Sorry Kyle, I totally missed this email

Initially I would say the symptoms sound like "not calling finish() on the
writer" but I skimmed some of your linked code and saw at least one call to
finish, so maybe this is not the root cause

In terms of reading from a parquet file and returning arrow, I would
recommend checking out the arrow module in the parquet[2]. The linked
documentation also includes an example.

There is one existing issue[1] that sounds like it may be similar.

Hope that helps,
Andrew

[1] https://github.com/apache/arrow-rs/issues/1335
[2] https://docs.rs/parquet/10.0.0/parquet/arrow/index.html

On Wed, Mar 9, 2022 at 2:03 AM Micah Kornfield <[email protected]>
wrote:

> Hi Kyle,
> I'm not sure if Rust contributors monitor this list, you might have better
> luck opening an issue on the Rust Repo [1]
>
> [1] https://github.com/apache/arrow-rs
>
> On Sun, Feb 27, 2022 at 7:28 PM Kyle Barron <[email protected]> wrote:
>
>> Hello!
>>
>> I've used Arrow a decent bit in Python and JS but I'm pretty new to Rust.
>> I'm trying to write a  minimal binding of Rust's Parquet to WebAssembly in
>> order to decode Parquet files to Arrow on the web. I have code that works
>> <https://github.com/kylebarron/parquet-wasm/blob/main/src/lib.rs> but
>> only some of the time. For example this test data
>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/works.parquet>
>>  (created here
>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/generate_data.py#L40-L43>)
>> seems to work with the js arrow.RecordBatchReader
>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/www/index.js#L50-L52>
>>  but other test data
>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/not_work.parquet>
>>  (created here
>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/data/generate_data.py#L45-L48>)
>> raises with "Error: Expected to read 1249648 metadata bytes, but only read
>> 300.".
>>
>> Based on logging, it *seems* as if parsing the Parquet file goes
>> smoothly. It's only writing the Arrow IPC format that fails (on the JS side
>> when trying to verify it). I'm currently trying to create the
>> StreamWriter
>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L122-L123>,
>> then write all the Arrow RecordBatches into the writer
>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L127-L128>,
>> then finish the writer
>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L142>,
>> and send the output back to JS
>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L145-L156>
>> .
>>
>> Has anyone seen a similar problem before, or any suggestions of where to
>> debug further? Alternatively, if an end-to-end example exists of reading
>> from a parquet file and returning an Arrow buffer would be very helpful to
>> see.
>>
>> Best,
>> Kyle Barron
>>
>>

Reply via email to