Sorry Kyle, I totally missed this email Initially I would say the symptoms sound like "not calling finish() on the writer" but I skimmed some of your linked code and saw at least one call to finish, so maybe this is not the root cause
In terms of reading from a parquet file and returning arrow, I would recommend checking out the arrow module in the parquet[2]. The linked documentation also includes an example. There is one existing issue[1] that sounds like it may be similar. Hope that helps, Andrew [1] https://github.com/apache/arrow-rs/issues/1335 [2] https://docs.rs/parquet/10.0.0/parquet/arrow/index.html On Wed, Mar 9, 2022 at 2:03 AM Micah Kornfield <[email protected]> wrote: > Hi Kyle, > I'm not sure if Rust contributors monitor this list, you might have better > luck opening an issue on the Rust Repo [1] > > [1] https://github.com/apache/arrow-rs > > On Sun, Feb 27, 2022 at 7:28 PM Kyle Barron <[email protected]> wrote: > >> Hello! >> >> I've used Arrow a decent bit in Python and JS but I'm pretty new to Rust. >> I'm trying to write a minimal binding of Rust's Parquet to WebAssembly in >> order to decode Parquet files to Arrow on the web. I have code that works >> <https://github.com/kylebarron/parquet-wasm/blob/main/src/lib.rs> but >> only some of the time. For example this test data >> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/works.parquet> >> (created here >> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/generate_data.py#L40-L43>) >> seems to work with the js arrow.RecordBatchReader >> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/www/index.js#L50-L52> >> but other test data >> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/not_work.parquet> >> (created here >> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/data/generate_data.py#L45-L48>) >> raises with "Error: Expected to read 1249648 metadata bytes, but only read >> 300.". >> >> Based on logging, it *seems* as if parsing the Parquet file goes >> smoothly. It's only writing the Arrow IPC format that fails (on the JS side >> when trying to verify it). I'm currently trying to create the >> StreamWriter >> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L122-L123>, >> then write all the Arrow RecordBatches into the writer >> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L127-L128>, >> then finish the writer >> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L142>, >> and send the output back to JS >> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L145-L156> >> . >> >> Has anyone seen a similar problem before, or any suggestions of where to >> debug further? Alternatively, if an end-to-end example exists of reading >> from a parquet file and returning an Arrow buffer would be very helpful to >> see. >> >> Best, >> Kyle Barron >> >>
