Thanks to both!
I did more debugging last night and I believe the entire issue was `unsafe
{ Uint8Array::view(&file) }`
<https://github.com/kylebarron/parquet-wasm/blob/9aee64343b76c1c6b7550f9d27aede327f9a1b75/src/lib.rs#L140>
was unsafe 😄. I originally copied that from Dominik Moritz's
`arrow-wasm`'s `Table.serialize`
<https://github.com/domoritz/arrow-wasm/blob/3d6d4c6ab940fd317c4a19610cd204a06dc29584/src/table.rs#L58-L76>,
and just assumed it was ok usage. But when I instead create a new
`js_sys::Uint8Array` and then fill that array with the writer's contents,
the bytes in JS match the bytes in Rust, and `arrow.tableFromIPC` in JS
works well.
Possibly in relation to #1335, I was originally surprised why these
original IPC file format files (with the unsafe view) were readable in
Python, but not JS. From looking at the hexdump, I think the unsafe view
corrupted the beginning of the file but not the end. So
`pyarrow.ipc.open_file` was able to open the file likely because it first
looked at the footer, while Arrow JS likely tries to parse stream and file
IPC data in the same way.
Kyle
On Thu, Mar 10, 2022 at 4:24 AM Andrew Lamb <[email protected]> wrote:
> Sorry Kyle, I totally missed this email
>
> Initially I would say the symptoms sound like "not calling finish() on the
> writer" but I skimmed some of your linked code and saw at least one call to
> finish, so maybe this is not the root cause
>
> In terms of reading from a parquet file and returning arrow, I would
> recommend checking out the arrow module in the parquet[2]. The linked
> documentation also includes an example.
>
> There is one existing issue[1] that sounds like it may be similar.
>
> Hope that helps,
> Andrew
>
> [1] https://github.com/apache/arrow-rs/issues/1335
> [2] https://docs.rs/parquet/10.0.0/parquet/arrow/index.html
>
> On Wed, Mar 9, 2022 at 2:03 AM Micah Kornfield <[email protected]>
> wrote:
>
>> Hi Kyle,
>> I'm not sure if Rust contributors monitor this list, you might have
>> better luck opening an issue on the Rust Repo [1]
>>
>> [1] https://github.com/apache/arrow-rs
>>
>> On Sun, Feb 27, 2022 at 7:28 PM Kyle Barron <[email protected]>
>> wrote:
>>
>>> Hello!
>>>
>>> I've used Arrow a decent bit in Python and JS but I'm pretty new to
>>> Rust. I'm trying to write a minimal binding of Rust's Parquet to
>>> WebAssembly in order to decode Parquet files to Arrow on the web. I have
>>> code
>>> that works
>>> <https://github.com/kylebarron/parquet-wasm/blob/main/src/lib.rs> but
>>> only some of the time. For example this test data
>>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/works.parquet>
>>> (created here
>>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/generate_data.py#L40-L43>)
>>> seems to work with the js arrow.RecordBatchReader
>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/www/index.js#L50-L52>
>>> but other test data
>>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/not_work.parquet>
>>> (created here
>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/data/generate_data.py#L45-L48>)
>>> raises with "Error: Expected to read 1249648 metadata bytes, but only read
>>> 300.".
>>>
>>> Based on logging, it *seems* as if parsing the Parquet file goes
>>> smoothly. It's only writing the Arrow IPC format that fails (on the JS side
>>> when trying to verify it). I'm currently trying to create the
>>> StreamWriter
>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L122-L123>,
>>> then write all the Arrow RecordBatches into the writer
>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L127-L128>,
>>> then finish the writer
>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L142>,
>>> and send the output back to JS
>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L145-L156>
>>> .
>>>
>>> Has anyone seen a similar problem before, or any suggestions of where to
>>> debug further? Alternatively, if an end-to-end example exists of reading
>>> from a parquet file and returning an Arrow buffer would be very helpful to
>>> see.
>>>
>>> Best,
>>> Kyle Barron
>>>
>>>