I am glad you got it working!
On Thu, Mar 10, 2022 at 12:34 PM Kyle Barron <[email protected]> wrote:
> Thanks to both!
>
> I did more debugging last night and I believe the entire issue was `unsafe
> { Uint8Array::view(&file) }`
> <https://github.com/kylebarron/parquet-wasm/blob/9aee64343b76c1c6b7550f9d27aede327f9a1b75/src/lib.rs#L140>
> was unsafe 😄. I originally copied that from Dominik Moritz's
> `arrow-wasm`'s `Table.serialize`
> <https://github.com/domoritz/arrow-wasm/blob/3d6d4c6ab940fd317c4a19610cd204a06dc29584/src/table.rs#L58-L76>,
> and just assumed it was ok usage. But when I instead create a new
> `js_sys::Uint8Array` and then fill that array with the writer's contents,
> the bytes in JS match the bytes in Rust, and `arrow.tableFromIPC` in JS
> works well.
>
> Possibly in relation to #1335, I was originally surprised why these
> original IPC file format files (with the unsafe view) were readable in
> Python, but not JS. From looking at the hexdump, I think the unsafe view
> corrupted the beginning of the file but not the end. So
> `pyarrow.ipc.open_file` was able to open the file likely because it first
> looked at the footer, while Arrow JS likely tries to parse stream and file
> IPC data in the same way.
>
> Kyle
>
> On Thu, Mar 10, 2022 at 4:24 AM Andrew Lamb <[email protected]> wrote:
>
>> Sorry Kyle, I totally missed this email
>>
>> Initially I would say the symptoms sound like "not calling finish() on
>> the writer" but I skimmed some of your linked code and saw at least one
>> call to finish, so maybe this is not the root cause
>>
>> In terms of reading from a parquet file and returning arrow, I would
>> recommend checking out the arrow module in the parquet[2]. The linked
>> documentation also includes an example.
>>
>> There is one existing issue[1] that sounds like it may be similar.
>>
>> Hope that helps,
>> Andrew
>>
>> [1] https://github.com/apache/arrow-rs/issues/1335
>> [2] https://docs.rs/parquet/10.0.0/parquet/arrow/index.html
>>
>> On Wed, Mar 9, 2022 at 2:03 AM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> Hi Kyle,
>>> I'm not sure if Rust contributors monitor this list, you might have
>>> better luck opening an issue on the Rust Repo [1]
>>>
>>> [1] https://github.com/apache/arrow-rs
>>>
>>> On Sun, Feb 27, 2022 at 7:28 PM Kyle Barron <[email protected]>
>>> wrote:
>>>
>>>> Hello!
>>>>
>>>> I've used Arrow a decent bit in Python and JS but I'm pretty new to
>>>> Rust. I'm trying to write a minimal binding of Rust's Parquet to
>>>> WebAssembly in order to decode Parquet files to Arrow on the web. I have
>>>> code
>>>> that works
>>>> <https://github.com/kylebarron/parquet-wasm/blob/main/src/lib.rs> but
>>>> only some of the time. For example this test data
>>>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/works.parquet>
>>>> (created here
>>>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/generate_data.py#L40-L43>)
>>>> seems to work with the js arrow.RecordBatchReader
>>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/www/index.js#L50-L52>
>>>> but other test data
>>>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/not_work.parquet>
>>>> (created here
>>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/data/generate_data.py#L45-L48>)
>>>> raises with "Error: Expected to read 1249648 metadata bytes, but only read
>>>> 300.".
>>>>
>>>> Based on logging, it *seems* as if parsing the Parquet file goes
>>>> smoothly. It's only writing the Arrow IPC format that fails (on the JS side
>>>> when trying to verify it). I'm currently trying to create the
>>>> StreamWriter
>>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L122-L123>,
>>>> then write all the Arrow RecordBatches into the writer
>>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L127-L128>,
>>>> then finish the writer
>>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L142>,
>>>> and send the output back to JS
>>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L145-L156>
>>>> .
>>>>
>>>> Has anyone seen a similar problem before, or any suggestions of where
>>>> to debug further? Alternatively, if an end-to-end example exists of reading
>>>> from a parquet file and returning an Arrow buffer would be very helpful to
>>>> see.
>>>>
>>>> Best,
>>>> Kyle Barron
>>>>
>>>>