Yes, I think you're correct that there isn't another way to do the conversion. A more efficient conversion may be in scope for the project so you might consider opening a GitHub Issue [1] to discuss further. You may also find this past discussion [2] interesting.
[1] https://github.com/apache/arrow-julia/issues [2] https://github.com/apache/arrow-julia/issues/227 On Wed, Mar 22, 2023 at 6:11 AM Kazunori Akiyama <[email protected]> wrote: > Hi Bryce, > > This clarifies a lot — I was indeed confused regarding formats. The > reference [5] was really helpful to clarify my confusion. > > Let me ask one more question regarding Julia interfaces before closing the > thread. So does this mean that we don’t have a function that loads parquet > files into the Julia implementation of the Arrow in-memory format? Looks > like the only way is converting it to the IPC format using Parquet.jl and > Arrow.jl and reload it. Am I correct? > > Like: > # convert a parquet file into the Arrow IPC format > tab = Parquet.readfile(“blah.parquet”) > Arrow.write(“blah.arrow”, tab) > > # reload it into in-memory data > tab2 = Arrow.read(“blah.arrow") > > - Kazu > > On Mar 21, 2023, at 6:40 PM, Bryce Mecum <[email protected]> wrote: > > Hi Kazu, from the description of what behavior you're seeing and the code > you've provided, it looks like you may be mixing up the two file formats > (Arrow IPC and Parquet) in your code. Your Julia code looks like it's using > the Arrow IPC file format whereas your Python code looks like it's using > the Parquet file format. > > If you want to use Parquet to share data: > > - In Julia: Use the Parquet package and its read_table and write_table > methods [1] > - In Python: Use pyarrow.parquet module and its read_table and write_table > methods [2] > > If you want to use Arrow IPC to share data: > > - In Julia: Use the Arrow package and its Arrow.table and Arrow.write > methods [3] > - In Python: Use the pyarrow package and the IPC readers and writers [4] > > Additionally, there is a FAQ [5] on the Apache Arrow website about formats > that you may find relevant. > > [1] https://github.com/JuliaIO/Parquet.jl > [2] https://arrow.apache.org/docs/python/parquet.html > [3] https://arrow.juliadata.org/dev/manual/#User-Manual > [4] https://arrow.apache.org/docs/python/ipc.html > [5] https://arrow.apache.org/faq/#what-about-arrow-files-then > > On Tue, Mar 21, 2023 at 12:00 PM Kazunori Akiyama <[email protected]> > wrote: > >> Hello, >> >> I’m a radio astronomer working for the Event Horizon Telescope >> <https://eventhorizontelescope.org/> project. We are interested in >> Apache Arrow for our next-generation data format as other radio astronomy >> groups started to develop a new Arrow-based data format >> <https://github.com/ratt-ru/casa-arrow>. We are currently developing >> major software ecosystems in Julia and Python, and would like to test data >> IO interfaces with Arrow.jl and pyarrow. >> >> I’m writing this e-mail because I faced some issues in loading Arrow >> table data created in a different language. We just did a very simple check >> like creating Arrow tables in python and Julia, and loading them in another >> language (i.e. Julia and Python respectively). While we confirmed that each >> of pyarrow and Arrow.jl can read parquet files generated from itself, it >> can’t load parquet files from another language. For instance, we found >> >> >> - pyarrow can’t read a table written by Arrow.write method of Julia’s >> Arrow.jl.It <http://arrow.jl.it/> returns `ArrowInvalid: Could not >> open Parquet input source ‘FILENAME': Parquet magic bytes not found in >> footer. Either the file is corrupted or this is not a parquet file.` >> - Arrow.jl can’t read a table from pyarrow. It doesn’t show any >> errors, but the loaded table is completely empty and doesn’t have any rows >> and cols. >> >> >> I have attached Julia and python scripts that create parquet files of a >> very simple single-column table (juliadf.parquet from julia, >> pandasdf.parquet from python). pyarrow.parquet.read_table doesn’t work for >> juliadf.parquet, and Arrow.Table methods doesn’t work for pandasdf.parquet. >> I also attached python’s pip freeze file and Julia’s toml files just in >> case you want to see my python and julia enviroments. >> >> As this is a very primitive test, I’m pretty sure I made some simple >> mistakes here. What I’m missing? Let me know how I should handle parquet >> files from interfaces in different languages. >> >> Thanks, >> Kazu >> >> >> >
