Hello,

I’m a radio astronomer working for the Event Horizon Telescope project. We are interested in Apache Arrow for our next-generation data format as other radio astronomy groups started to develop a new Arrow-based data format. We are currently developing major software ecosystems in Julia and Python, and would like to test data IO interfaces with Arrow.jl and pyarrow.

I’m writing this e-mail because I faced some issues in loading Arrow table data created in a different language. We just did a very simple check like creating Arrow tables in python and Julia, and loading them in another language (i.e. Julia and Python respectively). While we confirmed that each of pyarrow and Arrow.jl can read parquet files generated from itself, it can’t load parquet files from another language. For instance, we found

  • pyarrow can’t read a table written by Arrow.write method of Julia’s Arrow.jl.It returns `ArrowInvalid: Could not open Parquet input source ‘FILENAME': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.`
  • Arrow.jl can’t read a table from pyarrow. It doesn’t show any errors, but the loaded table is completely empty and doesn’t have any rows and cols.

I have attached Julia and python scripts that create parquet files of a very simple single-column table (juliadf.parquet from julia, pandasdf.parquet from python). pyarrow.parquet.read_table doesn’t work for juliadf.parquet, and Arrow.Table methods doesn’t work for pandasdf.parquet. I also attached python’s pip freeze file and Julia’s toml files just in case you want to see my python and julia enviroments.

As this is a very primitive test, I’m pretty sure I made some simple mistakes here. What I’m missing? Let me know how I should handle parquet files from interfaces in different languages.

Thanks,
Kazu


Attachment: create_parquet.jl
Description: Binary data

import pandas as pd
import numpy as np
import pyarrow.parquet as pq

# Create a simple dataframe
df = pd.DataFrame()
df["col1"] = np.zeros(10)
df

# save to a parquet file
df.to_parquet("pandasdf.parquet")

# load with pyarrow
atab = pq.read_table("pandasdf.parquet")
atab

Attachment: pip_freeze
Description: Binary data

Attachment: Manifest.toml
Description: Binary data

Attachment: Project.toml
Description: Binary data

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to