What version of pyarrow are you using? What's your OS? Is the file on a
local disk or S3? How many row groups are in your file?
A difference of that much is not expected. However, they do use different
infrastructure under the hood. Do you also get the faster performance with
Hi all, I found that for the same parquet file,
using pq.ParquetFile(file_name).read() takes 6s while
pq.read_table(file_name) takes 17s. How do those two apis differ? I thought
they use the same internals but it seems not. The parquet file is 865MB,
snappy compression and enable dictionary. All
> You might also try the GCS filesystem (released with 7.0.0) instead of
going through fsspec.
I don't think the native GCS filesystem support is complete in 7.0.0, I
think if you are willing to compile from the latest commit in the repo it
might be useable.
On Wed, Feb 23, 2022 at 11:41 AM
I'm pretty sure GCS is similar to S3 in that there is no such thing as
a "directory". Instead a directory is often emulated by an empty
file. Note that the single file being detected is hires-sonde/ (with
a trailing slash). I'm pretty sure this is the convention for
creating mock directories.
Hello,
> But it sounds like the general process in using the homebrew Arrow binaries
> on Mac OS is:
>
> 1. brew install apache-arrow
> 2. Either:
>a. Do some rpath modification (like I did):
> CXX_FLAGS=-Wl,-rpath,/opt/homebrew/opt/apache-arrow/lib/
>b. Or set DYLD_LIBRARY_PATH (as
On Mon, 21 Feb 2022 at 00:04, Kelton Halbert wrote:
> Hello,
>
> I’ve been learning and working with PyArrow recently for a project to
> store some atmospheric science data as part of a partitioned dataset, and
> recently the dataset class with the fsspec/gcsfs filesystem has started
>
Hi Kelton,
I was looking into it a bit, and this seems to be some kind of bug in the
gcsfs package (or fsspec).
When looking at the dataset object that gets created with your initial
example, we can see:
>>> data.files
['global-radiosondes/hires-sonde']
So this indicates that for some reason,
hi KB,
Thanks. I only have a superficial knowledge of how to do things in Rust,
but I'll attempt to contribute from what I know from the Pyarrow side of
things.
With regards to laying out the data (flat or nested): personally I like to
be able to include complex data types in tabular data