Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

2022-02-23 Thread Weston Pace
What version of pyarrow are you using? What's your OS? Is the file on a local disk or S3? How many row groups are in your file? A difference of that much is not expected. However, they do use different infrastructure under the hood. Do you also get the faster performance with

[Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

2022-02-23 Thread Shawn Zeng
Hi all, I found that for the same parquet file, using pq.ParquetFile(file_name).read() takes 6s while pq.read_table(file_name) takes 17s. How do those two apis differ? I thought they use the same internals but it seems not. The parquet file is 865MB, snappy compression and enable dictionary. All

Re: FIleNotFound Error on root directory with fsspec partitioned dataset

2022-02-23 Thread Micah Kornfield
> You might also try the GCS filesystem (released with 7.0.0) instead of going through fsspec. I don't think the native GCS filesystem support is complete in 7.0.0, I think if you are willing to compile from the latest commit in the repo it might be useable. On Wed, Feb 23, 2022 at 11:41 AM

Re: FIleNotFound Error on root directory with fsspec partitioned dataset

2022-02-23 Thread Weston Pace
I'm pretty sure GCS is similar to S3 in that there is no such thing as a "directory". Instead a directory is often emulated by an empty file. Note that the single file being detected is hires-sonde/ (with a trailing slash). I'm pretty sure this is the convention for creating mock directories.

Re: RPATH and Brew on MacOS

2022-02-23 Thread Jonathan Keane
Hello, > But it sounds like the general process in using the homebrew Arrow binaries > on Mac OS is: > > 1. brew install apache-arrow > 2. Either: >a. Do some rpath modification (like I did): > CXX_FLAGS=-Wl,-rpath,/opt/homebrew/opt/apache-arrow/lib/ >b. Or set DYLD_LIBRARY_PATH (as

Re: FIleNotFound Error on root directory with fsspec partitioned dataset

2022-02-23 Thread Joris Van den Bossche
On Mon, 21 Feb 2022 at 00:04, Kelton Halbert wrote: > Hello, > > I’ve been learning and working with PyArrow recently for a project to > store some atmospheric science data as part of a partitioned dataset, and > recently the dataset class with the fsspec/gcsfs filesystem has started >

Re: FIleNotFound Error on root directory with fsspec partitioned dataset

2022-02-23 Thread Joris Van den Bossche
Hi Kelton, I was looking into it a bit, and this seems to be some kind of bug in the gcsfs package (or fsspec). When looking at the dataset object that gets created with your initial example, we can see: >>> data.files ['global-radiosondes/hires-sonde'] So this indicates that for some reason,

Re: Looking for suggestions on approach

2022-02-23 Thread Marnix van den Broek
hi KB, Thanks. I only have a superficial knowledge of how to do things in Rust, but I'll attempt to contribute from what I know from the Pyarrow side of things. With regards to laying out the data (flat or nested): personally I like to be able to include complex data types in tabular data