The attachments didn’t come through properly..

I’ve got additional questions.


  1.  What filesystem are these files stored on? I’ve seen issues using S3 if 
HEAD operations aren’t prioritized. I’m assuming that without HEAD operations 
you can’t effectively scan a parquet file’s footer and reading the entire file 
isn’t efficient.

available (eventual consistency for HEAD operations) Behaves the same as the 
“read-after-new-write” consistency level, but only provides eventual 
consistency for HEAD operations. Offers higher availability for HEAD operations 
than “read-after-new-write” if Storage Nodes are unavailable. Differs from AWS 
S3 consistency guarantees for HEAD operations only.

2.    Are you using pyarrow.parquet.ParquetDataset or pyarrow.dataset?
https://arrow.apache.org/docs/python/parquet.html
Note
The ParquetDataset is being reimplemented based on the new generic Dataset API 
(see the Tabular 
Datasets<https://arrow.apache.org/docs/python/dataset.html#dataset> docs for an 
overview). This is not yet the default, but can already be enabled by passing 
the use_legacy_dataset=False keyword to ParquetDataset or read_table():
pq.ParquetDataset('dataset_name/', use_legacy_dataset=False)
Enabling this gives the following new features:

  *   Filtering on all columns (using row group statistics) instead of only on 
the partition keys.
  *   More fine-grained partitioning: support for a directory partitioning 
scheme in addition to the Hive-like partitioning (e.g. “/2019/11/15/” instead 
of “/year=2019/month=11/day=15/”), and the ability to specify a schema for the 
partition keys.
  *   General performance improvement and bug fixes.
It also has the following changes in behaviour:

  *   The partition keys need to be explicitly included in the columns keyword 
when you want to include them in the result while reading a subset of the 
columns
This new implementation is already enabled in read_table, and in the future, 
this will be turned on by default for ParquetDataset. The new implementation 
does not yet cover all existing ParquetDataset features (e.g. specifying the 
metadata, or the pieces property API). Feedback is very welcome.


From: Tomaz Maia Suller <[email protected]>
Sent: Wednesday, August 3, 2022 2:54 PM
To: [email protected]
Subject: Issue filtering partitioned Parquet files on partition keys using 
PyArrow


External Email: Use caution with links and attachments
Hi,

I'm trying to load a dataset I created consisting of Parquet files partitioned 
on two columns, but reading from a single partition takes over 10 minutes on 
the first try and still over 15 seconds on any subsequent one while specifying 
the path to the partition directly takes 50 milliseconds. Am I doing something 
wrong?

The data is arranged in the following way:
$ tree -d
.
├── state_code=11
│   ├── city_code=1005
│   ├── city_code=106
│   ├── city_code=1104
│   ├── city_code=114
│   ├── city_code=1203
│   ├── city_code=122
│   ├── city_code=130
│   ├── city_code=1302
...

There are 27 state codes and 5560 city codes, so 27 "first level" partitions 
and 5560 "second level" partitions in total. Each partition often contains only 
a few kBs worth of Parquet files and nome is greates than ~5MB. These files 
were written using PySpark and I have full control of how they're generated, in 
case you think there's a better way to arrange them. I chose this partitioning 
since I wish to analyse one city at a time; I have also experimented with 
having only a single level partitioning with 5560 partitions, but didn't see 
any increase in performance.

I'm using Pandas to read the files, and have tried using PyArrow directly as 
well. Regardless, I've profiled the reading of a single partition using 
cProfile, and the results clearly show PyArrow is taking the longest to run. 
I've attached the results of two runs I did using IPython: one right after 
rebooting my computer, which took well over 500 seconds; and one executed right 
after that, which took about 15 seconds, with the following command:

>>> command = "pd.read_parquet('.', engine='pyarrow', filters=[('state_code', 
>>> '==', 31), ('city_code', '==', 6200)])"
>>> cProfile.run(command, '/path/to/stats')

It was much better the second time around but still terrible compared to 
specifying the path manually, which took around 50 milliseconds according to 
%timeit.

I have absolutely no idea why the filesystem scan is taking so long. I have 
seen this issue 
https://issues.apache.org/jira/browse/ARROW-11781<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/ARROW-11781__;!!KSjYCgUGsB4!a08YdpWy7XPfHdgrfg8lMkMTia4epHAyid9h4ZrOEkINTFDbNphftopUuNpDDeL2-ZUSQttF_62sxBUHeVVk_g$>
 related to the same problem, but it mentions there should be no performance 
issue as of July 2021, whereas I'm having problems right now.

I think I'll stick to specifying the partitions to read "by hand", but I was 
really really curious on whether I messed something up, or if (Py)Arrow really 
is inefficient in a task which seems so trivial at first sight.

Thanks,
Tomaz.

P.S.: It's my first time sending an email to a mailing list, so I hope sending 
attachments is okay, and sorry if it isn't.



Importante: As informações deste e-mail são confidenciais. O uso não autorizado 
é proibido por lei. Por favor, considere o ambiente antes de imprimir.

Important: The information on this e-mail is confidential. Non-authorized use 
is prohibited by law. Please Consider the Environment Before Printing.

This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2022 BlackRock, Inc. All rights reserved.

Reply via email to