Re: Parquet File Naming Convention Standards

Tim Armstrong Wed, 22 May 2019 12:28:08 -0700

Not reusing file names is generally a good idea - there are a bunch of
interesting consistency issues, particularly on object stores, if you reuse
file paths. This has come up for us with things like INSERT OVERWRITE in
Hive, which tends to generate the same file names.


I think there's an interesting set of discussions to be had around best
practices for file sizes and row group sizes.

One point is that a lot of big data frameworks schedule parallel work based
on filesystem metadata only (i.e. file sizes and block sizes, if the
filesystem has a concept of a block). If you have arbitrary parquet files
this can break down in various ways - e.g. if you have a 1GB file, you have
to guess what a good way to divide up the processing is. If there are fewer
row groups than expected, you'll get skew and if there are more you'll lose
out on parallelism. HDFS blocks were often a good way to do this, since a
lot of writers aim for one row group per block, but Parquet files often
come from a variety of sources and get munged in different ways, so the
heuristic falls over  in various ways in some application. It's somewhat
worse on object stores like S3, where there isn't a concept of a block,
just whatever the writer and reader have configured - you really ideally
want reader and writer block sizes to line up, but coordinating can be
difficult for some workflows.

Working on Impala, I'm a bit biased towards larger blocks, because of the
scheduling problems and also because of the extra overhead added with row
groups - we end up needed to do extra I/O operations per row group, adding
overhead (some of the overhead is inherent because the data you're reading
is more fragmented, so of it is just our implementation).

On Wed, May 22, 2019 at 11:55 AM Brian Bowman <brian.bow...@sas.com> wrote:

>  Thanks for the info!
>
> HDFS is only one of many storage platforms (distributed or otherwise) that
> SAS supports.  In general larger physical files (e.g. 100MB to 1GB) with
> multiple RowGroups are also a good thing for our usage cases.  I'm working
> to get our Parquet (C to C++ via libparquet.so) writer to do this.
>
> -Brian
>
> On 5/22/19, 1:21 PM, "Lee, David" <david....@blackrock.com> wrote:
>
>     EXTERNAL
>
>     I'm not a big fan of this convention which is a Spark convention..
>
>     A. The files should have at least "foo" in the name. Using PyArrow I
> would create these files as foo.1.parquet, foo.2.parquet, etc..
>     B. These files are around 3 megs each. For HDFS storage, files should
> be sized to match the HDFS blocksize which is usually set at 128 megs
> (default) or 256 megs, 512 megs, 1 gig, etc..
>
>     https://blog.cloudera.com/blog/2009/02/the-small-files-problem/
>
>     I usually take small parquet files and save them as parquet row groups
> in a larger parquet file to match the HDFS blocksize.
>
>     -----Original Message-----
>     From: Brian Bowman <brian.bow...@sas.com>
>     Sent: Wednesday, May 22, 2019 8:40 AM
>     To: dev@parquet.apache.org
>     Subject: Parquet File Naming Convention Standards
>
>     External Email: Use caution with links and attachments
>
>
>     All,
>
>     Here is an example .parquet data set saved using pySpark where the
> following files are members of directory: “foo.parquet”:
>
>     -rw-r--r--    1 sasbpb  r&d        8 Mar 26 12:10 ._SUCCESS.crc
>     -rw-r--r--    1 sasbpb  r&d    25632 Mar 26 12:10
> .part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
>     -rw-r--r--    1 sasbpb  r&d    25356 Mar 26 12:10
> .part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
>     -rw-r--r--    1 sasbpb  r&d    26300 Mar 26 12:10
> .part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
>     -rw-r--r--    1 sasbpb  r&d    23728 Mar 26 12:10
> .part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
>     -rw-r--r--    1 sasbpb  r&d        0 Mar 26 12:10 _SUCCESS
>     -rw-r--r--    1 sasbpb  r&d  3279617 Mar 26 12:10
> part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
>     -rw-r--r--    1 sasbpb  r&d  3244105 Mar 26 12:10
> part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
>     -rw-r--r--    1 sasbpb  r&d  3365039 Mar 26 12:10
> part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
>     -rw-r--r--    1 sasbpb  r&d  3035960 Mar 26 12:10
> part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
>
>
>     Questions:
>
>       1.  Is this the “standard” for creating/saving a .parquet data set?
>       2.  It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a
> UUID.  Is the format:
>          part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc an
> established convention?  Is this documented somewhere?
>       3.  Is there a C++ class to create the CRC?
>
>
>     Thanks,
>
>
>     Brian
>
>
>     This message may contain information that is confidential or
> privileged. If you are not the intended recipient, please advise the sender
> immediately and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
>     For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
>     © 2019 BlackRock, Inc. All rights reserved.
>
>
>

Re: Parquet File Naming Convention Standards

Reply via email to