Re: Increasing store.parquet.block-size

Kunal Khatua Fri, 09 Jun 2017 11:37:07 -0700

The ideal size depends on what engine is consuming the parquet files (Drill, 
i'm guessing).... and the storage layer. For HDFS, which is usually 128-256GB, 
we recommend to bump it to about 512GB (with the underlying HDFS blocksize to 
match that).



You'll probably need to experiment a little with different blocks sizes stored 
on S3 to see which works the best.

<http://www.mapr.com/>

________________________________
From: Shuporno Choudhury <shuporno.choudh...@manthan.com>
Sent: Friday, June 9, 2017 11:23:37 AM
To: user@drill.apache.org
Subject: Re: Increasing store.parquet.block-size

Thanks for the information Kunal.
After the conversion, the file size scales down to half if I use gzip
compression.
For a 10 GB gzipped csv source file, it becomes 5GB (2+2+1) parquet file
(using gzip compression).
So, if I have to make multiple parquet files, what block size would be
optimal, if I have to read the file later?

On 09-Jun-2017 11:28 PM, "Kunal Khatua" <kkha...@mapr.com> wrote:

>
> If you're storing this in S3... you might want to selectively read the
> files as well.
>
>
> I'm only speculating, but if you want to download the data, downloading as
> a queue of files might be more reliable than one massive file. Similarly,
> within AWS, it *might* be faster to have an EC2 instance access a couple of
> large Parquet files versus one massive Parquet file.
>
>
> Remember that when you create a large block size, Drill tries to write
> everything within a single row group for each. So there is no chance of
> parallelization of the read (i.e. reading parts in parallel). The defaults
> should work well for S3 as well, and with the compression (e.g. Snappy),
> you should get a reasonably smaller file size.
>
>
> With the current default settings... have you seen what Parquet file sizes
> you get with Drill when converting your 10GB CSV source files?
>
>
> ________________________________
> From: Shuporno Choudhury <shuporno.choudh...@manthan.com>
> Sent: Friday, June 9, 2017 10:50:06 AM
> To: user@drill.apache.org
> Subject: Re: Increasing store.parquet.block-size
>
> Thanks Kunal for your insight.
> I am actually converting some .csv files and storing them in parquet format
> in s3, not in HDFS.
> The size of the individual .csv source files can be quite huge (around
> 10GB).
> So, is there a way to overcome this and create one parquet file or do I
> have to go ahead with multiple parquet files?
>
> On 09-Jun-2017 11:04 PM, "Kunal Khatua" <kkha...@mapr.com> wrote:
>
> > Shuporno
> >
> >
> > There are some interesting problems when using Parquet files > 2GB on
> HDFS.
> >
> >
> > If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddly
> > enough) returns an int value. Large Parquet blocksize also means you'll
> end
> > up having the file span across multiple HDFS blocks, and that would make
> > reading of rowgroups inefficient.
> >
> >
> > Is there a reason you want to create such a large parquet file?
> >
> >
> > ~ Kunal
> >
> > ________________________________
> > From: Vitalii Diravka <vitalii.dira...@gmail.com>
> > Sent: Friday, June 9, 2017 4:49:02 AM
> > To: user@drill.apache.org
> > Subject: Re: Increasing store.parquet.block-size
> >
> > Khurram,
> >
> > DRILL-2478 is a good place holder for the LongValidator issue, it really
> > works wrong.
> >
> > But other issue connected to impossibility to use long values for parquet
> > block-size.
> > This issue can be independent task or a sub-task of updating Drill
> project
> > to a latest parquet library.
> >
> > Kind regards
> > Vitalii
> >
> > On Fri, Jun 9, 2017 at 10:25 AM, Khurram Faraaz <kfar...@mapr.com>
> wrote:
> >
> > >   1.  DRILL-2478<https://issues.apache.org/jira/browse/DRILL-2478> is
> > > Open for this issue.
> > >   2.  I have added more details into the comments.
> > >
> > > Thanks,
> > > Khurram
> > >
> > > ________________________________
> > > From: Shuporno Choudhury <shuporno.choudh...@manthan.com>
> > > Sent: Friday, June 9, 2017 12:48:41 PM
> > > To: user@drill.apache.org
> > > Subject: Increasing store.parquet.block-size
> > >
> > > The max value that can be assigned to *store.parquet.block-size *is
> > > *2147483647*, as the value kind of this configuration parameter is
> LONG.
> > > This basically translates to 2GB of block size.
> > > How do I increase it to 3/4/5 GB ?
> > > Trying to set this parameter to a higher value using the following
> > command
> > > actually succeeds :
> > >     ALTER SYSTEM SET `store.parquet.block-size` = 4294967296;
> > > But when I try to run a query that uses this config, it throws the
> > > following error:
> > >    Error: SYSTEM ERROR: NumberFormatException: For input string:
> > > "4294967296"
> > > So, is it possible to assign a higher value to this parameter?
> > > --
> > > Regards,
> > > Shuporno Choudhury
> > >
> >
>

Re: Increasing store.parquet.block-size

Reply via email to