Thanks for the information Kunal. After the conversion, the file size scales down to half if I use gzip compression. For a 10 GB gzipped csv source file, it becomes 5GB (2+2+1) parquet file (using gzip compression). So, if I have to make multiple parquet files, what block size would be optimal, if I have to read the file later?
On 09-Jun-2017 11:28 PM, "Kunal Khatua" <kkha...@mapr.com> wrote: > > If you're storing this in S3... you might want to selectively read the > files as well. > > > I'm only speculating, but if you want to download the data, downloading as > a queue of files might be more reliable than one massive file. Similarly, > within AWS, it *might* be faster to have an EC2 instance access a couple of > large Parquet files versus one massive Parquet file. > > > Remember that when you create a large block size, Drill tries to write > everything within a single row group for each. So there is no chance of > parallelization of the read (i.e. reading parts in parallel). The defaults > should work well for S3 as well, and with the compression (e.g. Snappy), > you should get a reasonably smaller file size. > > > With the current default settings... have you seen what Parquet file sizes > you get with Drill when converting your 10GB CSV source files? > > > ________________________________ > From: Shuporno Choudhury <shuporno.choudh...@manthan.com> > Sent: Friday, June 9, 2017 10:50:06 AM > To: user@drill.apache.org > Subject: Re: Increasing store.parquet.block-size > > Thanks Kunal for your insight. > I am actually converting some .csv files and storing them in parquet format > in s3, not in HDFS. > The size of the individual .csv source files can be quite huge (around > 10GB). > So, is there a way to overcome this and create one parquet file or do I > have to go ahead with multiple parquet files? > > On 09-Jun-2017 11:04 PM, "Kunal Khatua" <kkha...@mapr.com> wrote: > > > Shuporno > > > > > > There are some interesting problems when using Parquet files > 2GB on > HDFS. > > > > > > If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddly > > enough) returns an int value. Large Parquet blocksize also means you'll > end > > up having the file span across multiple HDFS blocks, and that would make > > reading of rowgroups inefficient. > > > > > > Is there a reason you want to create such a large parquet file? > > > > > > ~ Kunal > > > > ________________________________ > > From: Vitalii Diravka <vitalii.dira...@gmail.com> > > Sent: Friday, June 9, 2017 4:49:02 AM > > To: user@drill.apache.org > > Subject: Re: Increasing store.parquet.block-size > > > > Khurram, > > > > DRILL-2478 is a good place holder for the LongValidator issue, it really > > works wrong. > > > > But other issue connected to impossibility to use long values for parquet > > block-size. > > This issue can be independent task or a sub-task of updating Drill > project > > to a latest parquet library. > > > > Kind regards > > Vitalii > > > > On Fri, Jun 9, 2017 at 10:25 AM, Khurram Faraaz <kfar...@mapr.com> > wrote: > > > > > 1. DRILL-2478<https://issues.apache.org/jira/browse/DRILL-2478> is > > > Open for this issue. > > > 2. I have added more details into the comments. > > > > > > Thanks, > > > Khurram > > > > > > ________________________________ > > > From: Shuporno Choudhury <shuporno.choudh...@manthan.com> > > > Sent: Friday, June 9, 2017 12:48:41 PM > > > To: user@drill.apache.org > > > Subject: Increasing store.parquet.block-size > > > > > > The max value that can be assigned to *store.parquet.block-size *is > > > *2147483647*, as the value kind of this configuration parameter is > LONG. > > > This basically translates to 2GB of block size. > > > How do I increase it to 3/4/5 GB ? > > > Trying to set this parameter to a higher value using the following > > command > > > actually succeeds : > > > ALTER SYSTEM SET `store.parquet.block-size` = 4294967296; > > > But when I try to run a query that uses this config, it throws the > > > following error: > > > Error: SYSTEM ERROR: NumberFormatException: For input string: > > > "4294967296" > > > So, is it possible to assign a higher value to this parameter? > > > -- > > > Regards, > > > Shuporno Choudhury > > > > > >