Thanks Padma. ________________________________ From: Padma Penumarthy <ppenumar...@mapr.com> Sent: Thursday, June 15, 2017 8:58:44 AM To: user@drill.apache.org Subject: Re: Increasing store.parquet.block-size
Sure. I will check and try to fix them as well. Thanks, Padma > On Jun 14, 2017, at 3:12 AM, Khurram Faraaz <kfar...@mapr.com> wrote: > > Thanks Padma. There are some more related failures reported in DRILL-2478, do > you think we should fix them too, if it is an easy fix. > > > Regards, > > Khurram > > ________________________________ > From: Padma Penumarthy <ppenumar...@mapr.com> > Sent: Wednesday, June 14, 2017 11:43:16 AM > To: user@drill.apache.org > Subject: Re: Increasing store.parquet.block-size > > I think you meant MB (not GB) below. > HDFS allows creation of very large files(theoretically, there is no limit). > I am wondering why >2GB file is a problem. May be it is blockSize >2GB, that > is not recommended. > > Anyways, we should not let the user be able to set any value and later throw > an error. > I opened a PR to fix this. > https://github.com/apache/drill/pull/852 > > Thanks, > Padma > > > On Jun 9, 2017, at 11:36 AM, Kunal Khatua > <kkha...@mapr.com<mailto:kkha...@mapr.com>> wrote: > > The ideal size depends on what engine is consuming the parquet files (Drill, > i'm guessing).... and the storage layer. For HDFS, which is usually > 128-256GB, we recommend to bump it to about 512GB (with the underlying HDFS > blocksize to match that). > > > You'll probably need to experiment a little with different blocks sizes > stored on S3 to see which works the best. > > <http://www.mapr.com/> > > ________________________________ > From: Shuporno Choudhury > <shuporno.choudh...@manthan.com<mailto:shuporno.choudh...@manthan.com>> > Sent: Friday, June 9, 2017 11:23:37 AM > To: user@drill.apache.org<mailto:user@drill.apache.org> > Subject: Re: Increasing store.parquet.block-size > > Thanks for the information Kunal. > After the conversion, the file size scales down to half if I use gzip > compression. > For a 10 GB gzipped csv source file, it becomes 5GB (2+2+1) parquet file > (using gzip compression). > So, if I have to make multiple parquet files, what block size would be > optimal, if I have to read the file later? > > On 09-Jun-2017 11:28 PM, "Kunal Khatua" > <kkha...@mapr.com<mailto:kkha...@mapr.com>> wrote: > > > If you're storing this in S3... you might want to selectively read the > files as well. > > > I'm only speculating, but if you want to download the data, downloading as > a queue of files might be more reliable than one massive file. Similarly, > within AWS, it *might* be faster to have an EC2 instance access a couple of > large Parquet files versus one massive Parquet file. > > > Remember that when you create a large block size, Drill tries to write > everything within a single row group for each. So there is no chance of > parallelization of the read (i.e. reading parts in parallel). The defaults > should work well for S3 as well, and with the compression (e.g. Snappy), > you should get a reasonably smaller file size. > > > With the current default settings... have you seen what Parquet file sizes > you get with Drill when converting your 10GB CSV source files? > > > ________________________________ > From: Shuporno Choudhury > <shuporno.choudh...@manthan.com<mailto:shuporno.choudh...@manthan.com>> > Sent: Friday, June 9, 2017 10:50:06 AM > To: user@drill.apache.org<mailto:user@drill.apache.org> > Subject: Re: Increasing store.parquet.block-size > > Thanks Kunal for your insight. > I am actually converting some .csv files and storing them in parquet format > in s3, not in HDFS. > The size of the individual .csv source files can be quite huge (around > 10GB). > So, is there a way to overcome this and create one parquet file or do I > have to go ahead with multiple parquet files? > > On 09-Jun-2017 11:04 PM, "Kunal Khatua" > <kkha...@mapr.com<mailto:kkha...@mapr.com>> wrote: > > Shuporno > > > There are some interesting problems when using Parquet files > 2GB on > HDFS. > > > If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddly > enough) returns an int value. Large Parquet blocksize also means you'll > end > up having the file span across multiple HDFS blocks, and that would make > reading of rowgroups inefficient. > > > Is there a reason you want to create such a large parquet file? > > > ~ Kunal > > ________________________________ > From: Vitalii Diravka > <vitalii.dira...@gmail.com<mailto:vitalii.dira...@gmail.com>> > Sent: Friday, June 9, 2017 4:49:02 AM > To: user@drill.apache.org<mailto:user@drill.apache.org> > Subject: Re: Increasing store.parquet.block-size > > Khurram, > > DRILL-2478 is a good place holder for the LongValidator issue, it really > works wrong. > > But other issue connected to impossibility to use long values for parquet > block-size. > This issue can be independent task or a sub-task of updating Drill > project > to a latest parquet library. > > Kind regards > Vitalii > > On Fri, Jun 9, 2017 at 10:25 AM, Khurram Faraaz > <kfar...@mapr.com<mailto:kfar...@mapr.com>> > wrote: > > 1. DRILL-2478<https://issues.apache.org/jira/browse/DRILL-2478> is > Open for this issue. > 2. I have added more details into the comments. > > Thanks, > Khurram > > ________________________________ > From: Shuporno Choudhury > <shuporno.choudh...@manthan.com<mailto:shuporno.choudh...@manthan.com>> > Sent: Friday, June 9, 2017 12:48:41 PM > To: user@drill.apache.org<mailto:user@drill.apache.org> > Subject: Increasing store.parquet.block-size > > The max value that can be assigned to *store.parquet.block-size *is > *2147483647*, as the value kind of this configuration parameter is > LONG. > This basically translates to 2GB of block size. > How do I increase it to 3/4/5 GB ? > Trying to set this parameter to a higher value using the following > command > actually succeeds : > ALTER SYSTEM SET `store.parquet.block-size` = 4294967296; > But when I try to run a query that uses this config, it throws the > following error: > Error: SYSTEM ERROR: NumberFormatException: For input string: > "4294967296" > So, is it possible to assign a higher value to this parameter? > -- > Regards, > Shuporno Choudhury > > > >