Re: Improve performance for "PARQUET_ROW_GROUP_SCAN"

Prabhakar Bhosale Fri, 17 Mar 2023 21:36:56 -0700

Hi James,
Thanks for your detail guidance. Please see my findings below

*You wrote*: GZip compresses very well but uses a lot of CPU during
compression and
decompression. Try running a test with store.parquet.compression =
'zstd' (introduced in Drill 1.20.0). You can use CTAS statements in
Drill to create Parquet files compressed with Zstandard.
*Me*: I tried both lz4 and zstd also, but none of them seems to be giving
any better results. Lz4 give some improvement but not considerable.
In Operator metrics, lz4 is faster in decompression but that time is
nullified by time_load_datepage and time_to_decode_datapage
For zstd the decompression time is same as that of gzip
On CPU utilization - querying to all 3 types of compressed files almost
utilizes similar CPU,

*You Wrote:* If some columns or row groups need not be scanned, ensure that
they are
being excluded by the query.
*Me: *Yes, this I had already tried and it improved the performance
considerably. I had to sort the data while creating parquet files

*You Wrote: *Ensure that your Parquet files have been partitioned to a
suitable size,
normally somewhere between 250 and 1000Mb.
*Me: *No changes made to drill defaults

*You Wrote:* For some data, setting store.parquet.use_new_reader = false
will be
significantly faster.
*Me: *I am using drill 1.20.1, in this version, for this options it is
written that "NOt supported in this version" and the value is false. I
tried after making it true and the query could not complete even after 3
times of duration taken for gzip. so I think, this is not useful for my
data.

*You Wrote: *If profiling the Drillbits doing the scans reveals that they
are waiting
for data due to limited I/O throughput then consider faster storage.
E.g. Data locality in HDFS can be exploited by Drill to achieve higher
throughput.
*Me: *Operator Metrics "TIME_DISK_SCAN_WAIT" is less than 0.2 sec, so i
don't think disk I/O bottleneck here.

*My additional observations are*
1. The operator metrics "TIME_VARCOLUMN_READ" is taking 19+ seconds as most
of the columns query reads are of VARCHAR. is there any way to improve upon
this?
2. the "numfiles" reported in physical plan is different for all 3
compression for same exact data and same query. The numfiles for gzip-109,
lz4 - 161 and zstd-152. I was expecting this should be same for all 3
compression formats. Same is the case with NUM_ROWGROUPS operator
3. The minor fragments created under "PARQUET_ROW_GROUP_SCAN" are 6. I
assume these are the number of parallel threads created to select data. Is
there any setting that can allow me to create more minor fragments for this
operator?

Thanks for reading this long email.

Regards
Prabhakar

On Mon, Mar 13, 2023 at 7:44 PM James Turton <[email protected]> wrote:

> GZip compresses very well but uses a lot of CPU during compression and
> decompression. Try running a test with store.parquet.compression =
> 'zstd' (introduced in Drill 1.20.0). You can use CTAS statements in
> Drill to create Parquet files compressed with Zstandard.
>
> If some columns or row groups need not be scanned, ensure that they are
> being excluded by the query.
>
> Ensure that your Parquet files have been partitioned to a suitable size,
> normally somewhere between 250 and 1000Mb.
>
> For some data, setting store.parquet.use_new_reader = false will be
> significantly faster.
>
> If profiling the Drillbits doing the scans reveals that they are waiting
> for data due to limited I/O throughput then consider faster storage.
> E.g. Data locality in HDFS can be exploited by Drill to achieve higher
> throughput.
>
>
> On 2023/03/07 08:17, Prabhakar Bhosale wrote:
> > hi team,
> > I have compressed (gzip) parquet files created with apache drill. the
> total
> > folder size is 7.8gb and the number of rows are 116,249,263. the query
> > takes 2min 18sec.
> > Most of the time is spent on "PARQUET_ROW_GROUP_SCAN". Is there any way
> to
> > improve this performance?
> > i am using
> > Drill - 1.20
> > CPU - 8 core
> > mem - 16gb
> >
> > I also tried increasing memory to 32GB but no much difference. I also
> tried
> > certain recommendations given in drill documentation but with no success.
> >
> > Any pointer/help is highly appreciated. thx
> >
> > REgards
> > Prabhakar
> >
>
>

Re: Improve performance for "PARQUET_ROW_GROUP_SCAN"

Reply via email to