To increase the minor fragment count set the option
planner.cpu_load_average. You can also increase the number of concurrent
Parquet reader threads using store.parquet.reader.columnreader.async.
However, since your tests with faster compression codecs showed no
improvements I think that your query is probably memory bandwidth bound,
a common state of affairs for a single node cluster. To add more memory
bandwidth to your cluster you'll need to scale horizontally e.g. 2x 16Gb
Drillbits instead of 1x 32Gb Drillbit.
What do you see for TIME_DISK_SCAN? If that's also small then
On 2023/03/18 06:36, Prabhakar Bhosale wrote:
Hi James,
Thanks for your detail guidance. Please see my findings below
*You wrote*: GZip compresses very well but uses a lot of CPU during
compression and
decompression. Try running a test with store.parquet.compression =
'zstd' (introduced in Drill 1.20.0). You can use CTAS statements in
Drill to create Parquet files compressed with Zstandard.
*Me*: I tried both lz4 and zstd also, but none of them seems to be
giving any better results. Lz4 give some improvement but not considerable.
In Operator metrics, lz4 is faster in decompression but that time is
nullified by time_load_datepage and time_to_decode_datapage
For zstd the decompression time is same as that of gzip
On CPU utilization - querying to all 3 types of compressed files
almost utilizes similar CPU,
*You Wrote:* If some columns or row groups need not be scanned, ensure
that they are
being excluded by the query.
*Me: *Yes, this I had already tried and it improved the performance
considerably. I had to sort the data while creating parquet files
*You Wrote: *Ensure that your Parquet files have been partitioned to a
suitable size,
normally somewhere between 250 and 1000Mb.
*Me: *No changes made to drill defaults
*You Wrote:* For some data, setting store.parquet.use_new_reader =
false will be
significantly faster.
*Me: *I am using drill 1.20.1, in this version, for this options it is
written that "NOt supported in this version" and the value is false. I
tried after making it true and the query could not complete even after
3 times of duration taken for gzip. so I think, this is not useful for
my data.
*You Wrote: *If profiling the Drillbits doing the scans reveals that
they are waiting
for data due to limited I/O throughput then consider faster storage.
E.g. Data locality in HDFS can be exploited by Drill to achieve higher
throughput.
*Me: *Operator Metrics "TIME_DISK_SCAN_WAIT" is less than 0.2 sec, so
i don't think disk I/O bottleneck here.
*My additional observations are*
1. The operator metrics "TIME_VARCOLUMN_READ" is taking 19+ seconds as
most of the columns query reads are of VARCHAR. is there any way to
improve upon this?
2. the "numfiles" reported in physical plan is different for all 3
compression for same exact data and same query. The numfiles for
gzip-109, lz4 - 161 and zstd-152. I was expecting this should be same
for all 3 compression formats. Same is the case with NUM_ROWGROUPS
operator
3. The minor fragments created under "PARQUET_ROW_GROUP_SCAN" are 6. I
assume these are the number of parallel threads created to select
data. Is there any setting that can allow me to create more minor
fragments for this operator?
Thanks for reading this long email.
Regards
Prabhakar
On Mon, Mar 13, 2023 at 7:44 PM James Turton <[email protected]> wrote:
GZip compresses very well but uses a lot of CPU during compression
and
decompression. Try running a test with store.parquet.compression =
'zstd' (introduced in Drill 1.20.0). You can use CTAS statements in
Drill to create Parquet files compressed with Zstandard.
If some columns or row groups need not be scanned, ensure that
they are
being excluded by the query.
Ensure that your Parquet files have been partitioned to a suitable
size,
normally somewhere between 250 and 1000Mb.
For some data, setting store.parquet.use_new_reader = false will be
significantly faster.
If profiling the Drillbits doing the scans reveals that they are
waiting
for data due to limited I/O throughput then consider faster storage.
E.g. Data locality in HDFS can be exploited by Drill to achieve
higher
throughput.
On 2023/03/07 08:17, Prabhakar Bhosale wrote:
> hi team,
> I have compressed (gzip) parquet files created with apache
drill. the total
> folder size is 7.8gb and the number of rows are 116,249,263. the
query
> takes 2min 18sec.
> Most of the time is spent on "PARQUET_ROW_GROUP_SCAN". Is there
any way to
> improve this performance?
> i am using
> Drill - 1.20
> CPU - 8 core
> mem - 16gb
>
> I also tried increasing memory to 32GB but no much difference. I
also tried
> certain recommendations given in drill documentation but with no
success.
>
> Any pointer/help is highly appreciated. thx
>
> REgards
> Prabhakar
>