To increase the minor fragment count set the option planner.cpu_load_average. You can also increase the number of concurrent Parquet reader threads using store.parquet.reader.columnreader.async.

However, since your tests with faster compression codecs showed no improvements I think that your query is probably memory bandwidth bound, a common state of affairs for a single node cluster. To add more memory bandwidth to your cluster you'll need to scale horizontally e.g. 2x 16Gb Drillbits instead of 1x 32Gb Drillbit.

What do you see for TIME_DISK_SCAN? If that's also small then

On 2023/03/18 06:36, Prabhakar Bhosale wrote:
Hi James,
Thanks for your detail guidance. Please see my findings below

*You wrote*: GZip compresses very well but uses a lot of CPU during compression and
decompression. Try running a test with store.parquet.compression =
'zstd' (introduced in Drill 1.20.0). You can use CTAS statements in
Drill to create Parquet files compressed with Zstandard.
*Me*: I tried both lz4 and zstd also, but none of them seems to be giving any better results. Lz4 give some improvement but not considerable. In Operator metrics, lz4 is faster in decompression but that time is nullified by time_load_datepage and time_to_decode_datapage
For zstd the decompression time is same as that of gzip
On CPU utilization - querying to all 3 types of compressed files almost utilizes similar CPU,

*You Wrote:* If some columns or row groups need not be scanned, ensure that they are
being excluded by the query.
*Me: *Yes, this I had already tried and it improved the performance considerably. I had to sort the data while creating parquet files

*You Wrote: *Ensure that your Parquet files have been partitioned to a suitable size,
normally somewhere between 250 and 1000Mb.
*Me: *No changes made to drill defaults

*You Wrote:* For some data, setting store.parquet.use_new_reader = false will be
significantly faster.
*Me: *I am using drill 1.20.1, in this version, for this options it is written that "NOt supported in this version" and the value is false. I tried after making it true and the query could not complete even after 3 times of duration taken for gzip. so I think, this is not useful for my data.

*You Wrote: *If profiling the Drillbits doing the scans reveals that they are waiting
for data due to limited I/O throughput then consider faster storage.
E.g. Data locality in HDFS can be exploited by Drill to achieve higher
throughput.
*Me: *Operator Metrics "TIME_DISK_SCAN_WAIT" is less than 0.2 sec, so i don't think disk I/O bottleneck here.

*My additional observations are*
1. The operator metrics "TIME_VARCOLUMN_READ" is taking 19+ seconds as most of the columns query reads are of VARCHAR. is there any way to improve upon this? 2. the "numfiles" reported in physical plan is different for all 3 compression for same exact data and same query. The numfiles for gzip-109, lz4 - 161 and zstd-152. I was expecting this should be same for all 3 compression formats. Same is the case with NUM_ROWGROUPS operator 3. The minor fragments created under "PARQUET_ROW_GROUP_SCAN" are 6. I assume these are the number of parallel threads created to select data. Is there any setting that can allow me to create more minor fragments for this operator?

Thanks for reading this long email.

Regards
Prabhakar



On Mon, Mar 13, 2023 at 7:44 PM James Turton <[email protected]> wrote:

    GZip compresses very well but uses a lot of CPU during compression
    and
    decompression. Try running a test with store.parquet.compression =
    'zstd' (introduced in Drill 1.20.0). You can use CTAS statements in
    Drill to create Parquet files compressed with Zstandard.

    If some columns or row groups need not be scanned, ensure that
    they are
    being excluded by the query.

    Ensure that your Parquet files have been partitioned to a suitable
    size,
    normally somewhere between 250 and 1000Mb.

    For some data, setting store.parquet.use_new_reader = false will be
    significantly faster.

    If profiling the Drillbits doing the scans reveals that they are
    waiting
    for data due to limited I/O throughput then consider faster storage.
    E.g. Data locality in HDFS can be exploited by Drill to achieve
    higher
    throughput.


    On 2023/03/07 08:17, Prabhakar Bhosale wrote:
    > hi team,
    > I have compressed (gzip) parquet files created with apache
    drill. the total
    > folder size is 7.8gb and the number of rows are 116,249,263. the
    query
    > takes 2min 18sec.
    > Most of the time is spent on "PARQUET_ROW_GROUP_SCAN". Is there
    any way to
    > improve this performance?
    > i am using
    > Drill - 1.20
    > CPU - 8 core
    > mem - 16gb
    >
    > I also tried increasing memory to 32GB but no much difference. I
    also tried
    > certain recommendations given in drill documentation but with no
    success.
    >
    > Any pointer/help is highly appreciated. thx
    >
    > REgards
    > Prabhakar
    >

Reply via email to