RE: Increasing degree of parallelism when reading Parquet files

Müller Ingo Mon, 09 Aug 2021 11:42:13 -0700

Dear Dmitry,

Thanks a lot for the quick reply! I had not though of this. However, I have 
tried out both ways just now (per query and in the cluster configuration) and 
did not see any changes. Is there any way I can control that the setting was 
applied successfully? I have also tried setting compiler.parallelism to 4 and 
still observed 16 cores being utilized.


Note that the observed degree of parallelism does not correspond to anything 
related to the data set (I tried with every power of two files between 1 and 
128) or the cluster (I tried with every power of two cores between 2 and 64, as 
well as 48 and 96) and I always see 16 cores being used (or fewer, if the 
system has fewer). To me, this makes it unlikely that the system really uses 
the semantics for p=0 or p<0, but looks more like some hard-coded value.

Cheers,
Ingo


> -----Original Message-----
> From: Dmitry Lychagin <[email protected]>
> Sent: Monday, August 9, 2021 7:25 PM
> To: [email protected]
> Subject: Re: Increasing degree of parallelism when reading Parquet files
> 
> Ingo,
> 
> 
> 
> We have `compiler.parallelism` parameter that controls how many cores are
> used for query execution.
> 
> See
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_param
> eter
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
> meter>
> 
> You can either set it per query (e.g. SET `compiler.parallelism` "-1";) ,
> 
> or globally in the cluster configuration:
> 
> https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
> app/src/main/resources/cc2.conf#L57
> 
> 
> 
> Thanks,
> 
> -- Dmitry
> 
> 
> 
> 
> 
> From: Müller Ingo <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Monday, August 9, 2021 at 10:05 AM
> To: "[email protected]" <[email protected]>
> Subject: Increasing degree of parallelism when reading Parquet files
> 
> 
> 
>  EXTERNAL EMAIL:  Use caution when opening attachments or clicking on links
> 
> 
> 
> 
> 
> Dear AsterixDB devs,
> 
> 
> 
> I am currently trying out the new support for Parquet files on S3 (still in 
> the
> context of my High-energy Physics use case [1]). This works great so far and 
> has
> generally decent performance. However, I realized that it does not use more
> than 16 cores, even though 96 logical cores are available and even though I 
> run
> long-running queries (several minutes) on large data sets with a large number 
> of
> files (I tried 128 files of 17GB each). Is this an arbitrary/artificial 
> limitation that
> can be changed somehow (potentially with a small patch+recompiling) or is
> there more serious development required to lift it? FYI, I am currently using
> 03fd6d0f, which should include all S3/Parquet commits on master.
> 
> 
> 
> Cheers,
> 
> Ingo
> 
> 
> 
> 
> 
> [1] https://arxiv.org/abs/2104.12615
> 
>

RE: Increasing degree of parallelism when reading Parquet files

Reply via email to