Thanks Andries

 

There is no way to utilize all nodes in a cluster if we have single large file?

Does splitting single file into multiple files will help to  utilize all nodes 
in  a cluster?

 

What if our use case requirement is to query against csv/tsv file of size > 1 
GB?

 

Even with parquet flle of size 1 GB, select query with limit 100 takes  more 
than 20 minutes.

 

 

Regards

Chetan

 

 

 

 

-----Original Message-----
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com] 
Sent: Thursday, February 23, 2017 8:51 PM
To: user@drill.apache.org
Subject: Re: Distribution of workload across nodes in a cluster

 

Last I checked csv data will be read with a single thread per file. To make 
matters more challenging Drill will typically scan the whole file (well in the 
case of a select * you are requesting a full scan of the data).

 

 

Try to split the file into several smaller files (128MB or 256MB or smaller 
pending your requirements) . Also consider migrating the data locally to your 
Drill cluster, or use parquet. Some use cases you may read data remotely and 
then write it locally for repeated access, then just try to split the file into 
smaller files on the remote cluster and write locally in parquet.

 

 

--Andries

 

________________________________

From: PROJJWAL SAHA <HYPERLINK "mailto:proj.s...@gmail.com"proj.s...@gmail.com>

Sent: Wednesday, February 22, 2017 11:31:27 PM

To: HYPERLINK "mailto:user@drill.apache.org"user@drill.apache.org

Subject: Distribution of workload across nodes in a cluster

 

Hello,

 

I am doing select * query on a csv file of 1 GB with a 5 node drill cluster. 
The csv file is stored in another storage cluster within the enterprise.

 

In the query profile, I see one major fragment and within the major fragment, I 
see only 1 minor fragment. The hostname for the minor fragment corresponds to 
one of the nodes of the cluster.

 

I think therefore, that all the resources of the cluster are not utilized.

Is there any configuration parameters that can be tweaked to achieve more 
effective workload distribution across cluster machines ?

 

Let me know what you think.

 

Regards,

Projjwal

 

Reply via email to