Thanks Nitin for the matrices you provided and the suggestions. On Tue, Feb 21, 2017 at 2:23 PM, Nitin Pawar <nitinpawar...@gmail.com> wrote:
> instead of doing select * in the first go, > can you do query like select count(1) > > when your data is in csv files then yes all the data is transferred to the > drill node and then query is executed on top of it. > We had noticed the performance on csv was significantly more compared to > parquet files, so we moved our data to parquet from csv and have not seen > any issues on then. > > we did test run on 125M records, size was 8 GB in parquet and it took > roughly 30 second or so. > > I would suggest two things > 1) Which AWS region your S3 bucket is hosted and which region your ec2 > servers are hosted? > 2) If answer to above question is two different regions then you might want > to move them into a single region. > > In either case, from AWS console you can figure out how much network > throughput you are getting if that is the bottleneck > Also drill machines would need CPU so along with 32GB memory if you have 8 > cores that would be desirable > > On Tue, Feb 21, 2017 at 2:17 PM, PROJJWAL SAHA <proj.s...@gmail.com> > wrote: > > > Hi Nitin, > > > > I am executing the SQL query on a drillbit node using drill-conf . > > We have configured a 5 node drill cluster external to Amazon with 32GB > > RAM. From one of the nodes, we are using drill-conf utility to fire the > SQL > > query. > > > > One observation is had is > > select * from `xxx.tsv` > > select * from `xxx.tsv` where yyy = 'zzz' > > > > Both these queries are taking almost the same time for 1 GB data with > > 1000000 rows. So if the network for data transfer is the major time > taking > > component compared with the query execution time, I think that the > entire > > data is first transferred to drill cluster and then the query is executed > > on the drill cluster ? > > > > Regards, > > Projjwal > > > > On Mon, Feb 20, 2017 at 6:18 PM, Nitin Pawar <nitinpawar...@gmail.com> > > wrote: > > > > > how are you doing select * .. using drill UI or sqlline? > > > where are you running it from ? > > > is the drill hosted in aws or on your local machine? > > > > > > I think majority of the time is spent on displaying the result set > > instead > > > of querying the file if the drill server is on aws. > > > If the drill server is local then it might be your network which might > > take > > > a lot of time based on s3 bucket location and where your drill server > is > > > > > > On Mon, Feb 20, 2017 at 5:37 PM, PROJJWAL SAHA <proj.s...@gmail.com> > > > wrote: > > > > > > > Hello all, > > > > > > > > I am using 1GB data in the form of .tsv file, stored in Amazon S3 > using > > > > Drill 1.8. I am using default configurations of Drill using S3 > storage > > > > plugin coming out of the box. The drill bits are configured on a 5 > node > > > > cluster with 32GB RAM and 4VCPU. > > > > > > > > I see that select * from xxx; query takes 23 mins to fetch 1,040,000 > > > rows. > > > > > > > > Is this the expected behaviour ? > > > > I am looking for any quick tuning that can improve the performance or > > any > > > > other suggestions. > > > > > > > > Attaching is the JSON profile for this query. > > > > > > > > Regards, > > > > Projjwal > > > > > > > > > > > > > > > > -- > > > Nitin Pawar > > > > > > > > > -- > Nitin Pawar >