Re: Performance tuning for TPC-H Q1 on a three nodes cluster

2016-05-25 Thread Dechang Gu
On Tue, May 24, 2016 at 7:07 PM, Yijie Shen wrote: > Hi Dechang, > > Thanks very much for your help! > > I get a little confused here, why does skew exist? > > After some statistic work, I got this: 1516 files and 102.54MB on average, > max of 104MB, min of 95MB. > On

Re: Performance tuning for TPC-H Q1 on a three nodes cluster

2016-05-24 Thread Dechang Gu
Hi Yijie, Thanks for the profile. Looks like from the Operator Profile overview, 03-xx-02 HASH_AGGREGATE and 03-xx-06 PARQUET_ROW_GROUP_SCAN took the most of time: 03-xx-02HASH_AGGREGATE 0.020s 0.083s 0.213s 1m06s 1m55s 3m12s 0.000s 0.000s 0.000s 16MB16MB 03-xx-03

Re: Performance tuning for TPC-H Q1 on a three nodes cluster

2016-05-23 Thread Dechang Gu
Hi Yijie, This is Dechang at MapR. I work on Drill performance. >From what you described, looks like scan took most of the time. How are the files are distributed on the disks, are there any skew? How many disks are there? If possible can you provide the profile for the run? Thanks, Dechang On

Performance tuning for TPC-H Q1 on a three nodes cluster

2016-05-22 Thread Yijie Shen
Hi all, I'm trying out Drill on master branch lately and have deployed a cluster on three physical server. The input data `lineitem` is in parquet format of total size 150GB, 101MB per file and 1516 files in total. The server has two Intel(R) Xeon(R) CPU E5645 @2.40GHz CPUs and 24 cores in