Performance tuning for TPC-H Q1 on a three nodes cluster

Yijie Shen Sun, 22 May 2016 09:07:06 -0700

Hi all,

I'm trying out Drill on master branch lately and have deployed a cluster on
three physical server.


The input data `lineitem` is in parquet format of total size 150GB, 101MB
per file and 1516 files in total.

The server has two Intel(R) Xeon(R) CPU E5645 @2.40GHz CPUs and 24 cores in
total, 32GB memory.

While executing Q1 using:

 SELECT
  L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE),
SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)),
SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY),
AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), COUNT(1)
FROM
  dfs.tpch.`lineitem`
WHERE
  L_SHIPDATE<='1998-09-02'
GROUP BY L_RETURNFLAG, L_LINESTATUS
ORDER BY L_RETURNFLAG, L_LINESTATUS

I've noticed the parallelism was 51 (planner.width.max_per_node = 17) in my
case for Major Fragment 03 (Scan Filter Project HashAgg and Project), and
each Minor fragment last about 8 to 9 minutes. one record for example:

03-00-xx hw080 7.309s 42.358s 9m35s 118,758,489 14,540 22:31:32 22:31:32
33MB FINISHED

Is this a normal speed (more than 10 minutes) for Drill for my current
cluster? Did I miss something important in conf to accelerate the execution?

Thanks very much!

Yijie

Performance tuning for TPC-H Q1 on a three nodes cluster

Reply via email to