You might want to enable metadata caching and see if it helps. https://drill.apache.org/docs/optimizing-parquet-metadata-reading/
On Thu, Jun 23, 2016 at 1:36 PM, Tanmay Solanki <tsolank...@yahoo.in.invalid > wrote: > Below is the plan. The amount of files is ~213000 files of parquet data. > > 0: jdbc:drill:> explain plan for select count(*) from > s3.`tables/stats/iad/201604*/`; > +------+------+ > | > > > > > > > text | json | > +------+------+ > | 00-00 Screen > 00-01 Project(EXPR$0=[$0]) > 00-02 Project(EXPR$0=[$0]) > 00-03 > Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@7cb1a0e8[columns > = null, isStarQuery = false, isSkipQuery = false]]) > | { > "head" : { > "version" : 1, > "generator" : { > "type" : "ExplainHandler", > "info" : "" > }, > "type" : "APACHE_DRILL_PHYSICAL", > "options" : [ ], > "queue" : 0, > "resultMode" : "EXEC" > }, > "graph" : [ { > "pop" : "DirectGroupScan", > "@id" : 3, > "cost" : 20.0 > }, { > "pop" : "project", > "@id" : 2, > "exprs" : [ { > "ref" : "`EXPR$0`", > "expr" : "`count`" > } ], > "child" : 3, > "initialAllocation" : 1000000, > "maxAllocation" : 10000000000, > "cost" : 20.0 > }, { > "pop" : "project", > "@id" : 1, > "exprs" : [ { > "ref" : "`EXPR$0`", > "expr" : "`EXPR$0`" > } ], > "child" : 2, > "initialAllocation" : 1000000, > "maxAllocation" : 10000000000, > "cost" : 20.0 > }, { > "pop" : "screen", > "@id" : 0, > "child" : 1, > "initialAllocation" : 1000000, > "maxAllocation" : 10000000000, > "cost" : 20.0 > } ] > } | > +------+------+ > 1 row selected (7493.869 seconds) > Additionally I have the drillbit.log for this query which I will post > below: > 2016-06-23 18:25:16,417 [2893d673-3dad-dd21-d5e6-8ef28e0f81c9:foreman] > INFO o.a.drill.exec.work.foreman.Foreman - Query text for query id > 2893d673-3dad-dd21-d5e6-8ef28e0f81c9: explain plan for select count(*) from > s3.`tables/stats/iad/201604*/` > 2016-06-23 20:29:45,446 [2893d673-3dad-dd21-d5e6-8ef28e0f81c9:foreman] > INFO o.a.d.exec.store.parquet.Metadata - Fetch parquet metadata: Executed > 218474 out of 218474 using 16 threads. Time: 3474817ms total, 254.452884ms > avg, 50344ms max. > 2016-06-23 20:29:45,446 [2893d673-3dad-dd21-d5e6-8ef28e0f81c9:foreman] > INFO o.a.d.exec.store.parquet.Metadata - Fetch parquet metadata: Executed > 218474 out of 218474 using 16 threads. Earliest start: 431.101000 ?s, > Latest start: 3474340355.187000 ?s, Average start: 1753982685.665761 ?s . > 2016-06-23 20:30:10,211 [2893d673-3dad-dd21-d5e6-8ef28e0f81c9:frag:0:0] > INFO o.a.d.e.w.fragment.FragmentExecutor - > 2893d673-3dad-dd21-d5e6-8ef28e0f81c9:0:0: State change requested > AWAITING_ALLOCATION --> RUNNING > 2016-06-23 20:30:10,211 [2893d673-3dad-dd21-d5e6-8ef28e0f81c9:frag:0:0] > INFO o.a.d.e.w.f.FragmentStatusReporter - > 2893d673-3dad-dd21-d5e6-8ef28e0f81c9:0:0: State to report: RUNNING > 2016-06-23 20:30:10,226 [2893d673-3dad-dd21-d5e6-8ef28e0f81c9:frag:0:0] > INFO o.a.d.e.w.fragment.FragmentExecutor - > 2893d673-3dad-dd21-d5e6-8ef28e0f81c9:0:0: State change requested RUNNING > --> FINISHED > 2016-06-23 20:30:10,226 [2893d673-3dad-dd21-d5e6-8ef28e0f81c9:frag:0:0] > INFO o.a.d.e.w.f.FragmentStatusReporter - > 2893d673-3dad-dd21-d5e6-8ef28e0f81c9:0:0: State to report: FINISHED > > > > On Thursday, 23 June 2016 11:22 AM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > Also, how many files? What format? > > Being so slow is an anomaly. > > > > > On Thu, Jun 23, 2016 at 11:15 AM, Khurram Faraaz <kfar...@maprtech.com> > wrote: > > > Can you please share the query plan for that long running query here ? > > > > On Thu, Jun 23, 2016 at 11:40 PM, Tanmay Solanki < > > tsolank...@yahoo.in.invalid> wrote: > > > > > I am trying to run a query on Apache drill to simply count the number > of > > > rows in a table stored in parquet format in S3. I am running this on a > 20 > > > node r3.8xlarge EC2 instance cluster and I have my direct memory set to > > > 80GB, heap memory set to 32GB and set the > > > planner.memory.max_memory_per_node to a very high value. However, > > counting > > > the rows in this table takes around 7662 seconds, or around 2 hours, > for > > > drill to finish the query on a 9.93TB, 56 billion rows, and 174 column > > > dataset.It seems like, from the logs and the web console that query > > > planning itself is taking near 99% of the time and actual query > execution > > > is almost taking no time. I ran the same query on PrestoDB of a similar > > > setup (20 node r3.8xlarge) and found that it completed in 137 seconds > or > > > just over 2 minutes. Is there someting wrong with my configuration of > > drill > > > possibly or is this what is expected for drill. > > > > > > > > >