You might want to enable metadata caching and see if it helps.

 https://drill.apache.org/docs/optimizing-parquet-metadata-reading/

On Thu, Jun 23, 2016 at 1:36 PM, Tanmay Solanki <tsolank...@yahoo.in.invalid
> wrote:

> Below is the plan. The amount of files is ~213000 files of parquet data.
>
> 0: jdbc:drill:> explain plan for select count(*) from
> s3.`tables/stats/iad/201604*/`;
> +------+------+
> |
>
>
>
>
>
>
>   text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(EXPR$0=[$0])
> 00-02        Project(EXPR$0=[$0])
> 00-03
> Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@7cb1a0e8[columns
> = null, isStarQuery = false, isSkipQuery = false]])
>  | {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ ],
>     "queue" : 0,
>     "resultMode" : "EXEC"
>   },
>   "graph" : [ {
>     "pop" : "DirectGroupScan",
>     "@id" : 3,
>     "cost" : 20.0
>   }, {
>     "pop" : "project",
>     "@id" : 2,
>     "exprs" : [ {
>       "ref" : "`EXPR$0`",
>       "expr" : "`count`"
>     } ],
>     "child" : 3,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 20.0
>   }, {
>     "pop" : "project",
>     "@id" : 1,
>     "exprs" : [ {
>       "ref" : "`EXPR$0`",
>       "expr" : "`EXPR$0`"
>     } ],
>     "child" : 2,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 20.0
>   }, {
>     "pop" : "screen",
>     "@id" : 0,
>     "child" : 1,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 20.0
>   } ]
> } |
> +------+------+
> 1 row selected (7493.869 seconds)
> Additionally I have the drillbit.log for this query which I will post
> below:
> 2016-06-23 18:25:16,417 [2893d673-3dad-dd21-d5e6-8ef28e0f81c9:foreman]
> INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id
> 2893d673-3dad-dd21-d5e6-8ef28e0f81c9: explain plan for select count(*) from
> s3.`tables/stats/iad/201604*/`
> 2016-06-23 20:29:45,446 [2893d673-3dad-dd21-d5e6-8ef28e0f81c9:foreman]
> INFO  o.a.d.exec.store.parquet.Metadata - Fetch parquet metadata: Executed
> 218474 out of 218474 using 16 threads. Time: 3474817ms total, 254.452884ms
> avg, 50344ms max.
> 2016-06-23 20:29:45,446 [2893d673-3dad-dd21-d5e6-8ef28e0f81c9:foreman]
> INFO  o.a.d.exec.store.parquet.Metadata - Fetch parquet metadata: Executed
> 218474 out of 218474 using 16 threads. Earliest start: 431.101000 ?s,
> Latest start: 3474340355.187000 ?s, Average start: 1753982685.665761 ?s .
> 2016-06-23 20:30:10,211 [2893d673-3dad-dd21-d5e6-8ef28e0f81c9:frag:0:0]
> INFO  o.a.d.e.w.fragment.FragmentExecutor -
> 2893d673-3dad-dd21-d5e6-8ef28e0f81c9:0:0: State change requested
> AWAITING_ALLOCATION --> RUNNING
> 2016-06-23 20:30:10,211 [2893d673-3dad-dd21-d5e6-8ef28e0f81c9:frag:0:0]
> INFO  o.a.d.e.w.f.FragmentStatusReporter -
> 2893d673-3dad-dd21-d5e6-8ef28e0f81c9:0:0: State to report: RUNNING
> 2016-06-23 20:30:10,226 [2893d673-3dad-dd21-d5e6-8ef28e0f81c9:frag:0:0]
> INFO  o.a.d.e.w.fragment.FragmentExecutor -
> 2893d673-3dad-dd21-d5e6-8ef28e0f81c9:0:0: State change requested RUNNING
> --> FINISHED
> 2016-06-23 20:30:10,226 [2893d673-3dad-dd21-d5e6-8ef28e0f81c9:frag:0:0]
> INFO  o.a.d.e.w.f.FragmentStatusReporter -
> 2893d673-3dad-dd21-d5e6-8ef28e0f81c9:0:0: State to report: FINISHED
>
>
>
>     On Thursday, 23 June 2016 11:22 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
>
>  Also, how many files?  What format?
>
> Being so slow is an anomaly.
>
>
>
>
> On Thu, Jun 23, 2016 at 11:15 AM, Khurram Faraaz <kfar...@maprtech.com>
> wrote:
>
> > Can you please share the query plan for that long running query here ?
> >
> > On Thu, Jun 23, 2016 at 11:40 PM, Tanmay Solanki <
> > tsolank...@yahoo.in.invalid> wrote:
> >
> > > I am trying to run a query on Apache drill to simply count the number
> of
> > > rows in a table stored in parquet format in S3. I am running this on a
> 20
> > > node r3.8xlarge EC2 instance cluster and I have my direct memory set to
> > > 80GB, heap memory set to 32GB and set the
> > > planner.memory.max_memory_per_node to a very high value. However,
> > counting
> > > the rows in this table takes around 7662 seconds, or around 2 hours,
> for
> > > drill to finish the query on a 9.93TB, 56 billion rows, and 174 column
> > > dataset.It seems like, from the logs and the web console that query
> > > planning itself is taking near 99% of the time and actual query
> execution
> > > is almost taking no time. I ran the same query on PrestoDB of a similar
> > > setup (20 node r3.8xlarge) and found that it completed in 137 seconds
> or
> > > just over 2 minutes. Is there someting wrong with my configuration of
> > drill
> > > possibly or is this what is expected for drill.
> > >
> >
>
>
>
>

Reply via email to