Re: Varying Execution Times For The Same Query On The Same File

Ted Dunning Fri, 16 Jan 2015 11:54:03 -0800

If you do want to have more parallelism, use several input files.


On Fri, Jan 16, 2015 at 9:13 AM, Jason Altekruse <[email protected]>
wrote:

> I do not think we currently consider JSON files splittable. If we do treat
> them as such, it would depend on the file size and the available read
> locality available on the nodes. Especially with a select * (or a count(*))
> query there is nothing to parallelize except for the read operation and a
> simple aggregation. Spreading a small read throughout the cluster would
> only guarantee that some of the reads would happen over the wire, only to
> have the final aggregation to be sent later to the query's head node.
>
> On Fri, Jan 16, 2015 at 3:19 AM, mufy <[email protected]> wrote:
>
> > And what would be the best way of ensuring that all the drill-bit nodes
> > participated in the query execution?
> >
> >
> > ---
> > Mufeed Usman
> > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> | My
> > Social Cause <http://www.vision2016.org.in/> | My Blogs : LiveJournal
> > <http://mufeed.livejournal.com>
> >
> >
> >
> >
> > On Fri, Jan 16, 2015 at 4:45 PM, Steven Phillips <[email protected]
> >
> > wrote:
> >
> > > I would guess that for the first run, data had to be read off disk,
> plus
> > > code runtime code had to be compiled. Subsequent runs did not need to
> do
> > > this, since the data should then be in cache, as well as the compiled
> > > classes, so the subsequent runs are noticeably faster. Runs 1 - 4 have
> a
> > > range of about 1.5 seconds, which seems like an unremarkable amount of
> > > noise.
> > >
> > > On Fri, Jan 16, 2015 at 3:07 AM, mufy <[email protected]> wrote:
> > >
> > > > Hello,
> > > >
> > > > I was curious to know the possible reason(s) behind the difference in
> > > > timings observed as shown below:
> > > >
> > > > 0: jdbc:drill:zk=> select count(*) from
> > > > dfs.tmp.`yelp_academic_dataset_review.json`;
> > > > +------------+
> > > > |   EXPR$0   |
> > > > +------------+
> > > > | 1125458    |
> > > > +------------+
> > > > 1 row selected (15.214 seconds)
> > > >
> > > > 0: jdbc:drill:zk=> select count(*) from
> > > > dfs.tmp.`yelp_academic_dataset_review.json`;
> > > > +------------+
> > > > |   EXPR$0   |
> > > > +------------+
> > > > | 1125458    |
> > > > +------------+
> > > > 1 row selected (12.717 seconds)
> > > >
> > > > 0: jdbc:drill:zk=> select count(*) from
> > > > dfs.tmp.`yelp_academic_dataset_review.json`;
> > > > +------------+
> > > > |   EXPR$0   |
> > > > +------------+
> > > > | 1125458    |
> > > > +------------+
> > > > 1 row selected (11.833 seconds)
> > > >
> > > > 0: jdbc:drill:zk=> select count(*) from
> > > > dfs.tmp.`yelp_academic_dataset_review.json`;
> > > > +------------+
> > > > |   EXPR$0   |
> > > > +------------+
> > > > | 1125458    |
> > > > +------------+
> > > > 1 row selected (13.298 seconds)
> > > >
> > > > 0: jdbc:drill:zk=> select count(*) from
> > > > dfs.tmp.`yelp_academic_dataset_review.json`;
> > > > +------------+
> > > > |   EXPR$0   |
> > > > +------------+
> > > > | 1125458    |
> > > > +------------+
> > > > 1 row selected (12.749 seconds)
> > > >
> > > > This was run using MapR Drill 0.7.0 on a 5 node MapR cluster.
> > > >
> > > >
> > > > ---
> > > > Mufeed Usman
> > > > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> |
> My
> > > > Social Cause <http://www.vision2016.org.in/> | My Blogs :
> LiveJournal
> > > > <http://mufeed.livejournal.com>
> > > >
> > >
> > >
> > >
> > > --
> > >  Steven Phillips
> > >  Software Engineer
> > >
> > >  mapr.com
> > >
> >
>

Re: Varying Execution Times For The Same Query On The Same File

Reply via email to