If you do want to have more parallelism, use several input files.
On Fri, Jan 16, 2015 at 9:13 AM, Jason Altekruse <[email protected]> wrote: > I do not think we currently consider JSON files splittable. If we do treat > them as such, it would depend on the file size and the available read > locality available on the nodes. Especially with a select * (or a count(*)) > query there is nothing to parallelize except for the read operation and a > simple aggregation. Spreading a small read throughout the cluster would > only guarantee that some of the reads would happen over the wire, only to > have the final aggregation to be sent later to the query's head node. > > On Fri, Jan 16, 2015 at 3:19 AM, mufy <[email protected]> wrote: > > > And what would be the best way of ensuring that all the drill-bit nodes > > participated in the query execution? > > > > > > --- > > Mufeed Usman > > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> | My > > Social Cause <http://www.vision2016.org.in/> | My Blogs : LiveJournal > > <http://mufeed.livejournal.com> > > > > > > > > > > On Fri, Jan 16, 2015 at 4:45 PM, Steven Phillips <[email protected] > > > > wrote: > > > > > I would guess that for the first run, data had to be read off disk, > plus > > > code runtime code had to be compiled. Subsequent runs did not need to > do > > > this, since the data should then be in cache, as well as the compiled > > > classes, so the subsequent runs are noticeably faster. Runs 1 - 4 have > a > > > range of about 1.5 seconds, which seems like an unremarkable amount of > > > noise. > > > > > > On Fri, Jan 16, 2015 at 3:07 AM, mufy <[email protected]> wrote: > > > > > > > Hello, > > > > > > > > I was curious to know the possible reason(s) behind the difference in > > > > timings observed as shown below: > > > > > > > > 0: jdbc:drill:zk=> select count(*) from > > > > dfs.tmp.`yelp_academic_dataset_review.json`; > > > > +------------+ > > > > | EXPR$0 | > > > > +------------+ > > > > | 1125458 | > > > > +------------+ > > > > 1 row selected (15.214 seconds) > > > > > > > > 0: jdbc:drill:zk=> select count(*) from > > > > dfs.tmp.`yelp_academic_dataset_review.json`; > > > > +------------+ > > > > | EXPR$0 | > > > > +------------+ > > > > | 1125458 | > > > > +------------+ > > > > 1 row selected (12.717 seconds) > > > > > > > > 0: jdbc:drill:zk=> select count(*) from > > > > dfs.tmp.`yelp_academic_dataset_review.json`; > > > > +------------+ > > > > | EXPR$0 | > > > > +------------+ > > > > | 1125458 | > > > > +------------+ > > > > 1 row selected (11.833 seconds) > > > > > > > > 0: jdbc:drill:zk=> select count(*) from > > > > dfs.tmp.`yelp_academic_dataset_review.json`; > > > > +------------+ > > > > | EXPR$0 | > > > > +------------+ > > > > | 1125458 | > > > > +------------+ > > > > 1 row selected (13.298 seconds) > > > > > > > > 0: jdbc:drill:zk=> select count(*) from > > > > dfs.tmp.`yelp_academic_dataset_review.json`; > > > > +------------+ > > > > | EXPR$0 | > > > > +------------+ > > > > | 1125458 | > > > > +------------+ > > > > 1 row selected (12.749 seconds) > > > > > > > > This was run using MapR Drill 0.7.0 on a 5 node MapR cluster. > > > > > > > > > > > > --- > > > > Mufeed Usman > > > > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> | > My > > > > Social Cause <http://www.vision2016.org.in/> | My Blogs : > LiveJournal > > > > <http://mufeed.livejournal.com> > > > > > > > > > > > > > > > > -- > > > Steven Phillips > > > Software Engineer > > > > > > mapr.com > > > > > >
