Hi Steven, But a JSON file residing on HDFS is nonetheless split across datanode boundaries.
Are you saying that Drill will serialize one file to one DrillBit? George On Fri, Jan 16, 2015 at 4:50 PM, Steven Phillips <[email protected]> wrote: > json files are not splittable. There will be exactly one thread reading the > file, regardless of how big it is. > > On Fri, Jan 16, 2015 at 4:15 PM, George Chow <[email protected]> wrote: > > > It should be possible to compare your HDFS block size with your file size > > to determine how many blocks (and hence nodes) the file spans. > > > > Is my understanding sound? > > > > George > > > > > > On Fri, Jan 16, 2015 at 11:52 AM, Ted Dunning <[email protected]> > > wrote: > > > > > If you do want to have more parallelism, use several input files. > > > > > > > > > On Fri, Jan 16, 2015 at 9:13 AM, Jason Altekruse < > > [email protected] > > > > > > > wrote: > > > > > > > I do not think we currently consider JSON files splittable. If we do > > > treat > > > > them as such, it would depend on the file size and the available read > > > > locality available on the nodes. Especially with a select * (or a > > > count(*)) > > > > query there is nothing to parallelize except for the read operation > > and a > > > > simple aggregation. Spreading a small read throughout the cluster > would > > > > only guarantee that some of the reads would happen over the wire, > only > > to > > > > have the final aggregation to be sent later to the query's head node. > > > > > > > > On Fri, Jan 16, 2015 at 3:19 AM, mufy <[email protected]> > wrote: > > > > > > > > > And what would be the best way of ensuring that all the drill-bit > > nodes > > > > > participated in the query execution? > > > > > > > > > > > > > > > --- > > > > > Mufeed Usman > > > > > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> > | > > My > > > > > Social Cause <http://www.vision2016.org.in/> | My Blogs : > > LiveJournal > > > > > <http://mufeed.livejournal.com> > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 16, 2015 at 4:45 PM, Steven Phillips < > > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > I would guess that for the first run, data had to be read off > disk, > > > > plus > > > > > > code runtime code had to be compiled. Subsequent runs did not > need > > to > > > > do > > > > > > this, since the data should then be in cache, as well as the > > compiled > > > > > > classes, so the subsequent runs are noticeably faster. Runs 1 - 4 > > > have > > > > a > > > > > > range of about 1.5 seconds, which seems like an unremarkable > amount > > > of > > > > > > noise. > > > > > > > > > > > > On Fri, Jan 16, 2015 at 3:07 AM, mufy <[email protected]> > > > wrote: > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > I was curious to know the possible reason(s) behind the > > difference > > > in > > > > > > > timings observed as shown below: > > > > > > > > > > > > > > 0: jdbc:drill:zk=> select count(*) from > > > > > > > dfs.tmp.`yelp_academic_dataset_review.json`; > > > > > > > +------------+ > > > > > > > | EXPR$0 | > > > > > > > +------------+ > > > > > > > | 1125458 | > > > > > > > +------------+ > > > > > > > 1 row selected (15.214 seconds) > > > > > > > > > > > > > > 0: jdbc:drill:zk=> select count(*) from > > > > > > > dfs.tmp.`yelp_academic_dataset_review.json`; > > > > > > > +------------+ > > > > > > > | EXPR$0 | > > > > > > > +------------+ > > > > > > > | 1125458 | > > > > > > > +------------+ > > > > > > > 1 row selected (12.717 seconds) > > > > > > > > > > > > > > 0: jdbc:drill:zk=> select count(*) from > > > > > > > dfs.tmp.`yelp_academic_dataset_review.json`; > > > > > > > +------------+ > > > > > > > | EXPR$0 | > > > > > > > +------------+ > > > > > > > | 1125458 | > > > > > > > +------------+ > > > > > > > 1 row selected (11.833 seconds) > > > > > > > > > > > > > > 0: jdbc:drill:zk=> select count(*) from > > > > > > > dfs.tmp.`yelp_academic_dataset_review.json`; > > > > > > > +------------+ > > > > > > > | EXPR$0 | > > > > > > > +------------+ > > > > > > > | 1125458 | > > > > > > > +------------+ > > > > > > > 1 row selected (13.298 seconds) > > > > > > > > > > > > > > 0: jdbc:drill:zk=> select count(*) from > > > > > > > dfs.tmp.`yelp_academic_dataset_review.json`; > > > > > > > +------------+ > > > > > > > | EXPR$0 | > > > > > > > +------------+ > > > > > > > | 1125458 | > > > > > > > +------------+ > > > > > > > 1 row selected (12.749 seconds) > > > > > > > > > > > > > > This was run using MapR Drill 0.7.0 on a 5 node MapR cluster. > > > > > > > > > > > > > > > > > > > > > --- > > > > > > > Mufeed Usman > > > > > > > My LinkedIn < > http://www.linkedin.com/pub/mufeed-usman/28/254/400 > > > > > > | > > > > My > > > > > > > Social Cause <http://www.vision2016.org.in/> | My Blogs : > > > > LiveJournal > > > > > > > <http://mufeed.livejournal.com> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Steven Phillips > > > > > > Software Engineer > > > > > > > > > > > > mapr.com > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > -- > > "Not everything that can be counted counts, and not everything that > counts > > can be counted." Albert Einstein > > > > > > -- > Steven Phillips > Software Engineer > > mapr.com > -- -- "Not everything that can be counted counts, and not everything that counts can be counted." Albert Einstein
