Re: Varying Execution Times For The Same Query On The Same File

George Chow Fri, 16 Jan 2015 18:28:31 -0800

Hi Steven,

But a JSON file residing on HDFS is nonetheless split across datanode
boundaries.


Are you saying that Drill will serialize one file to one DrillBit?

George

On Fri, Jan 16, 2015 at 4:50 PM, Steven Phillips <[email protected]>
wrote:

> json files are not splittable. There will be exactly one thread reading the
> file, regardless of how big it is.
>
> On Fri, Jan 16, 2015 at 4:15 PM, George Chow <[email protected]> wrote:
>
> > It should be possible to compare your HDFS block size with your file size
> > to determine how many blocks (and hence nodes) the file spans.
> >
> > Is my understanding sound?
> >
> > George
> >
> >
> > On Fri, Jan 16, 2015 at 11:52 AM, Ted Dunning <[email protected]>
> > wrote:
> >
> > > If you do want to have more parallelism, use several input files.
> > >
> > >
> > > On Fri, Jan 16, 2015 at 9:13 AM, Jason Altekruse <
> > [email protected]
> > > >
> > > wrote:
> > >
> > > > I do not think we currently consider JSON files splittable. If we do
> > > treat
> > > > them as such, it would depend on the file size and the available read
> > > > locality available on the nodes. Especially with a select * (or a
> > > count(*))
> > > > query there is nothing to parallelize except for the read operation
> > and a
> > > > simple aggregation. Spreading a small read throughout the cluster
> would
> > > > only guarantee that some of the reads would happen over the wire,
> only
> > to
> > > > have the final aggregation to be sent later to the query's head node.
> > > >
> > > > On Fri, Jan 16, 2015 at 3:19 AM, mufy <[email protected]>
> wrote:
> > > >
> > > > > And what would be the best way of ensuring that all the drill-bit
> > nodes
> > > > > participated in the query execution?
> > > > >
> > > > >
> > > > > ---
> > > > > Mufeed Usman
> > > > > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400>
> |
> > My
> > > > > Social Cause <http://www.vision2016.org.in/> | My Blogs :
> > LiveJournal
> > > > > <http://mufeed.livejournal.com>
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jan 16, 2015 at 4:45 PM, Steven Phillips <
> > > [email protected]
> > > > >
> > > > > wrote:
> > > > >
> > > > > > I would guess that for the first run, data had to be read off
> disk,
> > > > plus
> > > > > > code runtime code had to be compiled. Subsequent runs did not
> need
> > to
> > > > do
> > > > > > this, since the data should then be in cache, as well as the
> > compiled
> > > > > > classes, so the subsequent runs are noticeably faster. Runs 1 - 4
> > > have
> > > > a
> > > > > > range of about 1.5 seconds, which seems like an unremarkable
> amount
> > > of
> > > > > > noise.
> > > > > >
> > > > > > On Fri, Jan 16, 2015 at 3:07 AM, mufy <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I was curious to know the possible reason(s) behind the
> > difference
> > > in
> > > > > > > timings observed as shown below:
> > > > > > >
> > > > > > > 0: jdbc:drill:zk=> select count(*) from
> > > > > > > dfs.tmp.`yelp_academic_dataset_review.json`;
> > > > > > > +------------+
> > > > > > > |   EXPR$0   |
> > > > > > > +------------+
> > > > > > > | 1125458    |
> > > > > > > +------------+
> > > > > > > 1 row selected (15.214 seconds)
> > > > > > >
> > > > > > > 0: jdbc:drill:zk=> select count(*) from
> > > > > > > dfs.tmp.`yelp_academic_dataset_review.json`;
> > > > > > > +------------+
> > > > > > > |   EXPR$0   |
> > > > > > > +------------+
> > > > > > > | 1125458    |
> > > > > > > +------------+
> > > > > > > 1 row selected (12.717 seconds)
> > > > > > >
> > > > > > > 0: jdbc:drill:zk=> select count(*) from
> > > > > > > dfs.tmp.`yelp_academic_dataset_review.json`;
> > > > > > > +------------+
> > > > > > > |   EXPR$0   |
> > > > > > > +------------+
> > > > > > > | 1125458    |
> > > > > > > +------------+
> > > > > > > 1 row selected (11.833 seconds)
> > > > > > >
> > > > > > > 0: jdbc:drill:zk=> select count(*) from
> > > > > > > dfs.tmp.`yelp_academic_dataset_review.json`;
> > > > > > > +------------+
> > > > > > > |   EXPR$0   |
> > > > > > > +------------+
> > > > > > > | 1125458    |
> > > > > > > +------------+
> > > > > > > 1 row selected (13.298 seconds)
> > > > > > >
> > > > > > > 0: jdbc:drill:zk=> select count(*) from
> > > > > > > dfs.tmp.`yelp_academic_dataset_review.json`;
> > > > > > > +------------+
> > > > > > > |   EXPR$0   |
> > > > > > > +------------+
> > > > > > > | 1125458    |
> > > > > > > +------------+
> > > > > > > 1 row selected (12.749 seconds)
> > > > > > >
> > > > > > > This was run using MapR Drill 0.7.0 on a 5 node MapR cluster.
> > > > > > >
> > > > > > >
> > > > > > > ---
> > > > > > > Mufeed Usman
> > > > > > > My LinkedIn <
> http://www.linkedin.com/pub/mufeed-usman/28/254/400
> > >
> > > |
> > > > My
> > > > > > > Social Cause <http://www.vision2016.org.in/> | My Blogs :
> > > > LiveJournal
> > > > > > > <http://mufeed.livejournal.com>
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >  Steven Phillips
> > > > > >  Software Engineer
> > > > > >
> > > > > >  mapr.com
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > --
> > "Not everything that can be counted counts, and not everything that
> counts
> > can be counted." Albert Einstein
> >
>
>
>
> --
>  Steven Phillips
>  Software Engineer
>
>  mapr.com
>



-- 
--
"Not everything that can be counted counts, and not everything that counts
can be counted." Albert Einstein

Re: Varying Execution Times For The Same Query On The Same File

Reply via email to