Re: Drill favouring a particular Drillbit

Jacques Nadeau Wed, 15 Apr 2015 17:29:52 -0700

It doesn't currently have plan caching but a simple implementation probably
wouldn't be that difficult (assuming you keep it node-level as opposed to
cluster level).  We merged the auto shuffling per session so let us know
how that looks.


On Wed, Apr 15, 2015 at 4:35 PM, Adam Gilmore <[email protected]> wrote:

> The workload does involve a fair number of short queries.  Although when I
> say short, I'm talking about querying 2-10 million record Parquet files, so
> they're not extremely short.
>
> Does Drill have plan caching built in at this stage?  Might help us reduce
> some of that foreman overhead.
>
> On Tue, Apr 14, 2015 at 3:02 AM, Jacques Nadeau <[email protected]>
> wrote:
>
> > Yeah, it seems that way.  We should get your patch merged.  I just
> reviewed
> > and lgtm.
> >
> > What type of workload are you running?  Unless your workload is planning
> > heavy (e.g. lots of short queries) or does a lot of sorts (the last merge
> > is on the foreman node), work should be reasonably distributed.
> >
> > On Sun, Apr 12, 2015 at 10:29 PM, Adam Gilmore <[email protected]>
> > wrote:
> >
> > > Looks like this definitely is the following bug:
> > >
> > > https://issues.apache.org/jira/browse/DRILL-2512
> > >
> > > It's a pretty severe performance bottleneck having the foreman doing so
> > > much work.  In our environment, the foreman hits basically 95-100% CPU
> > > while the other drillbits barely do much work.  Means it's nearly
> > > impossible for us to scale out.
> > >
> > > On Wed, Apr 8, 2015 at 3:58 PM, Adam Gilmore <[email protected]>
> > > wrote:
> > >
> > > > Anyone have any more thoughts on this?  Anywhere I can start trying
> to
> > > > troubleshoot?
> > > >
> > > > On Thu, Mar 26, 2015 at 4:13 PM, Adam Gilmore <[email protected]
> >
> > > > wrote:
> > > >
> > > >> So there are 5 Parquet files, each ~125mb - not sure what I can
> > provide
> > > >> re the block locations?  I believe it's under the HDFS block size so
> > > they
> > > >> should be stored contiguously.
> > > >>
> > > >> I've tried setting the affinity factor to various values (1, 0,
> etc.)
> > > but
> > > >> nothing seems to change that.  It always prefers certain nodes.
> > > >>
> > > >> Moreover, we added a stack more nodes and it started picking very
> > > >> specific nodes as foremen (perhaps 2-3 nodes out of 20 were always
> > > picked
> > > >> as foremen).  Therefore, the foremen were being swamped with CPU
> while
> > > the
> > > >> other nodes were doing very little work.
> > > >>
> > > >> On Thu, Mar 26, 2015 at 12:12 PM, Steven Phillips <
> > > [email protected]
> > > >> > wrote:
> > > >>
> > > >>> Actually, I believe a query submitted through REST interface will
> > > >>> instantiate a DrillClient, which uses the same ZKClusterCoordinator
> > > that
> > > >>> sqlline uses, and thus the foreman for the query is not necessarily
> > on
> > > >>> the
> > > >>> same drillbit as it was submitted to. But I'm still not sure it's
> > > related
> > > >>> to DRILL-2512.
> > > >>>
> > > >>> I'll wait for your additional info before speculating further.
> > > >>>
> > > >>> On Wed, Mar 25, 2015 at 6:54 PM, Adam Gilmore <
> [email protected]
> > >
> > > >>> wrote:
> > > >>>
> > > >>> > We actually setup a separate load balancer for port 8047 (we're
> > > >>> submitting
> > > >>> > these queries via the REST API at the moment) so Zookeeper etc.
> is
> > > out
> > > >>> of
> > > >>> > the equation, thus I doubt we're hitting DRILL-2512.
> > > >>> >
> > > >>> > When shutitng down the "troublesome" drillbit, it starts
> > > parallelizing
> > > >>> much
> > > >>> > nicer again.  We even added 10+ nodes to the cluster and as long
> as
> > > >>> that
> > > >>> > particular drillbit is shut down, it distributes very nicely.
> The
> > > >>> minute
> > > >>> > we start the drillbit on that node again, it starts swamping it
> > with
> > > >>> work.
> > > >>> >
> > > >>> > I'll shoot through the JSON profiles and some more information on
> > the
> > > >>> > dataset etc. later today (Australian time!).
> > > >>> >
> > > >>> > On Thu, Mar 26, 2015 at 5:31 AM, Steven Phillips <
> > > >>> [email protected]>
> > > >>> > wrote:
> > > >>> >
> > > >>> > > I didn't notice at first that Adam said "no matter who the
> > foreman
> > > >>> is".
> > > >>> > >
> > > >>> > > Another suspicion I have is that our current logic for
> assigning
> > > work
> > > >>> > will
> > > >>> > > assign to the exact same nodes every time we query a particular
> > > >>> table.
> > > >>> > > Changing affinity factor may change it, but it will still be
> the
> > > same
> > > >>> > every
> > > >>> > > time. That is my suspicion, but I am not sure why shutting down
> > the
> > > >>> > > drillbit would improve performance. I would expect that
> shutting
> > > >>> down the
> > > >>> > > drillbit would result in a different drillbit becoming the
> > hotspot.
> > > >>> > >
> > > >>> > > On Wed, Mar 25, 2015 at 12:16 PM, Jacques Nadeau <
> > > [email protected]
> > > >>> >
> > > >>> > > wrote:
> > > >>> > >
> > > >>> > > > On Steven's point, the node that the client connects to is
> not
> > > >>> > currently
> > > >>> > > > randomized.  Given your description of behavior, I'm not sure
> > > that
> > > >>> > you're
> > > >>> > > > hitting 2512 or just general undesirable distribution.
> > > >>> > > >
> > > >>> > > > On Wed, Mar 25, 2015 at 10:18 AM, Steven Phillips <
> > > >>> > > [email protected]>
> > > >>> > > > wrote:
> > > >>> > > >
> > > >>> > > > > This is a known issue:
> > > >>> > > > >
> > > >>> > > > > https://issues.apache.org/jira/browse/DRILL-2512
> > > >>> > > > >
> > > >>> > > > > On Wed, Mar 25, 2015 at 8:13 AM, Andries Engelbrecht <
> > > >>> > > > > [email protected]> wrote:
> > > >>> > > > >
> > > >>> > > > > > What version of Drill are you running?
> > > >>> > > > > >
> > > >>> > > > > > Any hints when looking at the query profiles? Is the node
> > > that
> > > >>> is
> > > >>> > > being
> > > >>> > > > > > hammered the foreman for the queries and most of the
> major
> > > >>> > fragments
> > > >>> > > > are
> > > >>> > > > > > tied to the foreman?
> > > >>> > > > > >
> > > >>> > > > > > —Andries
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > > > On Mar 25, 2015, at 12:00 AM, Adam Gilmore <
> > > >>> [email protected]>
> > > >>> > > > > wrote:
> > > >>> > > > > >
> > > >>> > > > > > > Hi guys,
> > > >>> > > > > > >
> > > >>> > > > > > > I'm trying to understand how this could be possible.  I
> > > have
> > > >>> a
> > > >>> > > Hadoop
> > > >>> > > > > > > cluster of a name node and two data nodes setup.  All
> > have
> > > >>> > > identical
> > > >>> > > > > > specs
> > > >>> > > > > > > in terms of CPU/RAM etc.
> > > >>> > > > > > >
> > > >>> > > > > > > The two data nodes have a replicated HDFS setup where
> I'm
> > > >>> storing
> > > >>> > > > some
> > > >>> > > > > > > Parquet files.
> > > >>> > > > > > >
> > > >>> > > > > > > A Drill cluster (with Zookeeper) is running with
> > Drillbits
> > > >>> on all
> > > >>> > > > three
> > > >>> > > > > > > servers.
> > > >>> > > > > > >
> > > >>> > > > > > > When I submit a query to *any* of the Drillbits, no
> > matter
> > > >>> who
> > > >>> > the
> > > >>> > > > > > foreman
> > > >>> > > > > > > is, one particular data node gets picked to do the vast
> > > >>> majority
> > > >>> > of
> > > >>> > > > the
> > > >>> > > > > > > work.
> > > >>> > > > > > >
> > > >>> > > > > > > We've even added three more task nodes to the cluster
> and
> > > >>> > > everything
> > > >>> > > > > > still
> > > >>> > > > > > > puts a huge load on one particular server.
> > > >>> > > > > > >
> > > >>> > > > > > > There is nothing unique about this data node.  HDFS is
> > > fully
> > > >>> > > > replicated
> > > >>> > > > > > (no
> > > >>> > > > > > > unreplicated blocks) to the other data node.
> > > >>> > > > > > >
> > > >>> > > > > > > I know that Drill tries to get data locality, so I'm
> > > >>> wondering if
> > > >>> > > > this
> > > >>> > > > > is
> > > >>> > > > > > > the cause, but this essentially swamping this data node
> > > with
> > > >>> 100%
> > > >>> > > CPU
> > > >>> > > > > > usage
> > > >>> > > > > > > while leaving the others barely doing any work.
> > > >>> > > > > > >
> > > >>> > > > > > > As soon as we shut down the Drillbit on this data node,
> > > query
> > > >>> > > > > performance
> > > >>> > > > > > > increases significantly.
> > > >>> > > > > > >
> > > >>> > > > > > > Any thoughts on how I can troubleshoot why Drill is
> > picking
> > > >>> that
> > > >>> > > > > > particular
> > > >>> > > > > > > node?
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > > --
> > > >>> > > > >  Steven Phillips
> > > >>> > > > >  Software Engineer
> > > >>> > > > >
> > > >>> > > > >  mapr.com
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> > > --
> > > >>> > >  Steven Phillips
> > > >>> > >  Software Engineer
> > > >>> > >
> > > >>> > >  mapr.com
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>>  Steven Phillips
> > > >>>  Software Engineer
> > > >>>
> > > >>>  mapr.com
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: Drill favouring a particular Drillbit

Reply via email to