It doesn't currently have plan caching but a simple implementation probably wouldn't be that difficult (assuming you keep it node-level as opposed to cluster level). We merged the auto shuffling per session so let us know how that looks.
On Wed, Apr 15, 2015 at 4:35 PM, Adam Gilmore <[email protected]> wrote: > The workload does involve a fair number of short queries. Although when I > say short, I'm talking about querying 2-10 million record Parquet files, so > they're not extremely short. > > Does Drill have plan caching built in at this stage? Might help us reduce > some of that foreman overhead. > > On Tue, Apr 14, 2015 at 3:02 AM, Jacques Nadeau <[email protected]> > wrote: > > > Yeah, it seems that way. We should get your patch merged. I just > reviewed > > and lgtm. > > > > What type of workload are you running? Unless your workload is planning > > heavy (e.g. lots of short queries) or does a lot of sorts (the last merge > > is on the foreman node), work should be reasonably distributed. > > > > On Sun, Apr 12, 2015 at 10:29 PM, Adam Gilmore <[email protected]> > > wrote: > > > > > Looks like this definitely is the following bug: > > > > > > https://issues.apache.org/jira/browse/DRILL-2512 > > > > > > It's a pretty severe performance bottleneck having the foreman doing so > > > much work. In our environment, the foreman hits basically 95-100% CPU > > > while the other drillbits barely do much work. Means it's nearly > > > impossible for us to scale out. > > > > > > On Wed, Apr 8, 2015 at 3:58 PM, Adam Gilmore <[email protected]> > > > wrote: > > > > > > > Anyone have any more thoughts on this? Anywhere I can start trying > to > > > > troubleshoot? > > > > > > > > On Thu, Mar 26, 2015 at 4:13 PM, Adam Gilmore <[email protected] > > > > > > wrote: > > > > > > > >> So there are 5 Parquet files, each ~125mb - not sure what I can > > provide > > > >> re the block locations? I believe it's under the HDFS block size so > > > they > > > >> should be stored contiguously. > > > >> > > > >> I've tried setting the affinity factor to various values (1, 0, > etc.) > > > but > > > >> nothing seems to change that. It always prefers certain nodes. > > > >> > > > >> Moreover, we added a stack more nodes and it started picking very > > > >> specific nodes as foremen (perhaps 2-3 nodes out of 20 were always > > > picked > > > >> as foremen). Therefore, the foremen were being swamped with CPU > while > > > the > > > >> other nodes were doing very little work. > > > >> > > > >> On Thu, Mar 26, 2015 at 12:12 PM, Steven Phillips < > > > [email protected] > > > >> > wrote: > > > >> > > > >>> Actually, I believe a query submitted through REST interface will > > > >>> instantiate a DrillClient, which uses the same ZKClusterCoordinator > > > that > > > >>> sqlline uses, and thus the foreman for the query is not necessarily > > on > > > >>> the > > > >>> same drillbit as it was submitted to. But I'm still not sure it's > > > related > > > >>> to DRILL-2512. > > > >>> > > > >>> I'll wait for your additional info before speculating further. > > > >>> > > > >>> On Wed, Mar 25, 2015 at 6:54 PM, Adam Gilmore < > [email protected] > > > > > > >>> wrote: > > > >>> > > > >>> > We actually setup a separate load balancer for port 8047 (we're > > > >>> submitting > > > >>> > these queries via the REST API at the moment) so Zookeeper etc. > is > > > out > > > >>> of > > > >>> > the equation, thus I doubt we're hitting DRILL-2512. > > > >>> > > > > >>> > When shutitng down the "troublesome" drillbit, it starts > > > parallelizing > > > >>> much > > > >>> > nicer again. We even added 10+ nodes to the cluster and as long > as > > > >>> that > > > >>> > particular drillbit is shut down, it distributes very nicely. > The > > > >>> minute > > > >>> > we start the drillbit on that node again, it starts swamping it > > with > > > >>> work. > > > >>> > > > > >>> > I'll shoot through the JSON profiles and some more information on > > the > > > >>> > dataset etc. later today (Australian time!). > > > >>> > > > > >>> > On Thu, Mar 26, 2015 at 5:31 AM, Steven Phillips < > > > >>> [email protected]> > > > >>> > wrote: > > > >>> > > > > >>> > > I didn't notice at first that Adam said "no matter who the > > foreman > > > >>> is". > > > >>> > > > > > >>> > > Another suspicion I have is that our current logic for > assigning > > > work > > > >>> > will > > > >>> > > assign to the exact same nodes every time we query a particular > > > >>> table. > > > >>> > > Changing affinity factor may change it, but it will still be > the > > > same > > > >>> > every > > > >>> > > time. That is my suspicion, but I am not sure why shutting down > > the > > > >>> > > drillbit would improve performance. I would expect that > shutting > > > >>> down the > > > >>> > > drillbit would result in a different drillbit becoming the > > hotspot. > > > >>> > > > > > >>> > > On Wed, Mar 25, 2015 at 12:16 PM, Jacques Nadeau < > > > [email protected] > > > >>> > > > > >>> > > wrote: > > > >>> > > > > > >>> > > > On Steven's point, the node that the client connects to is > not > > > >>> > currently > > > >>> > > > randomized. Given your description of behavior, I'm not sure > > > that > > > >>> > you're > > > >>> > > > hitting 2512 or just general undesirable distribution. > > > >>> > > > > > > >>> > > > On Wed, Mar 25, 2015 at 10:18 AM, Steven Phillips < > > > >>> > > [email protected]> > > > >>> > > > wrote: > > > >>> > > > > > > >>> > > > > This is a known issue: > > > >>> > > > > > > > >>> > > > > https://issues.apache.org/jira/browse/DRILL-2512 > > > >>> > > > > > > > >>> > > > > On Wed, Mar 25, 2015 at 8:13 AM, Andries Engelbrecht < > > > >>> > > > > [email protected]> wrote: > > > >>> > > > > > > > >>> > > > > > What version of Drill are you running? > > > >>> > > > > > > > > >>> > > > > > Any hints when looking at the query profiles? Is the node > > > that > > > >>> is > > > >>> > > being > > > >>> > > > > > hammered the foreman for the queries and most of the > major > > > >>> > fragments > > > >>> > > > are > > > >>> > > > > > tied to the foreman? > > > >>> > > > > > > > > >>> > > > > > —Andries > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > On Mar 25, 2015, at 12:00 AM, Adam Gilmore < > > > >>> [email protected]> > > > >>> > > > > wrote: > > > >>> > > > > > > > > >>> > > > > > > Hi guys, > > > >>> > > > > > > > > > >>> > > > > > > I'm trying to understand how this could be possible. I > > > have > > > >>> a > > > >>> > > Hadoop > > > >>> > > > > > > cluster of a name node and two data nodes setup. All > > have > > > >>> > > identical > > > >>> > > > > > specs > > > >>> > > > > > > in terms of CPU/RAM etc. > > > >>> > > > > > > > > > >>> > > > > > > The two data nodes have a replicated HDFS setup where > I'm > > > >>> storing > > > >>> > > > some > > > >>> > > > > > > Parquet files. > > > >>> > > > > > > > > > >>> > > > > > > A Drill cluster (with Zookeeper) is running with > > Drillbits > > > >>> on all > > > >>> > > > three > > > >>> > > > > > > servers. > > > >>> > > > > > > > > > >>> > > > > > > When I submit a query to *any* of the Drillbits, no > > matter > > > >>> who > > > >>> > the > > > >>> > > > > > foreman > > > >>> > > > > > > is, one particular data node gets picked to do the vast > > > >>> majority > > > >>> > of > > > >>> > > > the > > > >>> > > > > > > work. > > > >>> > > > > > > > > > >>> > > > > > > We've even added three more task nodes to the cluster > and > > > >>> > > everything > > > >>> > > > > > still > > > >>> > > > > > > puts a huge load on one particular server. > > > >>> > > > > > > > > > >>> > > > > > > There is nothing unique about this data node. HDFS is > > > fully > > > >>> > > > replicated > > > >>> > > > > > (no > > > >>> > > > > > > unreplicated blocks) to the other data node. > > > >>> > > > > > > > > > >>> > > > > > > I know that Drill tries to get data locality, so I'm > > > >>> wondering if > > > >>> > > > this > > > >>> > > > > is > > > >>> > > > > > > the cause, but this essentially swamping this data node > > > with > > > >>> 100% > > > >>> > > CPU > > > >>> > > > > > usage > > > >>> > > > > > > while leaving the others barely doing any work. > > > >>> > > > > > > > > > >>> > > > > > > As soon as we shut down the Drillbit on this data node, > > > query > > > >>> > > > > performance > > > >>> > > > > > > increases significantly. > > > >>> > > > > > > > > > >>> > > > > > > Any thoughts on how I can troubleshoot why Drill is > > picking > > > >>> that > > > >>> > > > > > particular > > > >>> > > > > > > node? > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > -- > > > >>> > > > > Steven Phillips > > > >>> > > > > Software Engineer > > > >>> > > > > > > > >>> > > > > mapr.com > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > -- > > > >>> > > Steven Phillips > > > >>> > > Software Engineer > > > >>> > > > > > >>> > > mapr.com > > > >>> > > > > > >>> > > > > >>> > > > >>> > > > >>> > > > >>> -- > > > >>> Steven Phillips > > > >>> Software Engineer > > > >>> > > > >>> mapr.com > > > >>> > > > >> > > > >> > > > > > > > > > >
