Hi Sudheesh,
+ Does the query involve any aggregations or filters? Or is this a select query with only projections? Simple query with only projections + Any suspicious timings in the query profile? Nothing specially different than our working query on our small cluster. + Any suspicious warning messages in the logs around the time of failure on any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one (“..” are place holders): Message of mode .. of rpc type .. took longer than ..ms. Actual duration was ..ms. Well we do see this warning on the failing node (on my last test), I found this WARNING in our log for the past month for pretty much every node I checked. 2016-09-19 13:31:56,866 [BitServer-7] WARN o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type 6 took longer than 500 ms. Actual Duration was 16053ms. 2016-09-19 14:15:33,357 [BitServer-4] WARN o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type 1 took longer than 500 ms. Actual Duration was 981ms. We really appreciate your help. I will dig in the source code for when and why this error happen. Francois P.S.: We do see this also: 2016-09-19 14:48:23,444 [drill-executor-9] WARN o.a.d.exec.rpc.control.WorkEventBus - Fragment ......:1:2 not found in the work bus. 2016-09-19 14:48:23,444 [drill-executor-11] WARN o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:222 not found in the work bus. 2016-09-19 14:48:23,444 [drill-executor-12] WARN o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:442 not found in the work bus. 2016-09-19 14:48:23,444 [drill-executor-10] WARN o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:662 not found in the work bus. On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam <[email protected]> wrote: > Hi Francois, > > More questions.. > > > + Can you share the query profile? > > I will sum it up: > > It is a select on 18 columns: 9 string, 9 integers. > > Scan is done on 13862 parquet files spread on 1000 fragments. > > Fragments are spread accross 215 nodes. > > So ~5 leaf fragments (or scanners) per Drillbit seems fine. > > + Does the query involve any aggregations or filters? Or is this a select > query with only projections? > + Any suspicious timings in the query profile? > + Any suspicious warning messages in the logs around the time of failure > on any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one > (“..” are place holders): > Message of mode .. of rpc type .. took longer than ..ms. Actual > duration was ..ms. > > Thank you, > Sudheesh > > > On Sep 15, 2016, at 11:27 AM, François Méthot <[email protected]> > wrote: > > > > Hi Sudheesh, > > > > + How many zookeeper servers in the quorum? > > The quorum has 3 servers, everything looks healthy > > > > + What is the load on atsqa4-133.qa.lab when this happens? Any other > > applications running on that node? How many threads is the Drill process > > using? > > The load on the failing node(8 cores) is 14, when Drill is running. Which > > is nothing out of the ordinary according to our admin. > > HBase is also running. > > planner.width.max_per_node is set to 8 > > > > + When running the same query on 12 nodes, is the data size same? > > Yes > > > > + Can you share the query profile? > > I will sum it up: > > It is a select on 18 columns: 9 string, 9 integers. > > Scan is done on 13862 parquet files spread on 1000 fragments. > > Fragments are spread accross 215 nodes. > > > > > > We are in process of increasing our Zookeeper session timeout config to > see > > if it helps. > > > > thanks > > > > Francois > > > > > > > > > > > > > > > > On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam <[email protected]> > > wrote: > > > >> Hi Francois, > >> > >> Few questions: > >> + How many zookeeper servers in the quorum? > >> + What is the load on atsqa4-133.qa.lab when this happens? Any other > >> applications running on that node? How many threads is the Drill process > >> using? > >> + When running the same query on 12 nodes, is the data size same? > >> + Can you share the query profile? > >> > >> This may not be the right thing to do, but for now, If the cluster is > >> heavily loaded, increase the zk timeout. > >> > >> Thank you, > >> Sudheesh > >> > >>> On Sep 14, 2016, at 11:53 AM, François Méthot <[email protected]> > >> wrote: > >>> > >>> We are running 1.7. > >>> The log were taken from the jira tickets. > >>> > >>> We will try out 1.8 soon. > >>> > >>> > >>> > >>> > >>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang <[email protected]> > wrote: > >>> > >>>> Looks like you are running 1.5. I believe there are some work done in > >> that > >>>> area and the newer release should behave better. > >>>> > >>>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot < > [email protected]> > >>>> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> We are trying to find a solution/workaround to issue: > >>>>> > >>>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR > >>>>> o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: > ForemanException: > >>>>> One more more nodes lost connectivity during query. Identified nodes > >>>>> were [atsqa4-133.qa.lab:31010]. > >>>>> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: > >>>>> ForemanException: One more more nodes lost connectivity during query. > >>>>> Identified nodes were [atsqa4-133.qa.lab:31010]. > >>>>> at org.apache.drill.exec.work.foreman.Foreman$ForemanResult. > >>>>> close(Foreman.java:746) > >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>> at org.apache.drill.exec.work.foreman.Foreman$StateSwitch. > >>>>> processEvent(Foreman.java:858) > >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>> at org.apache.drill.exec.work.foreman.Foreman$StateSwitch. > >>>>> processEvent(Foreman.java:790) > >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>> at org.apache.drill.exec.work.foreman.Foreman$StateSwitch. > >>>>> moveToState(Foreman.java:792) > >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>> at org.apache.drill.exec.work.foreman.Foreman.moveToState( > >>>>> Foreman.java:909) > >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>> at org.apache.drill.exec.work.foreman.Foreman.access$2700( > >>>>> Foreman.java:110) > >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>> at org.apache.drill.exec.work.foreman.Foreman$StateListener. > >>>>> moveToState(Foreman.java:1183) > >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>> > >>>>> > >>>>> DRILL-4325 <https://issues.apache.org/jira/browse/DRILL-4325> > >>>>> ForemanException: > >>>>> One or more nodes lost connectivity during query > >>>>> > >>>>> > >>>>> > >>>>> Any one experienced this issue ? > >>>>> > >>>>> It happens when running query involving many parquet files on a > cluster > >>>> of > >>>>> 200 nodes. Same query on a smaller cluster of 12 nodes runs fine. > >>>>> > >>>>> It is not caused by garbage collection, (checked on both ZK node and > >> the > >>>>> involved drill bit). > >>>>> > >>>>> Negotiated max session timeout is 40 seconds. > >>>>> > >>>>> The sequence seems: > >>>>> - Drill Query begins, using an existing ZK session. > >>>>> - Drill Zk session timeouts > >>>>> - perhaps it was writing something that took too long > >>>>> - Drill attempts to renew session > >>>>> - drill believes that the write operation failed, so it attempts > >>>> to > >>>>> re-create the zk node, which trigger another exception. > >>>>> > >>>>> We are open to any suggestion. We will report any finding. > >>>>> > >>>>> Thanks > >>>>> Francois > >>>>> > >>>> > >> > >> > >
