Re: ZK lost connectivity issue on large cluster

François Méthot Mon, 19 Sep 2016 09:15:53 -0700

Hi Sudheesh,


+ Does the query involve any aggregations or filters? Or is this a select
query with only projections?
Simple query with only projections

+ Any suspicious timings in the query profile?
Nothing specially different than our working query on our small cluster.

+ Any suspicious warning messages in the logs around the time of failure on
any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one
(“..” are place holders):
  Message of mode .. of rpc type .. took longer than ..ms.  Actual duration
was ..ms.

Well we do see this warning on the failing node (on my last test), I found
this WARNING in our log for the past month for pretty much every node I
checked.
2016-09-19 13:31:56,866 [BitServer-7] WARN
o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type
6 took longer than 500 ms. Actual Duration was 16053ms.
2016-09-19 14:15:33,357 [BitServer-4] WARN
o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type
1 took longer than 500 ms. Actual Duration was 981ms.

We really appreciate your help. I will dig in the source code for when and
why this error happen.


Francois

P.S.:
We do see this also:
2016-09-19 14:48:23,444 [drill-executor-9] WARN
o.a.d.exec.rpc.control.WorkEventBus - Fragment ......:1:2 not found in the
work bus.
2016-09-19 14:48:23,444 [drill-executor-11] WARN
o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:222 not found in the
work bus.
2016-09-19 14:48:23,444 [drill-executor-12] WARN
o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:442 not found in the
work bus.
2016-09-19 14:48:23,444 [drill-executor-10] WARN
o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:662 not found in the
work bus.




On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam <[email protected]>
wrote:

> Hi Francois,
>
> More questions..
>
> > + Can you share the query profile?
> >   I will sum it up:
> >  It is a select on 18 columns: 9 string, 9 integers.
> >  Scan is done on 13862 parquet files spread  on 1000 fragments.
> >  Fragments are spread accross 215 nodes.
>
> So ~5 leaf fragments (or scanners) per Drillbit seems fine.
>
> + Does the query involve any aggregations or filters? Or is this a select
> query with only projections?
> + Any suspicious timings in the query profile?
> + Any suspicious warning messages in the logs around the time of failure
> on any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one
> (“..” are place holders):
>   Message of mode .. of rpc type .. took longer than ..ms.  Actual
> duration was ..ms.
>
> Thank you,
> Sudheesh
>
> > On Sep 15, 2016, at 11:27 AM, François Méthot <[email protected]>
> wrote:
> >
> > Hi Sudheesh,
> >
> > + How many zookeeper servers in the quorum?
> > The quorum has 3 servers, everything looks healthy
> >
> > + What is the load on atsqa4-133.qa.lab when this happens? Any other
> > applications running on that node? How many threads is the Drill process
> > using?
> > The load on the failing node(8 cores) is 14, when Drill is running. Which
> > is nothing out of the ordinary according to our admin.
> > HBase is also running.
> > planner.width.max_per_node is set to 8
> >
> > + When running the same query on 12 nodes, is the data size same?
> > Yes
> >
> > + Can you share the query profile?
> >   I will sum it up:
> >  It is a select on 18 columns: 9 string, 9 integers.
> >  Scan is done on 13862 parquet files spread  on 1000 fragments.
> >  Fragments are spread accross 215 nodes.
> >
> >
> > We are in process of increasing our Zookeeper session timeout config to
> see
> > if it helps.
> >
> > thanks
> >
> > Francois
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam <[email protected]>
> > wrote:
> >
> >> Hi Francois,
> >>
> >> Few questions:
> >> + How many zookeeper servers in the quorum?
> >> + What is the load on atsqa4-133.qa.lab when this happens? Any other
> >> applications running on that node? How many threads is the Drill process
> >> using?
> >> + When running the same query on 12 nodes, is the data size same?
> >> + Can you share the query profile?
> >>
> >> This may not be the right thing to do, but for now, If the cluster is
> >> heavily loaded, increase the zk timeout.
> >>
> >> Thank you,
> >> Sudheesh
> >>
> >>> On Sep 14, 2016, at 11:53 AM, François Méthot <[email protected]>
> >> wrote:
> >>>
> >>> We are running 1.7.
> >>> The log were taken from the jira tickets.
> >>>
> >>> We will try out 1.8 soon.
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang <[email protected]>
> wrote:
> >>>
> >>>> Looks like you are running 1.5. I believe there are some work done in
> >> that
> >>>> area and the newer release should behave better.
> >>>>
> >>>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot <
> [email protected]>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> We are trying to find a solution/workaround to issue:
> >>>>>
> >>>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
> >>>>> o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR:
> ForemanException:
> >>>>> One more more nodes lost connectivity during query.  Identified nodes
> >>>>> were [atsqa4-133.qa.lab:31010].
> >>>>> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
> >>>>> ForemanException: One more more nodes lost connectivity during query.
> >>>>> Identified nodes were [atsqa4-133.qa.lab:31010].
> >>>>>       at org.apache.drill.exec.work.foreman.Foreman$ForemanResult.
> >>>>> close(Foreman.java:746)
> >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>       at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> >>>>> processEvent(Foreman.java:858)
> >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>       at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> >>>>> processEvent(Foreman.java:790)
> >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>       at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> >>>>> moveToState(Foreman.java:792)
> >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>       at org.apache.drill.exec.work.foreman.Foreman.moveToState(
> >>>>> Foreman.java:909)
> >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>       at org.apache.drill.exec.work.foreman.Foreman.access$2700(
> >>>>> Foreman.java:110)
> >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>       at org.apache.drill.exec.work.foreman.Foreman$StateListener.
> >>>>> moveToState(Foreman.java:1183)
> >>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>
> >>>>>
> >>>>> DRILL-4325  <https://issues.apache.org/jira/browse/DRILL-4325>
> >>>>> ForemanException:
> >>>>> One or more nodes lost connectivity during query
> >>>>>
> >>>>>
> >>>>>
> >>>>> Any one experienced this issue ?
> >>>>>
> >>>>> It happens when running query involving many parquet files on a
> cluster
> >>>> of
> >>>>> 200 nodes. Same query on a smaller cluster of 12 nodes runs fine.
> >>>>>
> >>>>> It is not caused by garbage collection, (checked on both ZK node and
> >> the
> >>>>> involved drill bit).
> >>>>>
> >>>>> Negotiated max session timeout is 40 seconds.
> >>>>>
> >>>>> The sequence seems:
> >>>>> - Drill Query begins, using an existing ZK session.
> >>>>> - Drill Zk session timeouts
> >>>>>     - perhaps it was writing something that took too long
> >>>>> - Drill attempts to renew session
> >>>>>      - drill believes that the write operation failed, so it attempts
> >>>> to
> >>>>> re-create the zk node, which trigger another exception.
> >>>>>
> >>>>> We are open to any suggestion. We will report any finding.
> >>>>>
> >>>>> Thanks
> >>>>> Francois
> >>>>>
> >>>>
> >>
> >>
>
>

Re: ZK lost connectivity issue on large cluster

Reply via email to