Re: ZK lost connectivity issue on large cluster

Sudheesh Katkam Mon, 19 Sep 2016 10:20:03 -0700

Hi Francois,

A simple query with only projections is not an “ideal” use case, since Drill is 
bound by how fast the client can consume records. There are 1000 scanners 
sending data to 1 client (vs far fewer scanners sending data in the 12 node 
case).


This might increase the load on the Drillbit that is the Foreman for the query. 
In the query profile, the scanners should be spending a lot more time “waiting” 
to send records to the client (via root fragment).
+ Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case? There 
is also a bizarre case where the node that is reported as lost is the node 
itself.
+ Is there a spike in memory usage of the Drillbit this is the Foreman for the 
query (process memory, not just heap)?

Regarding the warnings ...

> 2016-09-19 13:31:56,866 [BitServer-7] WARN
> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type
> 6 took longer than 500 ms. Actual Duration was 16053ms.


RPC type 6 is a cancellation request; DRILL-4766 [2] should help in this case, 
which is resolved in the latest version of Drill. So as Chun suggested, upgrade 
the cluster to the latest version of Drill.

> 2016-09-19 14:15:33,357 [BitServer-4] WARN
> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type
> 1 took longer than 500 ms. Actual Duration was 981ms.

I am surprised that responses are taking that long to handle.
+ Are both messages on the same Drillbit?

The other warnings can be ignored.

Thank you,
Sudheesh

[1] I just realized that atsqa4-133.qa.lab is in one of our test environments :)
[2] https://issues.apache.org/jira/browse/DRILL-4766 
<https://issues.apache.org/jira/browse/DRILL-4766>

> On Sep 19, 2016, at 9:15 AM, François Méthot <fmetho...@gmail.com> wrote:
> 
> Hi Sudheesh,
> 
> 
> + Does the query involve any aggregations or filters? Or is this a select
> query with only projections?
> Simple query with only projections
> 
> + Any suspicious timings in the query profile?
> Nothing specially different than our working query on our small cluster.
> 
> + Any suspicious warning messages in the logs around the time of failure on
> any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one
> (“..” are place holders):
>  Message of mode .. of rpc type .. took longer than ..ms.  Actual duration
> was ..ms.
> 
> Well we do see this warning on the failing node (on my last test), I found
> this WARNING in our log for the past month for pretty much every node I
> checked.
> 2016-09-19 13:31:56,866 [BitServer-7] WARN
> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type
> 6 took longer than 500 ms. Actual Duration was 16053ms.
> 2016-09-19 14:15:33,357 [BitServer-4] WARN
> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type
> 1 took longer than 500 ms. Actual Duration was 981ms.
> 
> We really appreciate your help. I will dig in the source code for when and
> why this error happen.
> 
> 
> Francois
> 
> P.S.:
> We do see this also:
> 2016-09-19 14:48:23,444 [drill-executor-9] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment ......:1:2 not found in the
> work bus.
> 2016-09-19 14:48:23,444 [drill-executor-11] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:222 not found in the
> work bus.
> 2016-09-19 14:48:23,444 [drill-executor-12] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:442 not found in the
> work bus.
> 2016-09-19 14:48:23,444 [drill-executor-10] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:662 not found in the
> work bus.
> 
> 
> 
> 
> On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam <skat...@maprtech.com>
> wrote:
> 
>> Hi Francois,
>> 
>> More questions..
>> 
>>> + Can you share the query profile?
>>>  I will sum it up:
>>> It is a select on 18 columns: 9 string, 9 integers.
>>> Scan is done on 13862 parquet files spread  on 1000 fragments.
>>> Fragments are spread accross 215 nodes.
>> 
>> So ~5 leaf fragments (or scanners) per Drillbit seems fine.
>> 
>> + Does the query involve any aggregations or filters? Or is this a select
>> query with only projections?
>> + Any suspicious timings in the query profile?
>> + Any suspicious warning messages in the logs around the time of failure
>> on any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one
>> (“..” are place holders):
>>  Message of mode .. of rpc type .. took longer than ..ms.  Actual
>> duration was ..ms.
>> 
>> Thank you,
>> Sudheesh
>> 
>>> On Sep 15, 2016, at 11:27 AM, François Méthot <fmetho...@gmail.com>
>> wrote:
>>> 
>>> Hi Sudheesh,
>>> 
>>> + How many zookeeper servers in the quorum?
>>> The quorum has 3 servers, everything looks healthy
>>> 
>>> + What is the load on atsqa4-133.qa.lab when this happens? Any other
>>> applications running on that node? How many threads is the Drill process
>>> using?
>>> The load on the failing node(8 cores) is 14, when Drill is running. Which
>>> is nothing out of the ordinary according to our admin.
>>> HBase is also running.
>>> planner.width.max_per_node is set to 8
>>> 
>>> + When running the same query on 12 nodes, is the data size same?
>>> Yes
>>> 
>>> + Can you share the query profile?
>>>  I will sum it up:
>>> It is a select on 18 columns: 9 string, 9 integers.
>>> Scan is done on 13862 parquet files spread  on 1000 fragments.
>>> Fragments are spread accross 215 nodes.
>>> 
>>> 
>>> We are in process of increasing our Zookeeper session timeout config to
>> see
>>> if it helps.
>>> 
>>> thanks
>>> 
>>> Francois
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam <skat...@maprtech.com>
>>> wrote:
>>> 
>>>> Hi Francois,
>>>> 
>>>> Few questions:
>>>> + How many zookeeper servers in the quorum?
>>>> + What is the load on atsqa4-133.qa.lab when this happens? Any other
>>>> applications running on that node? How many threads is the Drill process
>>>> using?
>>>> + When running the same query on 12 nodes, is the data size same?
>>>> + Can you share the query profile?
>>>> 
>>>> This may not be the right thing to do, but for now, If the cluster is
>>>> heavily loaded, increase the zk timeout.
>>>> 
>>>> Thank you,
>>>> Sudheesh
>>>> 
>>>>> On Sep 14, 2016, at 11:53 AM, François Méthot <fmetho...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> We are running 1.7.
>>>>> The log were taken from the jira tickets.
>>>>> 
>>>>> We will try out 1.8 soon.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang <cch...@maprtech.com>
>> wrote:
>>>>> 
>>>>>> Looks like you are running 1.5. I believe there are some work done in
>>>> that
>>>>>> area and the newer release should behave better.
>>>>>> 
>>>>>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot <
>> fmetho...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> We are trying to find a solution/workaround to issue:
>>>>>>> 
>>>>>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
>>>>>>> o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR:
>> ForemanException:
>>>>>>> One more more nodes lost connectivity during query.  Identified nodes
>>>>>>> were [atsqa4-133.qa.lab:31010].
>>>>>>> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
>>>>>>> ForemanException: One more more nodes lost connectivity during query.
>>>>>>> Identified nodes were [atsqa4-133.qa.lab:31010].
>>>>>>>      at org.apache.drill.exec.work.foreman.Foreman$ForemanResult.
>>>>>>> close(Foreman.java:746)
>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>      at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>>>>>> processEvent(Foreman.java:858)
>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>      at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>>>>>> processEvent(Foreman.java:790)
>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>      at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>>>>>> moveToState(Foreman.java:792)
>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>      at org.apache.drill.exec.work.foreman.Foreman.moveToState(
>>>>>>> Foreman.java:909)
>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>      at org.apache.drill.exec.work.foreman.Foreman.access$2700(
>>>>>>> Foreman.java:110)
>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>>      at org.apache.drill.exec.work.foreman.Foreman$StateListener.
>>>>>>> moveToState(Foreman.java:1183)
>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>>>>> 
>>>>>>> 
>>>>>>> DRILL-4325  <https://issues.apache.org/jira/browse/DRILL-4325>
>>>>>>> ForemanException:
>>>>>>> One or more nodes lost connectivity during query
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Any one experienced this issue ?
>>>>>>> 
>>>>>>> It happens when running query involving many parquet files on a
>> cluster
>>>>>> of
>>>>>>> 200 nodes. Same query on a smaller cluster of 12 nodes runs fine.
>>>>>>> 
>>>>>>> It is not caused by garbage collection, (checked on both ZK node and
>>>> the
>>>>>>> involved drill bit).
>>>>>>> 
>>>>>>> Negotiated max session timeout is 40 seconds.
>>>>>>> 
>>>>>>> The sequence seems:
>>>>>>> - Drill Query begins, using an existing ZK session.
>>>>>>> - Drill Zk session timeouts
>>>>>>>    - perhaps it was writing something that took too long
>>>>>>> - Drill attempts to renew session
>>>>>>>     - drill believes that the write operation failed, so it attempts
>>>>>> to
>>>>>>> re-create the zk node, which trigger another exception.
>>>>>>> 
>>>>>>> We are open to any suggestion. We will report any finding.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Francois
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: ZK lost connectivity issue on large cluster

Reply via email to