Re: ZK lost connectivity issue on large cluster

Chun Chang Fri, 14 Oct 2016 11:45:27 -0700

I am surprised that it's not the default.

On Fri, Oct 14, 2016 at 11:18 AM, Sudheesh Katkam <skat...@maprtech.com>
wrote:


> Hi Francois,
>
> Thank you for posting your findings! Glad to see a 10X improvement.
>
> By increasing affinity factor, looks like Drill’s parallelizer is forced
> to assign fragments on nodes with data i.e. with high favorability for data
> locality.
>
> Regarding the random disconnection, I agree with your guess that the
> network bandwidth is being used up by remote reads which causes lags in
> drillbit to ZooKeeper heartbeats (since these services use the same
> network)? Maybe others can comment here.
>
> Thank you,
> Sudheesh
>
> > On Oct 12, 2016, at 6:06 PM, François Méthot <fmetho...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > We finally got rid of this error. We have tried many, many things  (like
> > modifying drill to ignore the error!), it ultimately came down to this
> > change:
> >
> > from default
> > planner.affinity_factor=1.2
> >   to
> > planner.affinity_factor=100
> >
> > Basically this encourages fragment to only care about locally stored
> files.
> > We looked at the code that used that property and figured that 100 would
> > have strong impact.
> >
> > What led us to this property is the fact that 1/4 of our fragments would
> > take a lot more time to complete their scan, up to  10x the time of the
> > fastest nodes.  On the slower nodes, Cloudera Manager would show very low
> > Disk IOPS with high Network IO compare to our faster nodes. We had
> noticed
> > that before but figured it would be some optimization to be done later
> when
> > more pressing issue would be fixed, like Zk disconnection and OOM. We
> were
> > desperate and decided to fix anything that  would look unusual.
> >
> > After this change, query ran up to 10x faster.
> >
> > We no longer get random disconnection between node and Zookeeper.
> >
> > We are still wondering why exactly. Network should not be a bottleneck.
> > Could high network traffic between a Drillbit and HDFS causes the
> Drillbit
> > to timeout with zookeeper?
> >
> >
> > On Fri, Sep 30, 2016 at 4:21 PM, François Méthot <fmetho...@gmail.com>
> > wrote:
> >
> >> After the 30 seconds gap, all the Drill nodes receives the following:
> >>
> >> 2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active drillbit
> set
> >> changed. Now includes 220 total bits. New Active drill bits....
> >> ...faulty node is not on the list...
> >> 2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active drillbit
> set
> >> changed. Now includes 221 total bits. New Active drill bits....
> >> ...faulty node is back on the list...
> >>
> >>
> >> So the faulty Drill node get unregistered and registered right after.
> >>
> >> Drill is using the low level API for registering and unregistering, and
> >> the only place with unregistering occurs is when the drillbit is closed
> at
> >> shutdown.
> >>
> >> That particular drillbit is still up and running after those log, it
> could
> >> not have trigger the unregistering process through a shutdown.
> >>
> >>
> >>
> >>
> >> Would you have an idea what else could cause a Drillbit to be
> unregistered
> >> from the DiscoveryService and registered again right after?
> >>
> >>
> >>
> >> We are using Zookeeper 3.4.5
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Sep 28, 2016 at 10:36 AM, François Méthot <fmetho...@gmail.com>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> We have switched to 1.8 and we are still getting node disconnection.
> >>>
> >>> We did many tests, we thought initially our stand alone parquet
> converter
> >>> was generating parquet files with problematic data (like 10K characters
> >>> string), but we were able to reproduce it with employee data from the
> >>> tutorial.
> >>>
> >>> For example,  we duplicated the Drill Tutorial "Employee" data to reach
> >>> 500 M records spread over 130 parquet files.
> >>> Each files is ~60 MB.
> >>>
> >>>
> >>> We ran this query over and over on 5 different sessions using a script:
> >>>   select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like '%does
> >>> not exist%';
> >>>
> >>>   Query return no rows and would take ~35 to 45 seconds to return.
> >>>
> >>> Leaving the script running on each node, we eventually hit the "nodes
> >>> lost connectivity during query" error.
> >>>
> >>> One the done that failed,
> >>>
> >>>   We see those log:
> >>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO
> >>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:10: State to report:
> >>> RUNNING
> >>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG
> >>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on server064:31010
> >>>
> >>> <--- 30 seconds gap for that fragment --->
> >>>
> >>> 2016-09-26 20:37:09,976 [BitServer-2] WARN o.a.d.exec.rpc.control.
> ControlServer
> >>> - Message of mode REQUEST of rpc type 2 took longer then 500 ms. Actual
> >>> duration was 23617ms.
> >>>
> >>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
> >>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of
> 0,
> >>> remaining: 0 incoming schema BatchSchema [....,
> selectionVector=TWO_BYTE]
> >>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
> >>> o.a.d.exec.rpc.control.WorkEventBus - Cancelling and removing fragment
> >>> manager : ...uuid...
> >>>
> >>>
> >>>
> >>> For the same query on a working node:
> >>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] INFO
> >>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:2: State to report:
> >>> RUNNING
> >>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] DEBUG
> >>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:2 on server125:31010
> >>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
> >>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of
> 0,
> >>> remaining: 0 incoming schema BatchSchema [....,
> selectionVector=TWO_BYTE]
> >>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
> >>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of
> 0,
> >>> remaining: 0 incoming schema BatchSchema [....,
> selectionVector=TWO_BYTE]
> >>> 2016-09-26 20:07:11,005 [...uuid...frag:1:2] DEBUG
> >>> o.a.d.e.s.p.c.ParquetRecordReader - Read 87573 records out of row
> >>> groups(0) in file `/data/drill/tmp/PARQUET_EMPLOYEE/0_0_14.parquet
> >>>
> >>>
> >>>
> >>>
> >>> We are investigating what could get cause that 30 seconds gap for that
> >>> fragment.
> >>>
> >>> Any idea let us know
> >>>
> >>> Thanks
> >>> Francois
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, Sep 19, 2016 at 2:59 PM, François Méthot <fmetho...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi Sudheesh,
> >>>>
> >>>>  If I add selection filter so that no row are returned, the same
> >>>> problem occur. I also simplified the query to include only few integer
> >>>> columns.
> >>>>
> >>>> That particular data repo is ~200+ Billions records spread over ~50
> 000
> >>>> parquet files.
> >>>>
> >>>> We have other CSV data repo that are 100x smaller that does not
> trigger
> >>>> this issue.
> >>>>
> >>>>
> >>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this
> case?
> >>>> There is also a bizarre case where the node that is reported as lost
> is the
> >>>> node itself.
> >>>> Yes, the stack trace is from the ticket, It did occurred once or twice
> >>>> (in the many many attempts) that it was the node itself.
> >>>>
> >>>> + Is there a spike in memory usage of the Drillbit this is the Foreman
> >>>> for the query (process memory, not just heap)?
> >>>> We don't notice any unusual spike, each nodes gets busy in the same
> >>>> range when query is running.
> >>>>
> >>>> I tried running with 8GB/20GB and 4GB/24GB heap/off-heap, did not see
> >>>> any improvement.
> >>>>
> >>>>
> >>>> We will update from 1.7 to 1.8 before going ahead with more
> >>>> investigation.
> >>>>
> >>>> Thanks a lot.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Sep 19, 2016 at 1:19 PM, Sudheesh Katkam <
> skat...@maprtech.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Francois,
> >>>>>
> >>>>> A simple query with only projections is not an “ideal” use case,
> since
> >>>>> Drill is bound by how fast the client can consume records. There are
> 1000
> >>>>> scanners sending data to 1 client (vs far fewer scanners sending
> data in
> >>>>> the 12 node case).
> >>>>>
> >>>>> This might increase the load on the Drillbit that is the Foreman for
> >>>>> the query. In the query profile, the scanners should be spending a
> lot more
> >>>>> time “waiting” to send records to the client (via root fragment).
> >>>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this
> case?
> >>>>> There is also a bizarre case where the node that is reported as lost
> is the
> >>>>> node itself.
> >>>>> + Is there a spike in memory usage of the Drillbit this is the
> Foreman
> >>>>> for the query (process memory, not just heap)?
> >>>>>
> >>>>> Regarding the warnings ...
> >>>>>
> >>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN
> >>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of
> >>>>> rpc type
> >>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms.
> >>>>>
> >>>>>
> >>>>> RPC type 6 is a cancellation request; DRILL-4766 [2] should help in
> >>>>> this case, which is resolved in the latest version of Drill. So as
> Chun
> >>>>> suggested, upgrade the cluster to the latest version of Drill.
> >>>>>
> >>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN
> >>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of
> >>>>> rpc type
> >>>>>> 1 took longer than 500 ms. Actual Duration was 981ms.
> >>>>>
> >>>>> I am surprised that responses are taking that long to handle.
> >>>>> + Are both messages on the same Drillbit?
> >>>>>
> >>>>> The other warnings can be ignored.
> >>>>>
> >>>>> Thank you,
> >>>>> Sudheesh
> >>>>>
> >>>>> [1] I just realized that atsqa4-133.qa.lab is in one of our test
> >>>>> environments :)
> >>>>> [2] https://issues.apache.org/jira/browse/DRILL-4766 <
> >>>>> https://issues.apache.org/jira/browse/DRILL-4766>
> >>>>>
> >>>>>> On Sep 19, 2016, at 9:15 AM, François Méthot <fmetho...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi Sudheesh,
> >>>>>>
> >>>>>>
> >>>>>> + Does the query involve any aggregations or filters? Or is this a
> >>>>> select
> >>>>>> query with only projections?
> >>>>>> Simple query with only projections
> >>>>>>
> >>>>>> + Any suspicious timings in the query profile?
> >>>>>> Nothing specially different than our working query on our small
> >>>>> cluster.
> >>>>>>
> >>>>>> + Any suspicious warning messages in the logs around the time of
> >>>>> failure on
> >>>>>> any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this
> >>>>> one
> >>>>>> (“..” are place holders):
> >>>>>> Message of mode .. of rpc type .. took longer than ..ms.  Actual
> >>>>> duration
> >>>>>> was ..ms.
> >>>>>>
> >>>>>> Well we do see this warning on the failing node (on my last test), I
> >>>>> found
> >>>>>> this WARNING in our log for the past month for pretty much every
> node
> >>>>> I
> >>>>>> checked.
> >>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN
> >>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of
> >>>>> rpc type
> >>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms.
> >>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN
> >>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of
> >>>>> rpc type
> >>>>>> 1 took longer than 500 ms. Actual Duration was 981ms.
> >>>>>>
> >>>>>> We really appreciate your help. I will dig in the source code for
> >>>>> when and
> >>>>>> why this error happen.
> >>>>>>
> >>>>>>
> >>>>>> Francois
> >>>>>>
> >>>>>> P.S.:
> >>>>>> We do see this also:
> >>>>>> 2016-09-19 14:48:23,444 [drill-executor-9] WARN
> >>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ......:1:2 not found
> >>>>> in the
> >>>>>> work bus.
> >>>>>> 2016-09-19 14:48:23,444 [drill-executor-11] WARN
> >>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:222 not found
> >>>>> in the
> >>>>>> work bus.
> >>>>>> 2016-09-19 14:48:23,444 [drill-executor-12] WARN
> >>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:442 not found
> >>>>> in the
> >>>>>> work bus.
> >>>>>> 2016-09-19 14:48:23,444 [drill-executor-10] WARN
> >>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:662 not found
> >>>>> in the
> >>>>>> work bus.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam <
> >>>>> skat...@maprtech.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Francois,
> >>>>>>>
> >>>>>>> More questions..
> >>>>>>>
> >>>>>>>> + Can you share the query profile?
> >>>>>>>> I will sum it up:
> >>>>>>>> It is a select on 18 columns: 9 string, 9 integers.
> >>>>>>>> Scan is done on 13862 parquet files spread  on 1000 fragments.
> >>>>>>>> Fragments are spread accross 215 nodes.
> >>>>>>>
> >>>>>>> So ~5 leaf fragments (or scanners) per Drillbit seems fine.
> >>>>>>>
> >>>>>>> + Does the query involve any aggregations or filters? Or is this a
> >>>>> select
> >>>>>>> query with only projections?
> >>>>>>> + Any suspicious timings in the query profile?
> >>>>>>> + Any suspicious warning messages in the logs around the time of
> >>>>> failure
> >>>>>>> on any of the drillbits? Specially on atsqa4-133.qa.lab? Specially
> >>>>> this one
> >>>>>>> (“..” are place holders):
> >>>>>>> Message of mode .. of rpc type .. took longer than ..ms.  Actual
> >>>>>>> duration was ..ms.
> >>>>>>>
> >>>>>>> Thank you,
> >>>>>>> Sudheesh
> >>>>>>>
> >>>>>>>> On Sep 15, 2016, at 11:27 AM, François Méthot <
> fmetho...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi Sudheesh,
> >>>>>>>>
> >>>>>>>> + How many zookeeper servers in the quorum?
> >>>>>>>> The quorum has 3 servers, everything looks healthy
> >>>>>>>>
> >>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens? Any
> other
> >>>>>>>> applications running on that node? How many threads is the Drill
> >>>>> process
> >>>>>>>> using?
> >>>>>>>> The load on the failing node(8 cores) is 14, when Drill is
> running.
> >>>>> Which
> >>>>>>>> is nothing out of the ordinary according to our admin.
> >>>>>>>> HBase is also running.
> >>>>>>>> planner.width.max_per_node is set to 8
> >>>>>>>>
> >>>>>>>> + When running the same query on 12 nodes, is the data size same?
> >>>>>>>> Yes
> >>>>>>>>
> >>>>>>>> + Can you share the query profile?
> >>>>>>>> I will sum it up:
> >>>>>>>> It is a select on 18 columns: 9 string, 9 integers.
> >>>>>>>> Scan is done on 13862 parquet files spread  on 1000 fragments.
> >>>>>>>> Fragments are spread accross 215 nodes.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> We are in process of increasing our Zookeeper session timeout
> >>>>> config to
> >>>>>>> see
> >>>>>>>> if it helps.
> >>>>>>>>
> >>>>>>>> thanks
> >>>>>>>>
> >>>>>>>> Francois
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam <
> >>>>> skat...@maprtech.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Francois,
> >>>>>>>>>
> >>>>>>>>> Few questions:
> >>>>>>>>> + How many zookeeper servers in the quorum?
> >>>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens? Any
> >>>>> other
> >>>>>>>>> applications running on that node? How many threads is the Drill
> >>>>> process
> >>>>>>>>> using?
> >>>>>>>>> + When running the same query on 12 nodes, is the data size same?
> >>>>>>>>> + Can you share the query profile?
> >>>>>>>>>
> >>>>>>>>> This may not be the right thing to do, but for now, If the
> cluster
> >>>>> is
> >>>>>>>>> heavily loaded, increase the zk timeout.
> >>>>>>>>>
> >>>>>>>>> Thank you,
> >>>>>>>>> Sudheesh
> >>>>>>>>>
> >>>>>>>>>> On Sep 14, 2016, at 11:53 AM, François Méthot <
> >>>>> fmetho...@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> We are running 1.7.
> >>>>>>>>>> The log were taken from the jira tickets.
> >>>>>>>>>>
> >>>>>>>>>> We will try out 1.8 soon.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang <
> cch...@maprtech.com>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Looks like you are running 1.5. I believe there are some work
> >>>>> done in
> >>>>>>>>> that
> >>>>>>>>>>> area and the newer release should behave better.
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot <
> >>>>>>> fmetho...@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> We are trying to find a solution/workaround to issue:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
> >>>>>>>>>>>> o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR:
> >>>>>>> ForemanException:
> >>>>>>>>>>>> One more more nodes lost connectivity during query.
> Identified
> >>>>> nodes
> >>>>>>>>>>>> were [atsqa4-133.qa.lab:31010].
> >>>>>>>>>>>> org.apache.drill.common.exceptions.UserException: SYSTEM
> ERROR:
> >>>>>>>>>>>> ForemanException: One more more nodes lost connectivity during
> >>>>> query.
> >>>>>>>>>>>> Identified nodes were [atsqa4-133.qa.lab:31010].
> >>>>>>>>>>>>     at org.apache.drill.exec.work.for
> >>>>> eman.Foreman$ForemanResult.
> >>>>>>>>>>>> close(Foreman.java:746)
> >>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>>>>>>>>     at org.apache.drill.exec.work.
> foreman.Foreman$StateSwitch.
> >>>>>>>>>>>> processEvent(Foreman.java:858)
> >>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>>>>>>>>     at org.apache.drill.exec.work.
> foreman.Foreman$StateSwitch.
> >>>>>>>>>>>> processEvent(Foreman.java:790)
> >>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>>>>>>>>     at org.apache.drill.exec.work.
> foreman.Foreman$StateSwitch.
> >>>>>>>>>>>> moveToState(Foreman.java:792)
> >>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>>>>>>>>     at org.apache.drill.exec.work.
> foreman.Foreman.moveToState(
> >>>>>>>>>>>> Foreman.java:909)
> >>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>>>>>>>>     at org.apache.drill.exec.work.
> foreman.Foreman.access$2700(
> >>>>>>>>>>>> Foreman.java:110)
> >>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>>>>>>>>     at org.apache.drill.exec.work.for
> >>>>> eman.Foreman$StateListener.
> >>>>>>>>>>>> moveToState(Foreman.java:1183)
> >>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> DRILL-4325  <https://issues.apache.org/jira/browse/DRILL-4325
> >
> >>>>>>>>>>>> ForemanException:
> >>>>>>>>>>>> One or more nodes lost connectivity during query
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Any one experienced this issue ?
> >>>>>>>>>>>>
> >>>>>>>>>>>> It happens when running query involving many parquet files on
> a
> >>>>>>> cluster
> >>>>>>>>>>> of
> >>>>>>>>>>>> 200 nodes. Same query on a smaller cluster of 12 nodes runs
> >>>>> fine.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It is not caused by garbage collection, (checked on both ZK
> >>>>> node and
> >>>>>>>>> the
> >>>>>>>>>>>> involved drill bit).
> >>>>>>>>>>>>
> >>>>>>>>>>>> Negotiated max session timeout is 40 seconds.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The sequence seems:
> >>>>>>>>>>>> - Drill Query begins, using an existing ZK session.
> >>>>>>>>>>>> - Drill Zk session timeouts
> >>>>>>>>>>>>   - perhaps it was writing something that took too long
> >>>>>>>>>>>> - Drill attempts to renew session
> >>>>>>>>>>>>    - drill believes that the write operation failed, so it
> >>>>> attempts
> >>>>>>>>>>> to
> >>>>>>>>>>>> re-create the zk node, which trigger another exception.
> >>>>>>>>>>>>
> >>>>>>>>>>>> We are open to any suggestion. We will report any finding.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks
> >>>>>>>>>>>> Francois
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: ZK lost connectivity issue on large cluster

Reply via email to