I am surprised that it's not the default. On Fri, Oct 14, 2016 at 11:18 AM, Sudheesh Katkam <skat...@maprtech.com> wrote:
> Hi Francois, > > Thank you for posting your findings! Glad to see a 10X improvement. > > By increasing affinity factor, looks like Drill’s parallelizer is forced > to assign fragments on nodes with data i.e. with high favorability for data > locality. > > Regarding the random disconnection, I agree with your guess that the > network bandwidth is being used up by remote reads which causes lags in > drillbit to ZooKeeper heartbeats (since these services use the same > network)? Maybe others can comment here. > > Thank you, > Sudheesh > > > On Oct 12, 2016, at 6:06 PM, François Méthot <fmetho...@gmail.com> > wrote: > > > > Hi, > > > > We finally got rid of this error. We have tried many, many things (like > > modifying drill to ignore the error!), it ultimately came down to this > > change: > > > > from default > > planner.affinity_factor=1.2 > > to > > planner.affinity_factor=100 > > > > Basically this encourages fragment to only care about locally stored > files. > > We looked at the code that used that property and figured that 100 would > > have strong impact. > > > > What led us to this property is the fact that 1/4 of our fragments would > > take a lot more time to complete their scan, up to 10x the time of the > > fastest nodes. On the slower nodes, Cloudera Manager would show very low > > Disk IOPS with high Network IO compare to our faster nodes. We had > noticed > > that before but figured it would be some optimization to be done later > when > > more pressing issue would be fixed, like Zk disconnection and OOM. We > were > > desperate and decided to fix anything that would look unusual. > > > > After this change, query ran up to 10x faster. > > > > We no longer get random disconnection between node and Zookeeper. > > > > We are still wondering why exactly. Network should not be a bottleneck. > > Could high network traffic between a Drillbit and HDFS causes the > Drillbit > > to timeout with zookeeper? > > > > > > On Fri, Sep 30, 2016 at 4:21 PM, François Méthot <fmetho...@gmail.com> > > wrote: > > > >> After the 30 seconds gap, all the Drill nodes receives the following: > >> > >> 2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active drillbit > set > >> changed. Now includes 220 total bits. New Active drill bits.... > >> ...faulty node is not on the list... > >> 2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active drillbit > set > >> changed. Now includes 221 total bits. New Active drill bits.... > >> ...faulty node is back on the list... > >> > >> > >> So the faulty Drill node get unregistered and registered right after. > >> > >> Drill is using the low level API for registering and unregistering, and > >> the only place with unregistering occurs is when the drillbit is closed > at > >> shutdown. > >> > >> That particular drillbit is still up and running after those log, it > could > >> not have trigger the unregistering process through a shutdown. > >> > >> > >> > >> > >> Would you have an idea what else could cause a Drillbit to be > unregistered > >> from the DiscoveryService and registered again right after? > >> > >> > >> > >> We are using Zookeeper 3.4.5 > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> On Wed, Sep 28, 2016 at 10:36 AM, François Méthot <fmetho...@gmail.com> > >> wrote: > >> > >>> Hi, > >>> > >>> We have switched to 1.8 and we are still getting node disconnection. > >>> > >>> We did many tests, we thought initially our stand alone parquet > converter > >>> was generating parquet files with problematic data (like 10K characters > >>> string), but we were able to reproduce it with employee data from the > >>> tutorial. > >>> > >>> For example, we duplicated the Drill Tutorial "Employee" data to reach > >>> 500 M records spread over 130 parquet files. > >>> Each files is ~60 MB. > >>> > >>> > >>> We ran this query over and over on 5 different sessions using a script: > >>> select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like '%does > >>> not exist%'; > >>> > >>> Query return no rows and would take ~35 to 45 seconds to return. > >>> > >>> Leaving the script running on each node, we eventually hit the "nodes > >>> lost connectivity during query" error. > >>> > >>> One the done that failed, > >>> > >>> We see those log: > >>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO > >>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:10: State to report: > >>> RUNNING > >>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG > >>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on server064:31010 > >>> > >>> <--- 30 seconds gap for that fragment ---> > >>> > >>> 2016-09-26 20:37:09,976 [BitServer-2] WARN o.a.d.exec.rpc.control. > ControlServer > >>> - Message of mode REQUEST of rpc type 2 took longer then 500 ms. Actual > >>> duration was 23617ms. > >>> > >>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG > >>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of > 0, > >>> remaining: 0 incoming schema BatchSchema [...., > selectionVector=TWO_BYTE] > >>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG > >>> o.a.d.exec.rpc.control.WorkEventBus - Cancelling and removing fragment > >>> manager : ...uuid... > >>> > >>> > >>> > >>> For the same query on a working node: > >>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] INFO > >>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:2: State to report: > >>> RUNNING > >>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] DEBUG > >>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:2 on server125:31010 > >>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG > >>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of > 0, > >>> remaining: 0 incoming schema BatchSchema [...., > selectionVector=TWO_BYTE] > >>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG > >>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of > 0, > >>> remaining: 0 incoming schema BatchSchema [...., > selectionVector=TWO_BYTE] > >>> 2016-09-26 20:07:11,005 [...uuid...frag:1:2] DEBUG > >>> o.a.d.e.s.p.c.ParquetRecordReader - Read 87573 records out of row > >>> groups(0) in file `/data/drill/tmp/PARQUET_EMPLOYEE/0_0_14.parquet > >>> > >>> > >>> > >>> > >>> We are investigating what could get cause that 30 seconds gap for that > >>> fragment. > >>> > >>> Any idea let us know > >>> > >>> Thanks > >>> Francois > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Mon, Sep 19, 2016 at 2:59 PM, François Méthot <fmetho...@gmail.com> > >>> wrote: > >>> > >>>> Hi Sudheesh, > >>>> > >>>> If I add selection filter so that no row are returned, the same > >>>> problem occur. I also simplified the query to include only few integer > >>>> columns. > >>>> > >>>> That particular data repo is ~200+ Billions records spread over ~50 > 000 > >>>> parquet files. > >>>> > >>>> We have other CSV data repo that are 100x smaller that does not > trigger > >>>> this issue. > >>>> > >>>> > >>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this > case? > >>>> There is also a bizarre case where the node that is reported as lost > is the > >>>> node itself. > >>>> Yes, the stack trace is from the ticket, It did occurred once or twice > >>>> (in the many many attempts) that it was the node itself. > >>>> > >>>> + Is there a spike in memory usage of the Drillbit this is the Foreman > >>>> for the query (process memory, not just heap)? > >>>> We don't notice any unusual spike, each nodes gets busy in the same > >>>> range when query is running. > >>>> > >>>> I tried running with 8GB/20GB and 4GB/24GB heap/off-heap, did not see > >>>> any improvement. > >>>> > >>>> > >>>> We will update from 1.7 to 1.8 before going ahead with more > >>>> investigation. > >>>> > >>>> Thanks a lot. > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On Mon, Sep 19, 2016 at 1:19 PM, Sudheesh Katkam < > skat...@maprtech.com> > >>>> wrote: > >>>> > >>>>> Hi Francois, > >>>>> > >>>>> A simple query with only projections is not an “ideal” use case, > since > >>>>> Drill is bound by how fast the client can consume records. There are > 1000 > >>>>> scanners sending data to 1 client (vs far fewer scanners sending > data in > >>>>> the 12 node case). > >>>>> > >>>>> This might increase the load on the Drillbit that is the Foreman for > >>>>> the query. In the query profile, the scanners should be spending a > lot more > >>>>> time “waiting” to send records to the client (via root fragment). > >>>>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this > case? > >>>>> There is also a bizarre case where the node that is reported as lost > is the > >>>>> node itself. > >>>>> + Is there a spike in memory usage of the Drillbit this is the > Foreman > >>>>> for the query (process memory, not just heap)? > >>>>> > >>>>> Regarding the warnings ... > >>>>> > >>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN > >>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of > >>>>> rpc type > >>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms. > >>>>> > >>>>> > >>>>> RPC type 6 is a cancellation request; DRILL-4766 [2] should help in > >>>>> this case, which is resolved in the latest version of Drill. So as > Chun > >>>>> suggested, upgrade the cluster to the latest version of Drill. > >>>>> > >>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN > >>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of > >>>>> rpc type > >>>>>> 1 took longer than 500 ms. Actual Duration was 981ms. > >>>>> > >>>>> I am surprised that responses are taking that long to handle. > >>>>> + Are both messages on the same Drillbit? > >>>>> > >>>>> The other warnings can be ignored. > >>>>> > >>>>> Thank you, > >>>>> Sudheesh > >>>>> > >>>>> [1] I just realized that atsqa4-133.qa.lab is in one of our test > >>>>> environments :) > >>>>> [2] https://issues.apache.org/jira/browse/DRILL-4766 < > >>>>> https://issues.apache.org/jira/browse/DRILL-4766> > >>>>> > >>>>>> On Sep 19, 2016, at 9:15 AM, François Méthot <fmetho...@gmail.com> > >>>>> wrote: > >>>>>> > >>>>>> Hi Sudheesh, > >>>>>> > >>>>>> > >>>>>> + Does the query involve any aggregations or filters? Or is this a > >>>>> select > >>>>>> query with only projections? > >>>>>> Simple query with only projections > >>>>>> > >>>>>> + Any suspicious timings in the query profile? > >>>>>> Nothing specially different than our working query on our small > >>>>> cluster. > >>>>>> > >>>>>> + Any suspicious warning messages in the logs around the time of > >>>>> failure on > >>>>>> any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this > >>>>> one > >>>>>> (“..” are place holders): > >>>>>> Message of mode .. of rpc type .. took longer than ..ms. Actual > >>>>> duration > >>>>>> was ..ms. > >>>>>> > >>>>>> Well we do see this warning on the failing node (on my last test), I > >>>>> found > >>>>>> this WARNING in our log for the past month for pretty much every > node > >>>>> I > >>>>>> checked. > >>>>>> 2016-09-19 13:31:56,866 [BitServer-7] WARN > >>>>>> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of > >>>>> rpc type > >>>>>> 6 took longer than 500 ms. Actual Duration was 16053ms. > >>>>>> 2016-09-19 14:15:33,357 [BitServer-4] WARN > >>>>>> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of > >>>>> rpc type > >>>>>> 1 took longer than 500 ms. Actual Duration was 981ms. > >>>>>> > >>>>>> We really appreciate your help. I will dig in the source code for > >>>>> when and > >>>>>> why this error happen. > >>>>>> > >>>>>> > >>>>>> Francois > >>>>>> > >>>>>> P.S.: > >>>>>> We do see this also: > >>>>>> 2016-09-19 14:48:23,444 [drill-executor-9] WARN > >>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ......:1:2 not found > >>>>> in the > >>>>>> work bus. > >>>>>> 2016-09-19 14:48:23,444 [drill-executor-11] WARN > >>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:222 not found > >>>>> in the > >>>>>> work bus. > >>>>>> 2016-09-19 14:48:23,444 [drill-executor-12] WARN > >>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:442 not found > >>>>> in the > >>>>>> work bus. > >>>>>> 2016-09-19 14:48:23,444 [drill-executor-10] WARN > >>>>>> o.a.d.exec.rpc.control.WorkEventBus - Fragment ....:1:662 not found > >>>>> in the > >>>>>> work bus. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam < > >>>>> skat...@maprtech.com> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi Francois, > >>>>>>> > >>>>>>> More questions.. > >>>>>>> > >>>>>>>> + Can you share the query profile? > >>>>>>>> I will sum it up: > >>>>>>>> It is a select on 18 columns: 9 string, 9 integers. > >>>>>>>> Scan is done on 13862 parquet files spread on 1000 fragments. > >>>>>>>> Fragments are spread accross 215 nodes. > >>>>>>> > >>>>>>> So ~5 leaf fragments (or scanners) per Drillbit seems fine. > >>>>>>> > >>>>>>> + Does the query involve any aggregations or filters? Or is this a > >>>>> select > >>>>>>> query with only projections? > >>>>>>> + Any suspicious timings in the query profile? > >>>>>>> + Any suspicious warning messages in the logs around the time of > >>>>> failure > >>>>>>> on any of the drillbits? Specially on atsqa4-133.qa.lab? Specially > >>>>> this one > >>>>>>> (“..” are place holders): > >>>>>>> Message of mode .. of rpc type .. took longer than ..ms. Actual > >>>>>>> duration was ..ms. > >>>>>>> > >>>>>>> Thank you, > >>>>>>> Sudheesh > >>>>>>> > >>>>>>>> On Sep 15, 2016, at 11:27 AM, François Méthot < > fmetho...@gmail.com> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> Hi Sudheesh, > >>>>>>>> > >>>>>>>> + How many zookeeper servers in the quorum? > >>>>>>>> The quorum has 3 servers, everything looks healthy > >>>>>>>> > >>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens? Any > other > >>>>>>>> applications running on that node? How many threads is the Drill > >>>>> process > >>>>>>>> using? > >>>>>>>> The load on the failing node(8 cores) is 14, when Drill is > running. > >>>>> Which > >>>>>>>> is nothing out of the ordinary according to our admin. > >>>>>>>> HBase is also running. > >>>>>>>> planner.width.max_per_node is set to 8 > >>>>>>>> > >>>>>>>> + When running the same query on 12 nodes, is the data size same? > >>>>>>>> Yes > >>>>>>>> > >>>>>>>> + Can you share the query profile? > >>>>>>>> I will sum it up: > >>>>>>>> It is a select on 18 columns: 9 string, 9 integers. > >>>>>>>> Scan is done on 13862 parquet files spread on 1000 fragments. > >>>>>>>> Fragments are spread accross 215 nodes. > >>>>>>>> > >>>>>>>> > >>>>>>>> We are in process of increasing our Zookeeper session timeout > >>>>> config to > >>>>>>> see > >>>>>>>> if it helps. > >>>>>>>> > >>>>>>>> thanks > >>>>>>>> > >>>>>>>> Francois > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam < > >>>>> skat...@maprtech.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi Francois, > >>>>>>>>> > >>>>>>>>> Few questions: > >>>>>>>>> + How many zookeeper servers in the quorum? > >>>>>>>>> + What is the load on atsqa4-133.qa.lab when this happens? Any > >>>>> other > >>>>>>>>> applications running on that node? How many threads is the Drill > >>>>> process > >>>>>>>>> using? > >>>>>>>>> + When running the same query on 12 nodes, is the data size same? > >>>>>>>>> + Can you share the query profile? > >>>>>>>>> > >>>>>>>>> This may not be the right thing to do, but for now, If the > cluster > >>>>> is > >>>>>>>>> heavily loaded, increase the zk timeout. > >>>>>>>>> > >>>>>>>>> Thank you, > >>>>>>>>> Sudheesh > >>>>>>>>> > >>>>>>>>>> On Sep 14, 2016, at 11:53 AM, François Méthot < > >>>>> fmetho...@gmail.com> > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> We are running 1.7. > >>>>>>>>>> The log were taken from the jira tickets. > >>>>>>>>>> > >>>>>>>>>> We will try out 1.8 soon. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang < > cch...@maprtech.com> > >>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Looks like you are running 1.5. I believe there are some work > >>>>> done in > >>>>>>>>> that > >>>>>>>>>>> area and the newer release should behave better. > >>>>>>>>>>> > >>>>>>>>>>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot < > >>>>>>> fmetho...@gmail.com> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi, > >>>>>>>>>>>> > >>>>>>>>>>>> We are trying to find a solution/workaround to issue: > >>>>>>>>>>>> > >>>>>>>>>>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR > >>>>>>>>>>>> o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: > >>>>>>> ForemanException: > >>>>>>>>>>>> One more more nodes lost connectivity during query. > Identified > >>>>> nodes > >>>>>>>>>>>> were [atsqa4-133.qa.lab:31010]. > >>>>>>>>>>>> org.apache.drill.common.exceptions.UserException: SYSTEM > ERROR: > >>>>>>>>>>>> ForemanException: One more more nodes lost connectivity during > >>>>> query. > >>>>>>>>>>>> Identified nodes were [atsqa4-133.qa.lab:31010]. > >>>>>>>>>>>> at org.apache.drill.exec.work.for > >>>>> eman.Foreman$ForemanResult. > >>>>>>>>>>>> close(Foreman.java:746) > >>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>>>>>>>>> at org.apache.drill.exec.work. > foreman.Foreman$StateSwitch. > >>>>>>>>>>>> processEvent(Foreman.java:858) > >>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>>>>>>>>> at org.apache.drill.exec.work. > foreman.Foreman$StateSwitch. > >>>>>>>>>>>> processEvent(Foreman.java:790) > >>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>>>>>>>>> at org.apache.drill.exec.work. > foreman.Foreman$StateSwitch. > >>>>>>>>>>>> moveToState(Foreman.java:792) > >>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>>>>>>>>> at org.apache.drill.exec.work. > foreman.Foreman.moveToState( > >>>>>>>>>>>> Foreman.java:909) > >>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>>>>>>>>> at org.apache.drill.exec.work. > foreman.Foreman.access$2700( > >>>>>>>>>>>> Foreman.java:110) > >>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>>>>>>>>> at org.apache.drill.exec.work.for > >>>>> eman.Foreman$StateListener. > >>>>>>>>>>>> moveToState(Foreman.java:1183) > >>>>>>>>>>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> DRILL-4325 <https://issues.apache.org/jira/browse/DRILL-4325 > > > >>>>>>>>>>>> ForemanException: > >>>>>>>>>>>> One or more nodes lost connectivity during query > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Any one experienced this issue ? > >>>>>>>>>>>> > >>>>>>>>>>>> It happens when running query involving many parquet files on > a > >>>>>>> cluster > >>>>>>>>>>> of > >>>>>>>>>>>> 200 nodes. Same query on a smaller cluster of 12 nodes runs > >>>>> fine. > >>>>>>>>>>>> > >>>>>>>>>>>> It is not caused by garbage collection, (checked on both ZK > >>>>> node and > >>>>>>>>> the > >>>>>>>>>>>> involved drill bit). > >>>>>>>>>>>> > >>>>>>>>>>>> Negotiated max session timeout is 40 seconds. > >>>>>>>>>>>> > >>>>>>>>>>>> The sequence seems: > >>>>>>>>>>>> - Drill Query begins, using an existing ZK session. > >>>>>>>>>>>> - Drill Zk session timeouts > >>>>>>>>>>>> - perhaps it was writing something that took too long > >>>>>>>>>>>> - Drill attempts to renew session > >>>>>>>>>>>> - drill believes that the write operation failed, so it > >>>>> attempts > >>>>>>>>>>> to > >>>>>>>>>>>> re-create the zk node, which trigger another exception. > >>>>>>>>>>>> > >>>>>>>>>>>> We are open to any suggestion. We will report any finding. > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks > >>>>>>>>>>>> Francois > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>>>> > >>>> > >>> > >> > >