Opening a JIRA is fine. See if you can capture stack trace during the hung stage and attach to JIRA so that we have more clue.
Thanks > On Jan 25, 2016, at 4:25 AM, Darren Govoni <dar...@ontrenet.com> wrote: > > Probably we should open a ticket for this. > There's definitely a deadlock situation occurring in spark under certain > conditions. > > The only clue I have is it always happens on the last stage. And it does seem > sensitive to scale. If my job has 300mb of data I'll see the deadlock. But if > I only run 10mb of it it will succeed. This suggest a serious fundamental > scaling problem. > > Workers have plenty of resources. > > > > Sent from my Verizon Wireless 4G LTE smartphone > > > -------- Original message -------- > From: "Sanders, Isaac B" <sande...@rose-hulman.edu> > Date: 01/24/2016 2:54 PM (GMT-05:00) > To: Renu Yadav <yren...@gmail.com> > Cc: Darren Govoni <dar...@ontrenet.com>, Muthu Jayakumar > <bablo...@gmail.com>, Ted Yu <yuzhih...@gmail.com>, user@spark.apache.org > Subject: Re: 10hrs of Scheduler Delay > > I am not getting anywhere with any of the suggestions so far. :( > > Trying some more outlets, I will share any solution I find. > > - Isaac > >> On Jan 23, 2016, at 1:48 AM, Renu Yadav <yren...@gmail.com> wrote: >> >> If you turn on spark.speculation on then that might help. it worked for me >> >>> On Sat, Jan 23, 2016 at 3:21 AM, Darren Govoni <dar...@ontrenet.com> wrote: >>> Thanks for the tip. I will try it. But this is the kind of thing spark is >>> supposed to figure out and handle. Or at least not get stuck forever. >>> >>> >>> >>> Sent from my Verizon Wireless 4G LTE smartphone >>> >>> >>> -------- Original message -------- >>> From: Muthu Jayakumar <bablo...@gmail.com> >>> Date: 01/22/2016 3:50 PM (GMT-05:00) >>> To: Darren Govoni <dar...@ontrenet.com>, "Sanders, Isaac B" >>> <sande...@rose-hulman.edu>, Ted Yu <yuzhih...@gmail.com> >>> Cc: user@spark.apache.org >>> Subject: Re: 10hrs of Scheduler Delay >>> >>> Does increasing the number of partition helps? You could try out something >>> 3 times what you currently have. >>> Another trick i used was to partition the problem into multiple dataframes >>> and run them sequentially and persistent the result and then run a union on >>> the results. >>> >>> Hope this helps. >>> >>>> On Fri, Jan 22, 2016, 3:48 AM Darren Govoni <dar...@ontrenet.com> wrote: >>>> Me too. I had to shrink my dataset to get it to work. For us at least >>>> Spark seems to have scaling issues. >>>> >>>> >>>> >>>> Sent from my Verizon Wireless 4G LTE smartphone >>>> >>>> >>>> -------- Original message -------- >>>> From: "Sanders, Isaac B" <sande...@rose-hulman.edu> >>>> Date: 01/21/2016 11:18 PM (GMT-05:00) >>>> To: Ted Yu <yuzhih...@gmail.com> >>>> Cc: user@spark.apache.org >>>> Subject: Re: 10hrs of Scheduler Delay >>>> >>>> I have run the driver on a smaller dataset (k=2, n=5000) and it worked >>>> quickly and didn’t hang like this. This dataset is closer to k=10, n=4.4m, >>>> but I am using more resources on this one. >>>> >>>> - Isaac >>>> >>>>> On Jan 21, 2016, at 11:06 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>> >>>>> You may have seen the following on github page: >>>>> >>>>> Latest commit 50fdf0e on Feb 22, 2015 >>>>> >>>>> That was 11 months ago. >>>>> >>>>> Can you search for similar algorithm which runs on Spark and is newer ? >>>>> >>>>> If nothing found, consider running the tests coming from the project to >>>>> determine whether the delay is intrinsic. >>>>> >>>>> Cheers >>>>> >>>>>> On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B >>>>>> <sande...@rose-hulman.edu> wrote: >>>>>> That thread seems to be moving, it oscillates between a few different >>>>>> traces… Maybe it is working. It seems odd that it would take that long. >>>>>> >>>>>> This is 3rd party code, and after looking at some of it, I think it >>>>>> might not be as Spark-y as it could be. >>>>>> >>>>>> I linked it below. I don’t know a lot about spark, so it might be fine, >>>>>> but I have my suspicions. >>>>>> >>>>>> https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala >>>>>> >>>>>> - Isaac >>>>>> >>>>>>> On Jan 21, 2016, at 10:08 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>>>> >>>>>>> You may have noticed the following - did this indicate prolonged >>>>>>> computation in your code ?