If you turn on spark.speculation on then that might help. it worked for me
On Sat, Jan 23, 2016 at 3:21 AM, Darren Govoni <dar...@ontrenet.com> wrote: > Thanks for the tip. I will try it. But this is the kind of thing spark is > supposed to figure out and handle. Or at least not get stuck forever. > > > > Sent from my Verizon Wireless 4G LTE smartphone > > > -------- Original message -------- > From: Muthu Jayakumar <bablo...@gmail.com> > Date: 01/22/2016 3:50 PM (GMT-05:00) > To: Darren Govoni <dar...@ontrenet.com>, "Sanders, Isaac B" < > sande...@rose-hulman.edu>, Ted Yu <yuzhih...@gmail.com> > Cc: user@spark.apache.org > Subject: Re: 10hrs of Scheduler Delay > > Does increasing the number of partition helps? You could try out something > 3 times what you currently have. > Another trick i used was to partition the problem into multiple dataframes > and run them sequentially and persistent the result and then run a union on > the results. > > Hope this helps. > > On Fri, Jan 22, 2016, 3:48 AM Darren Govoni <dar...@ontrenet.com> wrote: > >> Me too. I had to shrink my dataset to get it to work. For us at least >> Spark seems to have scaling issues. >> >> >> >> Sent from my Verizon Wireless 4G LTE smartphone >> >> >> -------- Original message -------- >> From: "Sanders, Isaac B" <sande...@rose-hulman.edu> >> Date: 01/21/2016 11:18 PM (GMT-05:00) >> To: Ted Yu <yuzhih...@gmail.com> >> Cc: user@spark.apache.org >> Subject: Re: 10hrs of Scheduler Delay >> >> I have run the driver on a smaller dataset (k=2, n=5000) and it worked >> quickly and didn’t hang like this. This dataset is closer to k=10, n=4.4m, >> but I am using more resources on this one. >> >> - Isaac >> >> On Jan 21, 2016, at 11:06 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >> You may have seen the following on github page: >> >> Latest commit 50fdf0e on Feb 22, 2015 >> >> That was 11 months ago. >> >> Can you search for similar algorithm which runs on Spark and is newer ? >> >> If nothing found, consider running the tests coming from the project to >> determine whether the delay is intrinsic. >> >> Cheers >> >> On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B < >> sande...@rose-hulman.edu> wrote: >> >>> That thread seems to be moving, it oscillates between a few different >>> traces… Maybe it is working. It seems odd that it would take that long. >>> >>> This is 3rd party code, and after looking at some of it, I think it >>> might not be as Spark-y as it could be. >>> >>> I linked it below. I don’t know a lot about spark, so it might be fine, >>> but I have my suspicions. >>> >>> >>> https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala >>> >>> - Isaac >>> >>> On Jan 21, 2016, at 10:08 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>> >>> You may have noticed the following - did this indicate prolonged >>> computation in your code ? >>> >>> org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205) >>> org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34) >>> org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15) >>> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16) >>> >>> >>> On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B < >>> sande...@rose-hulman.edu> wrote: >>> >>>> Hadoop is: HDP 2.3.2.0-2950 >>>> >>>> Here is a gist (pastebin) of my versions en masse and a stacktrace: >>>> https://gist.github.com/isaacsanders/2e59131758469097651b >>>> >>>> Thanks >>>> >>>> On Jan 21, 2016, at 7:44 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>> >>>> Looks like you were running on YARN. >>>> >>>> What hadoop version are you using ? >>>> >>>> Can you capture a few stack traces of the AppMaster during the delay >>>> and pastebin them ? >>>> >>>> Thanks >>>> >>>> On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B < >>>> sande...@rose-hulman.edu> wrote: >>>> >>>>> The Spark Version is 1.4.1 >>>>> >>>>> The logs are full of standard fair, nothing like an exception or even >>>>> interesting [INFO] lines. >>>>> >>>>> Here is the script I am using: >>>>> https://gist.github.com/isaacsanders/660f480810fbc07d4df2 >>>>> >>>>> Thanks >>>>> Isaac >>>>> >>>>> On Jan 21, 2016, at 11:03 AM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>> >>>>> Can you provide a bit more information ? >>>>> >>>>> command line for submitting Spark job >>>>> version of Spark >>>>> anything interesting from driver / executor logs ? >>>>> >>>>> Thanks >>>>> >>>>> On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B < >>>>> sande...@rose-hulman.edu> wrote: >>>>> >>>>>> Hey all, >>>>>> >>>>>> I am a CS student in the United States working on my senior thesis. >>>>>> >>>>>> My thesis uses Spark, and I am encountering some trouble. >>>>>> >>>>>> I am using https://github.com/alitouka/spark_dbscan, and to >>>>>> determine parameters, I am using the utility class they supply, >>>>>> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver. >>>>>> >>>>>> I am on a 10 node cluster with one machine with 8 cores and 32G of >>>>>> memory and nine machines with 6 cores and 16G of memory. >>>>>> >>>>>> I have 442M of data, which seems like it would be a joke, but the job >>>>>> stalls at the last stage. >>>>>> >>>>>> It was stuck in Scheduler Delay for 10 hours overnight, and I have >>>>>> tried a number of things for the last couple days, but nothing seems to >>>>>> be >>>>>> helping. >>>>>> >>>>>> I have tried: >>>>>> - Increasing heap sizes and numbers of cores >>>>>> - More/less executors with different amounts of resources. >>>>>> - Kyro Serialization >>>>>> - FAIR Scheduling >>>>>> >>>>>> It doesn’t seem like it should require this much. Any ideas? >>>>>> >>>>>> - Isaac >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >>