You may have seen the following on github page: Latest commit 50fdf0e on Feb 22, 2015
That was 11 months ago. Can you search for similar algorithm which runs on Spark and is newer ? If nothing found, consider running the tests coming from the project to determine whether the delay is intrinsic. Cheers On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B <sande...@rose-hulman.edu> wrote: > That thread seems to be moving, it oscillates between a few different > traces… Maybe it is working. It seems odd that it would take that long. > > This is 3rd party code, and after looking at some of it, I think it might > not be as Spark-y as it could be. > > I linked it below. I don’t know a lot about spark, so it might be fine, > but I have my suspicions. > > > https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala > > - Isaac > > On Jan 21, 2016, at 10:08 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > You may have noticed the following - did this indicate prolonged > computation in your code ? > > org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205) > org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34) > org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15) > org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16) > > > On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B < > sande...@rose-hulman.edu> wrote: > >> Hadoop is: HDP 2.3.2.0-2950 >> >> Here is a gist (pastebin) of my versions en masse and a stacktrace: >> https://gist.github.com/isaacsanders/2e59131758469097651b >> >> Thanks >> >> On Jan 21, 2016, at 7:44 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >> Looks like you were running on YARN. >> >> What hadoop version are you using ? >> >> Can you capture a few stack traces of the AppMaster during the delay and >> pastebin them ? >> >> Thanks >> >> On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B < >> sande...@rose-hulman.edu> wrote: >> >>> The Spark Version is 1.4.1 >>> >>> The logs are full of standard fair, nothing like an exception or even >>> interesting [INFO] lines. >>> >>> Here is the script I am using: >>> https://gist.github.com/isaacsanders/660f480810fbc07d4df2 >>> >>> Thanks >>> Isaac >>> >>> On Jan 21, 2016, at 11:03 AM, Ted Yu <yuzhih...@gmail.com> wrote: >>> >>> Can you provide a bit more information ? >>> >>> command line for submitting Spark job >>> version of Spark >>> anything interesting from driver / executor logs ? >>> >>> Thanks >>> >>> On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B < >>> sande...@rose-hulman.edu> wrote: >>> >>>> Hey all, >>>> >>>> I am a CS student in the United States working on my senior thesis. >>>> >>>> My thesis uses Spark, and I am encountering some trouble. >>>> >>>> I am using https://github.com/alitouka/spark_dbscan, and to determine >>>> parameters, I am using the utility class they supply, >>>> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver. >>>> >>>> I am on a 10 node cluster with one machine with 8 cores and 32G of >>>> memory and nine machines with 6 cores and 16G of memory. >>>> >>>> I have 442M of data, which seems like it would be a joke, but the job >>>> stalls at the last stage. >>>> >>>> It was stuck in Scheduler Delay for 10 hours overnight, and I have >>>> tried a number of things for the last couple days, but nothing seems to be >>>> helping. >>>> >>>> I have tried: >>>> - Increasing heap sizes and numbers of cores >>>> - More/less executors with different amounts of resources. >>>> - Kyro Serialization >>>> - FAIR Scheduling >>>> >>>> It doesn’t seem like it should require this much. Any ideas? >>>> >>>> - Isaac >>> >>> >>> >>> >> >> > >