Does increasing the number of partition helps? You could try out something
3 times what you currently have.
Another trick i used was to partition the problem into multiple dataframes
and run them sequentially and persistent the result and then run a union on
the results.

Hope this helps.

On Fri, Jan 22, 2016, 3:48 AM Darren Govoni <dar...@ontrenet.com> wrote:

> Me too. I had to shrink my dataset to get it to work. For us at least
> Spark seems to have scaling issues.
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
> -------- Original message --------
> From: "Sanders, Isaac B" <sande...@rose-hulman.edu>
> Date: 01/21/2016 11:18 PM (GMT-05:00)
> To: Ted Yu <yuzhih...@gmail.com>
> Cc: user@spark.apache.org
> Subject: Re: 10hrs of Scheduler Delay
>
> I have run the driver on a smaller dataset (k=2, n=5000) and it worked
> quickly and didn’t hang like this. This dataset is closer to k=10, n=4.4m,
> but I am using more resources on this one.
>
> - Isaac
>
> On Jan 21, 2016, at 11:06 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
> You may have seen the following on github page:
>
> Latest commit 50fdf0e  on Feb 22, 2015
>
> That was 11 months ago.
>
> Can you search for similar algorithm which runs on Spark and is newer ?
>
> If nothing found, consider running the tests coming from the project to
> determine whether the delay is intrinsic.
>
> Cheers
>
> On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B <
> sande...@rose-hulman.edu> wrote:
>
>> That thread seems to be moving, it oscillates between a few different
>> traces… Maybe it is working. It seems odd that it would take that long.
>>
>> This is 3rd party code, and after looking at some of it, I think it might
>> not be as Spark-y as it could be.
>>
>> I linked it below. I don’t know a lot about spark, so it might be fine,
>> but I have my suspicions.
>>
>>
>> https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala
>>
>> - Isaac
>>
>> On Jan 21, 2016, at 10:08 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>> You may have noticed the following - did this indicate prolonged
>> computation in your code ?
>>
>> org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
>> org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
>> org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
>> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)
>>
>>
>> On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B <
>> sande...@rose-hulman.edu> wrote:
>>
>>> Hadoop is: HDP 2.3.2.0-2950
>>>
>>> Here is a gist (pastebin) of my versions en masse and a stacktrace:
>>> https://gist.github.com/isaacsanders/2e59131758469097651b
>>>
>>> Thanks
>>>
>>> On Jan 21, 2016, at 7:44 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>> Looks like you were running on YARN.
>>>
>>> What hadoop version are you using ?
>>>
>>> Can you capture a few stack traces of the AppMaster during the delay and
>>> pastebin them ?
>>>
>>> Thanks
>>>
>>> On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B <
>>> sande...@rose-hulman.edu> wrote:
>>>
>>>> The Spark Version is 1.4.1
>>>>
>>>> The logs are full of standard fair, nothing like an exception or even
>>>> interesting [INFO] lines.
>>>>
>>>> Here is the script I am using:
>>>> https://gist.github.com/isaacsanders/660f480810fbc07d4df2
>>>>
>>>> Thanks
>>>> Isaac
>>>>
>>>> On Jan 21, 2016, at 11:03 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>> Can you provide a bit more information ?
>>>>
>>>> command line for submitting Spark job
>>>> version of Spark
>>>> anything interesting from driver / executor logs ?
>>>>
>>>> Thanks
>>>>
>>>> On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B <
>>>> sande...@rose-hulman.edu> wrote:
>>>>
>>>>> Hey all,
>>>>>
>>>>> I am a CS student in the United States working on my senior thesis.
>>>>>
>>>>> My thesis uses Spark, and I am encountering some trouble.
>>>>>
>>>>> I am using https://github.com/alitouka/spark_dbscan, and to determine
>>>>> parameters, I am using the utility class they supply,
>>>>> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.
>>>>>
>>>>> I am on a 10 node cluster with one machine with 8 cores and 32G of
>>>>> memory and nine machines with 6 cores and 16G of memory.
>>>>>
>>>>> I have 442M of data, which seems like it would be a joke, but the job
>>>>> stalls at the last stage.
>>>>>
>>>>> It was stuck in Scheduler Delay for 10 hours overnight, and I have
>>>>> tried a number of things for the last couple days, but nothing seems to be
>>>>> helping.
>>>>>
>>>>> I have tried:
>>>>> - Increasing heap sizes and numbers of cores
>>>>> - More/less executors with different amounts of resources.
>>>>> - Kyro Serialization
>>>>> - FAIR Scheduling
>>>>>
>>>>> It doesn’t seem like it should require this much. Any ideas?
>>>>>
>>>>> - Isaac
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Reply via email to