Hey all,

I am a CS student in the United States working on my senior thesis.

My thesis uses Spark, and I am encountering some trouble.

I am using https://github.com/alitouka/spark_dbscan, and to determine 
parameters, I am using the utility class they supply, 
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.

I am on a 10 node cluster with one machine with 8 cores and 32G of memory and 
nine machines with 6 cores and 16G of memory.

I have 442M of data, which seems like it would be a joke, but the job stalls at 
the last stage.

It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a 
number of things for the last couple days, but nothing seems to be helping.

I have tried:
- Increasing heap sizes and numbers of cores
- More/less executors with different amounts of resources.
- Kyro Serialization
- FAIR Scheduling

It doesn’t seem like it should require this much. Any ideas?

- Isaac

Reply via email to