Hi Xi,
Please create a JIRA if it takes longer to locate the issue. Did you
try a smaller k?
Best,
Xiangrui
On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen davidshe...@gmail.com wrote:
Hi Burak,
After I added .repartition(sc.defaultParallelism), I can see from the log
the partition number is set
For the same amount of data, if I set the k=500, the job finished in about
3 hrs. I wonder if I set k=5000, the job could finish in 30 hrs...the
longest time I waited was 12 hrs...
If I use kmeans-random, same amount of data, k=5000, the job finished in
less than 2 hrs.
I think current kmeans||
We test large feature dimension but not very large k
(https://github.com/databricks/spark-perf/blob/master/config/config.py.template#L525).
Again, please create a JIRA and post your test code and a link to your
test dataset, we can work on it. It is hard to track the issue with
multiple threads in
Hi,
When I run k-means cluster with Spark, I got this in the last two lines in
the log:
15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
Then it hangs for a long time. There's no active job. The driver machine is
Hi Burak,
After I added .repartition(sc.defaultParallelism), I can see from the log
the partition number is set to 32. But in the Spark UI, it seems all the
data are loaded onto one executor. Previously they were loaded onto 4
executors.
Any idea?
Thanks,
David
On Fri, Mar 27, 2015 at 11:01
How do I get the number of cores that I specified at the command line? I
want to use spark.default.parallelism. I have 4 executors, each has 8
cores. According to
https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior,
the spark.default.parallelism value will be 4 * 8 = 32...I
Hi Burak,
My iterations is set to 500. But I think it should also stop of the
centroid coverages, right?
My spark is 1.2.0, working in windows 64 bit. My data set is about 40k
vectors, each vector has about 300 features, all normalised. All work node
have sufficient memory and disk space.
OH, the job I talked about has ran more than 11 hrs without a result...it
doesn't make sense.
On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote:
Hi Burak,
My iterations is set to 500. But I think it should also stop of the
centroid coverages, right?
My spark is 1.2.0,
The code is very simple.
val data = sc.textFile(very/large/text/file) map { l =
// turn each line into dense vector
Vectors.dense(...)
}
// the resulting data set is about 40k vectors
KMeans.train(data, k=5000, maxIterations=500)
I just kill my application. In the log I found this: