Re: k-means can only run on one executor with one thread?

2015-03-30 Thread Xiangrui Meng
Hey Xi,

Have you tried Spark 1.3.0? The initialization happens on the driver node
and we fixed an issue with the initialization in 1.3.0. Again, please start
with a smaller k, and increase it gradually, Let us know at what k the
problem happens.

Best,
Xiangrui

On Sat, Mar 28, 2015 at 3:11 AM, Xi Shen davidshe...@gmail.com wrote:

 My vector dimension is like 360 or so. The data count is about 270k. My
 driver has 2.9G memory. I attache a screenshot of current executor status.
 I submitted this job with --master yarn-cluster. I have a total of 7
 worker node, one of them acts as the driver. In the screenshot, you can see
 all worker nodes have loaded some data, but the driver is not loaded with
 any data.

 But the funny thing is, when I log on to the driver, and check its CPU 
 memory status. I saw one java process using about 18% of CPU, and is using
 about 1.6 GB memory.

 [image: Inline image 1]


 On Sat, Mar 28, 2015 at 7:06 PM Reza Zadeh r...@databricks.com wrote:

 How many dimensions does your data have? The size of the k-means model is
 k * d, where d is the dimension of the data.

 Since you're using k=1000, if your data has dimension higher than say,
 10,000, you will have trouble, because k*d doubles have to fit in the
 driver.

 Reza

 On Sat, Mar 28, 2015 at 12:27 AM, Xi Shen davidshe...@gmail.com wrote:

 I have put more detail of my problem at http://stackoverflow.com/
 questions/29295420/spark-kmeans-computation-cannot-be-distributed

 It is really appreciate if you can help me take a look at this problem.
 I have tried various settings and ways to load/partition my data, but I
 just cannot get rid that long pause.


 Thanks,
 David





 [image: --]
 Xi Shen
 [image: http://]about.me/davidshen
 http://about.me/davidshen?promo=email_sig
   http://about.me/davidshen

 On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen davidshe...@gmail.com wrote:

 Yes, I have done repartition.

 I tried to repartition to the number of cores in my cluster. Not
 helping...
 I tried to repartition to the number of centroids (k value). Not
 helping...


 On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley jos...@databricks.com
 wrote:

 Can you try specifying the number of partitions when you load the data
 to equal the number of executors?  If your ETL changes the number of
 partitions, you can also repartition before calling KMeans.


 On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen davidshe...@gmail.com
 wrote:

 Hi,

 I have a large data set, and I expects to get 5000 clusters.

 I load the raw data, convert them into DenseVector; then I did
 repartition and cache; finally I give the RDD[Vector] to KMeans.train().

 Now the job is running, and data are loaded. But according to the
 Spark UI, all data are loaded onto one executor. I checked that executor,
 and its CPU workload is very low. I think it is using only 1 of the 8
 cores. And all other 3 executors are at rest.

 Did I miss something? Is it possible to distribute the workload to
 all 4 executors?


 Thanks,
 David







Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Xi Shen
I have put more detail of my problem at
http://stackoverflow.com/questions/29295420/spark-kmeans-computation-cannot-be-distributed

It is really appreciate if you can help me take a look at this problem. I
have tried various settings and ways to load/partition my data, but I just
cannot get rid that long pause.


Thanks,
David





[image: --]
Xi Shen
[image: http://]about.me/davidshen
http://about.me/davidshen?promo=email_sig
  http://about.me/davidshen

On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen davidshe...@gmail.com wrote:

 Yes, I have done repartition.

 I tried to repartition to the number of cores in my cluster. Not helping...
 I tried to repartition to the number of centroids (k value). Not helping...


 On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley jos...@databricks.com
 wrote:

 Can you try specifying the number of partitions when you load the data to
 equal the number of executors?  If your ETL changes the number of
 partitions, you can also repartition before calling KMeans.


 On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I have a large data set, and I expects to get 5000 clusters.

 I load the raw data, convert them into DenseVector; then I did
 repartition and cache; finally I give the RDD[Vector] to KMeans.train().

 Now the job is running, and data are loaded. But according to the Spark
 UI, all data are loaded onto one executor. I checked that executor, and its
 CPU workload is very low. I think it is using only 1 of the 8 cores. And
 all other 3 executors are at rest.

 Did I miss something? Is it possible to distribute the workload to all 4
 executors?


 Thanks,
 David





Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Reza Zadeh
How many dimensions does your data have? The size of the k-means model is k
* d, where d is the dimension of the data.

Since you're using k=1000, if your data has dimension higher than say,
10,000, you will have trouble, because k*d doubles have to fit in the
driver.

Reza

On Sat, Mar 28, 2015 at 12:27 AM, Xi Shen davidshe...@gmail.com wrote:

 I have put more detail of my problem at
 http://stackoverflow.com/questions/29295420/spark-kmeans-computation-cannot-be-distributed

 It is really appreciate if you can help me take a look at this problem. I
 have tried various settings and ways to load/partition my data, but I just
 cannot get rid that long pause.


 Thanks,
 David





 [image: --]
 Xi Shen
 [image: http://]about.me/davidshen
 http://about.me/davidshen?promo=email_sig
   http://about.me/davidshen

 On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen davidshe...@gmail.com wrote:

 Yes, I have done repartition.

 I tried to repartition to the number of cores in my cluster. Not
 helping...
 I tried to repartition to the number of centroids (k value). Not
 helping...


 On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley jos...@databricks.com
 wrote:

 Can you try specifying the number of partitions when you load the data
 to equal the number of executors?  If your ETL changes the number of
 partitions, you can also repartition before calling KMeans.


 On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I have a large data set, and I expects to get 5000 clusters.

 I load the raw data, convert them into DenseVector; then I did
 repartition and cache; finally I give the RDD[Vector] to KMeans.train().

 Now the job is running, and data are loaded. But according to the Spark
 UI, all data are loaded onto one executor. I checked that executor, and its
 CPU workload is very low. I think it is using only 1 of the 8 cores. And
 all other 3 executors are at rest.

 Did I miss something? Is it possible to distribute the workload to all
 4 executors?


 Thanks,
 David






Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Xi Shen
My vector dimension is like 360 or so. The data count is about 270k. My
driver has 2.9G memory. I attache a screenshot of current executor status.
I submitted this job with --master yarn-cluster. I have a total of 7
worker node, one of them acts as the driver. In the screenshot, you can see
all worker nodes have loaded some data, but the driver is not loaded with
any data.

But the funny thing is, when I log on to the driver, and check its CPU 
memory status. I saw one java process using about 18% of CPU, and is using
about 1.6 GB memory.

[image: Inline image 1]

On Sat, Mar 28, 2015 at 7:06 PM Reza Zadeh r...@databricks.com wrote:

 How many dimensions does your data have? The size of the k-means model is
 k * d, where d is the dimension of the data.

 Since you're using k=1000, if your data has dimension higher than say,
 10,000, you will have trouble, because k*d doubles have to fit in the
 driver.

 Reza

 On Sat, Mar 28, 2015 at 12:27 AM, Xi Shen davidshe...@gmail.com wrote:

 I have put more detail of my problem at http://stackoverflow.com/
 questions/29295420/spark-kmeans-computation-cannot-be-distributed

 It is really appreciate if you can help me take a look at this problem. I
 have tried various settings and ways to load/partition my data, but I just
 cannot get rid that long pause.


 Thanks,
 David





 [image: --]
 Xi Shen
 [image: http://]about.me/davidshen
 http://about.me/davidshen?promo=email_sig
   http://about.me/davidshen

 On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen davidshe...@gmail.com wrote:

 Yes, I have done repartition.

 I tried to repartition to the number of cores in my cluster. Not
 helping...
 I tried to repartition to the number of centroids (k value). Not
 helping...


 On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley jos...@databricks.com
 wrote:

 Can you try specifying the number of partitions when you load the data
 to equal the number of executors?  If your ETL changes the number of
 partitions, you can also repartition before calling KMeans.


 On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I have a large data set, and I expects to get 5000 clusters.

 I load the raw data, convert them into DenseVector; then I did
 repartition and cache; finally I give the RDD[Vector] to KMeans.train().

 Now the job is running, and data are loaded. But according to the
 Spark UI, all data are loaded onto one executor. I checked that executor,
 and its CPU workload is very low. I think it is using only 1 of the 8
 cores. And all other 3 executors are at rest.

 Did I miss something? Is it possible to distribute the workload to all
 4 executors?


 Thanks,
 David







Re: k-means can only run on one executor with one thread?

2015-03-27 Thread Joseph Bradley
Can you try specifying the number of partitions when you load the data to
equal the number of executors?  If your ETL changes the number of
partitions, you can also repartition before calling KMeans.


On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I have a large data set, and I expects to get 5000 clusters.

 I load the raw data, convert them into DenseVector; then I did repartition
 and cache; finally I give the RDD[Vector] to KMeans.train().

 Now the job is running, and data are loaded. But according to the Spark
 UI, all data are loaded onto one executor. I checked that executor, and its
 CPU workload is very low. I think it is using only 1 of the 8 cores. And
 all other 3 executors are at rest.

 Did I miss something? Is it possible to distribute the workload to all 4
 executors?


 Thanks,
 David




Re: k-means can only run on one executor with one thread?

2015-03-27 Thread Xi Shen
Yes, I have done repartition.

I tried to repartition to the number of cores in my cluster. Not helping...
I tried to repartition to the number of centroids (k value). Not helping...


On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley jos...@databricks.com
wrote:

 Can you try specifying the number of partitions when you load the data to
 equal the number of executors?  If your ETL changes the number of
 partitions, you can also repartition before calling KMeans.


 On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I have a large data set, and I expects to get 5000 clusters.

 I load the raw data, convert them into DenseVector; then I did
 repartition and cache; finally I give the RDD[Vector] to KMeans.train().

 Now the job is running, and data are loaded. But according to the Spark
 UI, all data are loaded onto one executor. I checked that executor, and its
 CPU workload is very low. I think it is using only 1 of the 8 cores. And
 all other 3 executors are at rest.

 Did I miss something? Is it possible to distribute the workload to all 4
 executors?


 Thanks,
 David





k-means can only run on one executor with one thread?

2015-03-26 Thread Xi Shen
Hi,

I have a large data set, and I expects to get 5000 clusters.

I load the raw data, convert them into DenseVector; then I did repartition
and cache; finally I give the RDD[Vector] to KMeans.train().

Now the job is running, and data are loaded. But according to the Spark UI,
all data are loaded onto one executor. I checked that executor, and its CPU
workload is very low. I think it is using only 1 of the 8 cores. And all
other 3 executors are at rest.

Did I miss something? Is it possible to distribute the workload to all 4
executors?


Thanks,
David