I have put more detail of my problem at
http://stackoverflow.com/questions/29295420/spark-kmeans-computation-cannot-be-distributed

It is really appreciate if you can help me take a look at this problem. I
have tried various settings and ways to load/partition my data, but I just
cannot get rid that long pause.


Thanks,
David





[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>

On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen <davidshe...@gmail.com> wrote:

> Yes, I have done repartition.
>
> I tried to repartition to the number of cores in my cluster. Not helping...
> I tried to repartition to the number of centroids (k value). Not helping...
>
>
> On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley <jos...@databricks.com>
> wrote:
>
>> Can you try specifying the number of partitions when you load the data to
>> equal the number of executors?  If your ETL changes the number of
>> partitions, you can also repartition before calling KMeans.
>>
>>
>> On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen <davidshe...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a large data set, and I expects to get 5000 clusters.
>>>
>>> I load the raw data, convert them into DenseVector; then I did
>>> repartition and cache; finally I give the RDD[Vector] to KMeans.train().
>>>
>>> Now the job is running, and data are loaded. But according to the Spark
>>> UI, all data are loaded onto one executor. I checked that executor, and its
>>> CPU workload is very low. I think it is using only 1 of the 8 cores. And
>>> all other 3 executors are at rest.
>>>
>>> Did I miss something? Is it possible to distribute the workload to all 4
>>> executors?
>>>
>>>
>>> Thanks,
>>> David
>>>
>>>
>>

Reply via email to