1. I meant in the n (1k) by m (10k) case, we need to broadcast k
centers and hence the total size is m * k. In 1.0, the driver needs to
send the current centers to each partition one by one. In the current
master, we use torrent to broadcast the centers to workers, which
should be much faster.

2. For MLlib algorithms, the number of partitions shouldn't be much
larger than the number of CPU cores. Your setting looks good.

3. You can use the hashing trick to limit the number of features, or
remove low-frequency and high-frequency words from the dictionary.

Best,
Xiangrui

On Mon, Jul 28, 2014 at 12:55 PM, durin <m...@simon-schaefer.net> wrote:
> Hi Xiangru,
>
> thanks for the explanation.
>
> 1. You said we have to broadcast m * k centers (with m = number of rows). I
> thought there were only k centers at each time, which would the have size of
> n * k and needed to be broadcasted. Is that I typo or did I understand
> something wrong?
> And the collection of the average is partition-wise. So more partitions =
> more overhead, but basically same number of operations?
>
> 2. I have 5 executors with 8 CPU cores and 25G of memory each, and I usually
> split the input RDD into 80 partitions for a few Gigs of input data. Is
> there a rule of thumb for the number of partitions in relation to the input
> size?
>
>
> 3. Assuming I wouldn't use numeric data but instead converted text data into
> a numeric representation using a dictionary and a featurization function:
> The number of columns would be the number of entries in my dictionary (i.e.
> number of distinct words in my case). I'd use a sparse vector representation
> of course. But even so, if I have a few hundred thousand entries and
> therefore columns, broadcasting overhead will get very large, as the centers
> are still in a dense representation.
> Do you know of any way to improve performance then?
>
>
> Best regards,
> Simon
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10804.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to