Re: KMeans: expensiveness of large vectors

2014-07-29 Thread durin
Development is really rapid here, that's a great thing.

Out of curiosity, how did communication work before torrent? Did everything
have to go back to the master / driver first?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10870.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: KMeans: expensiveness of large vectors

2014-07-29 Thread Xiangrui Meng
Before torrent, http is the default way for broadcasting. The driver
holds the data and the executors request the data via http, making the
driver the bottleneck if the data is large. -Xiangrui

On Tue, Jul 29, 2014 at 10:32 AM, durin m...@simon-schaefer.net wrote:
 Development is really rapid here, that's a great thing.

 Out of curiosity, how did communication work before torrent? Did everything
 have to go back to the master / driver first?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10870.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: KMeans: expensiveness of large vectors

2014-07-28 Thread durin
Hi Xiangru,

thanks for the explanation.

1. You said we have to broadcast m * k centers (with m = number of rows). I
thought there were only k centers at each time, which would the have size of
n * k and needed to be broadcasted. Is that I typo or did I understand
something wrong?
And the collection of the average is partition-wise. So more partitions =
more overhead, but basically same number of operations?

2. I have 5 executors with 8 CPU cores and 25G of memory each, and I usually
split the input RDD into 80 partitions for a few Gigs of input data. Is
there a rule of thumb for the number of partitions in relation to the input
size?


3. Assuming I wouldn't use numeric data but instead converted text data into
a numeric representation using a dictionary and a featurization function:
The number of columns would be the number of entries in my dictionary (i.e.
number of distinct words in my case). I'd use a sparse vector representation
of course. But even so, if I have a few hundred thousand entries and
therefore columns, broadcasting overhead will get very large, as the centers
are still in a dense representation.
Do you know of any way to improve performance then?


Best regards,
Simon



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10804.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: KMeans: expensiveness of large vectors

2014-07-28 Thread Xiangrui Meng
1. I meant in the n (1k) by m (10k) case, we need to broadcast k
centers and hence the total size is m * k. In 1.0, the driver needs to
send the current centers to each partition one by one. In the current
master, we use torrent to broadcast the centers to workers, which
should be much faster.

2. For MLlib algorithms, the number of partitions shouldn't be much
larger than the number of CPU cores. Your setting looks good.

3. You can use the hashing trick to limit the number of features, or
remove low-frequency and high-frequency words from the dictionary.

Best,
Xiangrui

On Mon, Jul 28, 2014 at 12:55 PM, durin m...@simon-schaefer.net wrote:
 Hi Xiangru,

 thanks for the explanation.

 1. You said we have to broadcast m * k centers (with m = number of rows). I
 thought there were only k centers at each time, which would the have size of
 n * k and needed to be broadcasted. Is that I typo or did I understand
 something wrong?
 And the collection of the average is partition-wise. So more partitions =
 more overhead, but basically same number of operations?

 2. I have 5 executors with 8 CPU cores and 25G of memory each, and I usually
 split the input RDD into 80 partitions for a few Gigs of input data. Is
 there a rule of thumb for the number of partitions in relation to the input
 size?


 3. Assuming I wouldn't use numeric data but instead converted text data into
 a numeric representation using a dictionary and a featurization function:
 The number of columns would be the number of entries in my dictionary (i.e.
 number of distinct words in my case). I'd use a sparse vector representation
 of course. But even so, if I have a few hundred thousand entries and
 therefore columns, broadcasting overhead will get very large, as the centers
 are still in a dense representation.
 Do you know of any way to improve performance then?


 Best regards,
 Simon



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10804.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: KMeans: expensiveness of large vectors

2014-07-28 Thread durin
Hi Xiangrui,

using the current master meant a huge improvement for my task. Something
that did not even finish before (training with 120G of dense data) now
completes in a reasonable time. I guess using torrent helps a lot in this
case.


Best regards,
Simon



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10833.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: KMeans: expensiveness of large vectors

2014-07-28 Thread Xiangrui Meng
Great! Thanks for testing the new features! -Xiangrui

On Mon, Jul 28, 2014 at 8:58 PM, durin m...@simon-schaefer.net wrote:
 Hi Xiangrui,

 using the current master meant a huge improvement for my task. Something
 that did not even finish before (training with 120G of dense data) now
 completes in a reasonable time. I guess using torrent helps a lot in this
 case.


 Best regards,
 Simon



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10833.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.