Re: KMeans: expensiveness of large vectors
Development is really rapid here, that's a great thing. Out of curiosity, how did communication work before torrent? Did everything have to go back to the master / driver first? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10870.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: KMeans: expensiveness of large vectors
Before torrent, http is the default way for broadcasting. The driver holds the data and the executors request the data via http, making the driver the bottleneck if the data is large. -Xiangrui On Tue, Jul 29, 2014 at 10:32 AM, durin m...@simon-schaefer.net wrote: Development is really rapid here, that's a great thing. Out of curiosity, how did communication work before torrent? Did everything have to go back to the master / driver first? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10870.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: KMeans: expensiveness of large vectors
Hi Xiangru, thanks for the explanation. 1. You said we have to broadcast m * k centers (with m = number of rows). I thought there were only k centers at each time, which would the have size of n * k and needed to be broadcasted. Is that I typo or did I understand something wrong? And the collection of the average is partition-wise. So more partitions = more overhead, but basically same number of operations? 2. I have 5 executors with 8 CPU cores and 25G of memory each, and I usually split the input RDD into 80 partitions for a few Gigs of input data. Is there a rule of thumb for the number of partitions in relation to the input size? 3. Assuming I wouldn't use numeric data but instead converted text data into a numeric representation using a dictionary and a featurization function: The number of columns would be the number of entries in my dictionary (i.e. number of distinct words in my case). I'd use a sparse vector representation of course. But even so, if I have a few hundred thousand entries and therefore columns, broadcasting overhead will get very large, as the centers are still in a dense representation. Do you know of any way to improve performance then? Best regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10804.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: KMeans: expensiveness of large vectors
1. I meant in the n (1k) by m (10k) case, we need to broadcast k centers and hence the total size is m * k. In 1.0, the driver needs to send the current centers to each partition one by one. In the current master, we use torrent to broadcast the centers to workers, which should be much faster. 2. For MLlib algorithms, the number of partitions shouldn't be much larger than the number of CPU cores. Your setting looks good. 3. You can use the hashing trick to limit the number of features, or remove low-frequency and high-frequency words from the dictionary. Best, Xiangrui On Mon, Jul 28, 2014 at 12:55 PM, durin m...@simon-schaefer.net wrote: Hi Xiangru, thanks for the explanation. 1. You said we have to broadcast m * k centers (with m = number of rows). I thought there were only k centers at each time, which would the have size of n * k and needed to be broadcasted. Is that I typo or did I understand something wrong? And the collection of the average is partition-wise. So more partitions = more overhead, but basically same number of operations? 2. I have 5 executors with 8 CPU cores and 25G of memory each, and I usually split the input RDD into 80 partitions for a few Gigs of input data. Is there a rule of thumb for the number of partitions in relation to the input size? 3. Assuming I wouldn't use numeric data but instead converted text data into a numeric representation using a dictionary and a featurization function: The number of columns would be the number of entries in my dictionary (i.e. number of distinct words in my case). I'd use a sparse vector representation of course. But even so, if I have a few hundred thousand entries and therefore columns, broadcasting overhead will get very large, as the centers are still in a dense representation. Do you know of any way to improve performance then? Best regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10804.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: KMeans: expensiveness of large vectors
Hi Xiangrui, using the current master meant a huge improvement for my task. Something that did not even finish before (training with 120G of dense data) now completes in a reasonable time. I guess using torrent helps a lot in this case. Best regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10833.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: KMeans: expensiveness of large vectors
Great! Thanks for testing the new features! -Xiangrui On Mon, Jul 28, 2014 at 8:58 PM, durin m...@simon-schaefer.net wrote: Hi Xiangrui, using the current master meant a huge improvement for my task. Something that did not even finish before (training with 120G of dense data) now completes in a reasonable time. I guess using torrent helps a lot in this case. Best regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10833.html Sent from the Apache Spark User List mailing list archive at Nabble.com.