Re: KMeans|| opinions
On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning ted.dunn...@gmail.com wrote: - Have you considered sketch based algorithms? can you give me a reference. at this point i am just contemplating more or less shameless port of what they've done in mllib. - It can be important to use optimizations in the search for nearest centroid. Consider triangle optimizations. - Do you mean parallel when you type || or is there another meaning there? No, i mean method called kmeans||. It's unfortunate name since I really don't know how to make google to search for it. http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf - When you say ++ initialization, many people get this wrong and assume that you mean pick the furthest point. Getting really good initialization is fairly difficult and typically requires more time than the actual clustering. This is one of the key benefits of sketch based methods. - Most algorithms require multiple restarts. At higher dimension the number of restarts required becomes very large. An ideal implementation does parallel sketch extraction followed by parallel ball k-means for restarts. On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: Considering porting implementation [1] and paper for KMeans || for Bindings. This seems like another method to map fairly nicely. The problem I am contemplating is ||-initialization, and in particular, centroid storage. That particular implementation assumes centroids could be kept in memory in front. (1) Question is, is it a dangerous idea. It doesn't seem like it particularly is, since unlikely people would want more k1e+6. Another thing, centers seem to be passed in via closure attribute (i.e. java-serialized array-backed matrix).However, with Bindings it is quite possible to keep centers at the back as a matrix. (2) obviously, LLoyd iterations are not terribly accurate. || and ++ versions mostly speed things up. Is there any better-than-LLoyd accuracy preference? [1] https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
[jira] [Commented] (MAHOUT-1506) Creation of affinity matrix for spectral clustering
[ https://issues.apache.org/jira/browse/MAHOUT-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959703#comment-13959703 ] Sebastian Schelter commented on MAHOUT-1506: I don't see much value in adding new MapReduce code to be honest. I'd suggest we implement this on the new scala DSL as a first step for porting spectral clustering. Creation of affinity matrix for spectral clustering --- Key: MAHOUT-1506 URL: https://issues.apache.org/jira/browse/MAHOUT-1506 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 1.0 Reporter: Shannon Quinn Assignee: Shannon Quinn I wanted to get this discussion going, since I think this is a critical blocker for any kind of documentation update on spectral clustering (I can't update the documentation until the algorithm is useful, and it won't be useful until there's a built-in method for converting raw data to an affinity matrix). Namely, I'm wondering what kind of raw data should this algorithm be expecting (anything that k-means expects, basically?), and what are the data structures associated with this? I've created a proof-of-concept for how pairwise affinity generation could work. https://github.com/magsol/Hadoop-Affinity It's a two-step job, but if the data structures in the input data format provides 1) the total number of data points, and 2) for each data point to know its index in the overall set, then the first job can be scrapped entirely and affinity generation will consist of 1 MR task. (discussions on Spark / h20 pending, of course) Mainly this is an engineering problem at this point. Let me know your thoughts and I'll get this done (I'm out of town the next 10 days for my wedding/honeymoon, will get to this on my return). -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: KMeans|| opinions
On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning ted.dunn...@gmail.com wrote: - Have you considered sketch based algorithms? can you give me a reference. at this point i am just contemplating more or less shameless port of what they've done in mllib. Here is the reference I used: http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets *A quick summary:* - a single pass over the data with a sort of dp-means [1] algorithm using a very fast approximate centroid search can give you k log N centroids. This is the sketch. - clustering these centroids as weighted values gives a provably probably good clustering of the original data for well clusterable data. For real data, it seems to work exactly as advertised. - clustering the sketch using ball k-means with clever initialization is important. Good initialization is very expensive in terms of number of points so using a sketch is a really good idea. - the proofs all depend on Euclidean distance. - the threshold can start very small (what small means is determined empirically). Each time we wind up with too many centroids, recursively clustering the centroids and possibly increasing the threshold will keep the number reasonably bounded. *Some notes from practical experience:* - moderate levels of parallelism are easy since sketches can be merged using set union. You may want to recursively cluster the centroids at this point if you have too many. This is very nice application of map-reduce. - high degrees of parallelism require multiple levels of merge/collapse since otherwise you wind up with a sketch nearly as large as the original data. If you have m parallel clusterings then m k log (N/m) can be large. Take a billion points and m = 1000 and k = 10,000. The size of the parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N. But with m = 100, k = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable. - ball k-means on highly clusterable data often uses a cutoff for centroid computation of 0.5 x distance to nearest. I find that with real data that 1 x distance or even larger to nearest is much better. The good effects are still mostly there, but you don't need wonderful data to succeed - the ball k-means algorithm that we have in Mahout is pretty high quality, but could use a bit of triangle (Elkan [2] ) speedups. *References* [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf [2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf - It can be important to use optimizations in the search for nearest centroid. Consider triangle optimizations. - Do you mean parallel when you type || or is there another meaning there? No, i mean method called kmeans||. It's unfortunate name since I really don't know how to make google to search for it. http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf - When you say ++ initialization, many people get this wrong and assume that you mean pick the furthest point. Getting really good initialization is fairly difficult and typically requires more time than the actual clustering. This is one of the key benefits of sketch based methods. - Most algorithms require multiple restarts. At higher dimension the number of restarts required becomes very large. An ideal implementation does parallel sketch extraction followed by parallel ball k-means for restarts. On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: Considering porting implementation [1] and paper for KMeans || for Bindings. This seems like another method to map fairly nicely. The problem I am contemplating is ||-initialization, and in particular, centroid storage. That particular implementation assumes centroids could be kept in memory in front. (1) Question is, is it a dangerous idea. It doesn't seem like it particularly is, since unlikely people would want more k1e+6. Another thing, centers seem to be passed in via closure attribute (i.e. java-serialized array-backed matrix).However, with Bindings it is quite possible to keep centers at the back as a matrix. (2) obviously, LLoyd iterations are not terribly accurate. || and ++ versions mostly speed things up. Is there any better-than-LLoyd accuracy preference? [1] https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
Re: KMeans|| opinions
is there any opinion whether one of pipe-pipe (||), dp-means initializations has bona-fide advantages one over another? On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning ted.dunn...@gmail.com wrote: - Have you considered sketch based algorithms? can you give me a reference. at this point i am just contemplating more or less shameless port of what they've done in mllib. Here is the reference I used: http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets *A quick summary:* - a single pass over the data with a sort of dp-means [1] algorithm using a very fast approximate centroid search can give you k log N centroids. This is the sketch. - clustering these centroids as weighted values gives a provably probably good clustering of the original data for well clusterable data. For real data, it seems to work exactly as advertised. - clustering the sketch using ball k-means with clever initialization is important. Good initialization is very expensive in terms of number of points so using a sketch is a really good idea. - the proofs all depend on Euclidean distance. - the threshold can start very small (what small means is determined empirically). Each time we wind up with too many centroids, recursively clustering the centroids and possibly increasing the threshold will keep the number reasonably bounded. *Some notes from practical experience:* - moderate levels of parallelism are easy since sketches can be merged using set union. You may want to recursively cluster the centroids at this point if you have too many. This is very nice application of map-reduce. - high degrees of parallelism require multiple levels of merge/collapse since otherwise you wind up with a sketch nearly as large as the original data. If you have m parallel clusterings then m k log (N/m) can be large. Take a billion points and m = 1000 and k = 10,000. The size of the parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N. But with m = 100, k = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable. - ball k-means on highly clusterable data often uses a cutoff for centroid computation of 0.5 x distance to nearest. I find that with real data that 1 x distance or even larger to nearest is much better. The good effects are still mostly there, but you don't need wonderful data to succeed - the ball k-means algorithm that we have in Mahout is pretty high quality, but could use a bit of triangle (Elkan [2] ) speedups. *References* [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf [2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf - It can be important to use optimizations in the search for nearest centroid. Consider triangle optimizations. - Do you mean parallel when you type || or is there another meaning there? No, i mean method called kmeans||. It's unfortunate name since I really don't know how to make google to search for it. http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf - When you say ++ initialization, many people get this wrong and assume that you mean pick the furthest point. Getting really good initialization is fairly difficult and typically requires more time than the actual clustering. This is one of the key benefits of sketch based methods. - Most algorithms require multiple restarts. At higher dimension the number of restarts required becomes very large. An ideal implementation does parallel sketch extraction followed by parallel ball k-means for restarts. On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: Considering porting implementation [1] and paper for KMeans || for Bindings. This seems like another method to map fairly nicely. The problem I am contemplating is ||-initialization, and in particular, centroid storage. That particular implementation assumes centroids could be kept in memory in front. (1) Question is, is it a dangerous idea. It doesn't seem like it particularly is, since unlikely people would want more k1e+6. Another thing, centers seem to be passed in via closure attribute (i.e. java-serialized array-backed matrix).However, with Bindings it is quite possible to keep centers at the back as a matrix. (2) obviously, LLoyd iterations are not terribly accurate. || and ++ versions mostly speed things up. Is there any better-than-LLoyd accuracy preference? [1] https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
Re: KMeans|| opinions
Pipe-pipe is an initialization algorithm which (roughly) parallelizes something like the ++ algorithm and requires logarithmically many parallel passes over the data as opposed to the k passes required for ++. DP-means itself is a variant of k-means where the value of k is not fixed. Instead, whenever a data-point is further than a threshold \lambda from the nearest centroid, it is used to form a new centroid. Different choices of \lambda result in different numbers of clusters. I don't know how well DP-means by itself would work as an initialization pass. Streaming k-means uses a variant of dp-means to get a sketch of the data. This sketch has many more centroids than the final desired number of clusters k. As such, we don't have to worry much about the quality of the initialization of this first dp-means pass. The sketch is then clustered using something like k-means++ and ball k-means iteration. As I see it, having a good streaming k-means block should make any clustering algorithm better as long as the sketch is much smaller than the original data. The second phase of clustering after streaming k-means can benefit from any technique that makes clustering better. The caveat is just that streaming k-means is likely to produce a small enough sketch that parallel clustering may not help very much. Exactly where that trade-off applies depends a lot on the parallel framework available. Spark and h2o have much lower task initiation costs so they will definitely be able to get parallel speedup on much smaller datasets. On Fri, Apr 4, 2014 at 11:29 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: is there any opinion whether one of pipe-pipe (||), dp-means initializations has bona-fide advantages one over another? On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning ted.dunn...@gmail.com wrote: - Have you considered sketch based algorithms? can you give me a reference. at this point i am just contemplating more or less shameless port of what they've done in mllib. Here is the reference I used: http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets *A quick summary:* - a single pass over the data with a sort of dp-means [1] algorithm using a very fast approximate centroid search can give you k log N centroids. This is the sketch. - clustering these centroids as weighted values gives a provably probably good clustering of the original data for well clusterable data. For real data, it seems to work exactly as advertised. - clustering the sketch using ball k-means with clever initialization is important. Good initialization is very expensive in terms of number of points so using a sketch is a really good idea. - the proofs all depend on Euclidean distance. - the threshold can start very small (what small means is determined empirically). Each time we wind up with too many centroids, recursively clustering the centroids and possibly increasing the threshold will keep the number reasonably bounded. *Some notes from practical experience:* - moderate levels of parallelism are easy since sketches can be merged using set union. You may want to recursively cluster the centroids at this point if you have too many. This is very nice application of map-reduce. - high degrees of parallelism require multiple levels of merge/collapse since otherwise you wind up with a sketch nearly as large as the original data. If you have m parallel clusterings then m k log (N/m) can be large. Take a billion points and m = 1000 and k = 10,000. The size of the parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N. But with m = 100, k = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable. - ball k-means on highly clusterable data often uses a cutoff for centroid computation of 0.5 x distance to nearest. I find that with real data that 1 x distance or even larger to nearest is much better. The good effects are still mostly there, but you don't need wonderful data to succeed - the ball k-means algorithm that we have in Mahout is pretty high quality, but could use a bit of triangle (Elkan [2] ) speedups. *References* [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf [2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf - It can be important to use optimizations in the search for nearest centroid. Consider triangle optimizations. - Do you mean parallel when you type || or is there another meaning there? No, i mean method called kmeans||. It's unfortunate name since I really don't know how to make google to search for it. http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf - When you say ++ initialization,
Re: Mail and IRC parsing
We've (LucidWorks) got full indexing and search of the Mahout mail archives at http://find.searchub.org. We could probably add in IRC pretty easily if you want. -Grant On Mar 22, 2014, at 2:06 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: I put up a parser for the IRC history logs here https://github.com/andrewmusselman/util/blob/master/irc-parser.sh I'd like to write one for the user list too to figure out the most common problems/questions so we can focus effort on repairs to bugs and docs. But the mail archives at https://mail-archives.apache.org/mod_mbox/mahout-user/ are dynamic, loaded in through JavaScript, so parsing them isn't that straightforward. Is it possible to get the mbox files directly? Grant Ingersoll | @gsingers http://www.lucidworks.com
Board Report
Can someone summarize the 0xData and the Spark work for me for the board report? I've unfortunately been too busy to keep up on the threads on it, but need to write the board report for this month. You can either summarize here or add it to the community section at https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt Also, assuming we are going ahead w/ the 0xData stuff, we likely need to do a software grant for that. Thanks, Grant Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: Mail and IRC parsing
Could be useful; in the meantime I found the mail files on people.apache.org and can use those. On Apr 4, 2014, at 8:39 AM, Grant Ingersoll gsing...@apache.org wrote: We've (LucidWorks) got full indexing and search of the Mahout mail archives at http://find.searchub.org. We could probably add in IRC pretty easily if you want. -Grant On Mar 22, 2014, at 2:06 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: I put up a parser for the IRC history logs here https://github.com/andrewmusselman/util/blob/master/irc-parser.sh I'd like to write one for the user list too to figure out the most common problems/questions so we can focus effort on repairs to bugs and docs. But the mail archives at https://mail-archives.apache.org/mod_mbox/mahout-user/ are dynamic, loaded in through JavaScript, so parsing them isn't that straightforward. Is it possible to get the mbox files directly? Grant Ingersoll | @gsingers http://www.lucidworks.com
[jira] [Updated] (MAHOUT-1482) Rework quickstart website
[ https://issues.apache.org/jira/browse/MAHOUT-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jian wang updated MAHOUT-1482: -- Attachment: examples.patch Rework quickstart website - Key: MAHOUT-1482 URL: https://issues.apache.org/jira/browse/MAHOUT-1482 Project: Mahout Issue Type: Improvement Components: Documentation Reporter: Sebastian Schelter Priority: Blocker Fix For: 1.0 Attachments: MAHOUT-1482-add-maven-dependency.patch, examples.patch Our quickstart website must be much more helpful. We should describe how to add Mahout as maven dependency to a Java project and how to start working on the source code. https://mahout.apache.org/users/basics/quickstart.html -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: KMeans|| opinions
On Fri, Apr 4, 2014 at 4:26 AM, Ted Dunning ted.dunn...@gmail.com wrote: Pipe-pipe is an initialization algorithm which (roughly) parallelizes something like the ++ algorithm and requires logarithmically many parallel passes over the data as opposed to the k passes required for ++. i'll re-read it again. The way i understood the paper (and the code), pipe-pipe actually uniformly subsamples data in each pass to bring in a limited number of points in each pass which is O(k) not O(n). This is not the same as going over each point computing distance-based probability metrics for them, if that's what you imply by going for ++ comparison. DP-means would have to do subsampling as well. DP-means itself is a variant of k-means where the value of k is not fixed. Instead, whenever a data-point is further than a threshold \lambda from the nearest centroid, it is used to form a new centroid. Different choices of \lambda result in different numbers of clusters. I don't know how well DP-means by itself would work as an initialization pass. Streaming k-means uses a variant of dp-means to get a sketch of the data. This sketch has many more centroids than the final desired number of clusters k. As such, we don't have to worry much about the quality of the initialization of this first dp-means pass. The sketch is then clustered using something like k-means++ and ball k-means iteration. As I see it, having a good streaming k-means block should make any clustering algorithm better as long as the sketch is much smaller than the original data. The second phase of clustering after streaming k-means can benefit from any technique that makes clustering better. The caveat is just that streaming k-means is likely to produce a small enough sketch that parallel clustering may not help very much. Exactly where that trade-off applies depends a lot on the parallel framework available. Spark and h2o have much lower task initiation costs so they will definitely be able to get parallel speedup on much smaller datasets. On Fri, Apr 4, 2014 at 11:29 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: is there any opinion whether one of pipe-pipe (||), dp-means initializations has bona-fide advantages one over another? On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning ted.dunn...@gmail.com wrote: - Have you considered sketch based algorithms? can you give me a reference. at this point i am just contemplating more or less shameless port of what they've done in mllib. Here is the reference I used: http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets *A quick summary:* - a single pass over the data with a sort of dp-means [1] algorithm using a very fast approximate centroid search can give you k log N centroids. This is the sketch. - clustering these centroids as weighted values gives a provably probably good clustering of the original data for well clusterable data. For real data, it seems to work exactly as advertised. - clustering the sketch using ball k-means with clever initialization is important. Good initialization is very expensive in terms of number of points so using a sketch is a really good idea. - the proofs all depend on Euclidean distance. - the threshold can start very small (what small means is determined empirically). Each time we wind up with too many centroids, recursively clustering the centroids and possibly increasing the threshold will keep the number reasonably bounded. *Some notes from practical experience:* - moderate levels of parallelism are easy since sketches can be merged using set union. You may want to recursively cluster the centroids at this point if you have too many. This is very nice application of map-reduce. - high degrees of parallelism require multiple levels of merge/collapse since otherwise you wind up with a sketch nearly as large as the original data. If you have m parallel clusterings then m k log (N/m) can be large. Take a billion points and m = 1000 and k = 10,000. The size of the parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N. But with m = 100, k = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable. - ball k-means on highly clusterable data often uses a cutoff for centroid computation of 0.5 x distance to nearest. I find that with real data that 1 x distance or even larger to nearest is much better. The good effects are still mostly there, but you don't need wonderful data to succeed - the ball k-means algorithm that we have in Mahout is pretty high quality, but could use a bit of triangle (Elkan [2] ) speedups.
Re: Board Report
Grant, On 0xdata / H2O front: We feel very excited at making Apache Mahout the principal platform for scalable machine learning and are rapidly prototyping an initial integration with the Matrix API. Ted (apache.org), Cliff Click ( acm.org/0xdata), Anand Avati (Redhat) and Michal Malohava (0xdata) are heads down on that making brisk progress. We hope to get the discussions restarted in the JIRAs and google hangouts as soon as we get past the first cut . We also chose to have the first level integration with Mahout will be as a maven dependency - That way we can flesh things out without major interruption and the grant work. In parallel, several members and teams have been reworking the core architecture to get a clean separation on the Algorithms Core, an in-memory (mr/task) API and a decent client framework with data read/write. This will allow Apache Mahout and other ML libraries to use Spark, Stratosphere or other engines for performance and extensibility. This is the state of the union at the moment - I'm very enthusiastic at making this a win for the ardent Community of Machine Learning users and developers. We are very grateful for the warmth, welcome, attention and impassionate reviews we received from the Apache community. Thank you for that. We should have more to report in the month ahead. Looking forward, Sri On Fri, Apr 4, 2014 at 6:44 AM, Grant Ingersoll gsing...@apache.org wrote: Can someone summarize the 0xData and the Spark work for me for the board report? I've unfortunately been too busy to keep up on the threads on it, but need to write the board report for this month. You can either summarize here or add it to the community section at https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt Also, assuming we are going ahead w/ the 0xData stuff, we likely need to do a software grant for that. Thanks, Grant Grant Ingersoll | @gsingers http://www.lucidworks.com -- ceo co-founder, 0 http://www.0xdata.com/*x*data Inc +1-408.316.8192
Re: Board Report
Grant, Spark Biindings issues (done/in-progress) are summarized here http://mahout.apache.org/users/sparkbindings/home.html. As it stands, there is 1 platform issue, of which core functionality is completed/committed, and 5 in-progress at various stages of ETA. Platform issue broughth completion of 5 algorithms as POC guinea pigs. In-progress issues are almost all but one about augmenting algorithm set. One WIP issue is of significant importance on the platform side (bringing in Data Frame-like semantics to distributed datasets), but it is fairly vague and under-developed on its design right now. The difficulty there is developing a solid concept around it rather than coding itself. So it is at the brainstorming stage rather than something concrete. Adding one more issue in clustering area is being contemplated in near term. There are 3 independent contributors working on these issues. The work is unpaid, though, meaning ETA sure may vary. Some issues are given ETAs by the contributors, but most of in-progress stuff is not. On Fri, Apr 4, 2014 at 6:44 AM, Grant Ingersoll gsing...@apache.org wrote: Can someone summarize the 0xData and the Spark work for me for the board report? I've unfortunately been too busy to keep up on the threads on it, but need to write the board report for this month. You can either summarize here or add it to the community section at https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt Also, assuming we are going ahead w/ the 0xData stuff, we likely need to do a software grant for that. Thanks, Grant Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: KMeans|| opinions
On Fri, Apr 4, 2014 at 8:44 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Fri, Apr 4, 2014 at 4:26 AM, Ted Dunning ted.dunn...@gmail.com wrote: Pipe-pipe is an initialization algorithm which (roughly) parallelizes something like the ++ algorithm and requires logarithmically many parallel passes over the data as opposed to the k passes required for ++. i'll re-read it again. The way i understood the paper (and the code), pipe-pipe actually uniformly subsamples data in each pass to bring in a limited number of points in each pass which is O(k) not O(n). This is not the same as going over each point computing distance-based probability metrics for them, if that's what you imply by going for ++ comparison. DP-means would have to do subsampling as well. Retracting that. They do full passes over data with full point cost recomputations each time. DP-means itself is a variant of k-means where the value of k is not fixed. Instead, whenever a data-point is further than a threshold \lambda from the nearest centroid, it is used to form a new centroid. Different choices of \lambda result in different numbers of clusters. I don't know how well DP-means by itself would work as an initialization pass. Streaming k-means uses a variant of dp-means to get a sketch of the data. This sketch has many more centroids than the final desired number of clusters k. As such, we don't have to worry much about the quality of the initialization of this first dp-means pass. The sketch is then clustered using something like k-means++ and ball k-means iteration. As I see it, having a good streaming k-means block should make any clustering algorithm better as long as the sketch is much smaller than the original data. The second phase of clustering after streaming k-means can benefit from any technique that makes clustering better. The caveat is just that streaming k-means is likely to produce a small enough sketch that parallel clustering may not help very much. Exactly where that trade-off applies depends a lot on the parallel framework available. Spark and h2o have much lower task initiation costs so they will definitely be able to get parallel speedup on much smaller datasets. On Fri, Apr 4, 2014 at 11:29 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: is there any opinion whether one of pipe-pipe (||), dp-means initializations has bona-fide advantages one over another? On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning ted.dunn...@gmail.com wrote: - Have you considered sketch based algorithms? can you give me a reference. at this point i am just contemplating more or less shameless port of what they've done in mllib. Here is the reference I used: http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets *A quick summary:* - a single pass over the data with a sort of dp-means [1] algorithm using a very fast approximate centroid search can give you k log N centroids. This is the sketch. - clustering these centroids as weighted values gives a provably probably good clustering of the original data for well clusterable data. For real data, it seems to work exactly as advertised. - clustering the sketch using ball k-means with clever initialization is important. Good initialization is very expensive in terms of number of points so using a sketch is a really good idea. - the proofs all depend on Euclidean distance. - the threshold can start very small (what small means is determined empirically). Each time we wind up with too many centroids, recursively clustering the centroids and possibly increasing the threshold will keep the number reasonably bounded. *Some notes from practical experience:* - moderate levels of parallelism are easy since sketches can be merged using set union. You may want to recursively cluster the centroids at this point if you have too many. This is very nice application of map-reduce. - high degrees of parallelism require multiple levels of merge/collapse since otherwise you wind up with a sketch nearly as large as the original data. If you have m parallel clusterings then m k log (N/m) can be large. Take a billion points and m = 1000 and k = 10,000. The size of the parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N. But with m = 100, k = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable. - ball k-means on highly clusterable data often uses a cutoff for centroid computation of 0.5 x distance to nearest. I find that with real data that 1 x distance or even larger to nearest is much better. The good effects are still mostly there, but you
Re: Board Report
To add to Sri's comments: A prototype implementation of the Mahout Matrix and Vector interfaces with h2o back-end is in progress outside of Mahout with the intention of using it as a test-bed for developing implementations of several Mahout algorithms such as SSVD. This code is intended for contribution if the objections of one committer are over-come by the concrete results of the prototype. On Fri, Apr 4, 2014 at 6:47 PM, SriSatish Ambati srisat...@0xdata.comwrote: Grant, On 0xdata / H2O front: We feel very excited at making Apache Mahout the principal platform for scalable machine learning and are rapidly prototyping an initial integration with the Matrix API. Ted (apache.org), Cliff Click ( acm.org/0xdata), Anand Avati (Redhat) and Michal Malohava (0xdata) are heads down on that making brisk progress. We hope to get the discussions restarted in the JIRAs and google hangouts as soon as we get past the first cut . We also chose to have the first level integration with Mahout will be as a maven dependency - That way we can flesh things out without major interruption and the grant work. In parallel, several members and teams have been reworking the core architecture to get a clean separation on the Algorithms Core, an in-memory (mr/task) API and a decent client framework with data read/write. This will allow Apache Mahout and other ML libraries to use Spark, Stratosphere or other engines for performance and extensibility. This is the state of the union at the moment - I'm very enthusiastic at making this a win for the ardent Community of Machine Learning users and developers. We are very grateful for the warmth, welcome, attention and impassionate reviews we received from the Apache community. Thank you for that. We should have more to report in the month ahead. Looking forward, Sri On Fri, Apr 4, 2014 at 6:44 AM, Grant Ingersoll gsing...@apache.org wrote: Can someone summarize the 0xData and the Spark work for me for the board report? I've unfortunately been too busy to keep up on the threads on it, but need to write the board report for this month. You can either summarize here or add it to the community section at https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt Also, assuming we are going ahead w/ the 0xData stuff, we likely need to do a software grant for that. Thanks, Grant Grant Ingersoll | @gsingers http://www.lucidworks.com -- ceo co-founder, 0 http://www.0xdata.com/*x*data Inc +1-408.316.8192
Re: KMeans|| opinions
On Fri, Apr 4, 2014 at 5:44 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Fri, Apr 4, 2014 at 4:26 AM, Ted Dunning ted.dunn...@gmail.com wrote: Pipe-pipe is an initialization algorithm which (roughly) parallelizes something like the ++ algorithm and requires logarithmically many parallel passes over the data as opposed to the k passes required for ++. i'll re-read it again. The way i understood the paper (and the code), pipe-pipe actually uniformly subsamples data in each pass to bring in a limited number of points in each pass which is O(k) not O(n). This is not the same as going over each point computing distance-based probability metrics for them, if that's what you imply by going for ++ comparison. I don't understand your point here. My original point was that || makes logarithmic number (log N or log k, I don't know which) of passes over the data (direct quote from the paper) while ++ makes O(k) passes over the data. Such passes are often I/O bound. Each ++ pass is made with the goal of selecting a new initial point. This is a sampling procedure based on all the previous initial points. This sounds very much like || except for the number of passes. So my point is O(log mumble) versus O(k). You say that pipe-pipe samples data. You say that is different than ++ sampling data. Is it the uniform versus non-uniform sampling that you are talking about? That hardly seems important compared to the difference in number of passes. DP-means would have to do subsampling as well.
[jira] [Commented] (MAHOUT-1507) Support External/Foreign Keys/IDs for Vectors and Matrices
[ https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960837#comment-13960837 ] Pat Ferrel commented on MAHOUT-1507: OK, desired behavior is to support a method to preserve user specified arbitrary keys for row and column vectors. Put another way preserve IDs for for rows and columns of a DRM. Below the term ID and Key is used interchangeably. For example, the DRM created for the item based recommender has rows of users, columns of items. If the user specified ID could be preserved through to the output this would make usage of the recommender much easier. Also kmeans or other clustering, if user specified row IDs were preserved (possibly document IDs in text clustering) this would make usage much easier. By preserve I mean that the user specified keys would be available in the sequence file output of the appropriate job. Failing this, a Dictionary of user specified key - Mahout keys be created for the output. In the case of a DRM as output, if the row vector was stored with it's user specified key and a dictionary of the user specified column keys were created then the user would not have to implement this behavior for every Mahout job. Partial examples of this behavior in Mahout 0.9 are provided by Named or Property Vectors. Also the text analysis pipeline creates a dictionary of word tokens - mahout column id. This dictionary has to be created by Mahout since the tokens are created as a consequence of phases in the job. I should emphasize that this ID or Key translation is an absolute requirement for Mahout use (except where the partial solution is enough). Why put the burden for this on every user? Support External/Foreign Keys/IDs for Vectors and Matrices -- Key: MAHOUT-1507 URL: https://issues.apache.org/jira/browse/MAHOUT-1507 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.9 Environment: Spark Scala Reporter: Pat Ferrel Labels: spark Fix For: 1.0 All users of Mahout have data which is addressed by keys or IDs of their own devise. In order to use much of Mahout they must translate these IDs into Mahout IDs, then run their jobs and translate back again when retrieving the output. If the ID space is very large this is a difficult problem for users to solve at scale. For many Mahout operations this would not be necessary if these external keys could be maintained for vectors and dimensions, or for rows and columns of a DRM. The reason I bring this up now is that much groundwork is being laid for Mahout's future on Spark so getting this notion in early could be fundamentally important and used to build on. If external IDs for rows and columns were maintained then RSJ, DRM Transpose (and other DRM ops), vector extraction, clustering, and recommenders would need no ID translation steps, a big user win. A partial solution might be to support external row IDs alone somewhat like the NamedVector and PropertyVector in the Mahout hadoop code. On Apr 3, 2014, at 11:00 AM, Pat Ferrel p...@occamsmachete.com wrote: Perhaps this is best phrased as a feature request. On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: PS. sequence file keys have also special meaning if they are Ints. .E.g. A' physical operator requires keys to be ints, in which case it interprets them as row indexes that become column indexes. This of course isn't always the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in reality optimizer will never choose actual transposition as a physical step in such pipeline. This interpretation is consistent with interpretation of long-existing Hadoop-side DistributedRowMatrix#transpose. On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel p...@occamsmachete.com wrote: On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I think this duality, names and keys, is not very healthy really, and just creates addtutiinal hassle. Spark drm takes care of keys automatically thoughout, but propagating names from name vectors is solely algorithm concern as it stands. Not sure what you mean. Not what you think, it looks like. I mean that Mahout DRM structure is a bag of (key - Vector) pairs. When persisted, key goes to the key of a sequence file. In particular, it means that there is a case of Bag[ key - NamedVector]. Which means, external anchor could be saved to either key or name of a row. In practice it causes compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse saves external keys (file paths) into key, whereas e.g. clustering algorithms are not seeing them because they
[jira] [Commented] (MAHOUT-1507) Support External/Foreign Keys/IDs for Vectors and Matrices
[ https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960958#comment-13960958 ] Andrew Musselman commented on MAHOUT-1507: -- I agree with Pat that this is a big deal; maintaining a custom mapping from arbitrary ids to sequential integers is a hurdle that prevents people from adopting the project. Support External/Foreign Keys/IDs for Vectors and Matrices -- Key: MAHOUT-1507 URL: https://issues.apache.org/jira/browse/MAHOUT-1507 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.9 Environment: Spark Scala Reporter: Pat Ferrel Labels: spark Fix For: 1.0 All users of Mahout have data which is addressed by keys or IDs of their own devise. In order to use much of Mahout they must translate these IDs into Mahout IDs, then run their jobs and translate back again when retrieving the output. If the ID space is very large this is a difficult problem for users to solve at scale. For many Mahout operations this would not be necessary if these external keys could be maintained for vectors and dimensions, or for rows and columns of a DRM. The reason I bring this up now is that much groundwork is being laid for Mahout's future on Spark so getting this notion in early could be fundamentally important and used to build on. If external IDs for rows and columns were maintained then RSJ, DRM Transpose (and other DRM ops), vector extraction, clustering, and recommenders would need no ID translation steps, a big user win. A partial solution might be to support external row IDs alone somewhat like the NamedVector and PropertyVector in the Mahout hadoop code. On Apr 3, 2014, at 11:00 AM, Pat Ferrel p...@occamsmachete.com wrote: Perhaps this is best phrased as a feature request. On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: PS. sequence file keys have also special meaning if they are Ints. .E.g. A' physical operator requires keys to be ints, in which case it interprets them as row indexes that become column indexes. This of course isn't always the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in reality optimizer will never choose actual transposition as a physical step in such pipeline. This interpretation is consistent with interpretation of long-existing Hadoop-side DistributedRowMatrix#transpose. On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel p...@occamsmachete.com wrote: On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I think this duality, names and keys, is not very healthy really, and just creates addtutiinal hassle. Spark drm takes care of keys automatically thoughout, but propagating names from name vectors is solely algorithm concern as it stands. Not sure what you mean. Not what you think, it looks like. I mean that Mahout DRM structure is a bag of (key - Vector) pairs. When persisted, key goes to the key of a sequence file. In particular, it means that there is a case of Bag[ key - NamedVector]. Which means, external anchor could be saved to either key or name of a row. In practice it causes compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse saves external keys (file paths) into key, whereas e.g. clustering algorithms are not seeing them because they expect them to be the name part of the vector. I am just saying we have two ways to name the rows, and it is generally not a healthy choice for the aforementioned reason. In my experience Names and Properties are primarily used to store external keys, which are quite healthy. Users never have data with Mahout keys, they must constantly go back and forth. This is exactly what the R data frame does, no? I'm not so concerned with being able to address an element by the external key drmB[pat][iPad'] like a HashMap. But it would sure be nice to have the external ids follow the data through any calculation that makes sense. I am with you on this. This would mean clustering, recommendations, transpose, RSJ would require no id transforming steps. This would make dealing with Mahout much easier. Data frames is a little bit a different thing, right now we work just with matrices. Although, yes, our in-core matrices support row and column names (just like in R) and distributed matrices support row keys only. what i mean is that algebraic expression e.g. Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied above, but not necessarily named vectors, because internally algorithms blockify things into matrix blocks, and i am far from sure that Mahout in-core stuff works correctly with named vectors as
[jira] [Commented] (MAHOUT-1507) Support External/Foreign Keys/IDs for Vectors and Matrices
[ https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960962#comment-13960962 ] Andrew Musselman commented on MAHOUT-1507: -- Maybe prevents is strong but it's another step that would be nice to remove for adoptees. Support External/Foreign Keys/IDs for Vectors and Matrices -- Key: MAHOUT-1507 URL: https://issues.apache.org/jira/browse/MAHOUT-1507 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.9 Environment: Spark Scala Reporter: Pat Ferrel Labels: spark Fix For: 1.0 All users of Mahout have data which is addressed by keys or IDs of their own devise. In order to use much of Mahout they must translate these IDs into Mahout IDs, then run their jobs and translate back again when retrieving the output. If the ID space is very large this is a difficult problem for users to solve at scale. For many Mahout operations this would not be necessary if these external keys could be maintained for vectors and dimensions, or for rows and columns of a DRM. The reason I bring this up now is that much groundwork is being laid for Mahout's future on Spark so getting this notion in early could be fundamentally important and used to build on. If external IDs for rows and columns were maintained then RSJ, DRM Transpose (and other DRM ops), vector extraction, clustering, and recommenders would need no ID translation steps, a big user win. A partial solution might be to support external row IDs alone somewhat like the NamedVector and PropertyVector in the Mahout hadoop code. On Apr 3, 2014, at 11:00 AM, Pat Ferrel p...@occamsmachete.com wrote: Perhaps this is best phrased as a feature request. On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: PS. sequence file keys have also special meaning if they are Ints. .E.g. A' physical operator requires keys to be ints, in which case it interprets them as row indexes that become column indexes. This of course isn't always the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in reality optimizer will never choose actual transposition as a physical step in such pipeline. This interpretation is consistent with interpretation of long-existing Hadoop-side DistributedRowMatrix#transpose. On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel p...@occamsmachete.com wrote: On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I think this duality, names and keys, is not very healthy really, and just creates addtutiinal hassle. Spark drm takes care of keys automatically thoughout, but propagating names from name vectors is solely algorithm concern as it stands. Not sure what you mean. Not what you think, it looks like. I mean that Mahout DRM structure is a bag of (key - Vector) pairs. When persisted, key goes to the key of a sequence file. In particular, it means that there is a case of Bag[ key - NamedVector]. Which means, external anchor could be saved to either key or name of a row. In practice it causes compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse saves external keys (file paths) into key, whereas e.g. clustering algorithms are not seeing them because they expect them to be the name part of the vector. I am just saying we have two ways to name the rows, and it is generally not a healthy choice for the aforementioned reason. In my experience Names and Properties are primarily used to store external keys, which are quite healthy. Users never have data with Mahout keys, they must constantly go back and forth. This is exactly what the R data frame does, no? I'm not so concerned with being able to address an element by the external key drmB[pat][iPad'] like a HashMap. But it would sure be nice to have the external ids follow the data through any calculation that makes sense. I am with you on this. This would mean clustering, recommendations, transpose, RSJ would require no id transforming steps. This would make dealing with Mahout much easier. Data frames is a little bit a different thing, right now we work just with matrices. Although, yes, our in-core matrices support row and column names (just like in R) and distributed matrices support row keys only. what i mean is that algebraic expression e.g. Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied above, but not necessarily named vectors, because internally algorithms blockify things into matrix blocks, and i am far from sure that Mahout in-core stuff works correctly with named vectors as part of a matrix block in all situations. I may be wrong. I always relied on