Re: KMeans|| opinions

2014-04-04 Thread Dmitriy Lyubimov
On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 - Have you considered sketch based algorithms?


can you give me a reference. at this point i am just  contemplating more or
less shameless port of what they've done in mllib.



 - It can be important to use optimizations in the search for nearest
 centroid.  Consider triangle optimizations.

 - Do you mean parallel when you type || or is there another meaning
 there?


No, i mean method called kmeans||. It's unfortunate name since I really
don't know how to make google to search for it.

http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf


 - When you say ++ initialization, many people get this wrong and assume
 that you mean pick the furthest point.  Getting really good initialization
 is fairly difficult and typically requires more time than the actual
 clustering.  This is one of the key benefits of sketch based methods.

 - Most algorithms require multiple restarts.  At higher dimension the
 number of restarts required becomes very large.  An ideal implementation
 does parallel sketch extraction followed by parallel ball k-means for
 restarts.



 On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  Considering porting implementation [1] and paper for KMeans || for
  Bindings.
 
  This seems like another method to map fairly nicely.
 
  The problem I am contemplating is ||-initialization, and in particular,
  centroid storage. That particular implementation assumes centroids could
 be
  kept in memory in front.
 
  (1) Question is, is it a dangerous idea. It doesn't seem like it
  particularly is, since unlikely people would want more k1e+6. Another
  thing, centers seem to be passed in via closure attribute (i.e.
  java-serialized array-backed matrix).However, with Bindings it is quite
  possible to keep centers at the back as a matrix.
 
  (2) obviously, LLoyd iterations are not terribly accurate. || and ++
  versions mostly speed things up. Is there any better-than-LLoyd accuracy
  preference?
 
 
  [1]
 
 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
 



[jira] [Commented] (MAHOUT-1506) Creation of affinity matrix for spectral clustering

2014-04-04 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959703#comment-13959703
 ] 

Sebastian Schelter commented on MAHOUT-1506:


I don't see much value in adding new MapReduce code to be honest. I'd suggest 
we implement this on the new scala DSL as a first step for porting spectral 
clustering.

 Creation of affinity matrix for spectral clustering
 ---

 Key: MAHOUT-1506
 URL: https://issues.apache.org/jira/browse/MAHOUT-1506
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 1.0
Reporter: Shannon Quinn
Assignee: Shannon Quinn

 I wanted to get this discussion going, since I think this is a critical 
 blocker for any kind of documentation update on spectral clustering (I can't 
 update the documentation until the algorithm is useful, and it won't be 
 useful until there's a built-in method for converting raw data to an affinity 
 matrix).
 Namely, I'm wondering what kind of raw data should this algorithm be 
 expecting (anything that k-means expects, basically?), and what are the data 
 structures associated with this? I've created a proof-of-concept for how 
 pairwise affinity generation could work.
 https://github.com/magsol/Hadoop-Affinity
 It's a two-step job, but if the data structures in the input data format 
 provides 1) the total number of data points, and 2) for each data point to 
 know its index in the overall set, then the first job can be scrapped 
 entirely and affinity generation will consist of 1 MR task.
 (discussions on Spark / h20 pending, of course)
 Mainly this is an engineering problem at this point. Let me know your 
 thoughts and I'll get this done (I'm out of town the next 10 days for my 
 wedding/honeymoon, will get to this on my return).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: KMeans|| opinions

2014-04-04 Thread Ted Dunning
On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  - Have you considered sketch based algorithms?
 

 can you give me a reference. at this point i am just  contemplating more or
 less shameless port of what they've done in mllib.


Here is the reference I used:

http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets

*A quick summary:*

- a single pass over the data with a sort of dp-means [1] algorithm using a
very fast approximate centroid search can give you k log N centroids.  This
is the sketch.

- clustering these centroids as weighted values gives a provably probably
good clustering of the original data for well clusterable data.  For real
data, it seems to work exactly as advertised.

- clustering the sketch using ball k-means with clever initialization is
important.  Good initialization is very expensive in terms of number of
points so using a sketch is a really good idea.

- the proofs all depend on Euclidean distance.

- the threshold can start very small (what small means is determined
empirically).  Each time we wind up with too many centroids, recursively
clustering the centroids and possibly increasing the threshold will keep
the number reasonably bounded.

*Some notes from practical experience:*

- moderate levels of parallelism are easy since sketches can be merged
using set union.  You may want to recursively cluster the centroids at this
point if you have too many.  This is very nice application of map-reduce.

- high degrees of parallelism require multiple levels of merge/collapse
since otherwise you wind up with a sketch nearly as large as the original
data.  If you have m parallel clusterings then m k log (N/m) can be large.
 Take a billion points and m = 1000 and k = 10,000.  The size of the
parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N.  But with m = 100, k
= 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable.

- ball k-means on highly clusterable data often uses a cutoff for centroid
computation of 0.5 x distance to nearest.  I find that with real data that
1 x distance or even larger to nearest is much better.  The good effects
are still mostly there, but you don't need wonderful data to succeed

- the ball k-means algorithm that we have in Mahout is pretty high quality,
but could use a bit of triangle (Elkan [2] ) speedups.

*References*

[1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
[2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf



 
  - It can be important to use optimizations in the search for nearest
  centroid.  Consider triangle optimizations.
 
  - Do you mean parallel when you type || or is there another meaning
  there?
 

 No, i mean method called kmeans||. It's unfortunate name since I really
 don't know how to make google to search for it.

 http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf

 
  - When you say ++ initialization, many people get this wrong and assume
  that you mean pick the furthest point.  Getting really good
 initialization
  is fairly difficult and typically requires more time than the actual
  clustering.  This is one of the key benefits of sketch based methods.
 
  - Most algorithms require multiple restarts.  At higher dimension the
  number of restarts required becomes very large.  An ideal implementation
  does parallel sketch extraction followed by parallel ball k-means for
  restarts.
 
 
 
  On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   Considering porting implementation [1] and paper for KMeans || for
   Bindings.
  
   This seems like another method to map fairly nicely.
  
   The problem I am contemplating is ||-initialization, and in particular,
   centroid storage. That particular implementation assumes centroids
 could
  be
   kept in memory in front.
  
   (1) Question is, is it a dangerous idea. It doesn't seem like it
   particularly is, since unlikely people would want more k1e+6. Another
   thing, centers seem to be passed in via closure attribute (i.e.
   java-serialized array-backed matrix).However, with Bindings it is quite
   possible to keep centers at the back as a matrix.
  
   (2) obviously, LLoyd iterations are not terribly accurate. || and ++
   versions mostly speed things up. Is there any better-than-LLoyd
 accuracy
   preference?
  
  
   [1]
  
  
 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
  
 



Re: KMeans|| opinions

2014-04-04 Thread Dmitriy Lyubimov
is there any opinion whether one of pipe-pipe (||), dp-means
initializations has bona-fide advantages  one over another?


On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   - Have you considered sketch based algorithms?
  
 
  can you give me a reference. at this point i am just  contemplating more
 or
  less shameless port of what they've done in mllib.
 

 Here is the reference I used:


 http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets

 *A quick summary:*

 - a single pass over the data with a sort of dp-means [1] algorithm using a
 very fast approximate centroid search can give you k log N centroids.  This
 is the sketch.

 - clustering these centroids as weighted values gives a provably probably
 good clustering of the original data for well clusterable data.  For real
 data, it seems to work exactly as advertised.

 - clustering the sketch using ball k-means with clever initialization is
 important.  Good initialization is very expensive in terms of number of
 points so using a sketch is a really good idea.

 - the proofs all depend on Euclidean distance.

 - the threshold can start very small (what small means is determined
 empirically).  Each time we wind up with too many centroids, recursively
 clustering the centroids and possibly increasing the threshold will keep
 the number reasonably bounded.

 *Some notes from practical experience:*

 - moderate levels of parallelism are easy since sketches can be merged
 using set union.  You may want to recursively cluster the centroids at this
 point if you have too many.  This is very nice application of map-reduce.

 - high degrees of parallelism require multiple levels of merge/collapse
 since otherwise you wind up with a sketch nearly as large as the original
 data.  If you have m parallel clusterings then m k log (N/m) can be large.
  Take a billion points and m = 1000 and k = 10,000.  The size of the
 parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N.  But with m = 100, k
 = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable.

 - ball k-means on highly clusterable data often uses a cutoff for centroid
 computation of 0.5 x distance to nearest.  I find that with real data that
 1 x distance or even larger to nearest is much better.  The good effects
 are still mostly there, but you don't need wonderful data to succeed

 - the ball k-means algorithm that we have in Mahout is pretty high quality,
 but could use a bit of triangle (Elkan [2] ) speedups.

 *References*

 [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
 [2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf


 
  
   - It can be important to use optimizations in the search for nearest
   centroid.  Consider triangle optimizations.
  
   - Do you mean parallel when you type || or is there another meaning
   there?
  
 
  No, i mean method called kmeans||. It's unfortunate name since I really
  don't know how to make google to search for it.
 
  http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 
  
   - When you say ++ initialization, many people get this wrong and assume
   that you mean pick the furthest point.  Getting really good
  initialization
   is fairly difficult and typically requires more time than the actual
   clustering.  This is one of the key benefits of sketch based methods.
  
   - Most algorithms require multiple restarts.  At higher dimension the
   number of restarts required becomes very large.  An ideal
 implementation
   does parallel sketch extraction followed by parallel ball k-means for
   restarts.
  
  
  
   On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
  
Considering porting implementation [1] and paper for KMeans || for
Bindings.
   
This seems like another method to map fairly nicely.
   
The problem I am contemplating is ||-initialization, and in
 particular,
centroid storage. That particular implementation assumes centroids
  could
   be
kept in memory in front.
   
(1) Question is, is it a dangerous idea. It doesn't seem like it
particularly is, since unlikely people would want more k1e+6.
 Another
thing, centers seem to be passed in via closure attribute (i.e.
java-serialized array-backed matrix).However, with Bindings it is
 quite
possible to keep centers at the back as a matrix.
   
(2) obviously, LLoyd iterations are not terribly accurate. || and ++
versions mostly speed things up. Is there any better-than-LLoyd
  accuracy
preference?
   
   
[1]
   
   
  
 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
   
  
 



Re: KMeans|| opinions

2014-04-04 Thread Ted Dunning
Pipe-pipe is an initialization algorithm which (roughly) parallelizes
something like the ++ algorithm and requires logarithmically many parallel
passes over the data as opposed to the k passes required for ++.

DP-means itself is a variant of k-means where the value of k is not fixed.
 Instead, whenever a data-point is further than a threshold \lambda from
the nearest centroid, it is used to form a new centroid.  Different choices
of \lambda result in different numbers of clusters.  I don't know how well
DP-means by itself would work as an initialization pass.

Streaming k-means uses a variant of dp-means to get a sketch of the data.
 This sketch has many more centroids than the final desired number of
clusters k.  As such, we don't have to worry much about the quality of the
initialization of this first dp-means pass.  The sketch is then clustered
using something like k-means++ and ball k-means iteration.

As I see it, having a good streaming k-means block should make any
clustering algorithm better as long as the sketch is much smaller than the
original data.

The second phase of clustering after streaming k-means can benefit from any
technique that makes clustering better.  The caveat is just that streaming
k-means is likely to produce a small enough sketch that parallel clustering
may not help very much.  Exactly where that trade-off applies depends a lot
on the parallel framework available. Spark and h2o have much lower task
initiation costs so they will definitely be able to get parallel speedup on
much smaller datasets.




On Fri, Apr 4, 2014 at 11:29 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 is there any opinion whether one of pipe-pipe (||), dp-means
 initializations has bona-fide advantages  one over another?


 On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning ted.dunn...@gmail.com
   wrote:
  
- Have you considered sketch based algorithms?
   
  
   can you give me a reference. at this point i am just  contemplating
 more
  or
   less shameless port of what they've done in mllib.
  
 
  Here is the reference I used:
 
 
 
 http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets
 
  *A quick summary:*
 
  - a single pass over the data with a sort of dp-means [1] algorithm
 using a
  very fast approximate centroid search can give you k log N centroids.
  This
  is the sketch.
 
  - clustering these centroids as weighted values gives a provably probably
  good clustering of the original data for well clusterable data.  For real
  data, it seems to work exactly as advertised.
 
  - clustering the sketch using ball k-means with clever initialization is
  important.  Good initialization is very expensive in terms of number of
  points so using a sketch is a really good idea.
 
  - the proofs all depend on Euclidean distance.
 
  - the threshold can start very small (what small means is determined
  empirically).  Each time we wind up with too many centroids, recursively
  clustering the centroids and possibly increasing the threshold will keep
  the number reasonably bounded.
 
  *Some notes from practical experience:*
 
  - moderate levels of parallelism are easy since sketches can be merged
  using set union.  You may want to recursively cluster the centroids at
 this
  point if you have too many.  This is very nice application of map-reduce.
 
  - high degrees of parallelism require multiple levels of merge/collapse
  since otherwise you wind up with a sketch nearly as large as the original
  data.  If you have m parallel clusterings then m k log (N/m) can be
 large.
   Take a billion points and m = 1000 and k = 10,000.  The size of the
  parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N.  But with m =
 100, k
  = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable.
 
  - ball k-means on highly clusterable data often uses a cutoff for
 centroid
  computation of 0.5 x distance to nearest.  I find that with real data
 that
  1 x distance or even larger to nearest is much better.  The good effects
  are still mostly there, but you don't need wonderful data to succeed
 
  - the ball k-means algorithm that we have in Mahout is pretty high
 quality,
  but could use a bit of triangle (Elkan [2] ) speedups.
 
  *References*
 
  [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
  [2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf
 
 
  
   
- It can be important to use optimizations in the search for nearest
centroid.  Consider triangle optimizations.
   
- Do you mean parallel when you type || or is there another meaning
there?
   
  
   No, i mean method called kmeans||. It's unfortunate name since I
 really
   don't know how to make google to search for it.
  
   http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
  
   
- When you say ++ initialization, 

Re: Mail and IRC parsing

2014-04-04 Thread Grant Ingersoll
We've (LucidWorks) got full indexing and search of the Mahout mail archives at 
http://find.searchub.org.  We could probably add in IRC pretty easily if you 
want.

-Grant

On Mar 22, 2014, at 2:06 AM, Andrew Musselman andrew.mussel...@gmail.com 
wrote:

 I put up a parser for the IRC history logs here
 https://github.com/andrewmusselman/util/blob/master/irc-parser.sh
 
 I'd like to write one for the user list too to figure out the most common
 problems/questions so we can focus effort on repairs to bugs and docs.
 
 But the mail archives at
 https://mail-archives.apache.org/mod_mbox/mahout-user/ are dynamic, loaded
 in through JavaScript, so parsing them isn't that straightforward.
 
 Is it possible to get the mbox files directly?


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Board Report

2014-04-04 Thread Grant Ingersoll
Can someone summarize the 0xData and the Spark work for me for the board 
report?  I've unfortunately been too busy to keep up on the threads on it, but 
need to write the board report for this month.

You can either summarize here or add it to the community section at 
https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt

Also, assuming we are going ahead w/ the 0xData stuff, we likely need to do a 
software grant for that.

Thanks,
Grant


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: Mail and IRC parsing

2014-04-04 Thread Andrew Musselman
Could be useful; in the meantime I found the mail files on people.apache.org 
and can use those.

 On Apr 4, 2014, at 8:39 AM, Grant Ingersoll gsing...@apache.org wrote:
 
 We've (LucidWorks) got full indexing and search of the Mahout mail archives 
 at http://find.searchub.org.  We could probably add in IRC pretty easily if 
 you want.
 
 -Grant
 
 On Mar 22, 2014, at 2:06 AM, Andrew Musselman andrew.mussel...@gmail.com 
 wrote:
 
 I put up a parser for the IRC history logs here
 https://github.com/andrewmusselman/util/blob/master/irc-parser.sh
 
 I'd like to write one for the user list too to figure out the most common
 problems/questions so we can focus effort on repairs to bugs and docs.
 
 But the mail archives at
 https://mail-archives.apache.org/mod_mbox/mahout-user/ are dynamic, loaded
 in through JavaScript, so parsing them isn't that straightforward.
 
 Is it possible to get the mbox files directly?
 
 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com
 
 
 
 
 


[jira] [Updated] (MAHOUT-1482) Rework quickstart website

2014-04-04 Thread jian wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jian wang updated MAHOUT-1482:
--

Attachment: examples.patch

 Rework quickstart website
 -

 Key: MAHOUT-1482
 URL: https://issues.apache.org/jira/browse/MAHOUT-1482
 Project: Mahout
  Issue Type: Improvement
  Components: Documentation
Reporter: Sebastian Schelter
Priority: Blocker
 Fix For: 1.0

 Attachments: MAHOUT-1482-add-maven-dependency.patch, examples.patch


 Our quickstart website must be much more helpful.
 We should describe how to add Mahout as maven dependency to a Java project 
 and how to start working on the source code.
 https://mahout.apache.org/users/basics/quickstart.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: KMeans|| opinions

2014-04-04 Thread Dmitriy Lyubimov
On Fri, Apr 4, 2014 at 4:26 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Pipe-pipe is an initialization algorithm which (roughly) parallelizes
 something like the ++ algorithm and requires logarithmically many parallel
 passes over the data as opposed to the k passes required for ++.


i'll re-read it again. The way i understood the paper (and the code),
pipe-pipe  actually uniformly subsamples data in each pass to bring in a
limited number of points in each pass which is O(k) not O(n). This is not
the same as going over each point computing distance-based probability
metrics for them, if that's what you imply by going for ++ comparison.
DP-means would have to do subsampling as well.


 DP-means itself is a variant of k-means where the value of k is not fixed.
  Instead, whenever a data-point is further than a threshold \lambda from
 the nearest centroid, it is used to form a new centroid.  Different choices
 of \lambda result in different numbers of clusters.  I don't know how well
 DP-means by itself would work as an initialization pass.

 Streaming k-means uses a variant of dp-means to get a sketch of the data.
  This sketch has many more centroids than the final desired number of
 clusters k.  As such, we don't have to worry much about the quality of the
 initialization of this first dp-means pass.  The sketch is then clustered
 using something like k-means++ and ball k-means iteration.

 As I see it, having a good streaming k-means block should make any
 clustering algorithm better as long as the sketch is much smaller than the
 original data.

 The second phase of clustering after streaming k-means can benefit from any
 technique that makes clustering better.  The caveat is just that streaming
 k-means is likely to produce a small enough sketch that parallel clustering
 may not help very much.  Exactly where that trade-off applies depends a lot
 on the parallel framework available. Spark and h2o have much lower task
 initiation costs so they will definitely be able to get parallel speedup on
 much smaller datasets.




 On Fri, Apr 4, 2014 at 11:29 AM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  is there any opinion whether one of pipe-pipe (||), dp-means
  initializations has bona-fide advantages  one over another?
 
 
  On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
  
On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
   
 - Have you considered sketch based algorithms?

   
can you give me a reference. at this point i am just  contemplating
  more
   or
less shameless port of what they've done in mllib.
   
  
   Here is the reference I used:
  
  
  
 
 http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets
  
   *A quick summary:*
  
   - a single pass over the data with a sort of dp-means [1] algorithm
  using a
   very fast approximate centroid search can give you k log N centroids.
   This
   is the sketch.
  
   - clustering these centroids as weighted values gives a provably
 probably
   good clustering of the original data for well clusterable data.  For
 real
   data, it seems to work exactly as advertised.
  
   - clustering the sketch using ball k-means with clever initialization
 is
   important.  Good initialization is very expensive in terms of number of
   points so using a sketch is a really good idea.
  
   - the proofs all depend on Euclidean distance.
  
   - the threshold can start very small (what small means is determined
   empirically).  Each time we wind up with too many centroids,
 recursively
   clustering the centroids and possibly increasing the threshold will
 keep
   the number reasonably bounded.
  
   *Some notes from practical experience:*
  
   - moderate levels of parallelism are easy since sketches can be merged
   using set union.  You may want to recursively cluster the centroids at
  this
   point if you have too many.  This is very nice application of
 map-reduce.
  
   - high degrees of parallelism require multiple levels of merge/collapse
   since otherwise you wind up with a sketch nearly as large as the
 original
   data.  If you have m parallel clusterings then m k log (N/m) can be
  large.
Take a billion points and m = 1000 and k = 10,000.  The size of the
   parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N.  But with m =
  100, k
   = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable.
  
   - ball k-means on highly clusterable data often uses a cutoff for
  centroid
   computation of 0.5 x distance to nearest.  I find that with real data
  that
   1 x distance or even larger to nearest is much better.  The good
 effects
   are still mostly there, but you don't need wonderful data to succeed
  
   - the ball k-means algorithm that we have in Mahout is pretty high
  quality,
   but could use a bit of triangle (Elkan [2] ) speedups.
  
   

Re: Board Report

2014-04-04 Thread SriSatish Ambati
Grant,
On 0xdata / H2O front:

We feel very excited at making Apache Mahout the principal platform for
scalable machine learning and are rapidly prototyping an initial
integration with the Matrix API. Ted (apache.org), Cliff Click (
acm.org/0xdata), Anand Avati (Redhat) and Michal Malohava (0xdata) are
heads down on that  making brisk progress. We hope to get the discussions
restarted in the JIRAs and google hangouts as soon as we get past the first
cut .

We also chose to have the first level integration with Mahout will be as a
maven dependency -
That way we can flesh things out without major interruption and the grant
work.

In parallel, several members and teams have been reworking the core
architecture to get a clean separation on the Algorithms  Core, an
in-memory (mr/task) API and a decent client framework with data read/write.
This will allow Apache Mahout and other ML libraries to use Spark,
Stratosphere or other engines for performance and extensibility.

This is the state of the union at the moment -
I'm very enthusiastic at making this a win for the ardent Community of
Machine Learning users and developers.
We are very grateful for the warmth, welcome, attention and impassionate
reviews we received from the Apache community.  Thank you for that.
We should have more to report in the month ahead.

Looking forward, Sri



On Fri, Apr 4, 2014 at 6:44 AM, Grant Ingersoll gsing...@apache.org wrote:

 Can someone summarize the 0xData and the Spark work for me for the board
 report?  I've unfortunately been too busy to keep up on the threads on it,
 but need to write the board report for this month.

 You can either summarize here or add it to the community section at
 https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt

 Also, assuming we are going ahead w/ the 0xData stuff, we likely need to
 do a software grant for that.

 Thanks,
 Grant

 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com








-- 
ceo  co-founder, 0 http://www.0xdata.com/*x*data Inc
+1-408.316.8192


Re: Board Report

2014-04-04 Thread Dmitriy Lyubimov
Grant,

Spark Biindings issues (done/in-progress) are summarized here
http://mahout.apache.org/users/sparkbindings/home.html.

As it stands, there is 1 platform issue, of which core functionality is
completed/committed, and 5 in-progress at various stages of ETA.

Platform issue broughth completion of 5 algorithms as POC guinea pigs.

In-progress issues are almost all but one about augmenting algorithm set.

One WIP issue is of significant importance on the platform side (bringing
in Data Frame-like semantics to distributed datasets), but it is fairly
vague and under-developed on its design right now. The difficulty there is
developing a solid concept around it rather than coding itself. So it  is
at the brainstorming stage rather than something concrete.

Adding one more issue in clustering area is being contemplated in near term.

There are 3 independent contributors working on these issues. The work is
unpaid, though, meaning ETA sure may vary. Some issues are given ETAs by
the contributors, but most of in-progress stuff is not.


On Fri, Apr 4, 2014 at 6:44 AM, Grant Ingersoll gsing...@apache.org wrote:

 Can someone summarize the 0xData and the Spark work for me for the board
 report?  I've unfortunately been too busy to keep up on the threads on it,
 but need to write the board report for this month.

 You can either summarize here or add it to the community section at
 https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt

 Also, assuming we are going ahead w/ the 0xData stuff, we likely need to
 do a software grant for that.

 Thanks,
 Grant

 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com








Re: KMeans|| opinions

2014-04-04 Thread Dmitriy Lyubimov
On Fri, Apr 4, 2014 at 8:44 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:




 On Fri, Apr 4, 2014 at 4:26 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Pipe-pipe is an initialization algorithm which (roughly) parallelizes
 something like the ++ algorithm and requires logarithmically many parallel
 passes over the data as opposed to the k passes required for ++.


 i'll re-read it again. The way i understood the paper (and the code),
 pipe-pipe  actually uniformly subsamples data in each pass to bring in a
 limited number of points in each pass which is O(k) not O(n). This is not
 the same as going over each point computing distance-based probability
 metrics for them, if that's what you imply by going for ++ comparison.
 DP-means would have to do subsampling as well.

Retracting that. They do full passes over data with full point cost
recomputations each time.



 DP-means itself is a variant of k-means where the value of k is not fixed.
  Instead, whenever a data-point is further than a threshold \lambda from
 the nearest centroid, it is used to form a new centroid.  Different
 choices
 of \lambda result in different numbers of clusters.  I don't know how well
 DP-means by itself would work as an initialization pass.

 Streaming k-means uses a variant of dp-means to get a sketch of the data.
  This sketch has many more centroids than the final desired number of
 clusters k.  As such, we don't have to worry much about the quality of the
 initialization of this first dp-means pass.  The sketch is then clustered
 using something like k-means++ and ball k-means iteration.

 As I see it, having a good streaming k-means block should make any
 clustering algorithm better as long as the sketch is much smaller than the
 original data.

 The second phase of clustering after streaming k-means can benefit from
 any
 technique that makes clustering better.  The caveat is just that streaming
 k-means is likely to produce a small enough sketch that parallel
 clustering
 may not help very much.  Exactly where that trade-off applies depends a
 lot
 on the parallel framework available. Spark and h2o have much lower task
 initiation costs so they will definitely be able to get parallel speedup
 on
 much smaller datasets.




 On Fri, Apr 4, 2014 at 11:29 AM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  is there any opinion whether one of pipe-pipe (||), dp-means
  initializations has bona-fide advantages  one over another?
 
 
  On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
  
On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning ted.dunn...@gmail.com
 
wrote:
   
 - Have you considered sketch based algorithms?

   
can you give me a reference. at this point i am just  contemplating
  more
   or
less shameless port of what they've done in mllib.
   
  
   Here is the reference I used:
  
  
  
 
 http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets
  
   *A quick summary:*
  
   - a single pass over the data with a sort of dp-means [1] algorithm
  using a
   very fast approximate centroid search can give you k log N centroids.
   This
   is the sketch.
  
   - clustering these centroids as weighted values gives a provably
 probably
   good clustering of the original data for well clusterable data.  For
 real
   data, it seems to work exactly as advertised.
  
   - clustering the sketch using ball k-means with clever initialization
 is
   important.  Good initialization is very expensive in terms of number
 of
   points so using a sketch is a really good idea.
  
   - the proofs all depend on Euclidean distance.
  
   - the threshold can start very small (what small means is determined
   empirically).  Each time we wind up with too many centroids,
 recursively
   clustering the centroids and possibly increasing the threshold will
 keep
   the number reasonably bounded.
  
   *Some notes from practical experience:*
  
   - moderate levels of parallelism are easy since sketches can be merged
   using set union.  You may want to recursively cluster the centroids at
  this
   point if you have too many.  This is very nice application of
 map-reduce.
  
   - high degrees of parallelism require multiple levels of
 merge/collapse
   since otherwise you wind up with a sketch nearly as large as the
 original
   data.  If you have m parallel clusterings then m k log (N/m) can be
  large.
Take a billion points and m = 1000 and k = 10,000.  The size of the
   parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N.  But with m =
  100, k
   = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable.
  
   - ball k-means on highly clusterable data often uses a cutoff for
  centroid
   computation of 0.5 x distance to nearest.  I find that with real data
  that
   1 x distance or even larger to nearest is much better.  The good
 effects
   are still mostly there, but you 

Re: Board Report

2014-04-04 Thread Ted Dunning
To add to Sri's comments:

A prototype implementation of the Mahout Matrix and Vector interfaces with
h2o back-end is in progress outside of Mahout with the intention of using
it as a test-bed for developing implementations of several Mahout
algorithms such as SSVD.  This code is intended for contribution if the
objections of one committer are over-come by the concrete results of the
prototype.



On Fri, Apr 4, 2014 at 6:47 PM, SriSatish Ambati srisat...@0xdata.comwrote:

 Grant,
 On 0xdata / H2O front:

 We feel very excited at making Apache Mahout the principal platform for
 scalable machine learning and are rapidly prototyping an initial
 integration with the Matrix API. Ted (apache.org), Cliff Click (
 acm.org/0xdata), Anand Avati (Redhat) and Michal Malohava (0xdata) are
 heads down on that  making brisk progress. We hope to get the discussions
 restarted in the JIRAs and google hangouts as soon as we get past the first
 cut .

 We also chose to have the first level integration with Mahout will be as a
 maven dependency -
 That way we can flesh things out without major interruption and the grant
 work.

 In parallel, several members and teams have been reworking the core
 architecture to get a clean separation on the Algorithms  Core, an
 in-memory (mr/task) API and a decent client framework with data read/write.
 This will allow Apache Mahout and other ML libraries to use Spark,
 Stratosphere or other engines for performance and extensibility.

 This is the state of the union at the moment -
 I'm very enthusiastic at making this a win for the ardent Community of
 Machine Learning users and developers.
 We are very grateful for the warmth, welcome, attention and impassionate
 reviews we received from the Apache community.  Thank you for that.
 We should have more to report in the month ahead.

 Looking forward, Sri



 On Fri, Apr 4, 2014 at 6:44 AM, Grant Ingersoll gsing...@apache.org
 wrote:

  Can someone summarize the 0xData and the Spark work for me for the board
  report?  I've unfortunately been too busy to keep up on the threads on
 it,
  but need to write the board report for this month.
 
  You can either summarize here or add it to the community section at
 
 https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt
 
  Also, assuming we are going ahead w/ the 0xData stuff, we likely need to
  do a software grant for that.
 
  Thanks,
  Grant
 
  
  Grant Ingersoll | @gsingers
  http://www.lucidworks.com
 
 
 
 
 
 


 --
 ceo  co-founder, 0 http://www.0xdata.com/*x*data Inc
 +1-408.316.8192



Re: KMeans|| opinions

2014-04-04 Thread Ted Dunning
On Fri, Apr 4, 2014 at 5:44 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 On Fri, Apr 4, 2014 at 4:26 AM, Ted Dunning ted.dunn...@gmail.com wrote:

  Pipe-pipe is an initialization algorithm which (roughly) parallelizes
  something like the ++ algorithm and requires logarithmically many
 parallel
  passes over the data as opposed to the k passes required for ++.
 

 i'll re-read it again. The way i understood the paper (and the code),
 pipe-pipe  actually uniformly subsamples data in each pass to bring in a
 limited number of points in each pass which is O(k) not O(n). This is not
 the same as going over each point computing distance-based probability
 metrics for them, if that's what you imply by going for ++ comparison.


I don't understand your point here.  My original point was that || makes
logarithmic number (log N or log k, I don't know which) of passes over the
data (direct quote from the paper) while ++ makes O(k) passes over the
data.  Such passes are often I/O bound.

Each ++ pass is made with the goal of selecting a new initial point.  This
is a sampling procedure based on all the previous initial points.  This
sounds very much like || except for the number of passes.

So my point is O(log mumble) versus O(k).  You say that pipe-pipe samples
data.  You say that is different than ++ sampling data.

Is it the uniform versus non-uniform sampling that you are talking about?
 That hardly seems important compared to the difference in number of passes.




DP-means would have to do subsampling as well.



[jira] [Commented] (MAHOUT-1507) Support External/Foreign Keys/IDs for Vectors and Matrices

2014-04-04 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960837#comment-13960837
 ] 

Pat Ferrel commented on MAHOUT-1507:


OK, desired behavior is to support a method to preserve user specified 
arbitrary keys for row and column vectors. Put another way preserve IDs for for 
rows and columns of a DRM. Below the term ID and Key is used interchangeably. 

For example, the DRM created for the item based recommender has rows of users, 
columns of items. If the user specified ID could be preserved through to the 
output this would make usage of the recommender much easier.

Also kmeans or other clustering, if user specified row IDs were preserved 
(possibly document IDs in text clustering) this would make usage much easier. 

By preserve I mean that the user specified keys would be available in the 
sequence file output of the appropriate job. Failing this, a Dictionary of user 
specified key - Mahout keys be created for the output. In the case of a DRM 
as output, if the row vector was stored with it's user specified key and a 
dictionary of the user specified column keys were created then the user would 
not have to implement this behavior for every Mahout job.

Partial examples of this behavior in Mahout 0.9 are provided by Named or 
Property Vectors. Also the text analysis pipeline creates a dictionary of word 
tokens - mahout column id. This dictionary has to be created by Mahout since 
the tokens are created as a consequence of phases in the job.

I should emphasize that this ID or Key translation is an absolute requirement 
for Mahout use (except where the partial solution is enough). Why put the 
burden for this on every user?

 Support External/Foreign Keys/IDs for Vectors and Matrices
 --

 Key: MAHOUT-1507
 URL: https://issues.apache.org/jira/browse/MAHOUT-1507
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.9
 Environment: Spark Scala
Reporter: Pat Ferrel
  Labels: spark
 Fix For: 1.0


 All users of Mahout have data which is addressed by keys or IDs of their own 
 devise. In order to use much of Mahout they must translate these IDs into 
 Mahout IDs, then run their jobs and translate back again when retrieving the 
 output. If the ID space is very large this is a difficult problem for users 
 to solve at scale.
 For many Mahout operations this would not be necessary if these external keys 
 could be maintained for vectors and dimensions, or for rows and columns of a 
 DRM.
 The reason I bring this up now is that much groundwork is being laid for 
 Mahout's future on Spark so getting this notion in early could be 
 fundamentally important and used to build on.
 If external IDs for rows and columns were maintained then RSJ, DRM Transpose 
 (and other DRM ops), vector extraction, clustering, and recommenders would 
 need no ID translation steps, a big user win.
 A partial solution might be to support external row IDs alone somewhat like 
 the NamedVector and PropertyVector in the Mahout hadoop code.
 On Apr 3, 2014, at 11:00 AM, Pat Ferrel p...@occamsmachete.com wrote:
 Perhaps this is best phrased as a feature request.
 On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 PS.
 sequence file keys have also special meaning if they are Ints. .E.g. A'
 physical operator requires keys to be ints, in which case it interprets
 them as row indexes that become column indexes. This of course isn't always
 the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
 reality optimizer will never choose actual transposition as a physical step
 in such pipeline. This interpretation is consistent with interpretation of
 long-existing Hadoop-side DistributedRowMatrix#transpose.
 On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel p...@occamsmachete.com wrote:
 On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 I think this duality, names and keys, is not very healthy really, and
 just
 creates addtutiinal hassle. Spark drm takes care of keys automatically
 thoughout, but propagating names from name vectors is solely algorithm
 concern as it stands.
 Not sure what you mean.
 Not what you think, it looks like.
 I mean that Mahout DRM structure is a bag of (key - Vector) pairs. When
 persisted, key goes to the key of a sequence file. In particular, it means
 that there is a case of Bag[ key - NamedVector]. Which means, external
 anchor could be saved to either key or name of a row. In practice it causes
 compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
 saves external keys (file paths) into  key, whereas e.g. clustering
 algorithms are not seeing them because they 

[jira] [Commented] (MAHOUT-1507) Support External/Foreign Keys/IDs for Vectors and Matrices

2014-04-04 Thread Andrew Musselman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960958#comment-13960958
 ] 

Andrew Musselman commented on MAHOUT-1507:
--

I agree with Pat that this is a big deal; maintaining a custom mapping from 
arbitrary ids to sequential integers is a hurdle that prevents people from 
adopting the project.

 Support External/Foreign Keys/IDs for Vectors and Matrices
 --

 Key: MAHOUT-1507
 URL: https://issues.apache.org/jira/browse/MAHOUT-1507
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.9
 Environment: Spark Scala
Reporter: Pat Ferrel
  Labels: spark
 Fix For: 1.0


 All users of Mahout have data which is addressed by keys or IDs of their own 
 devise. In order to use much of Mahout they must translate these IDs into 
 Mahout IDs, then run their jobs and translate back again when retrieving the 
 output. If the ID space is very large this is a difficult problem for users 
 to solve at scale.
 For many Mahout operations this would not be necessary if these external keys 
 could be maintained for vectors and dimensions, or for rows and columns of a 
 DRM.
 The reason I bring this up now is that much groundwork is being laid for 
 Mahout's future on Spark so getting this notion in early could be 
 fundamentally important and used to build on.
 If external IDs for rows and columns were maintained then RSJ, DRM Transpose 
 (and other DRM ops), vector extraction, clustering, and recommenders would 
 need no ID translation steps, a big user win.
 A partial solution might be to support external row IDs alone somewhat like 
 the NamedVector and PropertyVector in the Mahout hadoop code.
 On Apr 3, 2014, at 11:00 AM, Pat Ferrel p...@occamsmachete.com wrote:
 Perhaps this is best phrased as a feature request.
 On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 PS.
 sequence file keys have also special meaning if they are Ints. .E.g. A'
 physical operator requires keys to be ints, in which case it interprets
 them as row indexes that become column indexes. This of course isn't always
 the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
 reality optimizer will never choose actual transposition as a physical step
 in such pipeline. This interpretation is consistent with interpretation of
 long-existing Hadoop-side DistributedRowMatrix#transpose.
 On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel p...@occamsmachete.com wrote:
 On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 I think this duality, names and keys, is not very healthy really, and
 just
 creates addtutiinal hassle. Spark drm takes care of keys automatically
 thoughout, but propagating names from name vectors is solely algorithm
 concern as it stands.
 Not sure what you mean.
 Not what you think, it looks like.
 I mean that Mahout DRM structure is a bag of (key - Vector) pairs. When
 persisted, key goes to the key of a sequence file. In particular, it means
 that there is a case of Bag[ key - NamedVector]. Which means, external
 anchor could be saved to either key or name of a row. In practice it causes
 compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
 saves external keys (file paths) into  key, whereas e.g. clustering
 algorithms are not seeing them because they expect them to be the name part
 of the vector. I am just saying we have two ways to name the rows, and it
 is generally not a healthy choice for the aforementioned reason.
 In my experience Names and Properties are primarily used to store
 external keys, which are quite healthy.
 Users never have data with Mahout keys, they must constantly go back and
 forth. This is exactly what the R data frame does, no? I'm not so concerned
 with being able to address an element by the external key
 drmB[pat][iPad'] like a HashMap. But it would sure be nice to have the
 external ids follow the data through any calculation that makes sense.
 I am with you on this.
 This would mean clustering, recommendations, transpose, RSJ would require
 no id transforming steps. This would make dealing with Mahout much easier.
 Data frames is a little bit a different thing, right now we work just with
 matrices. Although, yes, our in-core matrices support row and column names
 (just like in R) and distributed matrices support row keys only.  what i
 mean is that algebraic expression e.g.
 Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied
 above, but not necessarily named vectors, because internally algorithms
 blockify things into matrix blocks, and i am far from sure that Mahout
 in-core stuff works correctly with named vectors as 

[jira] [Commented] (MAHOUT-1507) Support External/Foreign Keys/IDs for Vectors and Matrices

2014-04-04 Thread Andrew Musselman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960962#comment-13960962
 ] 

Andrew Musselman commented on MAHOUT-1507:
--

Maybe prevents is strong but it's another step that would be nice to remove 
for adoptees.

 Support External/Foreign Keys/IDs for Vectors and Matrices
 --

 Key: MAHOUT-1507
 URL: https://issues.apache.org/jira/browse/MAHOUT-1507
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.9
 Environment: Spark Scala
Reporter: Pat Ferrel
  Labels: spark
 Fix For: 1.0


 All users of Mahout have data which is addressed by keys or IDs of their own 
 devise. In order to use much of Mahout they must translate these IDs into 
 Mahout IDs, then run their jobs and translate back again when retrieving the 
 output. If the ID space is very large this is a difficult problem for users 
 to solve at scale.
 For many Mahout operations this would not be necessary if these external keys 
 could be maintained for vectors and dimensions, or for rows and columns of a 
 DRM.
 The reason I bring this up now is that much groundwork is being laid for 
 Mahout's future on Spark so getting this notion in early could be 
 fundamentally important and used to build on.
 If external IDs for rows and columns were maintained then RSJ, DRM Transpose 
 (and other DRM ops), vector extraction, clustering, and recommenders would 
 need no ID translation steps, a big user win.
 A partial solution might be to support external row IDs alone somewhat like 
 the NamedVector and PropertyVector in the Mahout hadoop code.
 On Apr 3, 2014, at 11:00 AM, Pat Ferrel p...@occamsmachete.com wrote:
 Perhaps this is best phrased as a feature request.
 On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 PS.
 sequence file keys have also special meaning if they are Ints. .E.g. A'
 physical operator requires keys to be ints, in which case it interprets
 them as row indexes that become column indexes. This of course isn't always
 the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
 reality optimizer will never choose actual transposition as a physical step
 in such pipeline. This interpretation is consistent with interpretation of
 long-existing Hadoop-side DistributedRowMatrix#transpose.
 On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel p...@occamsmachete.com wrote:
 On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 I think this duality, names and keys, is not very healthy really, and
 just
 creates addtutiinal hassle. Spark drm takes care of keys automatically
 thoughout, but propagating names from name vectors is solely algorithm
 concern as it stands.
 Not sure what you mean.
 Not what you think, it looks like.
 I mean that Mahout DRM structure is a bag of (key - Vector) pairs. When
 persisted, key goes to the key of a sequence file. In particular, it means
 that there is a case of Bag[ key - NamedVector]. Which means, external
 anchor could be saved to either key or name of a row. In practice it causes
 compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
 saves external keys (file paths) into  key, whereas e.g. clustering
 algorithms are not seeing them because they expect them to be the name part
 of the vector. I am just saying we have two ways to name the rows, and it
 is generally not a healthy choice for the aforementioned reason.
 In my experience Names and Properties are primarily used to store
 external keys, which are quite healthy.
 Users never have data with Mahout keys, they must constantly go back and
 forth. This is exactly what the R data frame does, no? I'm not so concerned
 with being able to address an element by the external key
 drmB[pat][iPad'] like a HashMap. But it would sure be nice to have the
 external ids follow the data through any calculation that makes sense.
 I am with you on this.
 This would mean clustering, recommendations, transpose, RSJ would require
 no id transforming steps. This would make dealing with Mahout much easier.
 Data frames is a little bit a different thing, right now we work just with
 matrices. Although, yes, our in-core matrices support row and column names
 (just like in R) and distributed matrices support row keys only.  what i
 mean is that algebraic expression e.g.
 Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied
 above, but not necessarily named vectors, because internally algorithms
 blockify things into matrix blocks, and i am far from sure that Mahout
 in-core stuff works correctly with named vectors as part of a matrix block
 in all situations. I may be wrong. I always relied on