Re: KMeans|| opinions

2014-04-03 Thread Dmitriy Lyubimov
On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning  wrote:

> - Have you considered sketch based algorithms?
>

can you give me a reference. at this point i am just  contemplating more or
less shameless port of what they've done in mllib.


>
> - It can be important to use optimizations in the search for nearest
> centroid.  Consider triangle optimizations.
>
> - Do you mean "parallel" when you type || or is there another meaning
> there?
>

No, i mean method called "kmeans||". It's unfortunate name since I really
don't know how to make google to search for it.

http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf

>
> - When you say ++ initialization, many people get this wrong and assume
> that you mean pick the furthest point.  Getting really good initialization
> is fairly difficult and typically requires more time than the actual
> clustering.  This is one of the key benefits of sketch based methods.
>
> - Most algorithms require multiple restarts.  At higher dimension the
> number of restarts required becomes very large.  An ideal implementation
> does parallel sketch extraction followed by parallel ball k-means for
> restarts.
>
>
>
> On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov 
> wrote:
>
> > Considering porting implementation [1] and paper for KMeans || for
> > Bindings.
> >
> > This seems like another method to map fairly nicely.
> >
> > The problem I am contemplating is ||-initialization, and in particular,
> > centroid storage. That particular implementation assumes centroids could
> be
> > kept in memory in front.
> >
> > (1) Question is, is it a dangerous idea. It doesn't seem like it
> > particularly is, since unlikely people would want more k>1e+6. Another
> > thing, centers seem to be passed in via closure attribute (i.e.
> > java-serialized array-backed matrix).However, with Bindings it is quite
> > possible to keep centers at the back as a matrix.
> >
> > (2) obviously, LLoyd iterations are not terribly accurate. || and ++
> > versions mostly speed things up. Is there any better-than-LLoyd accuracy
> > preference?
> >
> >
> > [1]
> >
> >
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
> >
>


Re: KMeans|| opinions

2014-04-03 Thread Ted Dunning
- Have you considered sketch based algorithms?

- It can be important to use optimizations in the search for nearest
centroid.  Consider triangle optimizations.

- Do you mean "parallel" when you type || or is there another meaning there?

- When you say ++ initialization, many people get this wrong and assume
that you mean pick the furthest point.  Getting really good initialization
is fairly difficult and typically requires more time than the actual
clustering.  This is one of the key benefits of sketch based methods.

- Most algorithms require multiple restarts.  At higher dimension the
number of restarts required becomes very large.  An ideal implementation
does parallel sketch extraction followed by parallel ball k-means for
restarts.



On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov  wrote:

> Considering porting implementation [1] and paper for KMeans || for
> Bindings.
>
> This seems like another method to map fairly nicely.
>
> The problem I am contemplating is ||-initialization, and in particular,
> centroid storage. That particular implementation assumes centroids could be
> kept in memory in front.
>
> (1) Question is, is it a dangerous idea. It doesn't seem like it
> particularly is, since unlikely people would want more k>1e+6. Another
> thing, centers seem to be passed in via closure attribute (i.e.
> java-serialized array-backed matrix).However, with Bindings it is quite
> possible to keep centers at the back as a matrix.
>
> (2) obviously, LLoyd iterations are not terribly accurate. || and ++
> versions mostly speed things up. Is there any better-than-LLoyd accuracy
> preference?
>
>
> [1]
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
>


Re: Data frames

2014-04-03 Thread Ted Dunning
And the feature request should be phrased in terms of code with desired
behavior.




On Thu, Apr 3, 2014 at 8:00 PM, Pat Ferrel  wrote:

> Perhaps this is best phrased as a feature request.
>
> On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov  wrote:
>
> PS.
>
> sequence file keys have also special meaning if they are Ints. .E.g. A'
> physical operator requires keys to be ints, in which case it interprets
> them as row indexes that become column indexes. This of course isn't always
> the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
> reality optimizer will never choose actual transposition as a physical step
> in such pipeline. This interpretation is consistent with interpretation of
> long-existing Hadoop-side DistributedRowMatrix#transpose.
>
>
> On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov 
> wrote:
>
> >
> >
> >
> > On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel 
> wrote:
> >
> >>
> >>> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov 
> wrote:
> >>>
> >>> I think this duality, names and keys, is not very healthy really, and
> >> just
> >>> creates addtutiinal hassle. Spark drm takes care of keys automatically
> >>> thoughout, but propagating names from name vectors is solely algorithm
> >>> concern as it stands.
> >>
> >> Not sure what you mean.
> >
> > Not what you think, it looks like.
> >
> > I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
> > persisted, key goes to the key of a sequence file. In particular, it
> means
> > that there is a case of Bag[ key -> NamedVector]. Which means, external
> > anchor could be saved to either key or name of a row. In practice it
> causes
> > compatibility mess, e.g. we saw those numerous cases where e.g.
> seq2sparse
> > saves external keys (file paths) into  key, whereas e.g. clustering
> > algorithms are not seeing them because they expect them to be the name
> part
> > of the vector. I am just saying we have two ways to name the rows, and it
> > is generally not a healthy choice for the aforementioned reason.
> >
> >
> >> In my experience Names and Properties are primarily used to store
> >> external keys, which are quite healthy.
> >
> > Users never have data with Mahout keys, they must constantly go back and
> >> forth. This is exactly what the R data frame does, no? I'm not so
> concerned
> >> with being able to address an element by the external key
> >> drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have
> the
> >> external ids follow the data through any calculation that makes sense.
> >>
> >
> > I am with you on this.
> >
> >
> >> This would mean clustering, recommendations, transpose, RSJ would
> require
> >> no id transforming steps. This would make dealing with Mahout much
> easier.
> >>
> >
> > Data frames is a little bit a different thing, right now we work just
> with
> > matrices. Although, yes, our in-core matrices support row and column
> names
> > (just like in R) and distributed matrices support row keys only.  what i
> > mean is that algebraic expression e.g.
> >
> > Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied
> > above, but not necessarily named vectors, because internally algorithms
> > blockify things into matrix blocks, and i am far from sure that Mahout
> > in-core stuff works correctly with named vectors as part of a matrix
> block
> > in all situations. I may be wrong. I always relied on sequence file keys
> to
> > identify data points.
> >
> > Note that sequence file keys are bigger than just a name, it is anything
> > Writable. I.e. you could save a data structure there, as long as you
> have a
> > Writable for it.
> >
> >
> >>> On Apr 2, 2014 1:08 PM, "Pat Ferrel"  wrote:
> >>>
>  Are the Spark efforts supporting all Mahout Vector types? Named,
> >> Property
>  Vectors? It occurred to me that data frames in R is a related but more
>  general solution. If all rows and columns of a DRM and their
> >> coresponding
>  Vectors (row or column vectors) were to support arbitrary properties
>  attached to them in such a way that they are preserved during
> >> transpose,
>  Vector extraction, and any other operations that make sense there
> >> would be
>  a huge benefit for users.
> 
>  One of the constant problems with input to Mahout is translation of
> >> IDs.
>  External to Mahout going in, Mahout to external coming out. Most of
> >> this
>  would be unneeded if Mahout supported data frames, some would be
> >> avoided by
>  supporting named or property vectors universally.
> 
> 
> >>>
> >>
> >
> >
>
>


[jira] [Comment Edited] (MAHOUT-1507) Support External/Foreign Keys/IDs for Vectors and Matrices

2014-04-03 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959158#comment-13959158
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1507 at 4/3/14 8:23 PM:
--

Pat, 

Honestly, i must admit i don't understand what you request. As i said, row ids 
are already supported thru row keys. They are also universally "passed-thru" 
with physical operator implementations. 

It is my opinion that NamedVectors should be avoided as it creates interface 
ambiguity. (We need to agree to support keys 1 particular way, not 2 ways -- 
either named vectors, or row keys, but not "whatever"). 

Named vectors are  supported to extent that they would be loaded correctly into 
an in-memory matrix off HDFS DRM. Passing them universally thru "whatever" 
algorithm implicitly and without limitations would be much harder because of 
Mahout in-core math limitations (there would be too many corner cases). This is 
therefore solely mahout-math in-core issue, not Spark- or Bindings-related (as 
far as issue of NamedVector pass-thru is concerned). 

Any particular algorithm is however quite capable of passing them 
(NamedVectors) thru explicitly, just like it was handled by MR solvers, if it 
makes sense for a particular algorithm. 

Same statements apply to property vectors.


was (Author: dlyubimov):
Pat, 

Honestly, i must admit i don't understand what you request. As i said, row ids 
are already supported thru row keys. They are also universally "passed-thru" 
with physical operator implementations. 

It is my opinion that NamedVectors should be avoided as it creates interface 
ambiguity. (We need to agree to support keys 1 particular way, not 2 ways -- 
either named vectors, or row keys, but not "whatever"). 

However, named vectors are still supported to extent that they would be loaded 
correctly into in-memory matrix. Passing them universally thru "whatever" 
algorithm implicitly wiould be much harder because of Mahout in-core math 
limitations (there would be too many corner cases). This is therefore solely 
mahout-math in-core issue, not Spark- or Bindings-related. 

Any particular algorithm is however quite capable of passing them 
(NamedVectors) thru explicitly, if it makes sense for a particular algorithm. 
Same statement applies to property vectors.

> Support External/Foreign Keys/IDs for Vectors and Matrices
> --
>
> Key: MAHOUT-1507
> URL: https://issues.apache.org/jira/browse/MAHOUT-1507
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.9
> Environment: Spark Scala
>Reporter: Pat Ferrel
>  Labels: spark
> Fix For: 1.0
>
>
> All users of Mahout have data which is addressed by keys or IDs of their own 
> devise. In order to use much of Mahout they must translate these IDs into 
> Mahout IDs, then run their jobs and translate back again when retrieving the 
> output. If the ID space is very large this is a difficult problem for users 
> to solve at scale.
> For many Mahout operations this would not be necessary if these external keys 
> could be maintained for vectors and dimensions, or for rows and columns of a 
> DRM.
> The reason I bring this up now is that much groundwork is being laid for 
> Mahout's future on Spark so getting this notion in early could be 
> fundamentally important and used to build on.
> If external IDs for rows and columns were maintained then RSJ, DRM Transpose 
> (and other DRM ops), vector extraction, clustering, and recommenders would 
> need no ID translation steps, a big user win.
> A partial solution might be to support external row IDs alone somewhat like 
> the NamedVector and PropertyVector in the Mahout hadoop code.
> On Apr 3, 2014, at 11:00 AM, Pat Ferrel  wrote:
> Perhaps this is best phrased as a feature request.
> On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov  wrote:
> PS.
> sequence file keys have also special meaning if they are Ints. .E.g. A'
> physical operator requires keys to be ints, in which case it interprets
> them as row indexes that become column indexes. This of course isn't always
> the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
> reality optimizer will never choose actual transposition as a physical step
> in such pipeline. This interpretation is consistent with interpretation of
> long-existing Hadoop-side DistributedRowMatrix#transpose.
> On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov  wrote:
> On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel  wrote:
> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov  wrote:
> I think this duality, names and keys, is not very healthy really, and
> just
> creates addtutiinal hassle. Spark drm takes care of keys automatically
> thoughout, but propagating names from name vectors is so

[jira] [Comment Edited] (MAHOUT-1507) Support External/Foreign Keys/IDs for Vectors and Matrices

2014-04-03 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959158#comment-13959158
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1507 at 4/3/14 8:20 PM:
--

Pat, 

Honestly, i must admit i don't understand what you request. As i said, row ids 
are already supported thru row keys. They are also universally "passed-thru" 
with physical operator implementations. 

It is my opinion that NamedVectors should be avoided as it creates interface 
ambiguity. (We need to agree to support keys 1 particular way, not 2 ways -- 
either named vectors, or row keys, but not "whatever"). 

However, named vectors are still supported to extent that they would be loaded 
correctly into in-memory matrix. Passing them universally thru "whatever" 
algorithm implicitly wiould be much harder because of Mahout in-core math 
limitations (there would be too many corner cases). This is therefore solely 
mahout-math in-core issue, not Spark- or Bindings-related. 

Any particular algorithm is however quite capable of passing them 
(NamedVectors) thru explicitly, if it makes sense for a particular algorithm. 
Same statement applies to property vectors.


was (Author: dlyubimov):
Pat, 

Honestly, i must admit i don't understand what you request. As i said, row ids 
are already supported thru row keys. It is my opinion that NamedVectors should 
be avoided as it creates interface ambiguity. (We need to agree to support keys 
1 particular way, not 2 ways -- either named vectors, or row keys, but not 
"whatever"). 

However, named vectors are still supported to extent that they would be loaded 
correctly into in-memory matrix. Passing them universally thru "whatever" 
algorithm implicitly wiould be much harder because of Mahout in-core math 
limitations (there would be too many corner cases). This is therefore solely 
mahout-math in-core issue, not Spark- or Bindings-related. 

Any particular algorithm is however quite capable of passing them thru 
explicitly, if it makes sense for a particular algorithm. Same statement 
applies to property vectors.

> Support External/Foreign Keys/IDs for Vectors and Matrices
> --
>
> Key: MAHOUT-1507
> URL: https://issues.apache.org/jira/browse/MAHOUT-1507
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.9
> Environment: Spark Scala
>Reporter: Pat Ferrel
>  Labels: spark
> Fix For: 1.0
>
>
> All users of Mahout have data which is addressed by keys or IDs of their own 
> devise. In order to use much of Mahout they must translate these IDs into 
> Mahout IDs, then run their jobs and translate back again when retrieving the 
> output. If the ID space is very large this is a difficult problem for users 
> to solve at scale.
> For many Mahout operations this would not be necessary if these external keys 
> could be maintained for vectors and dimensions, or for rows and columns of a 
> DRM.
> The reason I bring this up now is that much groundwork is being laid for 
> Mahout's future on Spark so getting this notion in early could be 
> fundamentally important and used to build on.
> If external IDs for rows and columns were maintained then RSJ, DRM Transpose 
> (and other DRM ops), vector extraction, clustering, and recommenders would 
> need no ID translation steps, a big user win.
> A partial solution might be to support external row IDs alone somewhat like 
> the NamedVector and PropertyVector in the Mahout hadoop code.
> On Apr 3, 2014, at 11:00 AM, Pat Ferrel  wrote:
> Perhaps this is best phrased as a feature request.
> On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov  wrote:
> PS.
> sequence file keys have also special meaning if they are Ints. .E.g. A'
> physical operator requires keys to be ints, in which case it interprets
> them as row indexes that become column indexes. This of course isn't always
> the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
> reality optimizer will never choose actual transposition as a physical step
> in such pipeline. This interpretation is consistent with interpretation of
> long-existing Hadoop-side DistributedRowMatrix#transpose.
> On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov  wrote:
> On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel  wrote:
> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov  wrote:
> I think this duality, names and keys, is not very healthy really, and
> just
> creates addtutiinal hassle. Spark drm takes care of keys automatically
> thoughout, but propagating names from name vectors is solely algorithm
> concern as it stands.
> Not sure what you mean.
> Not what you think, it looks like.
> I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
> persisted, key goes to the key of a seque

[jira] [Comment Edited] (MAHOUT-1507) Support External/Foreign Keys/IDs for Vectors and Matrices

2014-04-03 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959158#comment-13959158
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1507 at 4/3/14 8:18 PM:
--

Pat, 

Honestly, i must admit i don't understand what you request. As i said, row ids 
are already supported thru row keys. It is my opinion that NamedVectors should 
be avoided as it creates interface ambiguity. (We need to agree to support keys 
1 particular way, not 2 ways -- either named vectors, or row keys, but not 
"whatever"). 

However, named vectors are still supported to extent that they would be loaded 
correctly into in-memory matrix. Passing them universally thru "whatever" 
algorithm implicitly wiould be much harder because of Mahout in-core math 
limitations (there would be too many corner cases). This is therefore solely 
mahout-math in-core issue, not Spark- or Bindings-related. 

Any particular algorithm is however quite capable of passing them thru 
explicitly, if it makes sense for a particular algorithm. Same statement 
applies to property vectors.


was (Author: dlyubimov):
Pat, 

Honestly, i must admit i don't understand what you request. As i said, row ids 
are already supported thru row keys. It is my opinion that NamedVectors should 
be avoided as it creates interface ambiguity. (We need to agree to support keys 
1 particular way, not 2 ways -- either named vectors, or row keys, but not 
"whatever"). 

However, named vectors are still supported to extent that they would be loaded 
correctly into in-memory matrix. Passing them universally thru "whatever" 
algorithm implicitly wiould be much harder because of Mahout in-core math 
limitations (there would be too many corner cases). This is therefore solely 
mahout-math in-core issue, not Spark's related. 

Any particular algorithm is however quite capable of passing them thru 
explicitly, if it makes sense for a particular algorithm. Same statement 
applies to property vectors.

> Support External/Foreign Keys/IDs for Vectors and Matrices
> --
>
> Key: MAHOUT-1507
> URL: https://issues.apache.org/jira/browse/MAHOUT-1507
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.9
> Environment: Spark Scala
>Reporter: Pat Ferrel
>  Labels: spark
> Fix For: 1.0
>
>
> All users of Mahout have data which is addressed by keys or IDs of their own 
> devise. In order to use much of Mahout they must translate these IDs into 
> Mahout IDs, then run their jobs and translate back again when retrieving the 
> output. If the ID space is very large this is a difficult problem for users 
> to solve at scale.
> For many Mahout operations this would not be necessary if these external keys 
> could be maintained for vectors and dimensions, or for rows and columns of a 
> DRM.
> The reason I bring this up now is that much groundwork is being laid for 
> Mahout's future on Spark so getting this notion in early could be 
> fundamentally important and used to build on.
> If external IDs for rows and columns were maintained then RSJ, DRM Transpose 
> (and other DRM ops), vector extraction, clustering, and recommenders would 
> need no ID translation steps, a big user win.
> A partial solution might be to support external row IDs alone somewhat like 
> the NamedVector and PropertyVector in the Mahout hadoop code.
> On Apr 3, 2014, at 11:00 AM, Pat Ferrel  wrote:
> Perhaps this is best phrased as a feature request.
> On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov  wrote:
> PS.
> sequence file keys have also special meaning if they are Ints. .E.g. A'
> physical operator requires keys to be ints, in which case it interprets
> them as row indexes that become column indexes. This of course isn't always
> the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
> reality optimizer will never choose actual transposition as a physical step
> in such pipeline. This interpretation is consistent with interpretation of
> long-existing Hadoop-side DistributedRowMatrix#transpose.
> On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov  wrote:
> On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel  wrote:
> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov  wrote:
> I think this duality, names and keys, is not very healthy really, and
> just
> creates addtutiinal hassle. Spark drm takes care of keys automatically
> thoughout, but propagating names from name vectors is solely algorithm
> concern as it stands.
> Not sure what you mean.
> Not what you think, it looks like.
> I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
> persisted, key goes to the key of a sequence file. In particular, it means
> that there is a case of Bag[ key -> NamedVector]. Which means, external

[jira] [Comment Edited] (MAHOUT-1507) Support External/Foreign Keys/IDs for Vectors and Matrices

2014-04-03 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959158#comment-13959158
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1507 at 4/3/14 8:17 PM:
--

Pat, 

Honestly, i must admit i don't understand what you request. As i said, row ids 
are already supported thru row keys. It is my opinion that NamedVectors should 
be avoided as it creates interface ambiguity. (We need to agree to support keys 
1 particular way, not 2 ways -- either named vectors, or row keys, but not 
"whatever"). 

However, named vectors are still supported to extent that they would be loaded 
correctly into in-memory matrix. Passing them universally thru "whatever" 
algorithm implicitly wiould be much harder because of Mahout in-core math 
limitations (there would be too many corner cases). This is therefore solely 
mahout-math in-core issue, not Spark's related. 

Any particular algorithm is however quite capable of passing them thru 
explicitly, if it makes sense for a particular algorithm. Same statement 
applies to property vectors.


was (Author: dlyubimov):
Pat, 

Honestly, i must admit i don't understand what you request. As i said, row ids 
are already supported thru row keys. It is my opinion that NamedVectors should 
be avoided as it creates ambiguity. (We need to agree to support keys 1 
particular way, not 2 ways). 

However, named vectors are still supported to extent that they would be loaded 
correctly into in-memory matrix. Passing them universally thru "whatever" 
algorithm implicitly wiould be much harder because of Mahout in-core math 
limitations (there would be too many corner cases). This is therefore solely 
mahout-math in-core issue, not Spark's related. 

Any particular algorithm is however quite capable of passing them thru 
explicitly, if it makes sense for a particular algorithm. Same statement 
applies to property vectors.

> Support External/Foreign Keys/IDs for Vectors and Matrices
> --
>
> Key: MAHOUT-1507
> URL: https://issues.apache.org/jira/browse/MAHOUT-1507
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.9
> Environment: Spark Scala
>Reporter: Pat Ferrel
>  Labels: spark
> Fix For: 1.0
>
>
> All users of Mahout have data which is addressed by keys or IDs of their own 
> devise. In order to use much of Mahout they must translate these IDs into 
> Mahout IDs, then run their jobs and translate back again when retrieving the 
> output. If the ID space is very large this is a difficult problem for users 
> to solve at scale.
> For many Mahout operations this would not be necessary if these external keys 
> could be maintained for vectors and dimensions, or for rows and columns of a 
> DRM.
> The reason I bring this up now is that much groundwork is being laid for 
> Mahout's future on Spark so getting this notion in early could be 
> fundamentally important and used to build on.
> If external IDs for rows and columns were maintained then RSJ, DRM Transpose 
> (and other DRM ops), vector extraction, clustering, and recommenders would 
> need no ID translation steps, a big user win.
> A partial solution might be to support external row IDs alone somewhat like 
> the NamedVector and PropertyVector in the Mahout hadoop code.
> On Apr 3, 2014, at 11:00 AM, Pat Ferrel  wrote:
> Perhaps this is best phrased as a feature request.
> On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov  wrote:
> PS.
> sequence file keys have also special meaning if they are Ints. .E.g. A'
> physical operator requires keys to be ints, in which case it interprets
> them as row indexes that become column indexes. This of course isn't always
> the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
> reality optimizer will never choose actual transposition as a physical step
> in such pipeline. This interpretation is consistent with interpretation of
> long-existing Hadoop-side DistributedRowMatrix#transpose.
> On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov  wrote:
> On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel  wrote:
> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov  wrote:
> I think this duality, names and keys, is not very healthy really, and
> just
> creates addtutiinal hassle. Spark drm takes care of keys automatically
> thoughout, but propagating names from name vectors is solely algorithm
> concern as it stands.
> Not sure what you mean.
> Not what you think, it looks like.
> I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
> persisted, key goes to the key of a sequence file. In particular, it means
> that there is a case of Bag[ key -> NamedVector]. Which means, external
> anchor could be saved to either key or name of a row. In practice it causes
>

[jira] [Commented] (MAHOUT-1507) Support External/Foreign Keys/IDs for Vectors and Matrices

2014-04-03 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959158#comment-13959158
 ] 

Dmitriy Lyubimov commented on MAHOUT-1507:
--

Pat, 

Honestly, i must admit i don't understand what you request. As i said, row ids 
are already supported thru row keys. It is my opinion that NamedVectors should 
be avoided as it creates ambiguity. (We need to agree to support keys 1 
particular way, not 2 ways). 

However, named vectors are still supported to extent that they would be loaded 
correctly into in-memory matrix. Passing them thru universally thru "whatever" 
algorithm implicitly will be much harder because of Mahout in-core math 
limitations (there would be too many corner cases). Any particular algorithm is 
however quite capable of passing them thru explicitly, if it makes sense for a 
particular algorithm. Same statement applies to property vectors.

> Support External/Foreign Keys/IDs for Vectors and Matrices
> --
>
> Key: MAHOUT-1507
> URL: https://issues.apache.org/jira/browse/MAHOUT-1507
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.9
> Environment: Spark Scala
>Reporter: Pat Ferrel
>  Labels: spark
> Fix For: 1.0
>
>
> All users of Mahout have data which is addressed by keys or IDs of their own 
> devise. In order to use much of Mahout they must translate these IDs into 
> Mahout IDs, then run their jobs and translate back again when retrieving the 
> output. If the ID space is very large this is a difficult problem for users 
> to solve at scale.
> For many Mahout operations this would not be necessary if these external keys 
> could be maintained for vectors and dimensions, or for rows and columns of a 
> DRM.
> The reason I bring this up now is that much groundwork is being laid for 
> Mahout's future on Spark so getting this notion in early could be 
> fundamentally important and used to build on.
> If external IDs for rows and columns were maintained then RSJ, DRM Transpose 
> (and other DRM ops), vector extraction, clustering, and recommenders would 
> need no ID translation steps, a big user win.
> A partial solution might be to support external row IDs alone somewhat like 
> the NamedVector and PropertyVector in the Mahout hadoop code.
> On Apr 3, 2014, at 11:00 AM, Pat Ferrel  wrote:
> Perhaps this is best phrased as a feature request.
> On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov  wrote:
> PS.
> sequence file keys have also special meaning if they are Ints. .E.g. A'
> physical operator requires keys to be ints, in which case it interprets
> them as row indexes that become column indexes. This of course isn't always
> the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
> reality optimizer will never choose actual transposition as a physical step
> in such pipeline. This interpretation is consistent with interpretation of
> long-existing Hadoop-side DistributedRowMatrix#transpose.
> On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov  wrote:
> On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel  wrote:
> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov  wrote:
> I think this duality, names and keys, is not very healthy really, and
> just
> creates addtutiinal hassle. Spark drm takes care of keys automatically
> thoughout, but propagating names from name vectors is solely algorithm
> concern as it stands.
> Not sure what you mean.
> Not what you think, it looks like.
> I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
> persisted, key goes to the key of a sequence file. In particular, it means
> that there is a case of Bag[ key -> NamedVector]. Which means, external
> anchor could be saved to either key or name of a row. In practice it causes
> compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
> saves external keys (file paths) into  key, whereas e.g. clustering
> algorithms are not seeing them because they expect them to be the name part
> of the vector. I am just saying we have two ways to name the rows, and it
> is generally not a healthy choice for the aforementioned reason.
> In my experience Names and Properties are primarily used to store
> external keys, which are quite healthy.
> Users never have data with Mahout keys, they must constantly go back and
> forth. This is exactly what the R data frame does, no? I'm not so concerned
> with being able to address an element by the external key
> drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have the
> external ids follow the data through any calculation that makes sense.
> I am with you on this.
> This would mean clustering, recommendations, transpose, RSJ would require
> no id transforming steps. This would make dealing with Mahout much easier.
> 

[jira] [Comment Edited] (MAHOUT-1507) Support External/Foreign Keys/IDs for Vectors and Matrices

2014-04-03 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959158#comment-13959158
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1507 at 4/3/14 8:16 PM:
--

Pat, 

Honestly, i must admit i don't understand what you request. As i said, row ids 
are already supported thru row keys. It is my opinion that NamedVectors should 
be avoided as it creates ambiguity. (We need to agree to support keys 1 
particular way, not 2 ways). 

However, named vectors are still supported to extent that they would be loaded 
correctly into in-memory matrix. Passing them universally thru "whatever" 
algorithm implicitly wiould be much harder because of Mahout in-core math 
limitations (there would be too many corner cases). This is therefore solely 
mahout-math in-core issue, not Spark's related. 

Any particular algorithm is however quite capable of passing them thru 
explicitly, if it makes sense for a particular algorithm. Same statement 
applies to property vectors.


was (Author: dlyubimov):
Pat, 

Honestly, i must admit i don't understand what you request. As i said, row ids 
are already supported thru row keys. It is my opinion that NamedVectors should 
be avoided as it creates ambiguity. (We need to agree to support keys 1 
particular way, not 2 ways). 

However, named vectors are still supported to extent that they would be loaded 
correctly into in-memory matrix. Passing them thru universally thru "whatever" 
algorithm implicitly will be much harder because of Mahout in-core math 
limitations (there would be too many corner cases). Any particular algorithm is 
however quite capable of passing them thru explicitly, if it makes sense for a 
particular algorithm. Same statement applies to property vectors.

> Support External/Foreign Keys/IDs for Vectors and Matrices
> --
>
> Key: MAHOUT-1507
> URL: https://issues.apache.org/jira/browse/MAHOUT-1507
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.9
> Environment: Spark Scala
>Reporter: Pat Ferrel
>  Labels: spark
> Fix For: 1.0
>
>
> All users of Mahout have data which is addressed by keys or IDs of their own 
> devise. In order to use much of Mahout they must translate these IDs into 
> Mahout IDs, then run their jobs and translate back again when retrieving the 
> output. If the ID space is very large this is a difficult problem for users 
> to solve at scale.
> For many Mahout operations this would not be necessary if these external keys 
> could be maintained for vectors and dimensions, or for rows and columns of a 
> DRM.
> The reason I bring this up now is that much groundwork is being laid for 
> Mahout's future on Spark so getting this notion in early could be 
> fundamentally important and used to build on.
> If external IDs for rows and columns were maintained then RSJ, DRM Transpose 
> (and other DRM ops), vector extraction, clustering, and recommenders would 
> need no ID translation steps, a big user win.
> A partial solution might be to support external row IDs alone somewhat like 
> the NamedVector and PropertyVector in the Mahout hadoop code.
> On Apr 3, 2014, at 11:00 AM, Pat Ferrel  wrote:
> Perhaps this is best phrased as a feature request.
> On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov  wrote:
> PS.
> sequence file keys have also special meaning if they are Ints. .E.g. A'
> physical operator requires keys to be ints, in which case it interprets
> them as row indexes that become column indexes. This of course isn't always
> the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
> reality optimizer will never choose actual transposition as a physical step
> in such pipeline. This interpretation is consistent with interpretation of
> long-existing Hadoop-side DistributedRowMatrix#transpose.
> On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov  wrote:
> On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel  wrote:
> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov  wrote:
> I think this duality, names and keys, is not very healthy really, and
> just
> creates addtutiinal hassle. Spark drm takes care of keys automatically
> thoughout, but propagating names from name vectors is solely algorithm
> concern as it stands.
> Not sure what you mean.
> Not what you think, it looks like.
> I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
> persisted, key goes to the key of a sequence file. In particular, it means
> that there is a case of Bag[ key -> NamedVector]. Which means, external
> anchor could be saved to either key or name of a row. In practice it causes
> compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
> saves external keys (file paths) into  key, whereas e.g. cluste

[jira] [Updated] (MAHOUT-1507) Support External/Foreign Keys/IDs for Vectors and Matrices

2014-04-03 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1507:
---

Description: 
All users of Mahout have data which is addressed by keys or IDs of their own 
devise. In order to use much of Mahout they must translate these IDs into 
Mahout IDs, then run their jobs and translate back again when retrieving the 
output. If the ID space is very large this is a difficult problem for users to 
solve at scale.

For many Mahout operations this would not be necessary if these external keys 
could be maintained for vectors and dimensions, or for rows and columns of a 
DRM.

The reason I bring this up now is that much groundwork is being laid for 
Mahout's future on Spark so getting this notion in early could be fundamentally 
important and used to build on.

If external IDs for rows and columns were maintained then RSJ, DRM Transpose 
(and other DRM ops), vector extraction, clustering, and recommenders would need 
no ID translation steps, a big user win.

A partial solution might be to support external row IDs alone somewhat like the 
NamedVector and PropertyVector in the Mahout hadoop code.


On Apr 3, 2014, at 11:00 AM, Pat Ferrel  wrote:

Perhaps this is best phrased as a feature request.

On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov  wrote:

PS.

sequence file keys have also special meaning if they are Ints. .E.g. A'
physical operator requires keys to be ints, in which case it interprets
them as row indexes that become column indexes. This of course isn't always
the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
reality optimizer will never choose actual transposition as a physical step
in such pipeline. This interpretation is consistent with interpretation of
long-existing Hadoop-side DistributedRowMatrix#transpose.


On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov  wrote:




On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel  wrote:


On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov  wrote:

I think this duality, names and keys, is not very healthy really, and
just
creates addtutiinal hassle. Spark drm takes care of keys automatically
thoughout, but propagating names from name vectors is solely algorithm
concern as it stands.

Not sure what you mean.

Not what you think, it looks like.

I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
persisted, key goes to the key of a sequence file. In particular, it means
that there is a case of Bag[ key -> NamedVector]. Which means, external
anchor could be saved to either key or name of a row. In practice it causes
compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
saves external keys (file paths) into  key, whereas e.g. clustering
algorithms are not seeing them because they expect them to be the name part
of the vector. I am just saying we have two ways to name the rows, and it
is generally not a healthy choice for the aforementioned reason.


In my experience Names and Properties are primarily used to store
external keys, which are quite healthy.

Users never have data with Mahout keys, they must constantly go back and
forth. This is exactly what the R data frame does, no? I'm not so concerned
with being able to address an element by the external key
drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have the
external ids follow the data through any calculation that makes sense.


I am with you on this.


This would mean clustering, recommendations, transpose, RSJ would require
no id transforming steps. This would make dealing with Mahout much easier.


Data frames is a little bit a different thing, right now we work just with
matrices. Although, yes, our in-core matrices support row and column names
(just like in R) and distributed matrices support row keys only.  what i
mean is that algebraic expression e.g.

Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied
above, but not necessarily named vectors, because internally algorithms
blockify things into matrix blocks, and i am far from sure that Mahout
in-core stuff works correctly with named vectors as part of a matrix block
in all situations. I may be wrong. I always relied on sequence file keys to
identify data points.

Note that sequence file keys are bigger than just a name, it is anything
Writable. I.e. you could save a data structure there, as long as you have a
Writable for it.


On Apr 2, 2014 1:08 PM, "Pat Ferrel"  wrote:

Are the Spark efforts supporting all Mahout Vector types? Named,
Property
Vectors? It occurred to me that data frames in R is a related but more
general solution. If all rows and columns of a DRM and their
coresponding
Vectors (row or column vectors) were to support arbitrary properties
attached to them in such a way that they are preserved during
transpose,
Vector extraction, and any other operations that make sense there
would be
a huge benefit for users.

One of t

[jira] [Created] (MAHOUT-1507) Support External/Foreign Keys/IDs for Vectors and Matrices

2014-04-03 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-1507:
--

 Summary: Support External/Foreign Keys/IDs for Vectors and Matrices
 Key: MAHOUT-1507
 URL: https://issues.apache.org/jira/browse/MAHOUT-1507
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.9
 Environment: Spark Scala
Reporter: Pat Ferrel
 Fix For: 1.0


All users of Mahout have data which is addressed by keys or IDs of their own 
devise. In order to use much of Mahout they must translate these IDs into 
Mahout IDs, then run their jobs and translate back again when retrieving the 
output. If the ID space is very large this is a difficult problem for users to 
solve at scale.

For many Mahout operations this would not be necessary if these external keys 
could be maintained for vectors and dimensions, or for rows and columns of a 
DRM.

The reason I bring this up now is that much groundwork is being laid for 
Mahout's future on Spark so getting this notion in early could be fundamentally 
important and used to build on.

If external IDs for rows and columns were maintained then RSJ, DRM Transpose 
(and other DRM ops), vector extraction, clustering, and recommenders would need 
no ID translation steps, a big user win.

A partial solution might be to support external row IDs somewhat like the 
NamedVector and PropertyVector in the Mahout hadoop code.


On Apr 3, 2014, at 11:00 AM, Pat Ferrel  wrote:

Perhaps this is best phrased as a feature request.

On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov  wrote:

PS.

sequence file keys have also special meaning if they are Ints. .E.g. A'
physical operator requires keys to be ints, in which case it interprets
them as row indexes that become column indexes. This of course isn't always
the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
reality optimizer will never choose actual transposition as a physical step
in such pipeline. This interpretation is consistent with interpretation of
long-existing Hadoop-side DistributedRowMatrix#transpose.


On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov  wrote:




On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel  wrote:


On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov  wrote:

I think this duality, names and keys, is not very healthy really, and
just
creates addtutiinal hassle. Spark drm takes care of keys automatically
thoughout, but propagating names from name vectors is solely algorithm
concern as it stands.

Not sure what you mean.

Not what you think, it looks like.

I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
persisted, key goes to the key of a sequence file. In particular, it means
that there is a case of Bag[ key -> NamedVector]. Which means, external
anchor could be saved to either key or name of a row. In practice it causes
compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
saves external keys (file paths) into  key, whereas e.g. clustering
algorithms are not seeing them because they expect them to be the name part
of the vector. I am just saying we have two ways to name the rows, and it
is generally not a healthy choice for the aforementioned reason.


In my experience Names and Properties are primarily used to store
external keys, which are quite healthy.

Users never have data with Mahout keys, they must constantly go back and
forth. This is exactly what the R data frame does, no? I'm not so concerned
with being able to address an element by the external key
drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have the
external ids follow the data through any calculation that makes sense.


I am with you on this.


This would mean clustering, recommendations, transpose, RSJ would require
no id transforming steps. This would make dealing with Mahout much easier.


Data frames is a little bit a different thing, right now we work just with
matrices. Although, yes, our in-core matrices support row and column names
(just like in R) and distributed matrices support row keys only.  what i
mean is that algebraic expression e.g.

Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied
above, but not necessarily named vectors, because internally algorithms
blockify things into matrix blocks, and i am far from sure that Mahout
in-core stuff works correctly with named vectors as part of a matrix block
in all situations. I may be wrong. I always relied on sequence file keys to
identify data points.

Note that sequence file keys are bigger than just a name, it is anything
Writable. I.e. you could save a data structure there, as long as you have a
Writable for it.


On Apr 2, 2014 1:08 PM, "Pat Ferrel"  wrote:

Are the Spark efforts supporting all Mahout Vector types? Named,
Property
Vectors? It occurred to me that data frames in R is a related but more
general solution. If all rows and columns of a DRM and their
coresponding
Vectors (row

Re: Data frames

2014-04-03 Thread Pat Ferrel
Perhaps this is best phrased as a feature request.

On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov  wrote:

PS.

sequence file keys have also special meaning if they are Ints. .E.g. A'
physical operator requires keys to be ints, in which case it interprets
them as row indexes that become column indexes. This of course isn't always
the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
reality optimizer will never choose actual transposition as a physical step
in such pipeline. This interpretation is consistent with interpretation of
long-existing Hadoop-side DistributedRowMatrix#transpose.


On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov  wrote:

> 
> 
> 
> On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel  wrote:
> 
>> 
>>> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov  wrote:
>>> 
>>> I think this duality, names and keys, is not very healthy really, and
>> just
>>> creates addtutiinal hassle. Spark drm takes care of keys automatically
>>> thoughout, but propagating names from name vectors is solely algorithm
>>> concern as it stands.
>> 
>> Not sure what you mean.
> 
> Not what you think, it looks like.
> 
> I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
> persisted, key goes to the key of a sequence file. In particular, it means
> that there is a case of Bag[ key -> NamedVector]. Which means, external
> anchor could be saved to either key or name of a row. In practice it causes
> compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
> saves external keys (file paths) into  key, whereas e.g. clustering
> algorithms are not seeing them because they expect them to be the name part
> of the vector. I am just saying we have two ways to name the rows, and it
> is generally not a healthy choice for the aforementioned reason.
> 
> 
>> In my experience Names and Properties are primarily used to store
>> external keys, which are quite healthy.
> 
> Users never have data with Mahout keys, they must constantly go back and
>> forth. This is exactly what the R data frame does, no? I'm not so concerned
>> with being able to address an element by the external key
>> drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have the
>> external ids follow the data through any calculation that makes sense.
>> 
> 
> I am with you on this.
> 
> 
>> This would mean clustering, recommendations, transpose, RSJ would require
>> no id transforming steps. This would make dealing with Mahout much easier.
>> 
> 
> Data frames is a little bit a different thing, right now we work just with
> matrices. Although, yes, our in-core matrices support row and column names
> (just like in R) and distributed matrices support row keys only.  what i
> mean is that algebraic expression e.g.
> 
> Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied
> above, but not necessarily named vectors, because internally algorithms
> blockify things into matrix blocks, and i am far from sure that Mahout
> in-core stuff works correctly with named vectors as part of a matrix block
> in all situations. I may be wrong. I always relied on sequence file keys to
> identify data points.
> 
> Note that sequence file keys are bigger than just a name, it is anything
> Writable. I.e. you could save a data structure there, as long as you have a
> Writable for it.
> 
> 
>>> On Apr 2, 2014 1:08 PM, "Pat Ferrel"  wrote:
>>> 
 Are the Spark efforts supporting all Mahout Vector types? Named,
>> Property
 Vectors? It occurred to me that data frames in R is a related but more
 general solution. If all rows and columns of a DRM and their
>> coresponding
 Vectors (row or column vectors) were to support arbitrary properties
 attached to them in such a way that they are preserved during
>> transpose,
 Vector extraction, and any other operations that make sense there
>> would be
 a huge benefit for users.
 
 One of the constant problems with input to Mahout is translation of
>> IDs.
 External to Mahout going in, Mahout to external coming out. Most of
>> this
 would be unneeded if Mahout supported data frames, some would be
>> avoided by
 supporting named or property vectors universally.
 
 
>>> 
>> 
> 
> 



[jira] [Created] (MAHOUT-1506) Creation of affinity matrix for spectral clustering

2014-04-03 Thread Shannon Quinn (JIRA)
Shannon Quinn created MAHOUT-1506:
-

 Summary: Creation of affinity matrix for spectral clustering
 Key: MAHOUT-1506
 URL: https://issues.apache.org/jira/browse/MAHOUT-1506
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 1.0
Reporter: Shannon Quinn
Assignee: Shannon Quinn


I wanted to get this discussion going, since I think this is a critical blocker 
for any kind of documentation update on spectral clustering (I can't update the 
documentation until the algorithm is useful, and it won't be useful until 
there's a built-in method for converting raw data to an affinity matrix).

Namely, I'm wondering what kind of "raw" data should this algorithm be 
expecting (anything that k-means expects, basically?), and what are the data 
structures associated with this? I've created a proof-of-concept for how 
pairwise affinity generation could work.

https://github.com/magsol/Hadoop-Affinity

It's a two-step job, but if the data structures in the input data format 
provides 1) the total number of data points, and 2) for each data point to know 
its index in the overall set, then the first job can be scrapped entirely and 
affinity generation will consist of 1 MR task.

(discussions on Spark / h20 pending, of course)

Mainly this is an engineering problem at this point. Let me know your thoughts 
and I'll get this done (I'm out of town the next 10 days for my 
wedding/honeymoon, will get to this on my return).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1468) Creating a new page for StreamingKMeans documentation on mahout website

2014-04-03 Thread Pavan Kumar N (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958696#comment-13958696
 ] 

Pavan Kumar N commented on MAHOUT-1468:
---

[~ssc] for your attention, perhaps we can choose to add more information on 
kmeans page if committers feel new page is not necessary and we can close this 
one.

> Creating a new page for StreamingKMeans documentation on mahout website
> ---
>
> Key: MAHOUT-1468
> URL: https://issues.apache.org/jira/browse/MAHOUT-1468
> Project: Mahout
>  Issue Type: Documentation
>Affects Versions: 1.0
>Reporter: Pavan Kumar N
>  Labels: Documentation
> Fix For: 1.0
>
>
> Separate page required on Streaming K Means algorithm description and 
> overview, explaining the various parameters can be used in streamingkmeans, 
> strategy for parallelization, link to this paper: 
> http://papers.nips.cc/paper/3812-streaming-k-means-approximation.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1480) Clean up website on 20 newsgroups

2014-04-03 Thread Pavan Kumar N (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958691#comment-13958691
 ] 

Pavan Kumar N commented on MAHOUT-1480:
---

not to brag about a resolved issue here, but would be nice to put a screenshot 
of confusion matrix as it looks really cluttered and numbers are all over (does 
not look like a matrix) in the 20 news groups page.

> Clean up website on 20 newsgroups
> -
>
> Key: MAHOUT-1480
> URL: https://issues.apache.org/jira/browse/MAHOUT-1480
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Sebastian Schelter
>Assignee: Suneel Marthi
> Fix For: 1.0
>
>
> The website on the twenty newsgroups example needs clean up. We need to go 
> through the text, remove dead links and check whether the information is 
> still consistent with the current code.
> https://mahout.apache.org/users/clustering/twenty-newsgroups.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1450) Cleaning up clustering documentation on mahout website

2014-04-03 Thread Pavan Kumar N (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958685#comment-13958685
 ] 

Pavan Kumar N edited comment on MAHOUT-1450 at 4/3/14 10:16 AM:


[~ssc] Thanks for your comment. As [~smarthi] has mentioned in earlier 
comments, using canopy for initial clusters followed by kmeans is old school 
and inefficient approach, kindly reconsider whether you need the same 
documentation in the implementation part. It says how you can use canopy's 
initial clusters for running kmeans followed by a flowchart representation of 
the same. I suggest adding a flow chart on streamingkmeans. 

This is for your consideration. Other than that, everything else seems fine to 
me. Reuters kmeans (cluster-reuters.sh) worked for me. I suggest to add the 
info, do an "mvn install" (-Dskip tests if necessary) before a first time user 
runs these examples.


was (Author: pknarayan):
[~ssc] Thanks for your comment. As [~smarthi] has mentioned in earlier 
comments, using canopy for initial clusters followed by kmeans is old school 
and inefficient approach, kindly reconsider whether you need the same 
documentation in the implementation part. It says how you can use canopy's 
initial clusters for running kmeans followed by a flowchart representation of 
the same. I suggest adding a flow chart on streamingkmeans. 

This is for your consideration. Other than that, everything else seems fine to 
me.

> Cleaning up clustering documentation on mahout website 
> ---
>
> Key: MAHOUT-1450
> URL: https://issues.apache.org/jira/browse/MAHOUT-1450
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
> Environment: This affects all mahout versions
>Reporter: Pavan Kumar N
>  Labels: documentation, newbie
> Fix For: 1.0
>
>
> In canopy clustering, the strategy for parallelization seems to have some 
> dead links. Need to clean them and replace with new links (if there are any). 
> Here is the link:
> http://mahout.apache.org/users/clustering/canopy-clustering.html
> Here are some details of the dead links for kmeans clustering page:
> On the k-Means clustering - basics page, 
> first line of the Quickstart part of the documentation, the hyperlink "Here"
> http://mahout.apache.org/users/clustering/k-means-clustering%5Equickstart-kmeans.sh.html
> first sentence of Strategy for parallelization part of documentation, the 
> hyperlink "Cluster computing and MapReduce", second second sentence the 
> hyperlink "here" and last sentence the hyperlink 
> "http://www2.chass.ncsu.edu/garson/PA765/cluster.htm"; are dead.
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt
> http://www2.chass.ncsu.edu/garson/PA765/cluster.htm
> Under the page: 
> http://mahout.apache.org/users/clustering/visualizing-sample-clusters.html
> in the second sentence of Pre-prep part of this page, the hyperlink "setup 
> mahout" is dead.
> http://mahout.apache.org/users/clustering/users/basics/quickstart.html
> The existing documentation is too ambiguous and I recommend to make the 
> following changes so the new users can use it as tutorial.
> The Quickstart should be replaced with the following:
> Get the data from:
> wget 
> http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz
> Place it within the example folder from mahout home director:
> mahout-0.7/examples/reuters
> mkdir reuters
> cd reuters
> mkdir reuters-out
> mv reuters21578.tar.gz reuters-out
> cd reuters-out
> tar -xzvf reuters21578.tar.gz
> cd ..
> Mahout specific Commands
> #1 run the org.apache.lucene.benchmark .utils.ExtractReuters class
> ${MAHOUT_HOME}/bin/mahout
> org.apache.lucene.benchmark.utils.ExtractReuters reuters-out
> reuters-text
> #2 copy the file to your HDFS
> bin/hadoop fs -copyFromLocal
> /home/bigdata/mahout-distribution-0.7/examples/reuters-text
> hdfs://localhost:54310/user/bigdata/
> #3 generate sequence-file
> mahout seqdirectory -i hdfs://localhost:54310/user/bigdata/reuters-text
> -o hdfs://localhost:54310/user/bigdata/reuters-seqfiles -c UTF-8 -chunk 5
> -chunk → specifying the number of data blocks
> UTF-8 → specifying the appropriate input format
> #4 Check the generated sequence-file
> mahout-0.7$ ./bin/mahout seqdumper -i
> /your-hdfs-path-to/reuters-seqfiles/chunk-0 | less
> #5 From sequence-file generate vector file
> mahout seq2sparse -i
> hdfs://localhost:54310/user/bigdata/reuters-seqfiles -o
> hdfs://localhost:54310/user/bigdata/reuters-vectors -ow
> -ow → overwrite
> #6 take a look at it should have 7 items by using this command
> bin/hadoop fs -ls
> reuters-vectors/df-count
> reuters-vectors/d

[jira] [Commented] (MAHOUT-1450) Cleaning up clustering documentation on mahout website

2014-04-03 Thread Pavan Kumar N (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958685#comment-13958685
 ] 

Pavan Kumar N commented on MAHOUT-1450:
---

[~ssc] Thanks for your comment. As [~smarthi] has mentioned in earlier 
comments, using canopy for initial clusters followed by kmeans is old school 
and inefficient approach, kindly reconsider whether you need the same 
documentation in the implementation part. It says how you can use canopy's 
initial clusters for running kmeans followed by a flowchart representation of 
the same. I suggest adding a flow chart on streamingkmeans. 

This is for your consideration. Other than that, everything else seems fine to 
me.

> Cleaning up clustering documentation on mahout website 
> ---
>
> Key: MAHOUT-1450
> URL: https://issues.apache.org/jira/browse/MAHOUT-1450
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
> Environment: This affects all mahout versions
>Reporter: Pavan Kumar N
>  Labels: documentation, newbie
> Fix For: 1.0
>
>
> In canopy clustering, the strategy for parallelization seems to have some 
> dead links. Need to clean them and replace with new links (if there are any). 
> Here is the link:
> http://mahout.apache.org/users/clustering/canopy-clustering.html
> Here are some details of the dead links for kmeans clustering page:
> On the k-Means clustering - basics page, 
> first line of the Quickstart part of the documentation, the hyperlink "Here"
> http://mahout.apache.org/users/clustering/k-means-clustering%5Equickstart-kmeans.sh.html
> first sentence of Strategy for parallelization part of documentation, the 
> hyperlink "Cluster computing and MapReduce", second second sentence the 
> hyperlink "here" and last sentence the hyperlink 
> "http://www2.chass.ncsu.edu/garson/PA765/cluster.htm"; are dead.
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt
> http://www2.chass.ncsu.edu/garson/PA765/cluster.htm
> Under the page: 
> http://mahout.apache.org/users/clustering/visualizing-sample-clusters.html
> in the second sentence of Pre-prep part of this page, the hyperlink "setup 
> mahout" is dead.
> http://mahout.apache.org/users/clustering/users/basics/quickstart.html
> The existing documentation is too ambiguous and I recommend to make the 
> following changes so the new users can use it as tutorial.
> The Quickstart should be replaced with the following:
> Get the data from:
> wget 
> http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz
> Place it within the example folder from mahout home director:
> mahout-0.7/examples/reuters
> mkdir reuters
> cd reuters
> mkdir reuters-out
> mv reuters21578.tar.gz reuters-out
> cd reuters-out
> tar -xzvf reuters21578.tar.gz
> cd ..
> Mahout specific Commands
> #1 run the org.apache.lucene.benchmark .utils.ExtractReuters class
> ${MAHOUT_HOME}/bin/mahout
> org.apache.lucene.benchmark.utils.ExtractReuters reuters-out
> reuters-text
> #2 copy the file to your HDFS
> bin/hadoop fs -copyFromLocal
> /home/bigdata/mahout-distribution-0.7/examples/reuters-text
> hdfs://localhost:54310/user/bigdata/
> #3 generate sequence-file
> mahout seqdirectory -i hdfs://localhost:54310/user/bigdata/reuters-text
> -o hdfs://localhost:54310/user/bigdata/reuters-seqfiles -c UTF-8 -chunk 5
> -chunk → specifying the number of data blocks
> UTF-8 → specifying the appropriate input format
> #4 Check the generated sequence-file
> mahout-0.7$ ./bin/mahout seqdumper -i
> /your-hdfs-path-to/reuters-seqfiles/chunk-0 | less
> #5 From sequence-file generate vector file
> mahout seq2sparse -i
> hdfs://localhost:54310/user/bigdata/reuters-seqfiles -o
> hdfs://localhost:54310/user/bigdata/reuters-vectors -ow
> -ow → overwrite
> #6 take a look at it should have 7 items by using this command
> bin/hadoop fs -ls
> reuters-vectors/df-count
> reuters-vectors/dictionary.file-0
> reuters-vectors/frequency.file-0
> reuters-vectors/tf-vectors
> reuters-vectors/tfidf-vectors
> reuters-vectors/tokenized-documents
> reuters-vectors/wordcount
> bin/hadoop fs -ls reuters-vectors
> #7 check the vector: reuters-vectors/tf-vectors/part-r-0
> mahout-0.7$ hadoop fs -ls reuters-vectors/tf-vectors
> #8 Run canopy clustering to get optimal initial centroids for k-means
> mahout canopy -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tf-vectors -o
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000
> -dm → specifying the distance measure to be used while clustering (here it is 
> cosine distance measure)
> #9 Run k-means clustering a

[jira] [Commented] (MAHOUT-1489) Interactive Scala & Spark Bindings Shell & Script processor

2014-04-03 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958589#comment-13958589
 ] 

Saikat Kanjilal commented on MAHOUT-1489:
-

Ok lots of changes here so far:

1) added a new maven project called shell
2) created a new package under org/apache/mahout called shell which contains 
the spark shell code and all its dependencies (sub packages include 
server/storage/ui and util), we may not need these but for now my goal is to 
just get the project compiling
3) currently battling through a bunch of compilation errors with classes not 
being found which I'll be bringing in as needed (100+errors just related to 
this)

github repo is here again for reference: 
https://github.com/skanjila/mahout-scala-spark-shell

Dmitry my goal is to get this compiling with all dependencies brought in from 
spark which we can remove later

> Interactive Scala & Spark Bindings Shell & Script processor
> ---
>
> Key: MAHOUT-1489
> URL: https://issues.apache.org/jira/browse/MAHOUT-1489
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 1.0
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> Build an interactive shell /scripting (just like spark shell). Something very 
> similar in R interactive/script runner mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)