[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input

2013-06-01 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672182#comment-13672182
 ] 

Grant Ingersoll commented on MAHOUT-1080:
-

Here's a thought: kill NamedVector, and move the single name string to 
Vector.  It seems to me naming a Vector is very, very common.  A possible 
issue, however, is dealing with older Vectors that don't have a name, but we 
could just treat it as an empty string.

IMO, this should be fixed before 1.0

 Kmeans clustered output losses vectorId given in the input
 --

 Key: MAHOUT-1080
 URL: https://issues.apache.org/jira/browse/MAHOUT-1080
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
Reporter: Smita Wadhwa
 Fix For: 0.8

 Attachments: kMeansClusterVectorId.diff


 The input to the Kmeans is Intwritable and vectorWritable 
 and the output of clustered points is clusterId 
 WeightedVectorWitable(vector,distance-from-the-centre)
 The information the id of the vector is lost in this processing . 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input

2013-06-01 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672210#comment-13672210
 ] 

Pat Ferrel commented on MAHOUT-1080:


+10

As a frequent user of named vectors I would love to see this supported 
generally.

 Kmeans clustered output losses vectorId given in the input
 --

 Key: MAHOUT-1080
 URL: https://issues.apache.org/jira/browse/MAHOUT-1080
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
Reporter: Smita Wadhwa
 Fix For: 0.8

 Attachments: kMeansClusterVectorId.diff


 The input to the Kmeans is Intwritable and vectorWritable 
 and the output of clustered points is clusterId 
 WeightedVectorWitable(vector,distance-from-the-centre)
 The information the id of the vector is lost in this processing . 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input

2012-09-26 Thread Paritosh Ranjan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463626#comment-13463626
 ] 

Paritosh Ranjan commented on MAHOUT-1080:
-

The vectors can be wrapped around NamedVectors with id as the name to trace 
them back, which solves the problem explained.

So, I am not sure whether this fix should be applied or not.

 Kmeans clustered output losses vectorId given in the input
 --

 Key: MAHOUT-1080
 URL: https://issues.apache.org/jira/browse/MAHOUT-1080
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Reporter: Smita Wadhwa
 Attachments: kMeansClusterVectorId.diff


 The input to the Kmeans is Intwritable and vectorWritable 
 and the output of clustered points is clusterId 
 WeightedVectorWitable(vector,distance-from-the-centre)
 The information the id of the vector is lost in this processing . 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input

2012-09-26 Thread Smita Wadhwa (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463643#comment-13463643
 ] 

Smita Wadhwa commented on MAHOUT-1080:
--

Ya, that can also be done to wrap Vectors(Sparse Vector format to NamedVector 
), but I thought this might be the cleaner way, assuming we also have 
WeightedPropertyVectorWritable.

No issues, if that NamedVector is better in your view.

 Kmeans clustered output losses vectorId given in the input
 --

 Key: MAHOUT-1080
 URL: https://issues.apache.org/jira/browse/MAHOUT-1080
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Reporter: Smita Wadhwa
 Attachments: kMeansClusterVectorId.diff


 The input to the Kmeans is Intwritable and vectorWritable 
 and the output of clustered points is clusterId 
 WeightedVectorWitable(vector,distance-from-the-centre)
 The information the id of the vector is lost in this processing . 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input

2012-09-26 Thread Paritosh Ranjan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463658#comment-13463658
 ] 

Paritosh Ranjan commented on MAHOUT-1080:
-

Jeff, can you share your views on this.

Other users have also experienced this, 
http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%3c50502f5f.3050...@xebia.com%3E

 Kmeans clustered output losses vectorId given in the input
 --

 Key: MAHOUT-1080
 URL: https://issues.apache.org/jira/browse/MAHOUT-1080
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Reporter: Smita Wadhwa
 Attachments: kMeansClusterVectorId.diff


 The input to the Kmeans is Intwritable and vectorWritable 
 and the output of clustered points is clusterId 
 WeightedVectorWitable(vector,distance-from-the-centre)
 The information the id of the vector is lost in this processing . 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input

2012-09-26 Thread Jeff Eastman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463838#comment-13463838
 ] 

Jeff Eastman commented on MAHOUT-1080:
--

Using the WritableComparable key from the vector input file as an identifier 
certainly seems reasonable. We introduced NamedVectors a long time ago to allow 
for identifiers to pass through the clustering classification phase and most 
current Mahout applications take this approach. I'm not sure a new writable 
needs to be introduced here. We could also modify the 
ClusterClassificationMapper to emit a NamedVector with the key in it if the 
VectorWritable was not already named.

 Kmeans clustered output losses vectorId given in the input
 --

 Key: MAHOUT-1080
 URL: https://issues.apache.org/jira/browse/MAHOUT-1080
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
Reporter: Smita Wadhwa
 Fix For: 0.8

 Attachments: kMeansClusterVectorId.diff


 The input to the Kmeans is Intwritable and vectorWritable 
 and the output of clustered points is clusterId 
 WeightedVectorWitable(vector,distance-from-the-centre)
 The information the id of the vector is lost in this processing . 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input

2012-09-26 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13464038#comment-13464038
 ] 

Pat Ferrel commented on MAHOUT-1080:


The inconsistant support for NamedVector seems an issue with Mahout in general. 
If you don't use a NamedVector your clustered points will have no ids. In 
another issue NamedVectors were just added to the output of SSVD. It would be 
nice to have a more expressive version of a vector that goes through all the 
analysis pipeline.

However I'd vote for the WeightedPropertyVectorWritable which seems a more 
general solution and already exists. There are several, if not many, things 
that would be nice to associate with a vector at some point in the analysis 
pipeline (distance to centroid, name, some-external-key, pdf, etc.) Why not 
adopt it as a standard for i/o of jobs that can support it? Then add properties 
for each pipeline task that make sense. It would do away with the need for 
several dictionaries methinks.

 Kmeans clustered output losses vectorId given in the input
 --

 Key: MAHOUT-1080
 URL: https://issues.apache.org/jira/browse/MAHOUT-1080
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
Reporter: Smita Wadhwa
 Fix For: 0.8

 Attachments: kMeansClusterVectorId.diff


 The input to the Kmeans is Intwritable and vectorWritable 
 and the output of clustered points is clusterId 
 WeightedVectorWitable(vector,distance-from-the-centre)
 The information the id of the vector is lost in this processing . 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira