[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input
[ https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672182#comment-13672182 ] Grant Ingersoll commented on MAHOUT-1080: - Here's a thought: kill NamedVector, and move the single name string to Vector. It seems to me naming a Vector is very, very common. A possible issue, however, is dealing with older Vectors that don't have a name, but we could just treat it as an empty string. IMO, this should be fixed before 1.0 Kmeans clustered output losses vectorId given in the input -- Key: MAHOUT-1080 URL: https://issues.apache.org/jira/browse/MAHOUT-1080 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Reporter: Smita Wadhwa Fix For: 0.8 Attachments: kMeansClusterVectorId.diff The input to the Kmeans is Intwritable and vectorWritable and the output of clustered points is clusterId WeightedVectorWitable(vector,distance-from-the-centre) The information the id of the vector is lost in this processing . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input
[ https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672210#comment-13672210 ] Pat Ferrel commented on MAHOUT-1080: +10 As a frequent user of named vectors I would love to see this supported generally. Kmeans clustered output losses vectorId given in the input -- Key: MAHOUT-1080 URL: https://issues.apache.org/jira/browse/MAHOUT-1080 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Reporter: Smita Wadhwa Fix For: 0.8 Attachments: kMeansClusterVectorId.diff The input to the Kmeans is Intwritable and vectorWritable and the output of clustered points is clusterId WeightedVectorWitable(vector,distance-from-the-centre) The information the id of the vector is lost in this processing . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input
[ https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463626#comment-13463626 ] Paritosh Ranjan commented on MAHOUT-1080: - The vectors can be wrapped around NamedVectors with id as the name to trace them back, which solves the problem explained. So, I am not sure whether this fix should be applied or not. Kmeans clustered output losses vectorId given in the input -- Key: MAHOUT-1080 URL: https://issues.apache.org/jira/browse/MAHOUT-1080 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Smita Wadhwa Attachments: kMeansClusterVectorId.diff The input to the Kmeans is Intwritable and vectorWritable and the output of clustered points is clusterId WeightedVectorWitable(vector,distance-from-the-centre) The information the id of the vector is lost in this processing . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input
[ https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463643#comment-13463643 ] Smita Wadhwa commented on MAHOUT-1080: -- Ya, that can also be done to wrap Vectors(Sparse Vector format to NamedVector ), but I thought this might be the cleaner way, assuming we also have WeightedPropertyVectorWritable. No issues, if that NamedVector is better in your view. Kmeans clustered output losses vectorId given in the input -- Key: MAHOUT-1080 URL: https://issues.apache.org/jira/browse/MAHOUT-1080 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Smita Wadhwa Attachments: kMeansClusterVectorId.diff The input to the Kmeans is Intwritable and vectorWritable and the output of clustered points is clusterId WeightedVectorWitable(vector,distance-from-the-centre) The information the id of the vector is lost in this processing . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input
[ https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463658#comment-13463658 ] Paritosh Ranjan commented on MAHOUT-1080: - Jeff, can you share your views on this. Other users have also experienced this, http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%3c50502f5f.3050...@xebia.com%3E Kmeans clustered output losses vectorId given in the input -- Key: MAHOUT-1080 URL: https://issues.apache.org/jira/browse/MAHOUT-1080 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Smita Wadhwa Attachments: kMeansClusterVectorId.diff The input to the Kmeans is Intwritable and vectorWritable and the output of clustered points is clusterId WeightedVectorWitable(vector,distance-from-the-centre) The information the id of the vector is lost in this processing . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input
[ https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463838#comment-13463838 ] Jeff Eastman commented on MAHOUT-1080: -- Using the WritableComparable key from the vector input file as an identifier certainly seems reasonable. We introduced NamedVectors a long time ago to allow for identifiers to pass through the clustering classification phase and most current Mahout applications take this approach. I'm not sure a new writable needs to be introduced here. We could also modify the ClusterClassificationMapper to emit a NamedVector with the key in it if the VectorWritable was not already named. Kmeans clustered output losses vectorId given in the input -- Key: MAHOUT-1080 URL: https://issues.apache.org/jira/browse/MAHOUT-1080 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Reporter: Smita Wadhwa Fix For: 0.8 Attachments: kMeansClusterVectorId.diff The input to the Kmeans is Intwritable and vectorWritable and the output of clustered points is clusterId WeightedVectorWitable(vector,distance-from-the-centre) The information the id of the vector is lost in this processing . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input
[ https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13464038#comment-13464038 ] Pat Ferrel commented on MAHOUT-1080: The inconsistant support for NamedVector seems an issue with Mahout in general. If you don't use a NamedVector your clustered points will have no ids. In another issue NamedVectors were just added to the output of SSVD. It would be nice to have a more expressive version of a vector that goes through all the analysis pipeline. However I'd vote for the WeightedPropertyVectorWritable which seems a more general solution and already exists. There are several, if not many, things that would be nice to associate with a vector at some point in the analysis pipeline (distance to centroid, name, some-external-key, pdf, etc.) Why not adopt it as a standard for i/o of jobs that can support it? Then add properties for each pipeline task that make sense. It would do away with the need for several dictionaries methinks. Kmeans clustered output losses vectorId given in the input -- Key: MAHOUT-1080 URL: https://issues.apache.org/jira/browse/MAHOUT-1080 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Reporter: Smita Wadhwa Fix For: 0.8 Attachments: kMeansClusterVectorId.diff The input to the Kmeans is Intwritable and vectorWritable and the output of clustered points is clusterId WeightedVectorWitable(vector,distance-from-the-centre) The information the id of the vector is lost in this processing . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira