[ https://issues.apache.org/jira/browse/MAHOUT-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734942#comment-14734942 ]
ASF GitHub Bot commented on MAHOUT-1771: ---------------------------------------- GitHub user srowen opened a pull request: https://github.com/apache/mahout/pull/158 MAHOUT-1771 Cluster dumper omits indices and 0 elements for dense vector or sparse containing 0s Output indices in cluster representation whenever *any* vector has *some* zero elements that won't be output. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/mahout MAHOUT-1771 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/158.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #158 ---- commit a167e13bd9d420c291fbcd8c28cffafe04dc9a4c Author: Sean Owen <so...@cloudera.com> Date: 2015-09-08T14:58:26Z Output indices in cluster representation whenever *any* vector has *some* zero elements that won't be output. ---- > Cluster dumper omits indices and 0 elements for dense vector or sparse > containing 0s > ------------------------------------------------------------------------------------ > > Key: MAHOUT-1771 > URL: https://issues.apache.org/jira/browse/MAHOUT-1771 > Project: Mahout > Issue Type: Bug > Components: Clustering, mrlegacy > Affects Versions: 0.9 > Reporter: Sean Owen > Priority: Minor > Attachments: MAHOUT-1771.patch > > > (EDIT: fixed incorrect analysis) > Blast from the past -- are patches still being accepted for "mrlegacy" code? > Something turned up incidentally when working with a customer that looks like > a minor bug in the cluster dumper code. > In {{AbstractCluster.java}}: > {code} > public static List<Object> formatVectorAsJson(Vector v, String[] bindings) > throws IOException { > boolean hasBindings = bindings != null; > boolean isSparse = !v.isDense() && v.getNumNondefaultElements() != > v.size(); > // we assume sequential access in the output > Vector provider = v.isSequentialAccess() ? v : new > SequentialAccessSparseVector(v); > List<Object> terms = new LinkedList<>(); > String term = ""; > for (Element elem : provider.nonZeroes()) { > if (hasBindings && bindings.length >= elem.index() + 1 && > bindings[elem.index()] != null) { > term = bindings[elem.index()]; > } else if (hasBindings || isSparse) { > term = String.valueOf(elem.index()); > } > Map<String, Object> term_entry = new HashMap<>(); > double roundedWeight = (double) Math.round(elem.get() * 1000) / 1000; > if (hasBindings || isSparse) { > term_entry.put(term, roundedWeight); > terms.add(term_entry); > } else { > terms.add(roundedWeight); > } > } > return terms; > } > {code} > The problem is that this never outputs any elements of a vector with value 0, > but, also doesn't print indices in some cases. This means the output is > smaller than the number of dimensions, but it's not possible to know where > the omitted 0s are. > It will not output indices if the vector is a dense vector, or if the number > of non-default elements is the same as the size (which includes sparse > vectors even containing 0 values, if they have been set explicitly). However > the iteration is over non-zero elements only. > The fix seems to be to print indices if the number of _non-zero_ elements is > less than size, for _any_ vector: > {code} > boolean isSparse = v.getNumZeroElements() != v.size(); > {code} > Pretty straightforward, and minor, but wanted to check with everyone before > making a change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)