[
https://issues.apache.org/jira/browse/MAHOUT-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734944#comment-14734944
]
Sean Owen commented on MAHOUT-1771:
-----------------------------------
See pull request https://github.com/apache/mahout/pull/158 actually
> Cluster dumper omits indices and 0 elements for dense vector or sparse
> containing 0s
> ------------------------------------------------------------------------------------
>
> Key: MAHOUT-1771
> URL: https://issues.apache.org/jira/browse/MAHOUT-1771
> Project: Mahout
> Issue Type: Bug
> Components: Clustering, mrlegacy
> Affects Versions: 0.9
> Reporter: Sean Owen
> Priority: Minor
> Attachments: MAHOUT-1771.patch
>
>
> (EDIT: fixed incorrect analysis)
> Blast from the past -- are patches still being accepted for "mrlegacy" code?
> Something turned up incidentally when working with a customer that looks like
> a minor bug in the cluster dumper code.
> In {{AbstractCluster.java}}:
> {code}
> public static List<Object> formatVectorAsJson(Vector v, String[] bindings)
> throws IOException {
> boolean hasBindings = bindings != null;
> boolean isSparse = !v.isDense() && v.getNumNondefaultElements() !=
> v.size();
> // we assume sequential access in the output
> Vector provider = v.isSequentialAccess() ? v : new
> SequentialAccessSparseVector(v);
> List<Object> terms = new LinkedList<>();
> String term = "";
> for (Element elem : provider.nonZeroes()) {
> if (hasBindings && bindings.length >= elem.index() + 1 &&
> bindings[elem.index()] != null) {
> term = bindings[elem.index()];
> } else if (hasBindings || isSparse) {
> term = String.valueOf(elem.index());
> }
> Map<String, Object> term_entry = new HashMap<>();
> double roundedWeight = (double) Math.round(elem.get() * 1000) / 1000;
> if (hasBindings || isSparse) {
> term_entry.put(term, roundedWeight);
> terms.add(term_entry);
> } else {
> terms.add(roundedWeight);
> }
> }
> return terms;
> }
> {code}
> The problem is that this never outputs any elements of a vector with value 0,
> but, also doesn't print indices in some cases. This means the output is
> smaller than the number of dimensions, but it's not possible to know where
> the omitted 0s are.
> It will not output indices if the vector is a dense vector, or if the number
> of non-default elements is the same as the size (which includes sparse
> vectors even containing 0 values, if they have been set explicitly). However
> the iteration is over non-zero elements only.
> The fix seems to be to print indices if the number of _non-zero_ elements is
> less than size, for _any_ vector:
> {code}
> boolean isSparse = v.getNumZeroElements() != v.size();
> {code}
> Pretty straightforward, and minor, but wanted to check with everyone before
> making a change.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)