[ 
https://issues.apache.org/jira/browse/MAHOUT-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734944#comment-14734944
 ] 

Sean Owen commented on MAHOUT-1771:
-----------------------------------

See pull request https://github.com/apache/mahout/pull/158 actually

> Cluster dumper omits indices and 0 elements for dense vector or sparse 
> containing 0s
> ------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1771
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1771
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering, mrlegacy
>    Affects Versions: 0.9
>            Reporter: Sean Owen
>            Priority: Minor
>         Attachments: MAHOUT-1771.patch
>
>
> (EDIT: fixed incorrect analysis)
> Blast from the past -- are patches still being accepted for "mrlegacy" code? 
> Something turned up incidentally when working with a customer that looks like 
> a minor bug in the cluster dumper code.
> In {{AbstractCluster.java}}:
> {code}
> public static List<Object> formatVectorAsJson(Vector v, String[] bindings) 
> throws IOException {
>     boolean hasBindings = bindings != null;
>     boolean isSparse = !v.isDense() && v.getNumNondefaultElements() != 
> v.size();
>     // we assume sequential access in the output
>     Vector provider = v.isSequentialAccess() ? v : new 
> SequentialAccessSparseVector(v);
>     List<Object> terms = new LinkedList<>();
>     String term = "";
>     for (Element elem : provider.nonZeroes()) {
>       if (hasBindings && bindings.length >= elem.index() + 1 && 
> bindings[elem.index()] != null) {
>         term = bindings[elem.index()];
>       } else if (hasBindings || isSparse) {
>         term = String.valueOf(elem.index());
>       }
>       Map<String, Object> term_entry = new HashMap<>();
>       double roundedWeight = (double) Math.round(elem.get() * 1000) / 1000;
>       if (hasBindings || isSparse) {
>         term_entry.put(term, roundedWeight);
>         terms.add(term_entry);
>       } else {
>         terms.add(roundedWeight);
>       }
>     }
>     return terms;
>   }
> {code}
> The problem is that this never outputs any elements of a vector with value 0, 
> but, also doesn't print indices in some cases. This means the output is 
> smaller than the number of dimensions, but it's not possible to know where 
> the omitted 0s are.
> It will not output indices if the vector is a dense vector, or if the number 
> of non-default elements is the same as the size (which includes sparse 
> vectors even containing 0 values, if they have been set explicitly). However 
> the iteration is over non-zero elements only. 
> The fix seems to be to print indices if the number of _non-zero_ elements is 
> less than size, for _any_ vector:
> {code}
>     boolean isSparse = v.getNumZeroElements() != v.size();
> {code}
> Pretty straightforward, and minor, but wanted to check with everyone before 
> making a change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to