[ 
https://issues.apache.org/jira/browse/MAHOUT-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-1771:
------------------------------
    Description: 
(EDIT: fixed incorrect analysis)

Blast from the past -- are patches still being accepted for "mrlegacy" code? 
Something turned up incidentally when working with a customer that looks like a 
minor bug in the cluster dumper code.

In {{AbstractCluster.java}}:

{code}
public static List<Object> formatVectorAsJson(Vector v, String[] bindings) 
throws IOException {

    boolean hasBindings = bindings != null;
    boolean isSparse = !v.isDense() && v.getNumNondefaultElements() != v.size();

    // we assume sequential access in the output
    Vector provider = v.isSequentialAccess() ? v : new 
SequentialAccessSparseVector(v);

    List<Object> terms = new LinkedList<>();
    String term = "";

    for (Element elem : provider.nonZeroes()) {

      if (hasBindings && bindings.length >= elem.index() + 1 && 
bindings[elem.index()] != null) {
        term = bindings[elem.index()];
      } else if (hasBindings || isSparse) {
        term = String.valueOf(elem.index());
      }

      Map<String, Object> term_entry = new HashMap<>();
      double roundedWeight = (double) Math.round(elem.get() * 1000) / 1000;
      if (hasBindings || isSparse) {
        term_entry.put(term, roundedWeight);
        terms.add(term_entry);
      } else {
        terms.add(roundedWeight);
      }
    }

    return terms;
  }
{code}

The problem is that this never outputs any elements of a vector with value 0, 
but, also doesn't print indices in some cases. This means the output is smaller 
than the number of dimensions, but it's not possible to know where the omitted 
0s are.

It will not output indices if the vector is a dense vector, or if the number of 
non-default elements is the same as the size (which includes sparse vectors 
even containing 0 values, if they have been set explicitly). However the 
iteration is over non-zero elements only. 

The fix seems to be to print indices if the number of _non-zero_ elements is 
less than size, for _any_ vector:

{code}
    boolean isSparse = v.getNumZeroElements() != v.size();
{code}

Pretty straightforward, and minor, but wanted to check with everyone before 
making a change.

  was:
Blast from the past -- are patches still being accepted for "mrlegacy" code? 
Something turned up incidentally when working with a customer that looks like a 
minor bug in the cluster dumper code.

In {{AbstractCluster.java}}:

{code}
public static List<Object> formatVectorAsJson(Vector v, String[] bindings) 
throws IOException {

    boolean hasBindings = bindings != null;
    boolean isSparse = !v.isDense() && v.getNumNondefaultElements() != v.size();

    // we assume sequential access in the output
    Vector provider = v.isSequentialAccess() ? v : new 
SequentialAccessSparseVector(v);

    List<Object> terms = new LinkedList<>();
    String term = "";

    for (Element elem : provider.nonZeroes()) {

      if (hasBindings && bindings.length >= elem.index() + 1 && 
bindings[elem.index()] != null) {
        term = bindings[elem.index()];
      } else if (hasBindings || isSparse) {
        term = String.valueOf(elem.index());
      }

      Map<String, Object> term_entry = new HashMap<>();
      double roundedWeight = (double) Math.round(elem.get() * 1000) / 1000;
      if (hasBindings || isSparse) {
        term_entry.put(term, roundedWeight);
        terms.add(term_entry);
      } else {
        terms.add(roundedWeight);
      }
    }

    return terms;
  }
{code}

Imagine a {{DenseVector}} with 5 elements, of which two are 0. It's considered 
dense in this method since the number of non-default elements is 5 (all 
elements are "non default" in a dense vector).

However the iteration is over non-zero elements only. And indices are only 
printed if it's sparse (or has bindings). So the result will be the 3 non-zero 
elements printed without indices. Which dimensions they are can't be determined.

The fix seems to be either:
- Compare number of _non-zero_ elements to the size when determining if it's 
sparse
- Iterate over all elements if non-sparse

I think the first is the intent? it would be a one-line change if so.

{code}
    boolean isSparse = !v.isDense() && v.getNumZeroElements() != v.size();
{code}

Pretty straightforward, and minor, but wanted to check with everyone before 
making a change.

        Summary: Cluster dumper omits indices and 0 elements for dense vector 
or sparse containing 0s  (was: Cluster dumper omits indices and 0 elements for 
dense vectors)

> Cluster dumper omits indices and 0 elements for dense vector or sparse 
> containing 0s
> ------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1771
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1771
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering, mrlegacy
>    Affects Versions: 0.9
>            Reporter: Sean Owen
>            Priority: Minor
>
> (EDIT: fixed incorrect analysis)
> Blast from the past -- are patches still being accepted for "mrlegacy" code? 
> Something turned up incidentally when working with a customer that looks like 
> a minor bug in the cluster dumper code.
> In {{AbstractCluster.java}}:
> {code}
> public static List<Object> formatVectorAsJson(Vector v, String[] bindings) 
> throws IOException {
>     boolean hasBindings = bindings != null;
>     boolean isSparse = !v.isDense() && v.getNumNondefaultElements() != 
> v.size();
>     // we assume sequential access in the output
>     Vector provider = v.isSequentialAccess() ? v : new 
> SequentialAccessSparseVector(v);
>     List<Object> terms = new LinkedList<>();
>     String term = "";
>     for (Element elem : provider.nonZeroes()) {
>       if (hasBindings && bindings.length >= elem.index() + 1 && 
> bindings[elem.index()] != null) {
>         term = bindings[elem.index()];
>       } else if (hasBindings || isSparse) {
>         term = String.valueOf(elem.index());
>       }
>       Map<String, Object> term_entry = new HashMap<>();
>       double roundedWeight = (double) Math.round(elem.get() * 1000) / 1000;
>       if (hasBindings || isSparse) {
>         term_entry.put(term, roundedWeight);
>         terms.add(term_entry);
>       } else {
>         terms.add(roundedWeight);
>       }
>     }
>     return terms;
>   }
> {code}
> The problem is that this never outputs any elements of a vector with value 0, 
> but, also doesn't print indices in some cases. This means the output is 
> smaller than the number of dimensions, but it's not possible to know where 
> the omitted 0s are.
> It will not output indices if the vector is a dense vector, or if the number 
> of non-default elements is the same as the size (which includes sparse 
> vectors even containing 0 values, if they have been set explicitly). However 
> the iteration is over non-zero elements only. 
> The fix seems to be to print indices if the number of _non-zero_ elements is 
> less than size, for _any_ vector:
> {code}
>     boolean isSparse = v.getNumZeroElements() != v.size();
> {code}
> Pretty straightforward, and minor, but wanted to check with everyone before 
> making a change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to