The default with the Mahout encoders is two probes.  This is unnecessary
with the intercept term, of course, if you protect the intercept term from
other updates, possible by encoding other data using a view of the original
feature vector.

For each probe, a different hash is used so each value is put into multiple
locations.  Multiple probes are useful in general to decrease the effect of
the reduced dimensionality of the hashed representation.



On Fri, Nov 29, 2013 at 1:14 AM, Paul van Hoven <paul.van.ho...@gmail.com>wrote:

> For an example program using mahout I use the donut.csv sample data
> from the project (
>
> https://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/resources/donut.csv
> ). My code looks like this:
>
>     import org.apache.mahout.math.RandomAccessSparseVector;
>     import org.apache.mahout.math.Vector;
>     import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder;
>     import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder;
>     import com.csvreader.CsvReader;
>
>     public class Runner {
>
>     //Set the path accordingly!
>     public static final String csvInputDataPath = "/path/to/donut.csv";
>
>     public static void main(String[] args) {
>
>     FeatureVectorEncoder encoder = new StaticWordValueEncoder("features");
>     ArrayList<RandomAccessSparseVector> featureVectors =
>      new ArrayList<RandomAccessSparseVector>();
>     try {
>     CsvReader csvReader = new CsvReader(csvInputDataPath);
>     csvReader.readHeaders();
>     while( csvReader.readRecord() ) {
>     Vector featureVector = new RandomAccessSparseVector(30);
>     featureVector.set(0, new Double(csvReader.get("x")));
>     featureVector.set(1, new Double(csvReader.get("y")));
>     featureVector.set(2, new Double(csvReader.get("c")));
>     featureVector.set(3, new Integer(csvReader.get("color")));
>     System.out.println("Before: " + featureVector.toString());
>     encoder.addToVector(csvReader.get("shape").getBytes(),
>     featureVector);
>     System.out.println(" After: " + featureVector.toString());
>     featureVectors.add((RandomAccessSparseVector) featureVector);
>     }
>     } catch(Exception e) {
>     e.printStackTrace();
>     }
>
>     System.out.println("Program is done.");
>     }
>
>     }
>
>
> What confuses me is the following output (one sample):
>
>     Before:
> {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0}
>      After:
> {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0,29:1.0,25:1.0}
>
> As you can see, I added just one value "shape" to the vector. However
> two dimensions of this vector are encoded with 1.0. On the other hand,
> for some other data I get the output
>
>     Before:
> {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:2.0}
>      After:
> {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:3.0,16:1.0}
>
> Why? I would expect that _always_ only one dimension gets occupied by
> 1.0 as this is the standard case for categorial encoding. This way
> this seems to be wrong.
>
> Thanks in advance,
> Paul
>

Reply via email to