Re: consistency of StaticWordValueEncoder

2015-01-08 Thread chirag lakhani
Thanks!  Is that standard practice or do people typically serialize their
encoders and then load the binaries later?

On Wed, Jan 7, 2015 at 5:25 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Wed, Jan 7, 2015 at 2:20 PM, chirag lakhani chirag.lakh...@gmail.com
 wrote:

  In the Mahout in Action book I got the impression that the term memo
 will
  seed the random number generator and I wanted to confirm that means I
 will
  have consistency if I deploy this vectorizer in both my Hadoop
 environment
  as well as my Java app.  In particular, I am fixing the vector size to be
  of length FEATURES and I am using memo as the name of my encoder.  Will
  those two things guarantee consistency of my text vectorization?
 

 It should do.

 Anything else would be a bug (which is, of course, possible)



consistency of StaticWordValueEncoder

2015-01-07 Thread chirag lakhani
I am trying vectorize text data for a Naive Bayes classifier that will be
trained in Hadoop and then the corresponding model will be deployed in a
Java app.  My basic approach is to tokenize a string of text data using
Lucene and then encode each token using a StaticWordValueEncoder here are
the relevant code snippets

private static FeatureVectorEncoder memoEncoder = new
StaticWordValueEncoder(memo);

Vector v = new RandomAccessSparseVector(FEATURES);

StringReader reader = new StringReader(text);
StandardTokenizer source = new StandardTokenizer(Version.LUCENE_46, reader);
ShingleFilter sf = new ShingleFilter(source);
sf.setOutputUnigrams(true);
CharTermAttribute charTermAttribute =
sf.addAttribute(CharTermAttribute.class);
sf.reset();
while (sf.incrementToken()) {
memoEncoder.addToVector(charTermAttribute.toString(), 1,v);
}


In the Mahout in Action book I got the impression that the term memo will
seed the random number generator and I wanted to confirm that means I will
have consistency if I deploy this vectorizer in both my Hadoop environment
as well as my Java app.  In particular, I am fixing the vector size to be
of length FEATURES and I am using memo as the name of my encoder.  Will
those two things guarantee consistency of my text vectorization?


Re: consistency of StaticWordValueEncoder

2015-01-07 Thread Ted Dunning
On Wed, Jan 7, 2015 at 2:20 PM, chirag lakhani chirag.lakh...@gmail.com
wrote:

 In the Mahout in Action book I got the impression that the term memo will
 seed the random number generator and I wanted to confirm that means I will
 have consistency if I deploy this vectorizer in both my Hadoop environment
 as well as my Java app.  In particular, I am fixing the vector size to be
 of length FEATURES and I am using memo as the name of my encoder.  Will
 those two things guarantee consistency of my text vectorization?


It should do.

Anything else would be a bug (which is, of course, possible)