Re: consistency of StaticWordValueEncoder
Thanks! Is that standard practice or do people typically serialize their encoders and then load the binaries later? On Wed, Jan 7, 2015 at 5:25 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Jan 7, 2015 at 2:20 PM, chirag lakhani chirag.lakh...@gmail.com wrote: In the Mahout in Action book I got the impression that the term memo will seed the random number generator and I wanted to confirm that means I will have consistency if I deploy this vectorizer in both my Hadoop environment as well as my Java app. In particular, I am fixing the vector size to be of length FEATURES and I am using memo as the name of my encoder. Will those two things guarantee consistency of my text vectorization? It should do. Anything else would be a bug (which is, of course, possible)
consistency of StaticWordValueEncoder
I am trying vectorize text data for a Naive Bayes classifier that will be trained in Hadoop and then the corresponding model will be deployed in a Java app. My basic approach is to tokenize a string of text data using Lucene and then encode each token using a StaticWordValueEncoder here are the relevant code snippets private static FeatureVectorEncoder memoEncoder = new StaticWordValueEncoder(memo); Vector v = new RandomAccessSparseVector(FEATURES); StringReader reader = new StringReader(text); StandardTokenizer source = new StandardTokenizer(Version.LUCENE_46, reader); ShingleFilter sf = new ShingleFilter(source); sf.setOutputUnigrams(true); CharTermAttribute charTermAttribute = sf.addAttribute(CharTermAttribute.class); sf.reset(); while (sf.incrementToken()) { memoEncoder.addToVector(charTermAttribute.toString(), 1,v); } In the Mahout in Action book I got the impression that the term memo will seed the random number generator and I wanted to confirm that means I will have consistency if I deploy this vectorizer in both my Hadoop environment as well as my Java app. In particular, I am fixing the vector size to be of length FEATURES and I am using memo as the name of my encoder. Will those two things guarantee consistency of my text vectorization?
Re: consistency of StaticWordValueEncoder
On Wed, Jan 7, 2015 at 2:20 PM, chirag lakhani chirag.lakh...@gmail.com wrote: In the Mahout in Action book I got the impression that the term memo will seed the random number generator and I wanted to confirm that means I will have consistency if I deploy this vectorizer in both my Hadoop environment as well as my Java app. In particular, I am fixing the vector size to be of length FEATURES and I am using memo as the name of my encoder. Will those two things guarantee consistency of my text vectorization? It should do. Anything else would be a bug (which is, of course, possible)