This shows up with Mahout 0.10.0 (the distribution archive) and Hadoop 2.2.0

When I run seq2sparse on a document containing the following tokens:

cash cash equival cash cash equival consist highli liquid instrument
commerci paper time deposit other monei market instrument which origin
matur three month less aggreg cash balanc bank reclassifi neg balanc
consist mainli unclear check account payabl neg balanc reclassifi
account payabl decemb

the tokens mainli, check and unclear are dropped on the floor (they do
not appear in the dictionary file).  The issue persists if I change the
analyzer to SimpleAnalyzer (-a
org.apache.lucene.analysis.core.SimpleAnalyzer).  I can understand an
English analyzer doing something like this, but it seems a little
strange that it would happen with SimpleAnalyzer.  (I wonder if it is
coincidence that these tokens appear consecutively in the input.)

What I am trying to do:  The standard analyzers don't do enough, and I
have no access to the client's cluster to preload a custom analyzer.
Processing the text before stuffing it into the initial sequence file
seemed to be the cleanest alternative, since there doesn't seem to be
any way to add a custom jar when using a stock Mahout app.

Why dropped or mangled tokens matter, other than as missing information:
 Ultimately what I need to do is calculate topic weights for an
arbitrary chunk of text.  (See next post.)  If I can't get the tokens
right, I don't think I can do this.



Reply via email to