Allen, could u please file a JIRA for this? On Fri, May 29, 2015 at 8:58 AM, Allen McIntosh <amcint...@appcomsci.com> wrote:
> This shows up with Mahout 0.10.0 (the distribution archive) and Hadoop > 2.2.0 > > When I run seq2sparse on a document containing the following tokens: > > cash cash equival cash cash equival consist highli liquid instrument > commerci paper time deposit other monei market instrument which origin > matur three month less aggreg cash balanc bank reclassifi neg balanc > consist mainli unclear check account payabl neg balanc reclassifi > account payabl decemb > > the tokens mainli, check and unclear are dropped on the floor (they do > not appear in the dictionary file). The issue persists if I change the > analyzer to SimpleAnalyzer (-a > org.apache.lucene.analysis.core.SimpleAnalyzer). I can understand an > English analyzer doing something like this, but it seems a little > strange that it would happen with SimpleAnalyzer. (I wonder if it is > coincidence that these tokens appear consecutively in the input.) > > What I am trying to do: The standard analyzers don't do enough, and I > have no access to the client's cluster to preload a custom analyzer. > Processing the text before stuffing it into the initial sequence file > seemed to be the cleanest alternative, since there doesn't seem to be > any way to add a custom jar when using a stock Mahout app. > > Why dropped or mangled tokens matter, other than as missing information: > Ultimately what I need to do is calculate topic weights for an > arbitrary chunk of text. (See next post.) If I can't get the tokens > right, I don't think I can do this. > > > >