Re: seq2sparse dropping tokens

Suneel Marthi Fri, 29 May 2015 12:14:39 -0700

Allen, could u please file a JIRA for this?

On Fri, May 29, 2015 at 8:58 AM, Allen McIntosh <amcint...@appcomsci.com>
wrote:


> This shows up with Mahout 0.10.0 (the distribution archive) and Hadoop
> 2.2.0
>
> When I run seq2sparse on a document containing the following tokens:
>
> cash cash equival cash cash equival consist highli liquid instrument
> commerci paper time deposit other monei market instrument which origin
> matur three month less aggreg cash balanc bank reclassifi neg balanc
> consist mainli unclear check account payabl neg balanc reclassifi
> account payabl decemb
>
> the tokens mainli, check and unclear are dropped on the floor (they do
> not appear in the dictionary file).  The issue persists if I change the
> analyzer to SimpleAnalyzer (-a
> org.apache.lucene.analysis.core.SimpleAnalyzer).  I can understand an
> English analyzer doing something like this, but it seems a little
> strange that it would happen with SimpleAnalyzer.  (I wonder if it is
> coincidence that these tokens appear consecutively in the input.)
>
> What I am trying to do:  The standard analyzers don't do enough, and I
> have no access to the client's cluster to preload a custom analyzer.
> Processing the text before stuffing it into the initial sequence file
> seemed to be the cleanest alternative, since there doesn't seem to be
> any way to add a custom jar when using a stock Mahout app.
>
> Why dropped or mangled tokens matter, other than as missing information:
>  Ultimately what I need to do is calculate topic weights for an
> arbitrary chunk of text.  (See next post.)  If I can't get the tokens
> right, I don't think I can do this.
>
>
>
>

Re: seq2sparse dropping tokens

Reply via email to