I think your analysis is correct, but you are also probably correct that
having multiple levels at the same time would be preferable.

On Wed, Feb 1, 2012 at 1:05 PM, Stuart Smith <stu24m...@yahoo.com> wrote:

> Hello,
>    I was curious about how bayes handles the ngram argument, and how it
> could be modified..
>
> I started with the question: "If you say -ngram 3, does it consider ngrams
> of size 1,2, and 3, or just 3?"
>
> From looking at the code, it looks like it just uses the ShingleFilter
> from Lucene, which as per it's documentation:
>
> A ShingleFilter constructs shingles (token n-grams) from a token stream.
> In other words, it creates combinations of tokens as a single token.
> For example, the sentence "please divide this sentence into shingles"
> might be tokenized into shingles "please divide", "divide this", "this
> sentence", "sentence into", and "into shingles".
> This filter handles position increments > 1 by inserting filler tokens
> (tokens with termtext "_"). It does not handle a position increment of 0.
>
>
> So, it looks like, no, it only uses the specific number.
> Is this understanding correct?
>
> It just takes the second column in your input file, and runs ShingleFilter
> with a set ngram size?
> Is there plans to expand this any more?
>
> How badly would adding a loop like:
>
> for( int i = 0; i < ngram; ++i ) {
>      filter = new ShingleFilter( attribute_line, ngram );
>     //iterate through and add tokens
> }
>
> mess up the bayes classifier?
>
> This would overcount terms, but I'm not sure if this would wash out as
> every term would be overcounted equally, etc..
> And I assume it could cause other issues, but I don't know what.
>
> Ooo.. it it looks like ngram size 3 just crashed some job tasks by hitting
> the GC limit.. so I suppose I have other issues to fix first :)
>
>
> Take care,
>   -stu
>

Reply via email to