I think your analysis is correct, but you are also probably correct that having multiple levels at the same time would be preferable.
On Wed, Feb 1, 2012 at 1:05 PM, Stuart Smith <stu24m...@yahoo.com> wrote: > Hello, > I was curious about how bayes handles the ngram argument, and how it > could be modified.. > > I started with the question: "If you say -ngram 3, does it consider ngrams > of size 1,2, and 3, or just 3?" > > From looking at the code, it looks like it just uses the ShingleFilter > from Lucene, which as per it's documentation: > > A ShingleFilter constructs shingles (token n-grams) from a token stream. > In other words, it creates combinations of tokens as a single token. > For example, the sentence "please divide this sentence into shingles" > might be tokenized into shingles "please divide", "divide this", "this > sentence", "sentence into", and "into shingles". > This filter handles position increments > 1 by inserting filler tokens > (tokens with termtext "_"). It does not handle a position increment of 0. > > > So, it looks like, no, it only uses the specific number. > Is this understanding correct? > > It just takes the second column in your input file, and runs ShingleFilter > with a set ngram size? > Is there plans to expand this any more? > > How badly would adding a loop like: > > for( int i = 0; i < ngram; ++i ) { > filter = new ShingleFilter( attribute_line, ngram ); > //iterate through and add tokens > } > > mess up the bayes classifier? > > This would overcount terms, but I'm not sure if this would wash out as > every term would be overcounted equally, etc.. > And I assume it could cause other issues, but I don't know what. > > Ooo.. it it looks like ngram size 3 just crashed some job tasks by hitting > the GC limit.. so I suppose I have other issues to fix first :) > > > Take care, > -stu >