> Is this what you want?

Yes.

ShingleFilter outputs the actual tokens and the shingled tokens
so doing things like KeepWordFilter before ShingleFilter may not
will solve the issue.

We can probably extend ShingleFilter and create
ShingleStopWordFilter as this type of problem/solution seems
fairly common?

On Mon, Jul 27, 2009 at 1:22 PM, Steven A Rowe<sar...@syr.edu> wrote:
> Hi Jason,
>
> On 7/27/2009 at 3:15 PM, Jason Rutherglen wrote:
>> I'd like to enable ShingleFilter to only create shingles for a set of
>> (stop) words (rather than for all N tokens).
>
> For purposes of discussion, here's some example input (first sentence from 
> <http://en.wikipedia.org/wiki/Manufacturing>):
>
>        Manufacturing is the use of machines, tools and labor
>        to make things for use or sale.
>
> For n=2 and stoplist = { is, the, of, and, to, for, or }, and assuming 
> WhitespaceAnalyzer, I think what you want is for ShingleFilter to *exclude* 
> from output the following shingles (no unigrams output); since all other 
> bigrams contain at least one stopword, they would be output:
>
>        /machines, tools/
>        /make things/
>
> Is this what you want?
>
> It might make sense, rather than modifying ShingleFilter, to create a new 
> TokenFilter that can exclude terms you don't like.
>
> Solr has KeepWordFilter, which is close to what you want (the inverse of 
> StopFilter), with the exception that you want to keep shingles that *contain* 
> words from a list you supply.
>
> Perhaps a new TokenFilter subclass that can take in a regular expression 
> would work?  (Maybe called KeepRegexFilter.)  Stopword lists are generally 
> small enough to make building a regex to match them fairly simple, e.g. for 
> the above list:
>
>        (?:^|\s)(?:is|the|of|and|to|for|or)(?:\s|$)
>
> Alternatively/additionally, maybe a Keep{Term,Phrase,Keyword}Filter that 
> takes in a list of words, then builds a regex like above?
>
> Having this functionality separate from ShingleFilter would be nice, I think, 
> because it would be useful in other contexts.
>
> Steve
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to