I thought about this, as I think overall the resources required would be less than creating a filter. Ultimately I decided against it for a few reasons: 1) I'm working with an existing index of ~50 million documents, I don't want to reindex the whole thing, or even just the documents that contain profanity, if I can avoid it. 2) Filtering at indexing time means I can't effectively add new words to the profanity list without reindexing.
Good suggestion, though, I appreciate it. Greg ----- Original Message ---- From: Grant Ingersoll <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, March 7, 2007 2:07:38 PM Subject: Re: Negative Filtering (such as for profanity) Not sure if this helpful given your proposed solution, but could you do something on the indexing side, such as: 1. Remove the profanity from the token stream, much like a stopword. This would also mean stripping it from the display text 2. If your TokenFilter comes across a profanity, somehow mark the document as containing a profanity via a "profanity" Field (not sure if there is a way, in Lucene, to add another Field while you are in the analysis phase, but you could also have it update a table in a db or something.) Then on search, you could just say (regular query) +profanity:false HTH, Grant On Mar 7, 2007, at 10:07 AM, Greg Gershman wrote: > I'm attempting to create a profanity filter. I thought to use a > QueryFilter created with a Query of (-$#!+ AND [EMAIL PROTECTED] AND etc). > The > problem I have run into is that, as a pure negative query is not > supported (a query for (-term) DOES NOT return the inverse of a > query for (term)), I believe the bit set returned by a purely > negative QueryFilter is empty, so no matter how many results > returned by the initial query, the result after filtering is always > zero documents. > > I was wondering if anyone had suggestions as to how else to do > this. I've considered simply amending the query string submitted > by the user to include a pre-generated String that would exclude > the query terms, but I consider this a non-elegant solution. I had > also thought about creating a new sub-class of QueryFilter, > NegativeQueryFilter. Basically, it would works just like a > QueryFilter, taking a positive query (so, I would pass it an OR'ed > list of profane words), then the resulting bits are simply > flipped. I think this would work, unless I'm missing something. > I'm going to experiment with it, I'd appreciate anyone's thoughts > on this. > > Thanks, > > Greg > > > > > > ______________________________________________________________________ > ______________ > It's here! Your new message! > Get new email alerts with the free Yahoo! Toolbar. > http://tools.search.yahoo.com/toolbar/features/mail/ -------------------------- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ____________________________________________________________________________________ Need Mail bonding? Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users. http://answers.yahoo.com/dir/?link=list&sid=396546091