I _think_ Lucene 2.1 (or is it trunk?, I lose track) has the ability
to delete all documents containing a term. So, every time you update
your profanity list, you could iterate over it and remove all
documents that contain the terms.
If a user can never get these documents via a query, then I don't see
any reason to allow them in the index to begin with.
Also, I don't use QueryFilters much, but I'm curious as to how they
perform on that many docs.
On Mar 7, 2007, at 5:38 PM, Greg Gershman wrote:
I thought about this, as I think overall the resources required
would be less than creating a filter. Ultimately I decided against
it for a few reasons:
1) I'm working with an existing index of ~50 million documents, I
don't want to reindex the whole thing, or even just the documents
that contain profanity, if I can avoid it.
2) Filtering at indexing time means I can't effectively add new
words to the profanity list without reindexing.
Good suggestion, though, I appreciate it.
Greg
----- Original Message ----
From: Grant Ingersoll <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, March 7, 2007 2:07:38 PM
Subject: Re: Negative Filtering (such as for profanity)
Not sure if this helpful given your proposed solution, but could you
do something on the indexing side, such as:
1. Remove the profanity from the token stream, much like a
stopword. This would also mean stripping it from the display text
2. If your TokenFilter comes across a profanity, somehow mark the
document as containing a profanity via a "profanity" Field (not sure
if there is a way, in Lucene, to add another Field while you are in
the analysis phase, but you could also have it update a table in a db
or something.) Then on search, you could just say (regular query)
+profanity:false
HTH,
Grant
On Mar 7, 2007, at 10:07 AM, Greg Gershman wrote:
I'm attempting to create a profanity filter. I thought to use a
QueryFilter created with a Query of (-$#!+ AND [EMAIL PROTECTED] AND etc). The
problem I have run into is that, as a pure negative query is not
supported (a query for (-term) DOES NOT return the inverse of a
query for (term)), I believe the bit set returned by a purely
negative QueryFilter is empty, so no matter how many results
returned by the initial query, the result after filtering is always
zero documents.
I was wondering if anyone had suggestions as to how else to do
this. I've considered simply amending the query string submitted
by the user to include a pre-generated String that would exclude
the query terms, but I consider this a non-elegant solution. I had
also thought about creating a new sub-class of QueryFilter,
NegativeQueryFilter. Basically, it would works just like a
QueryFilter, taking a positive query (so, I would pass it an OR'ed
list of profane words), then the resulting bits are simply
flipped. I think this would work, unless I'm missing something.
I'm going to experiment with it, I'd appreciate anyone's thoughts
on this.
Thanks,
Greg
_____________________________________________________________________
_
______________
It's here! Your new message!
Get new email alerts with the free Yahoo! Toolbar.
http://tools.search.yahoo.com/toolbar/features/mail/
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org
Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
______________________________________________________________________
______________
Need Mail bonding?
Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list&sid=396546091
------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]