Matching on "owned" docs -- filter or query? Or sort?

Uncle Sun, 22 Jul 2012 08:07:03 -0700

I also posted this to StackOverflow, apologies if you see this twice.

I have a data set whereby documents are associated to a user id. Say that the 
documents represent books, and each book can have one or more owner. I am 
indexing the titles with Lucene. When searching, I want all results owned by me 
to be sorted at the top of the results before results that are not owned by me. 
So the data might look like:


Owner ID       Book Title
--------             ----------
13                   To Have and To Have Not
14                   To Have and To Have Not
19                   To Have and To Have Not
18                   Have a Little Faith
15                   Snow Crash
17                   Snow Crash
18                   Cryptonomicon
14                   Of Mice And Men
17                   Flash Crash

Say that my user id is 14 and I search on "have", I want to match on both "To 
Have and To Have Not" and "Have a Little Faith", but "To Have and To Have Not" 
should show up higher in my search results, because I own it.  Similarly, if I 
am user id 15 and search for "Crash", I will match both "Snow Crash" and "want 
"Flash Crash", but "Snow Crash" should show up first because I own it.  If I am 
user id 14 and I search for "crash", I would still get a match for "Snow Crash" 
even though I don't own it.  If I did a fuzzy match for "a" which would match 
almost all of these titles, I would see those that I own before I see the 
others.

I am a little stuck on whether this is a query, filter, custom sort, or some 
combination, and how to get the best performance.  For example, if I could 
write a filter that eliminates all duplicate titles, giving preference to those 
owned by me, I could then just perform a search on the remainder (assuming that 
filters are applied before searches). Then, a custom sort based on whether or 
not I own the doc would be straightforward.

But I am not sure how to implement the filter. It is not a simple 
DuplicateFilter because it operates on two fields. It is similar to the 
security filter example in section 5.6.7 of Lucene in Action, except that I 
still want to be able to see documents that I don't own, if I don't own a book 
with the same title. The custom filter in section 6.4 is also close, but my 
problem is more complex because it depends on two fields.

While iterating over the documents, the filter would have to remember which 
titles have been seen, and then keep the ones that I own. For example if it 
iterated over the values above in order, it would see the title "To Have and To 
Have Not", not owned by me; and then see the same title again, owned by me, and 
have to know that it should drop the first doc and keep the second. I can't 
think of how to do this without using a lot of memory, essentially keeping all 
titles in memory while iterating, which seems very expensive. It isn't a simple 
"match" function because whether or not I match depends on the other documents 
in the set.

Thanks much for any guidance or info.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Matching on "owned" docs -- filter or query? Or sort?

Reply via email to