Ok sorry for not explaining my problem clearly earlier. We have around 5 fields in each document. ID, ISBN, author, title and the category which this book falls under. ( You are right about point 3, we are indeed storing multiple genre against the book, which means 1 book 1 doc.)
doc.add(new Field("entityId", book.get("id"), Field.Store.YES, Field.Index.NO)); doc.add(new Field("author", book.get("author"), Field.Store.NO, Field.Index.TOKENIZED)); etc etc... and this document is added using the IndexWriter. and when a search is issued we search for the search term in title/author/isbn/category....based on some inputs... then a set of books are returned( you are right about point 2 as well... since we search only on title/author/genre, we were only indexing those ). The way we wanted these books to be laid out to the user was such that he can navigate through the categories, which the books he searched for belong to, to kind of being able to narrow the search. While the count of books, for the given search term, under a particular category was correct, the overall count of the books were incorrect because of some books being repeated in various categories. For this reason, we wanted a duplicate filter on the ID which would give us only the unique books... and there was something wrong in the way it was implemented... the ID in the document was not indexed as you can see in the above code. When this was fixed it worked as expected...but for some performance issues.. because of the huge index sizes ( 3 million books ). Anyway looks like we have figured the solution ( moved the filter out of the search.. applied it on the result or something like that ) Thanks so much for ur time. -Anisha Anshum-2 wrote: > > Hi Anish, > So am I getting something wrong here? You said "I have created a search > index on book Id , title ,and author from a database of books which fall > under various categories." so those are 3 fields, right? > 1. How do you filter the doc types (as in the genres) at search time? Do > you > even need to do that, if yes how? > 2. If you're doing that 'm assuming you're already indexing the genre > somehow. Right? > 3. How about a field for the genre having multi-valued entries (multiple > field objects going into the same doc with the same field label). This > would > help you store 1 doc as 1 doc having multiple genres instead of duplicate > entries. > > I'm still not sure if I've gotten tre problem correctly, but hope this is > of > help! > > -- > Anshum Gupta > Naukri Labs! > http://ai-cafe.blogspot.com > > The facts expressed here belong to everybody, the opinions to me. The > distinction is yours to draw............ > > > On Fri, Mar 5, 2010 at 12:07 PM, ani...@ekkitab <ani...@ekkitab.com> > wrote: > >> >> Hi Zhangchi >> >> >> Thanks for your reply. >> >> We have about 3 million records (different isbns) in the database and >> documents little more than that, and we wouldn't want to do the deduping >> at >> indexing time, because one book ( one isbn ) can be available under 2 or >> more categories( like fiction, comics & novels, science etc) >> >> We had actually applied filter on the primary key ie ID, and it wasn't >> working, so I was hoping for some sample code. But then we found out that >> the field name on which we wanted the duplicate filter to be applied (Id) >> was not actually indexed while adding it into the document. ie >> Field.Index >> was set to NO. We changed this, repopulated the documents and the >> filtering >> works now. >> >> Thanks for your time. >> >> >> >> >> zhangchi wrote: >> > >> > >> > i think you should check the index first.using the lukeall to see if >> there >> > is the duplicate books. >> > >> > On Thu, 04 Mar 2010 20:43:26 +0800, ani...@ekkitab <ani...@ekkitab.com> >> > wrote: >> > >> >> >> >> Hi there, Could someone help me with the usage of DuplicateFilters. >> Here >> >> is >> >> my problem >> >> >> >> I have created a search index on book Id , title ,and author from a >> >> database >> >> of books which fall under various categories. Some books fall under >> more >> >> than one category. Now, when i issue a search, I get back 'X' books >> >> matching >> >> the search criteria, some of which are repeated, because that books >> are >> >> in >> >> different documents and its the expected behaviour. >> >> >> >> I use the TopFieldDocCollector . getTotalHits() to get the total >> count. >> >> But >> >> this includes the repeats as mentioned above. This count is not the >> >> actual >> >> count, Hence when I issue a search on title or author i want to get a >> >> unique >> >> count / list of books. How do I use DuplicateFilter to acheive this. >> >> >> >> Please help >> >> >> >> Regards >> >> Anish >> > >> > >> > -- >> > Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> > >> >> -- >> View this message in context: >> http://old.nabble.com/how-to-use-DuplicateFilter-to-get-unique-documents-based-on-a-fieldName-tp27780251p27790391.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > -- View this message in context: http://old.nabble.com/how-to-use-DuplicateFilter-to-get-unique-documents-based-on-a-fieldName-tp27780251p27793771.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org