Can you please provide your schema details here? Cheers Avlesh
On Tue, Aug 11, 2009 at 1:29 AM, Joe Calderon <calderon....@gmail.com>wrote: > so in the case someone can help me with the query syntax, the > relational query i would use for this would be something like: > > SELECT * FROM videos > WHERE > title LIKE 'family guy' > AND desc LIKE 'stewie%' > AND ( > ( is_dup = 0 ) > OR > ( is_dup = 1 AND id NOT IN > ( > SELECT id FROM videos > WHERE > title LIKE 'family guy' > AND desc LIKE 'stewie%' > AND is_dup = 0 > ) > ) > ) > ORDER BY views > LIMIT 10 > > can a similar query be written in lucene or do i need to structure my > index differently to be able to do such a query? > > thx much > > --joe > > > On Sat, Aug 1, 2009 at 9:15 AM, Joe Calderon<calderon....@gmail.com> > wrote: > > hello, thanks for the response, i did take a look at that document but > > in my application i actually want the duplicates, as i mentioned, the > > matching text could be very different among cluster members, what > > joins them together is a similar set of numeric features. > > > > currently i do a query with fq=duplicate:0 and show a link to > > optionally show the "dupes" via by querying for all dupes of the > > master id, however im currently missing any documents that matched the > > query but are duplicates of other masters not included in that result > > set. > > > > in a relational database (fulltext indexing aside) i would use a > > subquery, i imagine a similar approach could be used with lucene, i > > just dont know the syntax > > > > best, > > > > --joe > > > > On Fri, Jul 31, 2009 at 11:32 PM, Otis > > Gospodnetic<otis_gospodne...@yahoo.com> wrote: > >> Joe, > >> > >> Maybe we can take a step back first. Would it be better if your index > was cleaner and didn't have flagged duplicates in the first place? If so, > have you tried using http://wiki.apache.org/solr/Deduplication ? > >> > >> Otis > >> -- > >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls > >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR > >> > >> > >> > >> ----- Original Message ---- > >>> From: Joe Calderon <calderon....@gmail.com> > >>> To: solr-user@lucene.apache.org > >>> Sent: Friday, July 31, 2009 5:06:48 PM > >>> Subject: dealing with duplicates > >>> > >>> hello all, i have a collection of a few million documents; i have many > >>> duplicates in this collection. they have been clustered with a simple > >>> algorithm, i have a field called 'duplicate' which is 0 or 1 and a > >>> fields called 'description, tags, meta', documents are clustered on > >>> different criteria and the text i search against could be very > >>> different among members of a cluster. > >>> > >>> im currently using a dismax handler to search across the text fields > >>> with different boosts, and a filter query to restrict to masters > >>> (duplicate: 0) > >>> > >>> my question is then, how do i best query for documents which are > >>> masters OR match text but are not included in the matched set of > >>> masters? > >>> > >>> does this make sense? > >> > >> > > >