Re: dealing with duplicates

Avlesh Singh Mon, 10 Aug 2009 21:38:48 -0700

Can you please provide your schema details here?

Cheers
Avlesh


On Tue, Aug 11, 2009 at 1:29 AM, Joe Calderon <calderon....@gmail.com>wrote:

> so in the case someone can help me with the query syntax, the
> relational query i would use for this would be something like:
>
> SELECT * FROM videos
> WHERE
> title LIKE 'family guy'
> AND desc LIKE 'stewie%'
> AND (
>  ( is_dup = 0 )
>  OR
>  ( is_dup = 1 AND id NOT IN
>    (
>    SELECT id FROM videos
>    WHERE
>    title LIKE 'family guy'
>    AND desc LIKE 'stewie%'
>    AND is_dup = 0
>    )
>  )
> )
> ORDER BY views
> LIMIT 10
>
> can a similar query be written in lucene or do i need to structure my
> index differently to be able to do such a query?
>
> thx much
>
> --joe
>
>
> On Sat, Aug 1, 2009 at 9:15 AM, Joe Calderon<calderon....@gmail.com>
> wrote:
> > hello, thanks for the response, i did take a look at that document but
> > in my application i actually want the duplicates, as i mentioned, the
> > matching text could be very different among cluster members, what
> > joins them together is a similar set of numeric features.
> >
> > currently i do a query with fq=duplicate:0 and show a link to
> > optionally show the "dupes" via by querying for all dupes of the
> > master id, however im currently missing any documents that matched the
> > query but are duplicates of other masters not included in that result
> > set.
> >
> > in a relational database (fulltext indexing aside) i would use a
> > subquery, i imagine a similar approach could be used with lucene, i
> > just dont know the syntax
> >
> > best,
> >
> > --joe
> >
> > On Fri, Jul 31, 2009 at 11:32 PM, Otis
> > Gospodnetic<otis_gospodne...@yahoo.com> wrote:
> >> Joe,
> >>
> >> Maybe we can take a step back first.  Would it be better if your index
> was cleaner and didn't have flagged duplicates in the first place?  If so,
> have you tried using http://wiki.apache.org/solr/Deduplication ?
> >>
> >>  Otis
> >> --
> >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >>
> >>
> >>
> >> ----- Original Message ----
> >>> From: Joe Calderon <calderon....@gmail.com>
> >>> To: solr-user@lucene.apache.org
> >>> Sent: Friday, July 31, 2009 5:06:48 PM
> >>> Subject: dealing with duplicates
> >>>
> >>> hello all, i have a collection of a few million documents; i have many
> >>> duplicates in this collection. they have been clustered with a simple
> >>> algorithm, i have a field called 'duplicate' which is 0 or 1 and a
> >>> fields called 'description, tags, meta', documents are clustered on
> >>> different criteria and the text i search against could be very
> >>> different among members of a cluster.
> >>>
> >>> im currently using a dismax handler to search across the text fields
> >>> with different boosts, and a filter query to restrict to masters
> >>> (duplicate: 0)
> >>>
> >>> my question is then, how do i best query for documents which are
> >>> masters OR match text but are not included in the matched set of
> >>> masters?
> >>>
> >>> does this make sense?
> >>
> >>
> >
>

Re: dealing with duplicates

Reply via email to