Alexander,

I have two ideas how to implement fast dedupe externally, assuming your PKs
don't fit to java.util.*Map:

   - your crawler can use inprocess RDBMS (Derby, H2) to track dupes;
   - if your crawler is stateless - it doesn't track PKs which has been
   already crawled, you can retrieve it from Solr via
   http://wiki.apache.org/solr/TermsComponent . That's blazingly fast, but
   it might be a problem with removed documents (I'm not sure). And it's also
   can lead to OOMException (if you have too much PKs). Let me know if you
   need a workaround for one of these problems.

If you choose internal dedupe (UpdateProcessor), pls let me know if
querying one-by-one will be to slow for your and you'll need to do it
page-by-page. I did some of such paging, and will do something similar
soon, so I'm interested in it.

Regards

On Thu, Dec 29, 2011 at 9:34 AM, Alexander Aristov <
alexander.aris...@gmail.com> wrote:

> Unfortunately I have a lot of duplicates  and taking that searching might
> suffer I will try with implementing update procesor.
>
> But your idea is interesting and I will consider it, thanks.
>
> Best Regards
> Alexander Aristov
>
>
> On 28 December 2011 19:12, Tanguy Moal <tanguy.m...@gmail.com> wrote:
>
> > Hello Alexander,
> >
> > I don't know much about your requirements in terms of size and
> > performances, but I've had a similar use case and found a pretty simple
> > workaround.
> > If your duplicate rate is not too high, you can have the
> > SignatureProcessor to generate fingerprint of documents (you already did
> > that).
> >
> > Simply turn off overwritting of duplicates, you can then rely on solr's
> > grouping / field collapsing to group your search results by fingerprints.
> > You'll then have one document group per "real" document. You can use
> > group.sort to sort your groups by indexing date ascending, and
> > group.limit=1 to keep only the oldest one.
> > You can even use group.format = simple to serve results as if no
> > collapsing occured, and use group.ngroups (/!\ could be expansive /!\) to
> > get the real number of deduplicated documents.
> >
> > Of course the index will be larger, as I said, I made no assumption
> > regarding your operating requirements. And search can be a bit slower,
> > depending on the average rate of duplicated documents.
> > But you've got your issue addressed by configuration tuning only...
> > Depending on your project's sizing, it could be time saving.
> >
> > The advantage is that you have the precious information of what content
> is
> > duplicated from where :-)
> >
> > Hope this helps,
> >
> > --
> > Tanguy
> >
> > Le 28/12/2011 15:45, Alexander Aristov a écrit :
> >
> >  Thanks Eric,
> >>
> >> it sets me direction. I will be writing new plugin and will get back to
> >> the
> >> dev forum with results and then we will decide next steps.
> >>
> >> Best Regards
> >> Alexander Aristov
> >>
> >>
> >> On 28 December 2011 18:08, Erick Erickson<erickerickson@gmail.**com<
> erickerick...@gmail.com>>
> >>  wrote:
> >>
> >>  Well, the short answer is that nobody else has
> >>> 1>  had a similar requirement
> >>> AND
> >>> 2>  not found a suitable work around
> >>> AND
> >>> 3>  implemented the change and contributed it back.
> >>>
> >>> So, if you'd like to volunteer<G>.....
> >>>
> >>> Seriously. If you think this would be valuable and are
> >>> willing to work on it, hop on over to the dev list and
> >>> discuss it, open a JIRA and make it work. I'd start
> >>> by opening a discussion on the dev list before
> >>> opening a JIRA, just to get a sense of where the
> >>> snags would be to changing the Solr code, but that's
> >>> optional.
> >>>
> >>> That said, writing your own update request handler
> >>> that detected this case isn't very difficult,
> >>> extend UpdateRequestProcessorFactory/**UpdateRequestProcessor
> >>> and use it as a plugin.
> >>>
> >>> Best
> >>> Erick
> >>>
> >>> On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
> >>> <alexander.aris...@gmail.com>  wrote:
> >>>
> >>>> the problem with dedupe (SignatureUpdateProcessor ) is that it
> REPLACES
> >>>>
> >>> old
> >>>
> >>>> docs. I have tried it already.
> >>>>
> >>>> Best Regards
> >>>> Alexander Aristov
> >>>>
> >>>>
> >>>> On 28 December 2011 13:04, Lance Norskog<goks...@gmail.com>  wrote:
> >>>>
> >>>>  The SignatureUpdateProcessor is for exactly this problem:
> >>>>>
> >>>>>
> >>>>>
> >>>>>  http://www.lucidimagination.**com/search/link?url=http://**
> >>> wiki.apache.org/solr/**Deduplication<
> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication
> >
> >>>
> >>>> On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
> >>>>> <alexander.aris...@gmail.com>  wrote:
> >>>>>
> >>>>>> I get docs from external sources and the only place I keep them is
> >>>>>>
> >>>>> solr
> >>>
> >>>> index. I have no a database or other means to track indexed docs (my
> >>>>>> personal oppinion is that it might be a huge headache).
> >>>>>>
> >>>>>> Some docs might change slightly in there original sources but I
> don't
> >>>>>>
> >>>>> need
> >>>>>
> >>>>>> that changes. In fact I need original data only.
> >>>>>>
> >>>>>> So I have no other ways but to either check if a document is already
> >>>>>>
> >>>>> in
> >>>
> >>>> index before I put it to solrj array (read - query solr) or develop my
> >>>>>>
> >>>>> own
> >>>>>
> >>>>>> update chain processor and implement ID check there and skip such
> >>>>>>
> >>>>> docs.
> >>>
> >>>> Maybe it's wrong place to aguee and probably it's been discussed
> >>>>>>
> >>>>> before
> >>>
> >>>> but
> >>>>>
> >>>>>> I wonder why simple the overwrite parameter doesn't work here.
> >>>>>>
> >>>>>> My oppinion it perfectly suits here. In combination with unique ID
> it
> >>>>>>
> >>>>> can
> >>>
> >>>> cover all possible variants.
> >>>>>>
> >>>>>> cases:
> >>>>>>
> >>>>>> 1. overwrite=true and uniquID exists then newer doc should overwrite
> >>>>>>
> >>>>> the
> >>>
> >>>> old one.
> >>>>>>
> >>>>>> 2. overwrite=false and uniqueID exists then newer doc must be
> skipped
> >>>>>>
> >>>>> since
> >>>>>
> >>>>>> old exists.
> >>>>>>
> >>>>>> 3. uniqueID doesn't exist then newer doc just gets added regardless
> if
> >>>>>>
> >>>>> old
> >>>>>
> >>>>>> exists or not.
> >>>>>>
> >>>>>>
> >>>>>> Best Regards
> >>>>>> Alexander Aristov
> >>>>>>
> >>>>>>
> >>>>>> On 27 December 2011 22:53, Erick Erickson<erickerickson@gmail.
> **com<erickerick...@gmail.com>
> >>>>>> >
> >>>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Mikhail is right as far as I know, the assumption built into Solr is
> >>>>>>>
> >>>>>> that
> >>>>>
> >>>>>> duplicate IDs (when<uniqueKey>  is defined) should trigger the old
> >>>>>>> document to be replaced.
> >>>>>>>
> >>>>>>> what is your system-of-record? By that I mean what does your SolrJ
> >>>>>>> program do to send data to Solr? Is there any way you could just
> >>>>>>> *not* send documents that are already in the Solr index based on,
> >>>>>>> for instance, any timestamp associated with your system-of-record
> >>>>>>> and the last time you did an incremental index?
> >>>>>>>
> >>>>>>> Best
> >>>>>>> Erick
> >>>>>>>
> >>>>>>> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
> >>>>>>> <alexander.aris...@gmail.com>  wrote:
> >>>>>>>
> >>>>>>>> Hi
> >>>>>>>>
> >>>>>>>> I am not using database. All needed data is in solr index that's
> >>>>>>>>
> >>>>>>> why I
> >>>
> >>>>  want
> >>>>>>>
> >>>>>>>> to skip excessive checks.
> >>>>>>>>
> >>>>>>>> I will check DIH but not sure if it helps.
> >>>>>>>>
> >>>>>>>> I am fluent with Java and it's not a problem for me to write a
> >>>>>>>>
> >>>>>>> class
> >>>
> >>>> or
> >>>>>
> >>>>>> so
> >>>>>>>
> >>>>>>>> but I want to check first  maybe there are any ways (workarounds)
> >>>>>>>>
> >>>>>>> to
> >>>
> >>>> make
> >>>>>
> >>>>>> it working without codding, just by playing around with
> >>>>>>>>
> >>>>>>> configuration
> >>>
> >>>> and
> >>>>>
> >>>>>> params. I don't want to go away from default solr implementation.
> >>>>>>>>
> >>>>>>>> Best Regards
> >>>>>>>> Alexander Aristov
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 27 December 2011 09:33, Mikhail Khludnev<
> >>>>>>>>
> >>>>>>> mkhlud...@griddynamics.com
> >>>>>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>  On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov<
> >>>>>>>>> alexander.aris...@gmail.com>  wrote:
> >>>>>>>>>
> >>>>>>>>>  Hi people,
> >>>>>>>>>>
> >>>>>>>>>> I urgently need your help!
> >>>>>>>>>>
> >>>>>>>>>> I have solr 3.3 configured and running. I do uncremental
> >>>>>>>>>>
> >>>>>>>>> indexing 4
> >>>
> >>>>  times a
> >>>>>>>>>
> >>>>>>>>>> day using bulk updates. Some documents are identical to some
> >>>>>>>>>>
> >>>>>>>>> extent
> >>>
> >>>>  and I
> >>>>>>>
> >>>>>>>> wish to skip them, not to index.
> >>>>>>>>>> But here is the problem as I could not find a way to tell solr
> >>>>>>>>>>
> >>>>>>>>> ignore
> >>>>>
> >>>>>> new
> >>>>>>>
> >>>>>>>> duplicate docs and keep old indexed docs. I don't care that it's
> >>>>>>>>>>
> >>>>>>>>> new.
> >>>>>
> >>>>>>  Just
> >>>>>>>>>
> >>>>>>>>>> determine by ID that such document is in the index already and
> >>>>>>>>>>
> >>>>>>>>> that's
> >>>>>
> >>>>>> it.
> >>>>>>>
> >>>>>>>> I use solrj for indexing. I have tried setting overwrite=false
> >>>>>>>>>>
> >>>>>>>>> and
> >>>
> >>>>  dedupe
> >>>>>>>
> >>>>>>>> apprache but nothing helped me. I either have that a newer doc
> >>>>>>>>>>
> >>>>>>>>> overwrites
> >>>>>>>
> >>>>>>>> old one or I get duplicate.
> >>>>>>>>>>
> >>>>>>>>>> I think it's a very simple and basic feature and it must exist.
> >>>>>>>>>>
> >>>>>>>>> What
> >>>>>
> >>>>>> did
> >>>>>>>
> >>>>>>>> I
> >>>>>>>>>
> >>>>>>>>>> make wrong or didn't do?
> >>>>>>>>>>
> >>>>>>>>>>  I guess, because  the mainstream approach is delta-import ,
> when
> >>>>>>>>>
> >>>>>>>> you
> >>>
> >>>>  have
> >>>>>>>
> >>>>>>>> "updated" timestamps in your DB and "last-import" timestamp stored
> >>>>>>>>> somewhere. You can check how it works in DIH.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>  Tried google but I couldn't find a solution there althoght many
> >>>>>>>>>>
> >>>>>>>>> people
> >>>>>
> >>>>>>  encounted such problem.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>  it's definitely can be done by overriding
> >>>>>>>>> o.a.s.update.**DirectUpdateHandler2.addDoc(**AddUpdateCommand),
> >>>>>>>>> but I
> >>>>>>>>>
> >>>>>>>> suggest
> >>>>>>>
> >>>>>>>> to start from implementing your own
> >>>>>>>>> http://wiki.apache.org/solr/**UpdateRequestProcessor<
> http://wiki.apache.org/solr/UpdateRequestProcessor>- search for
> >>>>>>>>>
> >>>>>>>> PK,
> >>>
> >>>>  bypass
> >>>>>>>
> >>>>>>>> chain call if it's found. Then if you meet performance issues on
> >>>>>>>>>
> >>>>>>>> querying
> >>>>>>>
> >>>>>>>> your PKs one by one, (but only after that) you can batch your
> >>>>>>>>>
> >>>>>>>> searches,
> >>>>>
> >>>>>>  there are couple of optimization techniques for huge disjunction
> >>>>>>>>>
> >>>>>>>> queries
> >>>>>
> >>>>>>  like PK:(2 OR 4 OR 5 OR 6).
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>  I start considering that I must query index to check if a doc
> >>>>>>>>>>
> >>>>>>>>> to be
> >>>
> >>>>  added
> >>>>>>>
> >>>>>>>> is in the index already and do not add it to array but I have so
> >>>>>>>>>>
> >>>>>>>>> many
> >>>>>
> >>>>>>  docs
> >>>>>>>>>
> >>>>>>>>>> that I am affraid it's not a good solution.
> >>>>>>>>>>
> >>>>>>>>>> Best Regards
> >>>>>>>>>> Alexander Aristov
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Sincerely yours
> >>>>>>>>> Mikhail Khludnev
> >>>>>>>>> Lucid Certified
> >>>>>>>>> Apache Lucene/Solr Developer
> >>>>>>>>> Grid Dynamics
> >>>>>>>>>
> >>>>>>>>>
> >>>>>
> >>>>> --
> >>>>> Lance Norskog
> >>>>> goks...@gmail.com
> >>>>>
> >>>>>
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

<http://www.griddynamics.com>
 <mkhlud...@griddynamics.com>

Reply via email to