Alexander, I have two ideas how to implement fast dedupe externally, assuming your PKs don't fit to java.util.*Map:
- your crawler can use inprocess RDBMS (Derby, H2) to track dupes; - if your crawler is stateless - it doesn't track PKs which has been already crawled, you can retrieve it from Solr via http://wiki.apache.org/solr/TermsComponent . That's blazingly fast, but it might be a problem with removed documents (I'm not sure). And it's also can lead to OOMException (if you have too much PKs). Let me know if you need a workaround for one of these problems. If you choose internal dedupe (UpdateProcessor), pls let me know if querying one-by-one will be to slow for your and you'll need to do it page-by-page. I did some of such paging, and will do something similar soon, so I'm interested in it. Regards On Thu, Dec 29, 2011 at 9:34 AM, Alexander Aristov < alexander.aris...@gmail.com> wrote: > Unfortunately I have a lot of duplicates and taking that searching might > suffer I will try with implementing update procesor. > > But your idea is interesting and I will consider it, thanks. > > Best Regards > Alexander Aristov > > > On 28 December 2011 19:12, Tanguy Moal <tanguy.m...@gmail.com> wrote: > > > Hello Alexander, > > > > I don't know much about your requirements in terms of size and > > performances, but I've had a similar use case and found a pretty simple > > workaround. > > If your duplicate rate is not too high, you can have the > > SignatureProcessor to generate fingerprint of documents (you already did > > that). > > > > Simply turn off overwritting of duplicates, you can then rely on solr's > > grouping / field collapsing to group your search results by fingerprints. > > You'll then have one document group per "real" document. You can use > > group.sort to sort your groups by indexing date ascending, and > > group.limit=1 to keep only the oldest one. > > You can even use group.format = simple to serve results as if no > > collapsing occured, and use group.ngroups (/!\ could be expansive /!\) to > > get the real number of deduplicated documents. > > > > Of course the index will be larger, as I said, I made no assumption > > regarding your operating requirements. And search can be a bit slower, > > depending on the average rate of duplicated documents. > > But you've got your issue addressed by configuration tuning only... > > Depending on your project's sizing, it could be time saving. > > > > The advantage is that you have the precious information of what content > is > > duplicated from where :-) > > > > Hope this helps, > > > > -- > > Tanguy > > > > Le 28/12/2011 15:45, Alexander Aristov a écrit : > > > > Thanks Eric, > >> > >> it sets me direction. I will be writing new plugin and will get back to > >> the > >> dev forum with results and then we will decide next steps. > >> > >> Best Regards > >> Alexander Aristov > >> > >> > >> On 28 December 2011 18:08, Erick Erickson<erickerickson@gmail.**com< > erickerick...@gmail.com>> > >> wrote: > >> > >> Well, the short answer is that nobody else has > >>> 1> had a similar requirement > >>> AND > >>> 2> not found a suitable work around > >>> AND > >>> 3> implemented the change and contributed it back. > >>> > >>> So, if you'd like to volunteer<G>..... > >>> > >>> Seriously. If you think this would be valuable and are > >>> willing to work on it, hop on over to the dev list and > >>> discuss it, open a JIRA and make it work. I'd start > >>> by opening a discussion on the dev list before > >>> opening a JIRA, just to get a sense of where the > >>> snags would be to changing the Solr code, but that's > >>> optional. > >>> > >>> That said, writing your own update request handler > >>> that detected this case isn't very difficult, > >>> extend UpdateRequestProcessorFactory/**UpdateRequestProcessor > >>> and use it as a plugin. > >>> > >>> Best > >>> Erick > >>> > >>> On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov > >>> <alexander.aris...@gmail.com> wrote: > >>> > >>>> the problem with dedupe (SignatureUpdateProcessor ) is that it > REPLACES > >>>> > >>> old > >>> > >>>> docs. I have tried it already. > >>>> > >>>> Best Regards > >>>> Alexander Aristov > >>>> > >>>> > >>>> On 28 December 2011 13:04, Lance Norskog<goks...@gmail.com> wrote: > >>>> > >>>> The SignatureUpdateProcessor is for exactly this problem: > >>>>> > >>>>> > >>>>> > >>>>> http://www.lucidimagination.**com/search/link?url=http://** > >>> wiki.apache.org/solr/**Deduplication< > http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication > > > >>> > >>>> On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov > >>>>> <alexander.aris...@gmail.com> wrote: > >>>>> > >>>>>> I get docs from external sources and the only place I keep them is > >>>>>> > >>>>> solr > >>> > >>>> index. I have no a database or other means to track indexed docs (my > >>>>>> personal oppinion is that it might be a huge headache). > >>>>>> > >>>>>> Some docs might change slightly in there original sources but I > don't > >>>>>> > >>>>> need > >>>>> > >>>>>> that changes. In fact I need original data only. > >>>>>> > >>>>>> So I have no other ways but to either check if a document is already > >>>>>> > >>>>> in > >>> > >>>> index before I put it to solrj array (read - query solr) or develop my > >>>>>> > >>>>> own > >>>>> > >>>>>> update chain processor and implement ID check there and skip such > >>>>>> > >>>>> docs. > >>> > >>>> Maybe it's wrong place to aguee and probably it's been discussed > >>>>>> > >>>>> before > >>> > >>>> but > >>>>> > >>>>>> I wonder why simple the overwrite parameter doesn't work here. > >>>>>> > >>>>>> My oppinion it perfectly suits here. In combination with unique ID > it > >>>>>> > >>>>> can > >>> > >>>> cover all possible variants. > >>>>>> > >>>>>> cases: > >>>>>> > >>>>>> 1. overwrite=true and uniquID exists then newer doc should overwrite > >>>>>> > >>>>> the > >>> > >>>> old one. > >>>>>> > >>>>>> 2. overwrite=false and uniqueID exists then newer doc must be > skipped > >>>>>> > >>>>> since > >>>>> > >>>>>> old exists. > >>>>>> > >>>>>> 3. uniqueID doesn't exist then newer doc just gets added regardless > if > >>>>>> > >>>>> old > >>>>> > >>>>>> exists or not. > >>>>>> > >>>>>> > >>>>>> Best Regards > >>>>>> Alexander Aristov > >>>>>> > >>>>>> > >>>>>> On 27 December 2011 22:53, Erick Erickson<erickerickson@gmail. > **com<erickerick...@gmail.com> > >>>>>> > > >>>>>> > >>>>> wrote: > >>>>> > >>>>>> Mikhail is right as far as I know, the assumption built into Solr is > >>>>>>> > >>>>>> that > >>>>> > >>>>>> duplicate IDs (when<uniqueKey> is defined) should trigger the old > >>>>>>> document to be replaced. > >>>>>>> > >>>>>>> what is your system-of-record? By that I mean what does your SolrJ > >>>>>>> program do to send data to Solr? Is there any way you could just > >>>>>>> *not* send documents that are already in the Solr index based on, > >>>>>>> for instance, any timestamp associated with your system-of-record > >>>>>>> and the last time you did an incremental index? > >>>>>>> > >>>>>>> Best > >>>>>>> Erick > >>>>>>> > >>>>>>> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov > >>>>>>> <alexander.aris...@gmail.com> wrote: > >>>>>>> > >>>>>>>> Hi > >>>>>>>> > >>>>>>>> I am not using database. All needed data is in solr index that's > >>>>>>>> > >>>>>>> why I > >>> > >>>> want > >>>>>>> > >>>>>>>> to skip excessive checks. > >>>>>>>> > >>>>>>>> I will check DIH but not sure if it helps. > >>>>>>>> > >>>>>>>> I am fluent with Java and it's not a problem for me to write a > >>>>>>>> > >>>>>>> class > >>> > >>>> or > >>>>> > >>>>>> so > >>>>>>> > >>>>>>>> but I want to check first maybe there are any ways (workarounds) > >>>>>>>> > >>>>>>> to > >>> > >>>> make > >>>>> > >>>>>> it working without codding, just by playing around with > >>>>>>>> > >>>>>>> configuration > >>> > >>>> and > >>>>> > >>>>>> params. I don't want to go away from default solr implementation. > >>>>>>>> > >>>>>>>> Best Regards > >>>>>>>> Alexander Aristov > >>>>>>>> > >>>>>>>> > >>>>>>>> On 27 December 2011 09:33, Mikhail Khludnev< > >>>>>>>> > >>>>>>> mkhlud...@griddynamics.com > >>>>> > >>>>>> wrote: > >>>>>>>> > >>>>>>>> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov< > >>>>>>>>> alexander.aris...@gmail.com> wrote: > >>>>>>>>> > >>>>>>>>> Hi people, > >>>>>>>>>> > >>>>>>>>>> I urgently need your help! > >>>>>>>>>> > >>>>>>>>>> I have solr 3.3 configured and running. I do uncremental > >>>>>>>>>> > >>>>>>>>> indexing 4 > >>> > >>>> times a > >>>>>>>>> > >>>>>>>>>> day using bulk updates. Some documents are identical to some > >>>>>>>>>> > >>>>>>>>> extent > >>> > >>>> and I > >>>>>>> > >>>>>>>> wish to skip them, not to index. > >>>>>>>>>> But here is the problem as I could not find a way to tell solr > >>>>>>>>>> > >>>>>>>>> ignore > >>>>> > >>>>>> new > >>>>>>> > >>>>>>>> duplicate docs and keep old indexed docs. I don't care that it's > >>>>>>>>>> > >>>>>>>>> new. > >>>>> > >>>>>> Just > >>>>>>>>> > >>>>>>>>>> determine by ID that such document is in the index already and > >>>>>>>>>> > >>>>>>>>> that's > >>>>> > >>>>>> it. > >>>>>>> > >>>>>>>> I use solrj for indexing. I have tried setting overwrite=false > >>>>>>>>>> > >>>>>>>>> and > >>> > >>>> dedupe > >>>>>>> > >>>>>>>> apprache but nothing helped me. I either have that a newer doc > >>>>>>>>>> > >>>>>>>>> overwrites > >>>>>>> > >>>>>>>> old one or I get duplicate. > >>>>>>>>>> > >>>>>>>>>> I think it's a very simple and basic feature and it must exist. > >>>>>>>>>> > >>>>>>>>> What > >>>>> > >>>>>> did > >>>>>>> > >>>>>>>> I > >>>>>>>>> > >>>>>>>>>> make wrong or didn't do? > >>>>>>>>>> > >>>>>>>>>> I guess, because the mainstream approach is delta-import , > when > >>>>>>>>> > >>>>>>>> you > >>> > >>>> have > >>>>>>> > >>>>>>>> "updated" timestamps in your DB and "last-import" timestamp stored > >>>>>>>>> somewhere. You can check how it works in DIH. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Tried google but I couldn't find a solution there althoght many > >>>>>>>>>> > >>>>>>>>> people > >>>>> > >>>>>> encounted such problem. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> it's definitely can be done by overriding > >>>>>>>>> o.a.s.update.**DirectUpdateHandler2.addDoc(**AddUpdateCommand), > >>>>>>>>> but I > >>>>>>>>> > >>>>>>>> suggest > >>>>>>> > >>>>>>>> to start from implementing your own > >>>>>>>>> http://wiki.apache.org/solr/**UpdateRequestProcessor< > http://wiki.apache.org/solr/UpdateRequestProcessor>- search for > >>>>>>>>> > >>>>>>>> PK, > >>> > >>>> bypass > >>>>>>> > >>>>>>>> chain call if it's found. Then if you meet performance issues on > >>>>>>>>> > >>>>>>>> querying > >>>>>>> > >>>>>>>> your PKs one by one, (but only after that) you can batch your > >>>>>>>>> > >>>>>>>> searches, > >>>>> > >>>>>> there are couple of optimization techniques for huge disjunction > >>>>>>>>> > >>>>>>>> queries > >>>>> > >>>>>> like PK:(2 OR 4 OR 5 OR 6). > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> I start considering that I must query index to check if a doc > >>>>>>>>>> > >>>>>>>>> to be > >>> > >>>> added > >>>>>>> > >>>>>>>> is in the index already and do not add it to array but I have so > >>>>>>>>>> > >>>>>>>>> many > >>>>> > >>>>>> docs > >>>>>>>>> > >>>>>>>>>> that I am affraid it's not a good solution. > >>>>>>>>>> > >>>>>>>>>> Best Regards > >>>>>>>>>> Alexander Aristov > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Sincerely yours > >>>>>>>>> Mikhail Khludnev > >>>>>>>>> Lucid Certified > >>>>>>>>> Apache Lucene/Solr Developer > >>>>>>>>> Grid Dynamics > >>>>>>>>> > >>>>>>>>> > >>>>> > >>>>> -- > >>>>> Lance Norskog > >>>>> goks...@gmail.com > >>>>> > >>>>> > > > -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>