Mikhail is right as far as I know, the assumption built into Solr is that duplicate IDs (when <uniqueKey> is defined) should trigger the old document to be replaced.
what is your system-of-record? By that I mean what does your SolrJ program do to send data to Solr? Is there any way you could just *not* send documents that are already in the Solr index based on, for instance, any timestamp associated with your system-of-record and the last time you did an incremental index? Best Erick On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov <alexander.aris...@gmail.com> wrote: > Hi > > I am not using database. All needed data is in solr index that's why I want > to skip excessive checks. > > I will check DIH but not sure if it helps. > > I am fluent with Java and it's not a problem for me to write a class or so > but I want to check first maybe there are any ways (workarounds) to make > it working without codding, just by playing around with configuration and > params. I don't want to go away from default solr implementation. > > Best Regards > Alexander Aristov > > > On 27 December 2011 09:33, Mikhail Khludnev <mkhlud...@griddynamics.com>wrote: > >> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov < >> alexander.aris...@gmail.com> wrote: >> >> > Hi people, >> > >> > I urgently need your help! >> > >> > I have solr 3.3 configured and running. I do uncremental indexing 4 >> times a >> > day using bulk updates. Some documents are identical to some extent and I >> > wish to skip them, not to index. >> > But here is the problem as I could not find a way to tell solr ignore new >> > duplicate docs and keep old indexed docs. I don't care that it's new. >> Just >> > determine by ID that such document is in the index already and that's it. >> > >> > I use solrj for indexing. I have tried setting overwrite=false and dedupe >> > apprache but nothing helped me. I either have that a newer doc overwrites >> > old one or I get duplicate. >> > >> > I think it's a very simple and basic feature and it must exist. What did >> I >> > make wrong or didn't do? >> > >> >> I guess, because the mainstream approach is delta-import , when you have >> "updated" timestamps in your DB and "last-import" timestamp stored >> somewhere. You can check how it works in DIH. >> >> >> > >> > Tried google but I couldn't find a solution there althoght many people >> > encounted such problem. >> > >> > >> it's definitely can be done by overriding >> o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I suggest >> to start from implementing your own >> http://wiki.apache.org/solr/UpdateRequestProcessor - search for PK, bypass >> chain call if it's found. Then if you meet performance issues on querying >> your PKs one by one, (but only after that) you can batch your searches, >> there are couple of optimization techniques for huge disjunction queries >> like PK:(2 OR 4 OR 5 OR 6). >> >> >> > I start considering that I must query index to check if a doc to be added >> > is in the index already and do not add it to array but I have so many >> docs >> > that I am affraid it's not a good solution. >> > >> > Best Regards >> > Alexander Aristov >> > >> >> >> >> -- >> Sincerely yours >> Mikhail Khludnev >> Lucid Certified >> Apache Lucene/Solr Developer >> Grid Dynamics >>