On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov <
alexander.aris...@gmail.com> wrote:

> Hi people,
>
> I urgently need your help!
>
> I have solr 3.3 configured and running. I do uncremental indexing 4 times a
> day using bulk updates. Some documents are identical to some extent and I
> wish to skip them, not to index.
> But here is the problem as I could not find a way to tell solr ignore new
> duplicate docs and keep old indexed docs. I don't care that it's new. Just
> determine by ID that such document is in the index already and that's it.
>
> I use solrj for indexing. I have tried setting overwrite=false and dedupe
> apprache but nothing helped me. I either have that a newer doc overwrites
> old one or I get duplicate.
>
> I think it's a very simple and basic feature and it must exist. What did I
> make wrong or didn't do?
>

I guess, because  the mainstream approach is delta-import , when you have
"updated" timestamps in your DB and "last-import" timestamp stored
somewhere. You can check how it works in DIH.


>
> Tried google but I couldn't find a solution there althoght many people
> encounted such problem.
>
>
it's definitely can be done by overriding
o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I suggest
to start from implementing your own
http://wiki.apache.org/solr/UpdateRequestProcessor - search for PK, bypass
chain call if it's found. Then if you meet performance issues on querying
your PKs one by one, (but only after that) you can batch your searches,
there are couple of optimization techniques for huge disjunction queries
like PK:(2 OR 4 OR 5 OR 6).


> I start considering that I must query index to check if a doc to be added
> is in the index already and do not add it to array but I have so many docs
> that I am affraid it's not a good solution.
>
> Best Regards
> Alexander Aristov
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

Reply via email to