The SignatureUpdateProcessor is for exactly this problem:

http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication

On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
<alexander.aris...@gmail.com> wrote:
> I get docs from external sources and the only place I keep them is solr
> index. I have no a database or other means to track indexed docs (my
> personal oppinion is that it might be a huge headache).
>
> Some docs might change slightly in there original sources but I don't need
> that changes. In fact I need original data only.
>
> So I have no other ways but to either check if a document is already in
> index before I put it to solrj array (read - query solr) or develop my own
> update chain processor and implement ID check there and skip such docs.
>
> Maybe it's wrong place to aguee and probably it's been discussed before but
> I wonder why simple the overwrite parameter doesn't work here.
>
> My oppinion it perfectly suits here. In combination with unique ID it can
> cover all possible variants.
>
> cases:
>
> 1. overwrite=true and uniquID exists then newer doc should overwrite the
> old one.
>
> 2. overwrite=false and uniqueID exists then newer doc must be skipped since
> old exists.
>
> 3. uniqueID doesn't exist then newer doc just gets added regardless if old
> exists or not.
>
>
> Best Regards
> Alexander Aristov
>
>
> On 27 December 2011 22:53, Erick Erickson <erickerick...@gmail.com> wrote:
>
>> Mikhail is right as far as I know, the assumption built into Solr is that
>> duplicate IDs (when <uniqueKey> is defined) should trigger the old
>> document to be replaced.
>>
>> what is your system-of-record? By that I mean what does your SolrJ
>> program do to send data to Solr? Is there any way you could just
>> *not* send documents that are already in the Solr index based on,
>> for instance, any timestamp associated with your system-of-record
>> and the last time you did an incremental index?
>>
>> Best
>> Erick
>>
>> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
>> <alexander.aris...@gmail.com> wrote:
>> > Hi
>> >
>> > I am not using database. All needed data is in solr index that's why I
>> want
>> > to skip excessive checks.
>> >
>> > I will check DIH but not sure if it helps.
>> >
>> > I am fluent with Java and it's not a problem for me to write a class or
>> so
>> > but I want to check first  maybe there are any ways (workarounds) to make
>> > it working without codding, just by playing around with configuration and
>> > params. I don't want to go away from default solr implementation.
>> >
>> > Best Regards
>> > Alexander Aristov
>> >
>> >
>> > On 27 December 2011 09:33, Mikhail Khludnev <mkhlud...@griddynamics.com
>> >wrote:
>> >
>> >> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov <
>> >> alexander.aris...@gmail.com> wrote:
>> >>
>> >> > Hi people,
>> >> >
>> >> > I urgently need your help!
>> >> >
>> >> > I have solr 3.3 configured and running. I do uncremental indexing 4
>> >> times a
>> >> > day using bulk updates. Some documents are identical to some extent
>> and I
>> >> > wish to skip them, not to index.
>> >> > But here is the problem as I could not find a way to tell solr ignore
>> new
>> >> > duplicate docs and keep old indexed docs. I don't care that it's new.
>> >> Just
>> >> > determine by ID that such document is in the index already and that's
>> it.
>> >> >
>> >> > I use solrj for indexing. I have tried setting overwrite=false and
>> dedupe
>> >> > apprache but nothing helped me. I either have that a newer doc
>> overwrites
>> >> > old one or I get duplicate.
>> >> >
>> >> > I think it's a very simple and basic feature and it must exist. What
>> did
>> >> I
>> >> > make wrong or didn't do?
>> >> >
>> >>
>> >> I guess, because  the mainstream approach is delta-import , when you
>> have
>> >> "updated" timestamps in your DB and "last-import" timestamp stored
>> >> somewhere. You can check how it works in DIH.
>> >>
>> >>
>> >> >
>> >> > Tried google but I couldn't find a solution there althoght many people
>> >> > encounted such problem.
>> >> >
>> >> >
>> >> it's definitely can be done by overriding
>> >> o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I
>> suggest
>> >> to start from implementing your own
>> >> http://wiki.apache.org/solr/UpdateRequestProcessor - search for PK,
>> bypass
>> >> chain call if it's found. Then if you meet performance issues on
>> querying
>> >> your PKs one by one, (but only after that) you can batch your searches,
>> >> there are couple of optimization techniques for huge disjunction queries
>> >> like PK:(2 OR 4 OR 5 OR 6).
>> >>
>> >>
>> >> > I start considering that I must query index to check if a doc to be
>> added
>> >> > is in the index already and do not add it to array but I have so many
>> >> docs
>> >> > that I am affraid it's not a good solution.
>> >> >
>> >> > Best Regards
>> >> > Alexander Aristov
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Sincerely yours
>> >> Mikhail Khludnev
>> >> Lucid Certified
>> >> Apache Lucene/Solr Developer
>> >> Grid Dynamics
>> >>
>>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to