Re: Question about http://wiki.apache.org/solr/Deduplication

2011-04-04 Thread eks dev
Thanks Hoss,

Externanlizing this part is exactly the path we are exploring now, not
only for this reason.

We already started testing Hadoop SequenceFile for write ahead log for
updates/deletes.
SequenceFile supports append now (simply great!). It was a a pain to
have to add hadoop into mix  for mortal collection
sizes 200 Mio, but on the other side, having hadoop around  offers
huge flexibility.
Write ahead log catches update commands (all solr slaves, fronting
clients accept updates but only to forward them to WAL). Solr master
is trying to catch up with update stream indexing in async fashion,
and finally solr slaves are chasing master index with standard solr
replication.
Overnight we run simple map reduce jobs to consolidate, normalize and
sort update stream and reindex at the end.
Deduplication and collection sorting is for us only an optimization,
if done reasonably offten, like  once per day/week, but if we do not
do it, it doubles HW resorces.

Imo, native WAL support on solr would be definitly one nice nice to
have (for HA, update scalability...). Charming with WAL  is that
updates never wait/disapear, if too much traffic, we only have
slightly higher update latency, but updates get definitley processed.
Some basic primitives on WAL (consolidation, replaying update stream
on solr etc...)  should be supported in this case, sort of smallish
hadoop features subset for solr clusters, but nothing oversized.

Cheers,
eks









On Sun, Apr 3, 2011 at 1:05 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : Is it possible in solr to have multivalued id? Or I need to make my
 : own mv_ID for this? Any ideas how to achieve this efficiently?

 This isn't something the SignatureUpdateProcessor is going to be able to
 hel pyou with -- it does the deduplication be changing hte low level
 update (implemented as a delete then add) so that the key used to delete
 the older documents is based on the signature field instead of the id
 field.

 in order to do what you are describing, you would need to query the index
 for matching signatures, then add the resulting ids to your document
 before doing that update

 You could posibly do this in a custom UpdateProcessor, but you'd have to
 do something tricky to ensure you didn't overlook docs that had been addd
 but not yet committed when checking for dups.

 I don't have a good suggestion for how to do this internally in Slr -- it
 seems like the type of bulk processing logic that would be better suited
 for an external process before you ever start indexing (much like link
 analysis for back refrences)

 -Hoss



Re: Question about http://wiki.apache.org/solr/Deduplication

2011-04-02 Thread Chris Hostetter

: Is it possible in solr to have multivalued id? Or I need to make my
: own mv_ID for this? Any ideas how to achieve this efficiently?

This isn't something the SignatureUpdateProcessor is going to be able to 
hel pyou with -- it does the deduplication be changing hte low level 
update (implemented as a delete then add) so that the key used to delete 
the older documents is based on the signature field instead of the id 
field.

in order to do what you are describing, you would need to query the index 
for matching signatures, then add the resulting ids to your document 
before doing that update

You could posibly do this in a custom UpdateProcessor, but you'd have to 
do something tricky to ensure you didn't overlook docs that had been addd 
but not yet committed when checking for dups.

I don't have a good suggestion for how to do this internally in Slr -- it 
seems like the type of bulk processing logic that would be better suited 
for an external process before you ever start indexing (much like link 
analysis for back refrences)

-Hoss


Question about http://wiki.apache.org/solr/Deduplication

2011-03-24 Thread eks dev
Hi,
Use case I am trying to figure out is about preserving IDs without
re-indexing on duplicate, rather adding this new ID under list of
document id aliases.

Example:
Input collection:
id:1, text:dummy text 1, signature:A
id:2, text:dummy text 1, signature:A

I add the first document in empty index, text is going to be indexed,
ID is going to be 1, so far so good

Now the question, if I add second document with id == 2, instead of
deleting/indexing this new document, I would like to store id == 2 in
multivalued Field id

At the end, I would have one document less indexed and both ID are
going to be searchable (and stored as well)...

Is it possible in solr to have multivalued id? Or I need to make my
own mv_ID for this? Any ideas how to achieve this efficiently?

My target is not to add new documents if signature matches, but to
have IDs indexed and stored?

Thanks,
eks