Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances

2011-06-01 Thread Tanguy Moal

Lee,

Thank you very much for your answer.

Using the signature field as the uniqueKey is effectively what I was 
doing, so the overwriteDupes=true parameter in my solrconfig was 
somehow redundant, although I wasn't aware of it! =D


In practice it works perfectly and that's the nice part.

By the way, I wonder what happens when we enter in the following code 
snippet when the id field is the same as the signature field, from 
addDoc@DirectUpdateHandler2(AddUpdateCommand) :

  if(del) { // ensure id remains unique
  BooleanQuery bq = new BooleanQuery();
  bq.add(new BooleanClause(new TermQuery(updateTerm), 
Occur.MUST_NOT));

  bq.add(new BooleanClause(new TermQuery(idTerm), Occur.MUST));
  writer.deleteDocuments(bq);
}

May be all my problems started from here...

I'll try to reproduce using a different uniqueKey field and turning 
overwriteDupes back to on to see if the problem was because of the 
signature field being the same as the uniqueKey field *and* having 
overwriteDupes on, when I'll have some time. If so, maybe that a simple 
configuration check should be performed to avoid the issue. Otherwise it 
means that having overwriteDupes turned on simply doesn't scale and that 
should be added to the wiki's Deduplication page, IMHO.


Thank you again.
Regards,

--
Tanguy

On 31/05/2011 14:58, lee carroll wrote:

Tanguy

You might have tried this already but can you set overwritedupes to
false and set the signiture key to be the id. That way solr
will manage updates?

from the wiki

http://wiki.apache.org/solr/Deduplication

!-- An example dedup update processor that creates the id field on the fly
based on the hash code of some other fields.  This example has
overwriteDupes
set to false since we are using the id field as the
signatureField and Solr
will maintain uniqueness based on that anyway. --

HTH

Lee


On 30 May 2011 08:32, Tanguy Moaltanguy.m...@gmail.com  wrote:

Hello,

Sorry for re-posting this but it seems my message got lost in the mailing 
list's messages stream without hitting anyone's attention... =D

Shortly, has anyone already experienced dramatic indexing slowdowns during 
large bulk imports with overwriteDupes turned on and a fairly high duplicates 
rate (around 4-8x) ?

It seems to produce a lot of deletions, which in turn appear to make the 
merging of segments pretty slow, by fairly increasing the number of little 
reads operations occuring simultaneously with the regular large write 
operations of the merge. Added to the poor IO performances of a commodity SATA 
drive, indexing takes ages.

I temporarily bypassed that limitation by disabling the overwriting of 
duplicates, but that changes the way I request the index, requiring me to turn 
on field collapsing at search time.

Is this a known limitation ?

Has anyone a few hints on how to optimize the handling of index time 
deduplication ?

More details on my setup and the state of my understanding are in my previous 
message here-after.

Thank you very much in advance.

Regards,

Tanguy

On 05/25/11 15:35, Tanguy Moal wrote:

Dear list,

I'm posting here after some unsuccessful investigations.
In my setup I push documents to Solr using the StreamingUpdateSolrServer.

I'm sending a comfortable initial amount of documents (~250M) and wished to 
perform overwriting of duplicated documents at index time, during the update, 
taking advantage of the UpdateProcessorChain.

At the beginning of the indexing stage, everything is quite fast; documents 
arrive at a rate of about 1000 doc/s.
The only extra processing during the import is computation of a couple of 
hashes that are used to identify uniquely documents given their content, using 
both stock (MD5Signature) and custom (derived from Lookup3Signature) update 
processors.
I send a commit command to the server every 500k documents sent.

During a first period, the server is CPU bound. After a short while (~10 
minutes), the rate at which documents are received starts to fall dramatically, 
the server being IO bound.
I've been firstly thinking of a normal speed decrease during the commit, while 
my push client is waiting for the flush to occur. That would have been a normal 
slowdown.

The thing that retained my attention was the fact that unexpectedly, the server 
was performing a lot of small reads, way more the number writes, which seem to 
be larger.
The combination of the many small reads with the constant amount of bigger 
writes seem to be creating a lot of IO contention on my commodity SATA drive, 
and the ETA of my built index started to increase scarily =D

I then restarted the JVM with JMX enabled so I could start investigating a 
little bit more. I've the realized that the UpdateHandler was performing many 
reads while processing the update request.

Are there any known limitations around the UpdateProcessorChain, when 
overwriteDupes is set to true ?
I turned that off, which of course breaks the 

Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances

2011-05-31 Thread lee carroll
Tanguy

You might have tried this already but can you set overwritedupes to
false and set the signiture key to be the id. That way solr
will manage updates?

from the wiki

http://wiki.apache.org/solr/Deduplication

!-- An example dedup update processor that creates the id field on the fly
   based on the hash code of some other fields.  This example has
overwriteDupes
   set to false since we are using the id field as the
signatureField and Solr
   will maintain uniqueness based on that anyway. --

HTH

Lee


On 30 May 2011 08:32, Tanguy Moal tanguy.m...@gmail.com wrote:

 Hello,

 Sorry for re-posting this but it seems my message got lost in the mailing 
 list's messages stream without hitting anyone's attention... =D

 Shortly, has anyone already experienced dramatic indexing slowdowns during 
 large bulk imports with overwriteDupes turned on and a fairly high duplicates 
 rate (around 4-8x) ?

 It seems to produce a lot of deletions, which in turn appear to make the 
 merging of segments pretty slow, by fairly increasing the number of little 
 reads operations occuring simultaneously with the regular large write 
 operations of the merge. Added to the poor IO performances of a commodity 
 SATA drive, indexing takes ages.

 I temporarily bypassed that limitation by disabling the overwriting of 
 duplicates, but that changes the way I request the index, requiring me to 
 turn on field collapsing at search time.

 Is this a known limitation ?

 Has anyone a few hints on how to optimize the handling of index time 
 deduplication ?

 More details on my setup and the state of my understanding are in my previous 
 message here-after.

 Thank you very much in advance.

 Regards,

 Tanguy

 On 05/25/11 15:35, Tanguy Moal wrote:

 Dear list,

 I'm posting here after some unsuccessful investigations.
 In my setup I push documents to Solr using the StreamingUpdateSolrServer.

 I'm sending a comfortable initial amount of documents (~250M) and wished to 
 perform overwriting of duplicated documents at index time, during the 
 update, taking advantage of the UpdateProcessorChain.

 At the beginning of the indexing stage, everything is quite fast; documents 
 arrive at a rate of about 1000 doc/s.
 The only extra processing during the import is computation of a couple of 
 hashes that are used to identify uniquely documents given their content, 
 using both stock (MD5Signature) and custom (derived from Lookup3Signature) 
 update processors.
 I send a commit command to the server every 500k documents sent.

 During a first period, the server is CPU bound. After a short while (~10 
 minutes), the rate at which documents are received starts to fall 
 dramatically, the server being IO bound.
 I've been firstly thinking of a normal speed decrease during the commit, 
 while my push client is waiting for the flush to occur. That would have been 
 a normal slowdown.

 The thing that retained my attention was the fact that unexpectedly, the 
 server was performing a lot of small reads, way more the number writes, 
 which seem to be larger.
 The combination of the many small reads with the constant amount of bigger 
 writes seem to be creating a lot of IO contention on my commodity SATA 
 drive, and the ETA of my built index started to increase scarily =D

 I then restarted the JVM with JMX enabled so I could start investigating a 
 little bit more. I've the realized that the UpdateHandler was performing 
 many reads while processing the update request.

 Are there any known limitations around the UpdateProcessorChain, when 
 overwriteDupes is set to true ?
 I turned that off, which of course breaks the intent of my built index, but 
 for comparison purposes it's good.

 That did the trick, indexing is fast again, even with the periodic commits.

 I therefor have two questions, an interesting first  one and a boring second 
 one :

 1 / What's the workflow of the UpdateProcessorChain when one or more 
 processors have overwriting of duplicates turned on ? What happens under the 
 hood ?

 I tried to answer that myself looking at DirectUpdateHandler2 and my 
 understanding stopped at the following :
 - The document is added to the lucene IW
 - The duplicates are deleted from the lucene IW
 The dark magic I couldn't understand seems to occur around the idTerm and 
 updateTerm things, in the addDoc method. The deletions seem to be buffered 
 somewhere, I just didn't get it :-)

 I might be wrong since I didn't read the code more than that, but the point 
 might be at how does solr handles deletions, which is something still 
 unclear to me. In anyways, a lot of reads seem to occur for that precise 
 task and it tends to produce a lot of IO, killing indexing performances when 
 overwriteDupes is on. I don't even understand why so many read operations 
 occur at this stage since my process had a comfortable amount of RAM (with 
 Xms=Xmx=8GB), with only 4.5GB are used so far.

 Any help, recommandation or idea is welcome 

Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances

2011-05-30 Thread Tanguy Moal

Hello,

Sorry for re-posting this but it seems my message got lost in the 
mailing list's messages stream without hitting anyone's attention... =D


Shortly, has anyone already experienced dramatic indexing slowdowns 
during large bulk imports with overwriteDupes turned on and a fairly 
high duplicates rate (around 4-8x) ?


It seems to produce a lot of deletions, which in turn appear to make the 
merging of segments pretty slow, by fairly increasing the number of 
little reads operations occuring simultaneously with the regular large 
write operations of the merge. Added to the poor IO performances of a 
commodity SATA drive, indexing takes ages.


I temporarily bypassed that limitation by disabling the overwriting of 
duplicates, but that changes the way I request the index, requiring me 
to turn on field collapsing at search time.


Is this a known limitation ?

Has anyone a few hints on how to optimize the handling of index time 
deduplication ?


More details on my setup and the state of my understanding are in my 
previous message here-after.


Thank you very much in advance.

Regards,

Tanguy

On 05/25/11 15:35, Tanguy Moal wrote:

Dear list,

I'm posting here after some unsuccessful investigations.
In my setup I push documents to Solr using the StreamingUpdateSolrServer.

I'm sending a comfortable initial amount of documents (~250M) and 
wished to perform overwriting of duplicated documents at index time, 
during the update, taking advantage of the UpdateProcessorChain.


At the beginning of the indexing stage, everything is quite fast; 
documents arrive at a rate of about 1000 doc/s.
The only extra processing during the import is computation of a couple 
of hashes that are used to identify uniquely documents given their 
content, using both stock (MD5Signature) and custom (derived from 
Lookup3Signature) update processors.

I send a commit command to the server every 500k documents sent.

During a first period, the server is CPU bound. After a short while 
(~10 minutes), the rate at which documents are received starts to fall 
dramatically, the server being IO bound.
I've been firstly thinking of a normal speed decrease during the 
commit, while my push client is waiting for the flush to occur. That 
would have been a normal slowdown.


The thing that retained my attention was the fact that unexpectedly, 
the server was performing a lot of small reads, way more the number 
writes, which seem to be larger.
The combination of the many small reads with the constant amount of 
bigger writes seem to be creating a lot of IO contention on my 
commodity SATA drive, and the ETA of my built index started to 
increase scarily =D


I then restarted the JVM with JMX enabled so I could start 
investigating a little bit more. I've the realized that the 
UpdateHandler was performing many reads while processing the update 
request.


Are there any known limitations around the UpdateProcessorChain, when 
overwriteDupes is set to true ?
I turned that off, which of course breaks the intent of my built 
index, but for comparison purposes it's good.


That did the trick, indexing is fast again, even with the periodic 
commits.


I therefor have two questions, an interesting first  one and a boring 
second one :


1 / What's the workflow of the UpdateProcessorChain when one or more 
processors have overwriting of duplicates turned on ? What happens 
under the hood ?


I tried to answer that myself looking at DirectUpdateHandler2 and my 
understanding stopped at the following :

- The document is added to the lucene IW
- The duplicates are deleted from the lucene IW
The dark magic I couldn't understand seems to occur around the idTerm 
and updateTerm things, in the addDoc method. The deletions seem to be 
buffered somewhere, I just didn't get it :-)


I might be wrong since I didn't read the code more than that, but the 
point might be at how does solr handles deletions, which is something 
still unclear to me. In anyways, a lot of reads seem to occur for that 
precise task and it tends to produce a lot of IO, killing indexing 
performances when overwriteDupes is on. I don't even understand why so 
many read operations occur at this stage since my process had a 
comfortable amount of RAM (with Xms=Xmx=8GB), with only 4.5GB are used 
so far.


Any help, recommandation or idea is welcome :-)

2 / In the case there isn't a simple fix for this, I'll have to do 
with duplicates in my index. I don't mind since solr offers a great 
grouping feature, which I already use in some other applications. The 
only thing I don't know yet is that if I do rely on grouping at search 
time, in combination with the Stats component (which is the intent of 
that index), and limiting the results to 1 document per group, will 
the computed statistics take those duplicates into account or not ? 
Shortly, how well does the Stats component behave when combined to 
hits collapsing ?


I had firstly implemented my solution using overwriteDupes 

Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances

2011-05-25 Thread Tanguy Moal

Dear list,

I'm posting here after some unsuccessful investigations.
In my setup I push documents to Solr using the StreamingUpdateSolrServer.

I'm sending a comfortable initial amount of documents (~250M) and wished 
to perform overwriting of duplicated documents at index time, during the 
update, taking advantage of the UpdateProcessorChain.


At the beginning of the indexing stage, everything is quite fast; 
documents arrive at a rate of about 1000 doc/s.
The only extra processing during the import is computation of a couple 
of hashes that are used to identify uniquely documents given their 
content, using both stock (MD5Signature) and custom (derived from 
Lookup3Signature) update processors.

I send a commit command to the server every 500k documents sent.

During a first period, the server is CPU bound. After a short while (~10 
minutes), the rate at which documents are received starts to fall 
dramatically, the server being IO bound.
I've been firstly thinking of a normal speed decrease during the commit, 
while my push client is waiting for the flush to occur. That would have 
been a normal slowdown.


The thing that retained my attention was the fact that unexpectedly, the 
server was performing a lot of small reads, way more the number writes, 
which seem to be larger.
The combination of the many small reads with the constant amount of 
bigger writes seem to be creating a lot of IO contention on my commodity 
SATA drive, and the ETA of my built index started to increase scarily =D


I then restarted the JVM with JMX enabled so I could start investigating 
a little bit more. I've the realized that the UpdateHandler was 
performing many reads while processing the update request.


Are there any known limitations around the UpdateProcessorChain, when 
overwriteDupes is set to true ?
I turned that off, which of course breaks the intent of my built index, 
but for comparison purposes it's good.


That did the trick, indexing is fast again, even with the periodic commits.

I therefor have two questions, an interesting first  one and a boring 
second one :


1 / What's the workflow of the UpdateProcessorChain when one or more 
processors have overwriting of duplicates turned on ? What happens under 
the hood ?


I tried to answer that myself looking at DirectUpdateHandler2 and my 
understanding stopped at the following :

- The document is added to the lucene IW
- The duplicates are deleted from the lucene IW
The dark magic I couldn't understand seems to occur around the idTerm 
and updateTerm things, in the addDoc method. The deletions seem to be 
buffered somewhere, I just didn't get it :-)


I might be wrong since I didn't read the code more than that, but the 
point might be at how does solr handles deletions, which is something 
still unclear to me. In anyways, a lot of reads seem to occur for that 
precise task and it tends to produce a lot of IO, killing indexing 
performances when overwriteDupes is on. I don't even understand why so 
many read operations occur at this stage since my process had a 
comfortable amount of RAM (with Xms=Xmx=8GB), with only 4.5GB are used 
so far.


Any help, recommandation or idea is welcome :-)

2 / In the case there isn't a simple fix for this, I'll have to do with 
duplicates in my index. I don't mind since solr offers a great grouping 
feature, which I already use in some other applications. The only thing 
I don't know yet is that if I do rely on grouping at search time, in 
combination with the Stats component (which is the intent of that 
index), and limiting the results to 1 document per group, will the 
computed statistics take those duplicates into account or not ? Shortly, 
how well does the Stats component behave when combined to hits collapsing ?


I had firstly implemented my solution using overwriteDupes because it 
would have reduced both the target size of my index and the complexity 
of queries used to obtain statistics on the search results, at one time.


Thank you very much in advance.

--
Tanguy