Dear list,
I'm posting here after some unsuccessful investigations.
In my setup I push documents to Solr using the StreamingUpdateSolrServer.
I'm sending a comfortable initial amount of documents (~250M) and wished
to perform overwriting of duplicated documents at index time, during the
update, taking advantage of the UpdateProcessorChain.
At the beginning of the indexing stage, everything is quite fast;
documents arrive at a rate of about 1000 doc/s.
The only extra processing during the import is computation of a couple
of hashes that are used to identify uniquely documents given their
content, using both stock (MD5Signature) and custom (derived from
Lookup3Signature) update processors.
I send a commit command to the server every 500k documents sent.
During a first period, the server is CPU bound. After a short while (~10
minutes), the rate at which documents are received starts to fall
dramatically, the server being IO bound.
I've been firstly thinking of a normal speed decrease during the commit,
while my push client is waiting for the flush to occur. That would have
been a normal slowdown.
The thing that retained my attention was the fact that unexpectedly, the
server was performing a lot of small reads, way more the number writes,
which seem to be larger.
The combination of the many small reads with the constant amount of
bigger writes seem to be creating a lot of IO contention on my commodity
SATA drive, and the ETA of my built index started to increase scarily =D
I then restarted the JVM with JMX enabled so I could start investigating
a little bit more. I've the realized that the UpdateHandler was
performing many reads while processing the update request.
Are there any known limitations around the UpdateProcessorChain, when
overwriteDupes is set to true ?
I turned that off, which of course breaks the intent of my built index,
but for comparison purposes it's good.
That did the trick, indexing is fast again, even with the periodic commits.
I therefor have two questions, an interesting first one and a boring
second one :
1 / What's the workflow of the UpdateProcessorChain when one or more
processors have overwriting of duplicates turned on ? What happens under
the hood ?
I tried to answer that myself looking at DirectUpdateHandler2 and my
understanding stopped at the following :
- The document is added to the lucene IW
- The duplicates are deleted from the lucene IW
The dark magic I couldn't understand seems to occur around the idTerm
and updateTerm things, in the addDoc method. The deletions seem to be
buffered somewhere, I just didn't get it :-)
I might be wrong since I didn't read the code more than that, but the
point might be at how does solr handles deletions, which is something
still unclear to me. In anyways, a lot of reads seem to occur for that
precise task and it tends to produce a lot of IO, killing indexing
performances when overwriteDupes is on. I don't even understand why so
many read operations occur at this stage since my process had a
comfortable amount of RAM (with Xms=Xmx=8GB), with only 4.5GB are used
so far.
Any help, recommandation or idea is welcome :-)
2 / In the case there isn't a simple fix for this, I'll have to do with
duplicates in my index. I don't mind since solr offers a great grouping
feature, which I already use in some other applications. The only thing
I don't know yet is that if I do rely on grouping at search time, in
combination with the Stats component (which is the intent of that
index), and limiting the results to 1 document per group, will the
computed statistics take those duplicates into account or not ? Shortly,
how well does the Stats component behave when combined to hits collapsing ?
I had firstly implemented my solution using overwriteDupes because it
would have reduced both the target size of my index and the complexity
of queries used to obtain statistics on the search results, at one time.
Thank you very much in advance.
--
Tanguy