Question on SolrUpdateBolt

Tech Id Fri, 15 Apr 2016 13:48:18 -0700

Hi,

I have a question on SolrUpdateBolt.execute()
<https://github.com/apache/storm/blob/master/external/storm-solr/src/main/java/org/apache/storm/solr/bolt/SolrUpdateBolt.java#L92>
method.


It seems that SolrUpdateBolt is sending every tuple to Solr in the
execute() method but sending a commit() only after a specified number of
documents have been sent.

Would it be better if we batch the documents in memory and then send to
Solr ?

I am drawing inspiration from another very popular search-engine bolt
EsBolt that keeps the tuples in memory and then sends one batch-request
along with ack() or fail() based on a single batch-request's outcome.

Here are some pointers on the EsBolt that shows how they do it:
EsBolt.execute()
<https://github.com/elastic/elasticsearch-hadoop/blob/master/storm/src/main/java/org/elasticsearch/storm/EsBolt.java#L116-L120>
--> RestRepository.writeToIndex()
<https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/rest/RestRepository.java#L154-L164>
 ---> RestRepository.doWriteToIndex()
<https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/rest/RestRepository.java#L182-L214>

If we do the same in SolrUpdateBolt, the number of http-calls is reduced by
a factor of N, where N is the batch-size of the request and that would be a
good performance boost IMO

Thanks,
Tid

Question on SolrUpdateBolt

Reply via email to