Re: Document Update performances Improvement

Nicolas Paris Wed, 23 Oct 2019 12:42:01 -0700

> <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html#merge-factors>.
> <https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrconfig.html#UpdateHandlersinSolrConfig-autoCommit>


Thanks for those relevant pointers and the explanation.

> How often do you commit? Are you committing after each XML is
> indexed? If yes, what is your batch (XML) size? Review default settings of
> autoCommit and considering increasing it. 

I guess I do not use any XML under the hood: spark-solr uses sorlj which
serialize the document in java binary objects. However the commit
strategy applies too, I have setup 20,000 documents or 20,000 ms.

> Do you want real time reflection
> of updates? If no, you can compromise on commits and merge factors and do
> faster indexing. Don't so soft commits then.

Indeed I d'like the document be accessible sooner. That being said, 5
minutes delay is acceptable.

> In our case, I have set autoCommit to commit after 50,000 documents are
> indexed. After EdgeNGrams tokenization, while full indexing, we have seen
> index to get over 60 GBs. Once we are done with full indexing, I optimize
> the index and the index size comes below 13 GB!

I guess I get the idea: "put the dollars as fast as possible in the bag,
we will clean-up when back home"

Thanks

On Wed, Oct 23, 2019 at 11:34:44AM +0530, Paras Lehana wrote:
> Hi Nicolas,
> 
> What kind of change exactly on the merge factor side ?
> 
> 
> We increased maxMergeAtOnce and segmentsPerTier from 5 to 50. This will
> make Solr to merge segments less frequently after many index updates. Yes,
> you need to find the sweet spot here but do try increasing these values
> from the default ones. I strongly recommend you to give a 2 min read to this
> <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html#merge-factors>.
> Do note that increasing these values will require you to have larger
> physical storage until segments merge.
> 
> Besides this, do review your autoCommit config
> <https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrconfig.html#UpdateHandlersinSolrConfig-autoCommit>
> or the frequency of your hard commits. In our case, we don't want real time
> updates - so we can always commit less frequently. This makes indexing
> faster. How often do you commit? Are you committing after each XML is
> indexed? If yes, what is your batch (XML) size? Review default settings of
> autoCommit and considering increasing it. Do you want real time reflection
> of updates? If no, you can compromise on commits and merge factors and do
> faster indexing. Don't so soft commits then.
> 
> In our case, I have set autoCommit to commit after 50,000 documents are
> indexed. After EdgeNGrams tokenization, while full indexing, we have seen
> index to get over 60 GBs. Once we are done with full indexing, I optimize
> the index and the index size comes below 13 GB! Since we can trade off
> space temporarily for increased indexing speed, we are still committed to
> find sweeter spots for faster indexing. For statistics purpose, we have
> over 250 million documents for indexing that converges to 60 million unique
> documents after atomic updates (full indexing).
> 
> 
> 
> > Would you say atomical update is faster than regular replacement of
> > documents?
> 
> 
> No, I don't say that. Either of the two configs (autoCommit, Merge Policy)
> will impact regular indexing too. In our case, non-atomic indexing is out
> of question.
> 
> On Wed, 23 Oct 2019 at 00:43, Nicolas Paris <nicolas.pa...@riseup.net>
> wrote:
> 
> > > We, at Auto-Suggest, also do atomic updates daily and specifically
> > > changing merge factor gave us a boost of ~4x
> >
> > Interesting. What kind of change exactly on the merge factor side ?
> >
> >
> > > At current configuration, our core atomically updates ~423 documents
> > > per second.
> >
> > Would you say atomical update is faster than regular replacement of
> > documents ? (considering my first thought on this below)
> >
> > > > I am wondering if **atomic update feature** would faster the process.
> > > > From one hand, using this feature would save network because only a
> > > > small subset of the document would be send from the client to the
> > > > server.
> > > > On the other hand, the server will have to collect the values from the
> > > > disk and reindex them. In addition, this implies to store the values
> > > > every fields (I am not storing every fields) and use more space.
> >
> >
> > Thanks Paras
> >
> >
> >
> > On Tue, Oct 22, 2019 at 01:00:10PM +0530, Paras Lehana wrote:
> > > Hi Nicolas,
> > >
> > > Have you tried playing with values of *IndexConfig*
> > > <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html
> > >
> > > (merge factor, segment size, maxBufferedDocs, Merge Policies)? We, at
> > > Auto-Suggest, also do atomic updates daily and specifically changing
> > merge
> > > factor gave us a boost of ~4x during indexing. At current configuration,
> > > our core atomically updates ~423 documents per second.
> > >
> > > On Sun, 20 Oct 2019 at 02:07, Nicolas Paris <nicolas.pa...@riseup.net>
> > > wrote:
> > >
> > > > > Maybe you need to give more details. I recommend always to try and
> > > > > test yourself as you know your own solution best. What performance do
> > > > > your use car needs and what is your current performance?
> > > >
> > > > I have 10 collections on 4 shards (no replications). The collections
> > are
> > > > quite large ranging from 2GB to 60 GB per shard. In every case, the
> > > > update process only add several values to an indexed array field on a
> > > > document subset of each collection. The proportion of the subset is
> > from
> > > > 0 to 100%, and 95% of time below 20%. The array field represents 1 over
> > > > 20 fields which are mainly unstored fields with some large textual
> > > > fields.
> > > >
> > > > The 4 solr instance collocate with the spark. Right now I tested with
> > 40
> > > > spark executors. Commit timing and commit number document are both set
> > > > to 20000. Each shard has 20g of memory.
> > > > Loading/replacing the largest collection is about 2 hours - which is
> > > > quite fast I guess. Updating 5% percent of documents of each
> > > > collections, is about half an hour.
> > > >
> > > > Because my need is "only" to append several values to an array I
> > suspect
> > > > there is some trick to make things faster.
> > > >
> > > >
> > > >
> > > > On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote:
> > > > > Maybe you need to give more details. I recommend always to try and
> > test
> > > > yourself as you know your own solution best. Depending on your spark
> > > > process atomic updates  could be faster.
> > > > >
> > > > > With Spark-Solr additional complexity comes. You could have too many
> > > > executors for your Solr instance(s), ie a too high parallelism.
> > > > >
> > > > > Probably the most important question is:
> > > > > What performance do your use car needs and what is your current
> > > > performance?
> > > > >
> > > > > Once this is clear further architecture aspects can be derived, such
> > as
> > > > number of spark executors, number of Solr instances, sharding,
> > replication,
> > > > commit timing etc.
> > > > >
> > > > > > Am 19.10.2019 um 21:52 schrieb Nicolas Paris <
> > nicolas.pa...@riseup.net
> > > > >:
> > > > > >
> > > > > > Hi community,
> > > > > >
> > > > > > Any advice to speed-up updates ?
> > > > > > Is there any advice on commit, memory, docvalues, stored or any
> > tips to
> > > > > > faster things ?
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > >
> > > > > >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
> > > > > >> Hi
> > > > > >>
> > > > > >> I am looking for a way to faster the update of documents.
> > > > > >>
> > > > > >> In my context, the update replaces one of the many existing
> > indexed
> > > > > >> fields, and keep the others as is.
> > > > > >>
> > > > > >> Right now, I am building the whole document, and replacing the
> > > > existing
> > > > > >> one by id.
> > > > > >>
> > > > > >> I am wondering if **atomic update feature** would faster the
> > process.
> > > > > >>
> > > > > >> From one hand, using this feature would save network because only
> > a
> > > > > >> small subset of the document would be send from the client to the
> > > > > >> server.
> > > > > >> On the other hand, the server will have to collect the values
> > from the
> > > > > >> disk and reindex them. In addition, this implies to store the
> > values
> > > > for
> > > > > >> every fields (I am not storing every fields) and use more space.
> > > > > >>
> > > > > >> Also I have read about the ConcurrentUpdateSolrServer class might
> > be
> > > > an
> > > > > >> optimized way of updating documents.
> > > > > >>
> > > > > >> I am using spark-solr library to deal with solr-cloud. If
> > something
> > > > > >> exist to faster the process, I would be glad to implement it in
> > that
> > > > > >> library.
> > > > > >> Also, I have split the collection over multiple shard, and I admit
> > > > this
> > > > > >> faster the update process, but who knows ?
> > > > > >>
> > > > > >> Thoughts ?
> > > > > >>
> > > > > >> --
> > > > > >> nicolas
> > > > > >>
> > > > > >
> > > > > > --
> > > > > > nicolas
> > > > >
> > > >
> > > > --
> > > > nicolas
> > > >
> > >
> > >
> > > --
> > > --
> > > Regards,
> > >
> > > *Paras Lehana* [65871]
> > > Software Programmer, Auto-Suggest,
> > > IndiaMART Intermesh Ltd.
> > >
> > > 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> > > Noida, UP, IN - 201303
> > >
> > > Mob.: +91-9560911996
> > > Work: 01203916600 | Extn:  *8173*
> > >
> > > --
> > > IMPORTANT:
> > > NEVER share your IndiaMART OTP/ Password with anyone.
> >
> > --
> > nicolas
> >
> 
> 
> -- 
> -- 
> Regards,
> 
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
> 
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
> 
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
> 
> -- 
> IMPORTANT: 
> NEVER share your IndiaMART OTP/ Password with anyone.

-- 
nicolas

Re: Document Update performances Improvement

Reply via email to