> <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html#merge-factors>. > <https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrconfig.html#UpdateHandlersinSolrConfig-autoCommit>
Thanks for those relevant pointers and the explanation. > How often do you commit? Are you committing after each XML is > indexed? If yes, what is your batch (XML) size? Review default settings of > autoCommit and considering increasing it. I guess I do not use any XML under the hood: spark-solr uses sorlj which serialize the document in java binary objects. However the commit strategy applies too, I have setup 20,000 documents or 20,000 ms. > Do you want real time reflection > of updates? If no, you can compromise on commits and merge factors and do > faster indexing. Don't so soft commits then. Indeed I d'like the document be accessible sooner. That being said, 5 minutes delay is acceptable. > In our case, I have set autoCommit to commit after 50,000 documents are > indexed. After EdgeNGrams tokenization, while full indexing, we have seen > index to get over 60 GBs. Once we are done with full indexing, I optimize > the index and the index size comes below 13 GB! I guess I get the idea: "put the dollars as fast as possible in the bag, we will clean-up when back home" Thanks On Wed, Oct 23, 2019 at 11:34:44AM +0530, Paras Lehana wrote: > Hi Nicolas, > > What kind of change exactly on the merge factor side ? > > > We increased maxMergeAtOnce and segmentsPerTier from 5 to 50. This will > make Solr to merge segments less frequently after many index updates. Yes, > you need to find the sweet spot here but do try increasing these values > from the default ones. I strongly recommend you to give a 2 min read to this > <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html#merge-factors>. > Do note that increasing these values will require you to have larger > physical storage until segments merge. > > Besides this, do review your autoCommit config > <https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrconfig.html#UpdateHandlersinSolrConfig-autoCommit> > or the frequency of your hard commits. In our case, we don't want real time > updates - so we can always commit less frequently. This makes indexing > faster. How often do you commit? Are you committing after each XML is > indexed? If yes, what is your batch (XML) size? Review default settings of > autoCommit and considering increasing it. Do you want real time reflection > of updates? If no, you can compromise on commits and merge factors and do > faster indexing. Don't so soft commits then. > > In our case, I have set autoCommit to commit after 50,000 documents are > indexed. After EdgeNGrams tokenization, while full indexing, we have seen > index to get over 60 GBs. Once we are done with full indexing, I optimize > the index and the index size comes below 13 GB! Since we can trade off > space temporarily for increased indexing speed, we are still committed to > find sweeter spots for faster indexing. For statistics purpose, we have > over 250 million documents for indexing that converges to 60 million unique > documents after atomic updates (full indexing). > > > > > Would you say atomical update is faster than regular replacement of > > documents? > > > No, I don't say that. Either of the two configs (autoCommit, Merge Policy) > will impact regular indexing too. In our case, non-atomic indexing is out > of question. > > On Wed, 23 Oct 2019 at 00:43, Nicolas Paris <nicolas.pa...@riseup.net> > wrote: > > > > We, at Auto-Suggest, also do atomic updates daily and specifically > > > changing merge factor gave us a boost of ~4x > > > > Interesting. What kind of change exactly on the merge factor side ? > > > > > > > At current configuration, our core atomically updates ~423 documents > > > per second. > > > > Would you say atomical update is faster than regular replacement of > > documents ? (considering my first thought on this below) > > > > > > I am wondering if **atomic update feature** would faster the process. > > > > From one hand, using this feature would save network because only a > > > > small subset of the document would be send from the client to the > > > > server. > > > > On the other hand, the server will have to collect the values from the > > > > disk and reindex them. In addition, this implies to store the values > > > > every fields (I am not storing every fields) and use more space. > > > > > > Thanks Paras > > > > > > > > On Tue, Oct 22, 2019 at 01:00:10PM +0530, Paras Lehana wrote: > > > Hi Nicolas, > > > > > > Have you tried playing with values of *IndexConfig* > > > <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html > > > > > > (merge factor, segment size, maxBufferedDocs, Merge Policies)? We, at > > > Auto-Suggest, also do atomic updates daily and specifically changing > > merge > > > factor gave us a boost of ~4x during indexing. At current configuration, > > > our core atomically updates ~423 documents per second. > > > > > > On Sun, 20 Oct 2019 at 02:07, Nicolas Paris <nicolas.pa...@riseup.net> > > > wrote: > > > > > > > > Maybe you need to give more details. I recommend always to try and > > > > > test yourself as you know your own solution best. What performance do > > > > > your use car needs and what is your current performance? > > > > > > > > I have 10 collections on 4 shards (no replications). The collections > > are > > > > quite large ranging from 2GB to 60 GB per shard. In every case, the > > > > update process only add several values to an indexed array field on a > > > > document subset of each collection. The proportion of the subset is > > from > > > > 0 to 100%, and 95% of time below 20%. The array field represents 1 over > > > > 20 fields which are mainly unstored fields with some large textual > > > > fields. > > > > > > > > The 4 solr instance collocate with the spark. Right now I tested with > > 40 > > > > spark executors. Commit timing and commit number document are both set > > > > to 20000. Each shard has 20g of memory. > > > > Loading/replacing the largest collection is about 2 hours - which is > > > > quite fast I guess. Updating 5% percent of documents of each > > > > collections, is about half an hour. > > > > > > > > Because my need is "only" to append several values to an array I > > suspect > > > > there is some trick to make things faster. > > > > > > > > > > > > > > > > On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote: > > > > > Maybe you need to give more details. I recommend always to try and > > test > > > > yourself as you know your own solution best. Depending on your spark > > > > process atomic updates could be faster. > > > > > > > > > > With Spark-Solr additional complexity comes. You could have too many > > > > executors for your Solr instance(s), ie a too high parallelism. > > > > > > > > > > Probably the most important question is: > > > > > What performance do your use car needs and what is your current > > > > performance? > > > > > > > > > > Once this is clear further architecture aspects can be derived, such > > as > > > > number of spark executors, number of Solr instances, sharding, > > replication, > > > > commit timing etc. > > > > > > > > > > > Am 19.10.2019 um 21:52 schrieb Nicolas Paris < > > nicolas.pa...@riseup.net > > > > >: > > > > > > > > > > > > Hi community, > > > > > > > > > > > > Any advice to speed-up updates ? > > > > > > Is there any advice on commit, memory, docvalues, stored or any > > tips to > > > > > > faster things ? > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote: > > > > > >> Hi > > > > > >> > > > > > >> I am looking for a way to faster the update of documents. > > > > > >> > > > > > >> In my context, the update replaces one of the many existing > > indexed > > > > > >> fields, and keep the others as is. > > > > > >> > > > > > >> Right now, I am building the whole document, and replacing the > > > > existing > > > > > >> one by id. > > > > > >> > > > > > >> I am wondering if **atomic update feature** would faster the > > process. > > > > > >> > > > > > >> From one hand, using this feature would save network because only > > a > > > > > >> small subset of the document would be send from the client to the > > > > > >> server. > > > > > >> On the other hand, the server will have to collect the values > > from the > > > > > >> disk and reindex them. In addition, this implies to store the > > values > > > > for > > > > > >> every fields (I am not storing every fields) and use more space. > > > > > >> > > > > > >> Also I have read about the ConcurrentUpdateSolrServer class might > > be > > > > an > > > > > >> optimized way of updating documents. > > > > > >> > > > > > >> I am using spark-solr library to deal with solr-cloud. If > > something > > > > > >> exist to faster the process, I would be glad to implement it in > > that > > > > > >> library. > > > > > >> Also, I have split the collection over multiple shard, and I admit > > > > this > > > > > >> faster the update process, but who knows ? > > > > > >> > > > > > >> Thoughts ? > > > > > >> > > > > > >> -- > > > > > >> nicolas > > > > > >> > > > > > > > > > > > > -- > > > > > > nicolas > > > > > > > > > > > > > -- > > > > nicolas > > > > > > > > > > > > > -- > > > -- > > > Regards, > > > > > > *Paras Lehana* [65871] > > > Software Programmer, Auto-Suggest, > > > IndiaMART Intermesh Ltd. > > > > > > 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, > > > Noida, UP, IN - 201303 > > > > > > Mob.: +91-9560911996 > > > Work: 01203916600 | Extn: *8173* > > > > > > -- > > > IMPORTANT: > > > NEVER share your IndiaMART OTP/ Password with anyone. > > > > -- > > nicolas > > > > > -- > -- > Regards, > > *Paras Lehana* [65871] > Development Engineer, Auto-Suggest, > IndiaMART Intermesh Ltd. > > 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, > Noida, UP, IN - 201303 > > Mob.: +91-9560911996 > Work: 01203916600 | Extn: *8173* > > -- > IMPORTANT: > NEVER share your IndiaMART OTP/ Password with anyone. -- nicolas