Re: Need an advice for architecture.
Well, thanks a lot. Chris Hostetter-3 wrote > The first question i have is why you are using a version of Solr that's > almost 5 years old. *Well, Solr is part of another software and integrated with this version. With next update they will also update Solr to ver. 7...* Chris Hostetter-3 wrote > The second question you should consider is what your indexing process > looks like, and whether it's multithreaded or not, and if the bottleneck > is your network/DB. *Diggin deeper into the system, shows that SQL is the bottleneck. Next to Solr around 25 applications acces the DB (110GB) and causes a load of 100% to DB memory [32GB] and disk access [SAS Raid]. The main problem is to get data out of db as fast as possible. Running into some other problems due to this circumstances. API Agent tries to deploy batch of 25 elements at once to solr but already runs into a timeout to get all the associated fields for this batch from SQL. After failing of 25 batch > 12 > 6 > 3 > 2 > 1. This ends up in at least 1 document every 7 minutes. :(* So at this time the DB admin has to do his work first. Really appreciate your thoughts on this. kindest regards Francois -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Need an advice for architecture.
: FWIW: I used the script below to build myself 3.8 million documents, with : 300 "text fields" consisting of anywhere from 1-10 "words" (integers : between 1 and 200) Whoops ... forgot to post the script... #!/usr/bin/perl use strict; use warnings; my $num_docs = 3_800_000; my $max_words_in_field = 10; my $words_in_vocab = 200; my $num_fields = 300; # header print "id"; map { print ",${_}_t" } 1..$num_fields; print "\n"; while ($num_docs--) { print "$num_docs"; # uniqueKey for (1..$num_fields) { my $words_in_field = int(rand($max_words_in_field)); print ",\""; map { print int(rand($words_in_vocab)) . " " } 0..$words_in_field; print "\""; } print "\n"; }
Re: Need an advice for architecture.
: SQL DB 4M documents with up to 5000 metadata fields each document [2xXeon : 2.1Ghz, 32GB RAM] : Actual Solr: 1 Core version 4.6, 3.8M documents, schema has 300 metadata : fields to import, size 3.6GB [2xXeon 2.4Ghz, 32GB RAM] : (atm we need 35h to build the index and about 24h for a mass update which : affects the production) The first question i have is why you are using a version of Solr that's almost 5 years old. The second question you should consider is what your indexing process looks like, and whether it's multithreaded or not, and if the bottleneck is your network/DB. The third question to consider is your solr configuration / schema: how complex the solr side indexing process is -- ie: are these 300 fields all TextFields with complex analyzers? FWIW: I used the script below to build myself 3.8 million documents, with 300 "text fields" consisting of anywhere from 1-10 "words" (integers between 1 and 200) The resulting CSV file was 24GB, and using a simple curl command to index with a single client thread (and a single solr thread) against the solr 7.4 running with the sample techproducts configs took less then 2 hours on my laptop (less CPU & half as much ram compared to your server) while i was doing other stuff. (I would bet your current indexing speed has very little to do with solr and is largey a factor of your source DB and how you are sending the data to solr) -Hoss http://www.lucidworks.com/
Re: Need an advice for architecture.
Are you doing a commit after every document? Is the index on local disk? That is very slow indexing. With four shards and smaller documents, we can index about a million documents per minute. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jul 19, 2018, at 1:28 AM, Emir Arnautović > wrote: > > Hi Francois, > If I got your numbers right, you are indexing on a single server and indexing > rate is ~31 doc/s. I would first check if something is wrong with indexing > logic. You check where the bottleneck is: do you read documents from DB fast > enough, do you batch documents… > Assuming you cannot have better rate than 30 doc/s and that bottleneck is > Solr, in order to finish it in 6h, you need to parallelise indexing on Solr > by splitting index to ~6 servers and have overall indexing rate of 180 doc/s. > > Thanks, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > >> On 19 Jul 2018, at 09:59, servus01 wrote: >> >> Would like to ask what your recommendations are for a new performant Solr >> architecture. >> >> SQL DB 4M documents with up to 5000 metadata fields each document [2xXeon >> 2.1Ghz, 32GB RAM] >> Actual Solr: 1 Core version 4.6, 3.8M documents, schema has 300 metadata >> fields to import, size 3.6GB [2xXeon 2.4Ghz, 32GB RAM] >> (atm we need 35h to build the index and about 24h for a mass update which >> affects the production) >> >> Building the index should be less than 6h. Sometimes we change some of the >> Metadata fields which affects most of the documents and therefore a >> massupdate / reindex is necessary. Reindex is ok also for about 6h (night) >> but should not have an impact to user queries. Anyway, every faster indexing >> is very welcome. We will have max. 20 - 30 CCUser. >> >> So i asked myself. How many nodes, Lshards, replicas ect. Could someone >> please give me recommendation for a fast working architecture. >> >> really appreciate this, best >> >> Francois >> >> >> >> >> >> >> >> >> >> -- >> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
Re: Need an advice for architecture.
Hi Francois, If I got your numbers right, you are indexing on a single server and indexing rate is ~31 doc/s. I would first check if something is wrong with indexing logic. You check where the bottleneck is: do you read documents from DB fast enough, do you batch documents… Assuming you cannot have better rate than 30 doc/s and that bottleneck is Solr, in order to finish it in 6h, you need to parallelise indexing on Solr by splitting index to ~6 servers and have overall indexing rate of 180 doc/s. Thanks, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 19 Jul 2018, at 09:59, servus01 wrote: > > Would like to ask what your recommendations are for a new performant Solr > architecture. > > SQL DB 4M documents with up to 5000 metadata fields each document [2xXeon > 2.1Ghz, 32GB RAM] > Actual Solr: 1 Core version 4.6, 3.8M documents, schema has 300 metadata > fields to import, size 3.6GB [2xXeon 2.4Ghz, 32GB RAM] > (atm we need 35h to build the index and about 24h for a mass update which > affects the production) > > Building the index should be less than 6h. Sometimes we change some of the > Metadata fields which affects most of the documents and therefore a > massupdate / reindex is necessary. Reindex is ok also for about 6h (night) > but should not have an impact to user queries. Anyway, every faster indexing > is very welcome. We will have max. 20 - 30 CCUser. > > So i asked myself. How many nodes, Lshards, replicas ect. Could someone > please give me recommendation for a fast working architecture. > > really appreciate this, best > > Francois > > > > > > > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Need an advice for architecture.
Would like to ask what your recommendations are for a new performant Solr architecture. SQL DB 4M documents with up to 5000 metadata fields each document [2xXeon 2.1Ghz, 32GB RAM] Actual Solr: 1 Core version 4.6, 3.8M documents, schema has 300 metadata fields to import, size 3.6GB [2xXeon 2.4Ghz, 32GB RAM] (atm we need 35h to build the index and about 24h for a mass update which affects the production) Building the index should be less than 6h. Sometimes we change some of the Metadata fields which affects most of the documents and therefore a massupdate / reindex is necessary. Reindex is ok also for about 6h (night) but should not have an impact to user queries. Anyway, every faster indexing is very welcome. We will have max. 20 - 30 CCUser. So i asked myself. How many nodes, Lshards, replicas ect. Could someone please give me recommendation for a fast working architecture. really appreciate this, best Francois -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html