Re: Need an advice for architecture.

2018-07-20 Thread servus01
Well, thanks a lot. 


Chris Hostetter-3 wrote
> The first question i have is why you are using a version of Solr that's 
> almost 5 years old.

*Well, Solr is part of another software and integrated with this version.
With next update they will also update Solr to ver. 7...*


Chris Hostetter-3 wrote
> The second question you should consider is what your indexing process 
> looks like, and whether it's multithreaded or not, and if the bottleneck 
> is your network/DB. 

*Diggin deeper into the system, shows that SQL is the bottleneck. Next to
Solr around 25 applications acces the DB (110GB) and causes a load of 100%
to DB memory [32GB] and disk access [SAS Raid].
The main problem is to get data out of db as fast as possible. Running into
some other problems due to this circumstances. API Agent tries to deploy
batch of 25 elements at once to solr but already runs into a timeout to get
all the associated fields for this batch from SQL. After failing of 25 batch
> 12 > 6 > 3 > 2 > 1. This ends up in at least 1 document every 7 minutes.
:(*

So at this time the DB admin has to do his work first.

Really appreciate your thoughts on this.

kindest regards

Francois






--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Need an advice for architecture.

2018-07-19 Thread Chris Hostetter


: FWIW: I used the script below to build myself 3.8 million documents, with 
: 300 "text fields" consisting of anywhere from 1-10 "words" (integers 
: between 1 and 200)

Whoops ... forgot to post the script...


#!/usr/bin/perl

use strict;
use warnings;

my $num_docs = 3_800_000;
my $max_words_in_field = 10;
my $words_in_vocab = 200;
my $num_fields = 300;

# header
print "id";
map { print ",${_}_t" } 1..$num_fields;
print "\n";

while ($num_docs--) {
print "$num_docs"; # uniqueKey
for (1..$num_fields) {
my $words_in_field = int(rand($max_words_in_field));
print ",\"";
map { print int(rand($words_in_vocab)) . " " } 0..$words_in_field;
print "\"";
}
print "\n";
}




Re: Need an advice for architecture.

2018-07-19 Thread Chris Hostetter


: SQL DB 4M documents with up to 5000 metadata fields each document [2xXeon
: 2.1Ghz, 32GB RAM]
: Actual Solr: 1 Core version 4.6, 3.8M documents, schema has 300 metadata
: fields to import, size 3.6GB [2xXeon 2.4Ghz, 32GB RAM]
: (atm we need 35h to build the index and about 24h for a mass update which
: affects the production)

The first question i have is why you are using a version of Solr that's 
almost 5 years old.

The second question you should consider is what your indexing process 
looks like, and whether it's multithreaded or not, and if the bottleneck 
is your network/DB.

The third question to consider is your solr configuration / schema: how 
complex the solr side indexing process is -- ie: are these 300 fields all 
TextFields with complex analyzers?

FWIW: I used the script below to build myself 3.8 million documents, with 
300 "text fields" consisting of anywhere from 1-10 "words" (integers 
between 1 and 200)

The resulting CSV file was 24GB, and using a simple curl command to index 
with a single client thread (and a single solr thread) against the solr 
7.4 running with the sample techproducts configs took less then 2 hours on 
my laptop (less CPU & half as much ram compared to your server) while i 
was doing other stuff.

(I would bet your current indexing speed has very little to do with solr 
and is largey a factor of your source DB and how you are sending the data 
to solr)


-Hoss
http://www.lucidworks.com/


Re: Need an advice for architecture.

2018-07-19 Thread Walter Underwood
Are you doing a commit after every document? Is the index on local disk?

That is very slow indexing. With four shards and smaller documents, we can 
index about a million documents per minute.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 19, 2018, at 1:28 AM, Emir Arnautović  
> wrote:
> 
> Hi Francois,
> If I got your numbers right, you are indexing on a single server and indexing 
> rate is ~31 doc/s. I would first check if something is wrong with indexing 
> logic. You check where the bottleneck is: do you read documents from DB fast 
> enough, do you batch documents…
> Assuming you cannot have better rate than 30 doc/s and that bottleneck is 
> Solr, in order to finish it in 6h, you need to parallelise indexing on Solr 
> by splitting index to ~6 servers and have overall indexing rate of 180 doc/s.
> 
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 19 Jul 2018, at 09:59, servus01  wrote:
>> 
>> Would like to ask what your recommendations are for a new performant Solr
>> architecture. 
>> 
>> SQL DB 4M documents with up to 5000 metadata fields each document [2xXeon
>> 2.1Ghz, 32GB RAM]
>> Actual Solr: 1 Core version 4.6, 3.8M documents, schema has 300 metadata
>> fields to import, size 3.6GB [2xXeon 2.4Ghz, 32GB RAM]
>> (atm we need 35h to build the index and about 24h for a mass update which
>> affects the production)
>> 
>> Building the index should be less than 6h. Sometimes we change some of the
>> Metadata fields which affects most of the documents and therefore a
>> massupdate / reindex is necessary. Reindex is ok also for about 6h (night)
>> but should not have an impact to user queries. Anyway, every faster indexing
>> is very welcome. We will have max. 20 - 30 CCUser.
>> 
>> So i asked myself. How many nodes, Lshards, replicas ect. Could someone
>> please give me recommendation for a fast working architecture. 
>> 
>> really appreciate this, best
>> 
>> Francois 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 



Re: Need an advice for architecture.

2018-07-19 Thread Emir Arnautović
Hi Francois,
If I got your numbers right, you are indexing on a single server and indexing 
rate is ~31 doc/s. I would first check if something is wrong with indexing 
logic. You check where the bottleneck is: do you read documents from DB fast 
enough, do you batch documents…
Assuming you cannot have better rate than 30 doc/s and that bottleneck is Solr, 
in order to finish it in 6h, you need to parallelise indexing on Solr by 
splitting index to ~6 servers and have overall indexing rate of 180 doc/s.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 19 Jul 2018, at 09:59, servus01  wrote:
> 
> Would like to ask what your recommendations are for a new performant Solr
> architecture. 
> 
> SQL DB 4M documents with up to 5000 metadata fields each document [2xXeon
> 2.1Ghz, 32GB RAM]
> Actual Solr: 1 Core version 4.6, 3.8M documents, schema has 300 metadata
> fields to import, size 3.6GB [2xXeon 2.4Ghz, 32GB RAM]
> (atm we need 35h to build the index and about 24h for a mass update which
> affects the production)
> 
> Building the index should be less than 6h. Sometimes we change some of the
> Metadata fields which affects most of the documents and therefore a
> massupdate / reindex is necessary. Reindex is ok also for about 6h (night)
> but should not have an impact to user queries. Anyway, every faster indexing
> is very welcome. We will have max. 20 - 30 CCUser.
> 
> So i asked myself. How many nodes, Lshards, replicas ect. Could someone
> please give me recommendation for a fast working architecture. 
> 
> really appreciate this, best
> 
> Francois 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Need an advice for architecture.

2018-07-19 Thread servus01
Would like to ask what your recommendations are for a new performant Solr
architecture. 

SQL DB 4M documents with up to 5000 metadata fields each document [2xXeon
2.1Ghz, 32GB RAM]
Actual Solr: 1 Core version 4.6, 3.8M documents, schema has 300 metadata
fields to import, size 3.6GB [2xXeon 2.4Ghz, 32GB RAM]
(atm we need 35h to build the index and about 24h for a mass update which
affects the production)

Building the index should be less than 6h. Sometimes we change some of the
Metadata fields which affects most of the documents and therefore a
massupdate / reindex is necessary. Reindex is ok also for about 6h (night)
but should not have an impact to user queries. Anyway, every faster indexing
is very welcome. We will have max. 20 - 30 CCUser.

So i asked myself. How many nodes, Lshards, replicas ect. Could someone
please give me recommendation for a fast working architecture. 

really appreciate this, best

Francois 









--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html