Re: ingest performance degrades sharply along with the documents having more fileds

Jörg Prante Sun, 22 Jun 2014 09:20:07 -0700

Two things to add, to make Elasticsearch/Solr comparison more fair.

In the ES mapping, you did not disable the _all field.


If you have _all field enabled, all tokens will be indexed twice, one for 
the field, one for _all.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html

Also you may want to disable ES codec bloom filter

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-codec.html#bloom-postings

because loading the bloom filter consumes significant memory.

Not sure why you call curl from perl, since this adds overhead. There are 
nice Solr/ES perl clients to push docs using bulk indexing.

Jörg


On Wednesday, June 18, 2014 4:50:13 AM UTC+2, Maco Ma wrote:
>
> Hi Mike,
>
> new_ES_config.sh(define the templates and disable the refresh/flush):
> curl -XPOST localhost:9200/doc -d '{
>       "mappings" : {
>           "type" : {
>                   "_source" : { "enabled" : false },
>                   "dynamic_templates" : [
>                     {"t1":{
>                   "match" : "*_ss",
>                   "mapping":{
>                         "type": "string",
>                         "store":false,
>                         "norms" : {"enabled" : false}
>                         }
>                         }},
>                     {"t2":{
>                   "match" : "*_dt",
>                   "mapping":{
>                         "type": "date",
>                         "store": false
>                         }
>                         }},
>                     {"t3":{
>                   "match" : "*_i",
>                   "mapping":{
>                         "type": "integer",
>                         "store": false
>                         }
>                         }}
> ]
>               }
>         }
>   }'
>
> curl -XPUT localhost:9200/doc/_settings -d '{
>       "index.refresh_interval" : "-1"
> }'
>
> curl -XPUT localhost:9200/doc/_settings -d '{
>       "index.translog.disable_flush" : true
> }'
>
> new_ES_ingest_threads.pl( spawn 10 threads to use curl command to ingest 
> the doc and one thread to flush/optimize periodically):
>
> my $num_args = $#ARGV + 1;
> if ($num_args < 1 || $num_args > 2) {
>   print "\n usuage:$0 [src_dir] [thread_count]\n";
>   exit;
> }
>
> my $INST_HOME="/scratch/aime/elasticsearch-1.2.1";
>
> my $pid = qx(jps | sed -e '/Elasticsearch/p' -n | sed 's/ .*//');
> chomp($pid);
> if( "$pid" eq "")
> {
>   print "Instance is not up\n";
>   exit;
> }
>
>
> my $dir = $ARGV[0];
> my $td_count = 10;
> $td_count = $ARGV[1] if($num_args == 2);
> open(FH, ">$lf");
> print FH "source dir: $dir\nthread_count: $td_count\n";
> print FH localtime()."\n";
>
> use threads;
> use threads::shared;
>
> my $flush_intv = 10;
>
> my $no:shared=0;
> my $total = 10000;
> my $intv = 1000;
> my $tstr:shared = "";
> my $ltime:shared = time;
>
> sub commit {
>   $SIG{'KILL'} = sub {`curl -XPOST '
> http://localhost:9200/doc/_flush'`;print 
> <http://localhost:9200/doc/_flush';print> "forced commit done on 
> ".localtime()."\n";threads->exit();};
>
>   while ($no < $total )
>   {
>     `curl -XPOST 'http://localhost:9200/doc/_flush'` 
> <http://localhost:9200/doc/_flush'>;
>     `curl -XPOST 'http://localhost:9200/doc/_optimize'` 
> <http://localhost:9200/doc/_optimize'>;
>     print "commit on ".localtime()."\n";
>     sleep($flush_intv);
>   }
>   `curl -XPOST 'http://localhost:9200/doc/_flush'` 
> <http://localhost:9200/doc/_flush'>;
>   print "commit done on ".localtime()."\n";
> }
>
> sub do {
>   my $c = -1;
>   while(1)
>   {
>     {
>       lock($no);
>       $c=$no;
>       $no++;
>     }
>     last if($c >= $total);
>     `curl -XPOST -s localhost:9200/doc/type/$c --data-binary 
> \@$dir/$c.json`;
>     if( ($c +1) % $intv == 0 )
>     {
>       lock($ltime);
>       $curtime = time;
>       $tstr .= ($curtime - $ltime)." ";
>       $ltime = $curtime;
>     }
>   }
> }
>
> # start the monitor processes
> my $sarId = qx(sar -A 5 100000 -o sar5sec_$dir.out > /dev/null &\necho 
> \$!);
> my $jgcId = qx(jstat -gc $pid 2s > jmem_$dir.out &\necho \$!);
>
> my $ct = threads->create(\&commit);
> my $start = time;
> my @ts=();
> for $i (1..$td_count)
> {
>   my $t = threads->create(\&do);
>   push(@ts, $t);
> }
>
> for my $t (@ts)
> {
>   $t->join();
> }
>
> $ct->kill('KILL');
> my $fin = time;
>
> qx(kill -9 $sarId\nkill -9 $jgcId);
>
> print FH localtime()."\n";
> $ct->join();
> print FH qx(curl 'http://localhost:9200/doc/type/_count?q=*');
> close(FH);
>
> new_Solr_ingest_threads.pl is similar to the file 
>  new_ES_ingest_threads.pl and uses the different parameters for curl 
> commands. Only post the differences here:
>
> sub commit {
>   while ($no < $total )
>   {
>     `curl  'http://localhost:8983/solr/collection2/update?commit=true'` 
> <http://localhost:8983/solr/collection2/update?commit=true'>;
>     `curl  'http://localhost:8983/solr/collection2/update?optimize=true'` 
> <http://localhost:8983/solr/collection2/update?optimize=true'>;
>     print "commit on ".localtime()."\n";
>     sleep(10);
>   }
>   `curl  'http://localhost:8983/solr/collection2/update?commit=true'` 
> <http://localhost:8983/solr/collection2/update?commit=true'>;
>   print "commit done on ".localtime()."\n";
> }
>
>
> sub do {
>   my $c = -1;
>   while(1)
>   {
>     {
>       lock($no);
>       $c=$no;
>       $no++;
>     }
>     last if($c >= $total);
>     `curl  -s 'http://localhost:8983/solr/collection2/update/json' 
> --data-binary \@$dir/$c.json -H 'Content-type:application/json'`;
>     if( ($c +1) % $intv == 0 )
>     {
>       lock($ltime);
>       $curtime = time;
>       $tstr .= ($curtime - $ltime)." ";
>       $ltime = $curtime;
>     }
>   }
> }
>
>
> B&R
> Maco
>
> On Wednesday, June 18, 2014 4:44:35 AM UTC+8, Michael McCandless wrote:
>>
>> Hi,
>>
>> Could you post the scripts you linked to (new_ES_config.sh, 
>> new_ES_ingest_threads.pl, new_Solr_ingest_threads.pl) inlined?  I can't 
>> download them from where you linked.
>>
>> Optimizing every 10 seconds or 10 minutes is really not a good idea in 
>> general, but I guess if you're doing the same with ES and Solr then the 
>> comparison is at least "fair".
>>
>> It's odd you see such a slowdown with ES...
>>
>> Mike
>>
>> On Fri, Jun 13, 2014 at 2:40 PM, Cindy Hsin <cindy...@gmail.com> wrote:
>>
>>> Hi, Mark:
>>>
>>> We are doing single document ingestion. We did a performance comparison 
>>> between Solr and Elastic Search (ES).
>>> The performance for ES degrades dramatically when we increase the 
>>> metadata fields where Solr performance remains the same. 
>>> The performance is done in very small data set (ie. 10k documents, the 
>>> index size is only 75mb). The machine is a high spec machine with 48GB 
>>> memory.
>>> You can see ES performance drop 50% even when the machine have plenty 
>>> memory. ES consumes all the machine memory when metadata field increased to 
>>> 100k. 
>>> This behavior seems abnormal since the data is really tiny.
>>>
>>> We also tried with larger data set (ie. 100k and 1Mil documents), ES 
>>> throw OOW for scenario 2 for 1 Mil doc scenario. 
>>> We want to know whether this is a bug in ES and/or is there any 
>>> workaround (config step) we can use to eliminate the performance 
>>> degradation. 
>>> Currently ES performance does not meet the customer requirement so we 
>>> want to see if there is anyway we can bring ES performance to the same 
>>> level as Solr.
>>>
>>> Below is the configuration setting and benchmark results for 10k 
>>> document set.
>>> scenario 0 means there are 1000 different metadata fields in the system.
>>> scenario 1 means there are 10k different metatdata fields in the system.
>>> scenario 2 means there are 100k different metadata fields in the system.
>>> scenario 3 means there are 1M different metadata fields in the system.
>>>
>>>    - disable hard-commit & soft commit + use a *client* to do commit 
>>>    (ES & Solr) every 10 second
>>>    - ES: flush, refresh are disabled
>>>       - Solr: autoSoftCommit are disabled
>>>    - monitor load on the system (cpu, memory, etc) or the ingestion 
>>>    speed change over time
>>>    - monitor the ingestion speed (is there any degradation over time?) 
>>>    - new ES config:new_ES_config.sh; new ingestion: 
>>>    new_ES_ingest_threads.pl
>>>    - new Solr ingestion: new_Solr_ingest_threads.pl
>>>    - flush interval: 10s
>>>
>>>
>>> Number of different meta data field ESSolrScenario 0: 100012secs -> 
>>> 833docs/sec
>>> CPU: 30.24%
>>> Heap: 1.08G
>>> time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
>>> index size: 36M
>>> iowait: 0.02%13 secs -> 769 docs/sec
>>> CPU: 28.85%
>>> Heap: 9.39G
>>> time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2Scenario 1: 10k29secs 
>>> -> 345docs/sec
>>> CPU: 40.83%
>>> Heap: 5.74G
>>> time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
>>> iowait: 0.02%
>>> Index Size: 36M12 secs -> 833 docs/sec
>>> CPU: 28.62%
>>> Heap: 9.88G
>>> time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2Scenario 2: 100k17 mins 
>>> 44 secs -> 9.4docs/sec
>>> CPU: 54.73%
>>> Heap: 47.99G
>>> time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
>>> iowait: 0.02%
>>> Index Size: 75M13 secs -> 769 docs/sec
>>> CPU: 29.43%
>>> Heap: 9.84G
>>> time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2 Scenario 3: 1M183 mins 
>>> 8 secs -> 0.9 docs/sec
>>> CPU: 40.47%
>>> Heap: 47.99G
>>> time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 159415 
>>> secs -> 666.7 docs/sec
>>> CPU: 45.10%
>>> Heap: 9.64G
>>> time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2
>>>
>>> Thanks!
>>> Cindy
>>>
>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to elasticsearch+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8a62db16-378e-4079-a48e-461d579a1f83%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: ingest performance degrades sharply along with the documents having more fileds

Reply via email to