Re: ingest performance degrades sharply along with the documents having more fileds

Maco Ma Tue, 24 Jun 2014 00:59:20 -0700

Hi Jörg,

I rerun the benchmark with disabling the _all and codec bloom filter: just 
the index data size got reduced dramatically but ingestion speed is still 
similar as previous:
Number of different meta data field 
ES 
ES with disable _all/codec bloom filter 
Scenario 0: 1000
12secs -> *833*docs/sec
CPU: 30.24%
Heap: 1.08G
time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
*index size: 36Mb*
iowait: 0.02%
13 secs ->769 docs/sec
CPU: 23.68%
iowait: 0.01%
Heap: 1.31G
Index Size: 248K
Ingestion speed change: 2 1 1 1 1 1 1 1 2 1
Scenario 1: 10k
29secs -> *345*docs/sec
CPU: 40.83%
Heap: 5.74G
time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
iowait: 0.02%
*Index Size: 36Mb*
31 secs -> 322.6 docs/sec
CPU: 39.29%
iowait: 0.01%
Heap: 47.95G
Index Size: 396K
Ingestion speed change: 12 1 2 1 1 1 2 1 4 2
Scenario 2: 100k
17 mins 44 secs -> *9.4*docs/sec
CPU: 54.73%
Heap: 47.99G
time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
iowait: 0.02%
*Index Size: 75Mb*
14 mins 24 secs -> 11.6 docs/sec
CPU: 52.30%
iowait: 0.02%
Heap: 47.96G
Index Size: 1.5M
Ingestion speed change: 93 153 151 112 84 65 61 53 51 41


We ingested one single doc once, instead of bulk ingestion, and that was 
our real world requirements.

scripts to disable _all/bloom filer:
curl -XPOST localhost:9200/doc -d '{
      "mappings" : {
          "type" : {
                  "_source" : { "enabled" : false },
                  "_all" : { "enabled" : false },
                  "dynamic_templates" : [
                    {"t1":{
                  "match" : "*_ss",
                  "mapping":{
                        "type": "string",
                        "store":false,
                        "norms" : {"enabled" : false}
                        }
                        }},
                    {"t2":{
                  "match" : "*_dt",
                  "mapping":{
                        "type": "date",
                        "store": false
                        }
                        }},
                    {"t3":{
                  "match" : "*_i",
                  "mapping":{
                        "type": "integer",
                        "store": false
                        }
                        }}
]
              }
        }
  }'


curl -XPUT localhost:9200/doc/_settings -d '{
      "index.codec.bloom.load" :false
}'

Best Regards
Maco

On Monday, June 23, 2014 12:17:27 AM UTC+8, Jörg Prante wrote:
>
> Two things to add, to make Elasticsearch/Solr comparison more fair.
>
> In the ES mapping, you did not disable the _all field.
>
> If you have _all field enabled, all tokens will be indexed twice, one for 
> the field, one for _all.
>
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html
>
> Also you may want to disable ES codec bloom filter
>
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-codec.html#bloom-postings
>
> because loading the bloom filter consumes significant memory.
>
> Not sure why you call curl from perl, since this adds overhead. There are 
> nice Solr/ES perl clients to push docs using bulk indexing.
>
> Jörg
>
>
> On Wednesday, June 18, 2014 4:50:13 AM UTC+2, Maco Ma wrote:
>>
>> Hi Mike,
>>
>> new_ES_config.sh(define the templates and disable the refresh/flush):
>> curl -XPOST localhost:9200/doc -d '{
>>       "mappings" : {
>>           "type" : {
>>                   "_source" : { "enabled" : false },
>>                   "dynamic_templates" : [
>>                     {"t1":{
>>                   "match" : "*_ss",
>>                   "mapping":{
>>                         "type": "string",
>>                         "store":false,
>>                         "norms" : {"enabled" : false}
>>                         }
>>                         }},
>>                     {"t2":{
>>                   "match" : "*_dt",
>>                   "mapping":{
>>                         "type": "date",
>>                         "store": false
>>                         }
>>                         }},
>>                     {"t3":{
>>                   "match" : "*_i",
>>                   "mapping":{
>>                         "type": "integer",
>>                         "store": false
>>                         }
>>                         }}
>> ]
>>               }
>>         }
>>   }'
>>
>> curl -XPUT localhost:9200/doc/_settings -d '{
>>       "index.refresh_interval" : "-1"
>> }'
>>
>> curl -XPUT localhost:9200/doc/_settings -d '{
>>       "index.translog.disable_flush" : true
>> }'
>>
>> new_ES_ingest_threads.pl( spawn 10 threads to use curl command to ingest 
>> the doc and one thread to flush/optimize periodically):
>>
>> my $num_args = $#ARGV + 1;
>> if ($num_args < 1 || $num_args > 2) {
>>   print "\n usuage:$0 [src_dir] [thread_count]\n";
>>   exit;
>> }
>>
>> my $INST_HOME="/scratch/aime/elasticsearch-1.2.1";
>>
>> my $pid = qx(jps | sed -e '/Elasticsearch/p' -n | sed 's/ .*//');
>> chomp($pid);
>> if( "$pid" eq "")
>> {
>>   print "Instance is not up\n";
>>   exit;
>> }
>>
>>
>> my $dir = $ARGV[0];
>> my $td_count = 10;
>> $td_count = $ARGV[1] if($num_args == 2);
>> open(FH, ">$lf");
>> print FH "source dir: $dir\nthread_count: $td_count\n";
>> print FH localtime()."\n";
>>
>> use threads;
>> use threads::shared;
>>
>> my $flush_intv = 10;
>>
>> my $no:shared=0;
>> my $total = 10000;
>> my $intv = 1000;
>> my $tstr:shared = "";
>> my $ltime:shared = time;
>>
>> sub commit {
>>   $SIG{'KILL'} = sub {`curl -XPOST '
>> http://localhost:9200/doc/_flush'`;print 
>> <http://localhost:9200/doc/_flush';print> "forced commit done on 
>> ".localtime()."\n";threads->exit();};
>>
>>   while ($no < $total )
>>   {
>>     `curl -XPOST 'http://localhost:9200/doc/_flush'` 
>> <http://localhost:9200/doc/_flush'>;
>>     `curl -XPOST 'http://localhost:9200/doc/_optimize'` 
>> <http://localhost:9200/doc/_optimize'>;
>>     print "commit on ".localtime()."\n";
>>     sleep($flush_intv);
>>   }
>>   `curl -XPOST 'http://localhost:9200/doc/_flush'` 
>> <http://localhost:9200/doc/_flush'>;
>>   print "commit done on ".localtime()."\n";
>> }
>>
>> sub do {
>>   my $c = -1;
>>   while(1)
>>   {
>>     {
>>       lock($no);
>>       $c=$no;
>>       $no++;
>>     }
>>     last if($c >= $total);
>>     `curl -XPOST -s localhost:9200/doc/type/$c --data-binary 
>> \@$dir/$c.json`;
>>     if( ($c +1) % $intv == 0 )
>>     {
>>       lock($ltime);
>>       $curtime = time;
>>       $tstr .= ($curtime - $ltime)." ";
>>       $ltime = $curtime;
>>     }
>>   }
>> }
>>
>> # start the monitor processes
>> my $sarId = qx(sar -A 5 100000 -o sar5sec_$dir.out > /dev/null &\necho 
>> \$!);
>> my $jgcId = qx(jstat -gc $pid 2s > jmem_$dir.out &\necho \$!);
>>
>> my $ct = threads->create(\&commit);
>> my $start = time;
>> my @ts=();
>> for $i (1..$td_count)
>> {
>>   my $t = threads->create(\&do);
>>   push(@ts, $t);
>> }
>>
>> for my $t (@ts)
>> {
>>   $t->join();
>> }
>>
>> $ct->kill('KILL');
>> my $fin = time;
>>
>> qx(kill -9 $sarId\nkill -9 $jgcId);
>>
>> print FH localtime()."\n";
>> $ct->join();
>> print FH qx(curl 'http://localhost:9200/doc/type/_count?q=*');
>> close(FH);
>>
>> new_Solr_ingest_threads.pl is similar to the file 
>>  new_ES_ingest_threads.pl and uses the different parameters for curl 
>> commands. Only post the differences here:
>>
>> sub commit {
>>   while ($no < $total )
>>   {
>>     `curl  'http://localhost:8983/solr/collection2/update?commit=true'` 
>> <http://localhost:8983/solr/collection2/update?commit=true'>;
>>     `curl  'http://localhost:8983/solr/collection2/update?optimize=true'` 
>> <http://localhost:8983/solr/collection2/update?optimize=true'>;
>>     print "commit on ".localtime()."\n";
>>     sleep(10);
>>   }
>>   `curl  'http://localhost:8983/solr/collection2/update?commit=true'` 
>> <http://localhost:8983/solr/collection2/update?commit=true'>;
>>   print "commit done on ".localtime()."\n";
>> }
>>
>>
>> sub do {
>>   my $c = -1;
>>   while(1)
>>   {
>>     {
>>       lock($no);
>>       $c=$no;
>>       $no++;
>>     }
>>     last if($c >= $total);
>>     `curl  -s 'http://localhost:8983/solr/collection2/update/json' 
>> --data-binary \@$dir/$c.json -H 'Content-type:application/json'`;
>>     if( ($c +1) % $intv == 0 )
>>     {
>>       lock($ltime);
>>       $curtime = time;
>>       $tstr .= ($curtime - $ltime)." ";
>>       $ltime = $curtime;
>>     }
>>   }
>> }
>>
>>
>> B&R
>> Maco
>>
>> On Wednesday, June 18, 2014 4:44:35 AM UTC+8, Michael McCandless wrote:
>>>
>>> Hi,
>>>
>>> Could you post the scripts you linked to (new_ES_config.sh, 
>>> new_ES_ingest_threads.pl, new_Solr_ingest_threads.pl) inlined?  I can't 
>>> download them from where you linked.
>>>
>>> Optimizing every 10 seconds or 10 minutes is really not a good idea in 
>>> general, but I guess if you're doing the same with ES and Solr then the 
>>> comparison is at least "fair".
>>>
>>> It's odd you see such a slowdown with ES...
>>>
>>> Mike
>>>
>>> On Fri, Jun 13, 2014 at 2:40 PM, Cindy Hsin <cindy...@gmail.com> wrote:
>>>
>>>> Hi, Mark:
>>>>
>>>> We are doing single document ingestion. We did a performance comparison 
>>>> between Solr and Elastic Search (ES).
>>>> The performance for ES degrades dramatically when we increase the 
>>>> metadata fields where Solr performance remains the same. 
>>>> The performance is done in very small data set (ie. 10k documents, the 
>>>> index size is only 75mb). The machine is a high spec machine with 48GB 
>>>> memory.
>>>> You can see ES performance drop 50% even when the machine have plenty 
>>>> memory. ES consumes all the machine memory when metadata field increased 
>>>> to 
>>>> 100k. 
>>>> This behavior seems abnormal since the data is really tiny.
>>>>
>>>> We also tried with larger data set (ie. 100k and 1Mil documents), ES 
>>>> throw OOW for scenario 2 for 1 Mil doc scenario. 
>>>> We want to know whether this is a bug in ES and/or is there any 
>>>> workaround (config step) we can use to eliminate the performance 
>>>> degradation. 
>>>> Currently ES performance does not meet the customer requirement so we 
>>>> want to see if there is anyway we can bring ES performance to the same 
>>>> level as Solr.
>>>>
>>>> Below is the configuration setting and benchmark results for 10k 
>>>> document set.
>>>> scenario 0 means there are 1000 different metadata fields in the system.
>>>> scenario 1 means there are 10k different metatdata fields in the system.
>>>> scenario 2 means there are 100k different metadata fields in the system.
>>>> scenario 3 means there are 1M different metadata fields in the system.
>>>>
>>>>    - disable hard-commit & soft commit + use a *client* to do commit 
>>>>    (ES & Solr) every 10 second
>>>>    - ES: flush, refresh are disabled
>>>>       - Solr: autoSoftCommit are disabled
>>>>    - monitor load on the system (cpu, memory, etc) or the ingestion 
>>>>    speed change over time
>>>>    - monitor the ingestion speed (is there any degradation over time?) 
>>>>    - new ES config:new_ES_config.sh; new ingestion: 
>>>>    new_ES_ingest_threads.pl
>>>>    - new Solr ingestion: new_Solr_ingest_threads.pl
>>>>    - flush interval: 10s
>>>>
>>>>
>>>> Number of different meta data field ESSolrScenario 0: 100012secs -> 
>>>> 833docs/sec
>>>> CPU: 30.24%
>>>> Heap: 1.08G
>>>> time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
>>>> index size: 36M
>>>> iowait: 0.02%13 secs -> 769 docs/sec
>>>> CPU: 28.85%
>>>> Heap: 9.39G
>>>> time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2Scenario 1: 10k29secs 
>>>> -> 345docs/sec
>>>> CPU: 40.83%
>>>> Heap: 5.74G
>>>> time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
>>>> iowait: 0.02%
>>>> Index Size: 36M12 secs -> 833 docs/sec
>>>> CPU: 28.62%
>>>> Heap: 9.88G
>>>> time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2Scenario 2: 100k17 mins 
>>>> 44 secs -> 9.4docs/sec
>>>> CPU: 54.73%
>>>> Heap: 47.99G
>>>> time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
>>>> iowait: 0.02%
>>>> Index Size: 75M13 secs -> 769 docs/sec
>>>> CPU: 29.43%
>>>> Heap: 9.84G
>>>> time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2 Scenario 3: 1M183 mins 
>>>> 8 secs -> 0.9 docs/sec
>>>> CPU: 40.47%
>>>> Heap: 47.99G
>>>> time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 159415 
>>>> secs -> 666.7 docs/sec
>>>> CPU: 45.10%
>>>> Heap: 9.64G
>>>> time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2
>>>>
>>>> Thanks!
>>>> Cindy
>>>>
>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to elasticsearch+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0086886a-330b-4db4-8e3d-5301df616eb5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: ingest performance degrades sharply along with the documents having more fileds

Reply via email to