Hi Jörg, I rerun the benchmark with disabling the _all and codec bloom filter: just the index data size got reduced dramatically but ingestion speed is still similar as previous: Number of different meta data field ES ES with disable _all/codec bloom filter Scenario 0: 1000 12secs -> *833*docs/sec CPU: 30.24% Heap: 1.08G time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1 *index size: 36Mb* iowait: 0.02% 13 secs ->769 docs/sec CPU: 23.68% iowait: 0.01% Heap: 1.31G Index Size: 248K Ingestion speed change: 2 1 1 1 1 1 1 1 2 1 Scenario 1: 10k 29secs -> *345*docs/sec CPU: 40.83% Heap: 5.74G time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1 iowait: 0.02% *Index Size: 36Mb* 31 secs -> 322.6 docs/sec CPU: 39.29% iowait: 0.01% Heap: 47.95G Index Size: 396K Ingestion speed change: 12 1 2 1 1 1 2 1 4 2 Scenario 2: 100k 17 mins 44 secs -> *9.4*docs/sec CPU: 54.73% Heap: 47.99G time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40 iowait: 0.02% *Index Size: 75Mb* 14 mins 24 secs -> 11.6 docs/sec CPU: 52.30% iowait: 0.02% Heap: 47.96G Index Size: 1.5M Ingestion speed change: 93 153 151 112 84 65 61 53 51 41
We ingested one single doc once, instead of bulk ingestion, and that was our real world requirements. scripts to disable _all/bloom filer: curl -XPOST localhost:9200/doc -d '{ "mappings" : { "type" : { "_source" : { "enabled" : false }, "_all" : { "enabled" : false }, "dynamic_templates" : [ {"t1":{ "match" : "*_ss", "mapping":{ "type": "string", "store":false, "norms" : {"enabled" : false} } }}, {"t2":{ "match" : "*_dt", "mapping":{ "type": "date", "store": false } }}, {"t3":{ "match" : "*_i", "mapping":{ "type": "integer", "store": false } }} ] } } }' curl -XPUT localhost:9200/doc/_settings -d '{ "index.codec.bloom.load" :false }' Best Regards Maco On Monday, June 23, 2014 12:17:27 AM UTC+8, Jörg Prante wrote: > > Two things to add, to make Elasticsearch/Solr comparison more fair. > > In the ES mapping, you did not disable the _all field. > > If you have _all field enabled, all tokens will be indexed twice, one for > the field, one for _all. > > > http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html > > Also you may want to disable ES codec bloom filter > > > http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-codec.html#bloom-postings > > because loading the bloom filter consumes significant memory. > > Not sure why you call curl from perl, since this adds overhead. There are > nice Solr/ES perl clients to push docs using bulk indexing. > > Jörg > > > On Wednesday, June 18, 2014 4:50:13 AM UTC+2, Maco Ma wrote: >> >> Hi Mike, >> >> new_ES_config.sh(define the templates and disable the refresh/flush): >> curl -XPOST localhost:9200/doc -d '{ >> "mappings" : { >> "type" : { >> "_source" : { "enabled" : false }, >> "dynamic_templates" : [ >> {"t1":{ >> "match" : "*_ss", >> "mapping":{ >> "type": "string", >> "store":false, >> "norms" : {"enabled" : false} >> } >> }}, >> {"t2":{ >> "match" : "*_dt", >> "mapping":{ >> "type": "date", >> "store": false >> } >> }}, >> {"t3":{ >> "match" : "*_i", >> "mapping":{ >> "type": "integer", >> "store": false >> } >> }} >> ] >> } >> } >> }' >> >> curl -XPUT localhost:9200/doc/_settings -d '{ >> "index.refresh_interval" : "-1" >> }' >> >> curl -XPUT localhost:9200/doc/_settings -d '{ >> "index.translog.disable_flush" : true >> }' >> >> new_ES_ingest_threads.pl( spawn 10 threads to use curl command to ingest >> the doc and one thread to flush/optimize periodically): >> >> my $num_args = $#ARGV + 1; >> if ($num_args < 1 || $num_args > 2) { >> print "\n usuage:$0 [src_dir] [thread_count]\n"; >> exit; >> } >> >> my $INST_HOME="/scratch/aime/elasticsearch-1.2.1"; >> >> my $pid = qx(jps | sed -e '/Elasticsearch/p' -n | sed 's/ .*//'); >> chomp($pid); >> if( "$pid" eq "") >> { >> print "Instance is not up\n"; >> exit; >> } >> >> >> my $dir = $ARGV[0]; >> my $td_count = 10; >> $td_count = $ARGV[1] if($num_args == 2); >> open(FH, ">$lf"); >> print FH "source dir: $dir\nthread_count: $td_count\n"; >> print FH localtime()."\n"; >> >> use threads; >> use threads::shared; >> >> my $flush_intv = 10; >> >> my $no:shared=0; >> my $total = 10000; >> my $intv = 1000; >> my $tstr:shared = ""; >> my $ltime:shared = time; >> >> sub commit { >> $SIG{'KILL'} = sub {`curl -XPOST ' >> http://localhost:9200/doc/_flush'`;print >> <http://localhost:9200/doc/_flush';print> "forced commit done on >> ".localtime()."\n";threads->exit();}; >> >> while ($no < $total ) >> { >> `curl -XPOST 'http://localhost:9200/doc/_flush'` >> <http://localhost:9200/doc/_flush'>; >> `curl -XPOST 'http://localhost:9200/doc/_optimize'` >> <http://localhost:9200/doc/_optimize'>; >> print "commit on ".localtime()."\n"; >> sleep($flush_intv); >> } >> `curl -XPOST 'http://localhost:9200/doc/_flush'` >> <http://localhost:9200/doc/_flush'>; >> print "commit done on ".localtime()."\n"; >> } >> >> sub do { >> my $c = -1; >> while(1) >> { >> { >> lock($no); >> $c=$no; >> $no++; >> } >> last if($c >= $total); >> `curl -XPOST -s localhost:9200/doc/type/$c --data-binary >> \@$dir/$c.json`; >> if( ($c +1) % $intv == 0 ) >> { >> lock($ltime); >> $curtime = time; >> $tstr .= ($curtime - $ltime)." "; >> $ltime = $curtime; >> } >> } >> } >> >> # start the monitor processes >> my $sarId = qx(sar -A 5 100000 -o sar5sec_$dir.out > /dev/null &\necho >> \$!); >> my $jgcId = qx(jstat -gc $pid 2s > jmem_$dir.out &\necho \$!); >> >> my $ct = threads->create(\&commit); >> my $start = time; >> my @ts=(); >> for $i (1..$td_count) >> { >> my $t = threads->create(\&do); >> push(@ts, $t); >> } >> >> for my $t (@ts) >> { >> $t->join(); >> } >> >> $ct->kill('KILL'); >> my $fin = time; >> >> qx(kill -9 $sarId\nkill -9 $jgcId); >> >> print FH localtime()."\n"; >> $ct->join(); >> print FH qx(curl 'http://localhost:9200/doc/type/_count?q=*'); >> close(FH); >> >> new_Solr_ingest_threads.pl is similar to the file >> new_ES_ingest_threads.pl and uses the different parameters for curl >> commands. Only post the differences here: >> >> sub commit { >> while ($no < $total ) >> { >> `curl 'http://localhost:8983/solr/collection2/update?commit=true'` >> <http://localhost:8983/solr/collection2/update?commit=true'>; >> `curl 'http://localhost:8983/solr/collection2/update?optimize=true'` >> <http://localhost:8983/solr/collection2/update?optimize=true'>; >> print "commit on ".localtime()."\n"; >> sleep(10); >> } >> `curl 'http://localhost:8983/solr/collection2/update?commit=true'` >> <http://localhost:8983/solr/collection2/update?commit=true'>; >> print "commit done on ".localtime()."\n"; >> } >> >> >> sub do { >> my $c = -1; >> while(1) >> { >> { >> lock($no); >> $c=$no; >> $no++; >> } >> last if($c >= $total); >> `curl -s 'http://localhost:8983/solr/collection2/update/json' >> --data-binary \@$dir/$c.json -H 'Content-type:application/json'`; >> if( ($c +1) % $intv == 0 ) >> { >> lock($ltime); >> $curtime = time; >> $tstr .= ($curtime - $ltime)." "; >> $ltime = $curtime; >> } >> } >> } >> >> >> B&R >> Maco >> >> On Wednesday, June 18, 2014 4:44:35 AM UTC+8, Michael McCandless wrote: >>> >>> Hi, >>> >>> Could you post the scripts you linked to (new_ES_config.sh, >>> new_ES_ingest_threads.pl, new_Solr_ingest_threads.pl) inlined? I can't >>> download them from where you linked. >>> >>> Optimizing every 10 seconds or 10 minutes is really not a good idea in >>> general, but I guess if you're doing the same with ES and Solr then the >>> comparison is at least "fair". >>> >>> It's odd you see such a slowdown with ES... >>> >>> Mike >>> >>> On Fri, Jun 13, 2014 at 2:40 PM, Cindy Hsin <cindy...@gmail.com> wrote: >>> >>>> Hi, Mark: >>>> >>>> We are doing single document ingestion. We did a performance comparison >>>> between Solr and Elastic Search (ES). >>>> The performance for ES degrades dramatically when we increase the >>>> metadata fields where Solr performance remains the same. >>>> The performance is done in very small data set (ie. 10k documents, the >>>> index size is only 75mb). The machine is a high spec machine with 48GB >>>> memory. >>>> You can see ES performance drop 50% even when the machine have plenty >>>> memory. ES consumes all the machine memory when metadata field increased >>>> to >>>> 100k. >>>> This behavior seems abnormal since the data is really tiny. >>>> >>>> We also tried with larger data set (ie. 100k and 1Mil documents), ES >>>> throw OOW for scenario 2 for 1 Mil doc scenario. >>>> We want to know whether this is a bug in ES and/or is there any >>>> workaround (config step) we can use to eliminate the performance >>>> degradation. >>>> Currently ES performance does not meet the customer requirement so we >>>> want to see if there is anyway we can bring ES performance to the same >>>> level as Solr. >>>> >>>> Below is the configuration setting and benchmark results for 10k >>>> document set. >>>> scenario 0 means there are 1000 different metadata fields in the system. >>>> scenario 1 means there are 10k different metatdata fields in the system. >>>> scenario 2 means there are 100k different metadata fields in the system. >>>> scenario 3 means there are 1M different metadata fields in the system. >>>> >>>> - disable hard-commit & soft commit + use a *client* to do commit >>>> (ES & Solr) every 10 second >>>> - ES: flush, refresh are disabled >>>> - Solr: autoSoftCommit are disabled >>>> - monitor load on the system (cpu, memory, etc) or the ingestion >>>> speed change over time >>>> - monitor the ingestion speed (is there any degradation over time?) >>>> - new ES config:new_ES_config.sh; new ingestion: >>>> new_ES_ingest_threads.pl >>>> - new Solr ingestion: new_Solr_ingest_threads.pl >>>> - flush interval: 10s >>>> >>>> >>>> Number of different meta data field ESSolrScenario 0: 100012secs -> >>>> 833docs/sec >>>> CPU: 30.24% >>>> Heap: 1.08G >>>> time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1 >>>> index size: 36M >>>> iowait: 0.02%13 secs -> 769 docs/sec >>>> CPU: 28.85% >>>> Heap: 9.39G >>>> time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2Scenario 1: 10k29secs >>>> -> 345docs/sec >>>> CPU: 40.83% >>>> Heap: 5.74G >>>> time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1 >>>> iowait: 0.02% >>>> Index Size: 36M12 secs -> 833 docs/sec >>>> CPU: 28.62% >>>> Heap: 9.88G >>>> time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2Scenario 2: 100k17 mins >>>> 44 secs -> 9.4docs/sec >>>> CPU: 54.73% >>>> Heap: 47.99G >>>> time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40 >>>> iowait: 0.02% >>>> Index Size: 75M13 secs -> 769 docs/sec >>>> CPU: 29.43% >>>> Heap: 9.84G >>>> time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2 Scenario 3: 1M183 mins >>>> 8 secs -> 0.9 docs/sec >>>> CPU: 40.47% >>>> Heap: 47.99G >>>> time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 159415 >>>> secs -> 666.7 docs/sec >>>> CPU: 45.10% >>>> Heap: 9.64G >>>> time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2 >>>> >>>> Thanks! >>>> Cindy >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to elasticsearch+unsubscr...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0086886a-330b-4db4-8e3d-5301df616eb5%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.