Two things to add, to make Elasticsearch/Solr comparison more fair. In the ES mapping, you did not disable the _all field.
If you have _all field enabled, all tokens will be indexed twice, one for the field, one for _all. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html Also you may want to disable ES codec bloom filter http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-codec.html#bloom-postings because loading the bloom filter consumes significant memory. Not sure why you call curl from perl, since this adds overhead. There are nice Solr/ES perl clients to push docs using bulk indexing. Jörg On Wednesday, June 18, 2014 4:50:13 AM UTC+2, Maco Ma wrote: > > Hi Mike, > > new_ES_config.sh(define the templates and disable the refresh/flush): > curl -XPOST localhost:9200/doc -d '{ > "mappings" : { > "type" : { > "_source" : { "enabled" : false }, > "dynamic_templates" : [ > {"t1":{ > "match" : "*_ss", > "mapping":{ > "type": "string", > "store":false, > "norms" : {"enabled" : false} > } > }}, > {"t2":{ > "match" : "*_dt", > "mapping":{ > "type": "date", > "store": false > } > }}, > {"t3":{ > "match" : "*_i", > "mapping":{ > "type": "integer", > "store": false > } > }} > ] > } > } > }' > > curl -XPUT localhost:9200/doc/_settings -d '{ > "index.refresh_interval" : "-1" > }' > > curl -XPUT localhost:9200/doc/_settings -d '{ > "index.translog.disable_flush" : true > }' > > new_ES_ingest_threads.pl( spawn 10 threads to use curl command to ingest > the doc and one thread to flush/optimize periodically): > > my $num_args = $#ARGV + 1; > if ($num_args < 1 || $num_args > 2) { > print "\n usuage:$0 [src_dir] [thread_count]\n"; > exit; > } > > my $INST_HOME="/scratch/aime/elasticsearch-1.2.1"; > > my $pid = qx(jps | sed -e '/Elasticsearch/p' -n | sed 's/ .*//'); > chomp($pid); > if( "$pid" eq "") > { > print "Instance is not up\n"; > exit; > } > > > my $dir = $ARGV[0]; > my $td_count = 10; > $td_count = $ARGV[1] if($num_args == 2); > open(FH, ">$lf"); > print FH "source dir: $dir\nthread_count: $td_count\n"; > print FH localtime()."\n"; > > use threads; > use threads::shared; > > my $flush_intv = 10; > > my $no:shared=0; > my $total = 10000; > my $intv = 1000; > my $tstr:shared = ""; > my $ltime:shared = time; > > sub commit { > $SIG{'KILL'} = sub {`curl -XPOST ' > http://localhost:9200/doc/_flush'`;print > <http://localhost:9200/doc/_flush';print> "forced commit done on > ".localtime()."\n";threads->exit();}; > > while ($no < $total ) > { > `curl -XPOST 'http://localhost:9200/doc/_flush'` > <http://localhost:9200/doc/_flush'>; > `curl -XPOST 'http://localhost:9200/doc/_optimize'` > <http://localhost:9200/doc/_optimize'>; > print "commit on ".localtime()."\n"; > sleep($flush_intv); > } > `curl -XPOST 'http://localhost:9200/doc/_flush'` > <http://localhost:9200/doc/_flush'>; > print "commit done on ".localtime()."\n"; > } > > sub do { > my $c = -1; > while(1) > { > { > lock($no); > $c=$no; > $no++; > } > last if($c >= $total); > `curl -XPOST -s localhost:9200/doc/type/$c --data-binary > \@$dir/$c.json`; > if( ($c +1) % $intv == 0 ) > { > lock($ltime); > $curtime = time; > $tstr .= ($curtime - $ltime)." "; > $ltime = $curtime; > } > } > } > > # start the monitor processes > my $sarId = qx(sar -A 5 100000 -o sar5sec_$dir.out > /dev/null &\necho > \$!); > my $jgcId = qx(jstat -gc $pid 2s > jmem_$dir.out &\necho \$!); > > my $ct = threads->create(\&commit); > my $start = time; > my @ts=(); > for $i (1..$td_count) > { > my $t = threads->create(\&do); > push(@ts, $t); > } > > for my $t (@ts) > { > $t->join(); > } > > $ct->kill('KILL'); > my $fin = time; > > qx(kill -9 $sarId\nkill -9 $jgcId); > > print FH localtime()."\n"; > $ct->join(); > print FH qx(curl 'http://localhost:9200/doc/type/_count?q=*'); > close(FH); > > new_Solr_ingest_threads.pl is similar to the file > new_ES_ingest_threads.pl and uses the different parameters for curl > commands. Only post the differences here: > > sub commit { > while ($no < $total ) > { > `curl 'http://localhost:8983/solr/collection2/update?commit=true'` > <http://localhost:8983/solr/collection2/update?commit=true'>; > `curl 'http://localhost:8983/solr/collection2/update?optimize=true'` > <http://localhost:8983/solr/collection2/update?optimize=true'>; > print "commit on ".localtime()."\n"; > sleep(10); > } > `curl 'http://localhost:8983/solr/collection2/update?commit=true'` > <http://localhost:8983/solr/collection2/update?commit=true'>; > print "commit done on ".localtime()."\n"; > } > > > sub do { > my $c = -1; > while(1) > { > { > lock($no); > $c=$no; > $no++; > } > last if($c >= $total); > `curl -s 'http://localhost:8983/solr/collection2/update/json' > --data-binary \@$dir/$c.json -H 'Content-type:application/json'`; > if( ($c +1) % $intv == 0 ) > { > lock($ltime); > $curtime = time; > $tstr .= ($curtime - $ltime)." "; > $ltime = $curtime; > } > } > } > > > B&R > Maco > > On Wednesday, June 18, 2014 4:44:35 AM UTC+8, Michael McCandless wrote: >> >> Hi, >> >> Could you post the scripts you linked to (new_ES_config.sh, >> new_ES_ingest_threads.pl, new_Solr_ingest_threads.pl) inlined? I can't >> download them from where you linked. >> >> Optimizing every 10 seconds or 10 minutes is really not a good idea in >> general, but I guess if you're doing the same with ES and Solr then the >> comparison is at least "fair". >> >> It's odd you see such a slowdown with ES... >> >> Mike >> >> On Fri, Jun 13, 2014 at 2:40 PM, Cindy Hsin <cindy...@gmail.com> wrote: >> >>> Hi, Mark: >>> >>> We are doing single document ingestion. We did a performance comparison >>> between Solr and Elastic Search (ES). >>> The performance for ES degrades dramatically when we increase the >>> metadata fields where Solr performance remains the same. >>> The performance is done in very small data set (ie. 10k documents, the >>> index size is only 75mb). The machine is a high spec machine with 48GB >>> memory. >>> You can see ES performance drop 50% even when the machine have plenty >>> memory. ES consumes all the machine memory when metadata field increased to >>> 100k. >>> This behavior seems abnormal since the data is really tiny. >>> >>> We also tried with larger data set (ie. 100k and 1Mil documents), ES >>> throw OOW for scenario 2 for 1 Mil doc scenario. >>> We want to know whether this is a bug in ES and/or is there any >>> workaround (config step) we can use to eliminate the performance >>> degradation. >>> Currently ES performance does not meet the customer requirement so we >>> want to see if there is anyway we can bring ES performance to the same >>> level as Solr. >>> >>> Below is the configuration setting and benchmark results for 10k >>> document set. >>> scenario 0 means there are 1000 different metadata fields in the system. >>> scenario 1 means there are 10k different metatdata fields in the system. >>> scenario 2 means there are 100k different metadata fields in the system. >>> scenario 3 means there are 1M different metadata fields in the system. >>> >>> - disable hard-commit & soft commit + use a *client* to do commit >>> (ES & Solr) every 10 second >>> - ES: flush, refresh are disabled >>> - Solr: autoSoftCommit are disabled >>> - monitor load on the system (cpu, memory, etc) or the ingestion >>> speed change over time >>> - monitor the ingestion speed (is there any degradation over time?) >>> - new ES config:new_ES_config.sh; new ingestion: >>> new_ES_ingest_threads.pl >>> - new Solr ingestion: new_Solr_ingest_threads.pl >>> - flush interval: 10s >>> >>> >>> Number of different meta data field ESSolrScenario 0: 100012secs -> >>> 833docs/sec >>> CPU: 30.24% >>> Heap: 1.08G >>> time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1 >>> index size: 36M >>> iowait: 0.02%13 secs -> 769 docs/sec >>> CPU: 28.85% >>> Heap: 9.39G >>> time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2Scenario 1: 10k29secs >>> -> 345docs/sec >>> CPU: 40.83% >>> Heap: 5.74G >>> time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1 >>> iowait: 0.02% >>> Index Size: 36M12 secs -> 833 docs/sec >>> CPU: 28.62% >>> Heap: 9.88G >>> time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2Scenario 2: 100k17 mins >>> 44 secs -> 9.4docs/sec >>> CPU: 54.73% >>> Heap: 47.99G >>> time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40 >>> iowait: 0.02% >>> Index Size: 75M13 secs -> 769 docs/sec >>> CPU: 29.43% >>> Heap: 9.84G >>> time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2 Scenario 3: 1M183 mins >>> 8 secs -> 0.9 docs/sec >>> CPU: 40.47% >>> Heap: 47.99G >>> time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 159415 >>> secs -> 666.7 docs/sec >>> CPU: 45.10% >>> Heap: 9.64G >>> time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2 >>> >>> Thanks! >>> Cindy >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to elasticsearch+unsubscr...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8a62db16-378e-4079-a48e-461d579a1f83%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.