Re: ingest performance degrades sharply along with the documents having more fileds

Maco Ma Thu, 26 Jun 2014 03:30:32 -0700

Added the Solr benchmark as well:

Number of different meta data field


ES with disable _all/codec bloom filter 
ES (Ingestion & Query concurrently) 
Solr 
Solr(Ingestion & Query concurrently) 
Scenario 0: 1000

13 secs ->769 docs/sec
CPU: 23.68%
iowait: 0.01%
Heap: 1.31G
Index Size: 248K
Ingestion speed change: 2 1 1 1 1 1 1 1 2 1
14 secs ->714 docs/sec
CPU: 27.51%
iowait: 0.03%
Heap: 1.27G
Index Size: 304K
Ingestion speed change: 3 1 1 1 1 1 1 2 2 1
13 secs -> 769 docs/sec
CPU: 28.85%
Heap: 9.39G
time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2

14 secs->714 docs/sec
CPU: 37.02%
Heap: 10G
Ingestion speed change: 2 2 1 1 1 1 2 2 1 1 

Scenario 1: 10k

31 secs -> 322.6 docs/sec
CPU: 39.29%
iowait: 0.01%
Heap: 4.76G
Index Size: 396K
Ingestion speed change: 12 1 2 1 1 1 2 1 4 2

35 secs -> 285docs/sec
CPU: 42.46%
iowait: 0.01%
Heap: 5.14G
Index Size: 336K
Ingestion speed change: 13 2 1 1 2 1 1 4 1 2 

12 secs -> 833 docs/sec
CPU: 28.62%
Heap: 9.88G
time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2

16 secs-> 625 docs/sec
CPU: 34.07%
Heap: 10G
Ingestion speed change: 2 2 1 1 1 1 2 2 2 2

 

List several sample queries for Solr:
curl -s 
'http://localhost:8983/solr/collection2/query?rows=0&q=field282_ss:f*'
curl -s 
'http://localhost:8983/solr/collection2/query?rows=0&q=field989_dt:\[2012-3-06T01%3A15%3A51Z%20TO%20NOW\]'
curl -s 
'http://localhost:8983/solr/collection2/query?rows=0&q=field363_i:\[0%20TO%20177\]'

filters:
curl -s 
'http://localhost:8983/solr/collection2/query?rows=0&q=*&fq=field118_i:\[0%20TO%2029\]'
curl -s 
'http://localhost:8983/solr/collection2/query?rows=0&q=*&fq=field91_dt:\[2012-1-06T01%3A15%3A51Z%20TO%20NOW\]'
curl -s 
'http://localhost:8983/solr/collection2/query?rows=0&q=*&fq=field879_ss:f*'

Maco

On Wednesday, June 25, 2014 5:23:16 PM UTC+8, Maco Ma wrote:
>
> I run the benchmark where search and ingest runs concurrently. Paste the 
> results here:
> Number of different meta data field 
> ES with disable _all/codec bloom filter 
> ES disabled params (Ingestion & Query concurrently) 
> Scenario 0: 1000
> 13 secs ->769 docs/sec
> CPU: 23.68%
> iowait: 0.01%
> Heap: 1.31G
> Index Size: 248K
> Ingestion speed change: 2 1 1 1 1 1 1 1 2 1
> 14 secs ->714 docs/sec
> CPU: 27.51%
> iowait: 0.03%
> Heap: 1.27G
> Index Size: 304K
> Ingestion speed change: 3 1 1 1 1 1 1 2 2 1
> Scenario 1: 10k
> 31 secs -> 322.6 docs/sec
> CPU: 39.29%
> iowait: 0.01%
> Heap: 4.76G
> Index Size: 396K
> Ingestion speed change: 12 1 2 1 1 1 2 1 4 2
>
> 35 secs -> 285docs/sec
> CPU: 42.46%
> iowait: 0.01%
> Heap: 5.14G
> Index Size: 336K
> Ingestion speed change: 13 2 1 1 2 1 1 4 1 2 
>
>
> I added one more thread to do the query to the existing ingestion script:
> sub query {
>   my $qstr = q(curl -s 'http://localhost:9200/doc/type/_search' 
> -d'{"query":{"filtered":{"query":{"query_string":{"fields" : [");
>   my $fstr = q(curl -s 'http://localhost:9200/doc/type/_search' 
> -d'{"query":{"filtered":{"query":{"match_all":{}},"filter":{");
>   my $fieldNum =  1000;
>
>   while ($no < $total )
>   {
>     $tr= int(rand(5));
>     if( $tr == 0 )
>     {
>       $fieldName = "field".int(rand($fieldNum))."_i";
>       $fieldValue = "*1*";
>     }
>     elsif ($tr == 1)
>     {
>       $fieldName = "field".int(rand($fieldNum))."_dt";
>       $fieldValue = "*2*";
>     }
>     else
>     {
>       $fieldName = "field".int(rand($fieldNum))."_ss";
>       $fieldValue = "f*";
>     }
>
>     $cstr = $qstr. "$fieldName" . q("],"query":") . $fieldValue . 
> q("}}}}}');
>     print $cstr."\n";
>     print `$cstr`."\n";
>
>     $tr= int(rand(5));
>     if( $tr == 0 )
>     {
>       $cstr = $fstr. q(range":{ 
> "field).int(rand($fieldNum)).q(_i":{"gte":). int(rand(1000)). q(}}}}}}');
>     }
>     elsif ($tr == 1)
>     {
>       $cstr = $fstr. q(range":{ "field). 
> int(rand($fieldNum)).q(_dt":{"from": 
> "2010-01-).(1+int(rand(31))).q(T02:10:03"}}}}}}');
>     }
>     else
>     {
>       $cstr = $fstr. 
> q(regexp":{"field).int(rand($fieldNum)).q(_ss":"f.*"}}}}}');
>     }
> print $cstr."\n";
>     print `$cstr`."\n";
>   }
> }
>
>
> Maco
>
> On Wednesday, June 25, 2014 1:04:08 AM UTC+8, Cindy Hsin wrote:
>>
>> Looks like the memory usage increased a lot with 10k fields with these 
>> two parameter disabled.
>>
>> Based on the experiment we have done, looks like ES have abnormal memory 
>> usage and performance degradation when number of fields are large (ie. 
>> 10k). Where Solr memory usage and performance remains for the large number 
>> fields. 
>>
>> If we are only looking at 10k fields scenario, is there a way for ES to 
>> make the ingest performance better (perhaps via a bug fix)? Looking at the 
>> performance number, I think this abnormal memory usage & performance drop 
>> is most likely a bug in ES layer. If this is not technically feasible then 
>> we'll report back that we have checked with ES experts and confirmed that 
>> there is no way for ES to provide a fix to address this issue. The solution 
>> Mike suggestion sounds like a workaround (ie combine multiple fields into 
>> one field to reduce the large number of fields). I can run it by our team 
>> but not sure if this will fly.
>>
>> I have also asked Maco to do one more benchmark (where search and ingest 
>> runs concurrently) for both ES and Solr to check whether there is any 
>> performance degradation for Solr when search and ingest happens 
>> concurrently. I think this is one point that Mike mentioned, right? Even 
>> with Solr, you think we will hit some performance issue with large fields 
>> when ingest and query runs concurrently.
>>
>> Thanks!
>> Cindy
>>
>> On Thursday, June 12, 2014 10:57:23 PM UTC-7, Maco Ma wrote:
>>>
>>> I try to measure the performance of ingesting the documents having lots 
>>> of fields.
>>>
>>>
>>> The latest elasticsearch 1.2.1:
>>> Total docs count: 10k (a small set definitely)
>>> ES_HEAP_SIZE: 48G
>>> settings:
>>>
>>> {"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}
>>>
>>> mappings:
>>>
>>> {"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}
>>>
>>> All fields in the documents mach the templates in the mappings.
>>>
>>> Since I disabled the flush & refresh, I submitted the flush command 
>>> (along with optimize command after it) in the client program every 10 
>>> seconds. (I tried the another interval 10mins and got the similar results)
>>>
>>> Scenario 0 - 10k docs have 1000 different fields:
>>> Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the 
>>> used heap memory).
>>>
>>>
>>> Scenario 1 - 10k docs have 10k different fields(10 times fields compared 
>>> with scenario0):
>>> This time ingestion took 29 secs.   Only 5.74G heap mem is used.
>>>
>>> Not sure why the performance degrades sharply.
>>>
>>> If I try to ingest the docs having 100k different fields, it will take 
>>> 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform 
>>> so badly. 
>>>
>>> Anyone can give suggestion to improve the performance?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/765afa4b-5b9a-414d-91f5-e1c6f234a9a9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: ingest performance degrades sharply along with the documents having more fileds

Reply via email to