Re: ingest performance degrades sharply along with the documents having more fileds

kimchy Tue, 08 Jul 2014 08:20:53 -0700

Hi, thanks for running the tests!. My tests were capped at 10k fields and 
improve for it, any more than that, I, and anybody here on Elasticsearch 
(+Lucene: Mike/Robert) simply don't recommend and can't really be behind 
when it comes to supporting it.


In Elasticsearch, there is a conscious decision to have concrete mappings 
for fields introduced. This allows for nice upstream features, such as 
autocomplete on Kibana and Sense, as well as certain index/search level 
optimizations that can't be done without concrete mapping for each field 
introduced. This incurs a cost when it comes to many fields introduced.

The idea here, is that a system that tries to put 1M different fields into 
Lucene simply not going to scale. The cost overhead, and even testability 
of such a system, is simply not something that we can support.

Aside from the obvious overhead when it comes to just wrangling so many 
fields in Lucene (merge costs that keep being incremental, ...), there is 
also the plan of what to do with it. For example, if sorting is enabled, 
then there is a multiplied cost at loading it for sorting (compared to 
using nested documents, where the cost is constant, since its the same 
field).

I think that there might be other factors in play to the performance test 
numbers I see below aside from the 100k and 1M different fields scenario. 
We can try and chase them, but the bottom line is the same, we can't 
support a system that asks to have 1M different fields, as we don't believe 
it uses either ES or Lucene correctly at this point.

I suggest looking into nested documents (regardless of the system you 
decided to use) as a viable alternative to the many fields solution. This 
is the only way you will be able to scale such a system, especially across 
multiple nodes (nested document scales out well, many fields don't).

On Tuesday, July 8, 2014 11:41:11 AM UTC+2, Maco Ma wrote:
>
> Hi Kimchy,
>
> I rerun the benchmark using ES1.3 with default settings (just disable the 
> _source & _all ) and it makes a great progress on the performance. However 
> Solr still outperforms ES 1.3:
> Number of different meta data field 
> ES 
> ES with disable _all/codec bloom filter 
>
> *ES 1.3 *
> Solr 
>
> Scenario 0: 1000
> 12secs -> *833*docs/sec
> CPU: 30.24%
> Heap: 1.08G
> time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
> *index size: 36Mb*
> iowait: 0.02%
> 13 secs ->769 docs/sec
> CPU: 23.68%
> iowait: 0.01%
> Heap: 1.31G
> Index Size: 248K
> Ingestion speed change: 2 1 1 1 1 1 1 1 2 1
>
> 13 secs->769 docs/sec
> CPU: 44.22%
> iowait: 0.01%
> Heap: 1.38G
> Index Size: 69M
> Ingestion speed change: 2 1 1 1 1 1 2 0 2 2
>
> 13 secs -> 769 docs/sec
> CPU: 28.85%
> Heap: 9.39G
> time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2
>
> Scenario 1: 10k
> 29secs -> *345*docs/sec
> CPU: 40.83%
> Heap: 5.74G
> time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
> iowait: 0.02%
> *Index Size: 36Mb*
> 31 secs -> 322.6 docs/sec
> CPU: 39.29%
> iowait: 0.01%
> Heap: 4.76G
> Index Size: 396K
> Ingestion speed change: 12 1 2 1 1 1 2 1 4 2
>
> 20 secs->500 docs/sec
> CPU: 54.74%
> iowait: 0.02%
> Heap: 3.06G
> Index Size: 133M
> Ingestion speed change: 2 2 1 2 2 3 2 2 2 1
> 12 secs -> 833 docs/sec
> CPU: 28.62%
> Heap: 9.88G
> time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2
>
> Scenario 2: 100k
> 17 mins 44 secs -> *9.4*docs/sec
> CPU: 54.73%
> Heap: 47.99G
> time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
> iowait: 0.02%
> *Index Size: 75Mb*
> 14 mins 24 secs -> 11.6 docs/sec
> CPU: 52.30%
> iowait: 0.02%
> Heap:
> Index Size: 1.5M
> Ingestion speed change: 93 153 151 112 84 65 61 53 51 41
>
> 1 mins 24 secs-> 119 docs/sec
> CPU: 47.67%
> iowait: 0.12%
> Heap: 8.66G
> Index Size: 163M
> Ingestion speed change: 9 14 12 12 8 8 5 7 5 4
> 13 secs -> 769 docs/sec
> CPU: 29.43%
> Heap: 9.84G
> time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2
>
> Scenario 3: 1M
> 183 mins 8 secs -> *0.9* docs/sec
> CPU: 40.47%
> Heap: 47.99G
> time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 1594
>
>
> 11 mins 9 secs->15docs/sec
> CPU: 41.45%
> iowait: 0.07%
> Heap: 36.12G
> Index Size: 163M
> Ingestion speed change: 12 24 38 55 70 86 106 117 83 78
> 15 secs -> 666.7 docs/sec
> CPU: 45.10%
> Heap: 9.64G
> time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2
>
>  
>
> Best Regards
> Maco
>
> On Saturday, July 5, 2014 11:46:59 PM UTC+8, kimchy wrote:
>>
>> Heya, I worked a bit on it, and 1.x (upcoming 1.3) has some significant 
>> perf improvements now for this case (including improvements Lucene wise, 
>> that are for now in ES, but will be in Lucene next version). Those include:
>>
>> 6648: https://github.com/elasticsearch/elasticsearch/pull/6648
>> 6714: https://github.com/elasticsearch/elasticsearch/pull/6714
>> 6707: https://github.com/elasticsearch/elasticsearch/pull/6707
>>
>> It would be interesting if you can run the tests again with 1.x branch. 
>> Note, also, please use default features in ES for now, no disable flushing 
>> and such.
>>
>> On Friday, June 13, 2014 7:57:23 AM UTC+2, Maco Ma wrote:
>>>
>>> I try to measure the performance of ingesting the documents having lots 
>>> of fields.
>>>
>>>
>>> The latest elasticsearch 1.2.1:
>>> Total docs count: 10k (a small set definitely)
>>> ES_HEAP_SIZE: 48G
>>> settings:
>>>
>>> {"doc":{"settings":{"index":{"uuid":"LiWHzE5uQrinYW1wW4E3nA","number_of_replicas":"0","translog":{"disable_flush":"true"},"number_of_shards":"5","refresh_interval":"-1","version":{"created":"1020199"}}}}}
>>>
>>> mappings:
>>>
>>> {"doc":{"mappings":{"type":{"dynamic_templates":[{"t1":{"mapping":{"store":false,"norms":{"enabled":false},"type":"string"},"match":"*_ss"}},{"t2":{"mapping":{"store":false,"type":"date"},"match":"*_dt"}},{"t3":{"mapping":{"store":false,"type":"integer"},"match":"*_i"}}],"_source":{"enabled":false},"properties":{}}}}}
>>>
>>> All fields in the documents mach the templates in the mappings.
>>>
>>> Since I disabled the flush & refresh, I submitted the flush command 
>>> (along with optimize command after it) in the client program every 10 
>>> seconds. (I tried the another interval 10mins and got the similar results)
>>>
>>> Scenario 0 - 10k docs have 1000 different fields:
>>> Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the 
>>> used heap memory).
>>>
>>>
>>> Scenario 1 - 10k docs have 10k different fields(10 times fields compared 
>>> with scenario0):
>>> This time ingestion took 29 secs.   Only 5.74G heap mem is used.
>>>
>>> Not sure why the performance degrades sharply.
>>>
>>> If I try to ingest the docs having 100k different fields, it will take 
>>> 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform 
>>> so badly. 
>>>
>>> Anyone can give suggestion to improve the performance?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ffbc1e5f-84f6-47bc-8b0d-cb863fe0f271%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: ingest performance degrades sharply along with the documents having more fileds

Reply via email to