Re: ingest performance degrades sharply along with the documents having more fileds

2014-07-08 Thread Maco Ma
Hi Kimchy,

I rerun the benchmark using ES1.3 with default settings (just disable the 
_source  _all ) and it makes a great progress on the performance. However 
Solr still outperforms ES 1.3:
Number of different meta data field 
ES 
ES with disable _all/codec bloom filter 

*ES 1.3 *
Solr 

Scenario 0: 1000
12secs - *833*docs/sec
CPU: 30.24%
Heap: 1.08G
time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
*index size: 36Mb*
iowait: 0.02%
13 secs -769 docs/sec
CPU: 23.68%
iowait: 0.01%
Heap: 1.31G
Index Size: 248K
Ingestion speed change: 2 1 1 1 1 1 1 1 2 1

13 secs-769 docs/sec
CPU: 44.22%
iowait: 0.01%
Heap: 1.38G
Index Size: 69M
Ingestion speed change: 2 1 1 1 1 1 2 0 2 2

13 secs - 769 docs/sec
CPU: 28.85%
Heap: 9.39G
time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2

Scenario 1: 10k
29secs - *345*docs/sec
CPU: 40.83%
Heap: 5.74G
time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
iowait: 0.02%
*Index Size: 36Mb*
31 secs - 322.6 docs/sec
CPU: 39.29%
iowait: 0.01%
Heap: 4.76G
Index Size: 396K
Ingestion speed change: 12 1 2 1 1 1 2 1 4 2

20 secs-500 docs/sec
CPU: 54.74%
iowait: 0.02%
Heap: 3.06G
Index Size: 133M
Ingestion speed change: 2 2 1 2 2 3 2 2 2 1
12 secs - 833 docs/sec
CPU: 28.62%
Heap: 9.88G
time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2

Scenario 2: 100k
17 mins 44 secs - *9.4*docs/sec
CPU: 54.73%
Heap: 47.99G
time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
iowait: 0.02%
*Index Size: 75Mb*
14 mins 24 secs - 11.6 docs/sec
CPU: 52.30%
iowait: 0.02%
Heap:
Index Size: 1.5M
Ingestion speed change: 93 153 151 112 84 65 61 53 51 41

1 mins 24 secs- 119 docs/sec
CPU: 47.67%
iowait: 0.12%
Heap: 8.66G
Index Size: 163M
Ingestion speed change: 9 14 12 12 8 8 5 7 5 4
13 secs - 769 docs/sec
CPU: 29.43%
Heap: 9.84G
time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2

Scenario 3: 1M
183 mins 8 secs - *0.9* docs/sec
CPU: 40.47%
Heap: 47.99G
time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 1594


11 mins 9 secs-15docs/sec
CPU: 41.45%
iowait: 0.07%
Heap: 36.12G
Index Size: 163M
Ingestion speed change: 12 24 38 55 70 86 106 117 83 78
15 secs - 666.7 docs/sec
CPU: 45.10%
Heap: 9.64G
time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2

 

Best Regards
Maco

On Saturday, July 5, 2014 11:46:59 PM UTC+8, kimchy wrote:

 Heya, I worked a bit on it, and 1.x (upcoming 1.3) has some significant 
 perf improvements now for this case (including improvements Lucene wise, 
 that are for now in ES, but will be in Lucene next version). Those include:

 6648: https://github.com/elasticsearch/elasticsearch/pull/6648
 6714: https://github.com/elasticsearch/elasticsearch/pull/6714
 6707: https://github.com/elasticsearch/elasticsearch/pull/6707

 It would be interesting if you can run the tests again with 1.x branch. 
 Note, also, please use default features in ES for now, no disable flushing 
 and such.

 On Friday, June 13, 2014 7:57:23 AM UTC+2, Maco Ma wrote:

 I try to measure the performance of ingesting the documents having lots 
 of fields.


 The latest elasticsearch 1.2.1:
 Total docs count: 10k (a small set definitely)
 ES_HEAP_SIZE: 48G
 settings:

 {doc:{settings:{index:{uuid:LiWHzE5uQrinYW1wW4E3nA,number_of_replicas:0,translog:{disable_flush:true},number_of_shards:5,refresh_interval:-1,version:{created:1020199}

 mappings:

 {doc:{mappings:{type:{dynamic_templates:[{t1:{mapping:{store:false,norms:{enabled:false},type:string},match:*_ss}},{t2:{mapping:{store:false,type:date},match:*_dt}},{t3:{mapping:{store:false,type:integer},match:*_i}}],_source:{enabled:false},properties:{}

 All fields in the documents mach the templates in the mappings.

 Since I disabled the flush  refresh, I submitted the flush command 
 (along with optimize command after it) in the client program every 10 
 seconds. (I tried the another interval 10mins and got the similar results)

 Scenario 0 - 10k docs have 1000 different fields:
 Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used 
 heap memory).


 Scenario 1 - 10k docs have 10k different fields(10 times fields compared 
 with scenario0):
 This time ingestion took 29 secs.   Only 5.74G heap mem is used.

 Not sure why the performance degrades sharply.

 If I try to ingest the docs having 100k different fields, it will take 17 
 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so 
 badly. 

 Anyone can give suggestion to improve the performance?









-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/3a2572a6-c97d-47f5-a801-b1d933c22990%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ingest performance degrades sharply along with the documents having more fileds

2014-07-08 Thread kimchy
Yes, this is the equivalent of using RAMDirectory. Please, don't use this, 
Mmap is optimized for random access and if the lucene index can fit in heap 
(to use ram dir), it can certainly fit in OS RAM, without the implications 
of loading it to heap.

On Monday, July 7, 2014 6:26:07 PM UTC+2, Mahesh Venkat wrote:

 Thanks Shay for updating us with perf improvements.
 Apart from using the default parameters, should we follow the guideline 
 listed in 


 http://elasticsearch-users.115913.n3.nabble.com/Is-ES-es-index-store-type-memory-equivalent-to-Lucene-s-RAMDirectory-td4057417.html
  

 Lucene supports MMapDirectory at the data indexing phase (in a batch) and 
 switch to in-memory for queries to optimize on search latency.

 Should we use JVM system parameter -Des.index.store.type=memory .  Isn't 
 this equivalent to using RAMDirectory in Lucene for in-memory search query 
  ?
 Thanks
 --Mahesh

 On Saturday, July 5, 2014 8:46:59 AM UTC-7, kimchy wrote:

 Heya, I worked a bit on it, and 1.x (upcoming 1.3) has some significant 
 perf improvements now for this case (including improvements Lucene wise, 
 that are for now in ES, but will be in Lucene next version). Those include:

 6648: https://github.com/elasticsearch/elasticsearch/pull/6648
 6714: https://github.com/elasticsearch/elasticsearch/pull/6714
 6707: https://github.com/elasticsearch/elasticsearch/pull/6707

 It would be interesting if you can run the tests again with 1.x branch. 
 Note, also, please use default features in ES for now, no disable flushing 
 and such.

 On Friday, June 13, 2014 7:57:23 AM UTC+2, Maco Ma wrote:

 I try to measure the performance of ingesting the documents having lots 
 of fields.


 The latest elasticsearch 1.2.1:
 Total docs count: 10k (a small set definitely)
 ES_HEAP_SIZE: 48G
 settings:

 {doc:{settings:{index:{uuid:LiWHzE5uQrinYW1wW4E3nA,number_of_replicas:0,translog:{disable_flush:true},number_of_shards:5,refresh_interval:-1,version:{created:1020199}

 mappings:

 {doc:{mappings:{type:{dynamic_templates:[{t1:{mapping:{store:false,norms:{enabled:false},type:string},match:*_ss}},{t2:{mapping:{store:false,type:date},match:*_dt}},{t3:{mapping:{store:false,type:integer},match:*_i}}],_source:{enabled:false},properties:{}

 All fields in the documents mach the templates in the mappings.

 Since I disabled the flush  refresh, I submitted the flush command 
 (along with optimize command after it) in the client program every 10 
 seconds. (I tried the another interval 10mins and got the similar results)

 Scenario 0 - 10k docs have 1000 different fields:
 Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the 
 used heap memory).


 Scenario 1 - 10k docs have 10k different fields(10 times fields compared 
 with scenario0):
 This time ingestion took 29 secs.   Only 5.74G heap mem is used.

 Not sure why the performance degrades sharply.

 If I try to ingest the docs having 100k different fields, it will take 
 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform 
 so badly. 

 Anyone can give suggestion to improve the performance?









-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/450fdf38-bdfe-49c2-9938-627b9854892c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ingest performance degrades sharply along with the documents having more fileds

2014-07-08 Thread kimchy
Hi, thanks for running the tests!. My tests were capped at 10k fields and 
improve for it, any more than that, I, and anybody here on Elasticsearch 
(+Lucene: Mike/Robert) simply don't recommend and can't really be behind 
when it comes to supporting it.

In Elasticsearch, there is a conscious decision to have concrete mappings 
for fields introduced. This allows for nice upstream features, such as 
autocomplete on Kibana and Sense, as well as certain index/search level 
optimizations that can't be done without concrete mapping for each field 
introduced. This incurs a cost when it comes to many fields introduced.

The idea here, is that a system that tries to put 1M different fields into 
Lucene simply not going to scale. The cost overhead, and even testability 
of such a system, is simply not something that we can support.

Aside from the obvious overhead when it comes to just wrangling so many 
fields in Lucene (merge costs that keep being incremental, ...), there is 
also the plan of what to do with it. For example, if sorting is enabled, 
then there is a multiplied cost at loading it for sorting (compared to 
using nested documents, where the cost is constant, since its the same 
field).

I think that there might be other factors in play to the performance test 
numbers I see below aside from the 100k and 1M different fields scenario. 
We can try and chase them, but the bottom line is the same, we can't 
support a system that asks to have 1M different fields, as we don't believe 
it uses either ES or Lucene correctly at this point.

I suggest looking into nested documents (regardless of the system you 
decided to use) as a viable alternative to the many fields solution. This 
is the only way you will be able to scale such a system, especially across 
multiple nodes (nested document scales out well, many fields don't).

On Tuesday, July 8, 2014 11:41:11 AM UTC+2, Maco Ma wrote:

 Hi Kimchy,

 I rerun the benchmark using ES1.3 with default settings (just disable the 
 _source  _all ) and it makes a great progress on the performance. However 
 Solr still outperforms ES 1.3:
 Number of different meta data field 
 ES 
 ES with disable _all/codec bloom filter 

 *ES 1.3 *
 Solr 

 Scenario 0: 1000
 12secs - *833*docs/sec
 CPU: 30.24%
 Heap: 1.08G
 time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
 *index size: 36Mb*
 iowait: 0.02%
 13 secs -769 docs/sec
 CPU: 23.68%
 iowait: 0.01%
 Heap: 1.31G
 Index Size: 248K
 Ingestion speed change: 2 1 1 1 1 1 1 1 2 1

 13 secs-769 docs/sec
 CPU: 44.22%
 iowait: 0.01%
 Heap: 1.38G
 Index Size: 69M
 Ingestion speed change: 2 1 1 1 1 1 2 0 2 2

 13 secs - 769 docs/sec
 CPU: 28.85%
 Heap: 9.39G
 time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2

 Scenario 1: 10k
 29secs - *345*docs/sec
 CPU: 40.83%
 Heap: 5.74G
 time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
 iowait: 0.02%
 *Index Size: 36Mb*
 31 secs - 322.6 docs/sec
 CPU: 39.29%
 iowait: 0.01%
 Heap: 4.76G
 Index Size: 396K
 Ingestion speed change: 12 1 2 1 1 1 2 1 4 2

 20 secs-500 docs/sec
 CPU: 54.74%
 iowait: 0.02%
 Heap: 3.06G
 Index Size: 133M
 Ingestion speed change: 2 2 1 2 2 3 2 2 2 1
 12 secs - 833 docs/sec
 CPU: 28.62%
 Heap: 9.88G
 time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2

 Scenario 2: 100k
 17 mins 44 secs - *9.4*docs/sec
 CPU: 54.73%
 Heap: 47.99G
 time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
 iowait: 0.02%
 *Index Size: 75Mb*
 14 mins 24 secs - 11.6 docs/sec
 CPU: 52.30%
 iowait: 0.02%
 Heap:
 Index Size: 1.5M
 Ingestion speed change: 93 153 151 112 84 65 61 53 51 41

 1 mins 24 secs- 119 docs/sec
 CPU: 47.67%
 iowait: 0.12%
 Heap: 8.66G
 Index Size: 163M
 Ingestion speed change: 9 14 12 12 8 8 5 7 5 4
 13 secs - 769 docs/sec
 CPU: 29.43%
 Heap: 9.84G
 time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2

 Scenario 3: 1M
 183 mins 8 secs - *0.9* docs/sec
 CPU: 40.47%
 Heap: 47.99G
 time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 1594


 11 mins 9 secs-15docs/sec
 CPU: 41.45%
 iowait: 0.07%
 Heap: 36.12G
 Index Size: 163M
 Ingestion speed change: 12 24 38 55 70 86 106 117 83 78
 15 secs - 666.7 docs/sec
 CPU: 45.10%
 Heap: 9.64G
 time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2

  

 Best Regards
 Maco

 On Saturday, July 5, 2014 11:46:59 PM UTC+8, kimchy wrote:

 Heya, I worked a bit on it, and 1.x (upcoming 1.3) has some significant 
 perf improvements now for this case (including improvements Lucene wise, 
 that are for now in ES, but will be in Lucene next version). Those include:

 6648: https://github.com/elasticsearch/elasticsearch/pull/6648
 6714: https://github.com/elasticsearch/elasticsearch/pull/6714
 6707: https://github.com/elasticsearch/elasticsearch/pull/6707

 It would be interesting if you can run the tests again with 1.x branch. 
 Note, also, please use default features in ES for now, no disable flushing 
 and such.

 On Friday, June 13, 2014 7:57:23 AM UTC+2, Maco Ma wrote:

 I try to measure the performance of ingesting the documents having lots 
 of 

Re: ingest performance degrades sharply along with the documents having more fileds

2014-07-07 Thread Mahesh Venkat
Thanks Shay for updating us with perf improvements.
Apart from using the default parameters, should we follow the guideline 
listed in 

http://elasticsearch-users.115913.n3.nabble.com/Is-ES-es-index-store-type-memory-equivalent-to-Lucene-s-RAMDirectory-td4057417.html
 

Lucene supports MMapDirectory at the data indexing phase (in a batch) and 
switch to in-memory for queries to optimize on search latency.

Should we use JVM system parameter -Des.index.store.type=memory .  Isn't 
this equivalent to using RAMDirectory in Lucene for in-memory search query 
 ?
Thanks
--Mahesh

On Saturday, July 5, 2014 8:46:59 AM UTC-7, kimchy wrote:

 Heya, I worked a bit on it, and 1.x (upcoming 1.3) has some significant 
 perf improvements now for this case (including improvements Lucene wise, 
 that are for now in ES, but will be in Lucene next version). Those include:

 6648: https://github.com/elasticsearch/elasticsearch/pull/6648
 6714: https://github.com/elasticsearch/elasticsearch/pull/6714
 6707: https://github.com/elasticsearch/elasticsearch/pull/6707

 It would be interesting if you can run the tests again with 1.x branch. 
 Note, also, please use default features in ES for now, no disable flushing 
 and such.

 On Friday, June 13, 2014 7:57:23 AM UTC+2, Maco Ma wrote:

 I try to measure the performance of ingesting the documents having lots 
 of fields.


 The latest elasticsearch 1.2.1:
 Total docs count: 10k (a small set definitely)
 ES_HEAP_SIZE: 48G
 settings:

 {doc:{settings:{index:{uuid:LiWHzE5uQrinYW1wW4E3nA,number_of_replicas:0,translog:{disable_flush:true},number_of_shards:5,refresh_interval:-1,version:{created:1020199}

 mappings:

 {doc:{mappings:{type:{dynamic_templates:[{t1:{mapping:{store:false,norms:{enabled:false},type:string},match:*_ss}},{t2:{mapping:{store:false,type:date},match:*_dt}},{t3:{mapping:{store:false,type:integer},match:*_i}}],_source:{enabled:false},properties:{}

 All fields in the documents mach the templates in the mappings.

 Since I disabled the flush  refresh, I submitted the flush command 
 (along with optimize command after it) in the client program every 10 
 seconds. (I tried the another interval 10mins and got the similar results)

 Scenario 0 - 10k docs have 1000 different fields:
 Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used 
 heap memory).


 Scenario 1 - 10k docs have 10k different fields(10 times fields compared 
 with scenario0):
 This time ingestion took 29 secs.   Only 5.74G heap mem is used.

 Not sure why the performance degrades sharply.

 If I try to ingest the docs having 100k different fields, it will take 17 
 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so 
 badly. 

 Anyone can give suggestion to improve the performance?









-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9456c6ab-1f0b-4021-b011-d8573032915a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ingest performance degrades sharply along with the documents having more fileds

2014-07-05 Thread kimchy
Heya, I worked a bit on it, and 1.x (upcoming 1.3) has some significant 
perf improvements now for this case (including improvements Lucene wise, 
that are for now in ES, but will be in Lucene next version). Those include:

6648: https://github.com/elasticsearch/elasticsearch/pull/6648
6714: https://github.com/elasticsearch/elasticsearch/pull/6714
6707: https://github.com/elasticsearch/elasticsearch/pull/6707

It would be interesting if you can run the tests again with 1.x branch. 
Note, also, please use default features in ES for now, no disable flushing 
and such.

On Friday, June 13, 2014 7:57:23 AM UTC+2, Maco Ma wrote:

 I try to measure the performance of ingesting the documents having lots of 
 fields.


 The latest elasticsearch 1.2.1:
 Total docs count: 10k (a small set definitely)
 ES_HEAP_SIZE: 48G
 settings:

 {doc:{settings:{index:{uuid:LiWHzE5uQrinYW1wW4E3nA,number_of_replicas:0,translog:{disable_flush:true},number_of_shards:5,refresh_interval:-1,version:{created:1020199}

 mappings:

 {doc:{mappings:{type:{dynamic_templates:[{t1:{mapping:{store:false,norms:{enabled:false},type:string},match:*_ss}},{t2:{mapping:{store:false,type:date},match:*_dt}},{t3:{mapping:{store:false,type:integer},match:*_i}}],_source:{enabled:false},properties:{}

 All fields in the documents mach the templates in the mappings.

 Since I disabled the flush  refresh, I submitted the flush command (along 
 with optimize command after it) in the client program every 10 seconds. (I 
 tried the another interval 10mins and got the similar results)

 Scenario 0 - 10k docs have 1000 different fields:
 Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used 
 heap memory).


 Scenario 1 - 10k docs have 10k different fields(10 times fields compared 
 with scenario0):
 This time ingestion took 29 secs.   Only 5.74G heap mem is used.

 Not sure why the performance degrades sharply.

 If I try to ingest the docs having 100k different fields, it will take 17 
 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so 
 badly. 

 Anyone can give suggestion to improve the performance?









-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/94f69102-a3ff-4aea-9513-0a07300a8a92%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-26 Thread Maco Ma
Added the Solr benchmark as well:

Number of different meta data field 

ES with disable _all/codec bloom filter 
ES (Ingestion  Query concurrently) 
Solr 
Solr(Ingestion  Query concurrently) 
Scenario 0: 1000

13 secs -769 docs/sec
CPU: 23.68%
iowait: 0.01%
Heap: 1.31G
Index Size: 248K
Ingestion speed change: 2 1 1 1 1 1 1 1 2 1
14 secs -714 docs/sec
CPU: 27.51%
iowait: 0.03%
Heap: 1.27G
Index Size: 304K
Ingestion speed change: 3 1 1 1 1 1 1 2 2 1
13 secs - 769 docs/sec
CPU: 28.85%
Heap: 9.39G
time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2

14 secs-714 docs/sec
CPU: 37.02%
Heap: 10G
Ingestion speed change: 2 2 1 1 1 1 2 2 1 1 

Scenario 1: 10k

31 secs - 322.6 docs/sec
CPU: 39.29%
iowait: 0.01%
Heap: 4.76G
Index Size: 396K
Ingestion speed change: 12 1 2 1 1 1 2 1 4 2

35 secs - 285docs/sec
CPU: 42.46%
iowait: 0.01%
Heap: 5.14G
Index Size: 336K
Ingestion speed change: 13 2 1 1 2 1 1 4 1 2 

12 secs - 833 docs/sec
CPU: 28.62%
Heap: 9.88G
time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2

16 secs- 625 docs/sec
CPU: 34.07%
Heap: 10G
Ingestion speed change: 2 2 1 1 1 1 2 2 2 2

 

List several sample queries for Solr:
curl -s 
'http://localhost:8983/solr/collection2/query?rows=0q=field282_ss:f*'
curl -s 
'http://localhost:8983/solr/collection2/query?rows=0q=field989_dt:\[2012-3-06T01%3A15%3A51Z%20TO%20NOW\]'
curl -s 
'http://localhost:8983/solr/collection2/query?rows=0q=field363_i:\[0%20TO%20177\]'

filters:
curl -s 
'http://localhost:8983/solr/collection2/query?rows=0q=*fq=field118_i:\[0%20TO%2029\]'
curl -s 
'http://localhost:8983/solr/collection2/query?rows=0q=*fq=field91_dt:\[2012-1-06T01%3A15%3A51Z%20TO%20NOW\]'
curl -s 
'http://localhost:8983/solr/collection2/query?rows=0q=*fq=field879_ss:f*'

Maco

On Wednesday, June 25, 2014 5:23:16 PM UTC+8, Maco Ma wrote:

 I run the benchmark where search and ingest runs concurrently. Paste the 
 results here:
 Number of different meta data field 
 ES with disable _all/codec bloom filter 
 ES disabled params (Ingestion  Query concurrently) 
 Scenario 0: 1000
 13 secs -769 docs/sec
 CPU: 23.68%
 iowait: 0.01%
 Heap: 1.31G
 Index Size: 248K
 Ingestion speed change: 2 1 1 1 1 1 1 1 2 1
 14 secs -714 docs/sec
 CPU: 27.51%
 iowait: 0.03%
 Heap: 1.27G
 Index Size: 304K
 Ingestion speed change: 3 1 1 1 1 1 1 2 2 1
 Scenario 1: 10k
 31 secs - 322.6 docs/sec
 CPU: 39.29%
 iowait: 0.01%
 Heap: 4.76G
 Index Size: 396K
 Ingestion speed change: 12 1 2 1 1 1 2 1 4 2

 35 secs - 285docs/sec
 CPU: 42.46%
 iowait: 0.01%
 Heap: 5.14G
 Index Size: 336K
 Ingestion speed change: 13 2 1 1 2 1 1 4 1 2 


 I added one more thread to do the query to the existing ingestion script:
 sub query {
   my $qstr = q(curl -s 'http://localhost:9200/doc/type/_search' 
 -d'{query:{filtered:{query:{query_string:{fields : [);
   my $fstr = q(curl -s 'http://localhost:9200/doc/type/_search' 
 -d'{query:{filtered:{query:{match_all:{}},filter:{);
   my $fieldNum =  1000;

   while ($no  $total )
   {
 $tr= int(rand(5));
 if( $tr == 0 )
 {
   $fieldName = field.int(rand($fieldNum))._i;
   $fieldValue = *1*;
 }
 elsif ($tr == 1)
 {
   $fieldName = field.int(rand($fieldNum))._dt;
   $fieldValue = *2*;
 }
 else
 {
   $fieldName = field.int(rand($fieldNum))._ss;
   $fieldValue = f*;
 }

 $cstr = $qstr. $fieldName . q(],query:) . $fieldValue . 
 q(}');
 print $cstr.\n;
 print `$cstr`.\n;

 $tr= int(rand(5));
 if( $tr == 0 )
 {
   $cstr = $fstr. q(range:{ 
 field).int(rand($fieldNum)).q(_i:{gte:). int(rand(1000)). q(}}');
 }
 elsif ($tr == 1)
 {
   $cstr = $fstr. q(range:{ field). 
 int(rand($fieldNum)).q(_dt:{from: 
 2010-01-).(1+int(rand(31))).q(T02:10:03}}');
 }
 else
 {
   $cstr = $fstr. 
 q(regexp:{field).int(rand($fieldNum)).q(_ss:f.*}');
 }
 print $cstr.\n;
 print `$cstr`.\n;
   }
 }


 Maco

 On Wednesday, June 25, 2014 1:04:08 AM UTC+8, Cindy Hsin wrote:

 Looks like the memory usage increased a lot with 10k fields with these 
 two parameter disabled.

 Based on the experiment we have done, looks like ES have abnormal memory 
 usage and performance degradation when number of fields are large (ie. 
 10k). Where Solr memory usage and performance remains for the large number 
 fields. 

 If we are only looking at 10k fields scenario, is there a way for ES to 
 make the ingest performance better (perhaps via a bug fix)? Looking at the 
 performance number, I think this abnormal memory usage  performance drop 
 is most likely a bug in ES layer. If this is not technically feasible then 
 we'll report back that we have checked with ES experts and confirmed that 
 there is no way for ES to provide a fix to address this issue. The solution 
 Mike suggestion sounds like a workaround (ie combine multiple fields into 
 one field to reduce the large number of fields). I can run it by our team 
 but not sure if this will fly.

 I have also asked Maco to do one more 

Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-25 Thread Maco Ma
I run the benchmark where search and ingest runs concurrently. Paste the 
results here:
Number of different meta data field 
ES with disable _all/codec bloom filter 
ES disabled params (Ingestion  Query concurrently) 
Scenario 0: 1000
13 secs -769 docs/sec
CPU: 23.68%
iowait: 0.01%
Heap: 1.31G
Index Size: 248K
Ingestion speed change: 2 1 1 1 1 1 1 1 2 1
14 secs -714 docs/sec
CPU: 27.51%
iowait: 0.03%
Heap: 1.27G
Index Size: 304K
Ingestion speed change: 3 1 1 1 1 1 1 2 2 1
Scenario 1: 10k
31 secs - 322.6 docs/sec
CPU: 39.29%
iowait: 0.01%
Heap: 4.76G
Index Size: 396K
Ingestion speed change: 12 1 2 1 1 1 2 1 4 2

35 secs - 285docs/sec
CPU: 42.46%
iowait: 0.01%
Heap: 5.14G
Index Size: 336K
Ingestion speed change: 13 2 1 1 2 1 1 4 1 2 


I added one more thread to do the query to the existing ingestion script:
sub query {
  my $qstr = q(curl -s 'http://localhost:9200/doc/type/_search' 
-d'{query:{filtered:{query:{query_string:{fields : [);
  my $fstr = q(curl -s 'http://localhost:9200/doc/type/_search' 
-d'{query:{filtered:{query:{match_all:{}},filter:{);
  my $fieldNum =  1000;

  while ($no  $total )
  {
$tr= int(rand(5));
if( $tr == 0 )
{
  $fieldName = field.int(rand($fieldNum))._i;
  $fieldValue = *1*;
}
elsif ($tr == 1)
{
  $fieldName = field.int(rand($fieldNum))._dt;
  $fieldValue = *2*;
}
else
{
  $fieldName = field.int(rand($fieldNum))._ss;
  $fieldValue = f*;
}

$cstr = $qstr. $fieldName . q(],query:) . $fieldValue . 
q(}');
print $cstr.\n;
print `$cstr`.\n;

$tr= int(rand(5));
if( $tr == 0 )
{
  $cstr = $fstr. q(range:{ 
field).int(rand($fieldNum)).q(_i:{gte:). int(rand(1000)). q(}}');
}
elsif ($tr == 1)
{
  $cstr = $fstr. q(range:{ field). 
int(rand($fieldNum)).q(_dt:{from: 
2010-01-).(1+int(rand(31))).q(T02:10:03}}');
}
else
{
  $cstr = $fstr. 
q(regexp:{field).int(rand($fieldNum)).q(_ss:f.*}');
}
print $cstr.\n;
print `$cstr`.\n;
  }
}


Maco

On Wednesday, June 25, 2014 1:04:08 AM UTC+8, Cindy Hsin wrote:

 Looks like the memory usage increased a lot with 10k fields with these two 
 parameter disabled.

 Based on the experiment we have done, looks like ES have abnormal memory 
 usage and performance degradation when number of fields are large (ie. 
 10k). Where Solr memory usage and performance remains for the large number 
 fields. 

 If we are only looking at 10k fields scenario, is there a way for ES to 
 make the ingest performance better (perhaps via a bug fix)? Looking at the 
 performance number, I think this abnormal memory usage  performance drop 
 is most likely a bug in ES layer. If this is not technically feasible then 
 we'll report back that we have checked with ES experts and confirmed that 
 there is no way for ES to provide a fix to address this issue. The solution 
 Mike suggestion sounds like a workaround (ie combine multiple fields into 
 one field to reduce the large number of fields). I can run it by our team 
 but not sure if this will fly.

 I have also asked Maco to do one more benchmark (where search and ingest 
 runs concurrently) for both ES and Solr to check whether there is any 
 performance degradation for Solr when search and ingest happens 
 concurrently. I think this is one point that Mike mentioned, right? Even 
 with Solr, you think we will hit some performance issue with large fields 
 when ingest and query runs concurrently.

 Thanks!
 Cindy

 On Thursday, June 12, 2014 10:57:23 PM UTC-7, Maco Ma wrote:

 I try to measure the performance of ingesting the documents having lots 
 of fields.


 The latest elasticsearch 1.2.1:
 Total docs count: 10k (a small set definitely)
 ES_HEAP_SIZE: 48G
 settings:

 {doc:{settings:{index:{uuid:LiWHzE5uQrinYW1wW4E3nA,number_of_replicas:0,translog:{disable_flush:true},number_of_shards:5,refresh_interval:-1,version:{created:1020199}

 mappings:

 {doc:{mappings:{type:{dynamic_templates:[{t1:{mapping:{store:false,norms:{enabled:false},type:string},match:*_ss}},{t2:{mapping:{store:false,type:date},match:*_dt}},{t3:{mapping:{store:false,type:integer},match:*_i}}],_source:{enabled:false},properties:{}

 All fields in the documents mach the templates in the mappings.

 Since I disabled the flush  refresh, I submitted the flush command 
 (along with optimize command after it) in the client program every 10 
 seconds. (I tried the another interval 10mins and got the similar results)

 Scenario 0 - 10k docs have 1000 different fields:
 Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used 
 heap memory).


 Scenario 1 - 10k docs have 10k different fields(10 times fields compared 
 with scenario0):
 This time ingestion took 29 secs.   Only 5.74G heap mem is used.

 Not sure why the performance degrades sharply.

 If I try to ingest the docs having 100k different fields, it will take 17 
 mins 44 secs.  We only have 10k docs totally and not sure why ES perform 

Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-25 Thread Michael McCandless
Some responses below:

On Tue, Jun 24, 2014 at 7:04 PM, Cindy Hsin cindy.h...@gmail.com wrote:

 Looks like the memory usage increased a lot with 10k fields with these two
 parameter disabled.

 Based on the experiment we have done, looks like ES have abnormal memory
 usage and performance degradation when number of fields are large (ie.
 10k). Where Solr memory usage and performance remains for the large number
 fields.

 If we are only looking at 10k fields scenario, is there a way for ES to
 make the ingest performance better (perhaps via a bug fix)?


I've opened an ES issue to address the slowdown as more and more unique
fields are added via dynamic templates:
https://github.com/elasticsearch/elasticsearch/issues/6619


 The solution Mike suggestion sounds like a workaround (ie combine multiple
 fields into one field to reduce the large number of fields). I can run it
 by our team but not sure if this will fly.


Well, I think both Solr and ES (once we fix the above issue) will still
have high cost if you index so many fields, since they both are based on
Lucene.

One simple but effective approach, whether you use Solr or ES, is to use
nested documents, where the parent document holds any common fields
across all of your documents, and then each child document has two fields,
key and value.  key holds the original field name you wanted to index, and
value holds the original field value, so you have as many child documents
as you had field+values to index for your original document.  This approach
has worked well in other applications that needed so many fields...

It essentially changes the wide range of field names and field values
instead, which Lucene handles very well.  It results in more, smaller
documents, but this scales out well as you add nodes.

Mike McCandless

http://blog.mikemccandless.com

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAD7smRecxnOmVOrrNfgfk5vmKZaP3ReEcM9P%2BVu2qRgLxSL%2BKQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-24 Thread Maco Ma
Hi Jörg,

I rerun the benchmark with disabling the _all and codec bloom filter: just 
the index data size got reduced dramatically but ingestion speed is still 
similar as previous:
Number of different meta data field 
ES 
ES with disable _all/codec bloom filter 
Scenario 0: 1000
12secs - *833*docs/sec
CPU: 30.24%
Heap: 1.08G
time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
*index size: 36Mb*
iowait: 0.02%
13 secs -769 docs/sec
CPU: 23.68%
iowait: 0.01%
Heap: 1.31G
Index Size: 248K
Ingestion speed change: 2 1 1 1 1 1 1 1 2 1
Scenario 1: 10k
29secs - *345*docs/sec
CPU: 40.83%
Heap: 5.74G
time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
iowait: 0.02%
*Index Size: 36Mb*
31 secs - 322.6 docs/sec
CPU: 39.29%
iowait: 0.01%
Heap: 47.95G
Index Size: 396K
Ingestion speed change: 12 1 2 1 1 1 2 1 4 2
Scenario 2: 100k
17 mins 44 secs - *9.4*docs/sec
CPU: 54.73%
Heap: 47.99G
time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
iowait: 0.02%
*Index Size: 75Mb*
14 mins 24 secs - 11.6 docs/sec
CPU: 52.30%
iowait: 0.02%
Heap: 47.96G
Index Size: 1.5M
Ingestion speed change: 93 153 151 112 84 65 61 53 51 41

We ingested one single doc once, instead of bulk ingestion, and that was 
our real world requirements.

scripts to disable _all/bloom filer:
curl -XPOST localhost:9200/doc -d '{
  mappings : {
  type : {
  _source : { enabled : false },
  _all : { enabled : false },
  dynamic_templates : [
{t1:{
  match : *_ss,
  mapping:{
type: string,
store:false,
norms : {enabled : false}
}
}},
{t2:{
  match : *_dt,
  mapping:{
type: date,
store: false
}
}},
{t3:{
  match : *_i,
  mapping:{
type: integer,
store: false
}
}}
]
  }
}
  }'


curl -XPUT localhost:9200/doc/_settings -d '{
  index.codec.bloom.load :false
}'

Best Regards
Maco

On Monday, June 23, 2014 12:17:27 AM UTC+8, Jörg Prante wrote:

 Two things to add, to make Elasticsearch/Solr comparison more fair.

 In the ES mapping, you did not disable the _all field.

 If you have _all field enabled, all tokens will be indexed twice, one for 
 the field, one for _all.


 http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html

 Also you may want to disable ES codec bloom filter


 http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-codec.html#bloom-postings

 because loading the bloom filter consumes significant memory.

 Not sure why you call curl from perl, since this adds overhead. There are 
 nice Solr/ES perl clients to push docs using bulk indexing.

 Jörg


 On Wednesday, June 18, 2014 4:50:13 AM UTC+2, Maco Ma wrote:

 Hi Mike,

 new_ES_config.sh(define the templates and disable the refresh/flush):
 curl -XPOST localhost:9200/doc -d '{
   mappings : {
   type : {
   _source : { enabled : false },
   dynamic_templates : [
 {t1:{
   match : *_ss,
   mapping:{
 type: string,
 store:false,
 norms : {enabled : false}
 }
 }},
 {t2:{
   match : *_dt,
   mapping:{
 type: date,
 store: false
 }
 }},
 {t3:{
   match : *_i,
   mapping:{
 type: integer,
 store: false
 }
 }}
 ]
   }
 }
   }'

 curl -XPUT localhost:9200/doc/_settings -d '{
   index.refresh_interval : -1
 }'

 curl -XPUT localhost:9200/doc/_settings -d '{
   index.translog.disable_flush : true
 }'

 new_ES_ingest_threads.pl( spawn 10 threads to use curl command to ingest 
 the doc and one thread to flush/optimize periodically):

 my $num_args = $#ARGV + 1;
 if ($num_args  1 || $num_args  2) {
   print \n usuage:$0 [src_dir] [thread_count]\n;
   exit;
 }

 my $INST_HOME=/scratch/aime/elasticsearch-1.2.1;

 my $pid = qx(jps | sed -e '/Elasticsearch/p' -n | sed 's/ .*//');
 chomp($pid);
 if( $pid eq )
 {
   print Instance is not up\n;
   exit;
 }


 my $dir = $ARGV[0];
 my $td_count = 10;
 $td_count = $ARGV[1] if($num_args == 2);
 open(FH, $lf);
 print FH source dir: $dir\nthread_count: $td_count\n;
 print FH localtime().\n;

 use threads;
 use threads::shared;

 my 

Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-24 Thread Cindy Hsin
Looks like the memory usage increased a lot with 10k fields with these two 
parameter disabled.

Based on the experiment we have done, looks like ES have abnormal memory 
usage and performance degradation when number of fields are large (ie. 
10k). Where Solr memory usage and performance remains for the large number 
fields. 

If we are only looking at 10k fields scenario, is there a way for ES to 
make the ingest performance better (perhaps via a bug fix)? Looking at the 
performance number, I think this abnormal memory usage  performance drop 
is most likely a bug in ES layer. If this is not technically feasible then 
we'll report back that we have checked with ES experts and confirmed that 
there is no way for ES to provide a fix to address this issue. The solution 
Mike suggestion sounds like a workaround (ie combine multiple fields into 
one field to reduce the large number of fields). I can run it by our team 
but not sure if this will fly.

I have also asked Maco to do one more benchmark (where search and ingest 
runs concurrently) for both ES and Solr to check whether there is any 
performance degradation for Solr when search and ingest happens 
concurrently. I think this is one point that Mike mentioned, right? Even 
with Solr, you think we will hit some performance issue with large fields 
when ingest and query runs concurrently.

Thanks!
Cindy

On Thursday, June 12, 2014 10:57:23 PM UTC-7, Maco Ma wrote:

 I try to measure the performance of ingesting the documents having lots of 
 fields.


 The latest elasticsearch 1.2.1:
 Total docs count: 10k (a small set definitely)
 ES_HEAP_SIZE: 48G
 settings:

 {doc:{settings:{index:{uuid:LiWHzE5uQrinYW1wW4E3nA,number_of_replicas:0,translog:{disable_flush:true},number_of_shards:5,refresh_interval:-1,version:{created:1020199}

 mappings:

 {doc:{mappings:{type:{dynamic_templates:[{t1:{mapping:{store:false,norms:{enabled:false},type:string},match:*_ss}},{t2:{mapping:{store:false,type:date},match:*_dt}},{t3:{mapping:{store:false,type:integer},match:*_i}}],_source:{enabled:false},properties:{}

 All fields in the documents mach the templates in the mappings.

 Since I disabled the flush  refresh, I submitted the flush command (along 
 with optimize command after it) in the client program every 10 seconds. (I 
 tried the another interval 10mins and got the similar results)

 Scenario 0 - 10k docs have 1000 different fields:
 Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used 
 heap memory).


 Scenario 1 - 10k docs have 10k different fields(10 times fields compared 
 with scenario0):
 This time ingestion took 29 secs.   Only 5.74G heap mem is used.

 Not sure why the performance degrades sharply.

 If I try to ingest the docs having 100k different fields, it will take 17 
 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so 
 badly. 

 Anyone can give suggestion to improve the performance?









-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/06d319c4-ee7a-40e3-b11a-6e0adff2c686%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-23 Thread Cindy Hsin
Thanks!

I have asked Maco to re-test ES with these two parameter disabled.

One more question regard Lucene's capability with large amount of metadata 
fields. What is the largest meta data fileds Lucene supports per Index?
What are different strategy to solve the large metadata fields issue? Do 
you recommend to use type to partition different set of meta data fields 
within a index?
I will clarify with our team regard their usage for large meta data fields 
as well.

Thanks!
Cindy

On Thursday, June 12, 2014 10:57:23 PM UTC-7, Maco Ma wrote:

 I try to measure the performance of ingesting the documents having lots of 
 fields.


 The latest elasticsearch 1.2.1:
 Total docs count: 10k (a small set definitely)
 ES_HEAP_SIZE: 48G
 settings:

 {doc:{settings:{index:{uuid:LiWHzE5uQrinYW1wW4E3nA,number_of_replicas:0,translog:{disable_flush:true},number_of_shards:5,refresh_interval:-1,version:{created:1020199}

 mappings:

 {doc:{mappings:{type:{dynamic_templates:[{t1:{mapping:{store:false,norms:{enabled:false},type:string},match:*_ss}},{t2:{mapping:{store:false,type:date},match:*_dt}},{t3:{mapping:{store:false,type:integer},match:*_i}}],_source:{enabled:false},properties:{}

 All fields in the documents mach the templates in the mappings.

 Since I disabled the flush  refresh, I submitted the flush command (along 
 with optimize command after it) in the client program every 10 seconds. (I 
 tried the another interval 10mins and got the similar results)

 Scenario 0 - 10k docs have 1000 different fields:
 Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used 
 heap memory).


 Scenario 1 - 10k docs have 10k different fields(10 times fields compared 
 with scenario0):
 This time ingestion took 29 secs.   Only 5.74G heap mem is used.

 Not sure why the performance degrades sharply.

 If I try to ingest the docs having 100k different fields, it will take 17 
 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so 
 badly. 

 Anyone can give suggestion to improve the performance?









-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8c5874cd-a1ff-432b-9bdf-e8a54a505fcb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-23 Thread Michael McCandless
Hi Cindy,

There isn't a hard limit on the number of field Lucene supports, it's more
than per-field there is highish heap used, added CPU/IO cost for merging,
etc.  It's just not a well tested usage of Lucene, not something the
developers focus on optimizing, etc.

Partitioning by _type won't change things (it's still a single Lucene
index).

How you design your schema really depends on how you want to search on
them.  E.g. if these are single-token text fields that you need to filter
on then you can index them all under a single field (say allFilterFields),
pre-pending your original field name onto each token, and then at search
time doing the same (searching for field:text as your text token within
allFilterFields).


Mike McCandless

http://blog.mikemccandless.com


On Tue, Jun 24, 2014 at 12:12 AM, Cindy Hsin cindy.h...@gmail.com wrote:

 Thanks!

 I have asked Maco to re-test ES with these two parameter disabled.

 One more question regard Lucene's capability with large amount of metadata
 fields. What is the largest meta data fileds Lucene supports per Index?
 What are different strategy to solve the large metadata fields issue? Do
 you recommend to use type to partition different set of meta data fields
 within a index?
 I will clarify with our team regard their usage for large meta data fields
 as well.


 Thanks!
 Cindy

 On Thursday, June 12, 2014 10:57:23 PM UTC-7, Maco Ma wrote:

 I try to measure the performance of ingesting the documents having lots
 of fields.


 The latest elasticsearch 1.2.1:
 Total docs count: 10k (a small set definitely)
 ES_HEAP_SIZE: 48G
 settings:
 {doc:{settings:{index:{uuid:LiWHzE5uQrinYW1wW4E3nA
 ,number_of_replicas:0,translog:{disable_flush:
 true},number_of_shards:5,refresh_interval:-1,
 version:{created:1020199}

 mappings:
 {doc:{mappings:{type:{dynamic_templates:[{t1:{
 mapping:{store:false,norms:{enabled:false},
 type:string},match:*_ss}},{t2:{mapping:{store:
 false,type:date},match:*_dt}},{t3:{mapping:{
 store:false,type:integer},match:*_i}}],_source:{
 enabled:false},properties:{}

 All fields in the documents mach the templates in the mappings.

 Since I disabled the flush  refresh, I submitted the flush command
 (along with optimize command after it) in the client program every 10
 seconds. (I tried the another interval 10mins and got the similar results)

 Scenario 0 - 10k docs have 1000 different fields:
 Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used
 heap memory).


 Scenario 1 - 10k docs have 10k different fields(10 times fields compared
 with scenario0):
 This time ingestion took 29 secs.   Only 5.74G heap mem is used.

 Not sure why the performance degrades sharply.

 If I try to ingest the docs having 100k different fields, it will take 17
 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so
 badly.

 Anyone can give suggestion to improve the performance?







  --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/8c5874cd-a1ff-432b-9bdf-e8a54a505fcb%40googlegroups.com
 https://groups.google.com/d/msgid/elasticsearch/8c5874cd-a1ff-432b-9bdf-e8a54a505fcb%40googlegroups.com?utm_medium=emailutm_source=footer
 .

 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAD7smRectTyYXUBJPW7Li6pK7WT9mOguODLwY2X%3DDK6Js_cMsg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-22 Thread Jörg Prante
Two things to add, to make Elasticsearch/Solr comparison more fair.

In the ES mapping, you did not disable the _all field.

If you have _all field enabled, all tokens will be indexed twice, one for 
the field, one for _all.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html

Also you may want to disable ES codec bloom filter

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-codec.html#bloom-postings

because loading the bloom filter consumes significant memory.

Not sure why you call curl from perl, since this adds overhead. There are 
nice Solr/ES perl clients to push docs using bulk indexing.

Jörg


On Wednesday, June 18, 2014 4:50:13 AM UTC+2, Maco Ma wrote:

 Hi Mike,

 new_ES_config.sh(define the templates and disable the refresh/flush):
 curl -XPOST localhost:9200/doc -d '{
   mappings : {
   type : {
   _source : { enabled : false },
   dynamic_templates : [
 {t1:{
   match : *_ss,
   mapping:{
 type: string,
 store:false,
 norms : {enabled : false}
 }
 }},
 {t2:{
   match : *_dt,
   mapping:{
 type: date,
 store: false
 }
 }},
 {t3:{
   match : *_i,
   mapping:{
 type: integer,
 store: false
 }
 }}
 ]
   }
 }
   }'

 curl -XPUT localhost:9200/doc/_settings -d '{
   index.refresh_interval : -1
 }'

 curl -XPUT localhost:9200/doc/_settings -d '{
   index.translog.disable_flush : true
 }'

 new_ES_ingest_threads.pl( spawn 10 threads to use curl command to ingest 
 the doc and one thread to flush/optimize periodically):

 my $num_args = $#ARGV + 1;
 if ($num_args  1 || $num_args  2) {
   print \n usuage:$0 [src_dir] [thread_count]\n;
   exit;
 }

 my $INST_HOME=/scratch/aime/elasticsearch-1.2.1;

 my $pid = qx(jps | sed -e '/Elasticsearch/p' -n | sed 's/ .*//');
 chomp($pid);
 if( $pid eq )
 {
   print Instance is not up\n;
   exit;
 }


 my $dir = $ARGV[0];
 my $td_count = 10;
 $td_count = $ARGV[1] if($num_args == 2);
 open(FH, $lf);
 print FH source dir: $dir\nthread_count: $td_count\n;
 print FH localtime().\n;

 use threads;
 use threads::shared;

 my $flush_intv = 10;

 my $no:shared=0;
 my $total = 1;
 my $intv = 1000;
 my $tstr:shared = ;
 my $ltime:shared = time;

 sub commit {
   $SIG{'KILL'} = sub {`curl -XPOST '
 http://localhost:9200/doc/_flush'`;print 
 http://localhost:9200/doc/_flush';print forced commit done on 
 .localtime().\n;threads-exit();};

   while ($no  $total )
   {
 `curl -XPOST 'http://localhost:9200/doc/_flush'` 
 http://localhost:9200/doc/_flush';
 `curl -XPOST 'http://localhost:9200/doc/_optimize'` 
 http://localhost:9200/doc/_optimize';
 print commit on .localtime().\n;
 sleep($flush_intv);
   }
   `curl -XPOST 'http://localhost:9200/doc/_flush'` 
 http://localhost:9200/doc/_flush';
   print commit done on .localtime().\n;
 }

 sub do {
   my $c = -1;
   while(1)
   {
 {
   lock($no);
   $c=$no;
   $no++;
 }
 last if($c = $total);
 `curl -XPOST -s localhost:9200/doc/type/$c --data-binary 
 \@$dir/$c.json`;
 if( ($c +1) % $intv == 0 )
 {
   lock($ltime);
   $curtime = time;
   $tstr .= ($curtime - $ltime). ;
   $ltime = $curtime;
 }
   }
 }

 # start the monitor processes
 my $sarId = qx(sar -A 5 10 -o sar5sec_$dir.out  /dev/null \necho 
 \$!);
 my $jgcId = qx(jstat -gc $pid 2s  jmem_$dir.out \necho \$!);

 my $ct = threads-create(\commit);
 my $start = time;
 my @ts=();
 for $i (1..$td_count)
 {
   my $t = threads-create(\do);
   push(@ts, $t);
 }

 for my $t (@ts)
 {
   $t-join();
 }

 $ct-kill('KILL');
 my $fin = time;

 qx(kill -9 $sarId\nkill -9 $jgcId);

 print FH localtime().\n;
 $ct-join();
 print FH qx(curl 'http://localhost:9200/doc/type/_count?q=*');
 close(FH);

 new_Solr_ingest_threads.pl is similar to the file 
  new_ES_ingest_threads.pl and uses the different parameters for curl 
 commands. Only post the differences here:

 sub commit {
   while ($no  $total )
   {
 `curl  'http://localhost:8983/solr/collection2/update?commit=true'` 
 http://localhost:8983/solr/collection2/update?commit=true';
 `curl  'http://localhost:8983/solr/collection2/update?optimize=true'` 
 http://localhost:8983/solr/collection2/update?optimize=true';
 print commit on .localtime().\n;
 sleep(10);
   }
   `curl  'http://localhost:8983/solr/collection2/update?commit=true'` 
 http://localhost:8983/solr/collection2/update?commit=true';
   print commit done on .localtime().\n;
 }


 

Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-21 Thread Michael McCandless
On Fri, Jun 20, 2014 at 8:00 PM, Cindy Hsin cindy.h...@gmail.com wrote:

 Hi, Mike:

 Since both ES and Solr uses Lucene, do you know why we only see big ingest
 performance degradation with ES but not Solr?


I'm not sure why: clearly something is slow with ES as you add more and
more fields.  I think it has to do with how it manages its mappings.


 Are you suggesting that if our customer require large amount of Metadata
 field, even Solr won't be able to provide decent performance when ingest
 and search are happening concurrently?


Exactly.  Even if you/we fixed ES's slowness as you add tons of fields, or
if you went with Solr, you're still going to see poor
indexing/merging/searching performance because Lucene itself doesn't scale
very well to so many fields: this use case (tons of fields) has never been
a priority for Lucene developers because it's typically easy for the
application to change its approach to not use so many fields.

Mike McCandless

http://blog.mikemccandless.com

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAD7smRce61ZAPYv2zdFfFqjQ_onvCWN3K6Qopu6-iG1aa9MHNQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-18 Thread Maco Ma
I tried your script with setting iwc.setRAMBufferSizeMB(4)/ and 48G 
heap size. The speed can be around 430 docs/sec before the first flush and 
the final speed is 350 docs/sec. Not sure what configuration Solr uses and 
its ingestion speed can be 800 docs/sec.

Maco

On Wednesday, June 18, 2014 6:09:07 AM UTC+8, Michael McCandless wrote:

 I tested roughly your Scenario 2 (100K unique fields, 100 fields per 
 document) with a straight Lucene test (attached, but not sure if the list 
 strips attachments).  Net/net I see ~100 docs/sec with one thread ... which 
 is very slow.

 Lucene stores quite a lot for each unique indexed field name and it's 
 really a bad idea to plan on having so many unique fields in the index: 
 you'll spend lots of RAM and CPU.

 Can you describe the wider use case here?  Maybe there's a more performant 
 way to achieve it...



 On Fri, Jun 13, 2014 at 2:40 PM, Cindy Hsin cindy...@gmail.com 
 javascript: wrote:

 Hi, Mark:

 We are doing single document ingestion. We did a performance comparison 
 between Solr and Elastic Search (ES).
 The performance for ES degrades dramatically when we increase the 
 metadata fields where Solr performance remains the same. 
 The performance is done in very small data set (ie. 10k documents, the 
 index size is only 75mb). The machine is a high spec machine with 48GB 
 memory.
 You can see ES performance drop 50% even when the machine have plenty 
 memory. ES consumes all the machine memory when metadata field increased to 
 100k. 
 This behavior seems abnormal since the data is really tiny.

 We also tried with larger data set (ie. 100k and 1Mil documents), ES 
 throw OOW for scenario 2 for 1 Mil doc scenario. 
 We want to know whether this is a bug in ES and/or is there any 
 workaround (config step) we can use to eliminate the performance 
 degradation. 
 Currently ES performance does not meet the customer requirement so we 
 want to see if there is anyway we can bring ES performance to the same 
 level as Solr.

 Below is the configuration setting and benchmark results for 10k document 
 set.
 scenario 0 means there are 1000 different metadata fields in the system.
 scenario 1 means there are 10k different metatdata fields in the system.
 scenario 2 means there are 100k different metadata fields in the system.
 scenario 3 means there are 1M different metadata fields in the system.

- disable hard-commit  soft commit + use a *client* to do commit (ES 
 Solr) every 10 second
- ES: flush, refresh are disabled
   - Solr: autoSoftCommit are disabled
- monitor load on the system (cpu, memory, etc) or the ingestion 
speed change over time
- monitor the ingestion speed (is there any degradation over time?) 
- new ES config:new_ES_config.sh 

 https://stbeehive.oracle.com/content/dav/st/Cloud%20Search/Documents/new_ES_config.sh;
  
new ingestion: new_ES_ingest_threads.pl 

 https://stbeehive.oracle.com/content/dav/st/Cloud%20Search/Documents/new_ES_ingest_threads.pl
  
- new Solr ingestion: new_Solr_ingest_threads.pl 

 https://stbeehive.oracle.com/content/dav/st/Cloud%20Search/Documents/new_Solr_ingest_threads.pl
- flush interval: 10s


 Number of different meta data fieldESSolrScenario 0: 100012secs - 
 833docs/sec
 CPU: 30.24%
 Heap: 1.08G
 time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
 index size: 36M
 iowait: 0.02%13 secs - 769 docs/sec
 CPU: 28.85%
 Heap: 9.39G
 time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2Scenario 1: 10k29secs - 
 345docs/sec
 CPU: 40.83%
 Heap: 5.74G
 time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
 iowait: 0.02%
 Index Size: 36M12 secs - 833 docs/sec
 CPU: 28.62%
 Heap: 9.88G
 time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2 Scenario 2: 100k17 mins 
 44 secs - 9.4docs/sec
 CPU: 54.73%
 Heap: 47.99G
 time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
 iowait: 0.02%
 Index Size: 75M13 secs - 769 docs/sec
 CPU: 29.43%
 Heap: 9.84G
 time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2Scenario 3: 1M183 mins 8 
 secs - 0.9 docs/sec
 CPU: 40.47%
 Heap: 47.99G
 time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 1594 15 
 secs - 666.7 docs/sec
 CPU: 45.10%
 Heap: 9.64G
 time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2

 Thanks!
 Cindy

  -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com javascript:.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com?utm_medium=emailutm_source=footer
 .

 For more options, visit https://groups.google.com/d/optout.




-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group 

Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-18 Thread Michael McCandless
On Wed, Jun 18, 2014 at 2:38 AM, Maco Ma mayaohu...@gmail.com wrote:

 I tried your script with setting iwc.setRAMBufferSizeMB(4)/ and 48G
 heap size. The speed can be around 430 docs/sec before the first flush and
 the final speed is 350 docs/sec. Not sure what configuration Solr uses and
 its ingestion speed can be 800 docs/sec.


Well, probably the difference is threads?  That simple Lucene test uses
only 1 thread, but your ES/Solr test uses 10 threads.

I think the cost in ES is how the MapperService maintains mappings for all
fields; I don't think there's a quick fix to reduce this cost.

But net/net you really need to take a step back and re-evaluate your
approach here: even if you use Solr, indexing at 800 docs/sec using 10
threads is awful indexing performance and this is because Lucene itself has
a high cost per field, at indexing time and searching time.  E.g. have you
tried opening a searcher once you've built a large index with so many
unique fields?  The heap usage will be very high.  Tested search
performance on that searcher?  Merging cost will be very high, etc.

Lucene is just not optimized for the zillions of unique fields case,
because you can so easily move those N fields into a single field; e.g. if
this is just for simple term filtering, make a single field and then as
terms insert fieldName:fieldValue as your tokens.

If you insist on creating so many unique fields in your use case you will
be unhappy down the road with Lucene ...

Mike McCandless

http://blog.mikemccandless.com

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAD7smRdMdQDP1e8MhxnJb%2BBWU02pmjTVfoV6r-BTNescv4%3DSvQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-17 Thread Michael McCandless
Hi,

Could you post the scripts you linked to (new_ES_config.sh,
new_ES_ingest_threads.pl, new_Solr_ingest_threads.pl) inlined?  I can't
download them from where you linked.

Optimizing every 10 seconds or 10 minutes is really not a good idea in
general, but I guess if you're doing the same with ES and Solr then the
comparison is at least fair.

It's odd you see such a slowdown with ES...

Mike

On Fri, Jun 13, 2014 at 2:40 PM, Cindy Hsin cindy.h...@gmail.com wrote:

 Hi, Mark:

 We are doing single document ingestion. We did a performance comparison
 between Solr and Elastic Search (ES).
 The performance for ES degrades dramatically when we increase the metadata
 fields where Solr performance remains the same.
 The performance is done in very small data set (ie. 10k documents, the
 index size is only 75mb). The machine is a high spec machine with 48GB
 memory.
 You can see ES performance drop 50% even when the machine have plenty
 memory. ES consumes all the machine memory when metadata field increased to
 100k.
 This behavior seems abnormal since the data is really tiny.

 We also tried with larger data set (ie. 100k and 1Mil documents), ES throw
 OOW for scenario 2 for 1 Mil doc scenario.
 We want to know whether this is a bug in ES and/or is there any workaround
 (config step) we can use to eliminate the performance degradation.
 Currently ES performance does not meet the customer requirement so we want
 to see if there is anyway we can bring ES performance to the same level as
 Solr.

 Below is the configuration setting and benchmark results for 10k document
 set.
 scenario 0 means there are 1000 different metadata fields in the system.
 scenario 1 means there are 10k different metatdata fields in the system.
 scenario 2 means there are 100k different metadata fields in the system.
 scenario 3 means there are 1M different metadata fields in the system.

- disable hard-commit  soft commit + use a *client* to do commit (ES
 Solr) every 10 second
- ES: flush, refresh are disabled
   - Solr: autoSoftCommit are disabled
- monitor load on the system (cpu, memory, etc) or the ingestion speed
change over time
- monitor the ingestion speed (is there any degradation over time?)
- new ES config:new_ES_config.sh; new ingestion:
new_ES_ingest_threads.pl
- new Solr ingestion: new_Solr_ingest_threads.pl
- flush interval: 10s


 Number of different meta data field ESSolrScenario 0: 100012secs -
 833docs/sec
 CPU: 30.24%
 Heap: 1.08G
 time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
 index size: 36M
 iowait: 0.02%13 secs - 769 docs/sec
 CPU: 28.85%
 Heap: 9.39G
 time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2Scenario 1: 10k29secs -
 345docs/sec
 CPU: 40.83%
 Heap: 5.74G
 time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
 iowait: 0.02%
 Index Size: 36M12 secs - 833 docs/sec
 CPU: 28.62%
 Heap: 9.88G
 time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2Scenario 2: 100k17 mins 44
 secs - 9.4docs/sec
 CPU: 54.73%
 Heap: 47.99G
 time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
 iowait: 0.02%
 Index Size: 75M13 secs - 769 docs/sec
 CPU: 29.43%
 Heap: 9.84G
 time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2 Scenario 3: 1M183 mins 8
 secs - 0.9 docs/sec
 CPU: 40.47%
 Heap: 47.99G
 time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 159415
 secs - 666.7 docs/sec
 CPU: 45.10%
 Heap: 9.64G
 time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2

 Thanks!
 Cindy

  --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com
 .

 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAD7smRfsxEPvTjfv%2BPWgpyWD5fLE1DTaPUfAe9%3DdLVzXRe4p4w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-17 Thread Michael McCandless
I tested roughly your Scenario 2 (100K unique fields, 100 fields per
document) with a straight Lucene test (attached, but not sure if the list
strips attachments).  Net/net I see ~100 docs/sec with one thread ... which
is very slow.

Lucene stores quite a lot for each unique indexed field name and it's
really a bad idea to plan on having so many unique fields in the index:
you'll spend lots of RAM and CPU.

Can you describe the wider use case here?  Maybe there's a more performant
way to achieve it...



On Fri, Jun 13, 2014 at 2:40 PM, Cindy Hsin cindy.h...@gmail.com wrote:

 Hi, Mark:

 We are doing single document ingestion. We did a performance comparison
 between Solr and Elastic Search (ES).
 The performance for ES degrades dramatically when we increase the metadata
 fields where Solr performance remains the same.
 The performance is done in very small data set (ie. 10k documents, the
 index size is only 75mb). The machine is a high spec machine with 48GB
 memory.
 You can see ES performance drop 50% even when the machine have plenty
 memory. ES consumes all the machine memory when metadata field increased to
 100k.
 This behavior seems abnormal since the data is really tiny.

 We also tried with larger data set (ie. 100k and 1Mil documents), ES throw
 OOW for scenario 2 for 1 Mil doc scenario.
 We want to know whether this is a bug in ES and/or is there any workaround
 (config step) we can use to eliminate the performance degradation.
 Currently ES performance does not meet the customer requirement so we want
 to see if there is anyway we can bring ES performance to the same level as
 Solr.

 Below is the configuration setting and benchmark results for 10k document
 set.
 scenario 0 means there are 1000 different metadata fields in the system.
 scenario 1 means there are 10k different metatdata fields in the system.
 scenario 2 means there are 100k different metadata fields in the system.
 scenario 3 means there are 1M different metadata fields in the system.

- disable hard-commit  soft commit + use a *client* to do commit (ES
 Solr) every 10 second
- ES: flush, refresh are disabled
   - Solr: autoSoftCommit are disabled
- monitor load on the system (cpu, memory, etc) or the ingestion speed
change over time
- monitor the ingestion speed (is there any degradation over time?)
- new ES config:new_ES_config.sh

 https://stbeehive.oracle.com/content/dav/st/Cloud%20Search/Documents/new_ES_config.sh;
new ingestion: new_ES_ingest_threads.pl

 https://stbeehive.oracle.com/content/dav/st/Cloud%20Search/Documents/new_ES_ingest_threads.pl
- new Solr ingestion: new_Solr_ingest_threads.pl

 https://stbeehive.oracle.com/content/dav/st/Cloud%20Search/Documents/new_Solr_ingest_threads.pl
- flush interval: 10s


 Number of different meta data fieldESSolrScenario 0: 100012secs -
 833docs/sec
 CPU: 30.24%
 Heap: 1.08G
 time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
 index size: 36M
 iowait: 0.02%13 secs - 769 docs/sec
 CPU: 28.85%
 Heap: 9.39G
 time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2Scenario 1: 10k29secs -
 345docs/sec
 CPU: 40.83%
 Heap: 5.74G
 time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
 iowait: 0.02%
 Index Size: 36M12 secs - 833 docs/sec
 CPU: 28.62%
 Heap: 9.88G
 time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2 Scenario 2: 100k17 mins
 44 secs - 9.4docs/sec
 CPU: 54.73%
 Heap: 47.99G
 time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
 iowait: 0.02%
 Index Size: 75M13 secs - 769 docs/sec
 CPU: 29.43%
 Heap: 9.84G
 time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2Scenario 3: 1M183 mins 8
 secs - 0.9 docs/sec
 CPU: 40.47%
 Heap: 47.99G
 time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 1594 15
 secs - 666.7 docs/sec
 CPU: 45.10%
 Heap: 9.64G
 time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2

 Thanks!
 Cindy

  --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com
 https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com?utm_medium=emailutm_source=footer
 .

 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAD7smRcDKZWA8tjsqfcthGUKcEX7q2dohWy_1vcFyKo7JgB53w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


ManyLuceneFields.java
Description: Binary data


Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-17 Thread Cindy Hsin
The way we make Solr ingest faster (single document ingest) is by turn off 
the engine soft commit and hard commit and use a client to commit the 
changes every 10 seconds. 

Solr ingest speed remains at 800 docs per second where ES ingest speed 
drops in half when we increase the fields (ie. from 1000 to 10k).
I have asked Maco to send you the requested script so you can do more 
analysis.

If you can help to solve the first level ES performance degradation (ie. 
1000 to 10k) as a starting point, that will be the best.

We do have real customer scenario that require large amount of metadata 
fields, that is why this is a blocking issue for the stack evaluation 
between Solr and Elastic Search.

Thanks!
Cindy

On Thursday, June 12, 2014 10:57:23 PM UTC-7, Maco Ma wrote:

 I try to measure the performance of ingesting the documents having lots of 
 fields.


 The latest elasticsearch 1.2.1:
 Total docs count: 10k (a small set definitely)
 ES_HEAP_SIZE: 48G
 settings:

 {doc:{settings:{index:{uuid:LiWHzE5uQrinYW1wW4E3nA,number_of_replicas:0,translog:{disable_flush:true},number_of_shards:5,refresh_interval:-1,version:{created:1020199}

 mappings:

 {doc:{mappings:{type:{dynamic_templates:[{t1:{mapping:{store:false,norms:{enabled:false},type:string},match:*_ss}},{t2:{mapping:{store:false,type:date},match:*_dt}},{t3:{mapping:{store:false,type:integer},match:*_i}}],_source:{enabled:false},properties:{}

 All fields in the documents mach the templates in the mappings.

 Since I disabled the flush  refresh, I submitted the flush command (along 
 with optimize command after it) in the client program every 10 seconds. (I 
 tried the another interval 10mins and got the similar results)

 Scenario 0 - 10k docs have 1000 different fields:
 Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used 
 heap memory).


 Scenario 1 - 10k docs have 10k different fields(10 times fields compared 
 with scenario0):
 This time ingestion took 29 secs.   Only 5.74G heap mem is used.

 Not sure why the performance degrades sharply.

 If I try to ingest the docs having 100k different fields, it will take 17 
 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so 
 badly. 

 Anyone can give suggestion to improve the performance?









-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/79911a7f-4118-4421-bc2d-2284eccebd3f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-17 Thread Maco Ma
Hi Mike,

new_ES_config.sh(define the templates and disable the refresh/flush):
curl -XPOST localhost:9200/doc -d '{
  mappings : {
  type : {
  _source : { enabled : false },
  dynamic_templates : [
{t1:{
  match : *_ss,
  mapping:{
type: string,
store:false,
norms : {enabled : false}
}
}},
{t2:{
  match : *_dt,
  mapping:{
type: date,
store: false
}
}},
{t3:{
  match : *_i,
  mapping:{
type: integer,
store: false
}
}}
]
  }
}
  }'

curl -XPUT localhost:9200/doc/_settings -d '{
  index.refresh_interval : -1
}'

curl -XPUT localhost:9200/doc/_settings -d '{
  index.translog.disable_flush : true
}'

new_ES_ingest_threads.pl( spawn 10 threads to use curl command to ingest 
the doc and one thread to flush/optimize periodically):

my $num_args = $#ARGV + 1;
if ($num_args  1 || $num_args  2) {
  print \n usuage:$0 [src_dir] [thread_count]\n;
  exit;
}

my $INST_HOME=/scratch/aime/elasticsearch-1.2.1;

my $pid = qx(jps | sed -e '/Elasticsearch/p' -n | sed 's/ .*//');
chomp($pid);
if( $pid eq )
{
  print Instance is not up\n;
  exit;
}


my $dir = $ARGV[0];
my $td_count = 10;
$td_count = $ARGV[1] if($num_args == 2);
open(FH, $lf);
print FH source dir: $dir\nthread_count: $td_count\n;
print FH localtime().\n;

use threads;
use threads::shared;

my $flush_intv = 10;

my $no:shared=0;
my $total = 1;
my $intv = 1000;
my $tstr:shared = ;
my $ltime:shared = time;

sub commit {
  $SIG{'KILL'} = sub {`curl -XPOST 
'http://localhost:9200/doc/_flush'`;print forced commit done on 
.localtime().\n;threads-exit();};

  while ($no  $total )
  {
`curl -XPOST 'http://localhost:9200/doc/_flush'`;
`curl -XPOST 'http://localhost:9200/doc/_optimize'`;
print commit on .localtime().\n;
sleep($flush_intv);
  }
  `curl -XPOST 'http://localhost:9200/doc/_flush'`;
  print commit done on .localtime().\n;
}

sub do {
  my $c = -1;
  while(1)
  {
{
  lock($no);
  $c=$no;
  $no++;
}
last if($c = $total);
`curl -XPOST -s localhost:9200/doc/type/$c --data-binary 
\@$dir/$c.json`;
if( ($c +1) % $intv == 0 )
{
  lock($ltime);
  $curtime = time;
  $tstr .= ($curtime - $ltime). ;
  $ltime = $curtime;
}
  }
}

# start the monitor processes
my $sarId = qx(sar -A 5 10 -o sar5sec_$dir.out  /dev/null \necho \$!);
my $jgcId = qx(jstat -gc $pid 2s  jmem_$dir.out \necho \$!);

my $ct = threads-create(\commit);
my $start = time;
my @ts=();
for $i (1..$td_count)
{
  my $t = threads-create(\do);
  push(@ts, $t);
}

for my $t (@ts)
{
  $t-join();
}

$ct-kill('KILL');
my $fin = time;

qx(kill -9 $sarId\nkill -9 $jgcId);

print FH localtime().\n;
$ct-join();
print FH qx(curl 'http://localhost:9200/doc/type/_count?q=*');
close(FH);

new_Solr_ingest_threads.pl is similar to the file  new_ES_ingest_threads.pl 
and uses the different parameters for curl commands. Only post the 
differences here:

sub commit {
  while ($no  $total )
  {
`curl  'http://localhost:8983/solr/collection2/update?commit=true'`;
`curl  'http://localhost:8983/solr/collection2/update?optimize=true'`;
print commit on .localtime().\n;
sleep(10);
  }
  `curl  'http://localhost:8983/solr/collection2/update?commit=true'`;
  print commit done on .localtime().\n;
}


sub do {
  my $c = -1;
  while(1)
  {
{
  lock($no);
  $c=$no;
  $no++;
}
last if($c = $total);
`curl  -s 'http://localhost:8983/solr/collection2/update/json' 
--data-binary \@$dir/$c.json -H 'Content-type:application/json'`;
if( ($c +1) % $intv == 0 )
{
  lock($ltime);
  $curtime = time;
  $tstr .= ($curtime - $ltime). ;
  $ltime = $curtime;
}
  }
}


BR
Maco

On Wednesday, June 18, 2014 4:44:35 AM UTC+8, Michael McCandless wrote:

 Hi,

 Could you post the scripts you linked to (new_ES_config.sh, 
 new_ES_ingest_threads.pl, new_Solr_ingest_threads.pl) inlined?  I can't 
 download them from where you linked.

 Optimizing every 10 seconds or 10 minutes is really not a good idea in 
 general, but I guess if you're doing the same with ES and Solr then the 
 comparison is at least fair.

 It's odd you see such a slowdown with ES...

 Mike

 On Fri, Jun 13, 2014 at 2:40 PM, Cindy Hsin cindy...@gmail.com 
 javascript: wrote:

 Hi, Mark:

 We are doing single document ingestion. We did a performance comparison 
 between Solr and Elastic Search (ES).
 The performance for ES degrades dramatically when we increase the 
 metadata fields where Solr performance remains 

Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-13 Thread Mark Walkom
It's not surprising that the time increases when you have an order of
magnitude more fields.

Are you using the bulk API?

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com


On 13 June 2014 15:57, Maco Ma mayaohu...@gmail.com wrote:

 I try to measure the performance of ingesting the documents having lots of
 fields.


 The latest elasticsearch 1.2.1:
 Total docs count: 10k (a small set definitely)
 ES_HEAP_SIZE: 48G
 settings:

 {doc:{settings:{index:{uuid:LiWHzE5uQrinYW1wW4E3nA,number_of_replicas:0,translog:{disable_flush:true},number_of_shards:5,refresh_interval:-1,version:{created:1020199}

 mappings:

 {doc:{mappings:{type:{dynamic_templates:[{t1:{mapping:{store:false,norms:{enabled:false},type:string},match:*_ss}},{t2:{mapping:{store:false,type:date},match:*_dt}},{t3:{mapping:{store:false,type:integer},match:*_i}}],_source:{enabled:false},properties:{}

 All fields in the documents mach the templates in the mappings.

 Since I disabled the flush  refresh, I submitted the flush command (along
 with optimize command after it) in the client program every 10 seconds. (I
 tried the another interval 10mins and got the similar results)

 Scenario 0 - 10k docs have 1000 different fields:
 Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used
 heap memory).


 Scenario 1 - 10k docs have 10k different fields(10 times fields compared
 with scenario0):
 This time ingestion took 29 secs.   Only 5.74G heap mem is used.

 Not sure why the performance degrades sharply.

 If I try to ingest the docs having 100k different fields, it will take 17
 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so
 badly.

 Anyone can give suggestion to improve the performance?







  --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/25ec100b-96d8-434b-b3a0-3a3e8ad90de4%40googlegroups.com
 https://groups.google.com/d/msgid/elasticsearch/25ec100b-96d8-434b-b3a0-3a3e8ad90de4%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAEM624bVPUUUAWJAaeLKwTrzSjprtdbFpp_SkBPHRkLxOdUaHg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-13 Thread Maco Ma
I used the curl command to do the ingestion(one command, one doc) and 
flush. I also tried the Solr(disabled the soft/hard commit  do the commit 
with client program) with the same data  commands and its performance did 
not degrade. Lucene are used for both of them and not sure why there is a 
big difference with the performances. 

On Friday, June 13, 2014 2:02:58 PM UTC+8, Mark Walkom wrote:

 It's not surprising that the time increases when you have an order of 
 magnitude more fields.

 Are you using the bulk API?

 Regards,
 Mark Walkom

 Infrastructure Engineer
 Campaign Monitor
 email: ma...@campaignmonitor.com javascript:
 web: www.campaignmonitor.com
  

 On 13 June 2014 15:57, Maco Ma mayao...@gmail.com javascript: wrote:

 I try to measure the performance of ingesting the documents having lots 
 of fields.


 The latest elasticsearch 1.2.1:
 Total docs count: 10k (a small set definitely)
 ES_HEAP_SIZE: 48G
 settings:

 {doc:{settings:{index:{uuid:LiWHzE5uQrinYW1wW4E3nA,number_of_replicas:0,translog:{disable_flush:true},number_of_shards:5,refresh_interval:-1,version:{created:1020199}

 mappings:

 {doc:{mappings:{type:{dynamic_templates:[{t1:{mapping:{store:false,norms:{enabled:false},type:string},match:*_ss}},{t2:{mapping:{store:false,type:date},match:*_dt}},{t3:{mapping:{store:false,type:integer},match:*_i}}],_source:{enabled:false},properties:{}

 All fields in the documents mach the templates in the mappings.

 Since I disabled the flush  refresh, I submitted the flush command 
 (along with optimize command after it) in the client program every 10 
 seconds. (I tried the another interval 10mins and got the similar results)

 Scenario 0 - 10k docs have 1000 different fields:
 Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used 
 heap memory).


 Scenario 1 - 10k docs have 10k different fields(10 times fields compared 
 with scenario0):
 This time ingestion took 29 secs.   Only 5.74G heap mem is used.

 Not sure why the performance degrades sharply.

 If I try to ingest the docs having 100k different fields, it will take 17 
 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so 
 badly. 

 Anyone can give suggestion to improve the performance?







  -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com javascript:.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/25ec100b-96d8-434b-b3a0-3a3e8ad90de4%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/25ec100b-96d8-434b-b3a0-3a3e8ad90de4%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.




-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8694a4da-68f6-40b3-9d40-fbbc63041cad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ingest performance degrades sharply along with the documents having more fileds

2014-06-13 Thread Cindy Hsin
Hi, Mark:

We are doing single document ingestion. We did a performance comparison 
between Solr and Elastic Search (ES).
The performance for ES degrades dramatically when we increase the metadata 
fields where Solr performance remains the same. 
The performance is done in very small data set (ie. 10k documents, the 
index size is only 75mb). The machine is a high spec machine with 48GB 
memory.
You can see ES performance drop 50% even when the machine have plenty 
memory. ES consumes all the machine memory when metadata field increased to 
100k. 
This behavior seems abnormal since the data is really tiny.

We also tried with larger data set (ie. 100k and 1Mil documents), ES throw 
OOW for scenario 2 for 1 Mil doc scenario. 
We want to know whether this is a bug in ES and/or is there any workaround 
(config step) we can use to eliminate the performance degradation. 
Currently ES performance does not meet the customer requirement so we want 
to see if there is anyway we can bring ES performance to the same level as 
Solr.

Below is the configuration setting and benchmark results for 10k document 
set.
scenario 0 means there are 1000 different metadata fields in the system.
scenario 1 means there are 10k different metatdata fields in the system.
scenario 2 means there are 100k different metadata fields in the system.
scenario 3 means there are 1M different metadata fields in the system.

   - disable hard-commit  soft commit + use a *client* to do commit (ES  
   Solr) every 10 second
   - ES: flush, refresh are disabled
  - Solr: autoSoftCommit are disabled
   - monitor load on the system (cpu, memory, etc) or the ingestion speed 
   change over time
   - monitor the ingestion speed (is there any degradation over time?)
   - new ES config:new_ES_config.sh 
   
https://stbeehive.oracle.com/content/dav/st/Cloud%20Search/Documents/new_ES_config.sh;
 
   new ingestion: new_ES_ingest_threads.pl 
   
https://stbeehive.oracle.com/content/dav/st/Cloud%20Search/Documents/new_ES_ingest_threads.pl
   - new Solr ingestion: new_Solr_ingest_threads.pl 
   
https://stbeehive.oracle.com/content/dav/st/Cloud%20Search/Documents/new_Solr_ingest_threads.pl
   - flush interval: 10s


Number of different meta data fieldESSolrScenario 0: 100012secs - 
833docs/sec
CPU: 30.24%
Heap: 1.08G
time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
index size: 36M
iowait: 0.02%13 secs - 769 docs/sec
CPU: 28.85%
Heap: 9.39G
time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2Scenario 1: 10k29secs - 
345docs/sec
CPU: 40.83%
Heap: 5.74G
time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
iowait: 0.02%
Index Size: 36M12 secs - 833 docs/sec
CPU: 28.62%
Heap: 9.88G
time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2Scenario 2: 100k17 mins 44 
secs - 9.4docs/sec
CPU: 54.73%
Heap: 47.99G
time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
iowait: 0.02%
Index Size: 75M13 secs - 769 docs/sec
CPU: 29.43%
Heap: 9.84G
time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2Scenario 3: 1M183 mins 8 
secs - 0.9 docs/sec
CPU: 40.47%
Heap: 47.99G
time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 159415 
secs - 666.7 docs/sec
CPU: 45.10%
Heap: 9.64G
time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2

Thanks!
Cindy

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.