Re: ingest performance degrades sharply along with the documents having more fileds

2014-07-08 Thread kimchy
Yes, this is the equivalent of using RAMDirectory. Please, don't use this, 
Mmap is optimized for random access and if the lucene index can fit in heap 
(to use ram dir), it can certainly fit in OS RAM, without the implications 
of loading it to heap.

On Monday, July 7, 2014 6:26:07 PM UTC+2, Mahesh Venkat wrote:

 Thanks Shay for updating us with perf improvements.
 Apart from using the default parameters, should we follow the guideline 
 listed in 


 http://elasticsearch-users.115913.n3.nabble.com/Is-ES-es-index-store-type-memory-equivalent-to-Lucene-s-RAMDirectory-td4057417.html
  

 Lucene supports MMapDirectory at the data indexing phase (in a batch) and 
 switch to in-memory for queries to optimize on search latency.

 Should we use JVM system parameter -Des.index.store.type=memory .  Isn't 
 this equivalent to using RAMDirectory in Lucene for in-memory search query 
  ?
 Thanks
 --Mahesh

 On Saturday, July 5, 2014 8:46:59 AM UTC-7, kimchy wrote:

 Heya, I worked a bit on it, and 1.x (upcoming 1.3) has some significant 
 perf improvements now for this case (including improvements Lucene wise, 
 that are for now in ES, but will be in Lucene next version). Those include:

 6648: https://github.com/elasticsearch/elasticsearch/pull/6648
 6714: https://github.com/elasticsearch/elasticsearch/pull/6714
 6707: https://github.com/elasticsearch/elasticsearch/pull/6707

 It would be interesting if you can run the tests again with 1.x branch. 
 Note, also, please use default features in ES for now, no disable flushing 
 and such.

 On Friday, June 13, 2014 7:57:23 AM UTC+2, Maco Ma wrote:

 I try to measure the performance of ingesting the documents having lots 
 of fields.


 The latest elasticsearch 1.2.1:
 Total docs count: 10k (a small set definitely)
 ES_HEAP_SIZE: 48G
 settings:

 {doc:{settings:{index:{uuid:LiWHzE5uQrinYW1wW4E3nA,number_of_replicas:0,translog:{disable_flush:true},number_of_shards:5,refresh_interval:-1,version:{created:1020199}

 mappings:

 {doc:{mappings:{type:{dynamic_templates:[{t1:{mapping:{store:false,norms:{enabled:false},type:string},match:*_ss}},{t2:{mapping:{store:false,type:date},match:*_dt}},{t3:{mapping:{store:false,type:integer},match:*_i}}],_source:{enabled:false},properties:{}

 All fields in the documents mach the templates in the mappings.

 Since I disabled the flush  refresh, I submitted the flush command 
 (along with optimize command after it) in the client program every 10 
 seconds. (I tried the another interval 10mins and got the similar results)

 Scenario 0 - 10k docs have 1000 different fields:
 Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the 
 used heap memory).


 Scenario 1 - 10k docs have 10k different fields(10 times fields compared 
 with scenario0):
 This time ingestion took 29 secs.   Only 5.74G heap mem is used.

 Not sure why the performance degrades sharply.

 If I try to ingest the docs having 100k different fields, it will take 
 17 mins 44 secs.  We only have 10k docs totally and not sure why ES perform 
 so badly. 

 Anyone can give suggestion to improve the performance?









-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/450fdf38-bdfe-49c2-9938-627b9854892c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ingest performance degrades sharply along with the documents having more fileds

2014-07-08 Thread kimchy
Hi, thanks for running the tests!. My tests were capped at 10k fields and 
improve for it, any more than that, I, and anybody here on Elasticsearch 
(+Lucene: Mike/Robert) simply don't recommend and can't really be behind 
when it comes to supporting it.

In Elasticsearch, there is a conscious decision to have concrete mappings 
for fields introduced. This allows for nice upstream features, such as 
autocomplete on Kibana and Sense, as well as certain index/search level 
optimizations that can't be done without concrete mapping for each field 
introduced. This incurs a cost when it comes to many fields introduced.

The idea here, is that a system that tries to put 1M different fields into 
Lucene simply not going to scale. The cost overhead, and even testability 
of such a system, is simply not something that we can support.

Aside from the obvious overhead when it comes to just wrangling so many 
fields in Lucene (merge costs that keep being incremental, ...), there is 
also the plan of what to do with it. For example, if sorting is enabled, 
then there is a multiplied cost at loading it for sorting (compared to 
using nested documents, where the cost is constant, since its the same 
field).

I think that there might be other factors in play to the performance test 
numbers I see below aside from the 100k and 1M different fields scenario. 
We can try and chase them, but the bottom line is the same, we can't 
support a system that asks to have 1M different fields, as we don't believe 
it uses either ES or Lucene correctly at this point.

I suggest looking into nested documents (regardless of the system you 
decided to use) as a viable alternative to the many fields solution. This 
is the only way you will be able to scale such a system, especially across 
multiple nodes (nested document scales out well, many fields don't).

On Tuesday, July 8, 2014 11:41:11 AM UTC+2, Maco Ma wrote:

 Hi Kimchy,

 I rerun the benchmark using ES1.3 with default settings (just disable the 
 _source  _all ) and it makes a great progress on the performance. However 
 Solr still outperforms ES 1.3:
 Number of different meta data field 
 ES 
 ES with disable _all/codec bloom filter 

 *ES 1.3 *
 Solr 

 Scenario 0: 1000
 12secs - *833*docs/sec
 CPU: 30.24%
 Heap: 1.08G
 time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
 *index size: 36Mb*
 iowait: 0.02%
 13 secs -769 docs/sec
 CPU: 23.68%
 iowait: 0.01%
 Heap: 1.31G
 Index Size: 248K
 Ingestion speed change: 2 1 1 1 1 1 1 1 2 1

 13 secs-769 docs/sec
 CPU: 44.22%
 iowait: 0.01%
 Heap: 1.38G
 Index Size: 69M
 Ingestion speed change: 2 1 1 1 1 1 2 0 2 2

 13 secs - 769 docs/sec
 CPU: 28.85%
 Heap: 9.39G
 time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2

 Scenario 1: 10k
 29secs - *345*docs/sec
 CPU: 40.83%
 Heap: 5.74G
 time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
 iowait: 0.02%
 *Index Size: 36Mb*
 31 secs - 322.6 docs/sec
 CPU: 39.29%
 iowait: 0.01%
 Heap: 4.76G
 Index Size: 396K
 Ingestion speed change: 12 1 2 1 1 1 2 1 4 2

 20 secs-500 docs/sec
 CPU: 54.74%
 iowait: 0.02%
 Heap: 3.06G
 Index Size: 133M
 Ingestion speed change: 2 2 1 2 2 3 2 2 2 1
 12 secs - 833 docs/sec
 CPU: 28.62%
 Heap: 9.88G
 time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2

 Scenario 2: 100k
 17 mins 44 secs - *9.4*docs/sec
 CPU: 54.73%
 Heap: 47.99G
 time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
 iowait: 0.02%
 *Index Size: 75Mb*
 14 mins 24 secs - 11.6 docs/sec
 CPU: 52.30%
 iowait: 0.02%
 Heap:
 Index Size: 1.5M
 Ingestion speed change: 93 153 151 112 84 65 61 53 51 41

 1 mins 24 secs- 119 docs/sec
 CPU: 47.67%
 iowait: 0.12%
 Heap: 8.66G
 Index Size: 163M
 Ingestion speed change: 9 14 12 12 8 8 5 7 5 4
 13 secs - 769 docs/sec
 CPU: 29.43%
 Heap: 9.84G
 time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2

 Scenario 3: 1M
 183 mins 8 secs - *0.9* docs/sec
 CPU: 40.47%
 Heap: 47.99G
 time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 1594


 11 mins 9 secs-15docs/sec
 CPU: 41.45%
 iowait: 0.07%
 Heap: 36.12G
 Index Size: 163M
 Ingestion speed change: 12 24 38 55 70 86 106 117 83 78
 15 secs - 666.7 docs/sec
 CPU: 45.10%
 Heap: 9.64G
 time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2

  

 Best Regards
 Maco

 On Saturday, July 5, 2014 11:46:59 PM UTC+8, kimchy wrote:

 Heya, I worked a bit on it, and 1.x (upcoming 1.3) has some significant 
 perf improvements now for this case (including improvements Lucene wise, 
 that are for now in ES, but will be in Lucene next version). Those include:

 6648: https://github.com/elasticsearch/elasticsearch/pull/6648
 6714: https://github.com/elasticsearch/elasticsearch/pull/6714
 6707: https://github.com/elasticsearch/elasticsearch/pull/6707

 It would be interesting if you can run the tests again with 1.x branch. 
 Note, also, please use default features in ES for now, no disable flushing 
 and such.

 On Friday, June 13, 2014 7:57:23 AM UTC+2, Maco Ma wrote:

 I try to measure the performance of ingesting the documents having lots

Re: ingest performance degrades sharply along with the documents having more fileds

2014-07-05 Thread kimchy
Heya, I worked a bit on it, and 1.x (upcoming 1.3) has some significant 
perf improvements now for this case (including improvements Lucene wise, 
that are for now in ES, but will be in Lucene next version). Those include:

6648: https://github.com/elasticsearch/elasticsearch/pull/6648
6714: https://github.com/elasticsearch/elasticsearch/pull/6714
6707: https://github.com/elasticsearch/elasticsearch/pull/6707

It would be interesting if you can run the tests again with 1.x branch. 
Note, also, please use default features in ES for now, no disable flushing 
and such.

On Friday, June 13, 2014 7:57:23 AM UTC+2, Maco Ma wrote:

 I try to measure the performance of ingesting the documents having lots of 
 fields.


 The latest elasticsearch 1.2.1:
 Total docs count: 10k (a small set definitely)
 ES_HEAP_SIZE: 48G
 settings:

 {doc:{settings:{index:{uuid:LiWHzE5uQrinYW1wW4E3nA,number_of_replicas:0,translog:{disable_flush:true},number_of_shards:5,refresh_interval:-1,version:{created:1020199}

 mappings:

 {doc:{mappings:{type:{dynamic_templates:[{t1:{mapping:{store:false,norms:{enabled:false},type:string},match:*_ss}},{t2:{mapping:{store:false,type:date},match:*_dt}},{t3:{mapping:{store:false,type:integer},match:*_i}}],_source:{enabled:false},properties:{}

 All fields in the documents mach the templates in the mappings.

 Since I disabled the flush  refresh, I submitted the flush command (along 
 with optimize command after it) in the client program every 10 seconds. (I 
 tried the another interval 10mins and got the similar results)

 Scenario 0 - 10k docs have 1000 different fields:
 Ingestion took 12 secs.  Only 1.08G heap mem is used(only states the used 
 heap memory).


 Scenario 1 - 10k docs have 10k different fields(10 times fields compared 
 with scenario0):
 This time ingestion took 29 secs.   Only 5.74G heap mem is used.

 Not sure why the performance degrades sharply.

 If I try to ingest the docs having 100k different fields, it will take 17 
 mins 44 secs.  We only have 10k docs totally and not sure why ES perform so 
 badly. 

 Anyone can give suggestion to improve the performance?









-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/94f69102-a3ff-4aea-9513-0a07300a8a92%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Java Serialization of Exceptions

2014-03-21 Thread kimchy
I wonder why you are asking for this feature? If its because Java broke 
backward comp. on serialization of InetAddress that we use in our 
exceptions, then its a bug in Java serialization, hard for us to do 
something about it. 

You will loose a lot by trying to serialize exceptions using JSON, and we 
prefer not to introduce dependency on ObjectMapper in Jackson, or try and 
serialize exceptions using Jackson.

I would be very careful in introducing this just because of a (one time 
bug) in Java.

On Friday, March 21, 2014 5:18:38 PM UTC+1, Chris Berry wrote:

 Greetings,

 Let me say up-front, I am a huge fan and proponent of Elasticsearch. It is 
 a beautiful tool.

 So, that said, it surprises me that Elasticsearch has such a pedestrian 
 flaw, and serializes it's Exceptions using Java Serialization.
 In a big shop it is quite difficult (i.e. next to impossible) to keep all 
 the ES Clients on the same exact JVM as Elasticsearch, and thus, it is not 
 uncommon to get TransportSerializationExceptions instead of the actual 
 underlying problem.
 I was really hoping this would be corrected in ES 1.0.X, but no such luck. 
 (As far as I can tell...)

 It seems that this is pretty easily fixed?
 Just switch to a JSON representation of the basic Exception and gracefully 
 (forwards-compatibly) attempt to re-hydrate the actual Exception class. 
 You'd just have to drop an additional header in the stream that tells 
 the code it is a JSON response and route to the right Handler it 
 accordingly. If the header is missing, then do things the old way with Java 
 Serialization??

 Are there any plans to fix this? Or perhaps to entertain a Pull Request?
 It may seem just an annoyance, but it is actually pretty bad, in that it 
 keeps Clients from seeing their real issues. Especially in shops where it 
 is difficult to see the Production logs of Elasticsearch itself. 

 Thanks much,
 -- Chris 




-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6ae5f173-a2b4-435c-8e5d-a43d377e2fb0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Java Serialization of Exceptions

2014-03-21 Thread kimchy
Not trivializing the bug at all, god knows I spend close to a week tracing 
it down to a JVM backward incompatibility change, but this happened once 
over the almost 5 years Elasticsearch existed. To introduce a workaround to 
something that happened once, compared to potential bugs in the workaround 
(Jackson is great, but what would happen if there was a bug in it for 
example) is not a great solution. Obviously, if this happened more often, 
then this is something we need to address.

On Friday, March 21, 2014 7:12:02 PM UTC+1, Chris Berry wrote:

 If it happened once, then by definition it will happen again. History 
 repeats itself. ;-)

 What exactly would you lose?
 You are simply trading one rigid serialization scheme for another more 
 lenient one.
 Yes, you would have to introduce something like Jackson’s Object Mapper, 
 but that seems to be the defacto standard today and with your use of the 
 Shade Plugin it wouldn’t really be a burden on the Client anyway.

 With all due respect, you may be trivializing the impact of this one time 
 bug.
 It is difficult, at best, to inform all the Clients of your Cluster; “Hey, 
 if you want to see what your Exceptions really are, then upgrade your JVM” 
 Especially in large SOA shops

 This just decouples the Client and Server deployments.

 Thanks much,
 — Chris 

 On Mar 21, 2014, at 12:18 PM, kimchy kim...@gmail.com javascript: 
 wrote:

 I wonder why you are asking for this feature? If its because Java broke 
 backward comp. on serialization of InetAddress that we use in our 
 exceptions, then its a bug in Java serialization, hard for us to do 
 something about it. 

 You will loose a lot by trying to serialize exceptions using JSON, and we 
 prefer not to introduce dependency on ObjectMapper in Jackson, or try and 
 serialize exceptions using Jackson.

 I would be very careful in introducing this just because of a (one time 
 bug) in Java.

 On Friday, March 21, 2014 5:18:38 PM UTC+1, Chris Berry wrote:

 Greetings,

 Let me say up-front, I am a huge fan and proponent of Elasticsearch. It 
 is a beautiful tool.

 So, that said, it surprises me that Elasticsearch has such a pedestrian 
 flaw, and serializes it's Exceptions using Java Serialization.
 In a big shop it is quite difficult (i.e. next to impossible) to keep all 
 the ES Clients on the same exact JVM as Elasticsearch, and thus, it is not 
 uncommon to get TransportSerializationExceptions instead of the actual 
 underlying problem.
 I was really hoping this would be corrected in ES 1.0.X, but no such 
 luck. (As far as I can tell...)

 It seems that this is pretty easily fixed?
 Just switch to a JSON representation of the basic Exception and 
 gracefully (forwards-compatibly) attempt to re-hydrate the actual Exception 
 class. 
 You'd just have to drop an additional header in the stream that tells 
 the code it is a JSON response and route to the right Handler it 
 accordingly. If the header is missing, then do things the old way with Java 
 Serialization??

 Are there any plans to fix this? Or perhaps to entertain a Pull Request?
 It may seem just an annoyance, but it is actually pretty bad, in that it 
 keeps Clients from seeing their real issues. Especially in shops where it 
 is difficult to see the Production logs of Elasticsearch itself. 

 Thanks much,
 -- Chris 



 -- 
 You received this message because you are subscribed to a topic in the 
 Google Groups elasticsearch group.
 To unsubscribe from this topic, visit 
 https://groups.google.com/d/topic/elasticsearch/7bpam7mWjY8/unsubscribe.
 To unsubscribe from this group and all its topics, send an email to 
 elasticsearc...@googlegroups.com javascript:.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/6ae5f173-a2b4-435c-8e5d-a43d377e2fb0%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/6ae5f173-a2b4-435c-8e5d-a43d377e2fb0%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.




-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/891337f7-230f-4ce2-a2b4-57749f095748%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.