What is the correct _primary_first syntax? What is the relevant debug logger ?

2015-04-26 Thread Itai Frenkel
Hello,

What is the correct syntax of using _primary_first in search and search 
template queries?

GET myindex/_search/template?preference=_primary_first

or

GET myindex/_search/template?routing=_primary_first

Is there any verbose mode that can log the list of shards that were 
actually accessed?

thanks,
Itai

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6f6d44a0-f689-4168-85cf-574610f73155%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Using serialized doc_value instead of _source to improve read latency

2015-04-21 Thread Itai Frenkel
The answer is these changes in elasticsearch.yml:
script.groovy.sandbox.class_whitelist: 
com.fasterxml.jackson.databind.ObjectMapper
script.groovy.sandbox.package_whitelist: com.fasterxml.jackson.databind

for some reason these classes are not shaded even though the pom.xml does 
shade them.

On Tuesday, April 21, 2015 at 5:21:58 AM UTC+3, Itai Frenkel wrote:

 If I could focus the question better :  How do I whitelist a specific 
 class in the groovy script inside transform ?

 On Tuesday, April 21, 2015 at 1:18:03 AM UTC+3, Itai Frenkel wrote:

 Hi,

 We are having a performance problem in which for each hit, elasticsearch 
 parses the entire _source then generates a new Json with only the requested 
 query _source fields. In order to overcome this issue we would like to use 
 mapping transform script that serializes the requested query fields (which 
 is known in advance) into a doc_value. Does that makes sense?

 The actual problem with the transform script is  SecurityException that 
 does not allow using any json serialization mechanism. A binary 
 serialization would also be ok.


 Itai



-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b7787495-500b-4ed7-b0e6-4fad7fda1aa2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Using serialized doc_value instead of _source to improve read latency

2015-04-20 Thread Itai Frenkel
Hi,

We are having a performance problem in which for each hit, elasticsearch 
parses the entire _source then generates a new Json with only the requested 
query _source fields. In order to overcome this issue we would like to use 
mapping transform script that serializes the requested query fields (which 
is known in advance) into a doc_value. Does that makes sense?

The actual problem with the transform script is  SecurityException that 
does not allow using any json serialization mechanism. A binary 
serialization would also be ok.


Itai

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Using serialized doc_value instead of _source to improve read latency

2015-04-20 Thread Itai Frenkel
Itamar,

1. The _source field includes many fields that are only being indexed, and 
many fields that are only needed as a query search result. _source includes 
them both.The projection from _source from the query result is too CPU 
intensive to do during search time for each result, especially if the size 
is big. 
2. I agree that adding another NoSQL could solve this problem, however it 
is currently out of scope, as it would require syncing data with another 
data store.
3. Wouldn't a big stored field will bloat the lucene index size? Even if 
not, isn't non_analyzed fields are destined to be (or already are) 
doc_fields?

On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko wrote:

 This is how _source works. doc_values don't make sense in this regard - 
 what you are looking for is using stored fields and have the transform 
 script write to that. Loading stored fields (even one field per hit) may be 
 slower than loading and parsing _source, though.

 I'd just put this logic in the indexer, though. It will definitely help 
 with other things as well, such as nasty huge mappings.

 Alternatively, find a way to avoid IO completely. How about using ES for 
 search and something like riak for loading the actual data, if IO costs are 
 so noticable?

 --

 Itamar Syn-Hershko
 http://code972.com | @synhershko https://twitter.com/synhershko
 Freelance Developer  Consultant
 Lucene.NET committer and PMC member

 On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel itaif...@live.com 
 javascript: wrote:

 Hi,

 We are having a performance problem in which for each hit, elasticsearch 
 parses the entire _source then generates a new Json with only the requested 
 query _source fields. In order to overcome this issue we would like to use 
 mapping transform script that serializes the requested query fields (which 
 is known in advance) into a doc_value. Does that makes sense?

 The actual problem with the transform script is  SecurityException that 
 does not allow using any json serialization mechanism. A binary 
 serialization would also be ok.


 Itai

  -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com javascript:.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.




-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Using serialized doc_value instead of _source to improve read latency

2015-04-20 Thread Itai Frenkel
A quick check shows there is no significant performance gain between 
doc_value and stored field that is not a doc value. I suppose there are 
warm-up and file system caching issues are at play. I do not have that 
field in the source since the ETL process at this point does not generate 
it. The ETL could be fixed and then it will generate the required field. 
However, even then I would still prefer doc_field over _source since I do 
not need _source at all. You are right to assume that reading the entire 
source parsing it and returning only one field would be fast (since the cpu 
is in the json generator I suspect, and not the parser, but that requires 
more work).


On Tuesday, April 21, 2015 at 2:25:22 AM UTC+3, Itamar Syn-Hershko wrote:

 What if all those fields are collapsed to one, like you suggest, but that 
 one field is projected out of _source (think non-indexed json in a string 
 field)? do you see a noticable performance gain then?

 What if that field is set to be stored (and loaded using fields, not via 
 _source)? what is the performance gain then?

 Fielddata and the doc_values optimization on top of them will not help you 
 here, those data structures aren't being used for sending data out, only 
 for aggregations and sorting. Also, using fielddata will require indexing 
 those fields; it is apparent that you are not looking to be doing that.

 --

 Itamar Syn-Hershko
 http://code972.com | @synhershko https://twitter.com/synhershko
 Freelance Developer  Consultant
 Lucene.NET committer and PMC member

 On Tue, Apr 21, 2015 at 12:14 AM, Itai Frenkel itaif...@live.com 
 javascript: wrote:

 Itamar,

 1. The _source field includes many fields that are only being indexed, 
 and many fields that are only needed as a query search result. _source 
 includes them both.The projection from _source from the query result is too 
 CPU intensive to do during search time for each result, especially if the 
 size is big. 
 2. I agree that adding another NoSQL could solve this problem, however it 
 is currently out of scope, as it would require syncing data with another 
 data store.
 3. Wouldn't a big stored field will bloat the lucene index size? Even if 
 not, isn't non_analyzed fields are destined to be (or already are) 
 doc_fields?

 On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko wrote:

 This is how _source works. doc_values don't make sense in this regard - 
 what you are looking for is using stored fields and have the transform 
 script write to that. Loading stored fields (even one field per hit) may be 
 slower than loading and parsing _source, though.

 I'd just put this logic in the indexer, though. It will definitely help 
 with other things as well, such as nasty huge mappings.

 Alternatively, find a way to avoid IO completely. How about using ES for 
 search and something like riak for loading the actual data, if IO costs are 
 so noticable?

 --

 Itamar Syn-Hershko
 http://code972.com | @synhershko https://twitter.com/synhershko
 Freelance Developer  Consultant
 Lucene.NET committer and PMC member

 On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel itaif...@live.com 
 wrote:

 Hi,

 We are having a performance problem in which for each hit, 
 elasticsearch parses the entire _source then generates a new Json with 
 only 
 the requested query _source fields. In order to overcome this issue we 
 would like to use mapping transform script that serializes the requested 
 query fields (which is known in advance) into a doc_value. Does that makes 
 sense?

 The actual problem with the transform script is  SecurityException that 
 does not allow using any json serialization mechanism. A binary 
 serialization would also be ok.


 Itai

  -- 
 You received this message because you are subscribed to the Google 
 Groups elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.


  -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com javascript:.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com?utm_medium=emailutm_source=footer
 .

 For more options, visit https://groups.google.com/d/optout.




-- 
You received this message because you are subscribed to the Google Groups

Re: Using serialized doc_value instead of _source to improve read latency

2015-04-20 Thread Itai Frenkel
Also - does fielddata: {  loading: eager } makes sense with 
doc_values in this use case? Would that combination be supported in the 
future?

On Tuesday, April 21, 2015 at 2:14:03 AM UTC+3, Itai Frenkel wrote:

 Itamar,

 1. The _source field includes many fields that are only being indexed, and 
 many fields that are only needed as a query search result. _source includes 
 them both.The projection from _source from the query result is too CPU 
 intensive to do during search time for each result, especially if the size 
 is big. 
 2. I agree that adding another NoSQL could solve this problem, however it 
 is currently out of scope, as it would require syncing data with another 
 data store.
 3. Wouldn't a big stored field will bloat the lucene index size? Even if 
 not, isn't non_analyzed fields are destined to be (or already are) 
 doc_fields?

 On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko wrote:

 This is how _source works. doc_values don't make sense in this regard - 
 what you are looking for is using stored fields and have the transform 
 script write to that. Loading stored fields (even one field per hit) may be 
 slower than loading and parsing _source, though.

 I'd just put this logic in the indexer, though. It will definitely help 
 with other things as well, such as nasty huge mappings.

 Alternatively, find a way to avoid IO completely. How about using ES for 
 search and something like riak for loading the actual data, if IO costs are 
 so noticable?

 --

 Itamar Syn-Hershko
 http://code972.com | @synhershko https://twitter.com/synhershko
 Freelance Developer  Consultant
 Lucene.NET committer and PMC member

 On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel itaif...@live.com wrote:

 Hi,

 We are having a performance problem in which for each hit, elasticsearch 
 parses the entire _source then generates a new Json with only the requested 
 query _source fields. In order to overcome this issue we would like to use 
 mapping transform script that serializes the requested query fields (which 
 is known in advance) into a doc_value. Does that makes sense?

 The actual problem with the transform script is  SecurityException that 
 does not allow using any json serialization mechanism. A binary 
 serialization would also be ok.


 Itai

  -- 
 You received this message because you are subscribed to the Google 
 Groups elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.




-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d5abaeac-ff16-45ac-bb3d-62b53e497795%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Using serialized doc_value instead of _source to improve read latency

2015-04-20 Thread Itai Frenkel
If I could focus the question better :  How do I whitelist a specific class 
in the groovy script inside transform ?

On Tuesday, April 21, 2015 at 1:18:03 AM UTC+3, Itai Frenkel wrote:

 Hi,

 We are having a performance problem in which for each hit, elasticsearch 
 parses the entire _source then generates a new Json with only the requested 
 query _source fields. In order to overcome this issue we would like to use 
 mapping transform script that serializes the requested query fields (which 
 is known in advance) into a doc_value. Does that makes sense?

 The actual problem with the transform script is  SecurityException that 
 does not allow using any json serialization mechanism. A binary 
 serialization would also be ok.


 Itai



-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e925c3b6-b102-413c-a320-62f1c0ffcf99%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Using serialized doc_value instead of _source to improve read latency

2015-04-20 Thread Itai Frenkel
Hi Nik,

when _source : true the time it takes for the search to complete in 
elasticsearch is very short. when _souce is a list of fields it is 
significantly slower.

Itai

On Tuesday, April 21, 2015 at 3:06:06 AM UTC+3, Nikolas Everett wrote:

 Have you profiled it and seen that reading the source is actually the slow 
 part? hot_threads can lie here so I'd go with a profiler or just sigquit or 
 something.

 I've got some reasonably big documents and generally don't see that as a 
 problem even under decent load.

 I could see an argument for a second source field with the long stuff 
 removed if you see the json decode or the disk read of the source be really 
 slow - but transform doesn't do that.

 Nik

 On Mon, Apr 20, 2015 at 7:57 PM, Itai Frenkel itaif...@live.com 
 javascript: wrote:

 A quick check shows there is no significant performance gain between 
 doc_value and stored field that is not a doc value. I suppose there are 
 warm-up and file system caching issues are at play. I do not have that 
 field in the source since the ETL process at this point does not generate 
 it. The ETL could be fixed and then it will generate the required field. 
 However, even then I would still prefer doc_field over _source since I do 
 not need _source at all. You are right to assume that reading the entire 
 source parsing it and returning only one field would be fast (since the cpu 
 is in the json generator I suspect, and not the parser, but that requires 
 more work).


 On Tuesday, April 21, 2015 at 2:25:22 AM UTC+3, Itamar Syn-Hershko wrote:

 What if all those fields are collapsed to one, like you suggest, but 
 that one field is projected out of _source (think non-indexed json in a 
 string field)? do you see a noticable performance gain then?

 What if that field is set to be stored (and loaded using fields, not via 
 _source)? what is the performance gain then?

 Fielddata and the doc_values optimization on top of them will not help 
 you here, those data structures aren't being used for sending data out, 
 only for aggregations and sorting. Also, using fielddata will require 
 indexing those fields; it is apparent that you are not looking to be doing 
 that.

 --

 Itamar Syn-Hershko
 http://code972.com | @synhershko https://twitter.com/synhershko
 Freelance Developer  Consultant
 Lucene.NET committer and PMC member

 On Tue, Apr 21, 2015 at 12:14 AM, Itai Frenkel itaif...@live.com 
 wrote:

 Itamar,

 1. The _source field includes many fields that are only being indexed, 
 and many fields that are only needed as a query search result. _source 
 includes them both.The projection from _source from the query result is 
 too 
 CPU intensive to do during search time for each result, especially if the 
 size is big. 
 2. I agree that adding another NoSQL could solve this problem, however 
 it is currently out of scope, as it would require syncing data with 
 another 
 data store.
 3. Wouldn't a big stored field will bloat the lucene index size? Even 
 if not, isn't non_analyzed fields are destined to be (or already are) 
 doc_fields?

 On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko 
 wrote:

 This is how _source works. doc_values don't make sense in this regard 
 - what you are looking for is using stored fields and have the transform 
 script write to that. Loading stored fields (even one field per hit) may 
 be 
 slower than loading and parsing _source, though.

 I'd just put this logic in the indexer, though. It will definitely 
 help with other things as well, such as nasty huge mappings.

 Alternatively, find a way to avoid IO completely. How about using ES 
 for search and something like riak for loading the actual data, if IO 
 costs 
 are so noticable?

 --

 Itamar Syn-Hershko
 http://code972.com | @synhershko https://twitter.com/synhershko
 Freelance Developer  Consultant
 Lucene.NET committer and PMC member

 On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel itaif...@live.com 
 wrote:

 Hi,

 We are having a performance problem in which for each hit, 
 elasticsearch parses the entire _source then generates a new Json with 
 only 
 the requested query _source fields. In order to overcome this issue we 
 would like to use mapping transform script that serializes the requested 
 query fields (which is known in advance) into a doc_value. Does that 
 makes 
 sense?

 The actual problem with the transform script is  SecurityException 
 that does not allow using any json serialization mechanism. A binary 
 serialization would also be ok.


 Itai

  -- 
 You received this message because you are subscribed to the Google 
 Groups elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, 
 send an email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474

tuning elasticsearch node client non-heap memory consumption

2015-01-13 Thread Itai Frenkel
Hello,

We are running a node client on each machine with small JVM heap -Xms384m 
-Xmx384m -Xss256k which is suitable for our use case.
There are however another 285MB non-heap memory (676-389=285). 
How can this extra non-heap memory usage be configured ? What is it used 
for ?
Below are the relevant node stats. 

Regards,
Itai

process: {
open_file_descriptors: 340,
mem: {
  resident_in_bytes: 676884480,
  share_in_bytes: 23248896,
  total_virtual_in_bytes: 1696899072
}
  },
jvm: {
mem: {
  heap_used_in_bytes: 44794784,
  heap_used_percent: 11,
  heap_committed_in_bytes: 389283840,
  heap_max_in_bytes: 389283840,
  non_heap_used_in_bytes: 44208640,
  non_heap_committed_in_bytes: 44564480,
  pools: {
young: {
  used_in_bytes: 13765016,
  max_in_bytes: 107479040,
  peak_used_in_bytes: 107479040,
  peak_max_in_bytes: 107479040
},
survivor: {
  used_in_bytes: 5086896,
  max_in_bytes: 13369344,
  peak_used_in_bytes: 13369344,
  peak_max_in_bytes: 13369344
},
old: {
  used_in_bytes: 25942872,
  max_in_bytes: 268435456,
  peak_used_in_bytes: 25942872,
  peak_max_in_bytes: 268435456
}
  }
},

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/178f4f2f-5dfe-418e-82a3-de505a9ebd9a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Recommended discovery mechanism for production cluster on ec2

2014-11-13 Thread Itai Frenkel
Hi,

What is the recommended discovery mechanism for production clusters on ec2?

Zen: My main concerns is the possibility of dedicated master nodes changing 
their ip addresses (could happen on ec2). Does zen uses the configured 
unicast hosts just as gossip seed, or is it a static list? If it is just a 
gossip seed, does zen persist the gossip back to disk in case a process 
restart is needed and the original ip addresses are out of date? If it does 
not persist gossiped ip address to disk, do I need a script that injects 
fresh seeds each time elasticsearch service starts?

ec2: My main concern is the resiliency of the cluster if ec2 api returns 
inconsistent results causing some kind of split brain scenario. Has such a 
thing been reported?

zookeeper: This plugin has been reported to work better than zen discovery 
regarding to split brain. Is this observation still relevant in v1.4? I am 
willing to risk an unofficial plugin if it makes the cluster more stable.


Regards,
Itai

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/09e144c4-41b7-44a1-b18b-2bee07aae71d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.