Re: Using serialized doc_value instead of _source to improve read latency

Itai Frenkel Mon, 20 Apr 2015 16:58:44 -0700

A quick check shows there is no significant performance gain between 
doc_value and stored field that is not a doc value. I suppose there are 
warm-up and file system caching issues are at play. I do not have that 
field in the source since the ETL process at this point does not generate 
it. The ETL could be fixed and then it will generate the required field. 
However, even then I would still prefer doc_field over _source since I do 
not need _source at all. You are right to assume that reading the entire 
source parsing it and returning only one field would be fast (since the cpu 
is in the json generator I suspect, and not the parser, but that requires 
more work).



On Tuesday, April 21, 2015 at 2:25:22 AM UTC+3, Itamar Syn-Hershko wrote:
>
> What if all those fields are collapsed to one, like you suggest, but that 
> one field is projected out of _source (think non-indexed json in a string 
> field)? do you see a noticable performance gain then?
>
> What if that field is set to be stored (and loaded using fields, not via 
> _source)? what is the performance gain then?
>
> Fielddata and the doc_values optimization on top of them will not help you 
> here, those data structures aren't being used for sending data out, only 
> for aggregations and sorting. Also, using fielddata will require indexing 
> those fields; it is apparent that you are not looking to be doing that.
>
> --
>
> Itamar Syn-Hershko
> http://code972.com | @synhershko <https://twitter.com/synhershko>
> Freelance Developer & Consultant
> Lucene.NET committer and PMC member
>
> On Tue, Apr 21, 2015 at 12:14 AM, Itai Frenkel <itaif...@live.com 
> <javascript:>> wrote:
>
>> Itamar,
>>
>> 1. The _source field includes many fields that are only being indexed, 
>> and many fields that are only needed as a query search result. _source 
>> includes them both.The projection from _source from the query result is too 
>> CPU intensive to do during search time for each result, especially if the 
>> size is big. 
>> 2. I agree that adding another NoSQL could solve this problem, however it 
>> is currently out of scope, as it would require syncing data with another 
>> data store.
>> 3. Wouldn't a big stored field will bloat the lucene index size? Even if 
>> not, isn't non_analyzed fields are destined to be (or already are) 
>> doc_fields?
>>
>> On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko wrote:
>>>
>>> This is how _source works. doc_values don't make sense in this regard - 
>>> what you are looking for is using stored fields and have the transform 
>>> script write to that. Loading stored fields (even one field per hit) may be 
>>> slower than loading and parsing _source, though.
>>>
>>> I'd just put this logic in the indexer, though. It will definitely help 
>>> with other things as well, such as nasty huge mappings.
>>>
>>> Alternatively, find a way to avoid IO completely. How about using ES for 
>>> search and something like riak for loading the actual data, if IO costs are 
>>> so noticable?
>>>
>>> --
>>>
>>> Itamar Syn-Hershko
>>> http://code972.com | @synhershko <https://twitter.com/synhershko>
>>> Freelance Developer & Consultant
>>> Lucene.NET committer and PMC member
>>>
>>> On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel <itaif...@live.com> 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> We are having a performance problem in which for each hit, 
>>>> elasticsearch parses the entire _source then generates a new Json with 
>>>> only 
>>>> the requested query _source fields. In order to overcome this issue we 
>>>> would like to use mapping transform script that serializes the requested 
>>>> query fields (which is known in advance) into a doc_value. Does that makes 
>>>> sense?
>>>>
>>>> The actual problem with the transform script is  SecurityException that 
>>>> does not allow using any json serialization mechanism. A binary 
>>>> serialization would also be ok.
>>>>
>>>>
>>>> Itai
>>>>
>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to elasticsearc...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8fd7a5d2-77c7-4758-8c28-82f517131660%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Using serialized doc_value instead of _source to improve read latency

Reply via email to