Re: Looking for a best practice to get all data according to some filters

2014-12-14 Thread David Pilato
Implication is the memory needed to be allocated on each shard.


David

 Le 14 déc. 2014 à 05:46, Ron Sher ron.s...@gmail.com a écrit :
 
 Again, why not use a very large count size? What are the implications of 
 using a very large count?
 Regarding performance - it seems doing 1 request with a very large count 
 performs better than using scan scroll (with count of 100 using 32 shards)
 
 On Wednesday, December 10, 2014 10:53:50 PM UTC+2, David Pilato wrote:
 No I did not say that. Or I did not mean that. Sorry if it was unclear.
 I said: don’t use large sizes:
 
 Never use size:1000 or from:1000. 
 
 
 You should read this: 
 http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan
 
 -- 
 David Pilato | Technical Advocate | Elasticsearch.com
 @dadoonet | @elasticsearchfr | @scrutmydocs
 
 
 
 Le 10 déc. 2014 à 21:16, Ron Sher ron@gmail.com a écrit :
 
 So you're saying there's no impact on elasticsearch if I issue a large 
 size? 
 If that's the case then why shouldn't I just call size of 1M if I want to 
 make sure I get everything?
 
 On Wednesday, December 10, 2014 8:22:47 PM UTC+2, David Pilato wrote:
 Scan/scroll is the best option to extract a huge amount of data.
 Never use size:1000 or from:1000. 
 
 It's not realtime because you basically scroll over a given set of 
 segments and all new changes that will come in new segments won't be taken 
 into account during the scroll.
 Which is good because you won't get inconsistent results.
 
 About size, I'd would try and test. It depends on your docs size I believe.
 Try with 1 and see how it goes when you increase it. You will may be 
 discover that getting 10*1 docs is the same as 1*10. :)
 
 Best
 
 David
 
 Le 10 déc. 2014 à 19:09, Ron Sher ron@gmail.com a écrit :
 
 Hi,
 
 I was wondering about best practices to to get all data according to some 
 filters.
 The options as I see them are:
 Use a very big size that will return all accounts, i.e. use some value 
 like 1m to make sure I get everything back (even if I need just a few 
 hundreds or tens of documents). This is the quickest way, development 
 wise.
 Use paging - using size and from. This requires looping over the result 
 and the performance gets worse as we advance to later pages. Also, we 
 need to use preference if we want to get consistent results over the 
 pages. Also, it's not clear what's the recommended size for each page.
 Use scan/scroll - this gives consistent paging but also has several 
 drawbacks: If I use search_type=scan then it can't be sorted; using 
 scan/scroll is (maybe) less performant than paging (the documentation 
 says it's not for realtime use); again not clear which size is 
 recommended.
 So you see - many options and not clear which path to take.
 
 What do you think?
 
 Thanks,
 Ron
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.
 
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.
 
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/ac0841ac-4150-435c-a3da-afbf2a4b06a6%40googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7717B0E2-E971-4653-A0A7-BA66EC3EAE9F%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


Re: Looking for a best practice to get all data according to some filters

2014-12-14 Thread Nikolas Everett
Search consumes O(offset + size) memory and O(ln(offset +
size)*(offset+size) CPU. Scan scroll has higher overhead but is O(size) the
whole time. I don't know the break even point.

The other thing is that scroll provides a consistent snapshot. That means
it consumes resources you shouldn't let end users expose but it won't miss
results or have repeats like scrolling with increasing offset.

You can certainly do large fetches with big size but its less stable in
general.

Finally, scan/scroll has always been pretty quick for me. I usually use a
batch size in the thousands.

Nik
On Dec 14, 2014 4:13 AM, David Pilato da...@pilato.fr wrote:

 Implication is the memory needed to be allocated on each shard.


 David

 Le 14 déc. 2014 à 05:46, Ron Sher ron.s...@gmail.com a écrit :

 Again, why not use a very large count size? What are the implications of
 using a very large count?
 Regarding performance - it seems doing 1 request with a very large count
 performs better than using scan scroll (with count of 100 using 32 shards)

 On Wednesday, December 10, 2014 10:53:50 PM UTC+2, David Pilato wrote:

 No I did not say that. Or I did not mean that. Sorry if it was unclear.
 I said: don’t use large sizes:

 Never use size:1000 or from:1000.


 You should read this: http://www.elasticsearch.org/guide/en/
 elasticsearch/reference/current/search-request-scroll.html#scroll-scan

 --
 *David Pilato* | *Technical Advocate* | *Elasticsearch.com
 http://Elasticsearch.com*
 @dadoonet https://twitter.com/dadoonet | @elasticsearchfr
 https://twitter.com/elasticsearchfr | @scrutmydocs
 https://twitter.com/scrutmydocs



 Le 10 déc. 2014 à 21:16, Ron Sher ron@gmail.com a écrit :

 So you're saying there's no impact on elasticsearch if I issue a large
 size?
 If that's the case then why shouldn't I just call size of 1M if I want to
 make sure I get everything?

 On Wednesday, December 10, 2014 8:22:47 PM UTC+2, David Pilato wrote:

 Scan/scroll is the best option to extract a huge amount of data.
 Never use size:1000 or from:1000.

 It's not realtime because you basically scroll over a given set of
 segments and all new changes that will come in new segments won't be taken
 into account during the scroll.
 Which is good because you won't get inconsistent results.

 About size, I'd would try and test. It depends on your docs size I
 believe.
 Try with 1 and see how it goes when you increase it. You will may be
 discover that getting 10*1 docs is the same as 1*10. :)

 Best

 David

 Le 10 déc. 2014 à 19:09, Ron Sher ron@gmail.com a écrit :

 Hi,

 I was wondering about best practices to to get all data according to
 some filters.
 The options as I see them are:

- Use a very big size that will return all accounts, i.e. use some
value like 1m to make sure I get everything back (even if I need just a 
 few
hundreds or tens of documents). This is the quickest way, development 
 wise.
- Use paging - using size and from. This requires looping over the
result and the performance gets worse as we advance to later pages. Also,
we need to use preference if we want to get consistent results over the
pages. Also, it's not clear what's the recommended size for each page.
- Use scan/scroll - this gives consistent paging but also has
several drawbacks: If I use search_type=scan then it can't be sorted; 
 using
scan/scroll is (maybe) less performant than paging (the documentation 
 says
it's not for realtime use); again not clear which size is recommended.

 So you see - many options and not clear which path to take.

 What do you think?

 Thanks,
 Ron

 --
 You received this message because you are subscribed to the Google
 Groups elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit https://groups.google.com/d/
 msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%
 40googlegroups.com
 https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.


 --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit https://groups.google.com/d/
 msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%
 40googlegroups.com
 https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.


  --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to 

Re: Looking for a best practice to get all data according to some filters

2014-12-14 Thread Jonathan Foy
Just to reword what others have said, ES will allocate memory for [size] scores 
as I understand it (per shard?) regardless of the final result count. If you're 
getting back 4986 results from a query, it'd be faster to use size: 4986 than 
size: 100. 

What I've done in similar situations is to issue a count first with the same 
filter (which is very fast), then use the result of that in the size field. It 
worked much better/faster than using a default large size.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ef65d92f-9a9c-4206-a2b9-5b769e4cec87%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Looking for a best practice to get all data according to some filters

2014-12-13 Thread Ron Sher
Again, why not use a very large count size? What are the implications of 
using a very large count?
Regarding performance - it seems doing 1 request with a very large count 
performs better than using scan scroll (with count of 100 using 32 shards)

On Wednesday, December 10, 2014 10:53:50 PM UTC+2, David Pilato wrote:

 No I did not say that. Or I did not mean that. Sorry if it was unclear.
 I said: don’t use large sizes:

 Never use size:1000 or from:1000. 


 You should read this: 
 http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan

 -- 
 *David Pilato* | *Technical Advocate* | *Elasticsearch.com 
 http://Elasticsearch.com*
 @dadoonet https://twitter.com/dadoonet | @elasticsearchfr 
 https://twitter.com/elasticsearchfr | @scrutmydocs 
 https://twitter.com/scrutmydocs


  
 Le 10 déc. 2014 à 21:16, Ron Sher ron@gmail.com javascript: a 
 écrit :

 So you're saying there's no impact on elasticsearch if I issue a large 
 size? 
 If that's the case then why shouldn't I just call size of 1M if I want to 
 make sure I get everything?

 On Wednesday, December 10, 2014 8:22:47 PM UTC+2, David Pilato wrote:

 Scan/scroll is the best option to extract a huge amount of data.
 Never use size:1000 or from:1000. 

 It's not realtime because you basically scroll over a given set of 
 segments and all new changes that will come in new segments won't be taken 
 into account during the scroll.
 Which is good because you won't get inconsistent results.

 About size, I'd would try and test. It depends on your docs size I 
 believe.
 Try with 1 and see how it goes when you increase it. You will may be 
 discover that getting 10*1 docs is the same as 1*10. :)

 Best

 David

 Le 10 déc. 2014 à 19:09, Ron Sher ron@gmail.com a écrit :

 Hi,

 I was wondering about best practices to to get all data according to some 
 filters.
 The options as I see them are:

- Use a very big size that will return all accounts, i.e. use some 
value like 1m to make sure I get everything back (even if I need just a 
 few 
hundreds or tens of documents). This is the quickest way, development 
 wise.
- Use paging - using size and from. This requires looping over the 
result and the performance gets worse as we advance to later pages. Also, 
we need to use preference if we want to get consistent results over the 
pages. Also, it's not clear what's the recommended size for each page.
- Use scan/scroll - this gives consistent paging but also has several 
drawbacks: If I use search_type=scan then it can't be sorted; using 
scan/scroll is (maybe) less performant than paging (the documentation 
 says 
it's not for realtime use); again not clear which size is recommended.

 So you see - many options and not clear which path to take.

 What do you think?

 Thanks,
 Ron

 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.


 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com javascript:.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.




-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ac0841ac-4150-435c-a3da-afbf2a4b06a6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Looking for a best practice to get all data according to some filters

2014-12-11 Thread Ron Sher
Just tested this.
When I used a large number to get all of my documents according to some 
criteria (4926 in the result) I got:
13.951s when using a size of 1M
43.6s when using scan/scroll (with a size of 100)

Looks like I should be using the not recommended paging.
Can I make the scroll better?

Thanks,
Ron

On Wednesday, December 10, 2014 10:53:50 PM UTC+2, David Pilato wrote:

 No I did not say that. Or I did not mean that. Sorry if it was unclear.
 I said: don’t use large sizes:

 Never use size:1000 or from:1000. 


 You should read this: 
 http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan

 -- 
 *David Pilato* | *Technical Advocate* | *Elasticsearch.com 
 http://Elasticsearch.com*
 @dadoonet https://twitter.com/dadoonet | @elasticsearchfr 
 https://twitter.com/elasticsearchfr | @scrutmydocs 
 https://twitter.com/scrutmydocs


  
 Le 10 déc. 2014 à 21:16, Ron Sher ron@gmail.com javascript: a 
 écrit :

 So you're saying there's no impact on elasticsearch if I issue a large 
 size? 
 If that's the case then why shouldn't I just call size of 1M if I want to 
 make sure I get everything?

 On Wednesday, December 10, 2014 8:22:47 PM UTC+2, David Pilato wrote:

 Scan/scroll is the best option to extract a huge amount of data.
 Never use size:1000 or from:1000. 

 It's not realtime because you basically scroll over a given set of 
 segments and all new changes that will come in new segments won't be taken 
 into account during the scroll.
 Which is good because you won't get inconsistent results.

 About size, I'd would try and test. It depends on your docs size I 
 believe.
 Try with 1 and see how it goes when you increase it. You will may be 
 discover that getting 10*1 docs is the same as 1*10. :)

 Best

 David

 Le 10 déc. 2014 à 19:09, Ron Sher ron@gmail.com a écrit :

 Hi,

 I was wondering about best practices to to get all data according to some 
 filters.
 The options as I see them are:

- Use a very big size that will return all accounts, i.e. use some 
value like 1m to make sure I get everything back (even if I need just a 
 few 
hundreds or tens of documents). This is the quickest way, development 
 wise.
- Use paging - using size and from. This requires looping over the 
result and the performance gets worse as we advance to later pages. Also, 
we need to use preference if we want to get consistent results over the 
pages. Also, it's not clear what's the recommended size for each page.
- Use scan/scroll - this gives consistent paging but also has several 
drawbacks: If I use search_type=scan then it can't be sorted; using 
scan/scroll is (maybe) less performant than paging (the documentation 
 says 
it's not for realtime use); again not clear which size is recommended.

 So you see - many options and not clear which path to take.

 What do you think?

 Thanks,
 Ron

 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.


 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com javascript:.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.




-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d41729a8-8dfc-48eb-ae7b-1ac16cd05787%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Looking for a best practice to get all data according to some filters

2014-12-11 Thread Dani Castro
Hi,
  I am facing the same situation:
We would like to get all the ids of the documents matching certain 
criteria. In the worst case (which is the one I am exposing here), the 
documents matching the criteria would be around 200K, and in our first 
tests it is really slow (around 15 seconds). However, if we do the same 
query just for count documents, ES replies in just 10-15ms, which is 
amazing.
I suspect that the problem is on the transport layer and the latency 
generated by transferring a big JSON result. 

Would you recommend, in a situation like this, to use another transport 
layer like Thirf or a custom solution?.

Thanks in advance

El jueves, 11 de diciembre de 2014 14:00:05 UTC+1, Ron Sher escribió:

 Just tested this.
 When I used a large number to get all of my documents according to some 
 criteria (4926 in the result) I got:
 13.951s when using a size of 1M
 43.6s when using scan/scroll (with a size of 100)

 Looks like I should be using the not recommended paging.
 Can I make the scroll better?

 Thanks,
 Ron

 On Wednesday, December 10, 2014 10:53:50 PM UTC+2, David Pilato wrote:

 No I did not say that. Or I did not mean that. Sorry if it was unclear.
 I said: don’t use large sizes:

 Never use size:1000 or from:1000. 


 You should read this: 
 http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan

 -- 
 *David Pilato* | *Technical Advocate* | *Elasticsearch.com 
 http://Elasticsearch.com*
 @dadoonet https://twitter.com/dadoonet | @elasticsearchfr 
 https://twitter.com/elasticsearchfr | @scrutmydocs 
 https://twitter.com/scrutmydocs


  
 Le 10 déc. 2014 à 21:16, Ron Sher ron@gmail.com a écrit :

 So you're saying there's no impact on elasticsearch if I issue a large 
 size? 
 If that's the case then why shouldn't I just call size of 1M if I want to 
 make sure I get everything?

 On Wednesday, December 10, 2014 8:22:47 PM UTC+2, David Pilato wrote:

 Scan/scroll is the best option to extract a huge amount of data.
 Never use size:1000 or from:1000. 

 It's not realtime because you basically scroll over a given set of 
 segments and all new changes that will come in new segments won't be taken 
 into account during the scroll.
 Which is good because you won't get inconsistent results.

 About size, I'd would try and test. It depends on your docs size I 
 believe.
 Try with 1 and see how it goes when you increase it. You will may be 
 discover that getting 10*1 docs is the same as 1*10. :)

 Best

 David

 Le 10 déc. 2014 à 19:09, Ron Sher ron@gmail.com a écrit :

 Hi,

 I was wondering about best practices to to get all data according to 
 some filters.
 The options as I see them are:

- Use a very big size that will return all accounts, i.e. use some 
value like 1m to make sure I get everything back (even if I need just a 
 few 
hundreds or tens of documents). This is the quickest way, development 
 wise.
- Use paging - using size and from. This requires looping over the 
result and the performance gets worse as we advance to later pages. 
 Also, 
we need to use preference if we want to get consistent results over the 
pages. Also, it's not clear what's the recommended size for each page.
- Use scan/scroll - this gives consistent paging but also has 
several drawbacks: If I use search_type=scan then it can't be sorted; 
 using 
scan/scroll is (maybe) less performant than paging (the documentation 
 says 
it's not for realtime use); again not clear which size is recommended.

 So you see - many options and not clear which path to take.

 What do you think?

 Thanks,
 Ron

 -- 
 You received this message because you are subscribed to the Google 
 Groups elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.


 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.




-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop 

Looking for a best practice to get all data according to some filters

2014-12-10 Thread Ron Sher
Hi,

I was wondering about best practices to to get all data according to some 
filters.
The options as I see them are:

   - Use a very big size that will return all accounts, i.e. use some value 
   like 1m to make sure I get everything back (even if I need just a few 
   hundreds or tens of documents). This is the quickest way, development wise.
   - Use paging - using size and from. This requires looping over the 
   result and the performance gets worse as we advance to later pages. Also, 
   we need to use preference if we want to get consistent results over the 
   pages. Also, it's not clear what's the recommended size for each page.
   - Use scan/scroll - this gives consistent paging but also has several 
   drawbacks: If I use search_type=scan then it can't be sorted; using 
   scan/scroll is (maybe) less performant than paging (the documentation says 
   it's not for realtime use); again not clear which size is recommended.

So you see - many options and not clear which path to take.

What do you think?

Thanks,
Ron

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Looking for a best practice to get all data according to some filters

2014-12-10 Thread David Pilato
Scan/scroll is the best option to extract a huge amount of data.
Never use size:1000 or from:1000. 

It's not realtime because you basically scroll over a given set of segments and 
all new changes that will come in new segments won't be taken into account 
during the scroll.
Which is good because you won't get inconsistent results.

About size, I'd would try and test. It depends on your docs size I believe.
Try with 1 and see how it goes when you increase it. You will may be 
discover that getting 10*1 docs is the same as 1*10. :)

Best

David

 Le 10 déc. 2014 à 19:09, Ron Sher ron.s...@gmail.com a écrit :
 
 Hi,
 
 I was wondering about best practices to to get all data according to some 
 filters.
 The options as I see them are:
 Use a very big size that will return all accounts, i.e. use some value like 
 1m to make sure I get everything back (even if I need just a few hundreds or 
 tens of documents). This is the quickest way, development wise.
 Use paging - using size and from. This requires looping over the result and 
 the performance gets worse as we advance to later pages. Also, we need to use 
 preference if we want to get consistent results over the pages. Also, it's 
 not clear what's the recommended size for each page.
 Use scan/scroll - this gives consistent paging but also has several 
 drawbacks: If I use search_type=scan then it can't be sorted; using 
 scan/scroll is (maybe) less performant than paging (the documentation says 
 it's not for realtime use); again not clear which size is recommended.
 So you see - many options and not clear which path to take.
 
 What do you think?
 
 Thanks,
 Ron
 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/F1FB312D-0FEA-4D59-88EA-3E16C457DAE0%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


Re: Looking for a best practice to get all data according to some filters

2014-12-10 Thread Ron Sher
So you're saying there's no impact on elasticsearch if I issue a large 
size? 
If that's the case then why shouldn't I just call size of 1M if I want to 
make sure I get everything?

On Wednesday, December 10, 2014 8:22:47 PM UTC+2, David Pilato wrote:

 Scan/scroll is the best option to extract a huge amount of data.
 Never use size:1000 or from:1000. 

 It's not realtime because you basically scroll over a given set of 
 segments and all new changes that will come in new segments won't be taken 
 into account during the scroll.
 Which is good because you won't get inconsistent results.

 About size, I'd would try and test. It depends on your docs size I believe.
 Try with 1 and see how it goes when you increase it. You will may be 
 discover that getting 10*1 docs is the same as 1*10. :)

 Best

 David

 Le 10 déc. 2014 à 19:09, Ron Sher ron@gmail.com javascript: a 
 écrit :

 Hi,

 I was wondering about best practices to to get all data according to some 
 filters.
 The options as I see them are:

- Use a very big size that will return all accounts, i.e. use some 
value like 1m to make sure I get everything back (even if I need just a 
 few 
hundreds or tens of documents). This is the quickest way, development wise.
- Use paging - using size and from. This requires looping over the 
result and the performance gets worse as we advance to later pages. Also, 
we need to use preference if we want to get consistent results over the 
pages. Also, it's not clear what's the recommended size for each page.
- Use scan/scroll - this gives consistent paging but also has several 
drawbacks: If I use search_type=scan then it can't be sorted; using 
scan/scroll is (maybe) less performant than paging (the documentation says 
it's not for realtime use); again not clear which size is recommended.

 So you see - many options and not clear which path to take.

 What do you think?

 Thanks,
 Ron

 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com javascript:.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.



-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Looking for a best practice to get all data according to some filters

2014-12-10 Thread David Pilato
No I did not say that. Or I did not mean that. Sorry if it was unclear.
I said: don’t use large sizes:

 Never use size:1000 or from:1000. 


You should read this: 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan
 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan

-- 
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfr 
https://twitter.com/elasticsearchfr | @scrutmydocs 
https://twitter.com/scrutmydocs



 Le 10 déc. 2014 à 21:16, Ron Sher ron.s...@gmail.com a écrit :
 
 So you're saying there's no impact on elasticsearch if I issue a large size? 
 If that's the case then why shouldn't I just call size of 1M if I want to 
 make sure I get everything?
 
 On Wednesday, December 10, 2014 8:22:47 PM UTC+2, David Pilato wrote:
 Scan/scroll is the best option to extract a huge amount of data.
 Never use size:1000 or from:1000. 
 
 It's not realtime because you basically scroll over a given set of segments 
 and all new changes that will come in new segments won't be taken into 
 account during the scroll.
 Which is good because you won't get inconsistent results.
 
 About size, I'd would try and test. It depends on your docs size I believe.
 Try with 1 and see how it goes when you increase it. You will may be 
 discover that getting 10*1 docs is the same as 1*10. :)
 
 Best
 
 David
 
 Le 10 déc. 2014 à 19:09, Ron Sher ron@gmail.com javascript: a écrit :
 
 Hi,
 
 I was wondering about best practices to to get all data according to some 
 filters.
 The options as I see them are:
 Use a very big size that will return all accounts, i.e. use some value like 
 1m to make sure I get everything back (even if I need just a few hundreds or 
 tens of documents). This is the quickest way, development wise.
 Use paging - using size and from. This requires looping over the result and 
 the performance gets worse as we advance to later pages. Also, we need to 
 use preference if we want to get consistent results over the pages. Also, 
 it's not clear what's the recommended size for each page.
 Use scan/scroll - this gives consistent paging but also has several 
 drawbacks: If I use search_type=scan then it can't be sorted; using 
 scan/scroll is (maybe) less performant than paging (the documentation says 
 it's not for realtime use); again not clear which size is recommended.
 So you see - many options and not clear which path to take.
 
 What do you think?
 
 Thanks,
 Ron
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com javascript:.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/764a37c5-1fec-48c4-9c66-7835d8141713%40googlegroups.com?utm_medium=emailutm_source=footer.
 For more options, visit https://groups.google.com/d/optout 
 https://groups.google.com/d/optout.
 
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearch+unsubscr...@googlegroups.com 
 mailto:elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/838020dc-d2ea-423d-9606-778d807b1a0d%40googlegroups.com?utm_medium=emailutm_source=footer.
 For more options, visit https://groups.google.com/d/optout 
 https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/D2511659-9029-41CB-89B5-CC5E363B656B%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.