[Hadoop] - How write operation is divided into tasks?

2015-05-06 Thread piyush goyal
Hi

A very basic question about implementation. Best understood through the 
example of implementation.

Architecture: A 3 node cluster with single index and 32 shards. A type 
data contains months of data with somewhere around 40K-50K count of 
documents per month. A routing value defined using the month and year value 
is used to route this data per shard. So, in short 1 month of data goes to 
1 shard.

Requirement: Simple requirement: pass a query, get data, update each 
document and insert back to the same shard. Since the number of shards = 32 
creates 32 tasks, each task fetches 1 month of data, update it and send it 
back to ES for writing with same routing value so that it overwrites the 
previous document.

Flow:  Well the retrieval seems easy, 32 tasks created, one task per shard 
and brings the data into a single RDD. Next step update each document. Next 
is the step for writing which brings the question as follows:

How does write operation divides itself into tasks?
Doing by documentation, it depends upon the es.batch.size.bytes and 
es.batch.size.entries. The value of these two properties defines the number 
of tasks. What I presumed was RDD is again partitioned into n number of 
tasks depending upon the value specified in these parameters and then that 
many number of tasks run to index/update data. However, when I ran write 
operation with just a count of 5 documents and with es.batch.size.entries 
as 10,000 I still saw as many of 32 tasks doing a write operation on my 
es.resource. Still confused on how the task allocation works here. Can you 
please explain?

Now comes the another question: In a standalone write to ES operation, how 
does code identify which shards contains which routing value? My assumption 
was all the tasks sends the data to the ES node which then distributes the 
data itself to the shards based on the routing value just like a normal 
bulk index operation. 

Can you please explain the process of task creations for the two operations 
- read-update-write and only write.

Thanks in advance
Piyush

-- 
Please update your bookmarks! We moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2d9dac53-da38-4309-8dc1-7440cb9479ae%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[Hadoop] - Difference between task creation for a write and read-update-write operation in ES

2015-05-06 Thread piyush goyal
Hi Costin,

I saw a different behavior of creating task for write to ES operation while 
working on my project. The difference is as follows:

1.) Only write to ES - When I create an RDD of my own to insert data into 
ES, the task are created based on property es.batch.size.bytes and 
es.batch.size.entries. Number of task created = Number of documents in 
RDD/the value of either of these properties. The request hits the node and 
node decides the shard to which document needs routed based on routing 
value(if specified).

2.) Read-Update-write to ES - Consider this case when I have to read data 
from ES, store it in RDD, do some updates in the documents in RDD and then 
index these documents back to ES. While reading, the number of tasks are 
created on basis of number of shards and I presume that each tasks fetch 
data from each Shard(not sure of how it works? - Task delagting request to 
node to serve data from a particular shard?). Now when I try to 
update/re-index data using same RDD and function saveToESWithMetadata, this 
time the number of task created is a number which is not based on point 1. 
If the data in each partition is less than property 
es.batch.size.entries, it creates the same number of tasks as are the 
number of shards, else greater than it.

What's the reason behind this? Also like read operation where request is 
from particular shard, does write operation also write to a shard or all 
the task delegate their request to the node?

Thanks in advance
Piyush Costin,

I saw a different behavior of creating task for write to ES operation while 
working on my project. The difference is as follows:

1.) Only write to ES - When I create an RDD of my own to insert data into 
ES, the task are created based on property es.batch.size.bytes and 
es.batch.size.entries. Number of task created = Number of documents in 
RDD/the value of either of these properties. The request hits the node and 
node decides the shard to which document needs routed based on routing 
value(if specified).

2.) Read-Update-write to ES - Consider this case when I have to read data 
from ES, store it in RDD, do some updates in the documents in RDD and then 
index these documents back to ES. While reading, the number of tasks are 
created on basis of number of shards and I presume that each tasks fetch 
data from each Shard(not sure of how it works? - Task delagting request to 
node to serve data from a particular shard?). Now when I try to 
update/re-index data using same RDD and function saveToESWithMetadata, this 
time the number of task created is a number which is not based on point 1. 
If the data in each partition is less than property 
es.batch.size.entries, it creates the same number of tasks as are the 
number of shards, else greater than it. 

What's the reason behind this? Also like read operation where request is 
from particular shard, does write operation also write to a shard or all 
the task delegate their request to the node?

Thanks in advance
Piyush 

-- 
Please update your bookmarks! We moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f70d94d9-3f58-44f3-8383-e798854fb39b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[Hadoop] - Difference between task creation for a write and read-update-write operation in ES

2015-05-06 Thread piyush goyal
Hi Costin,

I saw a different behavior of creating task for write to ES operation while 
working on my project. The difference is as follows:

1.) Only write to ES - When I create an RDD of my own to insert data into 
ES, the task are created based on property es.batch.size.bytes and 
es.batch.size.entries. Number of task created = Number of documents in 
RDD/the value of either of these properties. The request hits the node and 
node decides the shard to which document needs routed based on routing 
value(if specified).

2.) Read-Update-write to ES - Consider this case when I have to read data 
from ES, store it in RDD, do some updates in the documents in RDD and then 
index these documents back to ES. While reading, the number of tasks are 
created on basis of number of shards and I presume that each tasks fetch 
data from each Shard(not sure of how it works? - Task delagting request to 
node to serve data from a particular shard?). Now when I try to 
update/re-index data using same RDD and function saveToESWithMetadata, this 
time the number of task created is a number which is not based on point 1. 
If the data in each partition is less than property 
es.batch.size.entries, it creates the same number of tasks as are the 
number of shards, else greater than it.

What's the reason behind this? Also like read operation where request is 
from particular shard, does write operation also write to a shard or all 
the task delegate their request to the node?

Thanks in advance
Piyush Costin,

I saw a different behavior of creating task for write to ES operation while 
working on my project. The difference is as follows:

1.) Only write to ES - When I create an RDD of my own to insert data into 
ES, the task are created based on property es.batch.size.bytes and 
es.batch.size.entries. Number of task created = Number of documents in 
RDD/the value of either of these properties. The request hits the node and 
node decides the shard to which document needs routed based on routing 
value(if specified).

2.) Read-Update-write to ES - Consider this case when I have to read data 
from ES, store it in RDD, do some updates in the documents in RDD and then 
index these documents back to ES. While reading, the number of tasks are 
created on basis of number of shards and I presume that each tasks fetch 
data from each Shard(not sure of how it works? - Task delagting request to 
node to serve data from a particular shard?). Now when I try to 
update/re-index data using same RDD and function saveToESWithMetadata, this 
time the number of task created is a number which is not based on point 1. 
If the data in each partition is less than property 
es.batch.size.entries, it creates the same number of tasks as are the 
number of shards, else greater than it. 

What's the reason behind this? Also like read operation where request is 
from particular shard, does write operation also write to a shard or all 
the task delegate their request to the node?

Thanks in advance
Piyush

-- 
Please update your bookmarks! We moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ec268e76-6220-430b-958a-884692283ca0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[Hadoop] - Difference between task creation for a write and read-update-write operation in ES

2015-05-06 Thread piyush goyal
Hi Costin,

I saw a different behavior of creating task for write to ES operation while 
working on my project. The difference is as follows:

1.) Only write to ES - When I create an RDD of my own to insert data into 
ES, the task are created based on property es.batch.size.bytes and 
es.batch.size.entries. Number of task created = Number of documents in 
RDD/the value of either of these properties. The request hits the node and 
node decides the shard to which document needs routed based on routing 
value(if specified).

2.) Read-Update-write to ES - Consider this case when I have to read data 
from ES, store it in RDD, do some updates in the documents in RDD and then 
index these documents back to ES. While reading, the number of tasks are 
created on basis of number of shards and I presume that each tasks fetch 
data from each Shard(not sure of how it works? - Task delagting request to 
node to serve data from a particular shard?). Now when I try to 
update/re-index data using same RDD and function saveToESWithMetadata, this 
time the number of task created is a number which is not based on point 1. 
If the data in each partition is less than property 
es.batch.size.entries, it creates the same number of tasks as are the 
number of shards, else greater than it.

What's the reason behind this? Also like read operation where request is 
from particular shard, does write operation also write to a shard or all 
the task delegate their request to the node?

Thanks in advance
Piyush Costin,

I saw a different behavior of creating task for write to ES operation while 
working on my project. The difference is as follows:

1.) Only write to ES - When I create an RDD of my own to insert data into 
ES, the task are created based on property es.batch.size.bytes and 
es.batch.size.entries. Number of task created = Number of documents in 
RDD/the value of either of these properties. The request hits the node and 
node decides the shard to which document needs routed based on routing 
value(if specified).

2.) Read-Update-write to ES - Consider this case when I have to read data 
from ES, store it in RDD, do some updates in the documents in RDD and then 
index these documents back to ES. While reading, the number of tasks are 
created on basis of number of shards and I presume that each tasks fetch 
data from each Shard(not sure of how it works? - Task delagting request to 
node to serve data from a particular shard?). Now when I try to 
update/re-index data using same RDD and function saveToESWithMetadata, this 
time the number of task created is a number which is not based on point 1. 
If the data in each partition is less than property 
es.batch.size.entries, it creates the same number of tasks as are the 
number of shards, else greater than it. 

What's the reason behind this? Also like read operation where request is 
from particular shard, does write operation also write to a shard or all 
the task delegate their request to the node?

Thanks in advance
Piyush

-- 
Please update your bookmarks! We moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/bb3fdaef-5d2e-4f21-85f5-4166ac11a35d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ES JAVA API | QueryBuilders.java | FilterBuilders does not have nullable queryBuilder as arguement

2015-02-11 Thread piyush goyal

Ok. That helps a lot in getting the things but still I do feel, since ES is 
internally marking the query as MatchAll query, the method definition 
should have Nullable query as well.

Thanks once again for helping out. :)

Regards
Piyush


On Wednesday, 11 February 2015 13:29:32 UTC+5:30, David Pilato wrote:

 Actually in a filtered query, filters are applied first.
 The match all query then only said that all filtered documents match.


 David

 Le 11 févr. 2015 à 08:30, piyush goyal coolpi...@gmail.com javascript: 
 a écrit :

 Ahh.. Now I got you.

 But does that mean, if my rest query through sense does not have any query 
 part and only has a filter part, by default ES adds a matchAll Query to the 
 query part. So one more question, might be I am asking a wrong question but 
 just to clear up my doubts. 

 So now this query will do separate things:

 1.) What a normal match all query does.
 2.) What ever my filter operations are there, the query will perform the 
 same as well.

 And then ES picks up the intersection of two. Am I right?

 Regards
 Piyush

 On Wednesday, 11 February 2015 12:25:43 UTC+5:30, David Pilato wrote:

 I think I answered.
 This is what is done by default in REST: 
 https://github.com/elasticsearch/elasticsearch/blob/1816951b6b0320e7a011436c7c7519ec2bfabc6e/src/main/java/org/elasticsearch/index/query/FilteredQueryParser.java#L54

 David

 Le 11 févr. 2015 à 07:46, piyush goyal coolpi...@gmail.com a écrit :

 Hi Folks,

 Any inputs?

 Regards
 Piyush

 On Tuesday, 10 February 2015 16:23:43 UTC+5:30, piyush goyal wrote:

 I don't need the query part. All I need is a filter.

 Not sure how matchAllQuery will help.

 On Tuesday, 10 February 2015 16:14:51 UTC+5:30, David Pilato wrote:

 Use a matchAllQuery for the query part.
 All scores will be set to 1.

 --
 David ;-)
 Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

 Le 10 févr. 2015 à 11:24, piyush goyal coolpi...@gmail.com a écrit :

 Hi All,

 If I try to write a filtered query through sense, it allows me to just 
 add a filter and query is not a mandatory field. For example:
 {
   query: {
 filtered: {
   filter: {
 term: {
   response_timestamp: 2015-01-01
 }
   }
 }
   }

 is a valid query through sense. However, if I try to implement the same 
 through JAVA API, I have to use the abstract class QueryBuilders.java and 
 its method:

  filteredQuery(QueryBuilder queryBuilder, @Nullable FilterBuilder 
 filterBuilder)



 Please note that here FilterBuilder argument is nullable and 
 QueryBuilder argument is not. Which means that eventually I have to write 
 a 
 query inside the Filtered part. If this correct, then how can I write a 
 complete query with aggregations such that I don't want any score to be 
 calculated and the response time is faster?

 Regards
 Piyush

 -- 
 You received this message because you are subscribed to the Google 
 Groups elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.

  -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/c44055a5-1654-48fa-9787-503228e39de9%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/c44055a5-1654-48fa-9787-503228e39de9%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.

  -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com javascript:.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/f5e22135-746c-4613-92b3-dbac84247677%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/f5e22135-746c-4613-92b3-dbac84247677%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.



-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/cf379e01-76ee-4ed3-94ba-92bb2a95845f%40googlegroups.com.
For more options, visit

Re: Large results sets and paging for Aggregations

2015-02-11 Thread piyush goyal
aah..! This seems to be the best explanation of how aggregation works. 
Thanks a ton Mark for that. :) Few other questions:

1.) Would I assume that as my document count would increase, the time for 
aggregation calculation would as well increase? Reason: Trying to figure 
out if bucket creation is at individual shard level, then document count 
would happen asynchronously at each shard level thus decreasing the 
execution time significantly. Also at shard level, as and when my document 
count increases(satisfying the criteria as per query) considering if this 
process is linear time, the execution time would increase. 

2.) How would I relate this analogy with sub aggregations. My observation 
says that as you increase the number of child aggregations, so it increases 
the execution time along with memory utilization. What happens in case of 
sub aggregations?

3.) I didn't get your last statement:
 There is however a fixed overhead for all queries which *is* a 
function of number of docs and that is the Field Data cache required to 
hold the dates/member IDs in RAM - if this becomes a problem then you may 
want to look at on-disk alternative structure in the form of DocValues.
 
 4.) Off the topic, but I guess best to ask it here since we are talking 
about it. :) - DocValues - Since it was introduced in 1.0.0 and most of our 
mapping was defined in ES 0.9, can I change the mapping of existing fields 
now? Might be I can take this conversation in another thread but would love 
to hear about 1-3 points. You made this thread very interesting for me.

Thanks
Piyush 



On Wednesday, 11 February 2015 15:12:37 UTC+5:30, Mark Harwood wrote:

 5k doesn't sound  too scary.

 Think of the aggs tree like a Bean Machine [1] - one of those wooden 
 boards with pins arranged on it like a christmas tree and you drop balls at 
 the top of the board and they rattle down a choice of path to the bottom.
 In the case of aggs, your buckets are the pins and documents are the balls

 The memory requirement for processing the agg tree is typically the number 
 of pins, not the number of balls you drop into the tree as these just fall 
 out of the bottom of the tree.
 So in your case it is 5k members multiplied by 12 months each = 60k unique 
 buckets, each of which will maintain a counter of how many docs pass 
 through that point. So you could pass millions or billions of docs through 
 and the working memory requirement for the query would be the same.
 There is however a fixed overhead for all queries which *is* a function 
 of number of docs and that is the Field Data cache required to hold the 
 dates/member IDs in RAM - if this becomes a problem then you may want to 
 look at on-disk alternative structure in the form of DocValues.

 Hope that helps.

 [1] http://en.wikipedia.org/wiki/Bean_machine

 On Wednesday, February 11, 2015 at 7:04:04 AM UTC, piyush goyal wrote:

 Hi Mark,

 Before getting into queries, here is a little bit info about the project:

 1.) A community where members keep on increasing, decreasing and 
 changing. Maintained in a different type.
 2.) Approximately 3K to 4K documents of data of each user inserted into 
 ES per month in a different type maintained by member ID.
 3.) Mapping is flat, there are no nested and array type of data.

 Requirement:

 Here is a sample requirement:

 1.) Getting a report against each member ID against the count of data for 
 last three month.
 2.) Query used to get the data is:

 {
   query: {
 constant_score: {
   filter: {
 bool: {
   must: [
 {term: {
   datatype: XYZ
 }
 }, {
   range: {
 response_timestamp: {
   from: 2014-11-01,
   to: 2015-01-31
 }
   }
 }
   ]
 }
   }
 }
   },aggs: {
 memberIDAggs: {
   terms: {
 field: member_id,
 size: 0
   },aggs: {
 dateHistAggs: {
   date_histogram: {
 field: response_timestamp,
 interval: month
   }
 }
   }
 }
   },size: 0
 }

 Now since the current member count is approximately 1K which will 
 increase to 5K in next 10 months. 5K * 4K * 3 times of documents to be used 
 for this aggregation. I guess a major hit on system. And this is only two 
 level of aggregation. Next requirement by our analyst is to get per month 
 data into three different categories. 

 What is the optimum solution to this problem?

 Regards
 Piyush

 On Tuesday, 10 February 2015 16:15:22 UTC+5:30, Mark Harwood wrote:

 these kind of queries are hit more for qualitative analysis.

 Do you have any example queries? The pay as you go summarisation need 
 not be about just maintaining quantities.  In the demo here [1] I derive 
 profile names for people, categorizing them as newbies, fanboys or 
 haters based on a history of their reviewing behaviours

Re: Large results sets and paging for Aggregations

2015-02-10 Thread piyush goyal
Thanks Mark. Your suggestion of pay-as-you-go seems amazing. But 
considering the dynamics of the application, these kind of queries are hit 
more for qualitative analysis. There are hundred of such queries(I am not 
exaggerating) which are being hit daily by our analytic team. Keeping count 
of all those qualitative checks daily and maintaining them as documents is 
a headache itself. Addition/update/removals of these documents would cause 
us huge maintenance overheads. Hence was thinking of getting something of 
getting pagination on aggregations which would definitely help us to keep 
our ES memory leaks away.

By the way, are there any other strategies suggested by ES for these kind 
of scenarios?

Thanks

On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote:

  Why can't aggs be based on shard based calculations 

 They are. The shard_size setting will determine how many member 
 *summaries* will be returned from each shard - we won't stream each 
 member's thousands of related records back to a centralized point to 
 compute a final result. The final step is to summarise the summaries from 
 each shard.

  if the number of members keep on increasing, day by day ES has to keep 
 more and more data into memory to calculate the aggs

 This is a different point to the one above (shard-level computation vs 
 memory costs). If your analysis involves summarising the behaviours of 
 large numbers of people over time then you may well find the cost of doing 
 this in a single query too high when the numbers of people are extremely 
 large. There is a cost to any computation and in that scenario you have 
 deferred all these member-summarising costs to the very last moment. A 
 better strategy for large-scale analysis of behaviours over time is to use 
 a pay-as-you-go model where you update a per-member summary document at 
 regular intervals with batches of their related records. This shifts the 
 bulk of the computation cost from your single query to many smaller costs 
 when writing data. You can then perform efficient aggs or scan/scroll 
 operations on *member* documents with pre-summarised attributes e.g. 
 totalSpend rather than deriving these properties on-the-fly from records 
 with a shared member ID.



 On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:

 Well, my use case says I have tens of thousands of records for each 
 members. I want to do a simple terms aggs on member ID. If my count of 
 member ID remains same throughout .. good enough, if the number of members 
 keep on increasing, day by day ES has to keep more and more data into 
 memory to calculate the aggs. Does not sound very promising. What we do is 
 implementation of routing to put member specific data into a particular 
 shard. Why can't aggs be based on shard based calculations so that I am 
 safe from loading tons of data into memory.

 Any thoughts?

 On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:

 Sharing a response I received from Igor Motov:

 scroll works only to page results. paging aggs doesn't make sense since 
 aggs are executed on the entire result set. therefore if it managed to fit 
 into the memory you should just get it. paging will mean that you throw 
 away a lot of results that were already calculated. the only way to page 
 is by limiting the results that you are running aggs on. for example if 
 your data is sorted by date and you want to build histogram for the 
 results 
 one date range at a time.




-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d4b5fd32-3ef7-4026-846e-5f7d388bad1f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


ES JAVA API | QueryBuilders.java | FilterBuilders does not have nullable queryBuilder as arguement

2015-02-10 Thread piyush goyal
Hi All,

If I try to write a filtered query through sense, it allows me to just add 
a filter and query is not a mandatory field. For example:
{
  query: {
filtered: {
  filter: {
term: {
  response_timestamp: 2015-01-01
}
  }
}
  }

is a valid query through sense. However, if I try to implement the same 
through JAVA API, I have to use the abstract class QueryBuilders.java and 
its method:

 filteredQuery(QueryBuilder queryBuilder, @Nullable FilterBuilder filterBuilder)



Please note that here FilterBuilder argument is nullable and QueryBuilder 
argument is not. Which means that eventually I have to write a query inside 
the Filtered part. If this correct, then how can I write a complete query 
with aggregations such that I don't want any score to be calculated and the 
response time is faster?

Regards
Piyush

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ES JAVA API | QueryBuilders.java | FilterBuilders does not have nullable queryBuilder as arguement

2015-02-10 Thread piyush goyal
I don't need the query part. All I need is a filter.

Not sure how matchAllQuery will help.

On Tuesday, 10 February 2015 16:14:51 UTC+5:30, David Pilato wrote:

 Use a matchAllQuery for the query part.
 All scores will be set to 1.

 --
 David ;-)
 Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

 Le 10 févr. 2015 à 11:24, piyush goyal coolpi...@gmail.com javascript: 
 a écrit :

 Hi All,

 If I try to write a filtered query through sense, it allows me to just add 
 a filter and query is not a mandatory field. For example:
 {
   query: {
 filtered: {
   filter: {
 term: {
   response_timestamp: 2015-01-01
 }
   }
 }
   }

 is a valid query through sense. However, if I try to implement the same 
 through JAVA API, I have to use the abstract class QueryBuilders.java and 
 its method:

  filteredQuery(QueryBuilder queryBuilder, @Nullable FilterBuilder 
 filterBuilder)



 Please note that here FilterBuilder argument is nullable and QueryBuilder 
 argument is not. Which means that eventually I have to write a query inside 
 the Filtered part. If this correct, then how can I write a complete query 
 with aggregations such that I don't want any score to be calculated and the 
 response time is faster?

 Regards
 Piyush

 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com javascript:.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.



-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/57cfc674-7ad7-4232-aa14-534fb394ef34%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ES JAVA API | QueryBuilders.java | FilterBuilders does not have nullable queryBuilder as arguement

2015-02-10 Thread piyush goyal
Ahh.. Now I got you.

But does that mean, if my rest query through sense does not have any query 
part and only has a filter part, by default ES adds a matchAll Query to the 
query part. So one more question, might be I am asking a wrong question but 
just to clear up my doubts. 

So now this query will do separate things:

1.) What a normal match all query does.
2.) What ever my filter operations are there, the query will perform the 
same as well.

And then ES picks up the intersection of two. Am I right?

Regards
Piyush

On Wednesday, 11 February 2015 12:25:43 UTC+5:30, David Pilato wrote:

 I think I answered.
 This is what is done by default in REST: 
 https://github.com/elasticsearch/elasticsearch/blob/1816951b6b0320e7a011436c7c7519ec2bfabc6e/src/main/java/org/elasticsearch/index/query/FilteredQueryParser.java#L54

 David

 Le 11 févr. 2015 à 07:46, piyush goyal coolpi...@gmail.com javascript: 
 a écrit :

 Hi Folks,

 Any inputs?

 Regards
 Piyush

 On Tuesday, 10 February 2015 16:23:43 UTC+5:30, piyush goyal wrote:

 I don't need the query part. All I need is a filter.

 Not sure how matchAllQuery will help.

 On Tuesday, 10 February 2015 16:14:51 UTC+5:30, David Pilato wrote:

 Use a matchAllQuery for the query part.
 All scores will be set to 1.

 --
 David ;-)
 Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

 Le 10 févr. 2015 à 11:24, piyush goyal coolpi...@gmail.com a écrit :

 Hi All,

 If I try to write a filtered query through sense, it allows me to just 
 add a filter and query is not a mandatory field. For example:
 {
   query: {
 filtered: {
   filter: {
 term: {
   response_timestamp: 2015-01-01
 }
   }
 }
   }

 is a valid query through sense. However, if I try to implement the same 
 through JAVA API, I have to use the abstract class QueryBuilders.java and 
 its method:

  filteredQuery(QueryBuilder queryBuilder, @Nullable FilterBuilder 
 filterBuilder)



 Please note that here FilterBuilder argument is nullable and 
 QueryBuilder argument is not. Which means that eventually I have to write a 
 query inside the Filtered part. If this correct, then how can I write a 
 complete query with aggregations such that I don't want any score to be 
 calculated and the response time is faster?

 Regards
 Piyush

 -- 
 You received this message because you are subscribed to the Google 
 Groups elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.

  -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com javascript:.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/c44055a5-1654-48fa-9787-503228e39de9%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/c44055a5-1654-48fa-9787-503228e39de9%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.



-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f5e22135-746c-4613-92b3-dbac84247677%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ES JAVA API | QueryBuilders.java | FilterBuilders does not have nullable queryBuilder as arguement

2015-02-10 Thread piyush goyal
Hi Folks,

Any inputs?

Regards
Piyush

On Tuesday, 10 February 2015 16:23:43 UTC+5:30, piyush goyal wrote:

 I don't need the query part. All I need is a filter.

 Not sure how matchAllQuery will help.

 On Tuesday, 10 February 2015 16:14:51 UTC+5:30, David Pilato wrote:

 Use a matchAllQuery for the query part.
 All scores will be set to 1.

 --
 David ;-)
 Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

 Le 10 févr. 2015 à 11:24, piyush goyal coolpi...@gmail.com a écrit :

 Hi All,

 If I try to write a filtered query through sense, it allows me to just 
 add a filter and query is not a mandatory field. For example:
 {
   query: {
 filtered: {
   filter: {
 term: {
   response_timestamp: 2015-01-01
 }
   }
 }
   }

 is a valid query through sense. However, if I try to implement the same 
 through JAVA API, I have to use the abstract class QueryBuilders.java and 
 its method:

  filteredQuery(QueryBuilder queryBuilder, @Nullable FilterBuilder 
 filterBuilder)



 Please note that here FilterBuilder argument is nullable and QueryBuilder 
 argument is not. Which means that eventually I have to write a query inside 
 the Filtered part. If this correct, then how can I write a complete query 
 with aggregations such that I don't want any score to be calculated and the 
 response time is faster?

 Regards
 Piyush

 -- 
 You received this message because you are subscribed to the Google Groups 
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com
  
 https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.



-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c44055a5-1654-48fa-9787-503228e39de9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Large results sets and paging for Aggregations

2015-02-10 Thread piyush goyal
Hi Mark,

Before getting into queries, here is a little bit info about the project:

1.) A community where members keep on increasing, decreasing and changing. 
Maintained in a different type.
2.) Approximately 3K to 4K documents of data of each user inserted into ES 
per month in a different type maintained by member ID.
3.) Mapping is flat, there are no nested and array type of data.

Requirement:

Here is a sample requirement:

1.) Getting a report against each member ID against the count of data for 
last three month.
2.) Query used to get the data is:

{
  query: {
constant_score: {
  filter: {
bool: {
  must: [
{term: {
  datatype: XYZ
}
}, {
  range: {
response_timestamp: {
  from: 2014-11-01,
  to: 2015-01-31
}
  }
}
  ]
}
  }
}
  },aggs: {
memberIDAggs: {
  terms: {
field: member_id,
size: 0
  },aggs: {
dateHistAggs: {
  date_histogram: {
field: response_timestamp,
interval: month
  }
}
  }
}
  },size: 0
}

Now since the current member count is approximately 1K which will increase 
to 5K in next 10 months. 5K * 4K * 3 times of documents to be used for this 
aggregation. I guess a major hit on system. And this is only two level of 
aggregation. Next requirement by our analyst is to get per month data into 
three different categories. 

What is the optimum solution to this problem?

Regards
Piyush

On Tuesday, 10 February 2015 16:15:22 UTC+5:30, Mark Harwood wrote:

 these kind of queries are hit more for qualitative analysis.

 Do you have any example queries? The pay as you go summarisation need 
 not be about just maintaining quantities.  In the demo here [1] I derive 
 profile names for people, categorizing them as newbies, fanboys or 
 haters based on a history of their reviewing behaviours in a marketplace. 

 By the way, are there any other strategies suggested by ES for these kind 
 of scenarios?

 Igor hit on one which is to use some criteria eg. date to limit the volume 
 of what you analyze in any one query request.

 [1] 
 http://www.elasticsearch.org/videos/entity-centric-indexing-london-meetup-sep-2014/



 On Tuesday, February 10, 2015 at 10:05:24 AM UTC, piyush goyal wrote:

 Thanks Mark. Your suggestion of pay-as-you-go seems amazing. But 
 considering the dynamics of the application, these kind of queries are hit 
 more for qualitative analysis. There are hundred of such queries(I am not 
 exaggerating) which are being hit daily by our analytic team. Keeping count 
 of all those qualitative checks daily and maintaining them as documents is 
 a headache itself. Addition/update/removals of these documents would cause 
 us huge maintenance overheads. Hence was thinking of getting something of 
 getting pagination on aggregations which would definitely help us to keep 
 our ES memory leaks away.

 By the way, are there any other strategies suggested by ES for these kind 
 of scenarios?

 Thanks

 On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote:

  Why can't aggs be based on shard based calculations 

 They are. The shard_size setting will determine how many member 
 *summaries* will be returned from each shard - we won't stream each 
 member's thousands of related records back to a centralized point to 
 compute a final result. The final step is to summarise the summaries from 
 each shard.

  if the number of members keep on increasing, day by day ES has to keep 
 more and more data into memory to calculate the aggs

 This is a different point to the one above (shard-level computation vs 
 memory costs). If your analysis involves summarising the behaviours of 
 large numbers of people over time then you may well find the cost of doing 
 this in a single query too high when the numbers of people are extremely 
 large. There is a cost to any computation and in that scenario you have 
 deferred all these member-summarising costs to the very last moment. A 
 better strategy for large-scale analysis of behaviours over time is to use 
 a pay-as-you-go model where you update a per-member summary document at 
 regular intervals with batches of their related records. This shifts the 
 bulk of the computation cost from your single query to many smaller costs 
 when writing data. You can then perform efficient aggs or scan/scroll 
 operations on *member* documents with pre-summarised attributes e.g. 
 totalSpend rather than deriving these properties on-the-fly from records 
 with a shared member ID.



 On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:

 Well, my use case says I have tens of thousands of records for each 
 members. I want to do a simple terms aggs on member ID. If my count of 
 member ID remains same throughout .. good enough, if the number of members 
 keep

Re: Large results sets and paging for Aggregations

2015-02-09 Thread piyush goyal
Well, my use case says I have tens of thousands of records for each 
members. I want to do a simple terms aggs on member ID. If my count of 
member ID remains same throughout .. good enough, if the number of members 
keep on increasing, day by day ES has to keep more and more data into 
memory to calculate the aggs. Does not sound very promising. What we do is 
implementation of routing to put member specific data into a particular 
shard. Why can't aggs be based on shard based calculations so that I am 
safe from loading tons of data into memory.

Any thoughts?

On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:

 Sharing a response I received from Igor Motov:

 scroll works only to page results. paging aggs doesn't make sense since 
 aggs are executed on the entire result set. therefore if it managed to fit 
 into the memory you should just get it. paging will mean that you throw 
 away a lot of results that were already calculated. the only way to page 
 is by limiting the results that you are running aggs on. for example if 
 your data is sorted by date and you want to build histogram for the results 
 one date range at a time.




-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f6307a18-ea96-403d-ac02-dc37d3f2cceb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.