[Hadoop] - How write operation is divided into tasks?
Hi A very basic question about implementation. Best understood through the example of implementation. Architecture: A 3 node cluster with single index and 32 shards. A type data contains months of data with somewhere around 40K-50K count of documents per month. A routing value defined using the month and year value is used to route this data per shard. So, in short 1 month of data goes to 1 shard. Requirement: Simple requirement: pass a query, get data, update each document and insert back to the same shard. Since the number of shards = 32 creates 32 tasks, each task fetches 1 month of data, update it and send it back to ES for writing with same routing value so that it overwrites the previous document. Flow: Well the retrieval seems easy, 32 tasks created, one task per shard and brings the data into a single RDD. Next step update each document. Next is the step for writing which brings the question as follows: How does write operation divides itself into tasks? Doing by documentation, it depends upon the es.batch.size.bytes and es.batch.size.entries. The value of these two properties defines the number of tasks. What I presumed was RDD is again partitioned into n number of tasks depending upon the value specified in these parameters and then that many number of tasks run to index/update data. However, when I ran write operation with just a count of 5 documents and with es.batch.size.entries as 10,000 I still saw as many of 32 tasks doing a write operation on my es.resource. Still confused on how the task allocation works here. Can you please explain? Now comes the another question: In a standalone write to ES operation, how does code identify which shards contains which routing value? My assumption was all the tasks sends the data to the ES node which then distributes the data itself to the shards based on the routing value just like a normal bulk index operation. Can you please explain the process of task creations for the two operations - read-update-write and only write. Thanks in advance Piyush -- Please update your bookmarks! We moved to https://discuss.elastic.co/ --- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2d9dac53-da38-4309-8dc1-7440cb9479ae%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[Hadoop] - Difference between task creation for a write and read-update-write operation in ES
Hi Costin, I saw a different behavior of creating task for write to ES operation while working on my project. The difference is as follows: 1.) Only write to ES - When I create an RDD of my own to insert data into ES, the task are created based on property es.batch.size.bytes and es.batch.size.entries. Number of task created = Number of documents in RDD/the value of either of these properties. The request hits the node and node decides the shard to which document needs routed based on routing value(if specified). 2.) Read-Update-write to ES - Consider this case when I have to read data from ES, store it in RDD, do some updates in the documents in RDD and then index these documents back to ES. While reading, the number of tasks are created on basis of number of shards and I presume that each tasks fetch data from each Shard(not sure of how it works? - Task delagting request to node to serve data from a particular shard?). Now when I try to update/re-index data using same RDD and function saveToESWithMetadata, this time the number of task created is a number which is not based on point 1. If the data in each partition is less than property es.batch.size.entries, it creates the same number of tasks as are the number of shards, else greater than it. What's the reason behind this? Also like read operation where request is from particular shard, does write operation also write to a shard or all the task delegate their request to the node? Thanks in advance Piyush Costin, I saw a different behavior of creating task for write to ES operation while working on my project. The difference is as follows: 1.) Only write to ES - When I create an RDD of my own to insert data into ES, the task are created based on property es.batch.size.bytes and es.batch.size.entries. Number of task created = Number of documents in RDD/the value of either of these properties. The request hits the node and node decides the shard to which document needs routed based on routing value(if specified). 2.) Read-Update-write to ES - Consider this case when I have to read data from ES, store it in RDD, do some updates in the documents in RDD and then index these documents back to ES. While reading, the number of tasks are created on basis of number of shards and I presume that each tasks fetch data from each Shard(not sure of how it works? - Task delagting request to node to serve data from a particular shard?). Now when I try to update/re-index data using same RDD and function saveToESWithMetadata, this time the number of task created is a number which is not based on point 1. If the data in each partition is less than property es.batch.size.entries, it creates the same number of tasks as are the number of shards, else greater than it. What's the reason behind this? Also like read operation where request is from particular shard, does write operation also write to a shard or all the task delegate their request to the node? Thanks in advance Piyush -- Please update your bookmarks! We moved to https://discuss.elastic.co/ --- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f70d94d9-3f58-44f3-8383-e798854fb39b%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[Hadoop] - Difference between task creation for a write and read-update-write operation in ES
Hi Costin, I saw a different behavior of creating task for write to ES operation while working on my project. The difference is as follows: 1.) Only write to ES - When I create an RDD of my own to insert data into ES, the task are created based on property es.batch.size.bytes and es.batch.size.entries. Number of task created = Number of documents in RDD/the value of either of these properties. The request hits the node and node decides the shard to which document needs routed based on routing value(if specified). 2.) Read-Update-write to ES - Consider this case when I have to read data from ES, store it in RDD, do some updates in the documents in RDD and then index these documents back to ES. While reading, the number of tasks are created on basis of number of shards and I presume that each tasks fetch data from each Shard(not sure of how it works? - Task delagting request to node to serve data from a particular shard?). Now when I try to update/re-index data using same RDD and function saveToESWithMetadata, this time the number of task created is a number which is not based on point 1. If the data in each partition is less than property es.batch.size.entries, it creates the same number of tasks as are the number of shards, else greater than it. What's the reason behind this? Also like read operation where request is from particular shard, does write operation also write to a shard or all the task delegate their request to the node? Thanks in advance Piyush Costin, I saw a different behavior of creating task for write to ES operation while working on my project. The difference is as follows: 1.) Only write to ES - When I create an RDD of my own to insert data into ES, the task are created based on property es.batch.size.bytes and es.batch.size.entries. Number of task created = Number of documents in RDD/the value of either of these properties. The request hits the node and node decides the shard to which document needs routed based on routing value(if specified). 2.) Read-Update-write to ES - Consider this case when I have to read data from ES, store it in RDD, do some updates in the documents in RDD and then index these documents back to ES. While reading, the number of tasks are created on basis of number of shards and I presume that each tasks fetch data from each Shard(not sure of how it works? - Task delagting request to node to serve data from a particular shard?). Now when I try to update/re-index data using same RDD and function saveToESWithMetadata, this time the number of task created is a number which is not based on point 1. If the data in each partition is less than property es.batch.size.entries, it creates the same number of tasks as are the number of shards, else greater than it. What's the reason behind this? Also like read operation where request is from particular shard, does write operation also write to a shard or all the task delegate their request to the node? Thanks in advance Piyush -- Please update your bookmarks! We moved to https://discuss.elastic.co/ --- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ec268e76-6220-430b-958a-884692283ca0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[Hadoop] - Difference between task creation for a write and read-update-write operation in ES
Hi Costin, I saw a different behavior of creating task for write to ES operation while working on my project. The difference is as follows: 1.) Only write to ES - When I create an RDD of my own to insert data into ES, the task are created based on property es.batch.size.bytes and es.batch.size.entries. Number of task created = Number of documents in RDD/the value of either of these properties. The request hits the node and node decides the shard to which document needs routed based on routing value(if specified). 2.) Read-Update-write to ES - Consider this case when I have to read data from ES, store it in RDD, do some updates in the documents in RDD and then index these documents back to ES. While reading, the number of tasks are created on basis of number of shards and I presume that each tasks fetch data from each Shard(not sure of how it works? - Task delagting request to node to serve data from a particular shard?). Now when I try to update/re-index data using same RDD and function saveToESWithMetadata, this time the number of task created is a number which is not based on point 1. If the data in each partition is less than property es.batch.size.entries, it creates the same number of tasks as are the number of shards, else greater than it. What's the reason behind this? Also like read operation where request is from particular shard, does write operation also write to a shard or all the task delegate their request to the node? Thanks in advance Piyush Costin, I saw a different behavior of creating task for write to ES operation while working on my project. The difference is as follows: 1.) Only write to ES - When I create an RDD of my own to insert data into ES, the task are created based on property es.batch.size.bytes and es.batch.size.entries. Number of task created = Number of documents in RDD/the value of either of these properties. The request hits the node and node decides the shard to which document needs routed based on routing value(if specified). 2.) Read-Update-write to ES - Consider this case when I have to read data from ES, store it in RDD, do some updates in the documents in RDD and then index these documents back to ES. While reading, the number of tasks are created on basis of number of shards and I presume that each tasks fetch data from each Shard(not sure of how it works? - Task delagting request to node to serve data from a particular shard?). Now when I try to update/re-index data using same RDD and function saveToESWithMetadata, this time the number of task created is a number which is not based on point 1. If the data in each partition is less than property es.batch.size.entries, it creates the same number of tasks as are the number of shards, else greater than it. What's the reason behind this? Also like read operation where request is from particular shard, does write operation also write to a shard or all the task delegate their request to the node? Thanks in advance Piyush -- Please update your bookmarks! We moved to https://discuss.elastic.co/ --- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bb3fdaef-5d2e-4f21-85f5-4166ac11a35d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: ES JAVA API | QueryBuilders.java | FilterBuilders does not have nullable queryBuilder as arguement
Ok. That helps a lot in getting the things but still I do feel, since ES is internally marking the query as MatchAll query, the method definition should have Nullable query as well. Thanks once again for helping out. :) Regards Piyush On Wednesday, 11 February 2015 13:29:32 UTC+5:30, David Pilato wrote: Actually in a filtered query, filters are applied first. The match all query then only said that all filtered documents match. David Le 11 févr. 2015 à 08:30, piyush goyal coolpi...@gmail.com javascript: a écrit : Ahh.. Now I got you. But does that mean, if my rest query through sense does not have any query part and only has a filter part, by default ES adds a matchAll Query to the query part. So one more question, might be I am asking a wrong question but just to clear up my doubts. So now this query will do separate things: 1.) What a normal match all query does. 2.) What ever my filter operations are there, the query will perform the same as well. And then ES picks up the intersection of two. Am I right? Regards Piyush On Wednesday, 11 February 2015 12:25:43 UTC+5:30, David Pilato wrote: I think I answered. This is what is done by default in REST: https://github.com/elasticsearch/elasticsearch/blob/1816951b6b0320e7a011436c7c7519ec2bfabc6e/src/main/java/org/elasticsearch/index/query/FilteredQueryParser.java#L54 David Le 11 févr. 2015 à 07:46, piyush goyal coolpi...@gmail.com a écrit : Hi Folks, Any inputs? Regards Piyush On Tuesday, 10 February 2015 16:23:43 UTC+5:30, piyush goyal wrote: I don't need the query part. All I need is a filter. Not sure how matchAllQuery will help. On Tuesday, 10 February 2015 16:14:51 UTC+5:30, David Pilato wrote: Use a matchAllQuery for the query part. All scores will be set to 1. -- David ;-) Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs Le 10 févr. 2015 à 11:24, piyush goyal coolpi...@gmail.com a écrit : Hi All, If I try to write a filtered query through sense, it allows me to just add a filter and query is not a mandatory field. For example: { query: { filtered: { filter: { term: { response_timestamp: 2015-01-01 } } } } is a valid query through sense. However, if I try to implement the same through JAVA API, I have to use the abstract class QueryBuilders.java and its method: filteredQuery(QueryBuilder queryBuilder, @Nullable FilterBuilder filterBuilder) Please note that here FilterBuilder argument is nullable and QueryBuilder argument is not. Which means that eventually I have to write a query inside the Filtered part. If this correct, then how can I write a complete query with aggregations such that I don't want any score to be calculated and the response time is faster? Regards Piyush -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c44055a5-1654-48fa-9787-503228e39de9%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/c44055a5-1654-48fa-9787-503228e39de9%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com javascript:. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f5e22135-746c-4613-92b3-dbac84247677%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/f5e22135-746c-4613-92b3-dbac84247677%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cf379e01-76ee-4ed3-94ba-92bb2a95845f%40googlegroups.com. For more options, visit
Re: Large results sets and paging for Aggregations
aah..! This seems to be the best explanation of how aggregation works. Thanks a ton Mark for that. :) Few other questions: 1.) Would I assume that as my document count would increase, the time for aggregation calculation would as well increase? Reason: Trying to figure out if bucket creation is at individual shard level, then document count would happen asynchronously at each shard level thus decreasing the execution time significantly. Also at shard level, as and when my document count increases(satisfying the criteria as per query) considering if this process is linear time, the execution time would increase. 2.) How would I relate this analogy with sub aggregations. My observation says that as you increase the number of child aggregations, so it increases the execution time along with memory utilization. What happens in case of sub aggregations? 3.) I didn't get your last statement: There is however a fixed overhead for all queries which *is* a function of number of docs and that is the Field Data cache required to hold the dates/member IDs in RAM - if this becomes a problem then you may want to look at on-disk alternative structure in the form of DocValues. 4.) Off the topic, but I guess best to ask it here since we are talking about it. :) - DocValues - Since it was introduced in 1.0.0 and most of our mapping was defined in ES 0.9, can I change the mapping of existing fields now? Might be I can take this conversation in another thread but would love to hear about 1-3 points. You made this thread very interesting for me. Thanks Piyush On Wednesday, 11 February 2015 15:12:37 UTC+5:30, Mark Harwood wrote: 5k doesn't sound too scary. Think of the aggs tree like a Bean Machine [1] - one of those wooden boards with pins arranged on it like a christmas tree and you drop balls at the top of the board and they rattle down a choice of path to the bottom. In the case of aggs, your buckets are the pins and documents are the balls The memory requirement for processing the agg tree is typically the number of pins, not the number of balls you drop into the tree as these just fall out of the bottom of the tree. So in your case it is 5k members multiplied by 12 months each = 60k unique buckets, each of which will maintain a counter of how many docs pass through that point. So you could pass millions or billions of docs through and the working memory requirement for the query would be the same. There is however a fixed overhead for all queries which *is* a function of number of docs and that is the Field Data cache required to hold the dates/member IDs in RAM - if this becomes a problem then you may want to look at on-disk alternative structure in the form of DocValues. Hope that helps. [1] http://en.wikipedia.org/wiki/Bean_machine On Wednesday, February 11, 2015 at 7:04:04 AM UTC, piyush goyal wrote: Hi Mark, Before getting into queries, here is a little bit info about the project: 1.) A community where members keep on increasing, decreasing and changing. Maintained in a different type. 2.) Approximately 3K to 4K documents of data of each user inserted into ES per month in a different type maintained by member ID. 3.) Mapping is flat, there are no nested and array type of data. Requirement: Here is a sample requirement: 1.) Getting a report against each member ID against the count of data for last three month. 2.) Query used to get the data is: { query: { constant_score: { filter: { bool: { must: [ {term: { datatype: XYZ } }, { range: { response_timestamp: { from: 2014-11-01, to: 2015-01-31 } } } ] } } } },aggs: { memberIDAggs: { terms: { field: member_id, size: 0 },aggs: { dateHistAggs: { date_histogram: { field: response_timestamp, interval: month } } } } },size: 0 } Now since the current member count is approximately 1K which will increase to 5K in next 10 months. 5K * 4K * 3 times of documents to be used for this aggregation. I guess a major hit on system. And this is only two level of aggregation. Next requirement by our analyst is to get per month data into three different categories. What is the optimum solution to this problem? Regards Piyush On Tuesday, 10 February 2015 16:15:22 UTC+5:30, Mark Harwood wrote: these kind of queries are hit more for qualitative analysis. Do you have any example queries? The pay as you go summarisation need not be about just maintaining quantities. In the demo here [1] I derive profile names for people, categorizing them as newbies, fanboys or haters based on a history of their reviewing behaviours
Re: Large results sets and paging for Aggregations
Thanks Mark. Your suggestion of pay-as-you-go seems amazing. But considering the dynamics of the application, these kind of queries are hit more for qualitative analysis. There are hundred of such queries(I am not exaggerating) which are being hit daily by our analytic team. Keeping count of all those qualitative checks daily and maintaining them as documents is a headache itself. Addition/update/removals of these documents would cause us huge maintenance overheads. Hence was thinking of getting something of getting pagination on aggregations which would definitely help us to keep our ES memory leaks away. By the way, are there any other strategies suggested by ES for these kind of scenarios? Thanks On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote: Why can't aggs be based on shard based calculations They are. The shard_size setting will determine how many member *summaries* will be returned from each shard - we won't stream each member's thousands of related records back to a centralized point to compute a final result. The final step is to summarise the summaries from each shard. if the number of members keep on increasing, day by day ES has to keep more and more data into memory to calculate the aggs This is a different point to the one above (shard-level computation vs memory costs). If your analysis involves summarising the behaviours of large numbers of people over time then you may well find the cost of doing this in a single query too high when the numbers of people are extremely large. There is a cost to any computation and in that scenario you have deferred all these member-summarising costs to the very last moment. A better strategy for large-scale analysis of behaviours over time is to use a pay-as-you-go model where you update a per-member summary document at regular intervals with batches of their related records. This shifts the bulk of the computation cost from your single query to many smaller costs when writing data. You can then perform efficient aggs or scan/scroll operations on *member* documents with pre-summarised attributes e.g. totalSpend rather than deriving these properties on-the-fly from records with a shared member ID. On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote: Well, my use case says I have tens of thousands of records for each members. I want to do a simple terms aggs on member ID. If my count of member ID remains same throughout .. good enough, if the number of members keep on increasing, day by day ES has to keep more and more data into memory to calculate the aggs. Does not sound very promising. What we do is implementation of routing to put member specific data into a particular shard. Why can't aggs be based on shard based calculations so that I am safe from loading tons of data into memory. Any thoughts? On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote: Sharing a response I received from Igor Motov: scroll works only to page results. paging aggs doesn't make sense since aggs are executed on the entire result set. therefore if it managed to fit into the memory you should just get it. paging will mean that you throw away a lot of results that were already calculated. the only way to page is by limiting the results that you are running aggs on. for example if your data is sorted by date and you want to build histogram for the results one date range at a time. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d4b5fd32-3ef7-4026-846e-5f7d388bad1f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
ES JAVA API | QueryBuilders.java | FilterBuilders does not have nullable queryBuilder as arguement
Hi All, If I try to write a filtered query through sense, it allows me to just add a filter and query is not a mandatory field. For example: { query: { filtered: { filter: { term: { response_timestamp: 2015-01-01 } } } } is a valid query through sense. However, if I try to implement the same through JAVA API, I have to use the abstract class QueryBuilders.java and its method: filteredQuery(QueryBuilder queryBuilder, @Nullable FilterBuilder filterBuilder) Please note that here FilterBuilder argument is nullable and QueryBuilder argument is not. Which means that eventually I have to write a query inside the Filtered part. If this correct, then how can I write a complete query with aggregations such that I don't want any score to be calculated and the response time is faster? Regards Piyush -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: ES JAVA API | QueryBuilders.java | FilterBuilders does not have nullable queryBuilder as arguement
I don't need the query part. All I need is a filter. Not sure how matchAllQuery will help. On Tuesday, 10 February 2015 16:14:51 UTC+5:30, David Pilato wrote: Use a matchAllQuery for the query part. All scores will be set to 1. -- David ;-) Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs Le 10 févr. 2015 à 11:24, piyush goyal coolpi...@gmail.com javascript: a écrit : Hi All, If I try to write a filtered query through sense, it allows me to just add a filter and query is not a mandatory field. For example: { query: { filtered: { filter: { term: { response_timestamp: 2015-01-01 } } } } is a valid query through sense. However, if I try to implement the same through JAVA API, I have to use the abstract class QueryBuilders.java and its method: filteredQuery(QueryBuilder queryBuilder, @Nullable FilterBuilder filterBuilder) Please note that here FilterBuilder argument is nullable and QueryBuilder argument is not. Which means that eventually I have to write a query inside the Filtered part. If this correct, then how can I write a complete query with aggregations such that I don't want any score to be calculated and the response time is faster? Regards Piyush -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com javascript:. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/57cfc674-7ad7-4232-aa14-534fb394ef34%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: ES JAVA API | QueryBuilders.java | FilterBuilders does not have nullable queryBuilder as arguement
Ahh.. Now I got you. But does that mean, if my rest query through sense does not have any query part and only has a filter part, by default ES adds a matchAll Query to the query part. So one more question, might be I am asking a wrong question but just to clear up my doubts. So now this query will do separate things: 1.) What a normal match all query does. 2.) What ever my filter operations are there, the query will perform the same as well. And then ES picks up the intersection of two. Am I right? Regards Piyush On Wednesday, 11 February 2015 12:25:43 UTC+5:30, David Pilato wrote: I think I answered. This is what is done by default in REST: https://github.com/elasticsearch/elasticsearch/blob/1816951b6b0320e7a011436c7c7519ec2bfabc6e/src/main/java/org/elasticsearch/index/query/FilteredQueryParser.java#L54 David Le 11 févr. 2015 à 07:46, piyush goyal coolpi...@gmail.com javascript: a écrit : Hi Folks, Any inputs? Regards Piyush On Tuesday, 10 February 2015 16:23:43 UTC+5:30, piyush goyal wrote: I don't need the query part. All I need is a filter. Not sure how matchAllQuery will help. On Tuesday, 10 February 2015 16:14:51 UTC+5:30, David Pilato wrote: Use a matchAllQuery for the query part. All scores will be set to 1. -- David ;-) Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs Le 10 févr. 2015 à 11:24, piyush goyal coolpi...@gmail.com a écrit : Hi All, If I try to write a filtered query through sense, it allows me to just add a filter and query is not a mandatory field. For example: { query: { filtered: { filter: { term: { response_timestamp: 2015-01-01 } } } } is a valid query through sense. However, if I try to implement the same through JAVA API, I have to use the abstract class QueryBuilders.java and its method: filteredQuery(QueryBuilder queryBuilder, @Nullable FilterBuilder filterBuilder) Please note that here FilterBuilder argument is nullable and QueryBuilder argument is not. Which means that eventually I have to write a query inside the Filtered part. If this correct, then how can I write a complete query with aggregations such that I don't want any score to be calculated and the response time is faster? Regards Piyush -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com javascript:. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c44055a5-1654-48fa-9787-503228e39de9%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/c44055a5-1654-48fa-9787-503228e39de9%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f5e22135-746c-4613-92b3-dbac84247677%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: ES JAVA API | QueryBuilders.java | FilterBuilders does not have nullable queryBuilder as arguement
Hi Folks, Any inputs? Regards Piyush On Tuesday, 10 February 2015 16:23:43 UTC+5:30, piyush goyal wrote: I don't need the query part. All I need is a filter. Not sure how matchAllQuery will help. On Tuesday, 10 February 2015 16:14:51 UTC+5:30, David Pilato wrote: Use a matchAllQuery for the query part. All scores will be set to 1. -- David ;-) Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs Le 10 févr. 2015 à 11:24, piyush goyal coolpi...@gmail.com a écrit : Hi All, If I try to write a filtered query through sense, it allows me to just add a filter and query is not a mandatory field. For example: { query: { filtered: { filter: { term: { response_timestamp: 2015-01-01 } } } } is a valid query through sense. However, if I try to implement the same through JAVA API, I have to use the abstract class QueryBuilders.java and its method: filteredQuery(QueryBuilder queryBuilder, @Nullable FilterBuilder filterBuilder) Please note that here FilterBuilder argument is nullable and QueryBuilder argument is not. Which means that eventually I have to write a query inside the Filtered part. If this correct, then how can I write a complete query with aggregations such that I don't want any score to be calculated and the response time is faster? Regards Piyush -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/d1bfad33-93d3-407a-8c0a-8184045f88ec%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c44055a5-1654-48fa-9787-503228e39de9%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Large results sets and paging for Aggregations
Hi Mark, Before getting into queries, here is a little bit info about the project: 1.) A community where members keep on increasing, decreasing and changing. Maintained in a different type. 2.) Approximately 3K to 4K documents of data of each user inserted into ES per month in a different type maintained by member ID. 3.) Mapping is flat, there are no nested and array type of data. Requirement: Here is a sample requirement: 1.) Getting a report against each member ID against the count of data for last three month. 2.) Query used to get the data is: { query: { constant_score: { filter: { bool: { must: [ {term: { datatype: XYZ } }, { range: { response_timestamp: { from: 2014-11-01, to: 2015-01-31 } } } ] } } } },aggs: { memberIDAggs: { terms: { field: member_id, size: 0 },aggs: { dateHistAggs: { date_histogram: { field: response_timestamp, interval: month } } } } },size: 0 } Now since the current member count is approximately 1K which will increase to 5K in next 10 months. 5K * 4K * 3 times of documents to be used for this aggregation. I guess a major hit on system. And this is only two level of aggregation. Next requirement by our analyst is to get per month data into three different categories. What is the optimum solution to this problem? Regards Piyush On Tuesday, 10 February 2015 16:15:22 UTC+5:30, Mark Harwood wrote: these kind of queries are hit more for qualitative analysis. Do you have any example queries? The pay as you go summarisation need not be about just maintaining quantities. In the demo here [1] I derive profile names for people, categorizing them as newbies, fanboys or haters based on a history of their reviewing behaviours in a marketplace. By the way, are there any other strategies suggested by ES for these kind of scenarios? Igor hit on one which is to use some criteria eg. date to limit the volume of what you analyze in any one query request. [1] http://www.elasticsearch.org/videos/entity-centric-indexing-london-meetup-sep-2014/ On Tuesday, February 10, 2015 at 10:05:24 AM UTC, piyush goyal wrote: Thanks Mark. Your suggestion of pay-as-you-go seems amazing. But considering the dynamics of the application, these kind of queries are hit more for qualitative analysis. There are hundred of such queries(I am not exaggerating) which are being hit daily by our analytic team. Keeping count of all those qualitative checks daily and maintaining them as documents is a headache itself. Addition/update/removals of these documents would cause us huge maintenance overheads. Hence was thinking of getting something of getting pagination on aggregations which would definitely help us to keep our ES memory leaks away. By the way, are there any other strategies suggested by ES for these kind of scenarios? Thanks On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote: Why can't aggs be based on shard based calculations They are. The shard_size setting will determine how many member *summaries* will be returned from each shard - we won't stream each member's thousands of related records back to a centralized point to compute a final result. The final step is to summarise the summaries from each shard. if the number of members keep on increasing, day by day ES has to keep more and more data into memory to calculate the aggs This is a different point to the one above (shard-level computation vs memory costs). If your analysis involves summarising the behaviours of large numbers of people over time then you may well find the cost of doing this in a single query too high when the numbers of people are extremely large. There is a cost to any computation and in that scenario you have deferred all these member-summarising costs to the very last moment. A better strategy for large-scale analysis of behaviours over time is to use a pay-as-you-go model where you update a per-member summary document at regular intervals with batches of their related records. This shifts the bulk of the computation cost from your single query to many smaller costs when writing data. You can then perform efficient aggs or scan/scroll operations on *member* documents with pre-summarised attributes e.g. totalSpend rather than deriving these properties on-the-fly from records with a shared member ID. On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote: Well, my use case says I have tens of thousands of records for each members. I want to do a simple terms aggs on member ID. If my count of member ID remains same throughout .. good enough, if the number of members keep
Re: Large results sets and paging for Aggregations
Well, my use case says I have tens of thousands of records for each members. I want to do a simple terms aggs on member ID. If my count of member ID remains same throughout .. good enough, if the number of members keep on increasing, day by day ES has to keep more and more data into memory to calculate the aggs. Does not sound very promising. What we do is implementation of routing to put member specific data into a particular shard. Why can't aggs be based on shard based calculations so that I am safe from loading tons of data into memory. Any thoughts? On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote: Sharing a response I received from Igor Motov: scroll works only to page results. paging aggs doesn't make sense since aggs are executed on the entire result set. therefore if it managed to fit into the memory you should just get it. paging will mean that you throw away a lot of results that were already calculated. the only way to page is by limiting the results that you are running aggs on. for example if your data is sorted by date and you want to build histogram for the results one date range at a time. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f6307a18-ea96-403d-ac02-dc37d3f2cceb%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.