[google-appengine] Re: What is the most efficient way to compute a rolling median on appengine?
I need the median value for multiple entities but only compared to themselves. In the future I will probably create the media across entities by doing a median average. On Tuesday, November 19, 2013 9:23:59 PM UTC-5, Jim wrote: Are you doing a time-series type analysis where you need the rolling median value for a specific entity, or do you need the median value across a range of entities? On Tuesday, November 12, 2013 2:07:34 PM UTC-6, Mathieu Simard wrote: Since there is no appengine solution available such as the Redis atomic list, I'm left wondering how to implement a cost effective rolling median. Has anyone come up with a solution that would be more convenient than running a redis instance on Google Compute Engine? -- You received this message because you are subscribed to the Google Groups Google App Engine group. To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscr...@googlegroups.com. To post to this group, send email to google-appengine@googlegroups.com. Visit this group at http://groups.google.com/group/google-appengine. For more options, visit https://groups.google.com/groups/opt_out.
[google-appengine] Re: What is the most efficient way to compute a rolling median on appengine?
missed your comment... this is what we're doing, except we avoid the 1MB limitation by storing the data sets in blobs and store the pointer to the blob in the entity record On Wednesday, November 13, 2013 1:20:21 PM UTC-6, Kaan Soral wrote: A single datastore entity can hold up to 1MB's How big will a single dataset be? If it's smaller than 1MB's in summarized format, you could build a queue based solution to handle the 15/s data rate You could also probably develop something like a tree, with each entity representing a node and storing the data about the leafs etc, it could maybe lead up to a practical median calculator, just an idea, the point is: as Vinny P stated, the solution is always based on exactly what you are doing On Tuesday, November 12, 2013 10:07:34 PM UTC+2, Mathieu Simard wrote: Since there is no appengine solution available such as the Redis atomic list, I'm left wondering how to implement a cost effective rolling median. Has anyone come up with a solution that would be more convenient than running a redis instance on Google Compute Engine? -- You received this message because you are subscribed to the Google Groups Google App Engine group. To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscr...@googlegroups.com. To post to this group, send email to google-appengine@googlegroups.com. Visit this group at http://groups.google.com/group/google-appengine. For more options, visit https://groups.google.com/groups/opt_out.
[google-appengine] Re: What is the most efficient way to compute a rolling median on appengine?
I'm already using that approach. However, the distribution of my metrics require a more precise solution for my median. On Thursday, November 21, 2013 12:50:41 AM UTC-5, Luca de Alfaro wrote: If you can weigh recent data more than older data, you might consider instead of building a rolling average, an exponentially decaying weights average. You can store in ndb, sharded, total_amount, and total_weight, and timestamp. Then, when you get an update, you compute the decay_factor, which is equal to exp(- time since update / time constant). You then do: total_amount = total_amount * decay_factor + amount_now total_weight = total_weight * decay_factor + weight_now timestamp = present time avg = total_amount / total_weight On Tuesday, November 12, 2013 12:07:34 PM UTC-8, Mathieu Simard wrote: Since there is no appengine solution available such as the Redis atomic list, I'm left wondering how to implement a cost effective rolling median. Has anyone come up with a solution that would be more convenient than running a redis instance on Google Compute Engine? -- You received this message because you are subscribed to the Google Groups Google App Engine group. To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscr...@googlegroups.com. To post to this group, send email to google-appengine@googlegroups.com. Visit this group at http://groups.google.com/group/google-appengine. For more options, visit https://groups.google.com/groups/opt_out.
Re: [google-appengine] Re: What is the most efficient way to compute a rolling median on appengine?
The query volume doubled since I first started this thread and, at the current rate, should double again by the end of next week. I definitely need a solution that can handle in the thousand QPS because that's where we're heading. I'm currently running this without consistency using a dedicated memcache. On Thu, Nov 21, 2013 at 2:00 PM, Jim jeb62...@gmail.com wrote: If data points for each entity are not coming too fast, you could use blobstore/gcs to store your time series for each entity in a blob, then store a pointer to that blob in your entity in the data store. updating is expensive but can run off a task queue. retrieval of the blobs is very fast, and then you can quicky parse the blob into memory and compute your stats on a given entity. cross entity stats are trickier and require some map-reduce-esque processing. we use this approach for smart-meter analytics where data points for a given entity (meter) don't come any faster than once every 15 min... not sure if it would work for you. On Thursday, November 21, 2013 8:12:15 AM UTC-6, Mathieu Simard wrote: I need the median value for multiple entities but only compared to themselves. In the future I will probably create the media across entities by doing a median average. On Tuesday, November 19, 2013 9:23:59 PM UTC-5, Jim wrote: Are you doing a time-series type analysis where you need the rolling median value for a specific entity, or do you need the median value across a range of entities? On Tuesday, November 12, 2013 2:07:34 PM UTC-6, Mathieu Simard wrote: Since there is no appengine solution available such as the Redis atomic list, I'm left wondering how to implement a cost effective rolling median. Has anyone come up with a solution that would be more convenient than running a redis instance on Google Compute Engine? -- You received this message because you are subscribed to a topic in the Google Groups Google App Engine group. To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-appengine/VMG96Xzvsok/unsubscribe . To unsubscribe from this group and all its topics, send an email to google-appengine+unsubscr...@googlegroups.com. To post to this group, send email to google-appengine@googlegroups.com. Visit this group at http://groups.google.com/group/google-appengine. For more options, visit https://groups.google.com/groups/opt_out. -- You received this message because you are subscribed to the Google Groups Google App Engine group. To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscr...@googlegroups.com. To post to this group, send email to google-appengine@googlegroups.com. Visit this group at http://groups.google.com/group/google-appengine. For more options, visit https://groups.google.com/groups/opt_out.
[google-appengine] Re: What is the most efficient way to compute a rolling median on appengine?
If data points for each entity are not coming too fast, you could use blobstore/gcs to store your time series for each entity in a blob, then store a pointer to that blob in your entity in the data store. updating is expensive but can run off a task queue. retrieval of the blobs is very fast, and then you can quicky parse the blob into memory and compute your stats on a given entity. cross entity stats are trickier and require some map-reduce-esque processing. we use this approach for smart-meter analytics where data points for a given entity (meter) don't come any faster than once every 15 min... not sure if it would work for you. On Thursday, November 21, 2013 8:12:15 AM UTC-6, Mathieu Simard wrote: I need the median value for multiple entities but only compared to themselves. In the future I will probably create the media across entities by doing a median average. On Tuesday, November 19, 2013 9:23:59 PM UTC-5, Jim wrote: Are you doing a time-series type analysis where you need the rolling median value for a specific entity, or do you need the median value across a range of entities? On Tuesday, November 12, 2013 2:07:34 PM UTC-6, Mathieu Simard wrote: Since there is no appengine solution available such as the Redis atomic list, I'm left wondering how to implement a cost effective rolling median. Has anyone come up with a solution that would be more convenient than running a redis instance on Google Compute Engine? -- You received this message because you are subscribed to the Google Groups Google App Engine group. To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscr...@googlegroups.com. To post to this group, send email to google-appengine@googlegroups.com. Visit this group at http://groups.google.com/group/google-appengine. For more options, visit https://groups.google.com/groups/opt_out.
[google-appengine] Re: What is the most efficient way to compute a rolling median on appengine?
If you can weigh recent data more than older data, you might consider instead of building a rolling average, an exponentially decaying weights average. You can store in ndb, sharded, total_amount, and total_weight, and timestamp. Then, when you get an update, you compute the decay_factor, which is equal to exp(- time since update / time constant). You then do: total_amount = total_amount * decay_factor + amount_now total_weight = total_weight * decay_factor + weight_now timestamp = present time avg = total_amount / total_weight On Tuesday, November 12, 2013 12:07:34 PM UTC-8, Mathieu Simard wrote: Since there is no appengine solution available such as the Redis atomic list, I'm left wondering how to implement a cost effective rolling median. Has anyone come up with a solution that would be more convenient than running a redis instance on Google Compute Engine? -- You received this message because you are subscribed to the Google Groups Google App Engine group. To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscr...@googlegroups.com. To post to this group, send email to google-appengine@googlegroups.com. Visit this group at http://groups.google.com/group/google-appengine. For more options, visit https://groups.google.com/groups/opt_out.
[google-appengine] Re: What is the most efficient way to compute a rolling median on appengine?
Are you doing a time-series type analysis where you need the rolling median value for a specific entity, or do you need the median value across a range of entities? On Tuesday, November 12, 2013 2:07:34 PM UTC-6, Mathieu Simard wrote: Since there is no appengine solution available such as the Redis atomic list, I'm left wondering how to implement a cost effective rolling median. Has anyone come up with a solution that would be more convenient than running a redis instance on Google Compute Engine? -- You received this message because you are subscribed to the Google Groups Google App Engine group. To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscr...@googlegroups.com. To post to this group, send email to google-appengine@googlegroups.com. Visit this group at http://groups.google.com/group/google-appengine. For more options, visit https://groups.google.com/groups/opt_out.
[google-appengine] Re: What is the most efficient way to compute a rolling median on appengine?
I miss some Redis functionality in App Engine as well. Memcache is just an unreliable cache to hold some data for while... nothing more. To make such calculations which iterate over large sets of data, I use backends with in-memory processing: loading part of the data from datastore into memory, spawn multiple threads (if applicable) and iterate over data. Ugly, strange, error-prone and sometimes slow, but it works. A bomb-to-kill-an-ant solution would be using Google BigQuery. I don't like like the idea, but depending on your problem it can solve it for you. You can try to use some MapReduce processing as well. But since I'm using Java (a not so loved language in App Engine, see servlet 3.0 discussionhttp://code.google.com/p/googleappengine/issues/detail?id=3091) MapReduce (Mapper, actually http://code.google.com/p/appengine-mapreduce/) is too experimental to put in production (after the Conversion and Files API, I learned my lesson: never ever ever use an experimental API in App Engine). Anyway, you have several options to try. I just recommend you to avoid storing large datasets on Memcache, since it's just a cache and can wipe your data at any time - invalidating your calculations. On Tuesday, November 12, 2013 6:07:34 PM UTC-2, Mathieu Simard wrote: Since there is no appengine solution available such as the Redis atomic list, I'm left wondering how to implement a cost effective rolling median. Has anyone come up with a solution that would be more convenient than running a redis instance on Google Compute Engine? -- You received this message because you are subscribed to the Google Groups Google App Engine group. To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscr...@googlegroups.com. To post to this group, send email to google-appengine@googlegroups.com. Visit this group at http://groups.google.com/group/google-appengine. For more options, visit https://groups.google.com/groups/opt_out.
[google-appengine] Re: What is the most efficient way to compute a rolling median on appengine?
Storing the data points isn't an option. I'm already receiving far too many data points to start writing them all. Unless they dramatically lower the write costs... As for the in-memory approach, can you provide a scale at which you're using this technique? Do you have to ensure that the backend runs on a single thread? On Wednesday, November 13, 2013 6:59:49 AM UTC-5, Gilberto Torrezan Filho wrote: I miss some Redis functionality in App Engine as well. Memcache is just an unreliable cache to hold some data for while... nothing more. To make such calculations which iterate over large sets of data, I use backends with in-memory processing: loading part of the data from datastore into memory, spawn multiple threads (if applicable) and iterate over data. Ugly, strange, error-prone and sometimes slow, but it works. A bomb-to-kill-an-ant solution would be using Google BigQuery. I don't like like the idea, but depending on your problem it can solve it for you. You can try to use some MapReduce processing as well. But since I'm using Java (a not so loved language in App Engine, see servlet 3.0 discussionhttp://code.google.com/p/googleappengine/issues/detail?id=3091) MapReduce (Mapper, actuallyhttp://code.google.com/p/appengine-mapreduce/) is too experimental to put in production (after the Conversion and Files API, I learned my lesson: never ever ever use an experimental API in App Engine). Anyway, you have several options to try. I just recommend you to avoid storing large datasets on Memcache, since it's just a cache and can wipe your data at any time - invalidating your calculations. On Tuesday, November 12, 2013 6:07:34 PM UTC-2, Mathieu Simard wrote: Since there is no appengine solution available such as the Redis atomic list, I'm left wondering how to implement a cost effective rolling median. Has anyone come up with a solution that would be more convenient than running a redis instance on Google Compute Engine? -- You received this message because you are subscribed to the Google Groups Google App Engine group. To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscr...@googlegroups.com. To post to this group, send email to google-appengine@googlegroups.com. Visit this group at http://groups.google.com/group/google-appengine. For more options, visit https://groups.google.com/groups/opt_out.
[google-appengine] Re: What is the most efficient way to compute a rolling median on appengine?
A single datastore entity can hold up to 1MB's How big will a single dataset be? If it's smaller than 1MB's in summarized format, you could build a queue based solution to handle the 15/s data rate You could also probably develop something like a tree, with each entity representing a node and storing the data about the leafs etc, it could maybe lead up to a practical median calculator, just an idea, the point is: as Vinny P stated, the solution is always based on exactly what you are doing On Tuesday, November 12, 2013 10:07:34 PM UTC+2, Mathieu Simard wrote: Since there is no appengine solution available such as the Redis atomic list, I'm left wondering how to implement a cost effective rolling median. Has anyone come up with a solution that would be more convenient than running a redis instance on Google Compute Engine? -- You received this message because you are subscribed to the Google Groups Google App Engine group. To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscr...@googlegroups.com. To post to this group, send email to google-appengine@googlegroups.com. Visit this group at http://groups.google.com/group/google-appengine. For more options, visit https://groups.google.com/groups/opt_out.
[google-appengine] Re: What is the most efficient way to compute a rolling median on appengine?
Here's a better definition of the problem: I receive a *value* for a tracked metric (i.e. start-up time) for different systems at a rate of 15/s. This rate is expected to grow quickly as clients add systems. I need to produce a rolling median of that metric. Inserting all entries in the datastore is not an option since that would already require 15 writes per second which is extremely expensive. Using the memcache is not a good solution since there is no atomic push/pop on arrays. (Hence, my earlier reference to Redis.) Using a backend instance to hold it all in memory is a quick fix but it won't scale as we add new metrics. At the same time, I'm trying to keep the cost low since volume is only going to grow. On Wednesday, November 13, 2013 2:20:21 PM UTC-5, Kaan Soral wrote: A single datastore entity can hold up to 1MB's How big will a single dataset be? If it's smaller than 1MB's in summarized format, you could build a queue based solution to handle the 15/s data rate You could also probably develop something like a tree, with each entity representing a node and storing the data about the leafs etc, it could maybe lead up to a practical median calculator, just an idea, the point is: as Vinny P stated, the solution is always based on exactly what you are doing On Tuesday, November 12, 2013 10:07:34 PM UTC+2, Mathieu Simard wrote: Since there is no appengine solution available such as the Redis atomic list, I'm left wondering how to implement a cost effective rolling median. Has anyone come up with a solution that would be more convenient than running a redis instance on Google Compute Engine? -- You received this message because you are subscribed to the Google Groups Google App Engine group. To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscr...@googlegroups.com. To post to this group, send email to google-appengine@googlegroups.com. Visit this group at http://groups.google.com/group/google-appengine. For more options, visit https://groups.google.com/groups/opt_out.
Re: [google-appengine] Re: What is the most efficient way to compute a rolling median on appengine?
On Wed, Nov 13, 2013 at 7:58 PM, Mathieu Simard mathieu.simar...@gmail.com wrote: Here's a better definition of the problem: I receive a *value* for a tracked metric (i.e. start-up time) for different systems at a rate of 15/s. This rate is expected to grow quickly as clients add systems. I need to produce a rolling median of that metric. At the same time, I'm trying to keep the cost low since volume is only going to grow. A few months back someone posted a similar problem to this mailing list: a mobile game needed a backend to collect scores from thousands of mobile clients, compute a leaderboard, then send back the leaderboard to the clients, all in a 10 second window. After the discussion, the consensus IIRC was to either (1) run App Engine backends to reap incoming requests and calculate values within a high-memory backend, or (2) run a Compute Engine machine to reap and calculate values. The choice is up to you since it depends on what you're comfortable with, but if low cost is an important goal I'd choose the Compute Engine route. Since your application and values can be held entirely within RAM, you can choose a high-memory, diskless instance to optimize your resource usage. However if the incoming values will have spiky traffic levels or if your application requires complex services such as Endpoints, hosting on App Engine is the better solution. - -Vinny P Technology Media Advisor Chicago, IL App Engine Code Samples: http://www.learntogoogleit.com -- You received this message because you are subscribed to the Google Groups Google App Engine group. To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscr...@googlegroups.com. To post to this group, send email to google-appengine@googlegroups.com. Visit this group at http://groups.google.com/group/google-appengine. For more options, visit https://groups.google.com/groups/opt_out.