Re: [google-appengine] Re: How to solve the aggregate query with condition in Google App Engine

2012-08-07 Thread Neo
Hi Joakim,

Thanks for the detailed reply. The problem is, though there will be only a 
single entity for each day, it may have thousands of string > count items. 
I will have to process all these string > count items. If we start 
aggregating each week and each month and so on, and find the largest such 
interval that fits inside the given interval, then we reduce the items 
processing drastically. The only problem remains here is what if there are 
half a million of different input strings between the given interval. It 
seems that even aggregating on months won't be able to help. 

Having said that, I must say that I really like the idea. Do you think that 
it is scalable?

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-appengine/-/2k5o3kCRuSEJ.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



Re: [google-appengine] Re: How to solve the aggregate query with condition in Google App Engine

2012-08-06 Thread Joakim
The string-day entities are only needed for the current day, as a way to 
parallel concurrent writes. In fact, it sounds like you will want to delete 
the day-string entities when they are no longer needed. The digested 
DayCount entity also described is the way to process queries, and will mean 
fetching one entity per day in the requested range. Remember to put the 
results generated in the datastore for re-use.

The model I laid out can be made to work for the numbers you mention, and 
the more instantaneous consistency you're willing to sacrifice, the easier 
it gets.

   - You can push the string-writes to asynchronous tasks on the TaskQueue 
   and then have the task operate on the string-day counter. Generally these 
   are called within 100 ms of being added.
   - At the end of the day, have a cron job process each of the string-day 
   entities, generating and updating the DayCount entity and deleting for that 
   day and then deleting the string-day entity. You wont be able to process 
   all of these in one transaction, either do 4 string-day entities per 
   transaction (hitting the maximum of 5 entities for an XG transaction, 
   including the DayCount entity), or just stick to one per transaction.


Notice that this means today's data will not show up in your queries, since 
they're only updated at the end of the day. We can fix that too.

   - After incrementing a string-day's counter, set a property called 
   updated to [current date+time] before writing the entity and committing the 
   transaction. As an optimization, ignore it if updated was set within the 
   last few minutes.
   - Have a cron job run every X minutes (as often as you want the current 
   day's DayCount updated) to fetch today's DayCount and read it's 
   update-time, then query for string-day entities where updated > 
   dayCountUpdated.
   - Pull all of the counts from these string-day entities into your 
   DayCount entity and set its update-time to the highest observed update-time 
   in the list of string-day entities you got. Commit your changes to the 
   current DayCount.


Be smart about reusing old query results.

   - At the end of the week, generate weekly count aggregations
   - Same for end of month
   - Identify the largest aggregate result or group thereof which fit 
   inside the requested date range, and use these as the base of your 
   calculation

Remember to use unindexed properties for any part of the entities you're 
not going to use in a query, saves on cost.

Also, as someone who used JDO, I suggest you give Objectify a look.

On Monday, August 6, 2012 3:35:18 PM UTC+2, Neo wrote:
>
> Hi Joakim ,
>
> If I understand it correctly, the string-day entity would be something 
> like this:
> Suppose there are 1 different strings S1, S2, ..., S1 on some 
> particular day, each asked many a times in that day d1, then we will have 
> 1 of entities: d1S1, d1S2, ...,d1S1. That means, suppose there are 
> 't' days in the given time period, and on all of the days in this interval, 
> there are only 1 different strings (S1 to S1), and assuming that 
> each of the string is inputted on half of the days in the time interval - 
> that means we will have to process 1 * (t/2) records - for t = 200 
> days, it turns out to be 1 Million of records processing for a single 
> query. This too, when we have assumed that there are only 1 different 
> queries in those 200 days - in a real scenario this can well be 10. 
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-appengine/-/ZrWBL5zNgYUJ.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



[google-appengine] Re: How to solve the aggregate query with condition in Google App Engine

2012-08-06 Thread hyperflame
On Aug 6, 1:14 am, Steve James  wrote:
> I'd say horses for courses on this one.
>
> IMHO the Datastore may not be the best fit to solving your requirement.
>
> Perhaps Cloud SQL or BigQuery might be better suited?


Have to agree with Steve, this is increasingly sounding like a storage
schema that the datastore is not appropriate for.


On Aug 6, 4:49 am, Joakim  wrote:
> I'd solve this by having one unique entity per combination of date and
> string, storing the string's total for that day in said entity. You can
> achieve this uniqueness either by formatting special entity names (string
> ids) to include both the date and the string. Aggregate the day's totals in
> an entity keyed on the date, holding a map/dictionary of string->count.
> Your query would become something like this:
> SELECT * FROM DayCount WHERE __key__ >= Key('DayCount', startDate) AND
> __key__ <= Key('DayCount', endDate)

Nice one, Joakim!

Just to turn this into code:

To insert an Entity into the datastore (this is not production-quality
code, just showing an example):

try {
Key key = KeyFactory.createKey("someentityname", unique_string +
CURRENT_MONTH + CURRENT_DAY + CURRENT_YEAR);
Entity entity = datastore.get(key);
//If that operation succeeded, then there is an entity already with
that unique string, and was added today. Increase the count by one.
int count = (Integer)entity.getProperty("count");
entity.setProperty("count", count + 1);
entity.setProperty("add_date", new Date());
datastore.put(entity);
}
catch (EntityNotFoundException e) {
//this means that the unique string has not been added so far today
Entity entity = new Entity("someentityname", unique_string +
CURRENT_MONTH + CURRENT_DAY + CURRENT_YEAR);
entity.setProperty("count", 1);
entity.setProperty("add_date", new Date());
datastore.put(entity);
}

And then at query time, you can do a standard search using the Query
class. Of course, if your app is being accessed a lot, you need to
shard, or do some memcaching of data.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



[google-appengine] Re: How to solve the aggregate query with condition in Google App Engine

2012-08-06 Thread Steve James
I'd say horses for courses on this one.

IMHO the Datastore may not be the best fit to solving your requirement.

Perhaps Cloud SQL or BigQuery might be better suited?

Personally, i've consciously had to decide to make better use of all three; 
rather than hoping that Datastore'll solve all my needs (not saying that's 
necessarily what's happening here tho... don't get me wrong).

Im starting to find the following:-

Datastore's great for object persistence and basic querying

With a bit of live indexing onto Cloud Sql i can achieve the more complex 
searches i need

i copy Log type data into BigQuery for bigdata analytics.

And i guess there's also a legitimate place for the full text search Api too 
(tho i personally haven't used it yet).

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-appengine/-/_QhhWG8HVkAJ.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



[google-appengine] Re: How to solve the aggregate query with condition in Google App Engine

2012-08-06 Thread Neo
Hi Joakim ,

If I understand it correctly, the string-day entity would be something like 
this:
Suppose there are 1 different strings S1, S2, ..., S1 on some 
particular day, each asked many a times in that day d1, then we will have 
1 of entities: d1S1, d1S2, ...,d1S1. That means, suppose there are 
't' days in the given time period, and on all of the days in this interval, 
there are only 1 different strings (S1 to S1), and assuming that 
each of the string is inputted on half of the days in the time interval - 
that means we will have to process 1 * (t/2) records - for t = 200 
days, it turns out to be 1 Million of records processing for a single 
query. This too, when we have assumed that there are only 1 different 
queries in those 200 days - in a real scenario this can well be 10. 

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-appengine/-/5o4P_bEwXG0J.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



Re: [google-appengine] Re: How to solve the aggregate query with condition in Google App Engine

2012-08-06 Thread Rising India
Hi Joakim ,

If I understand it correctly, the string-day entity would be something like
this:
Suppose there are 1 different strings S1, S2, ..., S1 on some
particular day, each asked many a times in that day d1, then we will have
1 of entities: d1S1, d1S2, ...,d1S1. That means, suppose there are
't' days in the given time period, and on all of the days in this interval,
there are only 1 different strings (S1 to S1), and assuming that
each of the string is inputted on half of the days in the time interval -
that means we will have to process 1 * (t/2) records - for t = 200
days, it turns out to be 1 Million of records processing for a single
query. This too, when we have assumed that there are only 1 different
queries in those 200 days - in a real scenario this can well be 10.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



[google-appengine] Re: How to solve the aggregate query with condition in Google App Engine

2012-08-06 Thread Joakim
I'd solve this by having one unique entity per combination of date and 
string, storing the string's total for that day in said entity. You can 
achieve this uniqueness either by formatting special entity names (string 
ids) to include both the date and the string. Aggregate the day's totals in 
an entity keyed on the date, holding a map/dictionary of string->count.
Your query would become something like this:
SELECT * FROM DayCount WHERE __key__ >= Key('DayCount', startDate) AND 
__key__ <= Key('DayCount', endDate)

Or just generate all possible keys from start to end and do a batch fetch. 
Do the counting and sorting yourself. It's easy to see that this could 
become slow and expensive, so make sure to cache the generated results - 
especially if endDate is in the past - in an entity, and cache that entity 
in memcache. Keep the key to that entity predictable (formatted string with 
both dates) and you're down to a single fetch. This is also why I said to 
aggregate all of the day's totals in the DayCount-entity, so that there 
will be fewer fetches at query-time.

You may be asking yourself why you should bother having one entity per 
string per day at all, when you could do this with just the 
DayCount-entity. The reason is that there are limits to how many times per 
second you can write to a single entity (entity group to be correct). For 
more on this, see the App Engine talks from Google I/O 2011 (relevant 
link
).
Depending on how important consistency is to your application, you could 
update the DayCount in the same transaction as the per-string-and-day 
entity - which would be a bit slower, or defer that job to ASAP-but-not-now 
by scheduling a task on the TaskQueue, or just have a cron job checking 
every 5 or ten minutes.


On Wednesday, August 1, 2012 8:29:49 AM UTC+2, Neo wrote:
>
> Suppose, In my website, I ask users to input some string. A user can input 
> string multiple times. Whenever any user inputs a string, I log it in the 
> database along with the time. Many strings can be same, even though 
> inputted by different users. In the home page, I need to give the interface 
> such that any user can query for top n (say 50) strings in any time period 
> (say 10 Jan 2012 to 30 Jan 2012). If it was SQL, I could have written query 
> like: 
>
> select string, count(*)
> from userStrings where day >= d1 and day <= d2
> group by string 
> order by count(*) desc
> limit n
>
> How do I solve it in GAE environment? For each such user query, I can't 
> process the record at query time - there can be millions of records.
>
> I am using JDO. My obvious goal is to minimize the app engine cost : CPU + 
> data.
>
> Thanks,
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-appengine/-/E13JhQDkaFYJ.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



[google-appengine] Re: How to solve the aggregate query with condition in Google App Engine

2012-08-05 Thread Neo
Hi Hyperflame,

Again, the problem with this strategy is that it gives the top 'n' items 
without considering time period - it is the absolute count of each string 
till date. It doesn't take into account the time period window. 

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-appengine/-/4Fdi8t3TjJwJ.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



[google-appengine] Re: How to solve the aggregate query with condition in Google App Engine

2012-08-05 Thread hyperflame
Instead of counting the number of times a string has appeared at
querytime, just change to counting the number of times the string has
been input at input-time.

E.g. When your user puts in a string, search the datastore to see if
that string already appears in an entity. If so, set the count
property on that entity +1. If not, then add a new entity and set the
count property to 1. Then at query time, you can sort by the count
property, in addition to the date limitations.

On Aug 5, 6:20 am, Neo  wrote:
> Hi Hyperflame. Thanks for your reply. The problem is, between any given
> time period, there can be millions of such records. For each of such query,
> I will have to process those millions of records to get the top x entities.
> This is definitely very inefficient and not acceptable.
>
> -Neo
>
>
>
> On Thursday, August 2, 2012 7:12:12 AM UTC+5:30, hyperflame wrote:
>
> > Well, you could just use Google Cloud SQL if you really want a SQL
> > environment.
>
> > Otherwise ( this is just a quick example, obviously there are some
> > optimizations you could do). Also i'm typing this out on my phone, so
> > forgive me if I miss a semicolon or something.
>
> > To add a string to the datastore:
> > Entity record = new Entity("record");
> > record.setProperty("some_input", input_string);
> > record.setProperty("add_date", new Date());
> > datastore.put(record);
>
> > To search for strings in a given time period:
> > Query q = new Query("record");
> > Q.setFilter("add_date", FilterOperator.LESS_THAN,
> > some_date_that_you're_searching_for);
> > PreparedQuery pq = datastore.prepare(q);
>
> > And then you can pull out the top x entities out of pq as a List, then
> > extract their properties to find the string.
>
> > On Aug 1, 1:29 am, Neo  wrote:
> > > Suppose, In my website, I ask users to input some string. A user can
> > input
> > > string multiple times. Whenever any user inputs a string, I log it in
> > the
> > > database along with the time. Many strings can be same, even though
> > > inputted by different users. In the home page, I need to give the
> > interface
> > > such that any user can query for top n (say 50) strings in any time
> > period
> > > (say 10 Jan 2012 to 30 Jan 2012). If it was SQL, I could have written
> > query
> > > like:
>
> > > select string, count(*)
> > > from userStrings where day >= d1 and day <= d2
> > > group by string
> > > order by count(*) desc
> > > limit n
>
> > > How do I solve it in GAE environment? For each such user query, I can't
> > > process the record at query time - there can be millions of records.
>
> > > I am using JDO. My obvious goal is to minimize the app engine cost : CPU
> > +
> > > data.
>
> > > Thanks,

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



[google-appengine] Re: How to solve the aggregate query with condition in Google App Engine

2012-08-05 Thread Neo
Hi Hyperflame. Thanks for your reply. The problem is, between any given 
time period, there can be millions of such records. For each of such query, 
I will have to process those millions of records to get the top x entities. 
This is definitely very inefficient and not acceptable.

-Neo

On Thursday, August 2, 2012 7:12:12 AM UTC+5:30, hyperflame wrote:
>
> Well, you could just use Google Cloud SQL if you really want a SQL 
> environment. 
>
> Otherwise ( this is just a quick example, obviously there are some 
> optimizations you could do). Also i'm typing this out on my phone, so 
> forgive me if I miss a semicolon or something. 
>
> To add a string to the datastore: 
> Entity record = new Entity("record"); 
> record.setProperty("some_input", input_string); 
> record.setProperty("add_date", new Date()); 
> datastore.put(record); 
>
> To search for strings in a given time period: 
> Query q = new Query("record"); 
> Q.setFilter("add_date", FilterOperator.LESS_THAN, 
> some_date_that_you're_searching_for); 
> PreparedQuery pq = datastore.prepare(q); 
>
> And then you can pull out the top x entities out of pq as a List, then 
> extract their properties to find the string. 
>
>
>
> On Aug 1, 1:29 am, Neo  wrote: 
> > Suppose, In my website, I ask users to input some string. A user can 
> input 
> > string multiple times. Whenever any user inputs a string, I log it in 
> the 
> > database along with the time. Many strings can be same, even though 
> > inputted by different users. In the home page, I need to give the 
> interface 
> > such that any user can query for top n (say 50) strings in any time 
> period 
> > (say 10 Jan 2012 to 30 Jan 2012). If it was SQL, I could have written 
> query 
> > like: 
> > 
> > select string, count(*) 
> > from userStrings where day >= d1 and day <= d2 
> > group by string 
> > order by count(*) desc 
> > limit n 
> > 
> > How do I solve it in GAE environment? For each such user query, I can't 
> > process the record at query time - there can be millions of records. 
> > 
> > I am using JDO. My obvious goal is to minimize the app engine cost : CPU 
> + 
> > data. 
> > 
> > Thanks, 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-appengine/-/W9RoTvb0PHYJ.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



[google-appengine] Re: How to solve the aggregate query with condition in Google App Engine

2012-08-01 Thread hyperflame
Well, you could just use Google Cloud SQL if you really want a SQL
environment.

Otherwise ( this is just a quick example, obviously there are some
optimizations you could do). Also i'm typing this out on my phone, so
forgive me if I miss a semicolon or something.

To add a string to the datastore:
Entity record = new Entity("record");
record.setProperty("some_input", input_string);
record.setProperty("add_date", new Date());
datastore.put(record);

To search for strings in a given time period:
Query q = new Query("record");
Q.setFilter("add_date", FilterOperator.LESS_THAN,
some_date_that_you're_searching_for);
PreparedQuery pq = datastore.prepare(q);

And then you can pull out the top x entities out of pq as a List, then
extract their properties to find the string.



On Aug 1, 1:29 am, Neo  wrote:
> Suppose, In my website, I ask users to input some string. A user can input
> string multiple times. Whenever any user inputs a string, I log it in the
> database along with the time. Many strings can be same, even though
> inputted by different users. In the home page, I need to give the interface
> such that any user can query for top n (say 50) strings in any time period
> (say 10 Jan 2012 to 30 Jan 2012). If it was SQL, I could have written query
> like:
>
> select string, count(*)
> from userStrings where day >= d1 and day <= d2
> group by string
> order by count(*) desc
> limit n
>
> How do I solve it in GAE environment? For each such user query, I can't
> process the record at query time - there can be millions of records.
>
> I am using JDO. My obvious goal is to minimize the app engine cost : CPU +
> data.
>
> Thanks,

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.