Re: [google-appengine] Effectively Parallelizing Fetches (with pictures, yay!)

2010-05-04 Thread Patrick Twohig
Ah, thanks Nick!  I actually started to implement some of those changes, but
ended up getting sidetracked with other things, but I'm starting again on
it.  Will probably have more questions later :)

On Tue, Apr 20, 2010 at 4:23 AM, Nick Johnson (Google) <
nick.john...@google.com> wrote:

> Hi Patrick,
>
> Good questions!
>
> On Tue, Apr 20, 2010 at 12:57 AM, Patrick Twohig <
> patr...@namazustudios.com> wrote:
>
>> Hi All,
>>
>> As I understand it, the process of performing a single fetch (call to
>> get())  from the dastastore using a key basically involves finding the host
>> housing the entity, opening a socket, fetching the data, and then cleaning
>> up the connection.  So to fetch something like 30 entities from the
>> datastore, you're repeating the process 30 times over in serial, each time
>> incurring whatever overhead is involved.  I also read that if you perform
>> bulk fetches, (ie passing multiple keys at once) you can eliminate a great
>> deal of that overhead.  In one of the videos I watched from Google I/0 2009,
>> the presenter (whose name I forget - d'oh) said that performing a bulk fetch
>> actually performs the fetches in parallel from the data store and you shoudl
>> see requests noticeably faster.
>>
>> Currently I have a few situations where the app performs many fetches from
>> the data store in serially, rather than in bulk, and I believe it is the
>> result of these requests being extremely slow and CPU intensive.  Where
>> possible, I put into place as much bulk fetches as I can but I'm a little
>> stuck in a few places.
>>
>> I'm basing the fetch latency on today's numbers --
>> http://code.google.com/status/appengine/detail/datastore/2010/04/19.
>> Anomalies aside,  It looks like the get latency somewhere between 80ms and
>> 160ms, let's spit difference and just say that it's 120 milliseconds.
>> Additionally, the query latency is somewhere between 250ms and 500ms.
>> Splitting the difference, that's 375ms.  I'm just going to use those numbers
>> as a ballpark estimate for fetching multiple entities from the data store,
>> feel free to correct me if any of my reasoning is flawed or incorrect.
>>
>
> The figures shown by the status site seem to be on the high side at the
> moment - they represent worst cases. In my own apps, gets are observed to be
> more on the order of 10-20ms, while queries vary widely depending on
> returned data, but average about 100-300ms.
>
>
>> Example 1: http://imagepaste.nullnetwork.net/viewimage.php?id=830
>>
>> Given the above example, I'm assuming that if I performed an ancestor
>> query with Foo("A") as the ancestor it would effectively bulk-fetch the
>> entire entity group.  I could then use the result of that query to get the
>> data I need.  That would make the fetch from the datastore one query, 375
>> milliseconds versus (7entities * 160ms) or 1120ms.  So long as you need  3
>> or more entities (3 * 160) it would stand to reason that you're just better
>> off just fetching the whole thing.  In some simple tests I did, that seemed
>> to be the case, the query approach was faster, and that's great if you know
>> everything is in the same entity group.
>>
>> Example 2:  http://imagepaste.nullnetwork.net/viewimage.php?id=831
>>
>> Given the above example, none of the entities are in the same entity
>> group, but I would want to try to perform bulk fetches wherever possible.  I
>> would first fetch Foo("A").  I would then see that it has two key properties
>> pointing to Bar("B") and Bar("C"), perform a fetch of those two entities at
>> once.  Finally, I would see that Bar("B") and Bar("C") each reference two
>> more entities -- Baz("D"), Baz("E"), Baz("F"), and Baz("G") for a total of
>> four.  In the worst case, I would fetch each entity individually taking,
>> once again, 1120ms.  In the best case and I perform 3 fetches, (fetch A
>> first, then fetch B and C, then lastly fetch D, E, F, and G), it would be
>> more in the neighborhood of 480 milliseconds.  It's still an improvement
>> over fetching each entity individually, but not much.
>>
>
> Very similar to this is the 'referenceproperty prefetching' pattern - see
> http://blog.notdot.net/2010/01/ReferenceProperty-prefetching-in-App-Engine
>
> 
>
>
>>
>> So I was thinking of ways to improve this, the second example in
>> particular, because I have a few places in my app where that exact thing is
>> happening.  Right now it's actually implemented with individual fetches, but
>> it backed by memcache in many circumstances so that definitely helps.
>>
>> So given that, here's my questions...
>>
>>- When serializing the objects, would it be worthwhile adding some
>>sort of metadata in the entity that would tell me what other entities it
>>references (either directly or indirectly) so that I could fetch the whole
>>thing with one or two API calls?  I was thinking that an entity could have
>>child entit

Re: [google-appengine] Effectively Parallelizing Fetches (with pictures, yay!)

2010-04-20 Thread Nick Johnson (Google)
Hi Patrick,

Good questions!

On Tue, Apr 20, 2010 at 12:57 AM, Patrick Twohig
wrote:

> Hi All,
>
> As I understand it, the process of performing a single fetch (call to
> get())  from the dastastore using a key basically involves finding the host
> housing the entity, opening a socket, fetching the data, and then cleaning
> up the connection.  So to fetch something like 30 entities from the
> datastore, you're repeating the process 30 times over in serial, each time
> incurring whatever overhead is involved.  I also read that if you perform
> bulk fetches, (ie passing multiple keys at once) you can eliminate a great
> deal of that overhead.  In one of the videos I watched from Google I/0 2009,
> the presenter (whose name I forget - d'oh) said that performing a bulk fetch
> actually performs the fetches in parallel from the data store and you shoudl
> see requests noticeably faster.
>
> Currently I have a few situations where the app performs many fetches from
> the data store in serially, rather than in bulk, and I believe it is the
> result of these requests being extremely slow and CPU intensive.  Where
> possible, I put into place as much bulk fetches as I can but I'm a little
> stuck in a few places.
>
> I'm basing the fetch latency on today's numbers --
> http://code.google.com/status/appengine/detail/datastore/2010/04/19.
> Anomalies aside,  It looks like the get latency somewhere between 80ms and
> 160ms, let's spit difference and just say that it's 120 milliseconds.
> Additionally, the query latency is somewhere between 250ms and 500ms.
> Splitting the difference, that's 375ms.  I'm just going to use those numbers
> as a ballpark estimate for fetching multiple entities from the data store,
> feel free to correct me if any of my reasoning is flawed or incorrect.
>

The figures shown by the status site seem to be on the high side at the
moment - they represent worst cases. In my own apps, gets are observed to be
more on the order of 10-20ms, while queries vary widely depending on
returned data, but average about 100-300ms.


> Example 1: http://imagepaste.nullnetwork.net/viewimage.php?id=830
>
> Given the above example, I'm assuming that if I performed an ancestor query
> with Foo("A") as the ancestor it would effectively bulk-fetch the entire
> entity group.  I could then use the result of that query to get the data I
> need.  That would make the fetch from the datastore one query, 375
> milliseconds versus (7entities * 160ms) or 1120ms.  So long as you need  3
> or more entities (3 * 160) it would stand to reason that you're just better
> off just fetching the whole thing.  In some simple tests I did, that seemed
> to be the case, the query approach was faster, and that's great if you know
> everything is in the same entity group.
>
> Example 2:  http://imagepaste.nullnetwork.net/viewimage.php?id=831
>
> Given the above example, none of the entities are in the same entity group,
> but I would want to try to perform bulk fetches wherever possible.  I would
> first fetch Foo("A").  I would then see that it has two key properties
> pointing to Bar("B") and Bar("C"), perform a fetch of those two entities at
> once.  Finally, I would see that Bar("B") and Bar("C") each reference two
> more entities -- Baz("D"), Baz("E"), Baz("F"), and Baz("G") for a total of
> four.  In the worst case, I would fetch each entity individually taking,
> once again, 1120ms.  In the best case and I perform 3 fetches, (fetch A
> first, then fetch B and C, then lastly fetch D, E, F, and G), it would be
> more in the neighborhood of 480 milliseconds.  It's still an improvement
> over fetching each entity individually, but not much.
>

Very similar to this is the 'referenceproperty prefetching' pattern - see
http://blog.notdot.net/2010/01/ReferenceProperty-prefetching-in-App-Engine



>
> So I was thinking of ways to improve this, the second example in
> particular, because I have a few places in my app where that exact thing is
> happening.  Right now it's actually implemented with individual fetches, but
> it backed by memcache in many circumstances so that definitely helps.
>
> So given that, here's my questions...
>
>- When serializing the objects, would it be worthwhile adding some sort
>of metadata in the entity that would tell me what other entities it
>references (either directly or indirectly) so that I could fetch the whole
>thing with one or two API calls?  I was thinking that an entity could have
>child entities with all the keys it references directly or indirectly.  
> This
>would be a huge pain to implement, and I'm not sure it would make a
>noticeable performance boost.
>
>
Certainly, if you experience serial gets as a significant problem that isn't
solved with simple prefetching, this could be worth doing. I would avoid
using child entities, however, and simply have a list of keys instead.


>- Is there 

[google-appengine] Effectively Parallelizing Fetches (with pictures, yay!)

2010-04-19 Thread Patrick Twohig
Hi All,

As I understand it, the process of performing a single fetch (call to
get())  from the dastastore using a key basically involves finding the host
housing the entity, opening a socket, fetching the data, and then cleaning
up the connection.  So to fetch something like 30 entities from the
datastore, you're repeating the process 30 times over in serial, each time
incurring whatever overhead is involved.  I also read that if you perform
bulk fetches, (ie passing multiple keys at once) you can eliminate a great
deal of that overhead.  In one of the videos I watched from Google I/0 2009,
the presenter (whose name I forget - d'oh) said that performing a bulk fetch
actually performs the fetches in parallel from the data store and you shoudl
see requests noticeably faster.

Currently I have a few situations where the app performs many fetches from
the data store in serially, rather than in bulk, and I believe it is the
result of these requests being extremely slow and CPU intensive.  Where
possible, I put into place as much bulk fetches as I can but I'm a little
stuck in a few places.

I'm basing the fetch latency on today's numbers --
http://code.google.com/status/appengine/detail/datastore/2010/04/19.
Anomalies aside,  It looks like the get latency somewhere between 80ms and
160ms, let's spit difference and just say that it's 120 milliseconds.
Additionally, the query latency is somewhere between 250ms and 500ms.
Splitting the difference, that's 375ms.  I'm just going to use those numbers
as a ballpark estimate for fetching multiple entities from the data store,
feel free to correct me if any of my reasoning is flawed or incorrect.

Example 1: http://imagepaste.nullnetwork.net/viewimage.php?id=830

Given the above example, I'm assuming that if I performed an ancestor query
with Foo("A") as the ancestor it would effectively bulk-fetch the entire
entity group.  I could then use the result of that query to get the data I
need.  That would make the fetch from the datastore one query, 375
milliseconds versus (7entities * 160ms) or 1120ms.  So long as you need  3
or more entities (3 * 160) it would stand to reason that you're just better
off just fetching the whole thing.  In some simple tests I did, that seemed
to be the case, the query approach was faster, and that's great if you know
everything is in the same entity group.

Example 2:  http://imagepaste.nullnetwork.net/viewimage.php?id=831

Given the above example, none of the entities are in the same entity group,
but I would want to try to perform bulk fetches wherever possible.  I would
first fetch Foo("A").  I would then see that it has two key properties
pointing to Bar("B") and Bar("C"), perform a fetch of those two entities at
once.  Finally, I would see that Bar("B") and Bar("C") each reference two
more entities -- Baz("D"), Baz("E"), Baz("F"), and Baz("G") for a total of
four.  In the worst case, I would fetch each entity individually taking,
once again, 1120ms.  In the best case and I perform 3 fetches, (fetch A
first, then fetch B and C, then lastly fetch D, E, F, and G), it would be
more in the neighborhood of 480 milliseconds.  It's still an improvement
over fetching each entity individually, but not much.

So I was thinking of ways to improve this, the second example in particular,
because I have a few places in my app where that exact thing is happening.
Right now it's actually implemented with individual fetches, but it backed
by memcache in many circumstances so that definitely helps.

So given that, here's my questions...

   - When serializing the objects, would it be worthwhile adding some sort
   of metadata in the entity that would tell me what other entities it
   references (either directly or indirectly) so that I could fetch the whole
   thing with one or two API calls?  I was thinking that an entity could have
   child entities with all the keys it references directly or indirectly.  This
   would be a huge pain to implement, and I'm not sure it would make a
   noticeable performance boost.
   - Is there something "under the covers" of the API that actually makes
   more efficient usage of resources that I don't know about?
   - Is there something in the API that I don't know about that could make
   the second example faster w/o much effort?
   - Is my design just bad and I should figure out a better way of doing
   it?  If so, how would I go about doing that?

Alright, that's all for now.

Thanks,
Patrick.

-- 
Patrick H. Twohig.

Namazu Studios
P.O. Box 34161
San Diego, CA 92163-4161

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appeng...@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.