Re: [google-appengine] Effectively Parallelizing Fetches (with pictures, yay!)
Ah, thanks Nick! I actually started to implement some of those changes, but ended up getting sidetracked with other things, but I'm starting again on it. Will probably have more questions later :) On Tue, Apr 20, 2010 at 4:23 AM, Nick Johnson (Google) nick.john...@google.com wrote: Hi Patrick, Good questions! On Tue, Apr 20, 2010 at 12:57 AM, Patrick Twohig patr...@namazustudios.com wrote: Hi All, As I understand it, the process of performing a single fetch (call to get()) from the dastastore using a key basically involves finding the host housing the entity, opening a socket, fetching the data, and then cleaning up the connection. So to fetch something like 30 entities from the datastore, you're repeating the process 30 times over in serial, each time incurring whatever overhead is involved. I also read that if you perform bulk fetches, (ie passing multiple keys at once) you can eliminate a great deal of that overhead. In one of the videos I watched from Google I/0 2009, the presenter (whose name I forget - d'oh) said that performing a bulk fetch actually performs the fetches in parallel from the data store and you shoudl see requests noticeably faster. Currently I have a few situations where the app performs many fetches from the data store in serially, rather than in bulk, and I believe it is the result of these requests being extremely slow and CPU intensive. Where possible, I put into place as much bulk fetches as I can but I'm a little stuck in a few places. I'm basing the fetch latency on today's numbers -- http://code.google.com/status/appengine/detail/datastore/2010/04/19. Anomalies aside, It looks like the get latency somewhere between 80ms and 160ms, let's spit difference and just say that it's 120 milliseconds. Additionally, the query latency is somewhere between 250ms and 500ms. Splitting the difference, that's 375ms. I'm just going to use those numbers as a ballpark estimate for fetching multiple entities from the data store, feel free to correct me if any of my reasoning is flawed or incorrect. The figures shown by the status site seem to be on the high side at the moment - they represent worst cases. In my own apps, gets are observed to be more on the order of 10-20ms, while queries vary widely depending on returned data, but average about 100-300ms. Example 1: http://imagepaste.nullnetwork.net/viewimage.php?id=830 Given the above example, I'm assuming that if I performed an ancestor query with Foo(A) as the ancestor it would effectively bulk-fetch the entire entity group. I could then use the result of that query to get the data I need. That would make the fetch from the datastore one query, 375 milliseconds versus (7entities * 160ms) or 1120ms. So long as you need 3 or more entities (3 * 160) it would stand to reason that you're just better off just fetching the whole thing. In some simple tests I did, that seemed to be the case, the query approach was faster, and that's great if you know everything is in the same entity group. Example 2: http://imagepaste.nullnetwork.net/viewimage.php?id=831 Given the above example, none of the entities are in the same entity group, but I would want to try to perform bulk fetches wherever possible. I would first fetch Foo(A). I would then see that it has two key properties pointing to Bar(B) and Bar(C), perform a fetch of those two entities at once. Finally, I would see that Bar(B) and Bar(C) each reference two more entities -- Baz(D), Baz(E), Baz(F), and Baz(G) for a total of four. In the worst case, I would fetch each entity individually taking, once again, 1120ms. In the best case and I perform 3 fetches, (fetch A first, then fetch B and C, then lastly fetch D, E, F, and G), it would be more in the neighborhood of 480 milliseconds. It's still an improvement over fetching each entity individually, but not much. Very similar to this is the 'referenceproperty prefetching' pattern - see http://blog.notdot.net/2010/01/ReferenceProperty-prefetching-in-App-Engine http://blog.notdot.net/2010/01/ReferenceProperty-prefetching-in-App-Engine So I was thinking of ways to improve this, the second example in particular, because I have a few places in my app where that exact thing is happening. Right now it's actually implemented with individual fetches, but it backed by memcache in many circumstances so that definitely helps. So given that, here's my questions... - When serializing the objects, would it be worthwhile adding some sort of metadata in the entity that would tell me what other entities it references (either directly or indirectly) so that I could fetch the whole thing with one or two API calls? I was thinking that an entity could have child entities with all the keys it references directly or indirectly. This would be a huge pain to implement, and I'm not sure it would make a noticeable performance boost.
Re: [google-appengine] Effectively Parallelizing Fetches (with pictures, yay!)
Hi Patrick, Good questions! On Tue, Apr 20, 2010 at 12:57 AM, Patrick Twohig patr...@namazustudios.comwrote: Hi All, As I understand it, the process of performing a single fetch (call to get()) from the dastastore using a key basically involves finding the host housing the entity, opening a socket, fetching the data, and then cleaning up the connection. So to fetch something like 30 entities from the datastore, you're repeating the process 30 times over in serial, each time incurring whatever overhead is involved. I also read that if you perform bulk fetches, (ie passing multiple keys at once) you can eliminate a great deal of that overhead. In one of the videos I watched from Google I/0 2009, the presenter (whose name I forget - d'oh) said that performing a bulk fetch actually performs the fetches in parallel from the data store and you shoudl see requests noticeably faster. Currently I have a few situations where the app performs many fetches from the data store in serially, rather than in bulk, and I believe it is the result of these requests being extremely slow and CPU intensive. Where possible, I put into place as much bulk fetches as I can but I'm a little stuck in a few places. I'm basing the fetch latency on today's numbers -- http://code.google.com/status/appengine/detail/datastore/2010/04/19. Anomalies aside, It looks like the get latency somewhere between 80ms and 160ms, let's spit difference and just say that it's 120 milliseconds. Additionally, the query latency is somewhere between 250ms and 500ms. Splitting the difference, that's 375ms. I'm just going to use those numbers as a ballpark estimate for fetching multiple entities from the data store, feel free to correct me if any of my reasoning is flawed or incorrect. The figures shown by the status site seem to be on the high side at the moment - they represent worst cases. In my own apps, gets are observed to be more on the order of 10-20ms, while queries vary widely depending on returned data, but average about 100-300ms. Example 1: http://imagepaste.nullnetwork.net/viewimage.php?id=830 Given the above example, I'm assuming that if I performed an ancestor query with Foo(A) as the ancestor it would effectively bulk-fetch the entire entity group. I could then use the result of that query to get the data I need. That would make the fetch from the datastore one query, 375 milliseconds versus (7entities * 160ms) or 1120ms. So long as you need 3 or more entities (3 * 160) it would stand to reason that you're just better off just fetching the whole thing. In some simple tests I did, that seemed to be the case, the query approach was faster, and that's great if you know everything is in the same entity group. Example 2: http://imagepaste.nullnetwork.net/viewimage.php?id=831 Given the above example, none of the entities are in the same entity group, but I would want to try to perform bulk fetches wherever possible. I would first fetch Foo(A). I would then see that it has two key properties pointing to Bar(B) and Bar(C), perform a fetch of those two entities at once. Finally, I would see that Bar(B) and Bar(C) each reference two more entities -- Baz(D), Baz(E), Baz(F), and Baz(G) for a total of four. In the worst case, I would fetch each entity individually taking, once again, 1120ms. In the best case and I perform 3 fetches, (fetch A first, then fetch B and C, then lastly fetch D, E, F, and G), it would be more in the neighborhood of 480 milliseconds. It's still an improvement over fetching each entity individually, but not much. Very similar to this is the 'referenceproperty prefetching' pattern - see http://blog.notdot.net/2010/01/ReferenceProperty-prefetching-in-App-Engine http://blog.notdot.net/2010/01/ReferenceProperty-prefetching-in-App-Engine So I was thinking of ways to improve this, the second example in particular, because I have a few places in my app where that exact thing is happening. Right now it's actually implemented with individual fetches, but it backed by memcache in many circumstances so that definitely helps. So given that, here's my questions... - When serializing the objects, would it be worthwhile adding some sort of metadata in the entity that would tell me what other entities it references (either directly or indirectly) so that I could fetch the whole thing with one or two API calls? I was thinking that an entity could have child entities with all the keys it references directly or indirectly. This would be a huge pain to implement, and I'm not sure it would make a noticeable performance boost. Certainly, if you experience serial gets as a significant problem that isn't solved with simple prefetching, this could be worth doing. I would avoid using child entities, however, and simply have a list of keys instead. - Is there something under the covers of the API that actually makes more
[google-appengine] Effectively Parallelizing Fetches (with pictures, yay!)
Hi All, As I understand it, the process of performing a single fetch (call to get()) from the dastastore using a key basically involves finding the host housing the entity, opening a socket, fetching the data, and then cleaning up the connection. So to fetch something like 30 entities from the datastore, you're repeating the process 30 times over in serial, each time incurring whatever overhead is involved. I also read that if you perform bulk fetches, (ie passing multiple keys at once) you can eliminate a great deal of that overhead. In one of the videos I watched from Google I/0 2009, the presenter (whose name I forget - d'oh) said that performing a bulk fetch actually performs the fetches in parallel from the data store and you shoudl see requests noticeably faster. Currently I have a few situations where the app performs many fetches from the data store in serially, rather than in bulk, and I believe it is the result of these requests being extremely slow and CPU intensive. Where possible, I put into place as much bulk fetches as I can but I'm a little stuck in a few places. I'm basing the fetch latency on today's numbers -- http://code.google.com/status/appengine/detail/datastore/2010/04/19. Anomalies aside, It looks like the get latency somewhere between 80ms and 160ms, let's spit difference and just say that it's 120 milliseconds. Additionally, the query latency is somewhere between 250ms and 500ms. Splitting the difference, that's 375ms. I'm just going to use those numbers as a ballpark estimate for fetching multiple entities from the data store, feel free to correct me if any of my reasoning is flawed or incorrect. Example 1: http://imagepaste.nullnetwork.net/viewimage.php?id=830 Given the above example, I'm assuming that if I performed an ancestor query with Foo(A) as the ancestor it would effectively bulk-fetch the entire entity group. I could then use the result of that query to get the data I need. That would make the fetch from the datastore one query, 375 milliseconds versus (7entities * 160ms) or 1120ms. So long as you need 3 or more entities (3 * 160) it would stand to reason that you're just better off just fetching the whole thing. In some simple tests I did, that seemed to be the case, the query approach was faster, and that's great if you know everything is in the same entity group. Example 2: http://imagepaste.nullnetwork.net/viewimage.php?id=831 Given the above example, none of the entities are in the same entity group, but I would want to try to perform bulk fetches wherever possible. I would first fetch Foo(A). I would then see that it has two key properties pointing to Bar(B) and Bar(C), perform a fetch of those two entities at once. Finally, I would see that Bar(B) and Bar(C) each reference two more entities -- Baz(D), Baz(E), Baz(F), and Baz(G) for a total of four. In the worst case, I would fetch each entity individually taking, once again, 1120ms. In the best case and I perform 3 fetches, (fetch A first, then fetch B and C, then lastly fetch D, E, F, and G), it would be more in the neighborhood of 480 milliseconds. It's still an improvement over fetching each entity individually, but not much. So I was thinking of ways to improve this, the second example in particular, because I have a few places in my app where that exact thing is happening. Right now it's actually implemented with individual fetches, but it backed by memcache in many circumstances so that definitely helps. So given that, here's my questions... - When serializing the objects, would it be worthwhile adding some sort of metadata in the entity that would tell me what other entities it references (either directly or indirectly) so that I could fetch the whole thing with one or two API calls? I was thinking that an entity could have child entities with all the keys it references directly or indirectly. This would be a huge pain to implement, and I'm not sure it would make a noticeable performance boost. - Is there something under the covers of the API that actually makes more efficient usage of resources that I don't know about? - Is there something in the API that I don't know about that could make the second example faster w/o much effort? - Is my design just bad and I should figure out a better way of doing it? If so, how would I go about doing that? Alright, that's all for now. Thanks, Patrick. -- Patrick H. Twohig. Namazu Studios P.O. Box 34161 San Diego, CA 92163-4161 -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.