I'm just digging into map reduce on Google App Engine, and my early results 
are discouraging. I had in mind that I'd process about 10GB of data for an 
analysis I want to do, and I didn't even think that'd be that big a deal 
(given all the talk about petabyte-scale storage and such), but it's 
currently looking impossible. 

I did a simple word count mapreduce on some Gutenberg books (63MB zipped, 
166MB unzipped), once using Google's Python mapreduce example 
(https://cloud.google.com/appengine/docs/python/dataprocessing/) and once 
using the dumb-as-rocks standalone Python scripts posted at the top of 
Michael Noll's Hadoop tutorial 
(http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/).

Experimental results:

Simple Python: 1 minute 22 seconds
GAE dev server: 2 hours 17 minutes 12 seconds

Given the staggering difference in run time, even if computation in the 
cloud were free, I'd still opt to compute locally unless my hand were 
forced somehow (e.g. input files that didn't fit on my disk). Of course, 
the computation is not free, which means you're not only enduring all that 
overhead, but paying for it too.

I did try running this same test "in production," i.e. on Google Cloud's 
infrastructure. At first it failed, because just getting the job started 
exceeded the 128MB memory limit for the free tier. I turned on billing, 
bumped up the instance class to F4, and let it go. It chewed through the 
free tier quickly, then about USD$8 of instance time before one of the 
shuffle-merge shards seemed to enter an infinite loop (ran for 2 hours, no 
errors in logs). I aborted and gave up at that point.

Everything I hear about cloud computing makes it sound like the gleaming, 
glossy future, but these results makes it seem expensive and slow. $8+ to 
do a mapreduce across 60MB of data just doesn't seem like a good deal to 
me. At that rate, there's no way I can afford to process my 10GB dataset on 
App Engine. I understand that with the pipeline model you get fault 
tolerance and status reports and basic job management, but none of that is 
worth the expense or a 100x performance hit.

I think there's two possible problems going on here:

1) I made a technical mistake in my experiment and my results are invalid
2) I'm not understanding the benefits / value proposition of App Engine

Are my results consistent with what others would expect? Do either or both 
of my candidate explanations ring true? What else am I not considering? I 
think this is more a discussion topic than a discrete ask-and-answer, which 
is why I'm posting here instead of Stack Overflow.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/467aec55-e518-40b8-81b4-d62fb54a3dcb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to