Hello,

I am trying to understand running a map-reduce job a little better. I know 
a lot of datastore operations are required to keep the state of the job, 
but how does the mapreduce library keep track of "yielded" data?

I am running a job to process over 13 million entities and normalize some 
data to dump into Google Storage (approximately 15GB of data). When I use 
the FileOutputWriter, where does it keep track of each line I've yielded? 
How do I end up with only 1 large file written to Google Storage? I looked 
at the bucket during a map-reduce operation and I don't see anything until 
the job is done and there's one large file ready for me to use. Does each 
Shard aggregate the data into a blobstore object before a final step merges 
all the shards' data and writes it to GCS? How is the library able to do 
this with F1 instances and memory constraints? I was not able to easily 
follow the code behind all this so I was hoping someone who is more 
familiar with the process can shed some light. 

I have other use cases in which I need to process a lot of data and would 
like to end up with a single large output file, but my method isn't the 
most stable of processes and does not fit into a map-reduce job. If I knew 
the general logic behind aggregating the data and placing it into Google 
Storage this would be of great benefit to me.

Any and all insight would be greatly appreciated!

Thank you,
Prateek

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to