Hi,
I'm running into a problem where IdentityMapper seems to produce way too much data. For example, I have a job that reads a sequence file using IdentityMapper and then uses IdentityReducer to write everything back out to another sequence file. My input is a ~60MB sequence file and after the map phase has completed, the job tracker UI reports about 10GB for "Map output bytes". It seems like the output collector does not get properly reset and so each map that gets emitted has the correct key but the value ends up being all the data you've encountered up to that point. I think this is a known issue but I can't seem to find any discussion about it right now. Has anyone else run into this, and if so, is there a solution? I'm using the latest code in the 0.15 branch.
Thanks
Mike

Reply via email to