Hi, Can you let us know what is the value for : Map input records Map spilled records Map output bytes Is there any side effect file written?
Thanks, Amogh On 2/23/10 8:57 PM, "Tim Kiefer" <tim-kie...@gmx.de> wrote: No... 900GB is in the map column. Reduce adds another ~70GB of FILE_BYTES_WRITTEN and the total column consequently shows ~970GB. Am 23.02.2010 16:11, schrieb Ed Mazur: > Hi Tim, > > I'm guessing a lot of these writes are happening on the reduce side. > On the JT web interface, there are three columns: map, reduce, > overall. Is the 900GB figure from the overall column? The value in the > map column will probably be closer to what you were expecting. There > are writes on the reduce side too during the shuffle and multi-pass > merge. > > Ed > > 2010/2/23 Tim Kiefer <tim-kie...@gmx.de>: > >> Hi Gang, >> >> thanks for your reply. >> >> To clarify: I look at the statistics through the job tracker. In the >> webinterface for my job I have columns for map, reduce and total. What I >> was refering to is "map" - i.e. I see FILE_BYTES_WRITTEN = 3 * Map >> Output Bytes in the map column. >> >> About the replication factor: I would expect the exact same thing - >> changing to 6 has no influence on FILE_BYTES_WRITTEN. >> >> About the sorting: I have io.sort.mb = 100 and io.sort.factor = 10. >> Furthermore, I have 40 mappers and map output data is ~300GB. I can't >> see how that ends up in a factor 3? >> >> - tim >> >> Am 23.02.2010 14:39, schrieb Gang Luo: >> >>> Hi Tim, >>> the intermediate data is materialized to local file system. Before it is >>> available for reducers, mappers will sort them. If the buffer (io.sort.mb) >>> is too small for the intermediate data, multi-phase sorting happen, which >>> means you read and write the same bit more than one time. >>> >>> Besides, are you looking at the statistics per mapper through the job >>> tracker, or just the information output when a job finish? If you look at >>> the information given out at the end of the job, note that this is an >>> overall statistics which include sorting at reduce side. It also include >>> the amount of data written to HDFS (I am not 100% sure). >>> >>> And, the FILE-BYTES_WRITTEN has nothing to do with the replication factor. >>> I think if you change the factor to 6, FILE_BYTES_WRITTEN is still the same. >>> >>> -Gang >>> >>> >>> Hi there, >>> >>> can anybody help me out on a (most likely) simple unclarity. >>> >>> I am wondering how intermediate key/value pairs are materialized. I have a >>> job where the map phase produces 600,000 records and map output bytes is >>> ~300GB. What I thought (up to now) is that these 600,000 records, i.e., >>> 300GB, are materialized locally by the mappers and that later on reducers >>> pull these records (based on the key). >>> What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter is >>> as high as ~900GB. >>> >>> So - where does the factor 3 come from between Map output bytes and >>> FILE_BYTES_WRITTEN??? I thought about the replication factor of 3 in the >>> file system - but that should be HDFS only?! >>> >>> Thanks >>> - tim >>> >>> >>