Counting records
Hi, I am a complete noob with Hadoop and MapReduce and I have a question that is probably silly, but I still don't know the answer. For the purposes of discussion I'll assume that I'm using a standard TextInputFormat. (I don't think that this changes things too much.) To simplify (a fair bit) I want to count all the records that meet specific criteria. I would like to use MapReduce because I anticipate large sources and I want to get the performance and reliability that MapReduce offers. So the obvious and simple approach is to have my Mapper check whether each record meets the criteria and emit a 0 or a 1. Then I could use a combiner which accumulates (like a LongSumReducer) and use this as a reducer as well, and I am sure that that would work fine. However it seems massive overkill to have all those 1s and 0s emitted and stored on disc. It seems tempting to have the Mapper accumulate the count for all of the records that it sees and then just emit once at the end the total value. This seems simple enough, except that the Mapper doesn't seem to have any easy way to know when it is presented with the last record. Now I could just make the Mapper take a copy of the OutputCollector for each record called and then in the close method it could do a single emit. However, although, this looks like it would work with the current implementation, there seem to be no guarantees that the collector is valid at the time that the close is called. This just seems ugly. Or I could get the Mapper to record the first offset that it sees and read the split length using report.getInputSplit().getLength() and then it could monitor how far it is through the split and it should be able to detect the last record. It looks like the MapRunner class creates a Mapper object and uses it to process a split, and so it looks like it's safe to store state in the mapper class between invocations of the map method. (But is this just an implementation artefact? Is the mapper class supposed to be completely stateless?) Maybe I should have a custom InputFormat class and have it flag the last record by placing some extra information in the key? (Assuming that the InputFormant has enough information from the split to be able to detect the last record, which seems reasonable enough.) Is there some blessed way to do this? Or am I barking up the wrong tree because I should really just generate all those 1s and 0s and accept the overhead? Regards, Peter Marron Trillium Software UK Limited
Re: Counting records
Hi, an additional idea is to use the counter API inside the framework. http://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/ has a good example. Kai Am 23.07.2012 um 16:25 schrieb Peter Marron: I am a complete noob with Hadoop and MapReduce and I have a question that is probably silly, but I still don't know the answer. To simplify (a fair bit) I want to count all the records that meet specific criteria. I would like to use MapReduce because I anticipate large sources and I want to get the performance and reliability that MapReduce offers. -- Kai Voigt k...@123.org
RE: Counting records
You could just use a counter and never emit anything from the Map(). Use the getCounter(MyRecords, RecordTypeToCount).increment(1) whenever you find the type of record you are looking for. Never call output.collect(). Call the job with reduceTasks(0). When the job finishes, you can programmatically get the values of all counters including the one you create in the Map() method. Dave Shine Sr. Software Engineer 321.939.5093 direct | 407.314.0122 mobile CI Boost(tm) Clients Outperform Online(tm) www.ciboost.com -Original Message- From: Peter Marron [mailto:peter.mar...@trilliumsoftware.com] Sent: Monday, July 23, 2012 10:25 AM To: common-user@hadoop.apache.org Subject: Counting records Hi, I am a complete noob with Hadoop and MapReduce and I have a question that is probably silly, but I still don't know the answer. For the purposes of discussion I'll assume that I'm using a standard TextInputFormat. (I don't think that this changes things too much.) To simplify (a fair bit) I want to count all the records that meet specific criteria. I would like to use MapReduce because I anticipate large sources and I want to get the performance and reliability that MapReduce offers. So the obvious and simple approach is to have my Mapper check whether each record meets the criteria and emit a 0 or a 1. Then I could use a combiner which accumulates (like a LongSumReducer) and use this as a reducer as well, and I am sure that that would work fine. However it seems massive overkill to have all those 1s and 0s emitted and stored on disc. It seems tempting to have the Mapper accumulate the count for all of the records that it sees and then just emit once at the end the total value. This seems simple enough, except that the Mapper doesn't seem to have any easy way to know when it is presented with the last record. Now I could just make the Mapper take a copy of the OutputCollector for each record called and then in the close method it could do a single emit. However, although, this looks like it would work with the current implementation, there seem to be no guarantees that the collector is valid at the time that the close is called. This just seems ugly. Or I could get the Mapper to record the first offset that it sees and read the split length using report.getInputSplit().getLength() and then it could monitor how far it is through the split and it should be able to detect the last record. It looks like the MapRunner class creates a Mapper object and uses it to process a split, and so it looks like it's safe to store state in the mapper class between invocations of the map method. (But is this just an implementation artefact? Is the mapper class supposed to be completely stateless?) Maybe I should have a custom InputFormat class and have it flag the last record by placing some extra information in the key? (Assuming that the InputFormant has enough information from the split to be able to detect the last record, which seems reasonable enough.) Is there some blessed way to do this? Or am I barking up the wrong tree because I should really just generate all those 1s and 0s and accept the overhead? Regards, Peter Marron Trillium Software UK Limited The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.
Re: Counting records
Look at using a dynamic counter. You don't need to set up or declare an enum. The only caveat is that counters are passed back to the JT by each task and are stored in memory. On Jul 23, 2012, at 9:32 AM, Kai Voigt wrote: http://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/
RE: Counting records
Yeah, I thought about using counters but I was worried about what happens if a Mapper task fails. Does the counter get adjusted to remove any contributions that the failed Mapper made before another replacement Mapper is started? Otherwise in the case of any Mapper failure I'm going to get an overcount am I not? Or is there some way to make sure that counters have the correct semantics in the face of failures? Peter Marron -Original Message- From: Dave Shine [mailto:Dave.Shine@channelintelligence. com] Sent: 23 July 2012 15:35 To: common-user@hadoop.apache.org Subject: RE: Counting records You could just use a counter and never emit anything from the Map(). Use the getCounter(MyRecords, RecordTypeToCount).increment(1) whenever you find the type of record you are looking for. Never call output.collect(). Call the job with reduceTasks(0). When the job finishes, you can programmatically get the values of all counters including the one you create in the Map() method. Dave Shine Sr. Software Engineer 321.939.5093 direct | 407.314.0122 mobile CI Boost(tm) Clients Outperform Online(tm) www.ciboost.com -Original Message- From: Peter Marron [mailto:Peter.Marron@trilliumsoftware. com] Sent: Monday, July 23, 2012 10:25 AM To: common-user@hadoop.apache.org Subject: Counting records Hi, I am a complete noob with Hadoop and MapReduce and I have a question that is probably silly, but I still don't know the answer. For the purposes of discussion I'll assume that I'm using a standard TextInputFormat. (I don't think that this changes things too much.) To simplify (a fair bit) I want to count all the records that meet specific criteria. I would like to use MapReduce because I anticipate large sources and I want to get the performance and reliability that MapReduce offers. So the obvious and simple approach is to have my Mapper check whether each record meets the criteria and emit a 0 or a 1. Then I could use a combiner which accumulates (like a LongSumReducer) and use this as a reducer as well, and I am sure that that would work fine. However it seems massive overkill to have all those 1s and 0s emitted and stored on disc. It seems tempting to have the Mapper accumulate the count for all of the records that it sees and then just emit once at the end the total value. This seems simple enough, except that the Mapper doesn't seem to have any easy way to know when it is presented with the last record. Now I could just make the Mapper take a copy of the OutputCollector for each record called and then in the close method it could do a single emit. However, although, this looks like it would work with the current implementation, there seem to be no guarantees that the collector is valid at the time that the close is called. This just seems ugly. Or I could get the Mapper to record the first offset that it sees and read the split length using report.getInputSplit().getLength() and then it could monitor how far it is through the split and it should be able to detect the last record. It looks like the MapRunner class creates a Mapper object and uses it to process a split, and so it looks like it's safe to store state in the mapper class between invocations of the map method. (But is this just an implementation artefact? Is the mapper class supposed to be completely stateless?) Maybe I should have a custom InputFormat class and have it flag the last record by placing some extra information in the key? (Assuming that the InputFormant has enough information from the split to be able to detect the last record, which seems reasonable enough.) Is there some blessed way to do this? Or am I barking up the wrong tree because I should really just generate all those 1s and 0s and accept the overhead? Regards, Peter Marron Trillium Software UK Limited The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.
Re: Counting records
If the task fails the counter for that task is not used. So if you have speculative execution turned on and the JT kills a task, it won't affect your end results. Again the only major caveat is that the counters are in memory so if you have a lot of counters... On Jul 23, 2012, at 4:52 PM, Peter Marron wrote: Yeah, I thought about using counters but I was worried about what happens if a Mapper task fails. Does the counter get adjusted to remove any contributions that the failed Mapper made before another replacement Mapper is started? Otherwise in the case of any Mapper failure I'm going to get an overcount am I not? Or is there some way to make sure that counters have the correct semantics in the face of failures? Peter Marron -Original Message- From: Dave Shine [mailto:Dave.Shine@channelintelligence. com] Sent: 23 July 2012 15:35 To: common-user@hadoop.apache.org Subject: RE: Counting records You could just use a counter and never emit anything from the Map(). Use the getCounter(MyRecords, RecordTypeToCount).increment(1) whenever you find the type of record you are looking for. Never call output.collect(). Call the job with reduceTasks(0). When the job finishes, you can programmatically get the values of all counters including the one you create in the Map() method. Dave Shine Sr. Software Engineer 321.939.5093 direct | 407.314.0122 mobile CI Boost(tm) Clients Outperform Online(tm) www.ciboost.com -Original Message- From: Peter Marron [mailto:Peter.Marron@trilliumsoftware. com] Sent: Monday, July 23, 2012 10:25 AM To: common-user@hadoop.apache.org Subject: Counting records Hi, I am a complete noob with Hadoop and MapReduce and I have a question that is probably silly, but I still don't know the answer. For the purposes of discussion I'll assume that I'm using a standard TextInputFormat. (I don't think that this changes things too much.) To simplify (a fair bit) I want to count all the records that meet specific criteria. I would like to use MapReduce because I anticipate large sources and I want to get the performance and reliability that MapReduce offers. So the obvious and simple approach is to have my Mapper check whether each record meets the criteria and emit a 0 or a 1. Then I could use a combiner which accumulates (like a LongSumReducer) and use this as a reducer as well, and I am sure that that would work fine. However it seems massive overkill to have all those 1s and 0s emitted and stored on disc. It seems tempting to have the Mapper accumulate the count for all of the records that it sees and then just emit once at the end the total value. This seems simple enough, except that the Mapper doesn't seem to have any easy way to know when it is presented with the last record. Now I could just make the Mapper take a copy of the OutputCollector for each record called and then in the close method it could do a single emit. However, although, this looks like it would work with the current implementation, there seem to be no guarantees that the collector is valid at the time that the close is called. This just seems ugly. Or I could get the Mapper to record the first offset that it sees and read the split length using report.getInputSplit().getLength() and then it could monitor how far it is through the split and it should be able to detect the last record. It looks like the MapRunner class creates a Mapper object and uses it to process a split, and so it looks like it's safe to store state in the mapper class between invocations of the map method. (But is this just an implementation artefact? Is the mapper class supposed to be completely stateless?) Maybe I should have a custom InputFormat class and have it flag the last record by placing some extra information in the key? (Assuming that the InputFormant has enough information from the split to be able to detect the last record, which seems reasonable enough.) Is there some blessed way to do this? Or am I barking up the wrong tree because I should really just generate all those 1s and 0s and accept the overhead? Regards, Peter Marron Trillium Software UK Limited The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.