I have a usecase where I push data in my HTable in waves followed by Mapper-only processing. Currently once a row is processed in map I immediately mark it as processed=true. For this inside the map I execute a table.put(isprocessed=true). I am not sure if modifying the table like this is a good idea. I am also concerned that I am modifying the same table that I am running the MR job on. So I am thinking of another approach where I accumulate the processed rows in a list (or a better compact data structure) and use the cleanup method of the MR job to execute all the table.put(isprocessed=true) at once. What is the suggested best practice?
- R