Thanks JM, I am not so concerned about holding those rows in memory because they are mostly ordered integers and I would be using a bitset. So I have some leeway in that sense. My dilemma was 1. updating instantly within the map 2. bulk updating at the end of the map Yes I do understand the drawback with 2 if map crashes. I am ready to incur that penalty if that avoids any inconsistent behaviour on hbase.
- R On Sat, Jun 22, 2013 at 12:16 PM, Jean-Marc Spaggiari < jean-m...@spaggiari.org> wrote: > Hi Rahit, > > The list is a bad idea. When you will have millions of lines per > regions, are going to pu millions of them in memory in your list? > > Your MR will scan the entire table, row by row. If you modify the > current row, when the scanner will search for the next one, it will > not look at current one. So there is no real issue with that. > > Also, instead of doing puts one by one I will recommand you to buffer > them (let's say, 100 by 100) and put them as a batch. Don't forget to > push the remaining at the end of the job. The drawback is that if the > MR crash you will have some rows already processed and not marked as > processed... > > JM > > 2013/6/22 Rohit Kelkar <rohitkel...@gmail.com>: > > I have a usecase where I push data in my HTable in waves followed by > > Mapper-only processing. Currently once a row is processed in map I > > immediately mark it as processed=true. For this inside the map I execute > a > > table.put(isprocessed=true). I am not sure if modifying the table like > this > > is a good idea. I am also concerned that I am modifying the same table > that > > I am running the MR job on. > > So I am thinking of another approach where I accumulate the processed > rows > > in a list (or a better compact data structure) and use the cleanup method > > of the MR job to execute all the table.put(isprocessed=true) at once. > > What is the suggested best practice? > > > > - R >