Use an external database (e.g., mysql) or some other transactional bookkeeping system to record the state of all your datasets (STAGING, UPLOADED, PROCESSED)
- Aaron On Thu, Sep 17, 2009 at 7:17 PM, Huy Phan <dac...@gmail.com> wrote: > Hi all, > > I have a question about strategy to prepare data for Hadoop to run their > MapReduce job, we have to (somehow) copy input files from our local > filesystem to HDFS, how can we make sure that one input file is not > processed twice in different executions of the same MapReduce job (let's say > my MapReduce job runs once each 30 mins) ? > I don't want to delete my input files after finishing the MR job because I > may want to re-use it later. > > > >