Use an external database (e.g., mysql) or some other transactional
bookkeeping system to record the state of all your datasets (STAGING,
UPLOADED, PROCESSED)

- Aaron


On Thu, Sep 17, 2009 at 7:17 PM, Huy Phan <dac...@gmail.com> wrote:

> Hi all,
>
> I have a question about strategy to prepare data for Hadoop to run their
> MapReduce job, we have to (somehow) copy input files from our local
> filesystem to HDFS, how can we make sure that one input file is not
> processed twice in different executions of the same MapReduce job (let's say
> my MapReduce job runs once each 30 mins) ?
> I don't want to delete my input files after finishing the MR job because I
> may want to re-use it later.
>
>
>
>

Reply via email to