This is exactly what we do as well.  We also have auto-detection for
modifications and downstream processing so that back-filling in the presence
error correction is possible (the errors can be old processing code or file
munging). 


On 2/28/08 6:06 PM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote:

> We have had a lot of peace of mind by building a data pipeline that does
> not assume that hdfs is always up and running. If the application is
> primarily non real-time log processing - I would suggest
> batch/incremental copies of data to hdfs that can catch up automatically
> in case of failures/downtimes.
> 
> we have a rsync like map-reduce job that monitors a log directories and
> keeps pulling new data in (and suspect lot of other users do similar
> stuff as well). Might be a useful notion to generalize and put in
> contrib.

Reply via email to