This is exactly what we do as well. We also have auto-detection for modifications and downstream processing so that back-filling in the presence error correction is possible (the errors can be old processing code or file munging).
On 2/28/08 6:06 PM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote: > We have had a lot of peace of mind by building a data pipeline that does > not assume that hdfs is always up and running. If the application is > primarily non real-time log processing - I would suggest > batch/incremental copies of data to hdfs that can catch up automatically > in case of failures/downtimes. > > we have a rsync like map-reduce job that monitors a log directories and > keeps pulling new data in (and suspect lot of other users do similar > stuff as well). Might be a useful notion to generalize and put in > contrib.