Hi Peter AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as soon as the files are written to a certain hdfs directory.
On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan < psheri...@millennialmedia.com> wrote: > These are log files being deposited by other processes, which we may not > have control over. > > We don't want multiple processes to write to the same files — we just > don't want to start our jobs until they have been completely written. > > Sorry for lack of clarity & thanks for the response. > > > --Pete > > From: Bertrand Dechoux <decho...@gmail.com> > Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org> > Date: Tuesday, September 25, 2012 12:33 PM > To: "user@hadoop.apache.org" <user@hadoop.apache.org> > Subject: Re: Detect when file is not being written by another process > > Hi, > > Multiple files and aggregation or something like hbase? > > Could you tell use more about your context? What are the volumes? Why do > you want multiple processes to write to the same file? > > Regards > > Bertrand > > On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan < > psheri...@millennialmedia.com> wrote: > >> Hi all. >> >> We're using Hadoop 1.0.3. We need to pick up a set of large (4+GB) >> files when they've finished being written to HDFS by a different process. >> There doesn't appear to be an API specifically for this. We had >> discovered through experimentation that the FileSystem.append() method can >> be used for this purpose — it will fail if another process is writing to >> the file. >> >> However: when running this on a multi-node cluster, using that API >> actually corrupts the file. Perhaps this is a known issue? Looking at the >> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a >> bunch of similar-sounding things. >> >> What's the right way to solve this problem? Thanks. >> >> >> --Pete >> >> > > > -- > Bertrand Dechoux >