[ https://issues.apache.org/jira/browse/FLUME-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389178#comment-15389178 ]
Mike Percy commented on FLUME-2458: ----------------------------------- [~jfield], to me this sounds like a bug in HDFS snapshots, or maybe a bug in distcp. Particularly, if I take a snapshot and then copy the snapshotted data with distcp then I would expect a resulting state of the filesystem that is isolated from renames that occur after the snapshot was taken. It sounds like there are holes in that isolation. I would also expect a consistent snapshot across the whole filesystem. Admittedly, I am not really familiar with HDFS snapshots semantics or internals or how distcp interacts with those snapshots. Would you agree? Is this a bug in one of those systems? > Separate hdfs tmp directory for flume hdfs sink > ----------------------------------------------- > > Key: FLUME-2458 > URL: https://issues.apache.org/jira/browse/FLUME-2458 > Project: Flume > Issue Type: Improvement > Components: Sinks+Sources > Affects Versions: v1.5.0.1 > Reporter: Sverre Bakke > Assignee: Neerja Khattar > Priority: Minor > Attachments: FLUME-2458.patch, patch-2458.txt > > > The current HDFS sink will write temporary files to the same directory as the > final file will be stored. This is a problem for several reasons: > 1) File moving > When mapreduce fetches a list of files to be processed and then processes > files that are then gone (i.e. are moved from .tmp to whatever final name it > is suppose to have), then the mapreduce job will crash. > 2) File type > When mapreduce decides how to process files, then it looks at files > extension. If using compressed files, then it will decompress it for you. If > the file has a .tmp file extension (in the same folder) then it will treat a > compressed file as an uncompressed files, thus breaking the results of the > mapreduce job. > I propose that the sink gets an optional tmp path for storing these files to > avoid these issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)