[jira] [Commented] (FLUME-2458) Separate hdfs tmp directory for flume hdfs sink

Neerja Khattar (JIRA) Wed, 10 Aug 2016 08:26:37 -0700

    [ 
https://issues.apache.org/jira/browse/FLUME-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415442#comment-15415442
 ]


Neerja Khattar commented on FLUME-2458:
---------------------------------------

[~mpercy]

Yes, it will work but too much manual work. up to you now. I am ok with 
workaround or the patch

If you see this old comment this is where it can fail  and that I could recall.

that logic works half way during renaming it fails as in code the logic is to 
create the first main directory only as previously the path was same for .tmp 
and non tmp files.
now as path has changed so first time wherever u put .tmp that path will exist 
and when u rename u need second path to be created so it crashes
see errors in namenode logs dir doesnt exist 
2015-06-03 09:09:23,908 INFO org.apache.hadoop.hdfs.StateChange: DIR* 
completeFile: /flume/tmp/data/.1433347193172.tmp is closed by 
DFSClient_NONMAPREDUCE_695732204_38
2015-06-03 09:09:23,924 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
FSDirectory.unprotectedRenameTo: failed to rename 
/flume/tmp/data/.1433347193172.tmp to /flume/data/.1433347193172 because 
destination's parent does not exist
streaming_vitals.sinks.streaming_vitals_hdfs_sink1.hdfs.path = /flume -( this 
exist) but /data inside this doesnt or either we create everytime in advance or 
it crashes
streaming_vitals.sinks.streaming_vitals_hdfs_sink1.hdfs.filePrefix = data/
streaming_vitals.sinks.streaming_vitals_hdfs_sink1.hdfs.inUsePrefix=tmp/


> Separate hdfs tmp directory for flume hdfs sink
> -----------------------------------------------
>
>                 Key: FLUME-2458
>                 URL: https://issues.apache.org/jira/browse/FLUME-2458
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>    Affects Versions: v1.5.0.1
>            Reporter: Sverre Bakke
>            Assignee: Neerja Khattar
>            Priority: Minor
>         Attachments: FLUME-2458.patch, patch-2458.txt
>
>
> The current HDFS sink will write temporary files to the same directory as the 
> final file will be stored. This is a problem for several reasons:
> 1) File moving
> When mapreduce fetches a list of files to be processed and then processes 
> files that are then gone (i.e. are moved from .tmp to  whatever final name it 
> is suppose to have), then the mapreduce job will crash.
> 2) File type
> When mapreduce decides how to process files, then it looks at files 
> extension. If using compressed files, then it will decompress it for you. If 
> the file has a .tmp file extension (in the same folder) then it will treat a 
> compressed file as an uncompressed files, thus breaking the results of the 
> mapreduce job.
> I propose that the sink gets an optional tmp path for storing these files to 
> avoid these issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLUME-2458) Separate hdfs tmp directory for flume hdfs sink

Reply via email to