[ https://issues.apache.org/jira/browse/MAPREDUCE-5247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13689154#comment-13689154 ]
Devaraj K commented on MAPREDUCE-5247: -------------------------------------- I don't think giving this responsibility to FileInputFormat is a good idea. FileInputFormat already provides extensibility to add new filters using "mapred.input.pathFilter.class" configuration. If the user want to filter some specific files from the input dir for some Jobs they can achieve the same using the current behavior. > FileInputFormat should filter files with '._COPYING_' sufix > ----------------------------------------------------------- > > Key: MAPREDUCE-5247 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5247 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Stan Rosenberg > > FsShell copy/put creates staging files with '._COPYING_' suffix. These files > should be considered hidden by FileInputFormat. (A simple fix is to add the > following conjunct to the existing hiddenFilter: > {code} > !name.endsWith("._COPYING_") > {code} > After upgrading to CDH 4.2.0 we encountered this bug. We have a legacy data > loader which uses 'hadoop fs -put' to load data into hourly partitions. We > also have intra-hourly jobs which are scheduled to execute several times per > hour using the same hourly partition as input. Thus, as the new data is > continuously loaded, these staging files (i.e., ._COPYING_) are breaking our > jobs (since when copy/put completes staging files are moved). > As a workaround, we've defined a custom input path filter and loaded it with > "mapred.input.pathFilter.class". -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira