[ 
https://issues.apache.org/jira/browse/ORC-162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149728#comment-16149728
 ] 

ASF GitHub Bot commented on ORC-162:
------------------------------------

Github user prasanthj commented on a diff in the pull request:

    https://github.com/apache/orc/pull/163#discussion_r136466436
  
    --- Diff: java/mapreduce/src/java/org/apache/orc/mapred/OrcInputFormat.java 
---
    @@ -151,4 +153,26 @@ public static void setSearchArgument(Configuration 
conf,
         return new OrcMapredRecordReader<>(file, buildOptions(conf,
             file, split.getStart(), split.getLength()));
       }
    +
    +  /**
    +   * Filter out the 0 byte files, so that we don't generate splits for the
    +   * empty ORC files.
    +   * @param job the job configuration
    +   * @return a list of files that need to be read
    +   * @throws IOException
    +   */
    +  protected FileStatus[] listStatus(JobConf job) throws IOException {
    +    FileStatus[] result = super.listStatus(job);
    +    List<FileStatus> ok = new ArrayList<>(result.length);
    --- End diff --
    
    Instead of checking this after retrieving all FileStatus objects, it will 
be better if a PathFilter can be passed to listStatus() so that we will only 
get non-zero files. Getting 1000s of 0 length files and filtering here seems 
wasteful. 


> Handle 0 byte files as empty ORC files
> --------------------------------------
>
>                 Key: ORC-162
>                 URL: https://issues.apache.org/jira/browse/ORC-162
>             Project: ORC
>          Issue Type: Bug
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>
> Hive often creates empty files for empty buckets, which can introduce 
> significant load on the HDFS cluster. Therefore, they made the Hive 
> OrcOutputFormat and OrcInputFormat use 0 byte ORC files as a special case.
> We need to make the other readers treat them reasonably.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to