[
https://issues.apache.org/jira/browse/ORC-162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149740#comment-16149740
]
ASF GitHub Bot commented on ORC-162:
------------------------------------
Github user prasanthj commented on a diff in the pull request:
https://github.com/apache/orc/pull/163#discussion_r136468354
--- Diff: java/mapreduce/src/java/org/apache/orc/mapred/OrcInputFormat.java
---
@@ -151,4 +153,26 @@ public static void setSearchArgument(Configuration
conf,
return new OrcMapredRecordReader<>(file, buildOptions(conf,
file, split.getStart(), split.getLength()));
}
+
+ /**
+ * Filter out the 0 byte files, so that we don't generate splits for the
+ * empty ORC files.
+ * @param job the job configuration
+ * @return a list of files that need to be read
+ * @throws IOException
+ */
+ protected FileStatus[] listStatus(JobConf job) throws IOException {
+ FileStatus[] result = super.listStatus(job);
+ List<FileStatus> ok = new ArrayList<>(result.length);
--- End diff --
Make sense. Just noticed filter gets applied after listStatus anyway.
> Handle 0 byte files as empty ORC files
> --------------------------------------
>
> Key: ORC-162
> URL: https://issues.apache.org/jira/browse/ORC-162
> Project: ORC
> Issue Type: Bug
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
>
> Hive often creates empty files for empty buckets, which can introduce
> significant load on the HDFS cluster. Therefore, they made the Hive
> OrcOutputFormat and OrcInputFormat use 0 byte ORC files as a special case.
> We need to make the other readers treat them reasonably.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)