Hello gentlemen, This is Shaofeng Shi from Apache Kylin community, we use HBase as the storage engine, and we use MR job to generate HFile before bulk load. We received user reporting that, if configured to use S3 as the output location for HFile, the files were generated in "_temporary" folder and won't be committed to the target path. This caused no data be loaded finally. And we can reproduce this problem easily. The original reporting is in [1].
Kylin uses HBase's HFileOutputFormat2.java to configure the MR job. After some investigation, I found this class always uses the default "FileOutputCommitter", see [2], regardless of the job's configuration; so it always writing to "_temporary" folder. Since AWS EMR configured to use DirectOutputCommitter for S3, then this problem occurs: Hadoop expects to see the file directly under output path, while the RecordWriter generates them in "_temporary" folder. Did you get such reporting before? I had a temporary fix in my fork now. Just wondering how you think about it; if oaky I would report a JIRA. Thanks! [1] https://issues.apache.org/jira/browse/KYLIN-2788 [2] https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java#L193 -- Best regards, Shaofeng Shi 史少锋