Hello gentlemen,

This is Shaofeng Shi from Apache Kylin community, we use HBase as the
storage engine, and we use MR job to generate HFile before bulk load. We
received user reporting that, if configured to use S3 as the output
location for HFile, the files were generated in "_temporary" folder and
won't be committed to the target path. This caused no data be loaded
finally. And we can reproduce this problem easily. The original reporting
is in [1].

Kylin uses HBase's HFileOutputFormat2.java to configure the MR job. After
some investigation, I found this class always uses the default
"FileOutputCommitter", see [2], regardless of the job's configuration; so
it always writing to "_temporary" folder. Since AWS EMR configured to use
DirectOutputCommitter for S3, then this problem occurs: Hadoop expects to
see the file directly under output path, while the RecordWriter generates
them in "_temporary" folder.

Did you get such reporting before? I had a temporary fix in my fork now.
Just wondering how you think about it; if oaky I would report a JIRA.
Thanks!

[1] https://issues.apache.org/jira/browse/KYLIN-2788
[2]
https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java#L193

-- 
Best regards,

Shaofeng Shi 史少锋

Reply via email to