Hi, Benjamin, Thanks a lot for reporting this! It makes sense from reading the posts. Could you open a JIRA? Are you interested in assigning to yourself and contribute the fix?
Thanks a lot again! -Yi On Thu, Jun 16, 2016 at 9:52 AM, Benjamin Smith < ben.sm...@ranksoftwareinc.com> wrote: > > Hello, > > I am working on a project where we are integrating Samza and Hive. As part > of this project, we ran into an issue where sequence files written from > Samza were taking a long time (hours) to completely sync with HDFS. > > After some Googling and digging into the code, it appears that the issue > is here: > > https://github.com/apache/samza/blob/master/samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/writer/SequenceFileHdfsWriter.scala#L111 > > Writer.stream(dfs.create(path)) implies that the caller of > dfs.create(path) is responsible for closing the created stream explicitly. > This doesn't happen, and the SequenceFileHdfsWriter call to close will only > flush the stream. > > I believe the correct line should be: > > Writer.file(path) > > Or, SequenceFileHdfsWriter should explicitly track and close the stream. > > Thanks! > > Ben > > Refernece material: > > http://stackoverflow.com/questions/27916872/why-the-sequencefile-is-truncated > > https://apache.googlesource.com/hadoop-common/+/HADOOP-6685/src/java/org/apache/hadoop/io/SequenceFile.java#1238 > >