Hi, I am using AbstractFileOutputOperator in my application for writing incoming tuples into a file on HDFS.
Considering that there could be failover scenarios; I am using fileOutputOperator.setMaxLength() for rolling over the files after specified length. Assuming that, rolled over files would have faster recovery from the failure (since recovery is only for the last part of the file and not for the entire file). To set the maxLength; there is no specific recommended value from the usecase. Hence, I would prefer the rolled over file sizes to be equal to Block size for HDFS (say 64 MB). With the current implementation of AbstractFileOutputOperator; actual file sizes for the rolled over file would be slightly greater than 64MB. This is because, file is being rolled over after the incoming tuple is written to to the file. The check for file size (for roll over) happens after the tuple is written to the file. I believe that, files slightly greater than 64MB would result in 2 entries on the NameNode. This can be avoided if we flip the sequence of checking the file size (adding incoming tuple) and then rolling over to new file *before* writing the incoming tuple. Do you think that, this improvement should be considered? If yes; I will create a JIRA and work on it. Also, does this code change break backward compatibility? Although, signature of the API remains same; but there is slight change in the semantics. Thus, wanted to get feedback from the community. ~ Yogi
