Mike Percy created FLUME-1714:
---------------------------------
Summary: Improve handling of HDFS sink .tmp files after crash
Key: FLUME-1714
URL: https://issues.apache.org/jira/browse/FLUME-1714
Project: Flume
Issue Type: Improvement
Reporter: Mike Percy
Currently, the .tmp files left after a system or Flume client crash are never
cleaned up, and several users have noted that it would be better if Flume
itself took care of this.
This is actually a complicated issue, with multiple facets. These include:
# We would need to persist the in-progress filenames somewhere, probably on the
agent's local FS. This is not very hard.
# At startup, we would need to handle the files in some way to guarantee at
least one of the following:
** Mark it as a potentially partial file somehow when renaming from .tmp
** Ensure that the file format is valid before renaming it from .tmp
*** This 2nd option is actually harder than it sounds, since arbitrary
serializers may be plugged in. Say it's an XML serializer, then we would need
some way to programmatically read (deserialize) the file, throw away any
potentially unfinished records at the end (this is OK since the transaction
must not have been committed), then re-serialize the file with all the valid
records and correct opening/closing tags.
*** General deserialization / recovery APIs would need to be added to support
this, and this would need to be very carefully designed and implemented in
order to work. In the end, it also seems likely that if this a complex thing
(sounds complex) then most people would rely on out-of-the-box implementations
(supported file formats) to get this functionality, unless they are building on
top of abstract classes (e.g. for XML schema handling) to help accomplish this.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira