Hi, We have a recurring need of Flume deserializers that go beyond line or blob. Some examples are XML deserialization where events are generated with XPath/XQuery expressions, parsers for XLS, PDF, etc.
There is no proper solution in Flume for these use case. A significant amount of our projects required workarounds for this such as an external preprocessing or postprocessing step. So we have explored the following solutions to the problem: - Using BlobDeserializer and then using an interceptor (1 to N events) to perform the transformation. This is currently not possible since an interceptor must output 0 or 1 event for each input event. This was brought up in this mailing list long time ago [1] but it seems no one came up with a viable solution. - Implementing an EventDeserializer. We have done this in some cases with different degrees of success. For example, with a XML deserializer with XPath [2]. The main limitation of this approach is the lack of a common method for position tracking at the deserializer level. Currently, Flume's core has a PositionTracker at the Source/InputStream level, which tracks the input offset. LineDeserializer and BlobDeserializer rely on the assumption that events can be mapped to an input offset (i.e. an event can be created by reading only from a given input offset). This assumption is not valid for more complex use cases (e.g. can't produce events without reading file headers). This can be solved by using a second PositionTracker at the deserializer level. Here's a commit with a possible implementation of this approach [3]. Do you think this is a problem worth solving in Flume? If yes, what would be the best approach? [1] http://mail-archives.apache.org/mod_mbox/flume-dev/201208.mbox/%3CCABCB9rJ0-puRp1FfPfvyfO41wnMgUh=tifcpgufwxbnyv_p...@mail.gmail.com%3E [2] https://github.com/Stratio/flume-ingestion/tree/develop/stratio-deserializers/stratio-xmlxpath-deserializer [3] https://github.com/Stratio/flume/commit/a6fac7247b7fc48dec5dc3ab4c658ab4e5c0e753 Best, -- Santiago M. Mola <http://www.stratio.com/> Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*