Hi,

We have a recurring need of Flume deserializers that go beyond line or
blob. Some examples are XML deserialization where events are generated with
XPath/XQuery expressions, parsers for XLS, PDF, etc.

There is no proper solution in Flume for these use case. A significant
amount of our projects required workarounds for this such as an external
preprocessing or postprocessing step.

So we have explored the following solutions to the problem:

- Using BlobDeserializer and then using an interceptor (1 to N events) to
perform the transformation. This is currently not possible since an
interceptor must output 0 or 1 event for each input event. This was brought
up in this mailing list long time ago [1] but it seems no one came up with
a viable solution.

- Implementing an EventDeserializer. We have done this in some cases with
different degrees of success. For example, with a XML deserializer with
XPath [2]. The main limitation of this approach is the lack of a common
method for position tracking at the deserializer level. Currently, Flume's
core has a PositionTracker at the Source/InputStream level, which tracks
the input offset. LineDeserializer and BlobDeserializer rely on the
assumption that events can be mapped to an input offset (i.e. an event can
be created by reading only from a given input offset). This assumption is
not valid for more complex use cases (e.g. can't produce events without
reading file headers). This can be solved by using a second PositionTracker
at the deserializer level. Here's a commit with a possible implementation
of this approach [3].

Do you think this is a problem worth solving in Flume? If yes, what would
be the best approach?


[1]
http://mail-archives.apache.org/mod_mbox/flume-dev/201208.mbox/%3CCABCB9rJ0-puRp1FfPfvyfO41wnMgUh=tifcpgufwxbnyv_p...@mail.gmail.com%3E
[2]
https://github.com/Stratio/flume-ingestion/tree/develop/stratio-deserializers/stratio-xmlxpath-deserializer
[3]
https://github.com/Stratio/flume/commit/a6fac7247b7fc48dec5dc3ab4c658ab4e5c0e753

Best,
-- 

Santiago M. Mola


<http://www.stratio.com/>
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*

Reply via email to