[ https://issues.apache.org/jira/browse/FLUME-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13780509#comment-13780509 ]
wolfgang hoschek commented on FLUME-1988: ----------------------------------------- Splitting an input stream into events in a configurable and extensible way sounds like a good idea. An alternative way would be to address this problem (and many similar problems) by writing a MorphlineDeserializer that implements a java.io.InputStream on top of the SpoolingDirectorySource, then have that MorphlineDeserializer feed that InputStream into a configurable morphline which in turn contains a readMultiLine command. Then you can easily replace the readMultiLine with a command that splits on a character sequence, etc, etc. There are many other flavours of the same byte stream -> event splitting theme, and this way individual commands can be composed together in a morphline which makes them more powerful, flexible and reusable. http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html#readMultiLine > Add Support for Additional Deserializers for SpoolingDirectorySource > -------------------------------------------------------------------- > > Key: FLUME-1988 > URL: https://issues.apache.org/jira/browse/FLUME-1988 > Project: Flume > Issue Type: New Feature > Components: Docs, Sinks+Sources > Affects Versions: v1.4.0 > Reporter: Israel Ekpo > Assignee: Israel Ekpo > Labels: serializers > Attachments: EventDeserializerType.java, > RegexDelimiterDeSerializer.java, ResettableTestStringInputStream.java, > TestRegexDelimiterDeSerializer.java > > > There are certain use cases for SpoolingDirectorySource where the events in > the log file are not delimited with newline characters. > Certain log files that contain stack traces, xml documents and pretty JSON > strings seem to contain multiple new line characters within each event. > We can use alternative logic such as specific characters, strings or regular > expressions to determine when the event is complete. > Hence I am proposing the following new deserializers based on > org.apache.flume.serialization.LineDeserializer > # org.apache.flume.serialization.RegexDelimiterDeSerializer > Allows the user to specify a regular expression that is a delimiter for > events within the log file > # org.apache.flume.serialization.CharSequenceDelimiterDeSerializer > Allows the user to specify a comma separated character sequence that is a > delimiter for events within the log file > The user will specify an integer for the ascii characters and we will use > that as the delimter. > For example support for \r\n could be specified as 13,10 > A list of codes is available at http://www.asciitable.com/ > We will also need to update the user guide with examples on how to configure > and specify a custom deserializer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira