What about parser chaining?   Should the original string be from kafka, or
the last parsed?


On May 10, 2019 at 19:03:39, Simon Elliston Ball (
si...@simonellistonball.com) wrote:

The only scenario I can think of where a parser might treat original string
differently, or even need to know about it would be different encoding
locales. For example, if the string were to be encoded in a locale specific
to the device and choose the encoding based on metadata or parsed content,
then that could merit pushing it down. The other edge might be when you
have binary data that does not go down to an original string well (eg a
netflow parser).

That said, that’s a highly unlikely edge case that could be handled by
workarounds.

I’m a definitely +1 on Nick’s idea of pulling original string up to the
runner. Right now we’re pretty inconsistent in how it’s done, so that would
help.

Simon

Sent from my iPhone

On 10 May 2019, at 23:10, Nick Allen <n...@nickallen.org> wrote:

>> I suppose we could always allow this to be overridden, also.
>
> I like an on/off switch for the "original string" functionality. If on,
> you get the original string in pristine condition. If off, no original
> string is appended for those who care more about storage space.
>
> I can't think of a reason where one kind of parser would have a different
> original string mechanism than the others. If something like that does
> come up, the parser can create its own original string by just naming it
> something different and then turning "off" the switch that you described.
>
>
>
> On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
>> I think that's an excellent idea. Can anyone think of a situation where
we
>> wouldn't want to add this the same way for all parsers? I suppose we
could
>> always allow this to be overridden, also.
>>
>>> On Fri, May 10, 2019 at 3:43 PM Nick Allen <n...@nickallen.org> wrote:
>>>
>>> I think maintaining the integrity of the original data makes a lot of
>> sense
>>> for any parser. And ideally the original string should be what came out
>> of
>>> Kafka with only the minimally necessary processing.
>>>
>>> With that in mind, we could solve this one level up. Instead of relying
>> on
>>> each parser to do this right, we could have the ParserRunner and
>>> specifically the ParserRunnerImpl [1] handle this round-abouts here
>>> <
>>>
>>
https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
>>>>
>>> [1].
>>> It has the raw message data and can append the original string to each
>>> message it gets back from the parsers.
>>>
>>> Just another approach to consider.
>>>
>>> --
>>> [1]
>>>
>>>
>>
https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
>>>
>>> On Fri, May 10, 2019 at 4:11 PM Otto Fowler <ottobackwa...@gmail.com>
>>> wrote:
>>>
>>>> +1
>>>>
>>>>
>>>> On May 10, 2019 at 13:57:55, Michael Miklavcic (
>>>> michael.miklav...@gmail.com)
>>>> wrote:
>>>>
>>>> When adding the capability for parsing messages in the JsonMapParser
>>> using
>>>> JSON Path expressions the original behavior for managing original
>> strings
>>>> was changed.
>>>>
>>>>
>>>>
>>>
>>
https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192
>>>>
>>>> A couple issues have been reported recently regarding this change:
>>>>
>>>> 1. We're losing the actual original string, which is a legal issue for
>>>> data lineage for some customers
>>>> 2. Even for the degenerate case with no sub-messages created, the
>>>> original sub-message string is modified because of the
>>>> serialization/deserialization process with Jackson/JsonSimple. The
>> fields
>>>> are reordered bc the content is normalized.
>>>>
>>>> I looked at options for preserving formatting, but am unable to find a
>>>> method that allows you to both parse, then query the original message
>> and
>>>> then also obtain the raw string matches without the normalizing from
>>>> ser/deserialization.
>>>>
>>>> I'd like to propose that we add a configuration option for this parser
>>> that
>>>> allows the user to toggle which approach they'd like to use. My
>> personal
>>>> preference based on feedback I've gotten from multiple customers is
>> that
>>>> the default should be the older approach which takes the raw original
>>>> string. It's arguable that this change in contract is a regression, so
>>> the
>>>> default should be the earlier behavior. Any sub-messages would then
>> have
>>> a
>>>> copy of that raw original string, not just the sub-message original
>>> string.
>>>> Enabling the flag would enable the current sub-message original string
>>>> functionality.
>>>>
>>>> Mike
>>>>
>>>
>>

Reply via email to