I think that's an excellent idea. Can anyone think of a situation where we wouldn't want to add this the same way for all parsers? I suppose we could always allow this to be overridden, also.
On Fri, May 10, 2019 at 3:43 PM Nick Allen <n...@nickallen.org> wrote: > I think maintaining the integrity of the original data makes a lot of sense > for any parser. And ideally the original string should be what came out of > Kafka with only the minimally necessary processing. > > With that in mind, we could solve this one level up. Instead of relying on > each parser to do this right, we could have the ParserRunner and > specifically the ParserRunnerImpl [1] handle this round-abouts here > < > https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158 > > > [1]. > It has the raw message data and can append the original string to each > message it gets back from the parsers. > > Just another approach to consider. > > -- > [1] > > https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158 > > On Fri, May 10, 2019 at 4:11 PM Otto Fowler <ottobackwa...@gmail.com> > wrote: > > > +1 > > > > > > On May 10, 2019 at 13:57:55, Michael Miklavcic ( > > michael.miklav...@gmail.com) > > wrote: > > > > When adding the capability for parsing messages in the JsonMapParser > using > > JSON Path expressions the original behavior for managing original strings > > was changed. > > > > > > > https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192 > > > > A couple issues have been reported recently regarding this change: > > > > 1. We're losing the actual original string, which is a legal issue for > > data lineage for some customers > > 2. Even for the degenerate case with no sub-messages created, the > > original sub-message string is modified because of the > > serialization/deserialization process with Jackson/JsonSimple. The fields > > are reordered bc the content is normalized. > > > > I looked at options for preserving formatting, but am unable to find a > > method that allows you to both parse, then query the original message and > > then also obtain the raw string matches without the normalizing from > > ser/deserialization. > > > > I'd like to propose that we add a configuration option for this parser > that > > allows the user to toggle which approach they'd like to use. My personal > > preference based on feedback I've gotten from multiple customers is that > > the default should be the older approach which takes the raw original > > string. It's arguable that this change in contract is a regression, so > the > > default should be the earlier behavior. Any sub-messages would then have > a > > copy of that raw original string, not just the sub-message original > string. > > Enabling the flag would enable the current sub-message original string > > functionality. > > > > Mike > > >