So I've kept my ear to the ground regarding this topic for a while now, and
had some conversations a year or so ago about the idea as well.  At the
very least, I think having the concept of a pre-parser is a good one, if
not chaining an arbitrary number of parsers together.  I see this as an
important way to reduce the complexity of implementing new parsers and
getting more community involvement/contributions.

Syslog headers are a solid use case to start with because a lot of
implementations fail to properly implement it on the sending side, at least
in the real world scenarios that I've seen.  Having a way to extend the
parser to easily handle incorrect implementations of syslog would be great,
but anything that can pre-parse or trim the syslog headers to make parsing
further along in the pipeline more simple would help.

Another idea that would be attractive would be the ability to do
opportunistic parsing given an ordered list of parsers and some criteria
for successful parsing (which I admittedly am not sure how to solve) which
(at least in my mind) would require similar logic to parser chaining.  In
some highly decentralized organizations this would be helpful as it takes
the configuration effort off of the team sending the logs (and thus makes
them more willing to send logs _at all_) and pushes it onto the team
parsing and/or storing them.

I'm not suggesting we attempt to crack that second nut here, I would love
to see that use case in mind during discussions.

TL;DR:  +1

Jon

On Tue, Mar 20, 2018 at 6:14 PM Otto Fowler <ottobackwa...@gmail.com> wrote:

> I think the chaining of parsers, or ability to compose parsers is a good
> idea, but with reference to the pr mentioned, I would have some number of
> StellarChainLinks as opposed re-implementing stellar in chainlinks.
> Although it is NiFi-y.  But since I write Processors too, that is fine.
>
>
> On March 20, 2018 at 18:05:12, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> It seems like parser chaining is becomes a hot topic on the repo too with
> https://github.com/apache/metron/pull/969#partial-pull-merging <
> https://github.com/apache/metron/pull/969#partial-pull-merging>
>
> I would like to discuss the option, and how we might architect, of
> configuring parsers to operate on the output of parsers. This may also give
> us the opportunity to be more efficient in scenarios where people have
> large numbers of sources, and so use up a lot of slots for lower volume
> parsers for example.
>
> I have a bunch of ideas around this, but am more keen to hear what everyone
> else thinks at this stage. How should we go about fixing parser config so
> that it’s clearer (removing the need for people to reinvent the parser
> wheel as we’ve seen in a few places) and also more concise and powerful
> (consolidating the parsing of transports such as syslog and content such as
> application logs, or types of device logs).
>
> If this can lead to a more efficient way of handling both the syslog
> problem, and the kind of problem that leads to switching between grok
> statements in something like our ASA parser then all the better. I suspect
> that there might also be a case for multi-level chaining here too, since
> some things are embedded in multiple transports, or might have complex
> fields that want ‘sub-parsing’.
>
> Of course one of the key values of Metron is its speed, so maybe
> formalising some of the microbenchmarking approaches a few of us have been
> working on might help here too. I’ve got a few bits of micro-benching
> infrastructure around CEF and ASA, and I believe there’s also been some
> work to load and perf test things like enrichment that might be leveraged.
>
> Thoughts on a dev board?
>
> Simon
>
> > On 20 Mar 2018, at 21:47, Otto Fowler <ottobackwa...@gmail.com> wrote:
> >
> > I entered METRON–1453 <https://issues.apache.org/jira/browse/METRON-1453
> >
> a
> > little while ago while working on the PR#579
> > <https://github.com/apache/metron/pull/579>.
> >
> > "We have several parsers now, with many imaginable that are based on
> > syslog, where the format is SYSLOG HEADER MESSAGE.
> >
> > With message being in a different format. It would be great is we had a
> way
> > to generically handle syslog headers, such that ANY parser data could
> come
> > over syslog.
> >
> > Either you could have a custom parser, or configure CSV or JSON such that
> > they could be the payload, such that you can handle JSON over syslog by
> > configuration only."
> >
> > The idea would be that the parser bolt would use the configuration to
> > trigger parsing the incoming message as syslog formatted, and pass the
> > message part to the parser, and put the syslog parts in the message(s)
> > after parsing.
> >
> > As part of this I did some work on parsing syslog, using both grok and a
> > DSL that I did from the spec :
> https://github.com/ottobackwards/grok-v-antlr
> >
> > The DSL is slower, but grok cannot handle multiple structured data
> entries,
> > and the DSL can. I’m not good enough at grok to fix it so that it is
> > functionally equivalent. Another option would be to write a third parser…
> > It is also possible that the DSL could be improved for speed of course.
> >
> > Thoughts?
>
-- 

Jon

Reply via email to