Re: Architectural reason to split in 4 topologies / impact on the kafka ressources

2018-06-27 Thread Carolyn Duby
Another reason for the original string is that you may not want to extract all 
components of the original event into JSON.  If you look at Windows events you 
will want to have the original event but you will not want to extract 
everything because they are very verbose.   

You should have a choice on the sensor type whether you want to include the 
original string in the index not.

Thanks  

Carolyn Duby
Solutions Engineer, Northeast
cd...@hortonworks.com
+1.508.965.0584

Join my team!
Enterprise Account Manager – Boston - http://grnh.se/wepchv1
Solutions Engineer – Boston - http://grnh.se/8gbxy41
Need Answers? Try https://community.hortonworks.com 









On 6/25/18, 8:02 PM, "Simon Elliston Ball"  wrote:

>The original string serves purposes well beyond debugging. Many users will
>need to be able to prove provenance to the raw logs in order to prove or
>prosecute an attack from an internal threat, or provide evidence to law
>enforcement or an external threat. As such, the original string is
>important.
>
>It also provides a valuable source for the free text search where parsing
>has not extracted all the necessary tokens for a hunt use case, so it can
>be a valuable field to have in Elastic or Solr for text rather than keyword
>indexing.
>
>That said, it may make sense to remove a heavy weight processing and
>storage field like this from the lucene store. We have been talking for a
>while about filtering some of the data out of the realtime index, and
>preserving full copies in the batch index, which could meet the forensic
>use cases above, and would make it a matter of user choice. That would
>probably be configured through indexing config to filter fields.
>
>Simon
>
>On 25 June 2018 at 23:43, Michel Sumbul  wrote:
>
>> Depending on the source of data, it might be interesting to bypass a step
>> that the user concider useless.
>> For example if you have a source of data that dont need profiling and you
>> want to have it ingested like the other source to allow the  SOC analyst to
>> use it in there analysis. To have everything at the same place.
>>
>> How can we bypass it for a specific sensor?
>>
>> 2018-06-25 23:38 GMT+01:00 James Sirota :
>>
>> > There is a way to wire the system to bypass enrichment and profiling, but
>> > you would then bypass a lot of key features of the system.  It would be
>> > unwise to do that.
>> >
>> > 25.06.2018, 15:13, "Michel Sumbul" :
>> > > Hi Casey,
>> > >
>> > > Thats make completely sense.
>> > > Short question, if there is no enrichment or no profiling, does the
>> > message
>> > > still pass through the enrichment/profiling topic?
>> > >
>> > > If yes, do you think its possible to imagine a way that for messages
>> that
>> > > doesn't need enrichment or profiling to skip the topic and to go
>> directly
>> > > to the next one? This is again to avoid in/out in kafka.
>> > >
>> > > Thanks for the explaination,
>> > > Michel
>> > >
>> > > 2018-06-23 3:58 GMT+01:00 Casey Stella :
>> > >
>> > >>  Hey Michel,
>> > >>
>> > >>  Those are good questions and there were some reasons surrounding
>> that.
>> > In
>> > >>  fact, historically, we had fewer topologies (e.g. indexing and
>> > enrichment
>> > >>  were merged). Even earlier on, we had just one giant topology per
>> > parser
>> > >>  that enriched and indexed. The long story short is that we moved this
>> > way
>> > >>  because we saw how people were using metron and we gained more
>> insight
>> > >>  tuning Metron. That led us down this architectural path.
>> > >>
>> > >>  Some of the reasons that we went this way:
>> > >>
>> > >> - Fewer large topologies were a nightmare to tune
>> > >>- Enrichment would have different memory requirements than,
>> say,
>> > >>parsers or indexing
>> > >>- You can adjust the kafka topic params per topology to adjust
>> > the
>> > >>number of partitions, etc.
>> > >> - Having the separate topologies gives a natural set of extension
>> > points
>> > >> for customization and enhancement (e.g. you want a phase between
>> > parsing
>> > >> and enrichment).
>> > >> - Decoupling the topologies lets us spin up and down parts of
>> Metron
>> > >> without affecting others (e.g. you don't have to take down
>> > enrichments
>> > >>  to
>> > >> add a parser, even for a moment)
>> > >> - The movement to Flux meant we were limited in how much we could
>> > adjust
>> > >> the topology at runtime (e.g. colocating parsers and enrichment
>> > would
>> > >>  mean
>> > >> moving away from flux essentially as the topology changes its
>> > structure)
>> > >>
>> > >>  Best,
>> > >>
>> > >>  Casey
>> > >>
>> > >>  On Fri, Jun 22, 2018 at 5:25 PM Michel Sumbul <
>> michelsum...@gmail.com>
>> > >>  wrote:
>> > >>
>> > >>  > Hi Everyone,
>> > >>  >
>> > >>  > I was asking myself what was the architectural reason to split the
>> > >>  > ingestion in metron in 4 differents toppologie

Re: Architectural reason to split in 4 topologies / impact on the kafka ressources

2018-06-25 Thread Simon Elliston Ball
The original string serves purposes well beyond debugging. Many users will
need to be able to prove provenance to the raw logs in order to prove or
prosecute an attack from an internal threat, or provide evidence to law
enforcement or an external threat. As such, the original string is
important.

It also provides a valuable source for the free text search where parsing
has not extracted all the necessary tokens for a hunt use case, so it can
be a valuable field to have in Elastic or Solr for text rather than keyword
indexing.

That said, it may make sense to remove a heavy weight processing and
storage field like this from the lucene store. We have been talking for a
while about filtering some of the data out of the realtime index, and
preserving full copies in the batch index, which could meet the forensic
use cases above, and would make it a matter of user choice. That would
probably be configured through indexing config to filter fields.

Simon

On 25 June 2018 at 23:43, Michel Sumbul  wrote:

> Depending on the source of data, it might be interesting to bypass a step
> that the user concider useless.
> For example if you have a source of data that dont need profiling and you
> want to have it ingested like the other source to allow the  SOC analyst to
> use it in there analysis. To have everything at the same place.
>
> How can we bypass it for a specific sensor?
>
> 2018-06-25 23:38 GMT+01:00 James Sirota :
>
> > There is a way to wire the system to bypass enrichment and profiling, but
> > you would then bypass a lot of key features of the system.  It would be
> > unwise to do that.
> >
> > 25.06.2018, 15:13, "Michel Sumbul" :
> > > Hi Casey,
> > >
> > > Thats make completely sense.
> > > Short question, if there is no enrichment or no profiling, does the
> > message
> > > still pass through the enrichment/profiling topic?
> > >
> > > If yes, do you think its possible to imagine a way that for messages
> that
> > > doesn't need enrichment or profiling to skip the topic and to go
> directly
> > > to the next one? This is again to avoid in/out in kafka.
> > >
> > > Thanks for the explaination,
> > > Michel
> > >
> > > 2018-06-23 3:58 GMT+01:00 Casey Stella :
> > >
> > >>  Hey Michel,
> > >>
> > >>  Those are good questions and there were some reasons surrounding
> that.
> > In
> > >>  fact, historically, we had fewer topologies (e.g. indexing and
> > enrichment
> > >>  were merged). Even earlier on, we had just one giant topology per
> > parser
> > >>  that enriched and indexed. The long story short is that we moved this
> > way
> > >>  because we saw how people were using metron and we gained more
> insight
> > >>  tuning Metron. That led us down this architectural path.
> > >>
> > >>  Some of the reasons that we went this way:
> > >>
> > >> - Fewer large topologies were a nightmare to tune
> > >>- Enrichment would have different memory requirements than,
> say,
> > >>parsers or indexing
> > >>- You can adjust the kafka topic params per topology to adjust
> > the
> > >>number of partitions, etc.
> > >> - Having the separate topologies gives a natural set of extension
> > points
> > >> for customization and enhancement (e.g. you want a phase between
> > parsing
> > >> and enrichment).
> > >> - Decoupling the topologies lets us spin up and down parts of
> Metron
> > >> without affecting others (e.g. you don't have to take down
> > enrichments
> > >>  to
> > >> add a parser, even for a moment)
> > >> - The movement to Flux meant we were limited in how much we could
> > adjust
> > >> the topology at runtime (e.g. colocating parsers and enrichment
> > would
> > >>  mean
> > >> moving away from flux essentially as the topology changes its
> > structure)
> > >>
> > >>  Best,
> > >>
> > >>  Casey
> > >>
> > >>  On Fri, Jun 22, 2018 at 5:25 PM Michel Sumbul <
> michelsum...@gmail.com>
> > >>  wrote:
> > >>
> > >>  > Hi Everyone,
> > >>  >
> > >>  > I was asking myself what was the architectural reason to split the
> > >>  > ingestion in metron in 4 differents toppologies that all read/write
> > to
> > >>  > kafka?
> > >>  >
> > >>  > For example, why the parsing and enrichment topologies have not
> been
> > >>  > merged? Would it not be possible when you parse the message to
> > directly
> > >>  > enricht it?
> > >>  >
> > >>  > Im asking that because splitting in several topologies means that
> > all of
> > >>  > the topologies read/write to Kafka, which produce a bigger load on
> > the
> > >>  > kafka cluster and then a need for way more infrastructure/servers.
> > The
> > >>  cost
> > >>  > is especially true when we speak about TBs of data ingested every
> > day.
> > >>  >
> > >>  > Im sure there were a very good reason, I was just curious.
> > >>  >
> > >>  > Thanks,
> > >>  > Michel
> > >>  >
> >
> > ---
> > Thank you,
> >
> > James Sirota
> > PMC- Apache Metron
> > jsirota AT apache DOT org
> >
> >
>



-- 
--
simon 

Re: Architectural reason to split in 4 topologies / impact on the kafka ressources

2018-06-25 Thread Michel Sumbul
Depending on the source of data, it might be interesting to bypass a step
that the user concider useless.
For example if you have a source of data that dont need profiling and you
want to have it ingested like the other source to allow the  SOC analyst to
use it in there analysis. To have everything at the same place.

How can we bypass it for a specific sensor?

2018-06-25 23:38 GMT+01:00 James Sirota :

> There is a way to wire the system to bypass enrichment and profiling, but
> you would then bypass a lot of key features of the system.  It would be
> unwise to do that.
>
> 25.06.2018, 15:13, "Michel Sumbul" :
> > Hi Casey,
> >
> > Thats make completely sense.
> > Short question, if there is no enrichment or no profiling, does the
> message
> > still pass through the enrichment/profiling topic?
> >
> > If yes, do you think its possible to imagine a way that for messages that
> > doesn't need enrichment or profiling to skip the topic and to go directly
> > to the next one? This is again to avoid in/out in kafka.
> >
> > Thanks for the explaination,
> > Michel
> >
> > 2018-06-23 3:58 GMT+01:00 Casey Stella :
> >
> >>  Hey Michel,
> >>
> >>  Those are good questions and there were some reasons surrounding that.
> In
> >>  fact, historically, we had fewer topologies (e.g. indexing and
> enrichment
> >>  were merged). Even earlier on, we had just one giant topology per
> parser
> >>  that enriched and indexed. The long story short is that we moved this
> way
> >>  because we saw how people were using metron and we gained more insight
> >>  tuning Metron. That led us down this architectural path.
> >>
> >>  Some of the reasons that we went this way:
> >>
> >> - Fewer large topologies were a nightmare to tune
> >>- Enrichment would have different memory requirements than, say,
> >>parsers or indexing
> >>- You can adjust the kafka topic params per topology to adjust
> the
> >>number of partitions, etc.
> >> - Having the separate topologies gives a natural set of extension
> points
> >> for customization and enhancement (e.g. you want a phase between
> parsing
> >> and enrichment).
> >> - Decoupling the topologies lets us spin up and down parts of Metron
> >> without affecting others (e.g. you don't have to take down
> enrichments
> >>  to
> >> add a parser, even for a moment)
> >> - The movement to Flux meant we were limited in how much we could
> adjust
> >> the topology at runtime (e.g. colocating parsers and enrichment
> would
> >>  mean
> >> moving away from flux essentially as the topology changes its
> structure)
> >>
> >>  Best,
> >>
> >>  Casey
> >>
> >>  On Fri, Jun 22, 2018 at 5:25 PM Michel Sumbul 
> >>  wrote:
> >>
> >>  > Hi Everyone,
> >>  >
> >>  > I was asking myself what was the architectural reason to split the
> >>  > ingestion in metron in 4 differents toppologies that all read/write
> to
> >>  > kafka?
> >>  >
> >>  > For example, why the parsing and enrichment topologies have not been
> >>  > merged? Would it not be possible when you parse the message to
> directly
> >>  > enricht it?
> >>  >
> >>  > Im asking that because splitting in several topologies means that
> all of
> >>  > the topologies read/write to Kafka, which produce a bigger load on
> the
> >>  > kafka cluster and then a need for way more infrastructure/servers.
> The
> >>  cost
> >>  > is especially true when we speak about TBs of data ingested every
> day.
> >>  >
> >>  > Im sure there were a very good reason, I was just curious.
> >>  >
> >>  > Thanks,
> >>  > Michel
> >>  >
>
> ---
> Thank you,
>
> James Sirota
> PMC- Apache Metron
> jsirota AT apache DOT org
>
>


Re: Architectural reason to split in 4 topologies / impact on the kafka ressources

2018-06-25 Thread James Sirota
There is a way to wire the system to bypass enrichment and profiling, but you 
would then bypass a lot of key features of the system.  It would be unwise to 
do that. 

25.06.2018, 15:13, "Michel Sumbul" :
> Hi Casey,
>
> Thats make completely sense.
> Short question, if there is no enrichment or no profiling, does the message
> still pass through the enrichment/profiling topic?
>
> If yes, do you think its possible to imagine a way that for messages that
> doesn't need enrichment or profiling to skip the topic and to go directly
> to the next one? This is again to avoid in/out in kafka.
>
> Thanks for the explaination,
> Michel
>
> 2018-06-23 3:58 GMT+01:00 Casey Stella :
>
>>  Hey Michel,
>>
>>  Those are good questions and there were some reasons surrounding that. In
>>  fact, historically, we had fewer topologies (e.g. indexing and enrichment
>>  were merged). Even earlier on, we had just one giant topology per parser
>>  that enriched and indexed. The long story short is that we moved this way
>>  because we saw how people were using metron and we gained more insight
>>  tuning Metron. That led us down this architectural path.
>>
>>  Some of the reasons that we went this way:
>>
>> - Fewer large topologies were a nightmare to tune
>>    - Enrichment would have different memory requirements than, say,
>>    parsers or indexing
>>    - You can adjust the kafka topic params per topology to adjust the
>>    number of partitions, etc.
>> - Having the separate topologies gives a natural set of extension points
>> for customization and enhancement (e.g. you want a phase between parsing
>> and enrichment).
>> - Decoupling the topologies lets us spin up and down parts of Metron
>> without affecting others (e.g. you don't have to take down enrichments
>>  to
>> add a parser, even for a moment)
>> - The movement to Flux meant we were limited in how much we could adjust
>> the topology at runtime (e.g. colocating parsers and enrichment would
>>  mean
>> moving away from flux essentially as the topology changes its structure)
>>
>>  Best,
>>
>>  Casey
>>
>>  On Fri, Jun 22, 2018 at 5:25 PM Michel Sumbul 
>>  wrote:
>>
>>  > Hi Everyone,
>>  >
>>  > I was asking myself what was the architectural reason to split the
>>  > ingestion in metron in 4 differents toppologies that all read/write to
>>  > kafka?
>>  >
>>  > For example, why the parsing and enrichment topologies have not been
>>  > merged? Would it not be possible when you parse the message to directly
>>  > enricht it?
>>  >
>>  > Im asking that because splitting in several topologies means that all of
>>  > the topologies read/write to Kafka, which produce a bigger load on the
>>  > kafka cluster and then a need for way more infrastructure/servers. The
>>  cost
>>  > is especially true when we speak about TBs of data ingested every day.
>>  >
>>  > Im sure there were a very good reason, I was just curious.
>>  >
>>  > Thanks,
>>  > Michel
>>  >

--- 
Thank you,

James Sirota
PMC- Apache Metron
jsirota AT apache DOT org



Re: Architectural reason to split in 4 topologies / impact on the kafka ressources

2018-06-25 Thread Michel Sumbul
Hi Casey,

Thats make completely sense.
Short question, if there is no enrichment or no profiling, does the message
still pass through the enrichment/profiling topic?

If yes, do you think its possible to imagine a way that for messages that
doesn't need enrichment or profiling to skip the topic and to go directly
to the next one? This is again to avoid in/out in kafka.

Thanks for the explaination,
Michel

2018-06-23 3:58 GMT+01:00 Casey Stella :

> Hey Michel,
>
> Those are good questions and there were some reasons surrounding that.  In
> fact, historically, we had fewer topologies (e.g. indexing and enrichment
> were merged). Even earlier on, we had just one giant topology per parser
> that enriched and indexed.  The long story short is that we moved this way
> because we saw how people were using metron and we gained more insight
> tuning Metron.  That led us down this architectural path.
>
> Some of the reasons that we went this way:
>
>- Fewer large topologies were a nightmare to tune
>   - Enrichment would have different memory requirements than, say,
>   parsers or indexing
>   - You can adjust the kafka topic params per topology to adjust the
>   number of partitions, etc.
>- Having the separate topologies gives a natural set of extension points
>for customization and enhancement (e.g. you want a phase between parsing
>and enrichment).
>- Decoupling the topologies lets us spin up and down parts of Metron
>without affecting others (e.g. you don't have to take down enrichments
> to
>add a parser, even for a moment)
>- The movement to Flux meant we were limited in how much we could adjust
>the topology at runtime (e.g. colocating parsers and enrichment would
> mean
>moving away from flux essentially as the topology changes its structure)
>
> Best,
>
> Casey
>
>
> On Fri, Jun 22, 2018 at 5:25 PM Michel Sumbul 
> wrote:
>
> > Hi Everyone,
> >
> > I was asking myself what was the architectural reason to split the
> > ingestion in metron in 4 differents toppologies that all read/write to
> > kafka?
> >
> > For example, why the parsing and enrichment topologies have not been
> > merged? Would it not be possible when you parse the message to directly
> > enricht it?
> >
> > Im asking that because splitting in several topologies means that all of
> > the topologies read/write to Kafka, which produce a bigger load on the
> > kafka cluster and then a need for way more infrastructure/servers. The
> cost
> > is especially true when we speak about TBs of data ingested every day.
> >
> > Im sure there were a very good reason, I was just curious.
> >
> > Thanks,
> > Michel
> >
>


Re: Architectural reason to split in 4 topologies / impact on the kafka ressources

2018-06-22 Thread Casey Stella
Hey Michel,

Those are good questions and there were some reasons surrounding that.  In
fact, historically, we had fewer topologies (e.g. indexing and enrichment
were merged). Even earlier on, we had just one giant topology per parser
that enriched and indexed.  The long story short is that we moved this way
because we saw how people were using metron and we gained more insight
tuning Metron.  That led us down this architectural path.

Some of the reasons that we went this way:

   - Fewer large topologies were a nightmare to tune
  - Enrichment would have different memory requirements than, say,
  parsers or indexing
  - You can adjust the kafka topic params per topology to adjust the
  number of partitions, etc.
   - Having the separate topologies gives a natural set of extension points
   for customization and enhancement (e.g. you want a phase between parsing
   and enrichment).
   - Decoupling the topologies lets us spin up and down parts of Metron
   without affecting others (e.g. you don't have to take down enrichments to
   add a parser, even for a moment)
   - The movement to Flux meant we were limited in how much we could adjust
   the topology at runtime (e.g. colocating parsers and enrichment would mean
   moving away from flux essentially as the topology changes its structure)

Best,

Casey


On Fri, Jun 22, 2018 at 5:25 PM Michel Sumbul 
wrote:

> Hi Everyone,
>
> I was asking myself what was the architectural reason to split the
> ingestion in metron in 4 differents toppologies that all read/write to
> kafka?
>
> For example, why the parsing and enrichment topologies have not been
> merged? Would it not be possible when you parse the message to directly
> enricht it?
>
> Im asking that because splitting in several topologies means that all of
> the topologies read/write to Kafka, which produce a bigger load on the
> kafka cluster and then a need for way more infrastructure/servers. The cost
> is especially true when we speak about TBs of data ingested every day.
>
> Im sure there were a very good reason, I was just curious.
>
> Thanks,
> Michel
>


Architectural reason to split in 4 topologies / impact on the kafka ressources

2018-06-22 Thread Michel Sumbul
Hi Everyone,

I was asking myself what was the architectural reason to split the
ingestion in metron in 4 differents toppologies that all read/write to
kafka?

For example, why the parsing and enrichment topologies have not been
merged? Would it not be possible when you parse the message to directly
enricht it?

Im asking that because splitting in several topologies means that all of
the topologies read/write to Kafka, which produce a bigger load on the
kafka cluster and then a need for way more infrastructure/servers. The cost
is especially true when we speak about TBs of data ingested every day.

Im sure there were a very good reason, I was just curious.

Thanks,
Michel