Many thanks for starting off this discussion. Today in Metron we make a basic 
assumption that once the data is written it stays written. All our enrichments 
and modifications happen in the stream before landing in an immutable store, 
and this is something we need to maintain.

However, as we start to look at integration use cases, and the idea of 
providing an interactive UI to investigators using the platform we need to 
capture additional data about events:
human entered data (small scale)
has this alert been seen
escalated to a case system
manually combined with other alerts
machine generated data (large scale):
restatement of threat feeds
batch analytics too expensive to fit in stream
These require some mutability to the stream. However, I would argue that we 
must maintain that all mutability to Metron data is additive. Once data is 
stated, we should not restate it in order to maintain integrity of the record 
provided by Metron, which is a key value for security departments. 

In the case of the ‘post-indexing’ data we are expecting this to be a smaller 
profile than the telemetry, since it is mostly human scale. That said, we still 
have challenges when reading that data. Essentially it provides a delta overlay 
on the core indexed data which needs to be checked for a significant number of 
operations, create in effect a join condition for many queries. The primary 
query sources are going to be interactive UIs for things like alert status, for 
which an HBase or search index makes a lot of sense. However, we will also need 
to be able to access these efficiently in batch for things like relevancy 
modelling and capturing feedback for human-in-the-loop style models. On that 
basis, I would argue that something that’s easy to join to the HDFS index in 
Spark is also essential. HBase would be a candidate here.

The format of the stored mutation data also needs to be considered. Since it is 
likely to involve a relatively small number of modifications, and in keeping 
with the principal of immutability and preservation of provenance, I would 
suggest the mutations are stored as a timestamped transaction log against the 
original message. We may also want a current state representation. It makes 
sense to me to store the log in HBase while the current state is updated 
against the original message into ES / Solr depending on your search index of 
choice. 

Looking at the idea of storing the log in HBase, we would have to consider 
schema. I would recommend keyed by message guid, columns based versioning by 
timestamp or some sort of vector clock, depending on the expected volume and 
variance of changes, which I would expect to be low. Alternatively we could 
look at something like the opentsdb schema, with guid and partial timestamp in 
the key if we’re expecting high volumes (this seems very unlikely to me). 

Another option, similar to Raghu’s sidecar files, is to borrow the architecture 
of Hive updates, which is to write sidecar delta files, which are checked in 
every query to the underlying file for modifications, and to periodically 
compact. This makes sense but for our need for immutability. Compaction could 
be done in batch to the original record file, and would only add fields in the 
log form to that. We can get away with this optimisation over the Hive method, 
since we are never looking to change original values, but only ‘after the 
index’ values.  That said, compaction is still likely to be heavy and full of 
potential problems with things like stripe and block alignment for performance 
(maybe there is something we can learn here from the early problems with Hive 
acid if we go down that route). Personally I see this as a high risk option.

Something I would like to consider is how we abstract this from the metron UI 
and other metron users. I would recommend we deliver a data services layer API 
covering access to all the underlying data and controlling the immutability, 
and maintenance of whatever persistence we use. I would also like to see a 
Spark relation built for Metron to abstract data access on the backend of Spark 
jobs which would allow us to decouple things like model building from the 
underlying mechanisms and file formats. 

The short version is that I would say we store a transaction log in HBase and 
consider mutating the document in search. 

Simon


> On 27 Mar 2017, at 10:26, Raghu Mitra Kandikonda <r...@hortonworks.com> wrote:
> 
> Hi All,
> 
> I would like to start a discussion around what would be the good approach to 
> append data to the existing records that are processed by Metron. Here are 
> few thoughts that I have to start with.
> 
> 1.Store the new fields just in ES and allow records to be different in ES and 
> HDFS.
> 2.Store the new fields in HBASE along with ES.
> a.We can create a new table in HBASE that stores  guid + key (or any other 
> unique key of the record) and the new value.
> b. The table name will be same as the file name that originally contained the 
> record.
> 3. Store new fields in ES and in HDFS.
> a. The new fields will be stored in same file as the original record.
> b. The new fields are stored along with guid of the record.
> c. Any changes to the values of the fields will have a new record instead of 
> modifying the existing record.
> d. To read the latest value for a record we need to parse the entire file.
> Ex: File  enrichment-null-0-0-1490335748664.json has 3 records
> {“key1”: “value1, “key2”: “value2”, “key3”: “value3” , “guid” : “id1"}
> {“key1”: “value11, “key2”: “value21”, “key3”: “value31”  , “guid” : “id2"}
> {“key1”: “value12, “key2”: “value22”, “key3”: “value32”  , “guid” : “id3"}
> Now we have to store new field for record with guid id2 the new file looks as 
> follows
> {“key1”: “value1, “key2”: “value2”, “key3”: “value3” }
> {“key1”: “value11, “key2”: “value21”, “key3”: “value31” }
> {“key1”: “value12, “key2”: “value22”, “key3”: “value32” }
> {“guid”: “id2", “newKey”: “newValue”}
> Again the value of newKey for record has been changed to newestValue the new 
> file looks as follows
> {“key1”: “value1, “key2”: “value2”, “key3”: “value3” }
> {“key1”: “value11, “key2”: “value21”, “key3”: “value31” }
> {“key1”: “value12, “key2”: “value22”, “key3”: “value32” }
> {“guid”: “id2", “newKey”: “newValue”}
> {“guid”: “id2", “newKey”: “newestValue”}
> 4. Store the new fields in ES and in HDFS.
> a. The new fields will be stored in new file than the file where the record 
> originally existed.
> b. The name of file will be the same  as the file where the record is 
> originally present but it will be in a different folder.
> c. The new fields are stored along with guid of the record.
> c. new value to an existing field or a new field would be appended to the end 
> of the file instead of modifying a record.
> d. To read the latest value for a record we need to parse the entire file.
> Ex: File  
> /apps/metron/indexing/indexed/snort/enrichment-null-0-0-1490335746765.json 
> has following records
> {“key1”: “value1, “key2”: “value2”, “key3”: “value3” , “guid” : “id1"}
> {“key1”: “value11, “key2”: “value21”, “key3”: “value31”  , “guid” : “id2"}
> {“key1”: “value12, “key2”: “value22”, “key3”: “value32”  , “guid” : “id3"}
> Now we have a ’newKey’ and ’newValue’ to be stored for record with guid id2. 
> The file enrichment-null-0-0-1490335748664.json will look the same but we 
> will have a new file
> /apps/metron/augmented/snort/enrichment-null-0-0-1490335746765.json with the 
> following content
> {“guid”: “id2", “newKey”: “newValue”}
> Again the value of newKey is changed to  newestValue  and there is a new key 
> called newestKey the file looks as follows
> {“guid”: “id2", “newKey”: “newValue”}
> {“guid”: “id2", “newKey”: “newestValue”}
> {“guid”: “id2", “newestKey”: “nextNewestValue”}
> 
> -Raghu
> 
> 
> 

Reply via email to