Github user cestella commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1021#discussion_r189916513
  
    --- Diff: 
metron-platform/metron-common/src/main/java/org/apache/metron/common/configuration/enrichment/handler/StellarConfig.java
 ---
    @@ -142,8 +143,14 @@ else if(kv.getValue() instanceof List) {
       {
     
    --- End diff --
    
    > It feels like '_' conflates the 'messaging' with the language.
    
    I hear you there, that's why support for this variable is VariableResolver 
specific.  You very well could have a variable resolver that does NOT support 
it (it's just that all of ours happen to).  I'd argue that it's not even really 
part of the language as it's a feature of the variable resolver rather than the 
parsing infrastructure.
    
    One reason why this was done as a variable is that the split/join topology 
requires knowledge of the fields used by inspecting the variables used in 
stellar (this way we send only the required fields to the individual stellar 
adapter workers).  I had contemplated adjusting the interface or passing along 
the VariableResolver in the spark context, but that didn't feel right either 
and it was also more complex and it mandated that VariableResolvers support 
`_`, which not all can do.
    
    > Also, I hope some of these MAAS applications make it into Metron /contrib 
;)
    
    You will get your wish as this is the preliminary PR for one of them going 
into Metron.  It's actually not a MaaS model, but a semantic hash function 
backed by Word2Vec that fits into the `HASH()` infrastructure you and JJ 
created.
    
    >I'm saying why not just have the _ in the configuration side, and just 
have the scripts reference the vars by name and not have to MAP_GET()?
    
    So, the model scripts wouldn't reference `MAP_GET`.  I was going to wait 
and put out a discuss thread, but perhaps an example of what I'm contributing 
next will illuminate the need.  The model in question's job is to take the 
whole message and generate a hash from it such that messages that are similar 
have the same hash.  This has a similar usecase to the forensic clustering 
use-case that I wrote up, but it's customized to your data and does not presume 
the user is constructing a string.  
    
    The model itself knows about the schema becuase it's specific to your data. 
 For instance, if you build the model on netflow data, it'll know about netflow 
fields:
    * source computer/port
    * destination computer/port
    * packet count
    * byte count
    * duration
    
    Now, I need a way to pass the whole message into the `HASH()` function.  
One way of doing it would be:
    `HASH( { 'ip_src_addr' : ip_src_addr, 'ip_dst_addr' : ip_dst_addr, 
'ip_src_port' : ip_src_port, 'ip_dst_port' : ip_dst_port, 'packet_count' : 
packet_count, 'duration' : duration, 'byte_count' : byte_count}, 'SEMHASH', { 
'model' : OBJECT_GET('/path/to/model.ser') })`
    
    Rather than doing that, I'd rather let the model select the relevant fields 
like so:
    `HASH( _ , 'SEMHASH', { 'model' : OBJECT_GET('/path/to/model.ser') })`
    
    Similar situations exist with MaaS models as well, where the model knows 
which fields it cares about and the translation as the number of fields being 
input can become onerous to the user.
    
    What do you think?  Do you like another option that would solve the issue?
    
    PS. You'll get a full PR with a worked use-case on the Los Alamos National 
Labs data for the semantic hashing function I teased by end of week.


---

Reply via email to