Github user cestella commented on a diff in the pull request: https://github.com/apache/metron/pull/1021#discussion_r189916513 --- Diff: metron-platform/metron-common/src/main/java/org/apache/metron/common/configuration/enrichment/handler/StellarConfig.java --- @@ -142,8 +143,14 @@ else if(kv.getValue() instanceof List) { { --- End diff -- > It feels like '_' conflates the 'messaging' with the language. I hear you there, that's why support for this variable is VariableResolver specific. You very well could have a variable resolver that does NOT support it (it's just that all of ours happen to). I'd argue that it's not even really part of the language as it's a feature of the variable resolver rather than the parsing infrastructure. One reason why this was done as a variable is that the split/join topology requires knowledge of the fields used by inspecting the variables used in stellar (this way we send only the required fields to the individual stellar adapter workers). I had contemplated adjusting the interface or passing along the VariableResolver in the spark context, but that didn't feel right either and it was also more complex and it mandated that VariableResolvers support `_`, which not all can do. > Also, I hope some of these MAAS applications make it into Metron /contrib ;) You will get your wish as this is the preliminary PR for one of them going into Metron. It's actually not a MaaS model, but a semantic hash function backed by Word2Vec that fits into the `HASH()` infrastructure you and JJ created. >I'm saying why not just have the _ in the configuration side, and just have the scripts reference the vars by name and not have to MAP_GET()? So, the model scripts wouldn't reference `MAP_GET`. I was going to wait and put out a discuss thread, but perhaps an example of what I'm contributing next will illuminate the need. The model in question's job is to take the whole message and generate a hash from it such that messages that are similar have the same hash. This has a similar usecase to the forensic clustering use-case that I wrote up, but it's customized to your data and does not presume the user is constructing a string. The model itself knows about the schema becuase it's specific to your data. For instance, if you build the model on netflow data, it'll know about netflow fields: * source computer/port * destination computer/port * packet count * byte count * duration Now, I need a way to pass the whole message into the `HASH()` function. One way of doing it would be: `HASH( { 'ip_src_addr' : ip_src_addr, 'ip_dst_addr' : ip_dst_addr, 'ip_src_port' : ip_src_port, 'ip_dst_port' : ip_dst_port, 'packet_count' : packet_count, 'duration' : duration, 'byte_count' : byte_count}, 'SEMHASH', { 'model' : OBJECT_GET('/path/to/model.ser') })` Rather than doing that, I'd rather let the model select the relevant fields like so: `HASH( _ , 'SEMHASH', { 'model' : OBJECT_GET('/path/to/model.ser') })` Similar situations exist with MaaS models as well, where the model knows which fields it cares about and the translation as the number of fields being input can become onerous to the user. What do you think? Do you like another option that would solve the issue? PS. You'll get a full PR with a worked use-case on the Los Alamos National Labs data for the semantic hashing function I teased by end of week.
---