Repository: metron
Updated Branches:
  refs/heads/master 353bc8b3b -> 7b6a3da6d


METRON-1052: Add forensic similarity hash functions to Stellar closes 
apache/incubator-metron#781


Project: http://git-wip-us.apache.org/repos/asf/metron/repo
Commit: http://git-wip-us.apache.org/repos/asf/metron/commit/7b6a3da6
Tree: http://git-wip-us.apache.org/repos/asf/metron/tree/7b6a3da6
Diff: http://git-wip-us.apache.org/repos/asf/metron/diff/7b6a3da6

Branch: refs/heads/master
Commit: 7b6a3da6d6612e61b3bde33ccf50f0638b6072fa
Parents: 353bc8b
Author: cstella <ceste...@gmail.com>
Authored: Mon Oct 2 09:54:17 2017 -0400
Committer: cstella <ceste...@gmail.com>
Committed: Mon Oct 2 09:54:17 2017 -0400

----------------------------------------------------------------------
 dependencies_with_url.csv                       |   1 +
 metron-stellar/stellar-common/README.md         |  27 +-
 metron-stellar/stellar-common/pom.xml           |   6 +
 .../common/utils/hashing/DefaultHasher.java     |  54 +++
 .../common/utils/hashing/EnumConfigurable.java  |  31 ++
 .../common/utils/hashing/HashStrategy.java      |  84 ++++
 .../stellar/common/utils/hashing/Hasher.java    |  13 +-
 .../stellar/common/utils/hashing/tlsh/TLSH.java |  54 +++
 .../common/utils/hashing/tlsh/TLSHCache.java    |  40 ++
 .../common/utils/hashing/tlsh/TLSHHasher.java   | 188 ++++++++
 .../stellar/dsl/functions/HashFunctions.java    |  75 +++-
 .../dsl/functions/HashFunctionsTest.java        | 135 +++++-
 pom.xml                                         |   4 +
 use-cases/forensic_clustering/README.md         | 433 +++++++++++++++++++
 use-cases/forensic_clustering/clustered.png     | Bin 0 -> 218476 bytes
 use-cases/forensic_clustering/find_alerts.png   | Bin 0 -> 581508 bytes
 use-cases/geographic_login_outliers/README.md   |   5 +-
 17 files changed, 1123 insertions(+), 27 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/dependencies_with_url.csv
----------------------------------------------------------------------
diff --git a/dependencies_with_url.csv b/dependencies_with_url.csv
index f022647..38a9f5e 100644
--- a/dependencies_with_url.csv
+++ b/dependencies_with_url.csv
@@ -310,6 +310,7 @@ 
org.springframework.security.kerberos:spring-security-kerberos-core:jar:1.0.1.RE
 
org.springframework.kafka:spring-kafka:jar:1.1.1.RELEASE:compile,ASLv2,https://github.com/spring-projects/spring-kafka
 ch.hsr:geohash:jar:1.3.0:compile,ASLv2,https://github.com/kungfoo/geohash-java
 
org.locationtech.spatial4j:spatial4j:jar:0.6:compile,ASLv2,https://github.com/locationtech/spatial4j
+com.trendmicro:tlsh:jar:3.7.1:compile,ASLv2,https://github.com/trendmicro/tlsh
 org.glassfish:javax.json:jar:1.0.4:compile,Common Development and Distribution 
License (CDDL) v1.0,https://github.com/javaee/jsonp
 org.eclipse.persistence:javax.persistence:jar:2.1.1:compile,EPL 
1.0,http://www.eclipse.org/eclipselink
 org.eclipse.persistence:org.eclipse.persistence.antlr:jar:2.6.4:compile,EPL 
1.0,http://www.eclipse.org/eclipselink

http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/metron-stellar/stellar-common/README.md
----------------------------------------------------------------------
diff --git a/metron-stellar/stellar-common/README.md 
b/metron-stellar/stellar-common/README.md
index b35a5c7..713af06 100644
--- a/metron-stellar/stellar-common/README.md
+++ b/metron-stellar/stellar-common/README.md
@@ -222,6 +222,7 @@ In the core language functions, we support basic functional 
programming primitiv
 | [ `SYSTEM_ENV_GET`](#system_env_get)                                         
                      |
 | [ `SYSTEM_PROPERTY_GET`](#system_property_get)                               
                      |
 | [ `TAN`](#tan)                                                               
                      |
+| [ `TLSH_DIST`](#tlsh_dist)                                                   
                                  |
 | [ `TO_DOUBLE`](#to_double)                                                   
                      |
 | [ `TO_EPOCH_TIMESTAMP`](#to_epoch_timestamp)                                 
                      |
 | [ `TO_FLOAT`](#to_float)                                                     
                      |
@@ -524,12 +525,18 @@ In the core language functions, we support basic 
functional programming primitiv
 
 ### `HASH`
   * Description: Hashes a given value using the given hashing algorithm and 
returns a hex encoded string.
-  * Input: 
-    * toHash - value to hash.
-    * hashType - A valid string representation of a hashing algorithm. See 
'GET_HASHES_AVAILABLE'.
-  * Returns: A hex encoded string of a hashed value using the given algorithm. 
If 'hashType' is null 
-  then '00', padded to the necessary length, will be returned. If 'toHash' is 
not able to be hashed or 
-  'hashType' is null then null is returned.
+  * Input:
+     * toHash - value to hash.
+     * hashType - A valid string representation of a hashing algorithm. See 
'GET_HASHES_AVAILABLE'.
+     * config? - Configuration for the hash function in the form of a String 
to object map.
+        * For forensic hash TLSH (see 
[https://github.com/trendmicro/tlsh](https://github.com/trendmicro/tlsh) and 
Jonathan Oliver, Chun Cheng, and Yanggui Chen, TLSH - A Locality Sensitive 
Hash. 4th Cybercrime and Trustworthy Computing Workshop, Sydney, November 2013):
+          * bucketSize : This defines the size of the hash created.  Valid 
values are 128 (default) or 256 (the former results in a 70 character hash and 
latter results in 134 characters)
+          * checksumBytes : This defines how many bytes are used to capture 
the checksum.  Valid values are 1 (default) and 3
+          * force : If true (the default) then a hash can be generated from as 
few as 50 bytes.  If false, then at least 256 bytes are required.  Insufficient 
variation or size in the bytes result in a null being returned.
+          * hashes : You can compute a second hash for use in fuzzy clustering 
TLSH signatures.  The number of hashes is the lever to adjust the size of those 
clusters and \"fuzzy\" the clusters are.  If this is specified, then one or 
more bins are created based on the specified size and the function will return 
a Map containing the bins.
+        * For all other hashes:
+          * charset : The character set to use (UTF8 is default).
+  * Returns = A hex encoded string of a hashed value using the given 
algorithm. If 'hashType' is null then '00', padded to the necessary length, 
will be returned. If 'toHash' is not able to be hashed or 'hashType' is null 
then null is returned.
 
 ### `IN_SUBNET`
   * Description: Returns true if an IP is within a subnet range.
@@ -916,6 +923,14 @@ In the core language functions, we support basic 
functional programming primitiv
     * number - The number to take the tangent of
   * Returns: The tangent of the number passed in.
 
+### `TLSH_DIST`
+  * Description: Will return the hamming distance between two TLSH hashes 
(note: must be computed with the same params).  For more information, see 
[https://github.com/trendmicro/tlsh](https://github.com/trendmicro/tlsh) and 
Jonathan Oliver, Chun Cheng, and Yanggui Chen, TLSH - A Locality Sensitive 
Hash. 4th Cybercrime and Trustworthy Computing Workshop, Sydney, November 2013. 
 For a discussion of tradeoffs, see Table II on page 5 of 
[https://github.com/trendmicro/tlsh/blob/master/TLSH_CTC_final.pdf](https://github.com/trendmicro/tlsh/blob/master/TLSH_CTC_final.pdf)
+  * Input:
+     * hash1 - The first TLSH hash
+     * hash2 - The first TLSH hash
+     * includeLength? - Include the length in the distance calculation or not?
+  Returns: An integer representing the distance between hash1 and hash2.  The 
distance is roughly hamming distance, so 0 is very similar.
+
 ### `TO_DOUBLE`
   * Description: Transforms the first argument to a double precision number
   * Input:

http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/metron-stellar/stellar-common/pom.xml
----------------------------------------------------------------------
diff --git a/metron-stellar/stellar-common/pom.xml 
b/metron-stellar/stellar-common/pom.xml
index 9ec29b8..5ec8a4e 100644
--- a/metron-stellar/stellar-common/pom.xml
+++ b/metron-stellar/stellar-common/pom.xml
@@ -29,6 +29,7 @@
         <commons.config.version>1.10</commons.config.version>
     </properties>
     <dependencies>
+
         <dependency>
             <groupId>org.apache.hadoop</groupId>
             <artifactId>hadoop-auth</artifactId>
@@ -51,6 +52,11 @@
             </exclusions>
         </dependency>
         <dependency>
+            <groupId>com.trendmicro</groupId>
+            <artifactId>tlsh</artifactId>
+            <version>3.7.1</version>
+        </dependency>
+        <dependency>
             <groupId>org.apache.commons</groupId>
             <artifactId>commons-math3</artifactId>
             <version>3.6.1</version>

http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/DefaultHasher.java
----------------------------------------------------------------------
diff --git 
a/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/DefaultHasher.java
 
b/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/DefaultHasher.java
index c950a19..b2eeca5 100644
--- 
a/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/DefaultHasher.java
+++ 
b/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/DefaultHasher.java
@@ -17,19 +17,42 @@
  */
 package org.apache.metron.stellar.common.utils.hashing;
 
+import com.google.common.base.Joiner;
 import org.apache.commons.codec.BinaryEncoder;
+import org.apache.commons.codec.Charsets;
 import org.apache.commons.codec.EncoderException;
+import org.apache.commons.codec.binary.Hex;
 import org.apache.commons.lang3.SerializationUtils;
 import org.apache.commons.lang3.StringUtils;
+import org.apache.metron.stellar.common.utils.ConversionUtils;
 
 import java.io.Serializable;
 import java.nio.charset.Charset;
 import java.nio.charset.StandardCharsets;
+import java.nio.charset.UnsupportedCharsetException;
 import java.security.MessageDigest;
 import java.security.NoSuchAlgorithmException;
+import java.security.Security;
+import java.util.*;
+import java.util.function.Function;
 
 public class DefaultHasher implements Hasher {
 
+  public enum Config implements EnumConfigurable {
+    CHARSET("charset"),
+    ;
+    private String key;
+    Config(String key) {
+      this.key = key;
+    }
+
+    @Override
+    public String getKey() {
+      return key;
+    }
+
+  }
+
   private String algorithm;
   private BinaryEncoder encoder;
   private Charset charset;
@@ -62,6 +85,16 @@ public class DefaultHasher implements Hasher {
   }
 
   /**
+   * Builds a utility to hash values based on a given algorithm. Uses {@link 
StandardCharsets#UTF_8} for encoding.
+   * @param algorithm The algorithm used when hashing a value.
+   * @see java.security.Security
+   * @see java.security.MessageDigest
+   */
+  public DefaultHasher(final String algorithm) {
+    this(algorithm, new Hex(StandardCharsets.UTF_8));
+  }
+
+  /**
    * {@inheritDoc}
    *
    * Returns a hash which has been encoded using the supplied encoder. If 
input is null then a string
@@ -94,4 +127,25 @@ public class DefaultHasher implements Hasher {
 
     return new String(encode, charset);
   }
+
+  @Override
+  public void configure(Optional<Map<String, Object>> config) {
+    if(config.isPresent() && !config.get().isEmpty()) {
+      charset = Config.CHARSET.get(config.get()
+              , o -> {
+                String charset = ConversionUtils.convert(o, String.class);
+                if(charset != null) {
+                  Charset set = Charset.forName(charset);
+                  return set;
+                }
+                return null;
+              }
+      ).orElse(charset);
+    }
+  }
+
+  public static final Set<String> supportedHashes() {
+    return new HashSet<>(Security.getAlgorithms("MessageDigest"));
+  }
+
 }

http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/EnumConfigurable.java
----------------------------------------------------------------------
diff --git 
a/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/EnumConfigurable.java
 
b/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/EnumConfigurable.java
new file mode 100644
index 0000000..42923a0
--- /dev/null
+++ 
b/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/EnumConfigurable.java
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.metron.stellar.common.utils.hashing;
+
+import java.util.Map;
+import java.util.Optional;
+import java.util.function.Function;
+
+public interface EnumConfigurable {
+  String getKey();
+
+  default <T> Optional<T> get(Map<String, Object> config, Function<Object, T> 
converter) {
+    Object o = config.get(getKey());
+    return o == null?Optional.empty():Optional.ofNullable(converter.apply(o));
+  }
+}

http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/HashStrategy.java
----------------------------------------------------------------------
diff --git 
a/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/HashStrategy.java
 
b/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/HashStrategy.java
new file mode 100644
index 0000000..a26832b
--- /dev/null
+++ 
b/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/HashStrategy.java
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.metron.stellar.common.utils.hashing;
+
+import com.google.common.base.Joiner;
+import org.apache.metron.stellar.common.utils.hashing.tlsh.TLSHHasher;
+
+import java.util.*;
+import java.util.function.Function;
+
+/**
+ * This is an enum implementing a hashing strategy pattern.  Because hash 
types may
+ * be quite different, but have a similar interface we have laid on top this 
abstraction
+ * to allow new types of hashes to be added easily.
+ *
+ * In order to add a new family of hashes, simply implement the Hasher 
interface and register it with this
+ * enum.  The search order for algorithms to their respective Hasher is in the 
order of the entries
+ * in this enum.
+ */
+public enum HashStrategy {
+  TLSH(a -> new TLSHHasher(), TLSHHasher.supportedHashes()),
+  DEFAULT(a -> new DefaultHasher(a), DefaultHasher.supportedHashes())
+  ;
+
+  /**
+   * An accumulated list of all the supported hash algorithms.
+   */
+  public static final Set<String> ALL_SUPPORTED_HASHES = new HashSet<>();
+  static {
+    for(HashStrategy factory : HashStrategy.values()) {
+      ALL_SUPPORTED_HASHES.addAll(factory.supportedHashes);
+    }
+  }
+
+  Function<String, Hasher> hasherCreator;
+  Set<String> supportedHashes;
+
+  HashStrategy(Function<String, Hasher> hasherCreator, Set<String> 
supportedHashes) {
+    this.hasherCreator = hasherCreator;
+    this.supportedHashes = supportedHashes;
+  }
+
+  /**
+   * Return the appropriate hasher given the algorithm.
+   * @param algorithm The algorithm to find a hasher handler for.  Note: this 
is upper-cased prior to search
+   * @param config The config for the hasher
+   * @return The hasher which will handle the algorithm.  If the algorithm is 
not supported by any registered
+   *         hashers, an IllegalArgumentException is thrown.
+   */
+  public static Hasher getHasher(String algorithm, Optional<Map<String, 
Object>> config) {
+    Hasher h = null;
+    for(HashStrategy factory : HashStrategy.values()) {
+      if(factory.getSupportedHashes().contains(algorithm.toUpperCase())) {
+        h = factory.hasherCreator.apply(algorithm);
+        break;
+      }
+    }
+    if(h == null) {
+      throw new IllegalArgumentException("Unsupported hash function: " + 
algorithm
+                                        + ".  Supported algorithms are " + 
Joiner.on(",").join(ALL_SUPPORTED_HASHES));
+    }
+    h.configure(config);
+    return h;
+  }
+
+  public Set<String> getSupportedHashes() {
+    return supportedHashes;
+  }
+}

http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/Hasher.java
----------------------------------------------------------------------
diff --git 
a/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/Hasher.java
 
b/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/Hasher.java
index 08e8f72..a059842 100644
--- 
a/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/Hasher.java
+++ 
b/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/Hasher.java
@@ -20,6 +20,10 @@ package org.apache.metron.stellar.common.utils.hashing;
 import org.apache.commons.codec.EncoderException;
 
 import java.security.NoSuchAlgorithmException;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.Set;
 
 public interface Hasher {
 
@@ -31,5 +35,12 @@ public interface Hasher {
    * @throws EncoderException If unable to encode the hash then this exception 
occurs.
    * @throws NoSuchAlgorithmException If the supplied algorithm is not known.
    */
-  String getHash(final Object toHash) throws EncoderException, 
NoSuchAlgorithmException;
+  Object getHash(final Object toHash) throws EncoderException, 
NoSuchAlgorithmException;
+
+  /**
+   * Configure the hasher with a string to object map.
+   * @param config
+   */
+  void configure(final Optional<Map<String, Object>> config);
+
 }

http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/tlsh/TLSH.java
----------------------------------------------------------------------
diff --git 
a/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/tlsh/TLSH.java
 
b/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/tlsh/TLSH.java
new file mode 100644
index 0000000..9913b82
--- /dev/null
+++ 
b/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/tlsh/TLSH.java
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.metron.stellar.common.utils.hashing.tlsh;
+
+import com.trendmicro.tlsh.BucketOption;
+import com.trendmicro.tlsh.ChecksumOption;
+import com.trendmicro.tlsh.Tlsh;
+import com.trendmicro.tlsh.TlshCreator;
+
+import java.util.Optional;
+
+/**
+ * The abstraction around interacting with TLSH.
+ */
+public class TLSH {
+  TlshCreator creator;
+  public TLSH(BucketOption bucketOption, ChecksumOption checksumOption) {
+    creator = new TlshCreator(bucketOption, checksumOption);
+  }
+
+  public String apply(byte[] data, boolean force) {
+    try {
+      creator.update(data);
+      return creator.getHash(force).getEncoded();
+    }
+    finally {
+      creator.reset();
+    }
+  }
+
+  public static int distance(String hash1, String hash2, Optional<Boolean> 
includeLength) {
+    if(hash1 == null || hash2 == null && hash1 != hash2) {
+      return -1;
+    }
+    Tlsh t1 = Tlsh.fromTlshStr(hash1);
+    Tlsh t2 = Tlsh.fromTlshStr(hash2);
+    return t1.totalDiff(t2, includeLength.orElse(false));
+  }
+}

http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/tlsh/TLSHCache.java
----------------------------------------------------------------------
diff --git 
a/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/tlsh/TLSHCache.java
 
b/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/tlsh/TLSHCache.java
new file mode 100644
index 0000000..10d106f
--- /dev/null
+++ 
b/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/tlsh/TLSHCache.java
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.metron.stellar.common.utils.hashing.tlsh;
+
+import com.trendmicro.tlsh.BucketOption;
+import com.trendmicro.tlsh.ChecksumOption;
+
+import java.util.AbstractMap;
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Create a threadlocal cache of TLSH handlers.
+ */
+public class TLSHCache {
+  public static ThreadLocal<TLSHCache> INSTANCE = ThreadLocal.withInitial(() 
-> new TLSHCache());
+  private Map<Map.Entry<BucketOption, ChecksumOption>, TLSH> cache = new 
HashMap<>();
+  private TLSHCache() {}
+
+  public TLSH getTLSH(BucketOption bo, ChecksumOption co) {
+    return cache.computeIfAbsent( new AbstractMap.SimpleEntry<>(bo, co)
+                                , kv -> new TLSH(kv.getKey(), kv.getValue())
+                                );
+  }
+}

http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/tlsh/TLSHHasher.java
----------------------------------------------------------------------
diff --git 
a/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/tlsh/TLSHHasher.java
 
b/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/tlsh/TLSHHasher.java
new file mode 100644
index 0000000..b04fbc7
--- /dev/null
+++ 
b/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/utils/hashing/tlsh/TLSHHasher.java
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.metron.stellar.common.utils.hashing.tlsh;
+
+import com.trendmicro.tlsh.BucketOption;
+import com.trendmicro.tlsh.ChecksumOption;
+import org.apache.commons.codec.DecoderException;
+import org.apache.commons.codec.EncoderException;
+import org.apache.commons.codec.binary.Hex;
+import org.apache.metron.stellar.common.utils.ConversionUtils;
+import org.apache.metron.stellar.common.utils.SerDeUtils;
+import org.apache.metron.stellar.common.utils.hashing.EnumConfigurable;
+import org.apache.metron.stellar.common.utils.hashing.Hasher;
+
+import java.security.NoSuchAlgorithmException;
+import java.util.*;
+
+public class TLSHHasher implements Hasher {
+  public static final String TLSH_KEY = "tlsh";
+  public static final String TLSH_BIN_KEY = "tlsh_bin";
+
+  public enum Config implements EnumConfigurable {
+    BUCKET_SIZE("bucketSize"),
+    CHECKSUM("checksumBytes"),
+    HASHES("hashes"),
+    FORCE("force")
+    ;
+    final public String key;
+    Config(String key) {
+      this.key = key;
+    }
+
+    @Override
+    public String getKey() {
+      return key;
+    }
+  }
+
+  BucketOption bucketOption = BucketOption.BUCKETS_128;
+  ChecksumOption checksumOption = ChecksumOption.CHECKSUM_1B;
+  Boolean force = true;
+  List<Integer> hashes = new ArrayList<>();
+
+  /**
+   * Returns an encoded string representation of the hash value of the input. 
It is expected that
+   * this implementation does throw exceptions when the input is null.
+   *
+   * @param o The value to hash.
+   * @return A hash of {@code toHash} that has been encoded.
+   * @throws EncoderException         If unable to encode the hash then this 
exception occurs.
+   * @throws NoSuchAlgorithmException If the supplied algorithm is not known.
+   */
+  @Override
+  public Object getHash(Object o) throws EncoderException, 
NoSuchAlgorithmException {
+    TLSH tlsh = TLSHCache.INSTANCE.get().getTLSH(bucketOption, checksumOption);
+    byte[] data = null;
+    if(o instanceof String) {
+      data = ((String)o).getBytes();
+    }
+    else if(o instanceof byte[]) {
+      data = (byte[])o;
+    }
+    else {
+      data = SerDeUtils.toBytes(o);
+    }
+    try {
+      String hash = tlsh.apply(data, force);
+      if(hashes != null && hashes.size() > 0) {
+        Map<String, Object> ret = new HashMap<>();
+        ret.put(TLSH_KEY, hash);
+        ret.putAll(bin(hash));
+        return ret;
+      }
+      else {
+        return hash;
+      }
+    }
+    catch(IllegalStateException ise) {
+      return null;
+    } catch (DecoderException e) {
+      return null;
+    }
+  }
+
+  public Map<String, String> bin(String hash) throws DecoderException {
+    Random r = new Random(0);
+    byte[] h = 
Hex.decodeHex(hash.substring(2*checksumOption.getChecksumLength()).toCharArray());
+    BitSet vector = BitSet.valueOf(h);
+    int n = vector.length();
+    Map<String, String> ret = new HashMap<>();
+    boolean singleHash = hashes.size() == 1;
+    for(int numHashes : hashes) {
+      BitSet projection = new BitSet();
+      for (int i = 0; i < numHashes; ++i) {
+        int index = r.nextInt(n);
+        projection.set(i, vector.get(index));
+      }
+      String outputHash = numHashes + 
Hex.encodeHexString(projection.toByteArray());
+      if(singleHash) {
+        ret.put(TLSH_BIN_KEY, outputHash);
+      }
+      else {
+        ret.put(TLSH_BIN_KEY + "_" + numHashes, outputHash);
+      }
+    }
+    return ret;
+  }
+
+
+
+
+  @Override
+  public void configure(Optional<Map<String, Object>> config) {
+    if(config.isPresent() && !config.get().isEmpty()) {
+      bucketOption = Config.BUCKET_SIZE.get(config.get()
+              , o -> {
+                Integer bucketSize = ConversionUtils.convert(o, Integer.class);
+                switch (bucketSize) {
+                  case 128:
+                    return BucketOption.BUCKETS_128;
+                  case 256:
+                    return BucketOption.BUCKETS_256;
+                  default:
+                    return null;
+                }
+
+              }
+      ).orElse(bucketOption);
+
+      checksumOption = Config.CHECKSUM.get(config.get()
+              , o -> {
+                Integer checksumBytes= ConversionUtils.convert(o, 
Integer.class);
+                switch (checksumBytes) {
+                  case 1:
+                    return ChecksumOption.CHECKSUM_1B;
+                  case 3:
+                    return ChecksumOption.CHECKSUM_3B;
+                  default:
+                    return null;
+                }
+
+              }
+      ).orElse(checksumOption);
+
+      force = Config.FORCE.get(config.get()
+              , o -> ConversionUtils.convert(o, Boolean.class)
+      ).orElse(force);
+
+      hashes = Config.HASHES.get(config.get()
+              , o -> {
+                List<Integer> ret = new ArrayList<>();
+                if(o instanceof List) {
+                  List<? extends Object> vals = (List<? extends Object>)o;
+                  for(Object oVal : vals) {
+                    ret.add(ConversionUtils.convert(oVal, Integer.class));
+                  }
+                }
+                else {
+                  ret.add(ConversionUtils.convert(o, Integer.class));
+                }
+                return ret;
+              }
+      ).orElse(hashes);
+    }
+  }
+
+  public static final Set<String> supportedHashes() {
+    return new HashSet<String>() {{
+      add("TLSH");
+    }};
+  }
+
+}

http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/dsl/functions/HashFunctions.java
----------------------------------------------------------------------
diff --git 
a/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/dsl/functions/HashFunctions.java
 
b/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/dsl/functions/HashFunctions.java
index 5e59b6e..660daf1 100644
--- 
a/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/dsl/functions/HashFunctions.java
+++ 
b/metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/dsl/functions/HashFunctions.java
@@ -18,16 +18,18 @@
 package org.apache.metron.stellar.dsl.functions;
 
 import org.apache.commons.codec.EncoderException;
-import org.apache.commons.codec.binary.Hex;
-import org.apache.metron.stellar.common.utils.hashing.DefaultHasher;
+import org.apache.metron.stellar.common.utils.ConversionUtils;
+import org.apache.metron.stellar.common.utils.hashing.HashStrategy;
+import org.apache.metron.stellar.common.utils.hashing.tlsh.TLSH;
+import org.apache.metron.stellar.common.utils.hashing.tlsh.TLSHHasher;
 import org.apache.metron.stellar.dsl.BaseStellarFunction;
 import org.apache.metron.stellar.dsl.Stellar;
 
-import java.nio.charset.StandardCharsets;
 import java.security.NoSuchAlgorithmException;
-import java.security.Security;
 import java.util.ArrayList;
 import java.util.List;
+import java.util.Map;
+import java.util.Optional;
 
 public class HashFunctions {
 
@@ -44,16 +46,27 @@ public class HashFunctions {
         throw new IllegalArgumentException("Invalid call. This function does 
not expect any arguments.");
       }
 
-      return new ArrayList<>(Security.getAlgorithms("MessageDigest"));
+      List<String> ret = new ArrayList<>();
+      ret.addAll(HashStrategy.ALL_SUPPORTED_HASHES);
+      return ret;
     }
   }
 
+
   @Stellar(
     name = "HASH",
     description = "Hashes a given value using the given hashing algorithm and 
returns a hex encoded string.",
     params = {
       "toHash - value to hash.",
       "hashType - A valid string representation of a hashing algorithm. See 
'GET_HASHES_AVAILABLE'.",
+      "config? - Configuration for the hash function in the form of a String 
to object map.\n"
+    + "          For forensic hash TLSH (see 
https://github.com/trendmicro/tlsh and Jonathan Oliver, Chun Cheng, and Yanggui 
Chen, TLSH - A Locality Sensitive Hash. 4th Cybercrime and Trustworthy 
Computing Workshop, Sydney, November 2013):\n"
+    + "          - bucketSize : This defines the size of the hash created.  
Valid values are 128 (default) or 256 (the former results in a 70 character 
hash and latter results in 134 characters) \n"
+    + "          - checksumBytes : This defines how many bytes are used to 
capture the checksum.  Valid values are 1 (default) and 3\n"
+    + "          - force : If true (the default) then a hash can be generated 
from as few as 50 bytes.  If false, then at least 256 bytes are required.  
Insufficient variation or size in the bytes result in a null being returned.\n"
+    + "          - hashes : You can compute a second hash for use in fuzzy 
clustering TLSH signatures.  The number of hashes is the lever to adjust the 
size of those clusters and \"fuzzy\" the clusters are.  If this is specified, 
then one or more bins are created based on the specified size and the function 
will return a Map containing the bins.\n"
+    + "          For all other hashes:\n"
+    + "          - charset : The character set to use (UTF8 is default). \n"
     },
     returns = "A hex encoded string of a hashed value using the given 
algorithm. If 'hashType' is null " +
       "then '00', padded to the necessary length, will be returned. If 
'toHash' is not able to be hashed or " +
@@ -63,19 +76,25 @@ public class HashFunctions {
 
     @Override
     public Object apply(final List<Object> args) {
-      if (args == null || args.size() != 2) {
+      if (args == null || args.size() < 2) {
         throw new IllegalArgumentException("Invalid number of arguments: " + 
(args == null ? 0 : args.size()));
       }
 
       final Object toHash = args.get(0);
       final Object hashType = args.get(1);
-
       if (hashType == null) {
         return null;
       }
 
+      Map<String, Object> config = null;
+      if(args.size() > 2) {
+        Object configObj = args.get(2);
+        if(configObj instanceof Map && configObj != null) {
+          config = (Map<String, Object>)configObj;
+        }
+      }
       try {
-        return new DefaultHasher(hashType.toString(), new 
Hex(StandardCharsets.UTF_8)).getHash(toHash);
+        return HashStrategy.getHasher(hashType.toString(), 
Optional.ofNullable(config)).getHash(toHash);
       } catch (final EncoderException e) {
         return null;
       } catch (final NoSuchAlgorithmException e) {
@@ -83,4 +102,44 @@ public class HashFunctions {
       }
     }
   }
+
+  @Stellar(
+    name = "DIST",
+    namespace="TLSH",
+    params = {
+          "hash1 - The first TLSH hash",
+          "hash2 - The first TLSH hash",
+          "includeLength? - Include the length in the distance calculation or 
not?",
+          },
+    description = "Will return the hamming distance between two TLSH hashes 
(note: must be computed with the same params).  " +
+            "For more information, see https://github.com/trendmicro/tlsh and 
Jonathan Oliver, Chun Cheng, and Yanggui Chen, TLSH - A Locality Sensitive 
Hash. 4th Cybercrime and Trustworthy Computing Workshop, Sydney, November 2013. 
" +
+            "For a discussion of tradeoffs, see Table II on page 5 of 
https://github.com/trendmicro/tlsh/blob/master/TLSH_CTC_final.pdf";,
+    returns = "An integer representing the distance between hash1 and hash2.  
The distance is roughly hamming distance, so 0 is very similar."
+  )
+  public static class TlshDist extends BaseStellarFunction {
+
+    @Override
+    public Integer apply(final List<Object> args) {
+      if (args == null || args.size() < 2) {
+        throw new IllegalArgumentException("Invalid call. This function 
requires at least 2 arguments: the two TLSH hashes.");
+      }
+      Object h1Obj = args.get(0);
+      Object h2Obj = args.get(1);
+      if(h1Obj != null && !(h1Obj instanceof String) ) {
+        throw new IllegalArgumentException(h1Obj + " must be strings");
+      }
+      if(h2Obj != null && !(h2Obj instanceof String) ) {
+        throw new IllegalArgumentException(h2Obj + " must be strings");
+      }
+
+      Optional<Boolean> includeLength = Optional.empty();
+      if(args.size() > 2) {
+        Object includeLengthArg = args.get(2);
+        if(includeLengthArg != null) {
+          includeLength = 
Optional.ofNullable(ConversionUtils.convert(includeLengthArg, Boolean.class));
+        }
+      }
+      return TLSH.distance(h1Obj == null?null:h1Obj.toString(), h2Obj == 
null?null:h2Obj.toString(), includeLength);
+    }
+  }
 }

http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/metron-stellar/stellar-common/src/test/java/org/apache/metron/stellar/dsl/functions/HashFunctionsTest.java
----------------------------------------------------------------------
diff --git 
a/metron-stellar/stellar-common/src/test/java/org/apache/metron/stellar/dsl/functions/HashFunctionsTest.java
 
b/metron-stellar/stellar-common/src/test/java/org/apache/metron/stellar/dsl/functions/HashFunctionsTest.java
index 31bc6d3..e0ba241 100644
--- 
a/metron-stellar/stellar-common/src/test/java/org/apache/metron/stellar/dsl/functions/HashFunctionsTest.java
+++ 
b/metron-stellar/stellar-common/src/test/java/org/apache/metron/stellar/dsl/functions/HashFunctionsTest.java
@@ -17,9 +17,13 @@
  */
 package org.apache.metron.stellar.dsl.functions;
 
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
 import org.apache.commons.codec.binary.Hex;
 import org.apache.commons.lang.SerializationUtils;
 import org.apache.commons.lang3.StringUtils;
+import org.apache.metron.stellar.common.utils.hashing.tlsh.TLSHHasher;
+import org.junit.Assert;
 import org.junit.Test;
 
 import java.io.Serializable;
@@ -27,18 +31,11 @@ import java.nio.charset.StandardCharsets;
 import java.security.MessageDigest;
 import java.security.NoSuchAlgorithmException;
 import java.security.Security;
-import java.util.Arrays;
-import java.util.Collection;
-import java.util.Collections;
-import java.util.HashMap;
-import java.util.List;
-import java.util.Map;
-import java.util.Set;
+import java.util.*;
+import java.util.concurrent.ForkJoinPool;
 
 import static org.apache.metron.stellar.common.utils.StellarProcessorUtils.run;
-import static org.junit.Assert.assertEquals;
-import static org.junit.Assert.assertNull;
-import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.*;
 
 public class HashFunctionsTest {
   static final Hex HEX = new Hex(StandardCharsets.UTF_8);
@@ -199,6 +196,124 @@ public class HashFunctionsTest {
     assertNull(run("HASH(toHash, 'md5')", variables));
   }
 
+  public static String TLSH_DATA = "The best documentation is the UNIX source. 
After all, this is what the "
+            + "system uses for documentation when it decides what to do next! 
The "
+            + "manuals paraphrase the source code, often having been written 
at "
+            + "different times and by different people than who wrote the 
code. "
+            + "Think of them as guidelines. Sometimes they are more like 
wishes... "
+            + "Nonetheless, it is all too common to turn to the source and 
find "
+            + "options and behaviors that are not documented in the manual. 
Sometimes "
+            + "you find options described in the manual that are unimplemented 
"
+            + "and ignored by the source.";
+  String TLSH_EXPECTED = 
"6FF02BEF718027B0160B4391212923ED7F1A463D563B1549B86CF62973B197AD2731F8";
+
+  @Test
+  public void tlsh_happyPath() throws Exception {
+    final Map<String, Object> variables = new HashMap<>();
+
+    variables.put("toHash", TLSH_DATA);
+    variables.put("toHashBytes", TLSH_DATA.getBytes(StandardCharsets.UTF_8));
+    //this value is pulled from a canonical example at  
https://github.com/idealista/tlsh#how-to-calculate-a-hash
+    assertEquals(TLSH_EXPECTED, run("HASH(toHash, 'tlsh')", variables));
+    assertEquals(TLSH_EXPECTED, run("HASH(toHash, 'TLSH')", variables));
+    assertEquals(TLSH_EXPECTED, run("HASH(toHashBytes, 'tlsh')", variables));
+  }
+
+  @Test
+  public void tlsh_multiBin() throws Exception {
+    final Map<String, Object> variables = new HashMap<>();
+
+    variables.put("toHash", TLSH_DATA);
+    Map<String, String> out = (Map<String, String>)run("HASH(toHash, 'tlsh', { 
'hashes' : [ 8, 16, 32 ]} )", variables);
+
+    Assert.assertTrue(out.containsKey(TLSHHasher.TLSH_KEY));
+    for(int h : ImmutableList.of(8, 16, 32)) {
+      Assert.assertTrue(out.containsKey(TLSHHasher.TLSH_BIN_KEY + "_" + h));
+    }
+  }
+
+
+  @Test
+  public void tlsh_multithread() throws Exception {
+    //we want to ensure that everything is threadsafe, so we'll spin up some 
random data
+    //generate some hashes and then do it all in parallel and make sure it all 
matches.
+    Map<Map.Entry<byte[], Map<String, Object>>, String> hashes = new 
HashMap<>();
+    Random r = new Random(0);
+    for(int i = 0;i < 20;++i) {
+      byte[] d = new byte[256];
+      r.nextBytes(d);
+      Map<String, Object> config = new HashMap<String, Object>()
+      {{
+          put(TLSHHasher.Config.BUCKET_SIZE.key, r.nextBoolean() ? 128 : 256);
+          put(TLSHHasher.Config.CHECKSUM.key, r.nextBoolean() ? 1 : 3);
+      }};
+      String hash = (String)run("HASH(data, 'tlsh', config)", 
ImmutableMap.of("config", config, "data", d));
+      Assert.assertNotNull(hash);
+      hashes.put(new AbstractMap.SimpleEntry<>(d, config), hash);
+    }
+    ForkJoinPool forkJoinPool = new ForkJoinPool(5);
+
+    forkJoinPool.submit(() ->
+            hashes.entrySet().parallelStream().forEach(
+                   kv ->  {
+                     Map<String, Object> config = kv.getKey().getValue();
+                     byte[] data = kv.getKey().getKey();
+                     String hash = (String)run("HASH(data, 'tlsh', config)", 
ImmutableMap.of("config", config, "data", data));
+                     Assert.assertEquals(hash, kv.getValue());
+                   }
+            )
+    );
+  }
+
+  @Test
+  public void tlsh_similarity() throws Exception {
+    for(Map.Entry<String, String> kv : ImmutableMap.of("been", "ben", 
"document", "dokumant", "code", "cad").entrySet()) {
+      Map<String, Object> variables = ImmutableMap.of("toHash", TLSH_DATA, 
"toHashSimilar", TLSH_DATA.replace(kv.getKey(), kv.getValue()));
+      Map<String, Object> bin1 = (Map<String, Object>) 
run("HASH(toHashSimilar, 'tlsh', { 'hashes' : 4, 'bucketSize' : 128 })", 
variables);
+      Map<String, Object> bin2 = (Map<String, Object>) run("HASH(toHash, 
'tlsh', { 'hashes' : [ 4 ], 'bucketSize' : 128 })", variables);
+      assertEquals(kv.getKey() + " != " + kv.getValue() + " because " + 
bin1.get("tlsh") + " != " + bin2.get("tlsh"), bin1.get("tlsh_bin"), 
bin2.get("tlsh_bin"));
+      assertNotEquals(bin1.get("tlsh"), bin2.get("tlsh"));
+      Map<String, Object> distVariables = ImmutableMap.of("hash1", 
bin1.get(TLSHHasher.TLSH_KEY), "hash2", bin2.get(TLSHHasher.TLSH_KEY));
+      {
+        //ensure the diff is minimal
+        Integer diff = (Integer) run("TLSH_DIST( hash1, hash2)", 
distVariables);
+        Integer diffReflexive = (Integer) run("TLSH_DIST( hash2, hash1)", 
distVariables);
+        Assert.assertTrue("diff == " + diff, diff < 100);
+        Assert.assertEquals(diff, diffReflexive);
+      }
+
+      {
+        //ensure that d(x,x) == 0
+        Integer diff = (Integer) run("TLSH_DIST( hash1, hash1)", 
distVariables);
+        Assert.assertEquals((int)0, (int)diff);
+      }
+    }
+  }
+
+  @Test(expected=Exception.class)
+  public void tlshDist_invalidInput() throws Exception {
+    final Map<String, Object> variables = new HashMap<>();
+    variables.put("hash1", 1);
+    variables.put("hash2", TLSH_EXPECTED);
+    run("TLSH_DIST( hash1, hash1)", variables);
+  }
+
+  @Test
+  public void tlsh_insufficientComplexity() throws Exception {
+    final Map<String, Object> variables = new HashMap<>();
+    String data = "Metron is the best";
+    variables.put("toHash", data);
+    assertNull(run("HASH(toHash, 'tlsh')", variables));
+  }
+
+  @Test
+  public void tlsh_nullInput() throws Exception {
+    final Map<String, Object> variables = new HashMap<>();
+    String data = null;
+    variables.put("toHash", data);
+    assertNull(run("HASH(toHash, 'tlsh')", variables));
+  }
+
   private String expectedHexString(MessageDigest expected) {
     return new String(HEX.encode(expected.digest()), StandardCharsets.UTF_8);
   }

http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/pom.xml
----------------------------------------------------------------------
diff --git a/pom.xml b/pom.xml
index b22a2c9..9ae95eb 100644
--- a/pom.xml
+++ b/pom.xml
@@ -44,6 +44,10 @@
             <url>http://clojars.org/repo</url>
         </repository>
         <repository>
+          <id>jcenter</id>
+          <url>https://jcenter.bintray.com/</url>
+        </repository>
+        <repository>
             <releases>
                 <enabled>true</enabled>
                 <updatePolicy>always</updatePolicy>

http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/use-cases/forensic_clustering/README.md
----------------------------------------------------------------------
diff --git a/use-cases/forensic_clustering/README.md 
b/use-cases/forensic_clustering/README.md
new file mode 100644
index 0000000..7aa3468
--- /dev/null
+++ b/use-cases/forensic_clustering/README.md
@@ -0,0 +1,433 @@
+# Problem Statement
+
+Having a forensic hash, such as [TLSH](https://github.com/trendmicro/tlsh), is 
a useful tool in cybersecurity.
+In short, the notion is that semantically similar documents should hash
+to a value which also similar.  Contrast this with your standard
+cryptographic hashes, such as SHA and MD, where small deviations in the
+input data will yield large deviations in the hashes.
+
+The traditional use-case is to hash input documents or binaries and
+compare against a known blacklist of malicious hashes.  A sufficiently
+similar hash will indicate a match.  This will avoid malicious parties
+fuzzing input data to avoid detection.
+
+While this is interesting, it still requires metric-space searches in a 
blacklist.
+I envisioned a slightly more interesting streaming use-case of
+on-the-fly clustering of data.  While the TLSH hashes created do not
+necessarily hash to precisely the same value on similar documents, more
+traditional non-forensic hashes *do* collide when sufficiently similar.
+Namely, the Hamming distance
+[LSH](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Bit_sampling_for_Hamming_distance)
+applied to the TLSH hash would give us a way to bin semantic hashes such
+that similar hashes (by hamming distance) have the same hash.
+
+Inspired by a good
+[talk](https://github.com/fluenda/dataworks_summit_iot_botnet/blob/master/dws-fucs-lopresto.pdf)
 by Andy
+LoPresto and Andre Fucs de Miranda from Apache NiFi, we will proceed to
+take logs from the Cowrie honeypot and compute TLSH hashes and semantic
+bins so that users can easily find similarly malicious activity to known
+threats in logs.
+
+Consider the following excerpts from the Cowrie logs the authors above
+have shared:
+```
+{
+  "eventid": "cowrie.command.success"
+, "timestamp": "2017-09-18T11:45:25.028091Z"
+, "message": "Command found: /bin/busybox LSUCT"
+, "system": "CowrieTelnetTransport,787,121.237.129.163"
+, "isError": 0
+, "src_ip": "121.237.129.163"
+, "session": "21caf72c6358"
+, "input": "/bin/busybox LSUCT"
+, "sensor": "a927e8b28666"
+}
+```
+and
+```
+{
+  "eventid": "cowrie.command.success"
+, "timestamp": "2017-09-17T04:06:39.673206Z"
+, "message": "Command found: /bin/busybox XUSRH"
+, "system": "CowrieTelnetTransport,93,94.51.110.74"
+, "isError": 0
+, "src_ip": "94.51.110.74"
+, "session": "4c047bbc016c"
+, "input": "/bin/busybox XUSRH"
+, "sensor": "a927e8b28666"
+}
+```
+
+You will note the `/bin/busybox` call with a random selection afterwards.  
+Excerpting from an analysis of an IOT exploit
+[here](https://isc.sans.edu/diary/21543):
+```
+The use of the command "busybox ECCHI" appears to have two functions.
+First of all, cowrie, and more "complete" Linux distrubtions then
+commonly found on DVRs will respond with a help screen if a wrong module
+is used. So this way, "ECCHI" can be used to detect honeypots and
+irrelevant systems if the reply isn't simply "ECCHI: applet not found".
+Secondly, the command is used as a market to indicate that the prior
+command finished. Later, the attacker adds "/bin/busybox ECCHI" at the
+end of each line, following the actual command to be executed.
+```
+
+We have a few options at our disposal:
+* If we were merely filtering and alerting on the execution of `/bin/busybox` 
we would include false positives.  
+* If we looked at `/bin/busybox XUSRH`, we'd miss many attempts with a 
*different* value as `XUSRH` is able to be swapped out for another random 
sequence to foil overly strict rules.
+* If we looked for `/bin/busybox *` then we'd capture this scenario well, but 
it'd be nice to be able to not be specific to detecting the `/bin/busybox` 
style of exploits.
+
+Indeed, this is precisely what semantic hashing and binning allows us,
+the ability to group by semantic similarity without being too specific
+about what we mean of as "semantic" or "similar".  We want to cast a
+wide net, but not pull back every fish in the sea.
+
+For this demonstration, we will 
+* ingest some 400 cowrie records 
+* tag records from an IP blacklist for known malicious actors
+* use the alerts UI to investigate and find similar attacks.
+
+## Preliminaries
+
+We assume that the following environment variables are set:
+* `METRON_HOME` - the home directory for metron
+* `ZOOKEEPER` - The zookeeper quorum (comma separated with port specified: 
e.g. `node1:2181` for full-dev)
+* `BROKERLIST` - The Kafka broker list (comma separated with port specified: 
e.g. `node1:6667` for full-dev)
+* `ES_HOST` - The elasticsearch master (and port) e.g. `node1:9200` for 
full-dev.
+
+Also, this does not assume that you are using a kerberized cluster.  If you 
are, then the parser start command will adjust slightly to include the security 
protocol.
+
+Before editing configurations, be sure to pull the configs from zookeeper 
locally via
+```
+$METRON_HOME/bin/zk_load_configs.sh --mode PULL -z $ZOOKEEPER -o 
$METRON_HOME/config/zookeeper/ -f
+```
+
+## Setting up the Data
+
+First we must set up the cowrie log data in our cluster's access node.
+
+* Download the data from the github repository for the talk mentioned above 
[here](https://github.com/fluenda/dataworks_summit_iot_botnet/blob/master/180424243034750.tar.gz).
 Ensure that's moved into your home directory on the metron node.
+* Create a directory called `cowrie` in ~ and untar the tarball into that
+  directory via:
+```
+mkdir ~/cowrie
+cd ~/cowrie
+tar xzvf ~/180424243034750.tar.gz
+```
+
+## Configuring the Parser
+
+The Cowrie data is coming in as simple JSON blobs, so it's easy to
+parse.  We really just need to adjust the timestamp and a few fields and
+we have valid data.
+
+* Create `$METRON_HOME/config/zookeeper/parsers/cowrie.json` with the 
following content:
+```
+{
+  "parserClassName":"org.apache.metron.parsers.json.JSONMapParser",
+  "sensorTopic":"cowrie",
+  "fieldTransformations" : [
+    {
+    "transformation" : "STELLAR"
+   ,"output" : [ "timestamp"]
+   ,"config" : {
+      "timestamp" : "TO_EPOCH_TIMESTAMP( timestamp, 
'yyyy-MM-dd\\'T\\'HH:mm:ss.SSS')"
+               }
+    }
+                           ]
+
+}
+
+```
+
+Before we start, we will want to install ES mappings so ES knows how to 
interpret our fields:
+```
+curl -XPUT 'http://$ES_HOST/cowrie*/_mapping/cowrie_doc' -d '
+{
+        "properties" : {
+          "adapter:stellaradapter:begin:ts" : {
+            "type" : "string"
+          },
+          "adapter:stellaradapter:end:ts" : {
+            "type" : "string"
+          },
+          "blacklisted" : {
+            "type" : "boolean"
+          },
+          "compCS" : {
+            "type" : "string"
+          },
+          "data" : {
+            "type" : "string"
+          },
+          "dst_ip" : {
+            "type" : "string"
+          },
+          "dst_port" : {
+            "type" : "long"
+          },
+          "duration" : {
+            "type" : "double"
+          },
+          "encCS" : {
+            "type" : "string"
+          },
+          "enrichmentjoinbolt:joiner:ts" : {
+            "type" : "string"
+          },
+          "enrichmentsplitterbolt:splitter:begin:ts" : {
+            "type" : "string"
+          },
+          "enrichmentsplitterbolt:splitter:end:ts" : {
+            "type" : "string"
+          },
+          "eventid" : {
+            "type" : "string"
+          },
+          "guid" : {
+            "type" : "string"
+          },
+          "input" : {
+            "type" : "string"
+          },
+          "isError" : {
+            "type" : "long"
+          },
+          "is_alert" : {
+            "type" : "string"
+          },
+          "kexAlgs" : {
+            "type" : "string"
+          },
+          "keyAlgs" : {
+            "type" : "string"
+          },
+          "macCS" : {
+            "type" : "string"
+          },
+          "message" : {
+            "type" : "string"
+          },
+          "original_string" : {
+            "type" : "string"
+          },
+          "password" : {
+            "type" : "string"
+          },
+          "sensor" : {
+            "type" : "string"
+          },
+          "session" : {
+            "type" : "string"
+          },
+          "similarity_bin" : {
+            "type" : "string"
+          },
+          "size" : {
+            "type" : "long"
+          },
+          "source:type" : {
+            "type" : "string"
+          },
+          "src_ip" : {
+            "type" : "string"
+          },
+          "src_port" : {
+            "type" : "long"
+          },
+          "system" : {
+            "type" : "string"
+          },
+          "threat:triage:rules:0:comment" : {
+            "type" : "string"
+          },
+          "threat:triage:rules:0:name" : {
+            "type" : "string"
+          },
+          "threat:triage:rules:0:reason" : {
+            "type" : "string"
+          },
+          "threat:triage:rules:0:score" : {
+            "type" : "long"
+          },
+          "threat:triage:score" : {
+            "type" : "double"
+          },
+          "threatinteljoinbolt:joiner:ts" : {
+            "type" : "string"
+          },
+          "threatintelsplitterbolt:splitter:begin:ts" : {
+            "type" : "string"
+          },
+          "threatintelsplitterbolt:splitter:end:ts" : {
+            "type" : "string"
+          },
+          "timestamp" : {
+            "type" : "long"
+          },
+          "tlsh" : {
+            "type" : "string"
+          },
+          "ttylog" : {
+            "type" : "string"
+          },
+          "username" : {
+            "type" : "string"
+          },
+          "version" : {
+            "type" : "string"
+          },
+          "alert" : {
+            "type" : "nested"
+          }
+        }
+}
+'
+```
+
+* Create the `cowrie` kafka topic via:
+```
+/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --zookeeper $ZOOKEEPER 
--create --topic cowrie --partitions 1 --replication-factor 1
+```
+
+## Import the Blacklist
+
+Here, to build out a scenario, we will assume that we have a blacklist of 
known malicious hosts.  For our purposes, we'll choose 
+one particular host IP to be malicious.
+
+* Create `~/blacklist.csv` to contain the following:
+```
+94.51.110.74
+```
+* Create `~/blacklist_extractor.json` to contain the following:
+```
+{
+  "config" : {
+    "columns" : {
+       "ip" : 0
+    },
+    "indicator_column" : "ip",
+    "type" : "blacklist",
+    "separator" : ","
+  },
+  "extractor" : "CSV"
+}
+```
+* Import the data `$METRON_HOME/bin/flatfile_loader.sh -i ~/blacklist.csv -t 
threatintel -c t -e ~/blacklist_extractor.json`
+
+This will create a new enrichment type "blacklist" with a single entry 
"94.51.110.74".
+
+## Configure Enrichments
+
+We will want to do the following:
+* Add enrichments to faciliate binning
+  * Construct what we consider to be a sufficient representation of the thing 
we want to cluster.  For our purposes, this is centered around the input 
command, so that would be:
+    * The `message` field
+    * The `input` field
+    * The `isError` field
+  * Compute the TLSH hash of this representation, called `tlsh`
+  * Compute the locality sensitive hash of the TLSH hash suitable for binning, 
called `similarity_bin`
+* Set up the threat intelligence to use the blacklist
+  * Set an alert if the message is from an IP address in the threat 
intelligence blacklist.
+  * Score blacklisted messages with `10`.  In production, this would be more 
complex.
+
+Now, we can create the enrichments thusly by creating 
`$METRON_HOME/config/zookeeper/enrichments/cowrie.json` with the following 
content:
+```
+{
+  "enrichment": {
+    "fieldMap": {
+      "stellar" : {
+        "config" : [
+          "characteristic_rep := JOIN([ 'message', exists(message)?message:'', 
'input', exists(input)?input:'', 'isError', exists(isError)?isError:''], '|')",
+          "forensic_hashes := HASH(characteristic_rep, 'tlsh', { 'hashes' : 
16, 'bucketSize' : 128 })",
+          "similarity_bin := MAP_GET('tlsh_bin', forensic_hashes)",
+          "tlsh := MAP_GET('tlsh', forensic_hashes)",
+          "forensic_hashes := null",
+          "characteristic_rep := null"
+        ]
+      }
+   }
+  ,"fieldToTypeMap": { }
+  },
+  "threatIntel": {
+    "fieldMap": {
+      "stellar" : {
+        "config" : [
+          "blacklisted := ENRICHMENT_EXISTS( 'blacklist', src_ip, 
'threatintel', 't')",
+          "is_alert := (exists(is_alert) && is_alert) || blacklisted"
+        ]
+      }
+
+    },
+    "fieldToTypeMap": { },
+    "triageConfig" : {
+      "riskLevelRules" : [
+        {
+          "name" : "Blacklisted Host",
+          "comment" : "Determine if a host is blacklisted",
+          "rule" : "blacklisted != null && blacklisted",
+          "score" : 10,
+          "reason" : "FORMAT('IP %s is blacklisted', src_ip)"
+        }
+      ],
+      "aggregator" : "MAX"
+    }
+  }
+}
+```
+
+### A Note About Similarity Hashes and TLSH
+
+Notice that we have specified a number of hash functions of `16` when 
constructing the similarity bin.  
+I arrived at that by trial and error, which is not always tenable, frankly.  
What is more sensible is 
+likely to construct *multiple* similarity bins of size `8`, `16`, `32` at 
minimum.
+* The smaller the number of hashes, the more loose the notion of similarity 
(more possibly dissimilar things would get grouped together).  
+* The larger the number of hashes, the more strict (similar things may not be 
grouped together).
+
+## Create the Data Loader
+
+We want to pull a snapshot of the cowrie logs, so create `~/load_data.sh` with 
the following content:
+```
+COWRIE_HOME=~/cowrie
+for i in cowrie.1626302-1636522.json cowrie.16879981-16892488.json 
cowrie.21312194-21331475.json cowrie.698260-710913.json 
cowrie.762933-772239.json cowrie.929866-939552.json cowrie.1246880-1248235.json 
cowrie.19285959-19295444.json cowrie.16542668-16581213.json 
cowrie.5849832-5871517.json cowrie.6607473-6609163.json;do
+  echo $i
+  cat $COWRIE_HOME/$i | 
/usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list 
node1:6667 --topic cowrie
+  sleep 2
+done
+```
+* Set the `+x` bit on the executable via:
+```
+chmod +x ~/load_data.sh
+```
+
+## Execute Demonstration
+
+From here, we've set up our configuration and can push the configs:
+* Push the configs to zookeeper via
+```
+$METRON_HOME/bin/zk_load_configs.sh --mode PUSH -z $ZOOKEEPER -i 
$METRON_HOME/config/zookeeper/
+```
+* Start the parser via:
+```
+$METRON_HOME/bin/start_parser_topology.sh -k $BROKERLIST -z $ZOOKEEPER -s 
cowrie
+```
+* Push cowrie data into the `cowrie` topic via
+```
+~/load_data.sh
+```
+
+Once this data is loaded, we can use the Alerts UI, starting from known 
malicious actors, to find others doing similar things.
+
+* First we can look at the alerts directly and find an instance of our 
`/bin/busybox` activity:
+![Alerts](find_alerts.png)
+
+* We can now pivot and look for instances of messages with the same 
`semantic_hash` but who are *not* alerts:
+![Pivot](clustered.png)
+
+As you can see, we have found a few more malicious actors:
+* 177.239.192.172
+* 180.110.69.182
+* 177.238.236.21
+* 94.78.80.45
+
+Now we can look at *other* things that they're doing to build and refine our 
definition of what an alert is without resorting to hard-coding of rules.  Note 
that nothing in our enrichments actually used the string `busybox`, so this is 
a more general purpose way of navigating similar things.

http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/use-cases/forensic_clustering/clustered.png
----------------------------------------------------------------------
diff --git a/use-cases/forensic_clustering/clustered.png 
b/use-cases/forensic_clustering/clustered.png
new file mode 100644
index 0000000..fb09921
Binary files /dev/null and b/use-cases/forensic_clustering/clustered.png differ

http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/use-cases/forensic_clustering/find_alerts.png
----------------------------------------------------------------------
diff --git a/use-cases/forensic_clustering/find_alerts.png 
b/use-cases/forensic_clustering/find_alerts.png
new file mode 100644
index 0000000..bb730ba
Binary files /dev/null and b/use-cases/forensic_clustering/find_alerts.png 
differ

http://git-wip-us.apache.org/repos/asf/metron/blob/7b6a3da6/use-cases/geographic_login_outliers/README.md
----------------------------------------------------------------------
diff --git a/use-cases/geographic_login_outliers/README.md 
b/use-cases/geographic_login_outliers/README.md
index 99e9a5b..bfa5234 100644
--- a/use-cases/geographic_login_outliers/README.md
+++ b/use-cases/geographic_login_outliers/README.md
@@ -223,7 +223,7 @@ We also want to set up a triage rule associating a score 
and setting an alert if
 From here, we've set up our configuration and can push the configs:
 * Push the configs to zookeeper via
 ```
-$METRON_HOME/bin/zk_load_configs.sh --mode PUSH -z node1:2181 -i 
$METRON_HOME/config/zookeeper/
+$METRON_HOME/bin/zk_load_configs.sh --mode PUSH -z $ZOOKEEPER -i 
$METRON_HOME/config/zookeeper/
 ```
 * Start the parser via:
 ```
@@ -231,7 +231,8 @@ $METRON_HOME/bin/start_parser_topology.sh -k $BROKERLIST -z 
$ZOOKEEPER -s auth
 ```
 * Push synthetic data into the `auth` topic via
 ```
-python ~/gen_data.py | 
/usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list 
node1:6667 --topic auth
+python ~/gen_data.py |
+/usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list 
$BROKERLIST --topic auth
 ```
 * Wait for about `5` minutes and kill the previous command
 * Push a synthetic record indicating `user1` has logged in from a russian IP 
(`109.252.227.173`):

Reply via email to