[GitHub] metron pull request #882: METRON-1380: Create a typosquatting use-case (comm...
Github user asfgit closed the pull request at: https://github.com/apache/metron/pull/882 ---
[GitHub] metron pull request #882: METRON-1380: Create a typosquatting use-case (comm...
Github user cestella commented on a diff in the pull request: https://github.com/apache/metron/pull/882#discussion_r160269501 --- Diff: use-cases/typosquat_detection/README.md --- @@ -0,0 +1,448 @@ + +# Problem Statement + +[Typosquatting](https://en.wikipedia.org/wiki/Typosquatting) is a form of cybersquatting which relies on +likely typos to trick unsuspecting users to visit possibly malicious URLs. In the best case, this is a +mischievous joke as in the following RickRoll: [http://www.latlmes.com/breaking/apache-metron-named-best-software-by-asf-1](http://www.latlmes.com/breaking/apache-metron-named-best-software-by-asf-1). +In the worst case, however, it can be overtly malicious as Bitcoin users found out in [2016](https://nakedsecurity.sophos.com/2014/03/24/bitcoin-user-loses-10k-to-typosquatters/) +when thousands of dollars of Bitcoin was stolen as part of a phishing attack which used typosquatting. + +It is therefore of use for us to detect so called typosquatting attacks as they appear over the network. We +have had for some time, through the flatfile loader and open source typosquatting generation tools such +as [DNS Twist](https://github.com/elceef/dnstwist), the ability to generated potential typosquatted domains, +import them into HBase and look them up via `ENRICHMENT_EXISTS`. + +There are some challenges with this approach, though entirely viable: +* Even for modest numbers of domains, the number of records can grow quite large. The Top Alexa 10k domains has on the order of 3 million potential typosquatted domains. +* It still requires a network hop if out of cache. + +# The Tools Metron Provides + +## Bloom Filters + +It would be nice to have a local solution for these types of problems that may tradeoff accuracy for better +locality and space. Those who have been following the general theme of Metron's analytics philosophy will see +that we are likely in the domain where a probabalistic sketching data structure is in order. In this case, we +are asking simple existence queries, so a [Bloom Filter](https://en.wikipedia.org/wiki/Bloom_filter) fits +well here. + +In Metron, we have the ability to create, add and merge bloom filters via: +* `BLOOM_INIT( size, fpp)` - Creates a bloom filter to handle `size` number of elements with `fpp` probability of false positives (`0 < fpp < 1`). +* `BLOOM_ADD( filter, object)` - Add an item to an existing bloom filter. +* `BLOOM_MERGE( filters )` - Merge a `filters`, a list of Bloom Filters. + +## Typosquatting Domain Generation + +Now that we have a suitable data structure, we need a way to generate potential typosquatted domains for a +given domain. Following the good work of [DNS Twist](https://github.com/elceef/dnstwist), we have ported +their set of typosquatting strategies to Metron: +* Bitsquatting - See [here](http://dinaburg.org/bitsquatting.html) +* Homoglyphs - Substituting characters for ascii or unicode analogues which are visually similar (e.g. `latlmes.com` for `latimes.com` as above) +* Subdomain - Making part of the domain a subdomain (e.g. `am.azon.com`) +* Hyphenation +* Insertion +* Addition +* Omission +* Repetition +* Replacement +* Transposition +* Vowel swapping + +The Stellar function in Metron is `DOMAIN_TYPOSQUAT( domain )`. It is recommended to remove the TLD from the +domain. You can see it in action here with our rick roll example above: +``` +[Stellar]>>> 'latlmes' in DOMAIN_TYPOSQUAT( 'latimes') +true +``` + +## Generating Summaries + +We need a way to generate the summary sketches from flat data for this to work. This is similar to, but +somewhat different from, loading flat data into HBase. Instead of each row in the file being loaded +generating a record in HBase, what we want is for each record to contribute to the summary sketch and at the +end to write out the summary sketch. + +For this purpose, we have a new utility `$METRON_HOME/bin/flatfile_summarizer.sh` to accompany +`$METRON_HOME/bin/flatfile_loader.sh`. The same extractor config is used, but we have 3 new configuration +options: +* `state_init` - Allows a state object to be initialized. This is a string, so a single expression is created. The output of this expression will be available as the `state` variable. +* `state_update` - Allows a state object to be updated. This is a map, so you can have temporary variables here. Note that you can reference the `state` variable from this. +* `state_merge` - Allows a list of states to be merged. This is a string, so a single expression. There is a special field called `states` available, which is a list of the states (one per thread). If this is not in existence, the
[GitHub] metron pull request #882: METRON-1380: Create a typosquatting use-case (comm...
Github user justinleet commented on a diff in the pull request: https://github.com/apache/metron/pull/882#discussion_r160245549 --- Diff: use-cases/typosquat_detection/README.md --- @@ -0,0 +1,448 @@ + +# Problem Statement + +[Typosquatting](https://en.wikipedia.org/wiki/Typosquatting) is a form of cybersquatting which relies on +likely typos to trick unsuspecting users to visit possibly malicious URLs. In the best case, this is a +mischievous joke as in the following RickRoll: [http://www.latlmes.com/breaking/apache-metron-named-best-software-by-asf-1](http://www.latlmes.com/breaking/apache-metron-named-best-software-by-asf-1). +In the worst case, however, it can be overtly malicious as Bitcoin users found out in [2016](https://nakedsecurity.sophos.com/2014/03/24/bitcoin-user-loses-10k-to-typosquatters/) +when thousands of dollars of Bitcoin was stolen as part of a phishing attack which used typosquatting. + +It is therefore of use for us to detect so called typosquatting attacks as they appear over the network. We +have had for some time, through the flatfile loader and open source typosquatting generation tools such +as [DNS Twist](https://github.com/elceef/dnstwist), the ability to generated potential typosquatted domains, +import them into HBase and look them up via `ENRICHMENT_EXISTS`. + +There are some challenges with this approach, though entirely viable: +* Even for modest numbers of domains, the number of records can grow quite large. The Top Alexa 10k domains has on the order of 3 million potential typosquatted domains. +* It still requires a network hop if out of cache. + +# The Tools Metron Provides + +## Bloom Filters + +It would be nice to have a local solution for these types of problems that may tradeoff accuracy for better +locality and space. Those who have been following the general theme of Metron's analytics philosophy will see +that we are likely in the domain where a probabalistic sketching data structure is in order. In this case, we +are asking simple existence queries, so a [Bloom Filter](https://en.wikipedia.org/wiki/Bloom_filter) fits +well here. + +In Metron, we have the ability to create, add and merge bloom filters via: +* `BLOOM_INIT( size, fpp)` - Creates a bloom filter to handle `size` number of elements with `fpp` probability of false positives (`0 < fpp < 1`). +* `BLOOM_ADD( filter, object)` - Add an item to an existing bloom filter. +* `BLOOM_MERGE( filters )` - Merge a `filters`, a list of Bloom Filters. + +## Typosquatting Domain Generation + +Now that we have a suitable data structure, we need a way to generate potential typosquatted domains for a +given domain. Following the good work of [DNS Twist](https://github.com/elceef/dnstwist), we have ported +their set of typosquatting strategies to Metron: +* Bitsquatting - See [here](http://dinaburg.org/bitsquatting.html) +* Homoglyphs - Substituting characters for ascii or unicode analogues which are visually similar (e.g. `latlmes.com` for `latimes.com` as above) +* Subdomain - Making part of the domain a subdomain (e.g. `am.azon.com`) +* Hyphenation +* Insertion +* Addition +* Omission +* Repetition +* Replacement +* Transposition +* Vowel swapping + +The Stellar function in Metron is `DOMAIN_TYPOSQUAT( domain )`. It is recommended to remove the TLD from the +domain. You can see it in action here with our rick roll example above: +``` +[Stellar]>>> 'latlmes' in DOMAIN_TYPOSQUAT( 'latimes') +true +``` + +## Generating Summaries + +We need a way to generate the summary sketches from flat data for this to work. This is similar to, but +somewhat different from, loading flat data into HBase. Instead of each row in the file being loaded +generating a record in HBase, what we want is for each record to contribute to the summary sketch and at the +end to write out the summary sketch. + +For this purpose, we have a new utility `$METRON_HOME/bin/flatfile_summarizer.sh` to accompany +`$METRON_HOME/bin/flatfile_loader.sh`. The same extractor config is used, but we have 3 new configuration +options: +* `state_init` - Allows a state object to be initialized. This is a string, so a single expression is created. The output of this expression will be available as the `state` variable. +* `state_update` - Allows a state object to be updated. This is a map, so you can have temporary variables here. Note that you can reference the `state` variable from this. +* `state_merge` - Allows a list of states to be merged. This is a string, so a single expression. There is a special field called `states` available, which is a list of the states (one per thread). If this is not in existence, t
[GitHub] metron pull request #882: METRON-1380: Create a typosquatting use-case (comm...
Github user justinleet commented on a diff in the pull request: https://github.com/apache/metron/pull/882#discussion_r160241987 --- Diff: use-cases/typosquat_detection/README.md --- @@ -0,0 +1,448 @@ + +# Problem Statement + +[Typosquatting](https://en.wikipedia.org/wiki/Typosquatting) is a form of cybersquatting which relies on +likely typos to trick unsuspecting users to visit possibly malicious URLs. In the best case, this is a +mischievous joke as in the following RickRoll: [http://www.latlmes.com/breaking/apache-metron-named-best-software-by-asf-1](http://www.latlmes.com/breaking/apache-metron-named-best-software-by-asf-1). +In the worst case, however, it can be overtly malicious as Bitcoin users found out in [2016](https://nakedsecurity.sophos.com/2014/03/24/bitcoin-user-loses-10k-to-typosquatters/) +when thousands of dollars of Bitcoin was stolen as part of a phishing attack which used typosquatting. + +It is therefore of use for us to detect so called typosquatting attacks as they appear over the network. We +have had for some time, through the flatfile loader and open source typosquatting generation tools such +as [DNS Twist](https://github.com/elceef/dnstwist), the ability to generated potential typosquatted domains, +import them into HBase and look them up via `ENRICHMENT_EXISTS`. + +There are some challenges with this approach, though entirely viable: +* Even for modest numbers of domains, the number of records can grow quite large. The Top Alexa 10k domains has on the order of 3 million potential typosquatted domains. +* It still requires a network hop if out of cache. + +# The Tools Metron Provides + +## Bloom Filters + +It would be nice to have a local solution for these types of problems that may tradeoff accuracy for better +locality and space. Those who have been following the general theme of Metron's analytics philosophy will see +that we are likely in the domain where a probabalistic sketching data structure is in order. In this case, we +are asking simple existence queries, so a [Bloom Filter](https://en.wikipedia.org/wiki/Bloom_filter) fits +well here. + +In Metron, we have the ability to create, add and merge bloom filters via: +* `BLOOM_INIT( size, fpp)` - Creates a bloom filter to handle `size` number of elements with `fpp` probability of false positives (`0 < fpp < 1`). +* `BLOOM_ADD( filter, object)` - Add an item to an existing bloom filter. +* `BLOOM_MERGE( filters )` - Merge a `filters`, a list of Bloom Filters. + +## Typosquatting Domain Generation + +Now that we have a suitable data structure, we need a way to generate potential typosquatted domains for a +given domain. Following the good work of [DNS Twist](https://github.com/elceef/dnstwist), we have ported +their set of typosquatting strategies to Metron: +* Bitsquatting - See [here](http://dinaburg.org/bitsquatting.html) +* Homoglyphs - Substituting characters for ascii or unicode analogues which are visually similar (e.g. `latlmes.com` for `latimes.com` as above) +* Subdomain - Making part of the domain a subdomain (e.g. `am.azon.com`) +* Hyphenation +* Insertion +* Addition +* Omission +* Repetition +* Replacement +* Transposition +* Vowel swapping + +The Stellar function in Metron is `DOMAIN_TYPOSQUAT( domain )`. It is recommended to remove the TLD from the +domain. You can see it in action here with our rick roll example above: +``` +[Stellar]>>> 'latlmes' in DOMAIN_TYPOSQUAT( 'latimes') +true +``` + +## Generating Summaries + +We need a way to generate the summary sketches from flat data for this to work. This is similar to, but +somewhat different from, loading flat data into HBase. Instead of each row in the file being loaded +generating a record in HBase, what we want is for each record to contribute to the summary sketch and at the +end to write out the summary sketch. + +For this purpose, we have a new utility `$METRON_HOME/bin/flatfile_summarizer.sh` to accompany +`$METRON_HOME/bin/flatfile_loader.sh`. The same extractor config is used, but we have 3 new configuration +options: +* `state_init` - Allows a state object to be initialized. This is a string, so a single expression is created. The output of this expression will be available as the `state` variable. +* `state_update` - Allows a state object to be updated. This is a map, so you can have temporary variables here. Note that you can reference the `state` variable from this. +* `state_merge` - Allows a list of states to be merged. This is a string, so a single expression. There is a special field called `states` available, which is a list of the states (one per thread). If this is not in existence, t
[GitHub] metron pull request #882: METRON-1380: Create a typosquatting use-case (comm...
Github user justinleet commented on a diff in the pull request: https://github.com/apache/metron/pull/882#discussion_r159122512 --- Diff: use-cases/typosquat_detection/README.md --- @@ -0,0 +1,431 @@ +# Problem Statement --- End diff -- Can you please add the license header to this? https://github.com/apache/metron/pull/884 is close to going in and enforcing this, so I'm hoping to avoid impact to master. ``` ``` ---
[GitHub] metron pull request #882: METRON-1380: Create a typosquatting use-case (comm...
GitHub user cestella reopened a pull request: https://github.com/apache/metron/pull/882 METRON-1380: Create a typosquatting use-case (commit after METRON-1379, METRON-1377, METRON-1378) ## Contributor Comments This is a documented use-case on how to use the following JIRAs (PRs) to detect typosquatting in-stream using bloom filters: * METRON-1379 (#880) * METRON-1377 (#878 ) * METRON-1378 (#879 ) The code here is a merger of the PRs above to allow reviewers to test the entire feature together. The manual testing plan is to execute the typosquatting use-case [instructions](https://github.com/cestella/incubator-metron/tree/typosquat_merge/use-cases/typosquat_detection). ## Pull Request Checklist Thank you for submitting a contribution to Apache Metron. Please refer to our [Development Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235) for the complete guide to follow for contributions. Please refer also to our [Build Verification Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview) for complete smoke testing guides. In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? If not one needs to be created at [Metron Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel). - [x] Does your PR title start with METRON- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically master)? ### For code changes: - [x] Have you included steps to reproduce the behavior or problem that is being changed or addressed? - [x] Have you included steps or a guide to how the change may be verified and tested manually? - [x] Have you ensured that the full suite of tests and checks have been executed in the root metron folder via: ``` mvn -q clean integration-test install && build_utils/verify_licenses.sh ``` - [x] Have you written or updated unit tests and or integration tests to verify your changes? - [x] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [x] Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent? ### For documentation related changes: - [x] Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via `site-book/target/site/index.html`: ``` cd site-book mvn site ``` Note: Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible. It is also recommended that [travis-ci](https://travis-ci.org) is set up for your personal repository such that your branches are built there before submitting a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cestella/incubator-metron typosquat_merge Alternatively you can review and apply these changes as the patch at: https://github.com/apache/metron/pull/882.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #882 commit a95014ed1e145f9133dd95dcbfbf7e9212401fef Author: cstella Date: 2017-12-19T22:26:03Z METRON-1377: Stellar function to generate typosquatted domains (similar to dnstwist) commit 9c492c4540534fa72550aff330ce6c588f640965 Author: cstella Date: 2017-12-21T15:17:18Z flatfile summarizer initial commit. commit 71e63b2604ad94c51423762582e547184169d8a2 Author: cstella Date: 2017-12-21T15:20:48Z Don't want to generate original domain as it's not a typosquatted domain commit 42af879d5fc1623fd9b24dd24af687292d9bcc73 Author: cstella Date: 2017-12-21T16:20:10Z Fixed homoglyph bug with ACE domains commit 7ee3ab14b81b0cb3fd899cf082050b7e3fade63e Author: cstella Date: 2017-12-21T17:04:58Z Persistent bug.. commit 15681143e86913a69270d0a89e1c877e3d99 Author: cstella Date: 2017-12-21T18:50:58Z typo commit 0d1e7b304b926bae65a2d6b4c63dec565542ad7e Author: cstella Date: 2017-12-21T18:51:50Z Weirdness with international domains. commit 935d4d2933e7156219722e54cec5dfce228fdbcc Author: cstella Date: 2017-12-21T21:17:23Z Updating
[GitHub] metron pull request #882: METRON-1380: Create a typosquatting use-case (comm...
Github user cestella closed the pull request at: https://github.com/apache/metron/pull/882 ---
[GitHub] metron pull request #882: METRON-1380: Create a typosquatting use-case
GitHub user cestella opened a pull request: https://github.com/apache/metron/pull/882 METRON-1380: Create a typosquatting use-case ## Contributor Comments This is a documented use-case on how to use the following JIRAs (PRs) to detect typosquatting in-stream using bloom filters: * METRON-1379 (#880) * METRON-1377 (#878 ) * METRON-1378 (#879 ) The code here is a merger of the PRs above to allow reviewers to test the entire feature together. The manual testing plan is to execute the typosquatting use-case [instructions](https://github.com/cestella/incubator-metron/tree/typosquat_merge/use-cases/typosquat_detection). ## Pull Request Checklist Thank you for submitting a contribution to Apache Metron. Please refer to our [Development Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235) for the complete guide to follow for contributions. Please refer also to our [Build Verification Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview) for complete smoke testing guides. In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? If not one needs to be created at [Metron Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel). - [x] Does your PR title start with METRON- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically master)? ### For code changes: - [x] Have you included steps to reproduce the behavior or problem that is being changed or addressed? - [x] Have you included steps or a guide to how the change may be verified and tested manually? - [x] Have you ensured that the full suite of tests and checks have been executed in the root metron folder via: ``` mvn -q clean integration-test install && build_utils/verify_licenses.sh ``` - [x] Have you written or updated unit tests and or integration tests to verify your changes? - [x] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [x] Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent? ### For documentation related changes: - [x] Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via `site-book/target/site/index.html`: ``` cd site-book mvn site ``` Note: Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible. It is also recommended that [travis-ci](https://travis-ci.org) is set up for your personal repository such that your branches are built there before submitting a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cestella/incubator-metron typosquat_merge Alternatively you can review and apply these changes as the patch at: https://github.com/apache/metron/pull/882.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #882 commit a95014ed1e145f9133dd95dcbfbf7e9212401fef Author: cstella Date: 2017-12-19T22:26:03Z METRON-1377: Stellar function to generate typosquatted domains (similar to dnstwist) commit 9c492c4540534fa72550aff330ce6c588f640965 Author: cstella Date: 2017-12-21T15:17:18Z flatfile summarizer initial commit. commit 71e63b2604ad94c51423762582e547184169d8a2 Author: cstella Date: 2017-12-21T15:20:48Z Don't want to generate original domain as it's not a typosquatted domain commit 42af879d5fc1623fd9b24dd24af687292d9bcc73 Author: cstella Date: 2017-12-21T16:20:10Z Fixed homoglyph bug with ACE domains commit 7ee3ab14b81b0cb3fd899cf082050b7e3fade63e Author: cstella Date: 2017-12-21T17:04:58Z Persistent bug.. commit 15681143e86913a69270d0a89e1c877e3d99 Author: cstella Date: 2017-12-21T18:50:58Z typo commit 0d1e7b304b926bae65a2d6b4c63dec565542ad7e Author: cstella Date: 2017-12-21T18:51:50Z Weirdness with international domains. commit 935d4d2933e7156219722e54cec5dfce228fdbcc Author: cstella Date: 2017-12-21T21:17:23Z Updating tests and docs. commit afe91c341608468e2637db4a02f9428e