[jira] [Commented] (METRON-1677) UUIDv4 GUID is not Lucene friendly
[ https://issues.apache.org/jira/browse/METRON-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689996#comment-16689996 ] ASF GitHub Bot commented on METRON-1677: GitHub user nickwallen opened a pull request: https://github.com/apache/metron/pull/1269 METRON-1677 UUIDv4 GUID is not Lucene friendly With this change, when documents are written to Elasticsearch the document ID is no longer set as the Metron GUID, but instead left unset so that Elasticsearch can auto-generate it. Doing this improves write performance into Elasticsearch. This will also be the case for any Lucene based Indexer, including Solr. This work only covers Elasticsearch, but the same should be done for Solr as part of a separate effort. While the default table view looks the same, in the following screenshot I customized the table to show both the document ID and the GUID. This change is dependent on the following open pull requests. - [ ] #1247 - [ ] #1254 - [ ] #1259 ## Changes * The `ElasticsearchRetrieveLatestDao` was updated since the GUID is no longer the document ID. This instead does a terms query on the GUID field instead of an ID query. * The `Document` class now contains an optional documentID field. If the `Document` is retrieved from one of the DAOs this field will be populated. When creating a new document, this field will be empty. * Many of the integrations tests had to be updated because the GUID and document ID are now different. * The Alert UI was updated so that it visually looks the same. By default, the Metron GUID is still shown as one of the first columns in the table. * The table is actually showing the document's GUID field instead of the document ID as it was before. The ID field remains, which contains the document ID generated by Elasticsearch. The user can choose to add this to the table, if they like. ## Testing 1. Spin-up Full Dev. 1. Open up the Alerts UI and perform the following basic actions. * Search for alerts * Escalate an alert * Comment on an alert * Delete a comment from an alert * Create a meta-alert * Escalate a meta-alert 1. Click on the configure wheel and add the 'id' field to the table view. This will now display both the GUID and document ID in the table. They of course will be different. 1. Click on the 'guid' field in any row to filter the search results by the guid. ![screen shot 2018-11-14 at 2 44 21 pm](https://user-images.githubusercontent.com/2475409/48646597-53433300-e9b7-11e8-870b-f061af8cca47.png) 1. Click on the 'id' field to filter the search results by the document ID. ![screen shot 2018-11-14 at 2 44 08 pm](https://user-images.githubusercontent.com/2475409/48646566-3a3a8200-e9b7-11e8-8370-f596346d4a62.png) 1. Group by some fields to drill into the data. In the tree view, click on the 'guid' column and ensure the data sorts correctly. Do the same for the 'id' column that was added. ![screen shot 2018-11-14 at 2 46 43 pm](https://user-images.githubusercontent.com/2475409/48646519-1414e200-e9b7-11e8-96a2-50c568d909b2.png) ## Pull Request Checklist - [ ] Is there a JIRA ticket associated with this PR? If not one needs to be created at [Metron Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel). - [ ] Does your PR title start with METRON- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [ ] Has your PR been rebased against the latest commit within the target branch (typically master)? - [ ] Have you included steps to reproduce the behavior or problem that is being changed or addressed? - [ ] Have you included steps or a guide to how the change may be verified and tested manually? - [ ] Have you ensured that the full suite of tests and checks have been executed in the root metron folder via: - [ ] Have you written or updated unit tests and or integration tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent? You can merge this pull request into a Git repository by running: $ git pull https://github.com/nickwallen/metron METRON-1677 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/metron/pull/1269.patch To close this pull request, make a commit to your
[jira] [Commented] (METRON-1677) UUIDv4 GUID is not Lucene friendly
[ https://issues.apache.org/jira/browse/METRON-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593092#comment-16593092 ] Ali Nazemian commented on METRON-1677: -- [~simonellistonball] What if we still keep GUID as an extra field in ES/Solr, but don't pass it as a document ID to ES/Solr and let them decide what to use. However, it is still required to provide an ability for a Metron user that wants to enable deduplication by overwriting ID at the index time. No matter how Lucene friendly the document ID is, it is always slower for indexing to provide document ID at the indexing client side because it enables the deduplication pipeline and index becomes an upsert. > UUIDv4 GUID is not Lucene friendly > -- > > Key: METRON-1677 > URL: https://issues.apache.org/jira/browse/METRON-1677 > Project: Metron > Issue Type: Bug >Reporter: Ali Nazemian >Priority: Major > > Using UUIDv4 by UUID.randomUUID() in Java is not Lucene friendly and impacts > Elasticsearch and Solr indexing/search performance and makes it unpredictable > sometimes. > http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html > Moreover, specifying doc id at the client side will impact indexing > throughput due to enabling Elasticsearch deduplication policy and changing > insert to upsert. Hence, indexing throughput can be increased by providing an > ability to disable ID generation at the client side. Currently, the way ID is > generated can be overwritten at the config level by replacing Metron default > guid via Stellar, but it is not possible to disable it completely to let > Elasticsearch decide what ID can be used for the corresponding document. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (METRON-1677) UUIDv4 GUID is not Lucene friendly
[ https://issues.apache.org/jira/browse/METRON-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593086#comment-16593086 ] Ali Nazemian commented on METRON-1677: -- Given we are using ES/Solr for a time series use case, bringing timestamp to the id generation might be a good idea. We are working on implementing a Stellar function to give us a more Lucene friendly id for this case. I will share the outcome once it's tested. > UUIDv4 GUID is not Lucene friendly > -- > > Key: METRON-1677 > URL: https://issues.apache.org/jira/browse/METRON-1677 > Project: Metron > Issue Type: Bug >Reporter: Ali Nazemian >Priority: Major > > Using UUIDv4 by UUID.randomUUID() in Java is not Lucene friendly and impacts > Elasticsearch and Solr indexing/search performance and makes it unpredictable > sometimes. > http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html > Moreover, specifying doc id at the client side will impact indexing > throughput due to enabling Elasticsearch deduplication policy and changing > insert to upsert. Hence, indexing throughput can be increased by providing an > ability to disable ID generation at the client side. Currently, the way ID is > generated can be overwritten at the config level by replacing Metron default > guid via Stellar, but it is not possible to disable it completely to let > Elasticsearch decide what ID can be used for the corresponding document. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (METRON-1677) UUIDv4 GUID is not Lucene friendly
[ https://issues.apache.org/jira/browse/METRON-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547609#comment-16547609 ] Simon Elliston Ball commented on METRON-1677: - This is an excellent point on the performance side. Guids still need to be generated by metron topologies to ensure consistency with hdfs based stored (eg reindexing scenarios and consistency of the hbase based update log for mutation). However uuid1 may make more sense. The interesting element will be how we encode that. Binary encoding would be optimal, but we will need to consider the implications for Json friendly encoding of the binary uuid1 and other touchpoints for uuid use. This could be a pretty broad PR touching the DAO, REST and UI layers as well as the ingest pipeline. > UUIDv4 GUID is not Lucene friendly > -- > > Key: METRON-1677 > URL: https://issues.apache.org/jira/browse/METRON-1677 > Project: Metron > Issue Type: Bug >Reporter: Ali Nazemian >Priority: Major > > Using UUIDv4 by UUID.randomUUID() in Java is not Lucene friendly and impacts > Elasticsearch and Solr indexing/search performance and makes it unpredictable > sometimes. > http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html > Moreover, specifying doc id at the client side will impact indexing > throughput due to enabling Elasticsearch deduplication policy and changing > insert to upsert. Hence, indexing throughput can be increased by providing an > ability to disable ID generation at the client side. Currently, the way ID is > generated can be overwritten at the config level by replacing Metron default > guid via Stellar, but it is not possible to disable it completely to let > Elasticsearch decide what ID can be used for the corresponding document. -- This message was sent by Atlassian JIRA (v7.6.3#76005)