[jira] [Commented] (METRON-1677) UUIDv4 GUID is not Lucene friendly

2018-11-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/METRON-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689996#comment-16689996
 ] 

ASF GitHub Bot commented on METRON-1677:


GitHub user nickwallen opened a pull request:

https://github.com/apache/metron/pull/1269

METRON-1677 UUIDv4 GUID is not Lucene friendly

With this change, when documents are written to Elasticsearch the document 
ID is no longer set as the Metron GUID, but instead left unset so that 
Elasticsearch can auto-generate it.  Doing this improves write performance into 
Elasticsearch.

This will also be the case for any Lucene based Indexer, including Solr.  
This work only covers Elasticsearch, but the same should be done for Solr as 
part of a separate effort.

While the default table view looks the same, in the following screenshot I 
customized the table to show both the document ID and the GUID. 

This change is dependent on the following open pull requests.
- [ ] #1247
- [ ] #1254 
- [ ] #1259  

## Changes

* The `ElasticsearchRetrieveLatestDao` was updated since the GUID is no 
longer the document ID.  This instead does a terms query on the GUID field 
instead of an ID query.

* The `Document` class now contains an optional documentID field.  If the 
`Document` is retrieved from one of the DAOs this field will be populated.  
When creating a new document, this field will be empty.

* Many of the integrations tests had to be updated because the GUID and 
document ID are now different.

* The Alert UI was updated so that it visually looks the same. By default, 
the Metron GUID is still shown as one of the first columns in the table.  

* The table is actually showing the document's GUID field instead of 
the document ID as it was before.  The ID field remains, which contains the 
document ID generated by Elasticsearch.  The user can choose to add this to the 
table, if they like.


## Testing

1. Spin-up Full Dev.

1. Open up the Alerts UI and perform the following basic actions.
* Search for alerts
* Escalate an alert
* Comment on an alert
* Delete a comment from an alert
* Create a meta-alert
* Escalate a meta-alert

1. Click on the configure wheel and add the 'id' field to the table view.  
This will now display both the GUID and document ID in the table. They of 
course will be different.

1.  Click on the 'guid' field in any row to filter the search results by 
the guid.
![screen shot 2018-11-14 at 2 44 21 
pm](https://user-images.githubusercontent.com/2475409/48646597-53433300-e9b7-11e8-870b-f061af8cca47.png)

1. Click on the 'id' field to filter the search results by the document ID.
![screen shot 2018-11-14 at 2 44 08 
pm](https://user-images.githubusercontent.com/2475409/48646566-3a3a8200-e9b7-11e8-8370-f596346d4a62.png)

1. Group by some fields to drill into the data.  In the tree view, click on 
the 'guid' column and ensure the data sorts correctly.  Do the same for the 
'id' column that was added.
![screen shot 2018-11-14 at 2 46 43 
pm](https://user-images.githubusercontent.com/2475409/48646519-1414e200-e9b7-11e8-96a2-50c568d909b2.png)

## Pull Request Checklist

- [ ] Is there a JIRA ticket associated with this PR? If not one needs to 
be created at [Metron 
Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
- [ ] Does your PR title start with METRON- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
- [ ] Has your PR been rebased against the latest commit within the target 
branch (typically master)?
- [ ] Have you included steps to reproduce the behavior or problem that is 
being changed or addressed?
- [ ] Have you included steps or a guide to how the change may be verified 
and tested manually?
- [ ] Have you ensured that the full suite of tests and checks have been 
executed in the root metron folder via:
- [ ] Have you written or updated unit tests and or integration tests to 
verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] Have you verified the basic functionality of the build by building 
and running locally with Vagrant full-dev environment or the equivalent?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nickwallen/metron METRON-1677

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/metron/pull/1269.patch

To close this pull request, make a commit to your 

[jira] [Commented] (METRON-1677) UUIDv4 GUID is not Lucene friendly

2018-08-26 Thread Ali Nazemian (JIRA)


[ 
https://issues.apache.org/jira/browse/METRON-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593092#comment-16593092
 ] 

Ali Nazemian commented on METRON-1677:
--

[~simonellistonball] What if we still keep GUID as an extra field in ES/Solr, 
but don't pass it as a document ID to ES/Solr and let them decide what to use. 
However, it is still required to provide an ability for a Metron user that 
wants to enable deduplication by overwriting ID at the index time. No matter 
how Lucene friendly the document ID is, it is always slower for indexing to 
provide document ID at the indexing client side because it enables the 
deduplication pipeline and index becomes an upsert. 

 

 

> UUIDv4 GUID is not Lucene friendly
> --
>
> Key: METRON-1677
> URL: https://issues.apache.org/jira/browse/METRON-1677
> Project: Metron
>  Issue Type: Bug
>Reporter: Ali Nazemian
>Priority: Major
>
> Using UUIDv4 by UUID.randomUUID() in Java is not Lucene friendly and impacts 
> Elasticsearch and Solr indexing/search performance and makes it unpredictable 
> sometimes.
> http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html
> Moreover, specifying doc id at the client side will impact indexing 
> throughput due to enabling Elasticsearch deduplication policy and changing 
> insert to upsert. Hence, indexing throughput can be increased by providing an 
> ability to disable ID generation at the client side. Currently, the way ID is 
> generated can be overwritten at the config level by replacing Metron default 
> guid via Stellar, but it is not possible to disable it completely to let 
> Elasticsearch decide what ID can be used for the corresponding document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (METRON-1677) UUIDv4 GUID is not Lucene friendly

2018-08-26 Thread Ali Nazemian (JIRA)


[ 
https://issues.apache.org/jira/browse/METRON-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593086#comment-16593086
 ] 

Ali Nazemian commented on METRON-1677:
--

Given we are using ES/Solr for a time series use case, bringing timestamp to 
the id generation might be a good idea. We are working on implementing a 
Stellar function to give us a more Lucene friendly id for this case. I will 
share the outcome once it's tested.

> UUIDv4 GUID is not Lucene friendly
> --
>
> Key: METRON-1677
> URL: https://issues.apache.org/jira/browse/METRON-1677
> Project: Metron
>  Issue Type: Bug
>Reporter: Ali Nazemian
>Priority: Major
>
> Using UUIDv4 by UUID.randomUUID() in Java is not Lucene friendly and impacts 
> Elasticsearch and Solr indexing/search performance and makes it unpredictable 
> sometimes.
> http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html
> Moreover, specifying doc id at the client side will impact indexing 
> throughput due to enabling Elasticsearch deduplication policy and changing 
> insert to upsert. Hence, indexing throughput can be increased by providing an 
> ability to disable ID generation at the client side. Currently, the way ID is 
> generated can be overwritten at the config level by replacing Metron default 
> guid via Stellar, but it is not possible to disable it completely to let 
> Elasticsearch decide what ID can be used for the corresponding document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (METRON-1677) UUIDv4 GUID is not Lucene friendly

2018-07-18 Thread Simon Elliston Ball (JIRA)


[ 
https://issues.apache.org/jira/browse/METRON-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547609#comment-16547609
 ] 

Simon Elliston Ball commented on METRON-1677:
-

This is an excellent point on the performance side. Guids still need to be 
generated by metron topologies to ensure consistency with hdfs based stored (eg 
reindexing scenarios and consistency of the hbase based update log for 
mutation). However uuid1 may make more sense. The interesting element will be 
how we encode that. Binary encoding would be optimal, but we will need to 
consider the implications for Json friendly encoding of the binary uuid1 and 
other touchpoints for uuid use. This could be a pretty broad PR touching the 
DAO, REST and UI layers as well as the ingest pipeline.

> UUIDv4 GUID is not Lucene friendly
> --
>
> Key: METRON-1677
> URL: https://issues.apache.org/jira/browse/METRON-1677
> Project: Metron
>  Issue Type: Bug
>Reporter: Ali Nazemian
>Priority: Major
>
> Using UUIDv4 by UUID.randomUUID() in Java is not Lucene friendly and impacts 
> Elasticsearch and Solr indexing/search performance and makes it unpredictable 
> sometimes.
> http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html
> Moreover, specifying doc id at the client side will impact indexing 
> throughput due to enabling Elasticsearch deduplication policy and changing 
> insert to upsert. Hence, indexing throughput can be increased by providing an 
> ability to disable ID generation at the client side. Currently, the way ID is 
> generated can be overwritten at the config level by replacing Metron default 
> guid via Stellar, but it is not possible to disable it completely to let 
> Elasticsearch decide what ID can be used for the corresponding document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)