[jira] [Commented] (METRON-322) Global Batching and Flushing
[ https://issues.apache.org/jira/browse/METRON-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085020#comment-16085020 ] ASF GitHub Bot commented on METRON-322: --- Github user ottobackwards commented on the issue: https://github.com/apache/metron/pull/481 +1 > Global Batching and Flushing > > > Key: METRON-322 > URL: https://issues.apache.org/jira/browse/METRON-322 > Project: Metron > Issue Type: Improvement >Reporter: Ajay Yadav >Assignee: Matt Foley > > All Writers and other bolts that maintain an internal "batch" queue, need to > have a timeout flush, to prevent messages from low-volume telemetries from > sitting in their queues indefinitely. Storm has a timeout value > (topology.message.timeout.secs) that prevents it from waiting for too long. > If the Writer does not process the queue before the timeout, then Storm > recycles the tuples through the topology. This has multiple undesirable > consequences, including data duplication and waste of compute resources. We > would like to be able to specify an interval after which the queues would > flush, even if the batch size is not met. > We will utilize the Storm Tick Tuple to trigger timeout flushing, following > the recommendations of the article at > http://hortonworks.com/blog/apache-storm-design-pattern-micro-batching/#CONCLUSION > Since every Writer processes its queue somewhat differently, every bolt that > has a "batchSize" parameter will be given a "batchTimeout" parameter too. It > will default to 1/2 the value of "topology.message.timeout.secs", as > recommended, and will ignore settings larger than the default, which could > cause failure to flush in time. In the Enrichment topology, where two > Writers may be placed one after the other (enrichment and threat intel), the > default timeout interval will be 1/4 the value of > "topology.message.timeout.secs". The default value of > "topology.message.timeout.secs" in Storm is 30 seconds. > In addition, Storm provides a limit on the number of pending messages that > have not been acked. If more than "topology.max.spout.pending" messages are > waiting in a topology, then Storm will recycle them through the topology. > However, the default value of "topology.max.spout.pending" is null, and if > set to non-null value, the user can manage the consequences by setting > batchSize limits appropriately. Having the timeout flush will also > ameliorate this issue. So we do not need to address > "topology.max.spout.pending" directly in this task. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (METRON-322) Global Batching and Flushing
[ https://issues.apache.org/jira/browse/METRON-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084834#comment-16084834 ] ASF GitHub Bot commented on METRON-322: --- Github user ottobackwards commented on a diff in the pull request: https://github.com/apache/metron/pull/481#discussion_r127087711 --- Diff: metron-platform/metron-writer/src/main/java/org/apache/metron/writer/bolt/BulkMessageWriterBolt.java --- @@ -92,24 +177,58 @@ public void prepare(Map stormConf, TopologyContext context, OutputCollector coll configurationTransformation = x -> x; } try { - bulkMessageWriter.init(stormConf, - context, - configurationTransformation.apply(new IndexingWriterConfiguration(bulkMessageWriter.getName(), - getConfigurations())) -); + WriterConfiguration writerconf = configurationTransformation.apply( + new IndexingWriterConfiguration(bulkMessageWriter.getName(), getConfigurations())); + if (defaultBatchTimeout == 0) { +//This means getComponentConfiguration was never called to initialize defaultBatchTimeout, +//probably because we are in a unit test scenario. So calculate it here. +BatchTimeoutHelper timeoutHelper = new BatchTimeoutHelper(writerconf::getAllConfiguredTimeouts, batchTimeoutDivisor); +defaultBatchTimeout = timeoutHelper.getDefaultBatchTimeout(); + } + writerComponent.setDefaultBatchTimeout(defaultBatchTimeout); + bulkMessageWriter.init(stormConf, context, writerconf); } catch (Exception e) { throw new RuntimeException(e); } } + /** + * Used only for unit testing. + */ --- End diff -- let the intellisense fall where it may then > Global Batching and Flushing > > > Key: METRON-322 > URL: https://issues.apache.org/jira/browse/METRON-322 > Project: Metron > Issue Type: Improvement >Reporter: Ajay Yadav >Assignee: Matt Foley > > All Writers and other bolts that maintain an internal "batch" queue, need to > have a timeout flush, to prevent messages from low-volume telemetries from > sitting in their queues indefinitely. Storm has a timeout value > (topology.message.timeout.secs) that prevents it from waiting for too long. > If the Writer does not process the queue before the timeout, then Storm > recycles the tuples through the topology. This has multiple undesirable > consequences, including data duplication and waste of compute resources. We > would like to be able to specify an interval after which the queues would > flush, even if the batch size is not met. > We will utilize the Storm Tick Tuple to trigger timeout flushing, following > the recommendations of the article at > http://hortonworks.com/blog/apache-storm-design-pattern-micro-batching/#CONCLUSION > Since every Writer processes its queue somewhat differently, every bolt that > has a "batchSize" parameter will be given a "batchTimeout" parameter too. It > will default to 1/2 the value of "topology.message.timeout.secs", as > recommended, and will ignore settings larger than the default, which could > cause failure to flush in time. In the Enrichment topology, where two > Writers may be placed one after the other (enrichment and threat intel), the > default timeout interval will be 1/4 the value of > "topology.message.timeout.secs". The default value of > "topology.message.timeout.secs" in Storm is 30 seconds. > In addition, Storm provides a limit on the number of pending messages that > have not been acked. If more than "topology.max.spout.pending" messages are > waiting in a topology, then Storm will recycle them through the topology. > However, the default value of "topology.max.spout.pending" is null, and if > set to non-null value, the user can manage the consequences by setting > batchSize limits appropriately. Having the timeout flush will also > ameliorate this issue. So we do not need to address > "topology.max.spout.pending" directly in this task. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (METRON-1037) Stellar power function
Simon Elliston Ball created METRON-1037: --- Summary: Stellar power function Key: METRON-1037 URL: https://issues.apache.org/jira/browse/METRON-1037 Project: Metron Issue Type: Improvement Affects Versions: 0.4.0 Reporter: Simon Elliston Ball Stellar does not currently have a power function. We should have a function of the form POWER(number,power) named for consistency with SQL and Excel. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (METRON-1036) Stellar log function
Simon Elliston Ball created METRON-1036: --- Summary: Stellar log function Key: METRON-1036 URL: https://issues.apache.org/jira/browse/METRON-1036 Project: Metron Issue Type: Improvement Reporter: Simon Elliston Ball Priority: Critical Stellar does not currently have a natural log, or other base log function. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (METRON-322) Global Batching and Flushing
[ https://issues.apache.org/jira/browse/METRON-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084783#comment-16084783 ] ASF GitHub Bot commented on METRON-322: --- Github user mattf-horton commented on a diff in the pull request: https://github.com/apache/metron/pull/481#discussion_r127081873 --- Diff: metron-platform/metron-writer/src/main/java/org/apache/metron/writer/bolt/BulkMessageWriterBolt.java --- @@ -92,24 +177,58 @@ public void prepare(Map stormConf, TopologyContext context, OutputCollector coll configurationTransformation = x -> x; } try { - bulkMessageWriter.init(stormConf, - context, - configurationTransformation.apply(new IndexingWriterConfiguration(bulkMessageWriter.getName(), - getConfigurations())) -); + WriterConfiguration writerconf = configurationTransformation.apply( + new IndexingWriterConfiguration(bulkMessageWriter.getName(), getConfigurations())); + if (defaultBatchTimeout == 0) { +//This means getComponentConfiguration was never called to initialize defaultBatchTimeout, +//probably because we are in a unit test scenario. So calculate it here. +BatchTimeoutHelper timeoutHelper = new BatchTimeoutHelper(writerconf::getAllConfiguredTimeouts, batchTimeoutDivisor); +defaultBatchTimeout = timeoutHelper.getDefaultBatchTimeout(); + } + writerComponent.setDefaultBatchTimeout(defaultBatchTimeout); + bulkMessageWriter.init(stormConf, context, writerconf); } catch (Exception e) { throw new RuntimeException(e); } } + /** + * Used only for unit testing. + */ --- End diff -- Hm. With respect, this one I don't really agree with. It's just a polymorphism of the regular prepare() method. Since its signature differs from the "regular" prepare() method, system code will never call it. It is documented, at javadoc level, as being only for use in unit testing, which is by definition code-level. Those who aren't willing to look at code shouldn't be using it, and indeed will have a difficult time finding it at all. I'd rather let the polymorphism be evident. I will, however, add further comment that this is "used only for unit testing, with injection of a FakeClock for mocking passage of time in a controlled way, to test queue timeout behavior in the WriterComponent." Would that be okay? > Global Batching and Flushing > > > Key: METRON-322 > URL: https://issues.apache.org/jira/browse/METRON-322 > Project: Metron > Issue Type: Improvement >Reporter: Ajay Yadav >Assignee: Matt Foley > > All Writers and other bolts that maintain an internal "batch" queue, need to > have a timeout flush, to prevent messages from low-volume telemetries from > sitting in their queues indefinitely. Storm has a timeout value > (topology.message.timeout.secs) that prevents it from waiting for too long. > If the Writer does not process the queue before the timeout, then Storm > recycles the tuples through the topology. This has multiple undesirable > consequences, including data duplication and waste of compute resources. We > would like to be able to specify an interval after which the queues would > flush, even if the batch size is not met. > We will utilize the Storm Tick Tuple to trigger timeout flushing, following > the recommendations of the article at > http://hortonworks.com/blog/apache-storm-design-pattern-micro-batching/#CONCLUSION > Since every Writer processes its queue somewhat differently, every bolt that > has a "batchSize" parameter will be given a "batchTimeout" parameter too. It > will default to 1/2 the value of "topology.message.timeout.secs", as > recommended, and will ignore settings larger than the default, which could > cause failure to flush in time. In the Enrichment topology, where two > Writers may be placed one after the other (enrichment and threat intel), the > default timeout interval will be 1/4 the value of > "topology.message.timeout.secs". The default value of > "topology.message.timeout.secs" in Storm is 30 seconds. > In addition, Storm provides a limit on the number of pending messages that > have not been acked. If more than "topology.max.spout.pending" messages are > waiting in a topology, then Storm will recycle them through the topology. > However, the default value of "topology.max.spout.pending" is null, and if > set to non-null value, the user can manage the consequences by setting > batchSize limits appropriately. Having the timeout flush will also > ameliorate this issue. So we do not
[jira] [Commented] (METRON-1035) Documentation for rules triage aggregation does not include SUM
[ https://issues.apache.org/jira/browse/METRON-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084824#comment-16084824 ] ASF GitHub Bot commented on METRON-1035: GitHub user simonellistonball opened a pull request: https://github.com/apache/metron/pull/649 METRON-1035 Added SUM to the rules triage aggregation docs ## Contributor Comments Quick doc fix verified against code. ## Pull Request Checklist ### For all changes: - [x] Is there a JIRA ticket associated with this PR? If not one needs to be created at [Metron Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel). - [x] Does your PR title start with METRON- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically master)? ### For documentation related changes: - [x] Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via `site-book/target/site/index.html`: ``` cd site-book mvn site ``` Note: Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible. It is also recommended that [travis-ci](https://travis-ci.org) is set up for your personal repository such that your branches are built there before submitting a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/simonellistonball/incubator-metron METRON-1035 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/metron/pull/649.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #649 commit d2604086c52f00656ffab23b8b4eab1d30aeecee Author: Simon Elliston BallDate: 2017-07-12T22:12:31Z Added SUM to the rules triage aggregation docs > Documentation for rules triage aggregation does not include SUM > --- > > Key: METRON-1035 > URL: https://issues.apache.org/jira/browse/METRON-1035 > Project: Metron > Issue Type: Bug >Reporter: Simon Elliston Ball >Priority: Trivial > > The docs in the enrichment project does not indicate that SUM is an allowed > aggregation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (METRON-1035) Documentation for rules triage aggregation does not include SUM
Simon Elliston Ball created METRON-1035: --- Summary: Documentation for rules triage aggregation does not include SUM Key: METRON-1035 URL: https://issues.apache.org/jira/browse/METRON-1035 Project: Metron Issue Type: Bug Reporter: Simon Elliston Ball Priority: Trivial The docs in the enrichment project does not indicate that SUM is an allowed aggregation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (METRON-322) Global Batching and Flushing
[ https://issues.apache.org/jira/browse/METRON-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084729#comment-16084729 ] ASF GitHub Bot commented on METRON-322: --- Github user mattf-horton commented on a diff in the pull request: https://github.com/apache/metron/pull/481#discussion_r127077516 --- Diff: metron-platform/metron-enrichment/src/test/java/org/apache/metron/enrichment/bolt/BulkMessageWriterBoltTest.java --- @@ -164,4 +174,100 @@ public void test() throws Exception { verify(outputCollector, times(1)).emit(eq(Constants.ERROR_STREAM), any(Values.class)); verify(outputCollector, times(1)).reportError(any(Throwable.class)); } + + @Test --- End diff -- @ottobackwards , how about "Mocking sucks"? :-) Or at least "Mocking complex objects is excruciating." But seriously, yes I'll add some comments. Not planning on a line-by-line walk-thru, tho. > Global Batching and Flushing > > > Key: METRON-322 > URL: https://issues.apache.org/jira/browse/METRON-322 > Project: Metron > Issue Type: Improvement >Reporter: Ajay Yadav >Assignee: Matt Foley > > All Writers and other bolts that maintain an internal "batch" queue, need to > have a timeout flush, to prevent messages from low-volume telemetries from > sitting in their queues indefinitely. Storm has a timeout value > (topology.message.timeout.secs) that prevents it from waiting for too long. > If the Writer does not process the queue before the timeout, then Storm > recycles the tuples through the topology. This has multiple undesirable > consequences, including data duplication and waste of compute resources. We > would like to be able to specify an interval after which the queues would > flush, even if the batch size is not met. > We will utilize the Storm Tick Tuple to trigger timeout flushing, following > the recommendations of the article at > http://hortonworks.com/blog/apache-storm-design-pattern-micro-batching/#CONCLUSION > Since every Writer processes its queue somewhat differently, every bolt that > has a "batchSize" parameter will be given a "batchTimeout" parameter too. It > will default to 1/2 the value of "topology.message.timeout.secs", as > recommended, and will ignore settings larger than the default, which could > cause failure to flush in time. In the Enrichment topology, where two > Writers may be placed one after the other (enrichment and threat intel), the > default timeout interval will be 1/4 the value of > "topology.message.timeout.secs". The default value of > "topology.message.timeout.secs" in Storm is 30 seconds. > In addition, Storm provides a limit on the number of pending messages that > have not been acked. If more than "topology.max.spout.pending" messages are > waiting in a topology, then Storm will recycle them through the topology. > However, the default value of "topology.max.spout.pending" is null, and if > set to non-null value, the user can manage the consequences by setting > batchSize limits appropriately. Having the timeout flush will also > ameliorate this issue. So we do not need to address > "topology.max.spout.pending" directly in this task. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (METRON-322) Global Batching and Flushing
[ https://issues.apache.org/jira/browse/METRON-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084725#comment-16084725 ] ASF GitHub Bot commented on METRON-322: --- Github user mattf-horton commented on a diff in the pull request: https://github.com/apache/metron/pull/481#discussion_r127076248 --- Diff: metron-platform/metron-common/src/main/java/org/apache/metron/common/configuration/writer/SingleBatchConfigurationFacade.java --- @@ -18,6 +18,8 @@ package org.apache.metron.common.configuration.writer; +import java.util.ArrayList; +import java.util.List; import java.util.Map; --- End diff -- Well, there's some weirdness in this stuff, and I didn't try to get rid of it lest I break something I didn't understand. In this case (and other places), it appears that non-batched writers are essentially faked by using batched writers with a batchSize of "1". Doesn't seem real efficient to me, but obviously decreases the overall quantity of code and complexity of testing. Anyway, since I only did a trivial change to this file, I didn't feel obligated to try to explain it. > Global Batching and Flushing > > > Key: METRON-322 > URL: https://issues.apache.org/jira/browse/METRON-322 > Project: Metron > Issue Type: Improvement >Reporter: Ajay Yadav >Assignee: Matt Foley > > All Writers and other bolts that maintain an internal "batch" queue, need to > have a timeout flush, to prevent messages from low-volume telemetries from > sitting in their queues indefinitely. Storm has a timeout value > (topology.message.timeout.secs) that prevents it from waiting for too long. > If the Writer does not process the queue before the timeout, then Storm > recycles the tuples through the topology. This has multiple undesirable > consequences, including data duplication and waste of compute resources. We > would like to be able to specify an interval after which the queues would > flush, even if the batch size is not met. > We will utilize the Storm Tick Tuple to trigger timeout flushing, following > the recommendations of the article at > http://hortonworks.com/blog/apache-storm-design-pattern-micro-batching/#CONCLUSION > Since every Writer processes its queue somewhat differently, every bolt that > has a "batchSize" parameter will be given a "batchTimeout" parameter too. It > will default to 1/2 the value of "topology.message.timeout.secs", as > recommended, and will ignore settings larger than the default, which could > cause failure to flush in time. In the Enrichment topology, where two > Writers may be placed one after the other (enrichment and threat intel), the > default timeout interval will be 1/4 the value of > "topology.message.timeout.secs". The default value of > "topology.message.timeout.secs" in Storm is 30 seconds. > In addition, Storm provides a limit on the number of pending messages that > have not been acked. If more than "topology.max.spout.pending" messages are > waiting in a topology, then Storm will recycle them through the topology. > However, the default value of "topology.max.spout.pending" is null, and if > set to non-null value, the user can manage the consequences by setting > batchSize limits appropriately. Having the timeout flush will also > ameliorate this issue. So we do not need to address > "topology.max.spout.pending" directly in this task. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (METRON-322) Global Batching and Flushing
[ https://issues.apache.org/jira/browse/METRON-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084717#comment-16084717 ] ASF GitHub Bot commented on METRON-322: --- Github user mattf-horton commented on the issue: https://github.com/apache/metron/pull/481 Comment on testing: There are so many permutations it only seemed reasonable to automate them in unit test, and so I did. As part of code review, please provide your opinion on whether the provided unit tests are adequate, or what additional test cases should be added. Manual end-to-end testing, if you are so moved, consists of six scenarios for a given sensor queue: 1. Under **heavy continuous load** the batchSize still controls flushing behavior, because the queue size always exceeds batchSize before queue age exceeds batchTimeout. 2. Under **light continuous load**, where each queue continues to receive at least one message per second, and batchSize is large enough it is never exceeded, then the batchTimeout for each queue should control flushing behavior within +/- 1 sec, because each new message triggers a check of the queue age and potential timeout flush. - NOTE: If the configured batchTimeout is set to a large number, bigger than `1/2 topology.message.timeout.secs - 1` (which equals **14 sec** by default), then it will be replaced by an effective value equal to `1/2 topology.message.timeout.secs - 1`. Flushing will occur within +/- 1 sec of each _effective_ batchTimeout interval, rather than the _configured_ batchTimeout interval. 3. Under **light intermittent load**, where less than batchSize messages queue up, and gaps between messages may exceed the timeout interval, then age checks and potential flush events may be triggered by _either_ incoming messages or TickTuple events, depending on the phase relationship between intermittent bursts of messages, and the TickTuple system tick. The TickTuple interval is guaranteed to be < `1/2 topology.message.timeout.secs`, hence the default TickTuple interval is 14 seconds. But if the smallest batchTimeout configured for any sensor queue on the Bolt is < the default TickTuple interval, then that smallest value becomes the actual TickTuple interval. This produces three sub-cases, all of which guarantee a flush event before any message gets recycled due to aging past `topology.message.timeout.secs`: - If the queue's configured batchTimeout is the smallest (or only) such on this Bolt, and that number is smaller than the default TickTuple interval, then it _becomes_ the actual TickTuple interval. The queue is guaranteed to flush between 1x and 2x this interval. - If the queue's configured batchTimeout is not the smallest such, but still is < the default TickTuple interval, then the queue is guaranteed to flush between its own `configured batchTimeout` and its `configured batchTimeout + actual TickTuple interval` (which is less than 2x its own `configured batchTimeout`). - If the queue's configured batchTimeout is > the default TickTuple interval (14 sec default), then its effective batchTimeout is set to the default TickTuple interval. The queue is guaranteed to flush between this `effective batchTimeout` and its `effective batchTimeout + actual TickTuple interval`. The upshot is that: * "Configured batchTimeout" should be thought of as "minimum age before you'll allow a time-based flush" (capped by default TickTuple interval, aka 1/2 `topology.message.timeout.secs`) * "Actual TickTuple interval" is the "maximum time between age checks". It will be <= all the configured batchTimeouts for the various sensors on the Bolt. * When a flush actually happens may be up to "effective batchTimeout" + "actual TickTuple interval", depending on exactly when intermittent message events and periodic Tick events happen. > Global Batching and Flushing > > > Key: METRON-322 > URL: https://issues.apache.org/jira/browse/METRON-322 > Project: Metron > Issue Type: Improvement >Reporter: Ajay Yadav >Assignee: Matt Foley > > All Writers and other bolts that maintain an internal "batch" queue, need to > have a timeout flush, to prevent messages from low-volume telemetries from > sitting in their queues indefinitely. Storm has a timeout value > (topology.message.timeout.secs) that prevents it from waiting for too long. > If the Writer does not process the queue before the timeout, then Storm > recycles the tuples through the topology. This has multiple undesirable > consequences, including data duplication and waste of compute resources. We > would like to be able to specify an interval after which the queues would > flush, even if the batch size is not met. > We will utilize the Storm Tick Tuple to trigger timeout
[jira] [Commented] (METRON-322) Global Batching and Flushing
[ https://issues.apache.org/jira/browse/METRON-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084562#comment-16084562 ] ASF GitHub Bot commented on METRON-322: --- Github user ottobackwards commented on a diff in the pull request: https://github.com/apache/metron/pull/481#discussion_r127051759 --- Diff: metron-platform/metron-common/src/main/java/org/apache/metron/common/configuration/writer/SingleBatchConfigurationFacade.java --- @@ -18,6 +18,8 @@ package org.apache.metron.common.configuration.writer; +import java.util.ArrayList; +import java.util.List; import java.util.Map; --- End diff -- It would be nice if there was some javadoc on this class as to it's purpose and usage > Global Batching and Flushing > > > Key: METRON-322 > URL: https://issues.apache.org/jira/browse/METRON-322 > Project: Metron > Issue Type: Improvement >Reporter: Ajay Yadav >Assignee: Matt Foley > > All Writers and other bolts that maintain an internal "batch" queue, need to > have a timeout flush, to prevent messages from low-volume telemetries from > sitting in their queues indefinitely. Storm has a timeout value > (topology.message.timeout.secs) that prevents it from waiting for too long. > If the Writer does not process the queue before the timeout, then Storm > recycles the tuples through the topology. This has multiple undesirable > consequences, including data duplication and waste of compute resources. We > would like to be able to specify an interval after which the queues would > flush, even if the batch size is not met. > We will utilize the Storm Tick Tuple to trigger timeout flushing, following > the recommendations of the article at > http://hortonworks.com/blog/apache-storm-design-pattern-micro-batching/#CONCLUSION > Since every Writer processes its queue somewhat differently, every bolt that > has a "batchSize" parameter will be given a "batchTimeout" parameter too. It > will default to 1/2 the value of "topology.message.timeout.secs", as > recommended, and will ignore settings larger than the default, which could > cause failure to flush in time. In the Enrichment topology, where two > Writers may be placed one after the other (enrichment and threat intel), the > default timeout interval will be 1/4 the value of > "topology.message.timeout.secs". The default value of > "topology.message.timeout.secs" in Storm is 30 seconds. > In addition, Storm provides a limit on the number of pending messages that > have not been acked. If more than "topology.max.spout.pending" messages are > waiting in a topology, then Storm will recycle them through the topology. > However, the default value of "topology.max.spout.pending" is null, and if > set to non-null value, the user can manage the consequences by setting > batchSize limits appropriately. Having the timeout flush will also > ameliorate this issue. So we do not need to address > "topology.max.spout.pending" directly in this task. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (METRON-322) Global Batching and Flushing
[ https://issues.apache.org/jira/browse/METRON-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084569#comment-16084569 ] ASF GitHub Bot commented on METRON-322: --- Github user ottobackwards commented on a diff in the pull request: https://github.com/apache/metron/pull/481#discussion_r127053228 --- Diff: metron-platform/metron-writer/src/main/java/org/apache/metron/writer/bolt/BulkMessageWriterBolt.java --- @@ -92,24 +177,58 @@ public void prepare(Map stormConf, TopologyContext context, OutputCollector coll configurationTransformation = x -> x; } try { - bulkMessageWriter.init(stormConf, - context, - configurationTransformation.apply(new IndexingWriterConfiguration(bulkMessageWriter.getName(), - getConfigurations())) -); + WriterConfiguration writerconf = configurationTransformation.apply( + new IndexingWriterConfiguration(bulkMessageWriter.getName(), getConfigurations())); + if (defaultBatchTimeout == 0) { +//This means getComponentConfiguration was never called to initialize defaultBatchTimeout, +//probably because we are in a unit test scenario. So calculate it here. +BatchTimeoutHelper timeoutHelper = new BatchTimeoutHelper(writerconf::getAllConfiguredTimeouts, batchTimeoutDivisor); +defaultBatchTimeout = timeoutHelper.getDefaultBatchTimeout(); + } + writerComponent.setDefaultBatchTimeout(defaultBatchTimeout); + bulkMessageWriter.init(stormConf, context, writerconf); } catch (Exception e) { throw new RuntimeException(e); } } + /** + * Used only for unit testing. + */ --- End diff -- Maybe we can name this testPrepare? So it is explicit to caller, who may not 'peek' inside the source > Global Batching and Flushing > > > Key: METRON-322 > URL: https://issues.apache.org/jira/browse/METRON-322 > Project: Metron > Issue Type: Improvement >Reporter: Ajay Yadav >Assignee: Matt Foley > > All Writers and other bolts that maintain an internal "batch" queue, need to > have a timeout flush, to prevent messages from low-volume telemetries from > sitting in their queues indefinitely. Storm has a timeout value > (topology.message.timeout.secs) that prevents it from waiting for too long. > If the Writer does not process the queue before the timeout, then Storm > recycles the tuples through the topology. This has multiple undesirable > consequences, including data duplication and waste of compute resources. We > would like to be able to specify an interval after which the queues would > flush, even if the batch size is not met. > We will utilize the Storm Tick Tuple to trigger timeout flushing, following > the recommendations of the article at > http://hortonworks.com/blog/apache-storm-design-pattern-micro-batching/#CONCLUSION > Since every Writer processes its queue somewhat differently, every bolt that > has a "batchSize" parameter will be given a "batchTimeout" parameter too. It > will default to 1/2 the value of "topology.message.timeout.secs", as > recommended, and will ignore settings larger than the default, which could > cause failure to flush in time. In the Enrichment topology, where two > Writers may be placed one after the other (enrichment and threat intel), the > default timeout interval will be 1/4 the value of > "topology.message.timeout.secs". The default value of > "topology.message.timeout.secs" in Storm is 30 seconds. > In addition, Storm provides a limit on the number of pending messages that > have not been acked. If more than "topology.max.spout.pending" messages are > waiting in a topology, then Storm will recycle them through the topology. > However, the default value of "topology.max.spout.pending" is null, and if > set to non-null value, the user can manage the consequences by setting > batchSize limits appropriately. Having the timeout flush will also > ameliorate this issue. So we do not need to address > "topology.max.spout.pending" directly in this task. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (METRON-322) Global Batching and Flushing
[ https://issues.apache.org/jira/browse/METRON-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084577#comment-16084577 ] ASF GitHub Bot commented on METRON-322: --- Github user ottobackwards commented on the issue: https://github.com/apache/metron/pull/481 I am seeing errors like ```bash --- T E S T S --- Running org.apache.metron.solr.integration.SolrIndexingIntegrationTest Waiting for zookeeper... Found index config in zookeeper... 0 vs 10 vs 0 0 vs 10 vs 0 0 vs 10 vs 0 0 vs 10 vs 0 0 vs 10 vs 0 0 vs 10 vs 0 Processed target/indexingIntegrationTest/hdfs/test/enrichment-hdfsIndexingBolt-1-0-1499888936956.json 10 vs 10 vs 10 Processed target/indexingIntegrationTest/hdfs/test/enrichment-hdfsIndexingBolt-1-0-1499888936956.json 2017-07-12 15:48:57 ERROR ClientCnxn:524 - Error while calling watcher java.util.concurrent.RejectedExecutionException: Task org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1@9651f04 rejected from org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@d5a9af0[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 16] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:135) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) at org.apache.solr.common.cloud.SolrZkClient$3.process(SolrZkClient.java:261) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 2017-07-12 15:49:01 ERROR ReadClusterState:181 - Failed to Sync Supervisor java.lang.RuntimeException: java.lang.InterruptedException at org.apache.storm.utils.Utils.wrapInRuntime(Utils.java:1507) at org.apache.storm.zookeeper.Zookeeper.getChildren(Zookeeper.java:260) at org.apache.storm.cluster.ZKStateStorage.get_children(ZKStateStorage.java:174) at org.apache.storm.cluster.StormClusterStateImpl.assignments(StormClusterStateImpl.java:153) at org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:126) at org.apache.storm.event.EventManagerImp$1.run(EventManagerImp.java:54) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.storm.shade.org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1588) at org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1625) at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:210) at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:203) at org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108) at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:200) at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:191) at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:38) at org.apache.storm.zookeeper.Zookeeper.getChildren(Zookeeper.java:255) ... 4 more Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 45.675 sec - in org.apache.metron.solr.integration.SolrIndexingIntegrationTest Results : ``` Have you seen these? > Global Batching and Flushing > > > Key: METRON-322 > URL: https://issues.apache.org/jira/browse/METRON-322 > Project: Metron > Issue Type: Improvement >Reporter: Ajay Yadav >Assignee: Matt Foley > > All Writers and other bolts that maintain an internal "batch" queue, need to > have a timeout flush, to prevent messages from low-volume telemetries from > sitting in their queues indefinitely. Storm has a timeout value > (topology.message.timeout.secs) that prevents it from waiting for too long. > If the Writer does not process the queue before the timeout,
[jira] [Commented] (METRON-322) Global Batching and Flushing
[ https://issues.apache.org/jira/browse/METRON-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084570#comment-16084570 ] ASF GitHub Bot commented on METRON-322: --- Github user ottobackwards commented on the issue: https://github.com/apache/metron/pull/481 This is a lot of work, and a great job @mattf-horton. A couple of nit comments I would like to get response to before a +1 > Global Batching and Flushing > > > Key: METRON-322 > URL: https://issues.apache.org/jira/browse/METRON-322 > Project: Metron > Issue Type: Improvement >Reporter: Ajay Yadav >Assignee: Matt Foley > > All Writers and other bolts that maintain an internal "batch" queue, need to > have a timeout flush, to prevent messages from low-volume telemetries from > sitting in their queues indefinitely. Storm has a timeout value > (topology.message.timeout.secs) that prevents it from waiting for too long. > If the Writer does not process the queue before the timeout, then Storm > recycles the tuples through the topology. This has multiple undesirable > consequences, including data duplication and waste of compute resources. We > would like to be able to specify an interval after which the queues would > flush, even if the batch size is not met. > We will utilize the Storm Tick Tuple to trigger timeout flushing, following > the recommendations of the article at > http://hortonworks.com/blog/apache-storm-design-pattern-micro-batching/#CONCLUSION > Since every Writer processes its queue somewhat differently, every bolt that > has a "batchSize" parameter will be given a "batchTimeout" parameter too. It > will default to 1/2 the value of "topology.message.timeout.secs", as > recommended, and will ignore settings larger than the default, which could > cause failure to flush in time. In the Enrichment topology, where two > Writers may be placed one after the other (enrichment and threat intel), the > default timeout interval will be 1/4 the value of > "topology.message.timeout.secs". The default value of > "topology.message.timeout.secs" in Storm is 30 seconds. > In addition, Storm provides a limit on the number of pending messages that > have not been acked. If more than "topology.max.spout.pending" messages are > waiting in a topology, then Storm will recycle them through the topology. > However, the default value of "topology.max.spout.pending" is null, and if > set to non-null value, the user can manage the consequences by setting > batchSize limits appropriately. Having the timeout flush will also > ameliorate this issue. So we do not need to address > "topology.max.spout.pending" directly in this task. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (METRON-1020) minor changes to release process
[ https://issues.apache.org/jira/browse/METRON-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Foley reassigned METRON-1020: -- Assignee: Matt Foley > minor changes to release process > > > Key: METRON-1020 > URL: https://issues.apache.org/jira/browse/METRON-1020 > Project: Metron > Issue Type: Improvement >Reporter: Matt Foley >Assignee: Matt Foley > > The following proposed changes are not just editorial in nature, hence will > request vote of the community to change. Regarding the process at > https://cwiki.apache.org/confluence/display/METRON/Release+Process : > * Add a step to tag the final release, as > "apache-metron--release". > * The current policy says that when a critical release is urgently needed, > "the 72 hour waiting periods in Steps 7 and 8 can be waived." The formerly > referenced Step 8 was for the Incubator vote, so that can be removed as an > editorial issue, but we should also allow for not waiting for mirror > propagation -- let the mirrors catch up as fast as they can. So the text > should now read: "the 72 hour waiting period in Step 7 and the wait for > mirror propagation in Step 10 can be waived." > * Finally, it is good practice to increment the build version in POMs > immediately AFTER a release, so that builds with new stuff cannot be mistaken > for builds of the release version. The current policy says to increment it > just BEFORE a release. I suggest changing this to say: > ** immediately after a release, increment the MINOR version number (eg, with > the 0.4.0 just released, set the new version number to 0.4.1) > ** immediately before a release, decide whether it will be a minor or major > release. If minor, assure that the minor version number was already > incremented after the last release and continue to use that number. If > major, change the version number to the desired new major version. > ** These version number changes are in master branch. Creation of new > branches does not occur until the idea of creating a maintenance branch or a > new release branch has been consented by the community. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (METRON-1033) Profiler example uses incorrect units for expires
[ https://issues.apache.org/jira/browse/METRON-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084264#comment-16084264 ] ASF GitHub Bot commented on METRON-1033: GitHub user simonellistonball opened a pull request: https://github.com/apache/metron/pull/648 METRON-1033 Corrected profiler docs units on expires field Minor change to update profiler docs ## Contributor Comments [Please place any comments here. A description of the problem/enhancement, how to reproduce the issue, your testing methodology, etc.] ## Pull Request Checklist Thank you for submitting a contribution to Apache Metron. Please refer to our [Development Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235) for the complete guide to follow for contributions. Please refer also to our [Build Verification Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview) for complete smoke testing guides. In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? If not one needs to be created at [Metron Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel). - [x] Does your PR title start with METRON- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically master)? ### For documentation related changes: - [x] Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via `site-book/target/site/index.html`: You can merge this pull request into a Git repository by running: $ git pull https://github.com/simonellistonball/incubator-metron METRON-1033 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/metron/pull/648.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #648 commit 880e10cfc31f49cb11a3f5639c2c0a217743c044 Author: Simon Elliston BallDate: 2017-07-12T16:30:25Z Corrected profiler docs units on expires field > Profiler example uses incorrect units for expires > - > > Key: METRON-1033 > URL: https://issues.apache.org/jira/browse/METRON-1033 > Project: Metron > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Simon Elliston Ball >Priority: Trivial > Labels: newbie > > The profiler documentation states that expires is in ms, but examples show > (correctly) days. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (METRON-1033) Profiler example uses incorrect units for expires
[ https://issues.apache.org/jira/browse/METRON-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Elliston Ball updated METRON-1033: Description: The profiler documentation states that expires is in ms, but examples show (correctly) days. (was: The profiler example in the README use days as a unit, but the value is expected in ms. ) > Profiler example uses incorrect units for expires > - > > Key: METRON-1033 > URL: https://issues.apache.org/jira/browse/METRON-1033 > Project: Metron > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Simon Elliston Ball >Priority: Trivial > Labels: newbie > > The profiler documentation states that expires is in ms, but examples show > (correctly) days. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (METRON-1034) Profiler Definition Expires Field Doc'd as Milliseconds, but is Days
Nick Allen created METRON-1034: -- Summary: Profiler Definition Expires Field Doc'd as Milliseconds, but is Days Key: METRON-1034 URL: https://issues.apache.org/jira/browse/METRON-1034 Project: Metron Issue Type: Bug Reporter: Nick Allen Assignee: Nick Allen There is an inconsistency in the profiler README. The sentence in the table is wrong, the rest of the docs are correct. *_Wrong_* https://github.com/apache/metron/tree/master/metron-analytics/metron-profiler#creating-profiles {code} expires OptionalProfile data is purged after this period of time, specified in m̶i̶l̶l̶i̶s̶e̶c̶o̶n̶d̶s̶ [days] {code} *_Correct_* https://github.com/apache/metron/tree/master/metron-analytics/metron-profiler#expires -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (METRON-1013) StellarShell does not validate parameters
[ https://issues.apache.org/jira/browse/METRON-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084063#comment-16084063 ] ASF GitHub Bot commented on METRON-1013: Github user asfgit closed the pull request at: https://github.com/apache/metron/pull/639 > StellarShell does not validate parameters > - > > Key: METRON-1013 > URL: https://issues.apache.org/jira/browse/METRON-1013 > Project: Metron > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Otto Fowler >Assignee: Otto Fowler > > If you pass an invalid url to -z for example - you will get a wait and then > an exception crashing the shell. > In the case of -z, I am not sure url is the right name, since we want > host:port, with host being a valid name or ip address and port. > We also do use a default port if none is provided, we probably should. > A 'dev' default for the whole thing would be 'node1:2181', to match our > vagrant environments. I propose that we use that. > Thus: > * no -z => node1:2181 > * -z -> validate host name, if no port then 2181 > * if port then validate port -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (METRON-1033) Profiler example uses incorrect units for expires
[ https://issues.apache.org/jira/browse/METRON-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084081#comment-16084081 ] Nick Allen commented on METRON-1033: The `expires` field is expected to be in days (not milliseconds). > Profiler example uses incorrect units for expires > - > > Key: METRON-1033 > URL: https://issues.apache.org/jira/browse/METRON-1033 > Project: Metron > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Simon Elliston Ball >Priority: Trivial > Labels: newbie > > The profiler example in the README use days as a unit, but the value is > expected in ms. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (METRON-1013) StellarShell does not validate parameters
[ https://issues.apache.org/jira/browse/METRON-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083962#comment-16083962 ] ASF GitHub Bot commented on METRON-1013: Github user cestella commented on the issue: https://github.com/apache/metron/pull/639 +1 by inspection > StellarShell does not validate parameters > - > > Key: METRON-1013 > URL: https://issues.apache.org/jira/browse/METRON-1013 > Project: Metron > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Otto Fowler >Assignee: Otto Fowler > > If you pass an invalid url to -z for example - you will get a wait and then > an exception crashing the shell. > In the case of -z, I am not sure url is the right name, since we want > host:port, with host being a valid name or ip address and port. > We also do use a default port if none is provided, we probably should. > A 'dev' default for the whole thing would be 'node1:2181', to match our > vagrant environments. I propose that we use that. > Thus: > * no -z => node1:2181 > * -z -> validate host name, if no port then 2181 > * if port then validate port -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (METRON-1033) Profiler example uses incorrect units for expires
Simon Elliston Ball created METRON-1033: --- Summary: Profiler example uses incorrect units for expires Key: METRON-1033 URL: https://issues.apache.org/jira/browse/METRON-1033 Project: Metron Issue Type: Bug Affects Versions: 0.4.0 Reporter: Simon Elliston Ball Priority: Trivial The profiler example in the README use days as a unit, but the value is expected in ms. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (METRON-1022) Elasticsearch REST endpoint
[ https://issues.apache.org/jira/browse/METRON-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083864#comment-16083864 ] ASF GitHub Bot commented on METRON-1022: Github user cestella commented on the issue: https://github.com/apache/metron/pull/636 @ottobackwards agreed, we should work to that goal. I think we might want to make baby steps, though, and create the abstractions first and move to making it pluggable later. > Elasticsearch REST endpoint > --- > > Key: METRON-1022 > URL: https://issues.apache.org/jira/browse/METRON-1022 > Project: Metron > Issue Type: New Feature >Reporter: Ryan Merriman > > We need a "search" endpoint that will allow basic lucene-style searches with > sorting and pagination options. This endpoint should have a light > abstraction on top to make it simpler to consume and possibly allow different > search engines to be used in the future. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (METRON-1022) Elasticsearch REST endpoint
[ https://issues.apache.org/jira/browse/METRON-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083852#comment-16083852 ] ASF GitHub Bot commented on METRON-1022: Github user ottobackwards commented on the issue: https://github.com/apache/metron/pull/636 @cestella https://issues.apache.org/jira/browse/METRON-956 Is the jira I had created on this topic. > Elasticsearch REST endpoint > --- > > Key: METRON-1022 > URL: https://issues.apache.org/jira/browse/METRON-1022 > Project: Metron > Issue Type: New Feature >Reporter: Ryan Merriman > > We need a "search" endpoint that will allow basic lucene-style searches with > sorting and pagination options. This endpoint should have a light > abstraction on top to make it simpler to consume and possibly allow different > search engines to be used in the future. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (METRON-539) Add stellar keywords for hashing
[ https://issues.apache.org/jira/browse/METRON-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083815#comment-16083815 ] ASF GitHub Bot commented on METRON-539: --- Github user jjmeyer0 commented on a diff in the pull request: https://github.com/apache/metron/pull/641#discussion_r126926090 --- Diff: metron-stellar/stellar-common/src/test/java/org/apache/metron/stellar/dsl/functions/HashFunctionsTest.java --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.metron.stellar.dsl.functions; + +import com.google.common.io.BaseEncoding; +import org.apache.commons.lang.SerializationUtils; +import org.junit.Test; + +import java.io.Serializable; +import java.nio.charset.StandardCharsets; +import java.security.MessageDigest; +import java.security.NoSuchAlgorithmException; +import java.security.Security; +import java.util.Arrays; +import java.util.Collections; +import java.util.HashMap; +import java.util.Map; +import java.util.Set; + +import static org.apache.metron.stellar.common.utils.StellarProcessorUtils.run; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertNull; + +public class HashFunctionsTest { + final HashFunctions.Hash hash = new HashFunctions.Hash(); + + @Test(expected = IllegalArgumentException.class) + public void nullArgumentListShouldThrowException() throws Exception { +hash.apply(null); + } + + @Test(expected = IllegalArgumentException.class) + public void emptyArgumentListShouldThrowException() throws Exception { +hash.apply(Collections.emptyList()); + } + + @Test(expected = IllegalArgumentException.class) + public void singleArgumentListShouldThrowException() throws Exception { +hash.apply(Collections.singletonList("some value.")); + } + + @Test(expected = IllegalArgumentException.class) + public void argumentListWithMoreThanTwoValuesShouldThrowException3() throws Exception { +hash.apply(Arrays.asList("1", "2", "3")); + } + + @Test(expected = IllegalArgumentException.class) + public void argumentListWithMoreThanTwoValuesShouldThrowException4() throws Exception { +hash.apply(Arrays.asList("1", "2", "3", "4")); + } + + @Test(expected = IllegalArgumentException.class) + public void invalidAlgorithmArgumentShouldThrowException() throws Exception { +hash.apply(Arrays.asList("value to hash", "invalidAlgorithm")); + } + + @Test + public void invalidNullAlgorithmArgumentShouldThrowException() throws Exception { +assertNull(hash.apply(Arrays.asList("value to hash", null))); + } + + @Test + public void nullInputForValueToHashShouldProperlyThrowException() throws Exception { +assertNull(hash.apply(Arrays.asList(null, "md5"))); + } --- End diff -- oh boy, @mattf-horton that makes much more sense. forgive me i had a long day yesterday. I had a moment haha > Add stellar keywords for hashing > > > Key: METRON-539 > URL: https://issues.apache.org/jira/browse/METRON-539 > Project: Metron > Issue Type: Improvement >Reporter: Jon Zeolla >Priority: Minor > > Stellar should have the ability to natively hash values using a prefix of TO > or HASH, then the algorithm, and an optional length. For instance, > "TO_SHA3_256" or "HASH_MD5." -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (METRON-1022) Elasticsearch REST endpoint
[ https://issues.apache.org/jira/browse/METRON-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083646#comment-16083646 ] ASF GitHub Bot commented on METRON-1022: Github user cestella commented on the issue: https://github.com/apache/metron/pull/636 I just want to follow-up with something a bit more concrete suggestions. I think the beginnings of an abstraction are there. You pulled out a bunch of utility methods from `ElasticsearchWriter` which is good. I'm looking for a DAO-like abstraction in `metron-elasticsearch` and the `ServiceImpl` in the REST layer uses the generic DAO implementation (which would not live in the REST layer so it's usable for lower level access) and the index impl is specified by at runtime. The impl for the DAO would live in one of `metron-elasticsearch` or `metron-solr`. Again, these are my $0.02 and just suggestions. > Elasticsearch REST endpoint > --- > > Key: METRON-1022 > URL: https://issues.apache.org/jira/browse/METRON-1022 > Project: Metron > Issue Type: New Feature >Reporter: Ryan Merriman > > We need a "search" endpoint that will allow basic lucene-style searches with > sorting and pagination options. This endpoint should have a light > abstraction on top to make it simpler to consume and possibly allow different > search engines to be used in the future. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (METRON-1022) Elasticsearch REST endpoint
[ https://issues.apache.org/jira/browse/METRON-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083633#comment-16083633 ] ASF GitHub Bot commented on METRON-1022: Github user cestella commented on the issue: https://github.com/apache/metron/pull/636 Looking at this implementation and working a bit on the PoC for index data mutation, I think the abstraction here isn't in the right place. It's too bound-up in the REST layer whereas data access has traditionally existed further down in the abstraction layer. It's a smell that metron-rest has to depend on ES at all. That dependency should be provided at runtime entirely. Also, we're going to need data access in batch as well and I'd like a sensible abstraction in place to enable that (e.g. merging modified records when reading as part of a Spark or MR job). It seems to me that we need a DAO abstraction living in `metron-writer` or even `metron-indexing` for interacting with indices with the concrete implementations existing in `metron-elasticsearch` and `metron-solr` for ES and Solr respectively. I think `metron-rest` should work entirely on abstract classes. The REST layer should choose the DAO implementation via reflection and we should adjust the classpath in the shell script to include one of `metron-elasticsearch` or `metron-solr`. In essence, I'm recommending something similar to how JDBC works where you choose your driver implementation via reflection (often) and work with base JDBC classes from there. > Elasticsearch REST endpoint > --- > > Key: METRON-1022 > URL: https://issues.apache.org/jira/browse/METRON-1022 > Project: Metron > Issue Type: New Feature >Reporter: Ryan Merriman > > We need a "search" endpoint that will allow basic lucene-style searches with > sorting and pagination options. This endpoint should have a light > abstraction on top to make it simpler to consume and possibly allow different > search engines to be used in the future. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (METRON-1032) Metron Ambari mpack is not blueprint friendly
Ali Nazemian created METRON-1032: Summary: Metron Ambari mpack is not blueprint friendly Key: METRON-1032 URL: https://issues.apache.org/jira/browse/METRON-1032 Project: Metron Issue Type: Bug Affects Versions: 0.4.0 Reporter: Ali Nazemian Priority: Minor In order to deploy a platform using Ambari blueprint, one can divide generic and non-generic parts to two different files. In this case, the generic parts should go to the main blueprint and the non-generic parts like passwords or the fields that come with hostnames should go to the host-group mapping file. This is the way that all HDP and HDF components have been implemented. However, it is not right for Metron and all the components that come with HCP. All the following fields should be included in the main blueprint. Otherwise, you cannot kick start the blueprint. { metron-env= [ storm_rest_addr, es_hosts, zeppelin_server_url, metron_jdbc_url ] }, { kibana-env= [ kibana_es_url ] } -- This message was sent by Atlassian JIRA (v6.4.14#64029)