[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=123540&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-123540 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 16/Jul/18 08:48 Start Date: 16/Jul/18 08:48 Worklog Time Spent: 10m Work Description: lgajowy commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-405182970 Yay! Nice to see this merged! Thanks @pabloem and @iemejia. :) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 123540) Time Spent: 8h 50m (was: 8h 40m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 8h 50m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=123414&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-123414 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 15/Jul/18 14:56 Start Date: 15/Jul/18 14:56 Worklog Time Spent: 10m Work Description: pabloem closed pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy b/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy index 49147654ba9..3b7ce252161 100644 --- a/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy +++ b/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy @@ -341,6 +341,7 @@ class BeamModulePlugin implements Plugin { commons_io_1x : "commons-io:commons-io:1.3.2", commons_io_2x : "commons-io:commons-io:2.5", commons_lang3 : "org.apache.commons:commons-lang3:3.6", +commons_math3 : "org.apache.commons:commons-math3:3.6.1", datastore_v1_proto_client : "com.google.cloud.datastore:datastore-v1-proto-client:1.4.0", datastore_v1_protos : "com.google.cloud.datastore:datastore-v1-protos:1.3.0", error_prone_annotations : "com.google.errorprone:error_prone_annotations:2.0.15", diff --git a/sdks/java/io/synthetic/build.gradle b/sdks/java/io/synthetic/build.gradle new file mode 100644 index 000..3d6be8eb13c --- /dev/null +++ b/sdks/java/io/synthetic/build.gradle @@ -0,0 +1,36 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +apply plugin: org.apache.beam.gradle.BeamModulePlugin +applyJavaNature() + +description = "Apache Beam :: SDKs :: Java :: IO :: Synthetic" +ext.summary = "Generators of Synthetic IO for Testing." + +dependencies { + compile library.java.joda_time + compile library.java.commons_math3 + shadow library.java.jackson_core + shadow library.java.jackson_annotations + shadow library.java.jackson_databind + testCompile library.java.guava + testCompile library.java.junit + testCompile library.java.hamcrest_core + testCompile library.java.hamcrest_library + shadow project(path: ":beam-sdks-java-core", configuration: "shadow") +} diff --git a/sdks/java/io/synthetic/src/main/java/org/apache/beam/sdk/io/synthetic/SyntheticBoundedIO.java b/sdks/java/io/synthetic/src/main/java/org/apache/beam/sdk/io/synthetic/SyntheticBoundedIO.java new file mode 100644 index 000..d9f652ace9d --- /dev/null +++ b/sdks/java/io/synthetic/src/main/java/org/apache/beam/sdk/io/synthetic/SyntheticBoundedIO.java @@ -0,0 +1,437 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.synthetic; + +import static com.google.common.base.Preconditions.checkArgument; +import static
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=123100&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-123100 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 13/Jul/18 23:24 Start Date: 13/Jul/18 23:24 Worklog Time Spent: 10m Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-404978005 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 123100) Time Spent: 8.5h (was: 8h 20m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 8.5h > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=123034&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-123034 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 13/Jul/18 20:46 Start Date: 13/Jul/18 20:46 Worklog Time Spent: 10m Work Description: iemejia commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-404949448 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 123034) Time Spent: 8h 20m (was: 8h 10m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 8h 20m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=121626&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-121626 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 10/Jul/18 21:27 Start Date: 10/Jul/18 21:27 Worklog Time Spent: 10m Work Description: pabloem commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r201445730 ## File path: sdks/java/io/synthetic/pom.xml ## @@ -0,0 +1,84 @@ + Review comment: Done. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 121626) Time Spent: 8h 10m (was: 8h) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 8h 10m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=121407&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-121407 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 10/Jul/18 13:29 Start Date: 10/Jul/18 13:29 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r201340262 ## File path: sdks/java/io/synthetic/src/main/java/org/apache/beam/sdk/io/synthetic/SyntheticBoundedIO.java ## @@ -0,0 +1,436 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.synthetic; + +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkNotNull; +import static org.apache.commons.math3.stat.StatUtils.sum; + +import com.fasterxml.jackson.annotation.JsonProperty; +import com.fasterxml.jackson.databind.annotation.JsonDeserialize; +import com.google.common.base.MoreObjects; +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import java.util.NoSuchElementException; +import java.util.Random; +import java.util.stream.Collectors; +import org.apache.beam.sdk.coders.ByteArrayCoder; +import org.apache.beam.sdk.coders.Coder; +import org.apache.beam.sdk.coders.KvCoder; +import org.apache.beam.sdk.io.OffsetBasedSource; +import org.apache.beam.sdk.io.Read; +import org.apache.beam.sdk.io.range.OffsetRange; +import org.apache.beam.sdk.options.PipelineOptions; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.commons.math3.distribution.ConstantRealDistribution; +import org.joda.time.Duration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * This {@link SyntheticBoundedIO} class provides a parameterizable batch custom source that is + * deterministic. + * + * The {@link SyntheticBoundedSource} generates a {@link PCollection} of {@code KV}. A fraction of the generated records {@code KV} are associated with + * "hot" keys, which are uniformly distributed over a fixed number of hot keys. The remaining + * generated records are associated with "random" keys. Each record will be slowed down by a certain + * sleep time generated based on the specified sleep time distribution when the {@link + * SyntheticSourceReader} reads each record. The record {@code KV} is generated + * deterministically based on the record's position in the source, which enables repeatable + * execution for debugging. The SyntheticBoundedInput configurable parameters are defined in {@link + * SyntheticBoundedIO.SyntheticSourceOptions}. + * + * To read a {@link PCollection} of {@code KV} from {@link SyntheticBoundedIO}, + * use {@link SyntheticBoundedIO#readFrom} to construct the synthetic source with synthetic source + * options. See {@link SyntheticBoundedIO.SyntheticSourceOptions} for how to construct an instance. + * An example is below: + * + * {@code + * Pipeline p = ...; + * SyntheticBoundedInput.SourceOptions sso = ...; + * + * // Construct the synthetic input with synthetic source options. + * PCollection> input = p.apply(SyntheticBoundedInput.readFrom(sso)); + * } + */ +public class SyntheticBoundedIO { Review comment: I will probably prefer SyntheticIO and .bounded() maybe but well that's nitpicking and could be fixed in the future if we ever have an unbounded one. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 121407) Time Spent: 7h 50m (was: 7h 40m) > Performance tests need a way to generate Synthetic data > --- > >
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=121408&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-121408 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 10/Jul/18 13:29 Start Date: 10/Jul/18 13:29 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r201339767 ## File path: sdks/java/io/synthetic/pom.xml ## @@ -0,0 +1,84 @@ + Review comment: I suppose we have to remove this one now. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 121408) Time Spent: 8h (was: 7h 50m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 8h > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=121158&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-121158 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 10/Jul/18 01:16 Start Date: 10/Jul/18 01:16 Worklog Time Spent: 10m Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-403668051 Rebased. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 121158) Time Spent: 7h 40m (was: 7.5h) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 7h 40m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=120896&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-120896 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 09/Jul/18 17:08 Start Date: 09/Jul/18 17:08 Worklog Time Spent: 10m Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-403550594 @iemejia PTAL This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 120896) Time Spent: 7.5h (was: 7h 20m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 7.5h > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=120852&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-120852 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 09/Jul/18 16:25 Start Date: 09/Jul/18 16:25 Worklog Time Spent: 10m Work Description: pabloem removed a comment on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-403537539 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 120852) Time Spent: 7h 10m (was: 7h) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 7h 10m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=120853&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-120853 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 09/Jul/18 16:25 Start Date: 09/Jul/18 16:25 Worklog Time Spent: 10m Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-403537899 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 120853) Time Spent: 7h 20m (was: 7h 10m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 7h 20m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=120851&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-120851 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 09/Jul/18 16:24 Start Date: 09/Jul/18 16:24 Worklog Time Spent: 10m Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-403537539 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 120851) Time Spent: 7h (was: 6h 50m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 7h > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=119371&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-119371 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 05/Jul/18 14:04 Start Date: 05/Jul/18 14:04 Worklog Time Spent: 10m Work Description: iemejia commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-402732984 this is weird, tests pass ok locally, let's try again This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 119371) Time Spent: 6h 40m (was: 6.5h) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 6h 40m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=119372&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-119372 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 05/Jul/18 14:04 Start Date: 05/Jul/18 14:04 Worklog Time Spent: 10m Work Description: iemejia commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-402733023 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 119372) Time Spent: 6h 50m (was: 6h 40m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 6h 50m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=118864&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-118864 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 03/Jul/18 22:06 Start Date: 03/Jul/18 22:06 Worklog Time Spent: 10m Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-402305902 ugh.. I can't repro this on my environment : / This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 118864) Time Spent: 6.5h (was: 6h 20m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 6.5h > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=118781&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-118781 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 03/Jul/18 17:32 Start Date: 03/Jul/18 17:32 Worklog Time Spent: 10m Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-402235246 I've renamed to `SyntheticBoundedIO`, but I think it makes sense to have `readFrom(SyntheticSourceOptions options)` as an entry point to add to a pipeline. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 118781) Time Spent: 6h 10m (was: 6h) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 6h 10m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=118782&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-118782 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 03/Jul/18 17:32 Start Date: 03/Jul/18 17:32 Worklog Time Spent: 10m Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-402235268 PTAL : ) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 118782) Time Spent: 6h 20m (was: 6h 10m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 6h 20m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=118616&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-118616 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 03/Jul/18 08:25 Start Date: 03/Jul/18 08:25 Worklog Time Spent: 10m Work Description: iemejia commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-402055230 Thanks Pablo eager to review and merge this when ready (sorry for the wait before). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 118616) Time Spent: 6h (was: 5h 50m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 6h > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=118509&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-118509 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 03/Jul/18 00:22 Start Date: 03/Jul/18 00:22 Worklog Time Spent: 10m Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-401975854 Thanks Ismael! for some reason, I hadn't seen your other ocmments. Github hid that file from me. I've addressed them. Only thing remaining is setting up `SyntheticIO.read`. I'll work on that tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 118509) Time Spent: 5h 50m (was: 5h 40m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 5h 50m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=117404&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-117404 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 29/Jun/18 14:36 Start Date: 29/Jun/18 14:36 Worklog Time Spent: 10m Work Description: iemejia commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-401373701 @pabloem maybe we should call it `SyntheticIO.read()` or gen or something like that now ot make it closer to the other IOs no? Do you plan to address some of the other comments (most are quite minor but still nice to have). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 117404) Time Spent: 5h 40m (was: 5.5h) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 5h 40m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=116217&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-116217 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 27/Jun/18 00:33 Start Date: 27/Jun/18 00:33 Worklog Time Spent: 10m Work Description: pabloem edited a comment on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-400505021 Thanks @iemejia - I've addressed your comments. LMK what you think. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 116217) Time Spent: 5.5h (was: 5h 20m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 5.5h > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=116216&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-116216 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 27/Jun/18 00:30 Start Date: 27/Jun/18 00:30 Worklog Time Spent: 10m Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-400505021 Thanks @iemejia - I've addressed some of your concerns This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 116216) Time Spent: 5h 20m (was: 5h 10m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 5h 20m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=116213&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-116213 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 27/Jun/18 00:29 Start Date: 27/Jun/18 00:29 Worklog Time Spent: 10m Work Description: pabloem commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198336079 ## File path: sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticUtils.java ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import com.google.common.base.Stopwatch; +import com.google.common.hash.Hashing; +import com.google.common.util.concurrent.Uninterruptibles; +import java.util.Random; +import java.util.concurrent.TimeUnit; +import org.joda.time.Duration; + +/** + * Utility functions used in {@link org.apache.beam.sdk.io.common.synthetic}. + */ +public class SyntheticUtils { Review comment: Done. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 116213) Time Spent: 5h (was: 4h 50m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 5h > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=116209&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-116209 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 27/Jun/18 00:29 Start Date: 27/Jun/18 00:29 Worklog Time Spent: 10m Work Description: pabloem commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198322384 ## File path: sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticUtils.java ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import com.google.common.base.Stopwatch; +import com.google.common.hash.Hashing; +import com.google.common.util.concurrent.Uninterruptibles; +import java.util.Random; +import java.util.concurrent.TimeUnit; +import org.joda.time.Duration; + +/** + * Utility functions used in {@link org.apache.beam.sdk.io.common.synthetic}. + */ +public class SyntheticUtils { Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 116209) Time Spent: 4h 20m (was: 4h 10m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 4h 20m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=116214&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-116214 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 27/Jun/18 00:29 Start Date: 27/Jun/18 00:29 Worklog Time Spent: 10m Work Description: pabloem commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198335721 ## File path: sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/package-info.java ## @@ -0,0 +1,19 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +/** Transforms for performing Synthetic Operations in Apache Beam pipelines. */ +package org.apache.beam.sdk.io.common.synthetic; Review comment: I think that makes a lot of sense. I added a new component. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 116214) Time Spent: 5h 10m (was: 5h) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 5h 10m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=116212&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-116212 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 27/Jun/18 00:29 Start Date: 27/Jun/18 00:29 Worklog Time Spent: 10m Work Description: pabloem commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198328174 ## File path: sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInputTest.java ## @@ -0,0 +1,223 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import static org.apache.beam.sdk.io.common.synthetic.SyntheticOptions.fromIntegerDistribution; +import static org.apache.beam.sdk.io.common.synthetic.SyntheticOptions.fromRealDistribution; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +import com.fasterxml.jackson.core.JsonParseException; +import com.fasterxml.jackson.databind.ObjectMapper; +import java.io.IOException; +import java.util.List; +import org.apache.beam.sdk.io.BoundedSource; +import org.apache.beam.sdk.io.common.synthetic.SyntheticBoundedInput.SourceOptions; +import org.apache.beam.sdk.io.common.synthetic.SyntheticBoundedInput.SyntheticBoundedSource; +import org.apache.beam.sdk.options.PipelineOptions; +import org.apache.beam.sdk.options.PipelineOptionsFactory; +import org.apache.beam.sdk.testing.SourceTestUtils; +import org.apache.beam.sdk.values.KV; +import org.apache.commons.math3.distribution.ConstantRealDistribution; +import org.apache.commons.math3.distribution.ZipfDistribution; +import org.junit.Before; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.ExpectedException; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +/** Unit tests for {@link SyntheticBoundedInput}. */ +@RunWith(JUnit4.class) +public class SyntheticBoundedInputTest { + @Rule public final ExpectedException thrown = ExpectedException.none(); + + private SourceOptions testSourceOptions = new SourceOptions(); + + @Before + public void setUp() { +testSourceOptions.splitPointFrequencyRecords = 1; +testSourceOptions.numRecords = 10; +testSourceOptions.keySizeBytes = 10; +testSourceOptions.valueSizeBytes = 20; +testSourceOptions.numHotKeys = 3; +testSourceOptions.hotKeyFraction = 0.3; +testSourceOptions.setSeed(123456); +testSourceOptions.bundleSizeDistribution = +fromIntegerDistribution(new ZipfDistribution(100, 2.5)); +testSourceOptions.forceNumInitialBundles = null; + } + + private SourceOptions fromString(String jsonString) throws IOException { +ObjectMapper mapper = new ObjectMapper(); +SourceOptions result = mapper.readValue(jsonString, SourceOptions.class); +result.validate(); +return result; + } + + @Test + public void testInvalidSourceOptionsJsonFormat() throws Exception { +thrown.expect(JsonParseException.class); +String syntheticSourceOptions = "input:unknown URI"; +fromString(syntheticSourceOptions); + } + + @Test + public void testFromString() throws Exception { +String syntheticSourceOptions = + "{\"numRecords\":100,\"splitPointFrequencyRecords\":10,\"keySizeBytes\":10," ++ "\"valueSizeBytes\":20,\"numHotKeys\":3," ++ "\"hotKeyFraction\":0.3,\"seed\":123456," ++ "\"bundleSizeDistribution\":{\"type\":\"const\",\"const\":42}," ++ "\"forceNumInitialBundles\":10,\"progressShape\":\"LINEAR_REGRESSING\"" ++ "}"; +SourceOptions sourceOptions = fromString(syntheticSourceOptions); +assertEquals(100, sourceOptions.numRecords); +assertEquals(10, sourceOptions.splitPointFrequencyRecords); +assertEquals(10, sourceOptions.keySizeBytes); +assertEquals(20, sourceOptions.valueSizeBytes); +assertEquals(3, sourceOptions.numHotKeys); +assertEquals(0.3, sourceOptions.hotKey
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=116210&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-116210 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 27/Jun/18 00:29 Start Date: 27/Jun/18 00:29 Worklog Time Spent: 10m Work Description: pabloem commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198335679 ## File path: sdks/java/io/common/build.gradle ## @@ -23,9 +23,15 @@ description = "Apache Beam :: SDKs :: Java :: IO :: Common" ext.summary = "Code used by all Beam IOs" dependencies { + compile library.java.joda_time Review comment: Done : ) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 116210) Time Spent: 4.5h (was: 4h 20m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 4.5h > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=116211&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-116211 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 27/Jun/18 00:29 Start Date: 27/Jun/18 00:29 Worklog Time Spent: 10m Work Description: pabloem commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198327121 ## File path: sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInputTest.java ## @@ -0,0 +1,223 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import static org.apache.beam.sdk.io.common.synthetic.SyntheticOptions.fromIntegerDistribution; +import static org.apache.beam.sdk.io.common.synthetic.SyntheticOptions.fromRealDistribution; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +import com.fasterxml.jackson.core.JsonParseException; +import com.fasterxml.jackson.databind.ObjectMapper; +import java.io.IOException; +import java.util.List; +import org.apache.beam.sdk.io.BoundedSource; +import org.apache.beam.sdk.io.common.synthetic.SyntheticBoundedInput.SourceOptions; +import org.apache.beam.sdk.io.common.synthetic.SyntheticBoundedInput.SyntheticBoundedSource; +import org.apache.beam.sdk.options.PipelineOptions; +import org.apache.beam.sdk.options.PipelineOptionsFactory; +import org.apache.beam.sdk.testing.SourceTestUtils; +import org.apache.beam.sdk.values.KV; +import org.apache.commons.math3.distribution.ConstantRealDistribution; +import org.apache.commons.math3.distribution.ZipfDistribution; +import org.junit.Before; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.ExpectedException; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +/** Unit tests for {@link SyntheticBoundedInput}. */ +@RunWith(JUnit4.class) +public class SyntheticBoundedInputTest { + @Rule public final ExpectedException thrown = ExpectedException.none(); + + private SourceOptions testSourceOptions = new SourceOptions(); + + @Before + public void setUp() { +testSourceOptions.splitPointFrequencyRecords = 1; +testSourceOptions.numRecords = 10; +testSourceOptions.keySizeBytes = 10; +testSourceOptions.valueSizeBytes = 20; +testSourceOptions.numHotKeys = 3; +testSourceOptions.hotKeyFraction = 0.3; +testSourceOptions.setSeed(123456); +testSourceOptions.bundleSizeDistribution = +fromIntegerDistribution(new ZipfDistribution(100, 2.5)); +testSourceOptions.forceNumInitialBundles = null; + } + + private SourceOptions fromString(String jsonString) throws IOException { +ObjectMapper mapper = new ObjectMapper(); +SourceOptions result = mapper.readValue(jsonString, SourceOptions.class); +result.validate(); +return result; + } + + @Test + public void testInvalidSourceOptionsJsonFormat() throws Exception { +thrown.expect(JsonParseException.class); +String syntheticSourceOptions = "input:unknown URI"; +fromString(syntheticSourceOptions); + } + + @Test + public void testFromString() throws Exception { +String syntheticSourceOptions = + "{\"numRecords\":100,\"splitPointFrequencyRecords\":10,\"keySizeBytes\":10," ++ "\"valueSizeBytes\":20,\"numHotKeys\":3," ++ "\"hotKeyFraction\":0.3,\"seed\":123456," ++ "\"bundleSizeDistribution\":{\"type\":\"const\",\"const\":42}," ++ "\"forceNumInitialBundles\":10,\"progressShape\":\"LINEAR_REGRESSING\"" ++ "}"; +SourceOptions sourceOptions = fromString(syntheticSourceOptions); +assertEquals(100, sourceOptions.numRecords); +assertEquals(10, sourceOptions.splitPointFrequencyRecords); +assertEquals(10, sourceOptions.keySizeBytes); +assertEquals(20, sourceOptions.valueSizeBytes); +assertEquals(3, sourceOptions.numHotKeys); +assertEquals(0.3, sourceOptions.hotKey
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115950&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115950 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 26/Jun/18 12:38 Start Date: 26/Jun/18 12:38 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198074385 ## File path: sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java ## @@ -0,0 +1,452 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkNotNull; +import static org.apache.commons.math3.stat.StatUtils.sum; + +import com.fasterxml.jackson.annotation.JsonProperty; +import com.fasterxml.jackson.databind.annotation.JsonDeserialize; +import com.google.common.base.MoreObjects; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.NoSuchElementException; +import java.util.Random; +import org.apache.beam.sdk.coders.ByteArrayCoder; +import org.apache.beam.sdk.coders.Coder; +import org.apache.beam.sdk.coders.KvCoder; +import org.apache.beam.sdk.io.OffsetBasedSource; +import org.apache.beam.sdk.io.Read; +import org.apache.beam.sdk.options.PipelineOptions; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.commons.math3.distribution.ConstantRealDistribution; +import org.joda.time.Duration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * This {@link SyntheticBoundedInput} class provides a parameterizable batch custom source that is + * deterministic. + * + * The {@link SyntheticBoundedSource} generates a {@link PCollection} of + * {@code KV}. A fraction of the generated records {@code KV} are + * associated with "hot" keys, which are uniformly distributed over a fixed number of hot keys. + * The remaining generated records are associated with "random" keys. + * Each record will be slowed down by a certain sleep time generated based on the specified sleep + * time distribution when the {@link SyntheticSourceReader} reads each record. + * The record {@code KV} is generated deterministically based on the record's + * position in the source, which enables repeatable execution for debugging. + * The SyntheticBoundedInput configurable parameters are defined in {@link + * SyntheticBoundedInput.SourceOptions}. + * + * To read a {@link PCollection} of {@code KV} from {@link + * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to construct the synthetic + * source with synthetic source options. + * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an instance. + * An example is below: + * {@code + * Pipeline p = ...; + * SyntheticBoundedInput.SourceOptions sso = ...; + * + * // Construct the synthetic input with synthetic source options. + * PCollection> input = p.apply(SyntheticBoundedInput.readFrom(sso)); Review comment: Please add a test for `SyntheticBoundedInput` use. Also validate in the tests the full set of options on `SourceOptions`. I found a validation issue with the HashFunction non being available because it is transient so it probably gets lost on deserialization. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 115950) Time Spent: 4h 10m (was: 4h) > Performance tests need a way to generate Synthetic data > ---
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115949&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115949 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 26/Jun/18 12:37 Start Date: 26/Jun/18 12:37 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198113556 ## File path: sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java ## @@ -0,0 +1,452 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkNotNull; +import static org.apache.commons.math3.stat.StatUtils.sum; + +import com.fasterxml.jackson.annotation.JsonProperty; +import com.fasterxml.jackson.databind.annotation.JsonDeserialize; +import com.google.common.base.MoreObjects; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.NoSuchElementException; +import java.util.Random; +import org.apache.beam.sdk.coders.ByteArrayCoder; +import org.apache.beam.sdk.coders.Coder; +import org.apache.beam.sdk.coders.KvCoder; +import org.apache.beam.sdk.io.OffsetBasedSource; +import org.apache.beam.sdk.io.Read; +import org.apache.beam.sdk.options.PipelineOptions; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.commons.math3.distribution.ConstantRealDistribution; +import org.joda.time.Duration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * This {@link SyntheticBoundedInput} class provides a parameterizable batch custom source that is + * deterministic. + * + * The {@link SyntheticBoundedSource} generates a {@link PCollection} of + * {@code KV}. A fraction of the generated records {@code KV} are + * associated with "hot" keys, which are uniformly distributed over a fixed number of hot keys. + * The remaining generated records are associated with "random" keys. + * Each record will be slowed down by a certain sleep time generated based on the specified sleep + * time distribution when the {@link SyntheticSourceReader} reads each record. + * The record {@code KV} is generated deterministically based on the record's + * position in the source, which enables repeatable execution for debugging. + * The SyntheticBoundedInput configurable parameters are defined in {@link + * SyntheticBoundedInput.SourceOptions}. + * + * To read a {@link PCollection} of {@code KV} from {@link + * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to construct the synthetic + * source with synthetic source options. + * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an instance. + * An example is below: + * {@code + * Pipeline p = ...; + * SyntheticBoundedInput.SourceOptions sso = ...; + * + * // Construct the synthetic input with synthetic source options. + * PCollection> input = p.apply(SyntheticBoundedInput.readFrom(sso)); + * } + */ +public class SyntheticBoundedInput { + /** + * Read from the synthetic source options. + */ + public static Read.Bounded> readFrom(SourceOptions options) { +checkNotNull(options, "Input synthetic source options should not be null."); +options.validate(); +return Read.from(new SyntheticBoundedSource(options)); + } + + /** + * A {@link SyntheticBoundedSource} that reads {@code KV}. + */ + public static class SyntheticBoundedSource extends OffsetBasedSource> { +private static final long serialVersionUID = 0; +private static final Logger LOG = LoggerFactory.getLogger(SyntheticBoundedSource.class); + +private final SourceOptions sourceOptions; + +public SyntheticBoundedSource(SourceOptions sourceOptions) { + this(0, sourceOptions.numRecords, sourceOptions); +
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115943&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115943 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 26/Jun/18 12:37 Start Date: 26/Jun/18 12:37 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198084851 ## File path: sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java ## @@ -0,0 +1,452 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkNotNull; +import static org.apache.commons.math3.stat.StatUtils.sum; + +import com.fasterxml.jackson.annotation.JsonProperty; +import com.fasterxml.jackson.databind.annotation.JsonDeserialize; +import com.google.common.base.MoreObjects; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.NoSuchElementException; +import java.util.Random; +import org.apache.beam.sdk.coders.ByteArrayCoder; +import org.apache.beam.sdk.coders.Coder; +import org.apache.beam.sdk.coders.KvCoder; +import org.apache.beam.sdk.io.OffsetBasedSource; +import org.apache.beam.sdk.io.Read; +import org.apache.beam.sdk.options.PipelineOptions; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.commons.math3.distribution.ConstantRealDistribution; +import org.joda.time.Duration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * This {@link SyntheticBoundedInput} class provides a parameterizable batch custom source that is + * deterministic. + * + * The {@link SyntheticBoundedSource} generates a {@link PCollection} of + * {@code KV}. A fraction of the generated records {@code KV} are + * associated with "hot" keys, which are uniformly distributed over a fixed number of hot keys. + * The remaining generated records are associated with "random" keys. + * Each record will be slowed down by a certain sleep time generated based on the specified sleep + * time distribution when the {@link SyntheticSourceReader} reads each record. + * The record {@code KV} is generated deterministically based on the record's + * position in the source, which enables repeatable execution for debugging. + * The SyntheticBoundedInput configurable parameters are defined in {@link + * SyntheticBoundedInput.SourceOptions}. + * + * To read a {@link PCollection} of {@code KV} from {@link + * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to construct the synthetic + * source with synthetic source options. + * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an instance. + * An example is below: + * {@code + * Pipeline p = ...; + * SyntheticBoundedInput.SourceOptions sso = ...; + * + * // Construct the synthetic input with synthetic source options. + * PCollection> input = p.apply(SyntheticBoundedInput.readFrom(sso)); + * } + */ +public class SyntheticBoundedInput { + /** + * Read from the synthetic source options. + */ + public static Read.Bounded> readFrom(SourceOptions options) { +checkNotNull(options, "Input synthetic source options should not be null."); +options.validate(); +return Read.from(new SyntheticBoundedSource(options)); + } + + /** + * A {@link SyntheticBoundedSource} that reads {@code KV}. + */ + public static class SyntheticBoundedSource extends OffsetBasedSource> { +private static final long serialVersionUID = 0; +private static final Logger LOG = LoggerFactory.getLogger(SyntheticBoundedSource.class); + +private final SourceOptions sourceOptions; + +public SyntheticBoundedSource(SourceOptions sourceOptions) { + this(0, sourceOptions.numRecords, sourceOptions); +
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115942&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115942 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 26/Jun/18 12:37 Start Date: 26/Jun/18 12:37 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r194428847 ## File path: sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticUtils.java ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import com.google.common.base.Stopwatch; +import com.google.common.hash.Hashing; +import com.google.common.util.concurrent.Uninterruptibles; +import java.util.Random; +import java.util.concurrent.TimeUnit; +import org.joda.time.Duration; + +/** + * Utility functions used in {@link org.apache.beam.sdk.io.common.synthetic}. + */ +public class SyntheticUtils { Review comment: package private same for methods, maybe a run of IntelliJ's analyze code to restrict as much as intended scope access is worth a look (also in the other classes). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 115942) Time Spent: 3h 10m (was: 3h) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 3h 10m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115945&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115945 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 26/Jun/18 12:37 Start Date: 26/Jun/18 12:37 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198074805 ## File path: sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java ## @@ -0,0 +1,452 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkNotNull; +import static org.apache.commons.math3.stat.StatUtils.sum; + +import com.fasterxml.jackson.annotation.JsonProperty; +import com.fasterxml.jackson.databind.annotation.JsonDeserialize; +import com.google.common.base.MoreObjects; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.NoSuchElementException; +import java.util.Random; +import org.apache.beam.sdk.coders.ByteArrayCoder; +import org.apache.beam.sdk.coders.Coder; +import org.apache.beam.sdk.coders.KvCoder; +import org.apache.beam.sdk.io.OffsetBasedSource; +import org.apache.beam.sdk.io.Read; +import org.apache.beam.sdk.options.PipelineOptions; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.commons.math3.distribution.ConstantRealDistribution; +import org.joda.time.Duration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * This {@link SyntheticBoundedInput} class provides a parameterizable batch custom source that is + * deterministic. + * + * The {@link SyntheticBoundedSource} generates a {@link PCollection} of + * {@code KV}. A fraction of the generated records {@code KV} are + * associated with "hot" keys, which are uniformly distributed over a fixed number of hot keys. + * The remaining generated records are associated with "random" keys. + * Each record will be slowed down by a certain sleep time generated based on the specified sleep + * time distribution when the {@link SyntheticSourceReader} reads each record. + * The record {@code KV} is generated deterministically based on the record's + * position in the source, which enables repeatable execution for debugging. + * The SyntheticBoundedInput configurable parameters are defined in {@link + * SyntheticBoundedInput.SourceOptions}. + * + * To read a {@link PCollection} of {@code KV} from {@link + * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to construct the synthetic + * source with synthetic source options. + * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an instance. + * An example is below: + * {@code + * Pipeline p = ...; + * SyntheticBoundedInput.SourceOptions sso = ...; + * + * // Construct the synthetic input with synthetic source options. + * PCollection> input = p.apply(SyntheticBoundedInput.readFrom(sso)); + * } + */ +public class SyntheticBoundedInput { + /** + * Read from the synthetic source options. + */ + public static Read.Bounded> readFrom(SourceOptions options) { +checkNotNull(options, "Input synthetic source options should not be null."); +options.validate(); Review comment: is this needed? I think the validate is called in the lifecycle of the Read. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 115945) Time Spent: 3.5h (was: 3h 20m) > Performance te
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115939&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115939 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 26/Jun/18 12:37 Start Date: 26/Jun/18 12:37 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r194427503 ## File path: sdks/java/io/common/build.gradle ## @@ -23,9 +23,15 @@ description = "Apache Beam :: SDKs :: Java :: IO :: Common" ext.summary = "Code used by all Beam IOs" dependencies { + compile library.java.joda_time Review comment: Probably worth adding the deps to pom.xml until we remove the pom files we shall not break the other build. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 115939) Time Spent: 2h 50m (was: 2h 40m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 2h 50m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115941&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115941 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 26/Jun/18 12:37 Start Date: 26/Jun/18 12:37 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198081158 ## File path: sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java ## @@ -0,0 +1,452 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkNotNull; +import static org.apache.commons.math3.stat.StatUtils.sum; + +import com.fasterxml.jackson.annotation.JsonProperty; +import com.fasterxml.jackson.databind.annotation.JsonDeserialize; +import com.google.common.base.MoreObjects; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.NoSuchElementException; +import java.util.Random; +import org.apache.beam.sdk.coders.ByteArrayCoder; +import org.apache.beam.sdk.coders.Coder; +import org.apache.beam.sdk.coders.KvCoder; +import org.apache.beam.sdk.io.OffsetBasedSource; +import org.apache.beam.sdk.io.Read; +import org.apache.beam.sdk.options.PipelineOptions; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.commons.math3.distribution.ConstantRealDistribution; +import org.joda.time.Duration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * This {@link SyntheticBoundedInput} class provides a parameterizable batch custom source that is + * deterministic. + * + * The {@link SyntheticBoundedSource} generates a {@link PCollection} of + * {@code KV}. A fraction of the generated records {@code KV} are + * associated with "hot" keys, which are uniformly distributed over a fixed number of hot keys. + * The remaining generated records are associated with "random" keys. + * Each record will be slowed down by a certain sleep time generated based on the specified sleep + * time distribution when the {@link SyntheticSourceReader} reads each record. + * The record {@code KV} is generated deterministically based on the record's + * position in the source, which enables repeatable execution for debugging. + * The SyntheticBoundedInput configurable parameters are defined in {@link + * SyntheticBoundedInput.SourceOptions}. + * + * To read a {@link PCollection} of {@code KV} from {@link + * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to construct the synthetic + * source with synthetic source options. + * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an instance. + * An example is below: + * {@code + * Pipeline p = ...; + * SyntheticBoundedInput.SourceOptions sso = ...; + * + * // Construct the synthetic input with synthetic source options. + * PCollection> input = p.apply(SyntheticBoundedInput.readFrom(sso)); + * } + */ +public class SyntheticBoundedInput { + /** + * Read from the synthetic source options. + */ + public static Read.Bounded> readFrom(SourceOptions options) { +checkNotNull(options, "Input synthetic source options should not be null."); +options.validate(); +return Read.from(new SyntheticBoundedSource(options)); + } + + /** + * A {@link SyntheticBoundedSource} that reads {@code KV}. + */ + public static class SyntheticBoundedSource extends OffsetBasedSource> { +private static final long serialVersionUID = 0; +private static final Logger LOG = LoggerFactory.getLogger(SyntheticBoundedSource.class); + +private final SourceOptions sourceOptions; + +public SyntheticBoundedSource(SourceOptions sourceOptions) { + this(0, sourceOptions.numRecords, sourceOptions); +
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115937&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115937 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 26/Jun/18 12:37 Start Date: 26/Jun/18 12:37 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198120428 ## File path: sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java ## @@ -0,0 +1,452 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkNotNull; +import static org.apache.commons.math3.stat.StatUtils.sum; + +import com.fasterxml.jackson.annotation.JsonProperty; +import com.fasterxml.jackson.databind.annotation.JsonDeserialize; +import com.google.common.base.MoreObjects; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.NoSuchElementException; +import java.util.Random; +import org.apache.beam.sdk.coders.ByteArrayCoder; +import org.apache.beam.sdk.coders.Coder; +import org.apache.beam.sdk.coders.KvCoder; +import org.apache.beam.sdk.io.OffsetBasedSource; +import org.apache.beam.sdk.io.Read; +import org.apache.beam.sdk.options.PipelineOptions; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.commons.math3.distribution.ConstantRealDistribution; +import org.joda.time.Duration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * This {@link SyntheticBoundedInput} class provides a parameterizable batch custom source that is + * deterministic. + * + * The {@link SyntheticBoundedSource} generates a {@link PCollection} of + * {@code KV}. A fraction of the generated records {@code KV} are + * associated with "hot" keys, which are uniformly distributed over a fixed number of hot keys. + * The remaining generated records are associated with "random" keys. + * Each record will be slowed down by a certain sleep time generated based on the specified sleep + * time distribution when the {@link SyntheticSourceReader} reads each record. + * The record {@code KV} is generated deterministically based on the record's + * position in the source, which enables repeatable execution for debugging. + * The SyntheticBoundedInput configurable parameters are defined in {@link + * SyntheticBoundedInput.SourceOptions}. + * + * To read a {@link PCollection} of {@code KV} from {@link + * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to construct the synthetic + * source with synthetic source options. + * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an instance. + * An example is below: + * {@code + * Pipeline p = ...; + * SyntheticBoundedInput.SourceOptions sso = ...; + * + * // Construct the synthetic input with synthetic source options. + * PCollection> input = p.apply(SyntheticBoundedInput.readFrom(sso)); + * } + */ +public class SyntheticBoundedInput { + /** + * Read from the synthetic source options. + */ + public static Read.Bounded> readFrom(SourceOptions options) { +checkNotNull(options, "Input synthetic source options should not be null."); +options.validate(); +return Read.from(new SyntheticBoundedSource(options)); + } + + /** + * A {@link SyntheticBoundedSource} that reads {@code KV}. + */ + public static class SyntheticBoundedSource extends OffsetBasedSource> { +private static final long serialVersionUID = 0; +private static final Logger LOG = LoggerFactory.getLogger(SyntheticBoundedSource.class); + +private final SourceOptions sourceOptions; + +public SyntheticBoundedSource(SourceOptions sourceOptions) { + this(0, sourceOptions.numRecords, sourceOptions); +
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115944&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115944 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 26/Jun/18 12:37 Start Date: 26/Jun/18 12:37 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198112261 ## File path: sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInputTest.java ## @@ -0,0 +1,223 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import static org.apache.beam.sdk.io.common.synthetic.SyntheticOptions.fromIntegerDistribution; +import static org.apache.beam.sdk.io.common.synthetic.SyntheticOptions.fromRealDistribution; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +import com.fasterxml.jackson.core.JsonParseException; +import com.fasterxml.jackson.databind.ObjectMapper; +import java.io.IOException; +import java.util.List; +import org.apache.beam.sdk.io.BoundedSource; +import org.apache.beam.sdk.io.common.synthetic.SyntheticBoundedInput.SourceOptions; +import org.apache.beam.sdk.io.common.synthetic.SyntheticBoundedInput.SyntheticBoundedSource; +import org.apache.beam.sdk.options.PipelineOptions; +import org.apache.beam.sdk.options.PipelineOptionsFactory; +import org.apache.beam.sdk.testing.SourceTestUtils; +import org.apache.beam.sdk.values.KV; +import org.apache.commons.math3.distribution.ConstantRealDistribution; +import org.apache.commons.math3.distribution.ZipfDistribution; +import org.junit.Before; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.ExpectedException; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +/** Unit tests for {@link SyntheticBoundedInput}. */ +@RunWith(JUnit4.class) +public class SyntheticBoundedInputTest { + @Rule public final ExpectedException thrown = ExpectedException.none(); + + private SourceOptions testSourceOptions = new SourceOptions(); + + @Before + public void setUp() { +testSourceOptions.splitPointFrequencyRecords = 1; +testSourceOptions.numRecords = 10; +testSourceOptions.keySizeBytes = 10; +testSourceOptions.valueSizeBytes = 20; +testSourceOptions.numHotKeys = 3; +testSourceOptions.hotKeyFraction = 0.3; +testSourceOptions.setSeed(123456); +testSourceOptions.bundleSizeDistribution = +fromIntegerDistribution(new ZipfDistribution(100, 2.5)); +testSourceOptions.forceNumInitialBundles = null; + } + + private SourceOptions fromString(String jsonString) throws IOException { +ObjectMapper mapper = new ObjectMapper(); +SourceOptions result = mapper.readValue(jsonString, SourceOptions.class); +result.validate(); +return result; + } + + @Test + public void testInvalidSourceOptionsJsonFormat() throws Exception { +thrown.expect(JsonParseException.class); +String syntheticSourceOptions = "input:unknown URI"; +fromString(syntheticSourceOptions); + } + + @Test + public void testFromString() throws Exception { +String syntheticSourceOptions = + "{\"numRecords\":100,\"splitPointFrequencyRecords\":10,\"keySizeBytes\":10," ++ "\"valueSizeBytes\":20,\"numHotKeys\":3," ++ "\"hotKeyFraction\":0.3,\"seed\":123456," ++ "\"bundleSizeDistribution\":{\"type\":\"const\",\"const\":42}," ++ "\"forceNumInitialBundles\":10,\"progressShape\":\"LINEAR_REGRESSING\"" ++ "}"; +SourceOptions sourceOptions = fromString(syntheticSourceOptions); +assertEquals(100, sourceOptions.numRecords); +assertEquals(10, sourceOptions.splitPointFrequencyRecords); +assertEquals(10, sourceOptions.keySizeBytes); +assertEquals(20, sourceOptions.valueSizeBytes); +assertEquals(3, sourceOptions.numHotKeys); +assertEquals(0.3, sourceOptions.hotKey
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115938&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115938 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 26/Jun/18 12:37 Start Date: 26/Jun/18 12:37 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198086099 ## File path: sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java ## @@ -0,0 +1,452 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkNotNull; +import static org.apache.commons.math3.stat.StatUtils.sum; + +import com.fasterxml.jackson.annotation.JsonProperty; +import com.fasterxml.jackson.databind.annotation.JsonDeserialize; +import com.google.common.base.MoreObjects; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.NoSuchElementException; +import java.util.Random; +import org.apache.beam.sdk.coders.ByteArrayCoder; +import org.apache.beam.sdk.coders.Coder; +import org.apache.beam.sdk.coders.KvCoder; +import org.apache.beam.sdk.io.OffsetBasedSource; +import org.apache.beam.sdk.io.Read; +import org.apache.beam.sdk.options.PipelineOptions; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.commons.math3.distribution.ConstantRealDistribution; +import org.joda.time.Duration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * This {@link SyntheticBoundedInput} class provides a parameterizable batch custom source that is + * deterministic. + * + * The {@link SyntheticBoundedSource} generates a {@link PCollection} of + * {@code KV}. A fraction of the generated records {@code KV} are + * associated with "hot" keys, which are uniformly distributed over a fixed number of hot keys. + * The remaining generated records are associated with "random" keys. + * Each record will be slowed down by a certain sleep time generated based on the specified sleep + * time distribution when the {@link SyntheticSourceReader} reads each record. + * The record {@code KV} is generated deterministically based on the record's + * position in the source, which enables repeatable execution for debugging. + * The SyntheticBoundedInput configurable parameters are defined in {@link + * SyntheticBoundedInput.SourceOptions}. + * + * To read a {@link PCollection} of {@code KV} from {@link + * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to construct the synthetic + * source with synthetic source options. + * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an instance. + * An example is below: + * {@code + * Pipeline p = ...; + * SyntheticBoundedInput.SourceOptions sso = ...; + * + * // Construct the synthetic input with synthetic source options. + * PCollection> input = p.apply(SyntheticBoundedInput.readFrom(sso)); + * } + */ +public class SyntheticBoundedInput { + /** + * Read from the synthetic source options. + */ + public static Read.Bounded> readFrom(SourceOptions options) { +checkNotNull(options, "Input synthetic source options should not be null."); +options.validate(); +return Read.from(new SyntheticBoundedSource(options)); + } + + /** + * A {@link SyntheticBoundedSource} that reads {@code KV}. + */ + public static class SyntheticBoundedSource extends OffsetBasedSource> { +private static final long serialVersionUID = 0; +private static final Logger LOG = LoggerFactory.getLogger(SyntheticBoundedSource.class); + +private final SourceOptions sourceOptions; + +public SyntheticBoundedSource(SourceOptions sourceOptions) { + this(0, sourceOptions.numRecords, sourceOptions); +
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115948&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115948 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 26/Jun/18 12:37 Start Date: 26/Jun/18 12:37 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198123166 ## File path: sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java ## @@ -0,0 +1,452 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkNotNull; +import static org.apache.commons.math3.stat.StatUtils.sum; + +import com.fasterxml.jackson.annotation.JsonProperty; +import com.fasterxml.jackson.databind.annotation.JsonDeserialize; +import com.google.common.base.MoreObjects; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.NoSuchElementException; +import java.util.Random; +import org.apache.beam.sdk.coders.ByteArrayCoder; +import org.apache.beam.sdk.coders.Coder; +import org.apache.beam.sdk.coders.KvCoder; +import org.apache.beam.sdk.io.OffsetBasedSource; +import org.apache.beam.sdk.io.Read; +import org.apache.beam.sdk.options.PipelineOptions; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.commons.math3.distribution.ConstantRealDistribution; +import org.joda.time.Duration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * This {@link SyntheticBoundedInput} class provides a parameterizable batch custom source that is + * deterministic. + * + * The {@link SyntheticBoundedSource} generates a {@link PCollection} of + * {@code KV}. A fraction of the generated records {@code KV} are + * associated with "hot" keys, which are uniformly distributed over a fixed number of hot keys. + * The remaining generated records are associated with "random" keys. + * Each record will be slowed down by a certain sleep time generated based on the specified sleep + * time distribution when the {@link SyntheticSourceReader} reads each record. + * The record {@code KV} is generated deterministically based on the record's + * position in the source, which enables repeatable execution for debugging. + * The SyntheticBoundedInput configurable parameters are defined in {@link + * SyntheticBoundedInput.SourceOptions}. + * + * To read a {@link PCollection} of {@code KV} from {@link + * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to construct the synthetic + * source with synthetic source options. + * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an instance. + * An example is below: + * {@code + * Pipeline p = ...; + * SyntheticBoundedInput.SourceOptions sso = ...; + * + * // Construct the synthetic input with synthetic source options. + * PCollection> input = p.apply(SyntheticBoundedInput.readFrom(sso)); + * } + */ +public class SyntheticBoundedInput { + /** + * Read from the synthetic source options. + */ + public static Read.Bounded> readFrom(SourceOptions options) { +checkNotNull(options, "Input synthetic source options should not be null."); +options.validate(); +return Read.from(new SyntheticBoundedSource(options)); + } + + /** + * A {@link SyntheticBoundedSource} that reads {@code KV}. + */ + public static class SyntheticBoundedSource extends OffsetBasedSource> { +private static final long serialVersionUID = 0; +private static final Logger LOG = LoggerFactory.getLogger(SyntheticBoundedSource.class); + +private final SourceOptions sourceOptions; + +public SyntheticBoundedSource(SourceOptions sourceOptions) { + this(0, sourceOptions.numRecords, sourceOptions); +
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115940&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115940 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 26/Jun/18 12:37 Start Date: 26/Jun/18 12:37 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198116718 ## File path: sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/package-info.java ## @@ -0,0 +1,19 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +/** Transforms for performing Synthetic Operations in Apache Beam pipelines. */ +package org.apache.beam.sdk.io.common.synthetic; Review comment: I am partially 'partagé' on having this in common. If the goal is to have a synthetic IO maybe the need is specific so better be an independent IO, no ? And then we can migrate some features from other generators there, and write just ParDo transformations on this e.g. to generate data for Nexmark or examples. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 115940) Time Spent: 3h (was: 2h 50m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 3h > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115947&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115947 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 26/Jun/18 12:37 Start Date: 26/Jun/18 12:37 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198113865 ## File path: sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java ## @@ -0,0 +1,452 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkNotNull; +import static org.apache.commons.math3.stat.StatUtils.sum; + +import com.fasterxml.jackson.annotation.JsonProperty; +import com.fasterxml.jackson.databind.annotation.JsonDeserialize; +import com.google.common.base.MoreObjects; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.NoSuchElementException; +import java.util.Random; +import org.apache.beam.sdk.coders.ByteArrayCoder; +import org.apache.beam.sdk.coders.Coder; +import org.apache.beam.sdk.coders.KvCoder; +import org.apache.beam.sdk.io.OffsetBasedSource; +import org.apache.beam.sdk.io.Read; +import org.apache.beam.sdk.options.PipelineOptions; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.commons.math3.distribution.ConstantRealDistribution; +import org.joda.time.Duration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * This {@link SyntheticBoundedInput} class provides a parameterizable batch custom source that is + * deterministic. + * + * The {@link SyntheticBoundedSource} generates a {@link PCollection} of + * {@code KV}. A fraction of the generated records {@code KV} are + * associated with "hot" keys, which are uniformly distributed over a fixed number of hot keys. + * The remaining generated records are associated with "random" keys. + * Each record will be slowed down by a certain sleep time generated based on the specified sleep + * time distribution when the {@link SyntheticSourceReader} reads each record. + * The record {@code KV} is generated deterministically based on the record's + * position in the source, which enables repeatable execution for debugging. + * The SyntheticBoundedInput configurable parameters are defined in {@link + * SyntheticBoundedInput.SourceOptions}. + * + * To read a {@link PCollection} of {@code KV} from {@link + * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to construct the synthetic + * source with synthetic source options. + * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an instance. + * An example is below: + * {@code + * Pipeline p = ...; + * SyntheticBoundedInput.SourceOptions sso = ...; + * + * // Construct the synthetic input with synthetic source options. + * PCollection> input = p.apply(SyntheticBoundedInput.readFrom(sso)); + * } + */ +public class SyntheticBoundedInput { + /** + * Read from the synthetic source options. + */ + public static Read.Bounded> readFrom(SourceOptions options) { +checkNotNull(options, "Input synthetic source options should not be null."); +options.validate(); +return Read.from(new SyntheticBoundedSource(options)); + } + + /** + * A {@link SyntheticBoundedSource} that reads {@code KV}. + */ + public static class SyntheticBoundedSource extends OffsetBasedSource> { +private static final long serialVersionUID = 0; +private static final Logger LOG = LoggerFactory.getLogger(SyntheticBoundedSource.class); + +private final SourceOptions sourceOptions; + +public SyntheticBoundedSource(SourceOptions sourceOptions) { + this(0, sourceOptions.numRecords, sourceOptions); +
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115936&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115936 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 26/Jun/18 12:37 Start Date: 26/Jun/18 12:37 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198074385 ## File path: sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java ## @@ -0,0 +1,452 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkNotNull; +import static org.apache.commons.math3.stat.StatUtils.sum; + +import com.fasterxml.jackson.annotation.JsonProperty; +import com.fasterxml.jackson.databind.annotation.JsonDeserialize; +import com.google.common.base.MoreObjects; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.NoSuchElementException; +import java.util.Random; +import org.apache.beam.sdk.coders.ByteArrayCoder; +import org.apache.beam.sdk.coders.Coder; +import org.apache.beam.sdk.coders.KvCoder; +import org.apache.beam.sdk.io.OffsetBasedSource; +import org.apache.beam.sdk.io.Read; +import org.apache.beam.sdk.options.PipelineOptions; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.commons.math3.distribution.ConstantRealDistribution; +import org.joda.time.Duration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * This {@link SyntheticBoundedInput} class provides a parameterizable batch custom source that is + * deterministic. + * + * The {@link SyntheticBoundedSource} generates a {@link PCollection} of + * {@code KV}. A fraction of the generated records {@code KV} are + * associated with "hot" keys, which are uniformly distributed over a fixed number of hot keys. + * The remaining generated records are associated with "random" keys. + * Each record will be slowed down by a certain sleep time generated based on the specified sleep + * time distribution when the {@link SyntheticSourceReader} reads each record. + * The record {@code KV} is generated deterministically based on the record's + * position in the source, which enables repeatable execution for debugging. + * The SyntheticBoundedInput configurable parameters are defined in {@link + * SyntheticBoundedInput.SourceOptions}. + * + * To read a {@link PCollection} of {@code KV} from {@link + * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to construct the synthetic + * source with synthetic source options. + * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an instance. + * An example is below: + * {@code + * Pipeline p = ...; + * SyntheticBoundedInput.SourceOptions sso = ...; + * + * // Construct the synthetic input with synthetic source options. + * PCollection> input = p.apply(SyntheticBoundedInput.readFrom(sso)); Review comment: Please add a test for `SyntheticBoundedInput` use. Also validate in the tests the case of the complete options on `SourceOptions`. I found a validation issue with the HashFunction non being available because it is transient so it probably gets lost on deserialization. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 115936) Time Spent: 2.5h (was: 2h 20m) > Performance tests need a way to generate Synthetic data > -
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115946&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115946 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 26/Jun/18 12:37 Start Date: 26/Jun/18 12:37 Worklog Time Spent: 10m Work Description: iemejia commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r198112035 ## File path: sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInputTest.java ## @@ -0,0 +1,223 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common.synthetic; + +import static org.apache.beam.sdk.io.common.synthetic.SyntheticOptions.fromIntegerDistribution; +import static org.apache.beam.sdk.io.common.synthetic.SyntheticOptions.fromRealDistribution; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +import com.fasterxml.jackson.core.JsonParseException; +import com.fasterxml.jackson.databind.ObjectMapper; +import java.io.IOException; +import java.util.List; +import org.apache.beam.sdk.io.BoundedSource; +import org.apache.beam.sdk.io.common.synthetic.SyntheticBoundedInput.SourceOptions; +import org.apache.beam.sdk.io.common.synthetic.SyntheticBoundedInput.SyntheticBoundedSource; +import org.apache.beam.sdk.options.PipelineOptions; +import org.apache.beam.sdk.options.PipelineOptionsFactory; +import org.apache.beam.sdk.testing.SourceTestUtils; +import org.apache.beam.sdk.values.KV; +import org.apache.commons.math3.distribution.ConstantRealDistribution; +import org.apache.commons.math3.distribution.ZipfDistribution; +import org.junit.Before; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.ExpectedException; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +/** Unit tests for {@link SyntheticBoundedInput}. */ +@RunWith(JUnit4.class) +public class SyntheticBoundedInputTest { + @Rule public final ExpectedException thrown = ExpectedException.none(); + + private SourceOptions testSourceOptions = new SourceOptions(); + + @Before + public void setUp() { +testSourceOptions.splitPointFrequencyRecords = 1; +testSourceOptions.numRecords = 10; +testSourceOptions.keySizeBytes = 10; +testSourceOptions.valueSizeBytes = 20; +testSourceOptions.numHotKeys = 3; +testSourceOptions.hotKeyFraction = 0.3; +testSourceOptions.setSeed(123456); +testSourceOptions.bundleSizeDistribution = +fromIntegerDistribution(new ZipfDistribution(100, 2.5)); +testSourceOptions.forceNumInitialBundles = null; + } + + private SourceOptions fromString(String jsonString) throws IOException { +ObjectMapper mapper = new ObjectMapper(); +SourceOptions result = mapper.readValue(jsonString, SourceOptions.class); +result.validate(); +return result; + } + + @Test + public void testInvalidSourceOptionsJsonFormat() throws Exception { +thrown.expect(JsonParseException.class); +String syntheticSourceOptions = "input:unknown URI"; +fromString(syntheticSourceOptions); + } + + @Test + public void testFromString() throws Exception { +String syntheticSourceOptions = + "{\"numRecords\":100,\"splitPointFrequencyRecords\":10,\"keySizeBytes\":10," ++ "\"valueSizeBytes\":20,\"numHotKeys\":3," ++ "\"hotKeyFraction\":0.3,\"seed\":123456," ++ "\"bundleSizeDistribution\":{\"type\":\"const\",\"const\":42}," ++ "\"forceNumInitialBundles\":10,\"progressShape\":\"LINEAR_REGRESSING\"" ++ "}"; +SourceOptions sourceOptions = fromString(syntheticSourceOptions); +assertEquals(100, sourceOptions.numRecords); +assertEquals(10, sourceOptions.splitPointFrequencyRecords); +assertEquals(10, sourceOptions.keySizeBytes); +assertEquals(20, sourceOptions.valueSizeBytes); +assertEquals(3, sourceOptions.numHotKeys); +assertEquals(0.3, sourceOptions.hotKey
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=114339&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-114339 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 21/Jun/18 13:46 Start Date: 21/Jun/18 13:46 Worklog Time Spent: 10m Work Description: iemejia commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-399109148 @aaltay I had discussed with Pablo via slack about the urgency of this because I was not available to do it, so we agree that I would do it as soon as I can hopefully today, tomorrow at latest. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 114339) Time Spent: 2h 20m (was: 2h 10m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 2h 20m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=113968&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-113968 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 20/Jun/18 23:20 Start Date: 20/Jun/18 23:20 Worklog Time Spent: 10m Work Description: aaltay commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-398927561 What is the state of this PR? @iemejia are you planning to review? @pabloem I have an open question about, whether we can remove zipf or not, could you take a look at that? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 113968) Time Spent: 2h 10m (was: 2h) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 2h 10m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=110958&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-110958 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 12/Jun/18 05:50 Start Date: 12/Jun/18 05:50 Worklog Time Spent: 10m Work Description: pabloem commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r194620716 ## File path: sdks/python/setup.py ## @@ -117,6 +117,7 @@ def get_version(): REQUIRED_TEST_PACKAGES = [ 'nose>=1.3.7', 'pyhamcrest>=1.9,<2.0', +'numpy>=1.14.3', Review comment: Numpy is needed here because we do a zipf distribution (i.e. heavily weighted towards a few keys). We can define an 'extra' that is `'perftests'`, so that numpy is only installed when we do `pip install -e .[perftest]`. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 110958) Time Spent: 2h (was: 1h 50m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 2h > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=110957&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-110957 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 12/Jun/18 05:48 Start Date: 12/Jun/18 05:48 Worklog Time Spent: 10m Work Description: pabloem commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r194620716 ## File path: sdks/python/setup.py ## @@ -117,6 +117,7 @@ def get_version(): REQUIRED_TEST_PACKAGES = [ 'nose>=1.3.7', 'pyhamcrest>=1.9,<2.0', +'numpy>=1.14.3', Review comment: Numpy is needed here because we do a zipf distribution (i.e. heavily weighted towards a few keys). Another option would be to have these files not be part of Beam, but since they will be used for perf tests, it'll likely be worth having them directly n beam. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 110957) Time Spent: 1h 50m (was: 1h 40m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 1h 50m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=110956&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-110956 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 12/Jun/18 05:47 Start Date: 12/Jun/18 05:47 Worklog Time Spent: 10m Work Description: pabloem commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r194620610 ## File path: sdks/python/scripts/generate_pydoc.sh ## @@ -174,6 +174,7 @@ nitpicky = True nitpick_ignore = [] nitpick_ignore += [('py:class', iden) for iden in ignore_identifiers] nitpick_ignore += [('py:obj', iden) for iden in ignore_identifiers] +nitpick_ignore += [('py:exc', 'ValueError')] Review comment: I found this issue in Sphinx. https://github.com/sphinx-doc/sphinx/issues/1034 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 110956) Time Spent: 1h 40m (was: 1.5h) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 1h 40m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=110951&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-110951 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 12/Jun/18 04:58 Start Date: 12/Jun/18 04:58 Worklog Time Spent: 10m Work Description: aaltay commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r194614742 ## File path: sdks/python/scripts/generate_pydoc.sh ## @@ -174,6 +174,7 @@ nitpicky = True nitpick_ignore = [] nitpick_ignore += [('py:class', iden) for iden in ignore_identifiers] nitpick_ignore += [('py:obj', iden) for iden in ignore_identifiers] +nitpick_ignore += [('py:exc', 'ValueError')] Review comment: Do you know why? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 110951) Time Spent: 1.5h (was: 1h 20m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 1.5h > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=110950&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-110950 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 12/Jun/18 04:58 Start Date: 12/Jun/18 04:58 Worklog Time Spent: 10m Work Description: aaltay commented on a change in pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#discussion_r194614733 ## File path: sdks/python/setup.py ## @@ -117,6 +117,7 @@ def get_version(): REQUIRED_TEST_PACKAGES = [ 'nose>=1.3.7', 'pyhamcrest>=1.9,<2.0', +'numpy>=1.14.3', Review comment: This is a rather big dependency. Is it possible for us to not use it? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 110950) Time Spent: 1h 20m (was: 1h 10m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 1h 20m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=109865&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-109865 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 07/Jun/18 21:34 Start Date: 07/Jun/18 21:34 Worklog Time Spent: 10m Work Description: iemejia commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-395572681 Hi Pablo my excuses I have been quite busy these last days. I expect to do a first round of the Java part tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 109865) Time Spent: 1h 10m (was: 1h) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=109852&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-109852 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 07/Jun/18 20:36 Start Date: 07/Jun/18 20:36 Worklog Time Spent: 10m Work Description: iemejia commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-395556195 Hi Pablo my excuses I have been quite busy these last days. I expect to do a first round of the Java part tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 109852) Time Spent: 1h (was: 50m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 1h > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=107762&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-107762 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 31/May/18 17:12 Start Date: 31/May/18 17:12 Worklog Time Spent: 10m Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-393605464 r: @aaltay cc: @iemejia starting the review process, but leaving space for discussion as we go This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 107762) Time Spent: 50m (was: 40m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 50m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=107390&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-107390 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 30/May/18 21:38 Start Date: 30/May/18 21:38 Worklog Time Spent: 10m Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-393328065 Removing lambdas that take advantage of tuple unpacking in function arguments: https://www.python.org/dev/peps/pep-3113/ This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 107390) Time Spent: 40m (was: 0.5h) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=107380&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-107380 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 30/May/18 21:12 Start Date: 30/May/18 21:12 Worklog Time Spent: 10m Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-393321132 Passing Java PreCommit: https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/5907/ This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 107380) Time Spent: 0.5h (was: 20m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=107379&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-107379 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 30/May/18 21:12 Start Date: 30/May/18 21:12 Worklog Time Spent: 10m Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519#issuecomment-393320967 Also adding ValueError as a nitpick_ignore element due to a sphinx bug: https://github.com/sphinx-doc/sphinx/issues/1034 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 107379) Time Spent: 20m (was: 10m) > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data
[ https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=107350&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-107350 ] ASF GitHub Bot logged work on BEAM-4432: Author: ASF GitHub Bot Created on: 30/May/18 20:08 Start Date: 30/May/18 20:08 Worklog Time Spent: 10m Work Description: pabloem opened a new pull request #5519: [BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines URL: https://github.com/apache/beam/pull/5519 Adding sources to produce Synthetic data in Beam. These will be used in subsequent performance tests that validate the performance of basic Beam operations. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 107350) Time Spent: 10m Remaining Estimate: 0h > Performance tests need a way to generate Synthetic data > --- > > Key: BEAM-4432 > URL: https://issues.apache.org/jira/browse/BEAM-4432 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > GenerateSequence fal.lls short in this regard, as we may want to generate > data in custom distributions, or with specific repeatability requirements / > and hardcoded delays for autoscaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)