[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-16 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=123540=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-123540
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 16/Jul/18 08:48
Start Date: 16/Jul/18 08:48
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-405182970
 
 
   Yay! Nice to see this merged! Thanks @pabloem and @iemejia. :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 123540)
Time Spent: 8h 50m  (was: 8h 40m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-15 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=123414=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-123414
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 15/Jul/18 14:56
Start Date: 15/Jul/18 14:56
Worklog Time Spent: 10m 
  Work Description: pabloem closed pull request #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy 
b/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy
index 49147654ba9..3b7ce252161 100644
--- a/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy
+++ b/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy
@@ -341,6 +341,7 @@ class BeamModulePlugin implements Plugin {
 commons_io_1x   : 
"commons-io:commons-io:1.3.2",
 commons_io_2x   : 
"commons-io:commons-io:2.5",
 commons_lang3   : 
"org.apache.commons:commons-lang3:3.6",
+commons_math3   : 
"org.apache.commons:commons-math3:3.6.1",
 datastore_v1_proto_client   : 
"com.google.cloud.datastore:datastore-v1-proto-client:1.4.0",
 datastore_v1_protos : 
"com.google.cloud.datastore:datastore-v1-protos:1.3.0",
 error_prone_annotations : 
"com.google.errorprone:error_prone_annotations:2.0.15",
diff --git a/sdks/java/io/synthetic/build.gradle 
b/sdks/java/io/synthetic/build.gradle
new file mode 100644
index 000..3d6be8eb13c
--- /dev/null
+++ b/sdks/java/io/synthetic/build.gradle
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+apply plugin: org.apache.beam.gradle.BeamModulePlugin
+applyJavaNature()
+
+description = "Apache Beam :: SDKs :: Java :: IO :: Synthetic"
+ext.summary = "Generators of Synthetic IO for Testing."
+
+dependencies {
+  compile library.java.joda_time
+  compile library.java.commons_math3
+  shadow library.java.jackson_core
+  shadow library.java.jackson_annotations
+  shadow library.java.jackson_databind
+  testCompile library.java.guava
+  testCompile library.java.junit
+  testCompile library.java.hamcrest_core
+  testCompile library.java.hamcrest_library
+  shadow project(path: ":beam-sdks-java-core", configuration: "shadow")
+}
diff --git 
a/sdks/java/io/synthetic/src/main/java/org/apache/beam/sdk/io/synthetic/SyntheticBoundedIO.java
 
b/sdks/java/io/synthetic/src/main/java/org/apache/beam/sdk/io/synthetic/SyntheticBoundedIO.java
new file mode 100644
index 000..d9f652ace9d
--- /dev/null
+++ 
b/sdks/java/io/synthetic/src/main/java/org/apache/beam/sdk/io/synthetic/SyntheticBoundedIO.java
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.synthetic;
+
+import static com.google.common.base.Preconditions.checkArgument;
+import static 

[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=123100=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-123100
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 13/Jul/18 23:24
Start Date: 13/Jul/18 23:24
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-404978005
 
 
   Run Python PreCommit


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 123100)
Time Spent: 8.5h  (was: 8h 20m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 8.5h
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=123034=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-123034
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 13/Jul/18 20:46
Start Date: 13/Jul/18 20:46
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-404949448
 
 
   Run Java PreCommit


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 123034)
Time Spent: 8h 20m  (was: 8h 10m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-10 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=121626=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-121626
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 10/Jul/18 21:27
Start Date: 10/Jul/18 21:27
Worklog Time Spent: 10m 
  Work Description: pabloem commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r201445730
 
 

 ##
 File path: sdks/java/io/synthetic/pom.xml
 ##
 @@ -0,0 +1,84 @@
+
 
 Review comment:
   Done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 121626)
Time Spent: 8h 10m  (was: 8h)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-10 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=121407=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-121407
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 10/Jul/18 13:29
Start Date: 10/Jul/18 13:29
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r201340262
 
 

 ##
 File path: 
sdks/java/io/synthetic/src/main/java/org/apache/beam/sdk/io/synthetic/SyntheticBoundedIO.java
 ##
 @@ -0,0 +1,436 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.synthetic;
+
+import static com.google.common.base.Preconditions.checkArgument;
+import static com.google.common.base.Preconditions.checkNotNull;
+import static org.apache.commons.math3.stat.StatUtils.sum;
+
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
+import com.google.common.base.MoreObjects;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.NoSuchElementException;
+import java.util.Random;
+import java.util.stream.Collectors;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.KvCoder;
+import org.apache.beam.sdk.io.OffsetBasedSource;
+import org.apache.beam.sdk.io.Read;
+import org.apache.beam.sdk.io.range.OffsetRange;
+import org.apache.beam.sdk.options.PipelineOptions;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.commons.math3.distribution.ConstantRealDistribution;
+import org.joda.time.Duration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * This {@link SyntheticBoundedIO} class provides a parameterizable batch 
custom source that is
+ * deterministic.
+ *
+ * The {@link SyntheticBoundedSource} generates a {@link PCollection} of 
{@code KV}. A fraction of the generated records {@code KV} 
are associated with
+ * "hot" keys, which are uniformly distributed over a fixed number of hot 
keys. The remaining
+ * generated records are associated with "random" keys. Each record will be 
slowed down by a certain
+ * sleep time generated based on the specified sleep time distribution when 
the {@link
+ * SyntheticSourceReader} reads each record. The record {@code KV} is generated
+ * deterministically based on the record's position in the source, which 
enables repeatable
+ * execution for debugging. The SyntheticBoundedInput configurable parameters 
are defined in {@link
+ * SyntheticBoundedIO.SyntheticSourceOptions}.
+ *
+ * To read a {@link PCollection} of {@code KV} from {@link 
SyntheticBoundedIO},
+ * use {@link SyntheticBoundedIO#readFrom} to construct the synthetic source 
with synthetic source
+ * options. See {@link SyntheticBoundedIO.SyntheticSourceOptions} for how to 
construct an instance.
+ * An example is below:
+ *
+ * {@code
+ * Pipeline p = ...;
+ * SyntheticBoundedInput.SourceOptions sso = ...;
+ *
+ * // Construct the synthetic input with synthetic source options.
+ * PCollection> input = 
p.apply(SyntheticBoundedInput.readFrom(sso));
+ * }
+ */
+public class SyntheticBoundedIO {
 
 Review comment:
   I will probably prefer SyntheticIO and .bounded() maybe but well that's 
nitpicking and could be fixed in the future if we ever have an unbounded one.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 121407)
Time Spent: 7h 50m  (was: 7h 40m)

> Performance tests need a way to generate Synthetic data
> ---
>
> 

[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-10 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=121408=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-121408
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 10/Jul/18 13:29
Start Date: 10/Jul/18 13:29
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r201339767
 
 

 ##
 File path: sdks/java/io/synthetic/pom.xml
 ##
 @@ -0,0 +1,84 @@
+
 
 Review comment:
   I suppose we have to remove this one now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 121408)
Time Spent: 8h  (was: 7h 50m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-09 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=121158=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-121158
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 10/Jul/18 01:16
Start Date: 10/Jul/18 01:16
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-403668051
 
 
   Rebased.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 121158)
Time Spent: 7h 40m  (was: 7.5h)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-09 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=120896=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-120896
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 09/Jul/18 17:08
Start Date: 09/Jul/18 17:08
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-403550594
 
 
   @iemejia PTAL


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 120896)
Time Spent: 7.5h  (was: 7h 20m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-09 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=120852=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-120852
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 09/Jul/18 16:25
Start Date: 09/Jul/18 16:25
Worklog Time Spent: 10m 
  Work Description: pabloem removed a comment on issue #5519: [BEAM-4432] 
Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-403537539
 
 
   Run Java PreCommit
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 120852)
Time Spent: 7h 10m  (was: 7h)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-09 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=120853=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-120853
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 09/Jul/18 16:25
Start Date: 09/Jul/18 16:25
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-403537899
 
 
   Run Java PreCommit


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 120853)
Time Spent: 7h 20m  (was: 7h 10m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-09 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=120851=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-120851
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 09/Jul/18 16:24
Start Date: 09/Jul/18 16:24
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-403537539
 
 
   Run Java PreCommit
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 120851)
Time Spent: 7h  (was: 6h 50m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-05 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=119371=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-119371
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 05/Jul/18 14:04
Start Date: 05/Jul/18 14:04
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-402732984
 
 
    this is weird, tests pass ok locally, let's try again


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 119371)
Time Spent: 6h 40m  (was: 6.5h)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-05 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=119372=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-119372
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 05/Jul/18 14:04
Start Date: 05/Jul/18 14:04
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-402733023
 
 
   Run Java PreCommit


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 119372)
Time Spent: 6h 50m  (was: 6h 40m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-03 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=118864=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-118864
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 03/Jul/18 22:06
Start Date: 03/Jul/18 22:06
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-402305902
 
 
   ugh.. I can't repro this on my environment : /


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 118864)
Time Spent: 6.5h  (was: 6h 20m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-03 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=118781=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-118781
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 03/Jul/18 17:32
Start Date: 03/Jul/18 17:32
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-402235246
 
 
   I've renamed to `SyntheticBoundedIO`, but I think it makes sense to have 
`readFrom(SyntheticSourceOptions options)` as an entry point to add to a 
pipeline. WDYT?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 118781)
Time Spent: 6h 10m  (was: 6h)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-03 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=118782=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-118782
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 03/Jul/18 17:32
Start Date: 03/Jul/18 17:32
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-402235268
 
 
   PTAL : )


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 118782)
Time Spent: 6h 20m  (was: 6h 10m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-03 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=118616=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-118616
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 03/Jul/18 08:25
Start Date: 03/Jul/18 08:25
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-402055230
 
 
   Thanks Pablo eager to review and merge this when ready (sorry for the wait 
before).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 118616)
Time Spent: 6h  (was: 5h 50m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-07-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=118509=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-118509
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 03/Jul/18 00:22
Start Date: 03/Jul/18 00:22
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-401975854
 
 
   Thanks Ismael! for some reason, I hadn't seen your other ocmments. Github 
hid that file from me. I've addressed them. Only thing remaining is setting up 
`SyntheticIO.read`. I'll work on that tomorrow.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 118509)
Time Spent: 5h 50m  (was: 5h 40m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-29 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=117404=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-117404
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 29/Jun/18 14:36
Start Date: 29/Jun/18 14:36
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-401373701
 
 
   @pabloem maybe we should call it `SyntheticIO.read()` or gen or something 
like that now ot make it closer to the other IOs no? Do you plan to address 
some of the other comments (most are quite minor but still nice to have).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 117404)
Time Spent: 5h 40m  (was: 5.5h)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=116217=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-116217
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 27/Jun/18 00:33
Start Date: 27/Jun/18 00:33
Worklog Time Spent: 10m 
  Work Description: pabloem edited a comment on issue #5519: [BEAM-4432] 
Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-400505021
 
 
   Thanks @iemejia - I've addressed your comments. LMK what you think.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 116217)
Time Spent: 5.5h  (was: 5h 20m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=116216=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-116216
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 27/Jun/18 00:30
Start Date: 27/Jun/18 00:30
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-400505021
 
 
   Thanks @iemejia - I've addressed some of your concerns


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 116216)
Time Spent: 5h 20m  (was: 5h 10m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=116213=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-116213
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 27/Jun/18 00:29
Start Date: 27/Jun/18 00:29
Worklog Time Spent: 10m 
  Work Description: pabloem commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198336079
 
 

 ##
 File path: 
sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticUtils.java
 ##
 @@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import com.google.common.base.Stopwatch;
+import com.google.common.hash.Hashing;
+import com.google.common.util.concurrent.Uninterruptibles;
+import java.util.Random;
+import java.util.concurrent.TimeUnit;
+import org.joda.time.Duration;
+
+/**
+ * Utility functions used in {@link org.apache.beam.sdk.io.common.synthetic}.
+ */
+public class SyntheticUtils {
 
 Review comment:
   Done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 116213)
Time Spent: 5h  (was: 4h 50m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=116209=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-116209
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 27/Jun/18 00:29
Start Date: 27/Jun/18 00:29
Worklog Time Spent: 10m 
  Work Description: pabloem commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198322384
 
 

 ##
 File path: 
sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticUtils.java
 ##
 @@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import com.google.common.base.Stopwatch;
+import com.google.common.hash.Hashing;
+import com.google.common.util.concurrent.Uninterruptibles;
+import java.util.Random;
+import java.util.concurrent.TimeUnit;
+import org.joda.time.Duration;
+
+/**
+ * Utility functions used in {@link org.apache.beam.sdk.io.common.synthetic}.
+ */
+public class SyntheticUtils {
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 116209)
Time Spent: 4h 20m  (was: 4h 10m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=116214=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-116214
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 27/Jun/18 00:29
Start Date: 27/Jun/18 00:29
Worklog Time Spent: 10m 
  Work Description: pabloem commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198335721
 
 

 ##
 File path: 
sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/package-info.java
 ##
 @@ -0,0 +1,19 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/** Transforms for performing Synthetic Operations in Apache Beam pipelines. */
+package org.apache.beam.sdk.io.common.synthetic;
 
 Review comment:
   I think that makes a lot of sense. I added a new component. WDYT?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 116214)
Time Spent: 5h 10m  (was: 5h)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=116212=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-116212
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 27/Jun/18 00:29
Start Date: 27/Jun/18 00:29
Worklog Time Spent: 10m 
  Work Description: pabloem commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198328174
 
 

 ##
 File path: 
sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInputTest.java
 ##
 @@ -0,0 +1,223 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import static 
org.apache.beam.sdk.io.common.synthetic.SyntheticOptions.fromIntegerDistribution;
+import static 
org.apache.beam.sdk.io.common.synthetic.SyntheticOptions.fromRealDistribution;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+import com.fasterxml.jackson.core.JsonParseException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.util.List;
+import org.apache.beam.sdk.io.BoundedSource;
+import 
org.apache.beam.sdk.io.common.synthetic.SyntheticBoundedInput.SourceOptions;
+import 
org.apache.beam.sdk.io.common.synthetic.SyntheticBoundedInput.SyntheticBoundedSource;
+import org.apache.beam.sdk.options.PipelineOptions;
+import org.apache.beam.sdk.options.PipelineOptionsFactory;
+import org.apache.beam.sdk.testing.SourceTestUtils;
+import org.apache.beam.sdk.values.KV;
+import org.apache.commons.math3.distribution.ConstantRealDistribution;
+import org.apache.commons.math3.distribution.ZipfDistribution;
+import org.junit.Before;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.ExpectedException;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/** Unit tests for {@link SyntheticBoundedInput}. */
+@RunWith(JUnit4.class)
+public class SyntheticBoundedInputTest {
+  @Rule public final ExpectedException thrown = ExpectedException.none();
+
+  private SourceOptions testSourceOptions = new SourceOptions();
+
+  @Before
+  public void setUp() {
+testSourceOptions.splitPointFrequencyRecords = 1;
+testSourceOptions.numRecords = 10;
+testSourceOptions.keySizeBytes = 10;
+testSourceOptions.valueSizeBytes = 20;
+testSourceOptions.numHotKeys = 3;
+testSourceOptions.hotKeyFraction = 0.3;
+testSourceOptions.setSeed(123456);
+testSourceOptions.bundleSizeDistribution =
+fromIntegerDistribution(new ZipfDistribution(100, 2.5));
+testSourceOptions.forceNumInitialBundles = null;
+  }
+
+  private SourceOptions fromString(String jsonString) throws IOException {
+ObjectMapper mapper = new ObjectMapper();
+SourceOptions result = mapper.readValue(jsonString, SourceOptions.class);
+result.validate();
+return result;
+  }
+
+  @Test
+  public void testInvalidSourceOptionsJsonFormat() throws Exception {
+thrown.expect(JsonParseException.class);
+String syntheticSourceOptions = "input:unknown URI";
+fromString(syntheticSourceOptions);
+  }
+
+  @Test
+  public void testFromString() throws Exception {
+String syntheticSourceOptions =
+
"{\"numRecords\":100,\"splitPointFrequencyRecords\":10,\"keySizeBytes\":10,"
++ "\"valueSizeBytes\":20,\"numHotKeys\":3,"
++ "\"hotKeyFraction\":0.3,\"seed\":123456,"
++ "\"bundleSizeDistribution\":{\"type\":\"const\",\"const\":42},"
++ 
"\"forceNumInitialBundles\":10,\"progressShape\":\"LINEAR_REGRESSING\""
++ "}";
+SourceOptions sourceOptions = fromString(syntheticSourceOptions);
+assertEquals(100, sourceOptions.numRecords);
+assertEquals(10, sourceOptions.splitPointFrequencyRecords);
+assertEquals(10, sourceOptions.keySizeBytes);
+assertEquals(20, sourceOptions.valueSizeBytes);
+assertEquals(3, sourceOptions.numHotKeys);
+assertEquals(0.3, 

[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=116210=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-116210
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 27/Jun/18 00:29
Start Date: 27/Jun/18 00:29
Worklog Time Spent: 10m 
  Work Description: pabloem commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198335679
 
 

 ##
 File path: sdks/java/io/common/build.gradle
 ##
 @@ -23,9 +23,15 @@ description = "Apache Beam :: SDKs :: Java :: IO :: Common"
 ext.summary = "Code used by all Beam IOs"
 
 dependencies {
+  compile library.java.joda_time
 
 Review comment:
   Done : )


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 116210)
Time Spent: 4.5h  (was: 4h 20m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=116211=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-116211
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 27/Jun/18 00:29
Start Date: 27/Jun/18 00:29
Worklog Time Spent: 10m 
  Work Description: pabloem commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198327121
 
 

 ##
 File path: 
sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInputTest.java
 ##
 @@ -0,0 +1,223 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import static 
org.apache.beam.sdk.io.common.synthetic.SyntheticOptions.fromIntegerDistribution;
+import static 
org.apache.beam.sdk.io.common.synthetic.SyntheticOptions.fromRealDistribution;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+import com.fasterxml.jackson.core.JsonParseException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.util.List;
+import org.apache.beam.sdk.io.BoundedSource;
+import 
org.apache.beam.sdk.io.common.synthetic.SyntheticBoundedInput.SourceOptions;
+import 
org.apache.beam.sdk.io.common.synthetic.SyntheticBoundedInput.SyntheticBoundedSource;
+import org.apache.beam.sdk.options.PipelineOptions;
+import org.apache.beam.sdk.options.PipelineOptionsFactory;
+import org.apache.beam.sdk.testing.SourceTestUtils;
+import org.apache.beam.sdk.values.KV;
+import org.apache.commons.math3.distribution.ConstantRealDistribution;
+import org.apache.commons.math3.distribution.ZipfDistribution;
+import org.junit.Before;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.ExpectedException;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/** Unit tests for {@link SyntheticBoundedInput}. */
+@RunWith(JUnit4.class)
+public class SyntheticBoundedInputTest {
+  @Rule public final ExpectedException thrown = ExpectedException.none();
+
+  private SourceOptions testSourceOptions = new SourceOptions();
+
+  @Before
+  public void setUp() {
+testSourceOptions.splitPointFrequencyRecords = 1;
+testSourceOptions.numRecords = 10;
+testSourceOptions.keySizeBytes = 10;
+testSourceOptions.valueSizeBytes = 20;
+testSourceOptions.numHotKeys = 3;
+testSourceOptions.hotKeyFraction = 0.3;
+testSourceOptions.setSeed(123456);
+testSourceOptions.bundleSizeDistribution =
+fromIntegerDistribution(new ZipfDistribution(100, 2.5));
+testSourceOptions.forceNumInitialBundles = null;
+  }
+
+  private SourceOptions fromString(String jsonString) throws IOException {
+ObjectMapper mapper = new ObjectMapper();
+SourceOptions result = mapper.readValue(jsonString, SourceOptions.class);
+result.validate();
+return result;
+  }
+
+  @Test
+  public void testInvalidSourceOptionsJsonFormat() throws Exception {
+thrown.expect(JsonParseException.class);
+String syntheticSourceOptions = "input:unknown URI";
+fromString(syntheticSourceOptions);
+  }
+
+  @Test
+  public void testFromString() throws Exception {
+String syntheticSourceOptions =
+
"{\"numRecords\":100,\"splitPointFrequencyRecords\":10,\"keySizeBytes\":10,"
++ "\"valueSizeBytes\":20,\"numHotKeys\":3,"
++ "\"hotKeyFraction\":0.3,\"seed\":123456,"
++ "\"bundleSizeDistribution\":{\"type\":\"const\",\"const\":42},"
++ 
"\"forceNumInitialBundles\":10,\"progressShape\":\"LINEAR_REGRESSING\""
++ "}";
+SourceOptions sourceOptions = fromString(syntheticSourceOptions);
+assertEquals(100, sourceOptions.numRecords);
+assertEquals(10, sourceOptions.splitPointFrequencyRecords);
+assertEquals(10, sourceOptions.keySizeBytes);
+assertEquals(20, sourceOptions.valueSizeBytes);
+assertEquals(3, sourceOptions.numHotKeys);
+assertEquals(0.3, 

[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115950=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115950
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 26/Jun/18 12:38
Start Date: 26/Jun/18 12:38
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198074385
 
 

 ##
 File path: 
sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java
 ##
 @@ -0,0 +1,452 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import static com.google.common.base.Preconditions.checkArgument;
+import static com.google.common.base.Preconditions.checkNotNull;
+import static org.apache.commons.math3.stat.StatUtils.sum;
+
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
+import com.google.common.base.MoreObjects;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.NoSuchElementException;
+import java.util.Random;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.KvCoder;
+import org.apache.beam.sdk.io.OffsetBasedSource;
+import org.apache.beam.sdk.io.Read;
+import org.apache.beam.sdk.options.PipelineOptions;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.commons.math3.distribution.ConstantRealDistribution;
+import org.joda.time.Duration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * This {@link SyntheticBoundedInput} class provides a parameterizable batch 
custom source that is
+ * deterministic.
+ *
+ * The {@link SyntheticBoundedSource} generates a {@link PCollection} of
+ * {@code KV}. A fraction of the generated records {@code 
KV} are
+ * associated with "hot" keys, which are uniformly distributed over a fixed 
number of hot keys.
+ * The remaining generated records are associated with "random" keys.
+ * Each record will be slowed down by a certain sleep time generated based on 
the specified sleep
+ * time distribution when the {@link SyntheticSourceReader} reads each record.
+ * The record {@code KV} is generated deterministically based 
on the record's
+ * position in the source, which enables repeatable execution for debugging.
+ * The SyntheticBoundedInput configurable parameters are defined in {@link
+ * SyntheticBoundedInput.SourceOptions}.
+ *
+ * To read a {@link PCollection} of {@code KV} from {@link
+ * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to 
construct the synthetic
+ * source with synthetic source options.
+ * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an 
instance.
+ * An example is below:
+ *  {@code
+ * Pipeline p = ...;
+ * SyntheticBoundedInput.SourceOptions sso = ...;
+ *
+ * // Construct the synthetic input with synthetic source options.
+ * PCollection> input = 
p.apply(SyntheticBoundedInput.readFrom(sso));
 
 Review comment:
   Please add a test for `SyntheticBoundedInput` use. Also validate in the 
tests the full set of options on `SourceOptions`. I found a validation issue 
with the HashFunction non being available because it is transient so it 
probably gets lost on deserialization.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 115950)
Time Spent: 4h 10m  (was: 4h)

> Performance tests need a way to generate Synthetic data
> ---
>
>  

[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115949=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115949
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 26/Jun/18 12:37
Start Date: 26/Jun/18 12:37
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198113556
 
 

 ##
 File path: 
sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java
 ##
 @@ -0,0 +1,452 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import static com.google.common.base.Preconditions.checkArgument;
+import static com.google.common.base.Preconditions.checkNotNull;
+import static org.apache.commons.math3.stat.StatUtils.sum;
+
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
+import com.google.common.base.MoreObjects;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.NoSuchElementException;
+import java.util.Random;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.KvCoder;
+import org.apache.beam.sdk.io.OffsetBasedSource;
+import org.apache.beam.sdk.io.Read;
+import org.apache.beam.sdk.options.PipelineOptions;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.commons.math3.distribution.ConstantRealDistribution;
+import org.joda.time.Duration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * This {@link SyntheticBoundedInput} class provides a parameterizable batch 
custom source that is
+ * deterministic.
+ *
+ * The {@link SyntheticBoundedSource} generates a {@link PCollection} of
+ * {@code KV}. A fraction of the generated records {@code 
KV} are
+ * associated with "hot" keys, which are uniformly distributed over a fixed 
number of hot keys.
+ * The remaining generated records are associated with "random" keys.
+ * Each record will be slowed down by a certain sleep time generated based on 
the specified sleep
+ * time distribution when the {@link SyntheticSourceReader} reads each record.
+ * The record {@code KV} is generated deterministically based 
on the record's
+ * position in the source, which enables repeatable execution for debugging.
+ * The SyntheticBoundedInput configurable parameters are defined in {@link
+ * SyntheticBoundedInput.SourceOptions}.
+ *
+ * To read a {@link PCollection} of {@code KV} from {@link
+ * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to 
construct the synthetic
+ * source with synthetic source options.
+ * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an 
instance.
+ * An example is below:
+ *  {@code
+ * Pipeline p = ...;
+ * SyntheticBoundedInput.SourceOptions sso = ...;
+ *
+ * // Construct the synthetic input with synthetic source options.
+ * PCollection> input = 
p.apply(SyntheticBoundedInput.readFrom(sso));
+ * } 
+ */
+public class SyntheticBoundedInput {
+  /**
+   * Read from the synthetic source options.
+   */
+  public static Read.Bounded> readFrom(SourceOptions 
options) {
+checkNotNull(options, "Input synthetic source options should not be 
null.");
+options.validate();
+return Read.from(new SyntheticBoundedSource(options));
+  }
+
+  /**
+   * A {@link SyntheticBoundedSource} that reads {@code KV}.
+   */
+  public static class SyntheticBoundedSource extends 
OffsetBasedSource> {
+private static final long serialVersionUID = 0;
+private static final Logger LOG = 
LoggerFactory.getLogger(SyntheticBoundedSource.class);
+
+private final SourceOptions sourceOptions;
+
+public SyntheticBoundedSource(SourceOptions sourceOptions) {
+  this(0, sourceOptions.numRecords, sourceOptions);
+}

[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115943=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115943
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 26/Jun/18 12:37
Start Date: 26/Jun/18 12:37
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198084851
 
 

 ##
 File path: 
sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java
 ##
 @@ -0,0 +1,452 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import static com.google.common.base.Preconditions.checkArgument;
+import static com.google.common.base.Preconditions.checkNotNull;
+import static org.apache.commons.math3.stat.StatUtils.sum;
+
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
+import com.google.common.base.MoreObjects;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.NoSuchElementException;
+import java.util.Random;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.KvCoder;
+import org.apache.beam.sdk.io.OffsetBasedSource;
+import org.apache.beam.sdk.io.Read;
+import org.apache.beam.sdk.options.PipelineOptions;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.commons.math3.distribution.ConstantRealDistribution;
+import org.joda.time.Duration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * This {@link SyntheticBoundedInput} class provides a parameterizable batch 
custom source that is
+ * deterministic.
+ *
+ * The {@link SyntheticBoundedSource} generates a {@link PCollection} of
+ * {@code KV}. A fraction of the generated records {@code 
KV} are
+ * associated with "hot" keys, which are uniformly distributed over a fixed 
number of hot keys.
+ * The remaining generated records are associated with "random" keys.
+ * Each record will be slowed down by a certain sleep time generated based on 
the specified sleep
+ * time distribution when the {@link SyntheticSourceReader} reads each record.
+ * The record {@code KV} is generated deterministically based 
on the record's
+ * position in the source, which enables repeatable execution for debugging.
+ * The SyntheticBoundedInput configurable parameters are defined in {@link
+ * SyntheticBoundedInput.SourceOptions}.
+ *
+ * To read a {@link PCollection} of {@code KV} from {@link
+ * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to 
construct the synthetic
+ * source with synthetic source options.
+ * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an 
instance.
+ * An example is below:
+ *  {@code
+ * Pipeline p = ...;
+ * SyntheticBoundedInput.SourceOptions sso = ...;
+ *
+ * // Construct the synthetic input with synthetic source options.
+ * PCollection> input = 
p.apply(SyntheticBoundedInput.readFrom(sso));
+ * } 
+ */
+public class SyntheticBoundedInput {
+  /**
+   * Read from the synthetic source options.
+   */
+  public static Read.Bounded> readFrom(SourceOptions 
options) {
+checkNotNull(options, "Input synthetic source options should not be 
null.");
+options.validate();
+return Read.from(new SyntheticBoundedSource(options));
+  }
+
+  /**
+   * A {@link SyntheticBoundedSource} that reads {@code KV}.
+   */
+  public static class SyntheticBoundedSource extends 
OffsetBasedSource> {
+private static final long serialVersionUID = 0;
+private static final Logger LOG = 
LoggerFactory.getLogger(SyntheticBoundedSource.class);
+
+private final SourceOptions sourceOptions;
+
+public SyntheticBoundedSource(SourceOptions sourceOptions) {
+  this(0, sourceOptions.numRecords, sourceOptions);
+}

[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115942=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115942
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 26/Jun/18 12:37
Start Date: 26/Jun/18 12:37
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r194428847
 
 

 ##
 File path: 
sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticUtils.java
 ##
 @@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import com.google.common.base.Stopwatch;
+import com.google.common.hash.Hashing;
+import com.google.common.util.concurrent.Uninterruptibles;
+import java.util.Random;
+import java.util.concurrent.TimeUnit;
+import org.joda.time.Duration;
+
+/**
+ * Utility functions used in {@link org.apache.beam.sdk.io.common.synthetic}.
+ */
+public class SyntheticUtils {
 
 Review comment:
   package private same for methods, maybe a run of IntelliJ's analyze code to 
restrict as much as intended scope access is worth a look (also in the other 
classes).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 115942)
Time Spent: 3h 10m  (was: 3h)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115945=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115945
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 26/Jun/18 12:37
Start Date: 26/Jun/18 12:37
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198074805
 
 

 ##
 File path: 
sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java
 ##
 @@ -0,0 +1,452 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import static com.google.common.base.Preconditions.checkArgument;
+import static com.google.common.base.Preconditions.checkNotNull;
+import static org.apache.commons.math3.stat.StatUtils.sum;
+
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
+import com.google.common.base.MoreObjects;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.NoSuchElementException;
+import java.util.Random;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.KvCoder;
+import org.apache.beam.sdk.io.OffsetBasedSource;
+import org.apache.beam.sdk.io.Read;
+import org.apache.beam.sdk.options.PipelineOptions;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.commons.math3.distribution.ConstantRealDistribution;
+import org.joda.time.Duration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * This {@link SyntheticBoundedInput} class provides a parameterizable batch 
custom source that is
+ * deterministic.
+ *
+ * The {@link SyntheticBoundedSource} generates a {@link PCollection} of
+ * {@code KV}. A fraction of the generated records {@code 
KV} are
+ * associated with "hot" keys, which are uniformly distributed over a fixed 
number of hot keys.
+ * The remaining generated records are associated with "random" keys.
+ * Each record will be slowed down by a certain sleep time generated based on 
the specified sleep
+ * time distribution when the {@link SyntheticSourceReader} reads each record.
+ * The record {@code KV} is generated deterministically based 
on the record's
+ * position in the source, which enables repeatable execution for debugging.
+ * The SyntheticBoundedInput configurable parameters are defined in {@link
+ * SyntheticBoundedInput.SourceOptions}.
+ *
+ * To read a {@link PCollection} of {@code KV} from {@link
+ * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to 
construct the synthetic
+ * source with synthetic source options.
+ * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an 
instance.
+ * An example is below:
+ *  {@code
+ * Pipeline p = ...;
+ * SyntheticBoundedInput.SourceOptions sso = ...;
+ *
+ * // Construct the synthetic input with synthetic source options.
+ * PCollection> input = 
p.apply(SyntheticBoundedInput.readFrom(sso));
+ * } 
+ */
+public class SyntheticBoundedInput {
+  /**
+   * Read from the synthetic source options.
+   */
+  public static Read.Bounded> readFrom(SourceOptions 
options) {
+checkNotNull(options, "Input synthetic source options should not be 
null.");
+options.validate();
 
 Review comment:
   is this needed? I think the validate is called in the lifecycle of the Read.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 115945)
Time Spent: 3.5h  (was: 3h 20m)

> Performance tests 

[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115939=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115939
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 26/Jun/18 12:37
Start Date: 26/Jun/18 12:37
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r194427503
 
 

 ##
 File path: sdks/java/io/common/build.gradle
 ##
 @@ -23,9 +23,15 @@ description = "Apache Beam :: SDKs :: Java :: IO :: Common"
 ext.summary = "Code used by all Beam IOs"
 
 dependencies {
+  compile library.java.joda_time
 
 Review comment:
   Probably worth adding the deps to pom.xml until we remove the pom files we 
shall not break the other build.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 115939)
Time Spent: 2h 50m  (was: 2h 40m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115941=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115941
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 26/Jun/18 12:37
Start Date: 26/Jun/18 12:37
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198081158
 
 

 ##
 File path: 
sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java
 ##
 @@ -0,0 +1,452 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import static com.google.common.base.Preconditions.checkArgument;
+import static com.google.common.base.Preconditions.checkNotNull;
+import static org.apache.commons.math3.stat.StatUtils.sum;
+
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
+import com.google.common.base.MoreObjects;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.NoSuchElementException;
+import java.util.Random;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.KvCoder;
+import org.apache.beam.sdk.io.OffsetBasedSource;
+import org.apache.beam.sdk.io.Read;
+import org.apache.beam.sdk.options.PipelineOptions;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.commons.math3.distribution.ConstantRealDistribution;
+import org.joda.time.Duration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * This {@link SyntheticBoundedInput} class provides a parameterizable batch 
custom source that is
+ * deterministic.
+ *
+ * The {@link SyntheticBoundedSource} generates a {@link PCollection} of
+ * {@code KV}. A fraction of the generated records {@code 
KV} are
+ * associated with "hot" keys, which are uniformly distributed over a fixed 
number of hot keys.
+ * The remaining generated records are associated with "random" keys.
+ * Each record will be slowed down by a certain sleep time generated based on 
the specified sleep
+ * time distribution when the {@link SyntheticSourceReader} reads each record.
+ * The record {@code KV} is generated deterministically based 
on the record's
+ * position in the source, which enables repeatable execution for debugging.
+ * The SyntheticBoundedInput configurable parameters are defined in {@link
+ * SyntheticBoundedInput.SourceOptions}.
+ *
+ * To read a {@link PCollection} of {@code KV} from {@link
+ * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to 
construct the synthetic
+ * source with synthetic source options.
+ * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an 
instance.
+ * An example is below:
+ *  {@code
+ * Pipeline p = ...;
+ * SyntheticBoundedInput.SourceOptions sso = ...;
+ *
+ * // Construct the synthetic input with synthetic source options.
+ * PCollection> input = 
p.apply(SyntheticBoundedInput.readFrom(sso));
+ * } 
+ */
+public class SyntheticBoundedInput {
+  /**
+   * Read from the synthetic source options.
+   */
+  public static Read.Bounded> readFrom(SourceOptions 
options) {
+checkNotNull(options, "Input synthetic source options should not be 
null.");
+options.validate();
+return Read.from(new SyntheticBoundedSource(options));
+  }
+
+  /**
+   * A {@link SyntheticBoundedSource} that reads {@code KV}.
+   */
+  public static class SyntheticBoundedSource extends 
OffsetBasedSource> {
+private static final long serialVersionUID = 0;
+private static final Logger LOG = 
LoggerFactory.getLogger(SyntheticBoundedSource.class);
+
+private final SourceOptions sourceOptions;
+
+public SyntheticBoundedSource(SourceOptions sourceOptions) {
+  this(0, sourceOptions.numRecords, sourceOptions);
+}

[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115937=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115937
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 26/Jun/18 12:37
Start Date: 26/Jun/18 12:37
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198120428
 
 

 ##
 File path: 
sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java
 ##
 @@ -0,0 +1,452 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import static com.google.common.base.Preconditions.checkArgument;
+import static com.google.common.base.Preconditions.checkNotNull;
+import static org.apache.commons.math3.stat.StatUtils.sum;
+
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
+import com.google.common.base.MoreObjects;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.NoSuchElementException;
+import java.util.Random;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.KvCoder;
+import org.apache.beam.sdk.io.OffsetBasedSource;
+import org.apache.beam.sdk.io.Read;
+import org.apache.beam.sdk.options.PipelineOptions;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.commons.math3.distribution.ConstantRealDistribution;
+import org.joda.time.Duration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * This {@link SyntheticBoundedInput} class provides a parameterizable batch 
custom source that is
+ * deterministic.
+ *
+ * The {@link SyntheticBoundedSource} generates a {@link PCollection} of
+ * {@code KV}. A fraction of the generated records {@code 
KV} are
+ * associated with "hot" keys, which are uniformly distributed over a fixed 
number of hot keys.
+ * The remaining generated records are associated with "random" keys.
+ * Each record will be slowed down by a certain sleep time generated based on 
the specified sleep
+ * time distribution when the {@link SyntheticSourceReader} reads each record.
+ * The record {@code KV} is generated deterministically based 
on the record's
+ * position in the source, which enables repeatable execution for debugging.
+ * The SyntheticBoundedInput configurable parameters are defined in {@link
+ * SyntheticBoundedInput.SourceOptions}.
+ *
+ * To read a {@link PCollection} of {@code KV} from {@link
+ * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to 
construct the synthetic
+ * source with synthetic source options.
+ * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an 
instance.
+ * An example is below:
+ *  {@code
+ * Pipeline p = ...;
+ * SyntheticBoundedInput.SourceOptions sso = ...;
+ *
+ * // Construct the synthetic input with synthetic source options.
+ * PCollection> input = 
p.apply(SyntheticBoundedInput.readFrom(sso));
+ * } 
+ */
+public class SyntheticBoundedInput {
+  /**
+   * Read from the synthetic source options.
+   */
+  public static Read.Bounded> readFrom(SourceOptions 
options) {
+checkNotNull(options, "Input synthetic source options should not be 
null.");
+options.validate();
+return Read.from(new SyntheticBoundedSource(options));
+  }
+
+  /**
+   * A {@link SyntheticBoundedSource} that reads {@code KV}.
+   */
+  public static class SyntheticBoundedSource extends 
OffsetBasedSource> {
+private static final long serialVersionUID = 0;
+private static final Logger LOG = 
LoggerFactory.getLogger(SyntheticBoundedSource.class);
+
+private final SourceOptions sourceOptions;
+
+public SyntheticBoundedSource(SourceOptions sourceOptions) {
+  this(0, sourceOptions.numRecords, sourceOptions);
+}

[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115946=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115946
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 26/Jun/18 12:37
Start Date: 26/Jun/18 12:37
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198112035
 
 

 ##
 File path: 
sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInputTest.java
 ##
 @@ -0,0 +1,223 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import static 
org.apache.beam.sdk.io.common.synthetic.SyntheticOptions.fromIntegerDistribution;
+import static 
org.apache.beam.sdk.io.common.synthetic.SyntheticOptions.fromRealDistribution;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+import com.fasterxml.jackson.core.JsonParseException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.util.List;
+import org.apache.beam.sdk.io.BoundedSource;
+import 
org.apache.beam.sdk.io.common.synthetic.SyntheticBoundedInput.SourceOptions;
+import 
org.apache.beam.sdk.io.common.synthetic.SyntheticBoundedInput.SyntheticBoundedSource;
+import org.apache.beam.sdk.options.PipelineOptions;
+import org.apache.beam.sdk.options.PipelineOptionsFactory;
+import org.apache.beam.sdk.testing.SourceTestUtils;
+import org.apache.beam.sdk.values.KV;
+import org.apache.commons.math3.distribution.ConstantRealDistribution;
+import org.apache.commons.math3.distribution.ZipfDistribution;
+import org.junit.Before;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.ExpectedException;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/** Unit tests for {@link SyntheticBoundedInput}. */
+@RunWith(JUnit4.class)
+public class SyntheticBoundedInputTest {
+  @Rule public final ExpectedException thrown = ExpectedException.none();
+
+  private SourceOptions testSourceOptions = new SourceOptions();
+
+  @Before
+  public void setUp() {
+testSourceOptions.splitPointFrequencyRecords = 1;
+testSourceOptions.numRecords = 10;
+testSourceOptions.keySizeBytes = 10;
+testSourceOptions.valueSizeBytes = 20;
+testSourceOptions.numHotKeys = 3;
+testSourceOptions.hotKeyFraction = 0.3;
+testSourceOptions.setSeed(123456);
+testSourceOptions.bundleSizeDistribution =
+fromIntegerDistribution(new ZipfDistribution(100, 2.5));
+testSourceOptions.forceNumInitialBundles = null;
+  }
+
+  private SourceOptions fromString(String jsonString) throws IOException {
+ObjectMapper mapper = new ObjectMapper();
+SourceOptions result = mapper.readValue(jsonString, SourceOptions.class);
+result.validate();
+return result;
+  }
+
+  @Test
+  public void testInvalidSourceOptionsJsonFormat() throws Exception {
+thrown.expect(JsonParseException.class);
+String syntheticSourceOptions = "input:unknown URI";
+fromString(syntheticSourceOptions);
+  }
+
+  @Test
+  public void testFromString() throws Exception {
+String syntheticSourceOptions =
+
"{\"numRecords\":100,\"splitPointFrequencyRecords\":10,\"keySizeBytes\":10,"
++ "\"valueSizeBytes\":20,\"numHotKeys\":3,"
++ "\"hotKeyFraction\":0.3,\"seed\":123456,"
++ "\"bundleSizeDistribution\":{\"type\":\"const\",\"const\":42},"
++ 
"\"forceNumInitialBundles\":10,\"progressShape\":\"LINEAR_REGRESSING\""
++ "}";
+SourceOptions sourceOptions = fromString(syntheticSourceOptions);
+assertEquals(100, sourceOptions.numRecords);
+assertEquals(10, sourceOptions.splitPointFrequencyRecords);
+assertEquals(10, sourceOptions.keySizeBytes);
+assertEquals(20, sourceOptions.valueSizeBytes);
+assertEquals(3, sourceOptions.numHotKeys);
+assertEquals(0.3, 

[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115944=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115944
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 26/Jun/18 12:37
Start Date: 26/Jun/18 12:37
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198112261
 
 

 ##
 File path: 
sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInputTest.java
 ##
 @@ -0,0 +1,223 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import static 
org.apache.beam.sdk.io.common.synthetic.SyntheticOptions.fromIntegerDistribution;
+import static 
org.apache.beam.sdk.io.common.synthetic.SyntheticOptions.fromRealDistribution;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+import com.fasterxml.jackson.core.JsonParseException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.util.List;
+import org.apache.beam.sdk.io.BoundedSource;
+import 
org.apache.beam.sdk.io.common.synthetic.SyntheticBoundedInput.SourceOptions;
+import 
org.apache.beam.sdk.io.common.synthetic.SyntheticBoundedInput.SyntheticBoundedSource;
+import org.apache.beam.sdk.options.PipelineOptions;
+import org.apache.beam.sdk.options.PipelineOptionsFactory;
+import org.apache.beam.sdk.testing.SourceTestUtils;
+import org.apache.beam.sdk.values.KV;
+import org.apache.commons.math3.distribution.ConstantRealDistribution;
+import org.apache.commons.math3.distribution.ZipfDistribution;
+import org.junit.Before;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.ExpectedException;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/** Unit tests for {@link SyntheticBoundedInput}. */
+@RunWith(JUnit4.class)
+public class SyntheticBoundedInputTest {
+  @Rule public final ExpectedException thrown = ExpectedException.none();
+
+  private SourceOptions testSourceOptions = new SourceOptions();
+
+  @Before
+  public void setUp() {
+testSourceOptions.splitPointFrequencyRecords = 1;
+testSourceOptions.numRecords = 10;
+testSourceOptions.keySizeBytes = 10;
+testSourceOptions.valueSizeBytes = 20;
+testSourceOptions.numHotKeys = 3;
+testSourceOptions.hotKeyFraction = 0.3;
+testSourceOptions.setSeed(123456);
+testSourceOptions.bundleSizeDistribution =
+fromIntegerDistribution(new ZipfDistribution(100, 2.5));
+testSourceOptions.forceNumInitialBundles = null;
+  }
+
+  private SourceOptions fromString(String jsonString) throws IOException {
+ObjectMapper mapper = new ObjectMapper();
+SourceOptions result = mapper.readValue(jsonString, SourceOptions.class);
+result.validate();
+return result;
+  }
+
+  @Test
+  public void testInvalidSourceOptionsJsonFormat() throws Exception {
+thrown.expect(JsonParseException.class);
+String syntheticSourceOptions = "input:unknown URI";
+fromString(syntheticSourceOptions);
+  }
+
+  @Test
+  public void testFromString() throws Exception {
+String syntheticSourceOptions =
+
"{\"numRecords\":100,\"splitPointFrequencyRecords\":10,\"keySizeBytes\":10,"
++ "\"valueSizeBytes\":20,\"numHotKeys\":3,"
++ "\"hotKeyFraction\":0.3,\"seed\":123456,"
++ "\"bundleSizeDistribution\":{\"type\":\"const\",\"const\":42},"
++ 
"\"forceNumInitialBundles\":10,\"progressShape\":\"LINEAR_REGRESSING\""
++ "}";
+SourceOptions sourceOptions = fromString(syntheticSourceOptions);
+assertEquals(100, sourceOptions.numRecords);
+assertEquals(10, sourceOptions.splitPointFrequencyRecords);
+assertEquals(10, sourceOptions.keySizeBytes);
+assertEquals(20, sourceOptions.valueSizeBytes);
+assertEquals(3, sourceOptions.numHotKeys);
+assertEquals(0.3, 

[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115938=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115938
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 26/Jun/18 12:37
Start Date: 26/Jun/18 12:37
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198086099
 
 

 ##
 File path: 
sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java
 ##
 @@ -0,0 +1,452 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import static com.google.common.base.Preconditions.checkArgument;
+import static com.google.common.base.Preconditions.checkNotNull;
+import static org.apache.commons.math3.stat.StatUtils.sum;
+
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
+import com.google.common.base.MoreObjects;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.NoSuchElementException;
+import java.util.Random;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.KvCoder;
+import org.apache.beam.sdk.io.OffsetBasedSource;
+import org.apache.beam.sdk.io.Read;
+import org.apache.beam.sdk.options.PipelineOptions;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.commons.math3.distribution.ConstantRealDistribution;
+import org.joda.time.Duration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * This {@link SyntheticBoundedInput} class provides a parameterizable batch 
custom source that is
+ * deterministic.
+ *
+ * The {@link SyntheticBoundedSource} generates a {@link PCollection} of
+ * {@code KV}. A fraction of the generated records {@code 
KV} are
+ * associated with "hot" keys, which are uniformly distributed over a fixed 
number of hot keys.
+ * The remaining generated records are associated with "random" keys.
+ * Each record will be slowed down by a certain sleep time generated based on 
the specified sleep
+ * time distribution when the {@link SyntheticSourceReader} reads each record.
+ * The record {@code KV} is generated deterministically based 
on the record's
+ * position in the source, which enables repeatable execution for debugging.
+ * The SyntheticBoundedInput configurable parameters are defined in {@link
+ * SyntheticBoundedInput.SourceOptions}.
+ *
+ * To read a {@link PCollection} of {@code KV} from {@link
+ * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to 
construct the synthetic
+ * source with synthetic source options.
+ * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an 
instance.
+ * An example is below:
+ *  {@code
+ * Pipeline p = ...;
+ * SyntheticBoundedInput.SourceOptions sso = ...;
+ *
+ * // Construct the synthetic input with synthetic source options.
+ * PCollection> input = 
p.apply(SyntheticBoundedInput.readFrom(sso));
+ * } 
+ */
+public class SyntheticBoundedInput {
+  /**
+   * Read from the synthetic source options.
+   */
+  public static Read.Bounded> readFrom(SourceOptions 
options) {
+checkNotNull(options, "Input synthetic source options should not be 
null.");
+options.validate();
+return Read.from(new SyntheticBoundedSource(options));
+  }
+
+  /**
+   * A {@link SyntheticBoundedSource} that reads {@code KV}.
+   */
+  public static class SyntheticBoundedSource extends 
OffsetBasedSource> {
+private static final long serialVersionUID = 0;
+private static final Logger LOG = 
LoggerFactory.getLogger(SyntheticBoundedSource.class);
+
+private final SourceOptions sourceOptions;
+
+public SyntheticBoundedSource(SourceOptions sourceOptions) {
+  this(0, sourceOptions.numRecords, sourceOptions);
+}

[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115948=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115948
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 26/Jun/18 12:37
Start Date: 26/Jun/18 12:37
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198123166
 
 

 ##
 File path: 
sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java
 ##
 @@ -0,0 +1,452 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import static com.google.common.base.Preconditions.checkArgument;
+import static com.google.common.base.Preconditions.checkNotNull;
+import static org.apache.commons.math3.stat.StatUtils.sum;
+
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
+import com.google.common.base.MoreObjects;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.NoSuchElementException;
+import java.util.Random;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.KvCoder;
+import org.apache.beam.sdk.io.OffsetBasedSource;
+import org.apache.beam.sdk.io.Read;
+import org.apache.beam.sdk.options.PipelineOptions;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.commons.math3.distribution.ConstantRealDistribution;
+import org.joda.time.Duration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * This {@link SyntheticBoundedInput} class provides a parameterizable batch 
custom source that is
+ * deterministic.
+ *
+ * The {@link SyntheticBoundedSource} generates a {@link PCollection} of
+ * {@code KV}. A fraction of the generated records {@code 
KV} are
+ * associated with "hot" keys, which are uniformly distributed over a fixed 
number of hot keys.
+ * The remaining generated records are associated with "random" keys.
+ * Each record will be slowed down by a certain sleep time generated based on 
the specified sleep
+ * time distribution when the {@link SyntheticSourceReader} reads each record.
+ * The record {@code KV} is generated deterministically based 
on the record's
+ * position in the source, which enables repeatable execution for debugging.
+ * The SyntheticBoundedInput configurable parameters are defined in {@link
+ * SyntheticBoundedInput.SourceOptions}.
+ *
+ * To read a {@link PCollection} of {@code KV} from {@link
+ * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to 
construct the synthetic
+ * source with synthetic source options.
+ * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an 
instance.
+ * An example is below:
+ *  {@code
+ * Pipeline p = ...;
+ * SyntheticBoundedInput.SourceOptions sso = ...;
+ *
+ * // Construct the synthetic input with synthetic source options.
+ * PCollection> input = 
p.apply(SyntheticBoundedInput.readFrom(sso));
+ * } 
+ */
+public class SyntheticBoundedInput {
+  /**
+   * Read from the synthetic source options.
+   */
+  public static Read.Bounded> readFrom(SourceOptions 
options) {
+checkNotNull(options, "Input synthetic source options should not be 
null.");
+options.validate();
+return Read.from(new SyntheticBoundedSource(options));
+  }
+
+  /**
+   * A {@link SyntheticBoundedSource} that reads {@code KV}.
+   */
+  public static class SyntheticBoundedSource extends 
OffsetBasedSource> {
+private static final long serialVersionUID = 0;
+private static final Logger LOG = 
LoggerFactory.getLogger(SyntheticBoundedSource.class);
+
+private final SourceOptions sourceOptions;
+
+public SyntheticBoundedSource(SourceOptions sourceOptions) {
+  this(0, sourceOptions.numRecords, sourceOptions);
+}

[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115940=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115940
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 26/Jun/18 12:37
Start Date: 26/Jun/18 12:37
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198116718
 
 

 ##
 File path: 
sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/package-info.java
 ##
 @@ -0,0 +1,19 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/** Transforms for performing Synthetic Operations in Apache Beam pipelines. */
+package org.apache.beam.sdk.io.common.synthetic;
 
 Review comment:
   I am partially 'partagé' on having this in common. If the goal is to have a 
synthetic IO maybe the need is specific so better be an independent IO, no ? 
And then we can migrate some features from other generators there, and write 
just ParDo transformations on this e.g. to generate data for Nexmark or 
examples.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 115940)
Time Spent: 3h  (was: 2h 50m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115947=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115947
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 26/Jun/18 12:37
Start Date: 26/Jun/18 12:37
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198113865
 
 

 ##
 File path: 
sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java
 ##
 @@ -0,0 +1,452 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import static com.google.common.base.Preconditions.checkArgument;
+import static com.google.common.base.Preconditions.checkNotNull;
+import static org.apache.commons.math3.stat.StatUtils.sum;
+
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
+import com.google.common.base.MoreObjects;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.NoSuchElementException;
+import java.util.Random;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.KvCoder;
+import org.apache.beam.sdk.io.OffsetBasedSource;
+import org.apache.beam.sdk.io.Read;
+import org.apache.beam.sdk.options.PipelineOptions;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.commons.math3.distribution.ConstantRealDistribution;
+import org.joda.time.Duration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * This {@link SyntheticBoundedInput} class provides a parameterizable batch 
custom source that is
+ * deterministic.
+ *
+ * The {@link SyntheticBoundedSource} generates a {@link PCollection} of
+ * {@code KV}. A fraction of the generated records {@code 
KV} are
+ * associated with "hot" keys, which are uniformly distributed over a fixed 
number of hot keys.
+ * The remaining generated records are associated with "random" keys.
+ * Each record will be slowed down by a certain sleep time generated based on 
the specified sleep
+ * time distribution when the {@link SyntheticSourceReader} reads each record.
+ * The record {@code KV} is generated deterministically based 
on the record's
+ * position in the source, which enables repeatable execution for debugging.
+ * The SyntheticBoundedInput configurable parameters are defined in {@link
+ * SyntheticBoundedInput.SourceOptions}.
+ *
+ * To read a {@link PCollection} of {@code KV} from {@link
+ * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to 
construct the synthetic
+ * source with synthetic source options.
+ * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an 
instance.
+ * An example is below:
+ *  {@code
+ * Pipeline p = ...;
+ * SyntheticBoundedInput.SourceOptions sso = ...;
+ *
+ * // Construct the synthetic input with synthetic source options.
+ * PCollection> input = 
p.apply(SyntheticBoundedInput.readFrom(sso));
+ * } 
+ */
+public class SyntheticBoundedInput {
+  /**
+   * Read from the synthetic source options.
+   */
+  public static Read.Bounded> readFrom(SourceOptions 
options) {
+checkNotNull(options, "Input synthetic source options should not be 
null.");
+options.validate();
+return Read.from(new SyntheticBoundedSource(options));
+  }
+
+  /**
+   * A {@link SyntheticBoundedSource} that reads {@code KV}.
+   */
+  public static class SyntheticBoundedSource extends 
OffsetBasedSource> {
+private static final long serialVersionUID = 0;
+private static final Logger LOG = 
LoggerFactory.getLogger(SyntheticBoundedSource.class);
+
+private final SourceOptions sourceOptions;
+
+public SyntheticBoundedSource(SourceOptions sourceOptions) {
+  this(0, sourceOptions.numRecords, sourceOptions);
+}

[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=115936=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-115936
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 26/Jun/18 12:37
Start Date: 26/Jun/18 12:37
Worklog Time Spent: 10m 
  Work Description: iemejia commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r198074385
 
 

 ##
 File path: 
sdks/java/io/common/src/main/java/org/apache/beam/sdk/io/common/synthetic/SyntheticBoundedInput.java
 ##
 @@ -0,0 +1,452 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common.synthetic;
+
+import static com.google.common.base.Preconditions.checkArgument;
+import static com.google.common.base.Preconditions.checkNotNull;
+import static org.apache.commons.math3.stat.StatUtils.sum;
+
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
+import com.google.common.base.MoreObjects;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.NoSuchElementException;
+import java.util.Random;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.KvCoder;
+import org.apache.beam.sdk.io.OffsetBasedSource;
+import org.apache.beam.sdk.io.Read;
+import org.apache.beam.sdk.options.PipelineOptions;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.commons.math3.distribution.ConstantRealDistribution;
+import org.joda.time.Duration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * This {@link SyntheticBoundedInput} class provides a parameterizable batch 
custom source that is
+ * deterministic.
+ *
+ * The {@link SyntheticBoundedSource} generates a {@link PCollection} of
+ * {@code KV}. A fraction of the generated records {@code 
KV} are
+ * associated with "hot" keys, which are uniformly distributed over a fixed 
number of hot keys.
+ * The remaining generated records are associated with "random" keys.
+ * Each record will be slowed down by a certain sleep time generated based on 
the specified sleep
+ * time distribution when the {@link SyntheticSourceReader} reads each record.
+ * The record {@code KV} is generated deterministically based 
on the record's
+ * position in the source, which enables repeatable execution for debugging.
+ * The SyntheticBoundedInput configurable parameters are defined in {@link
+ * SyntheticBoundedInput.SourceOptions}.
+ *
+ * To read a {@link PCollection} of {@code KV} from {@link
+ * SyntheticBoundedInput}, use {@link SyntheticBoundedInput#readFrom} to 
construct the synthetic
+ * source with synthetic source options.
+ * See {@link SyntheticBoundedInput.SourceOptions} for how to construct an 
instance.
+ * An example is below:
+ *  {@code
+ * Pipeline p = ...;
+ * SyntheticBoundedInput.SourceOptions sso = ...;
+ *
+ * // Construct the synthetic input with synthetic source options.
+ * PCollection> input = 
p.apply(SyntheticBoundedInput.readFrom(sso));
 
 Review comment:
   Please add a test for `SyntheticBoundedInput` use. Also validate in the 
tests the case of the complete options on `SourceOptions`. I found a validation 
issue with the HashFunction non being available because it is transient so it 
probably gets lost on deserialization.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 115936)
Time Spent: 2.5h  (was: 2h 20m)

> Performance tests need a way to generate Synthetic data
> 

[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-21 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=114339=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-114339
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 21/Jun/18 13:46
Start Date: 21/Jun/18 13:46
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-399109148
 
 
   @aaltay I had discussed with Pablo via slack about the urgency of this 
because I was not available to do it, so we agree that I would do it as soon as 
I can hopefully today, tomorrow at latest.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 114339)
Time Spent: 2h 20m  (was: 2h 10m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-20 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=113968=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-113968
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 20/Jun/18 23:20
Start Date: 20/Jun/18 23:20
Worklog Time Spent: 10m 
  Work Description: aaltay commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-398927561
 
 
   What is the state of this PR?
   
   @iemejia are you planning to review?
   @pabloem I have an open question about, whether we can remove zipf or not, 
could you take a look at that?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 113968)
Time Spent: 2h 10m  (was: 2h)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-11 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=110958=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-110958
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 12/Jun/18 05:50
Start Date: 12/Jun/18 05:50
Worklog Time Spent: 10m 
  Work Description: pabloem commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r194620716
 
 

 ##
 File path: sdks/python/setup.py
 ##
 @@ -117,6 +117,7 @@ def get_version():
 REQUIRED_TEST_PACKAGES = [
 'nose>=1.3.7',
 'pyhamcrest>=1.9,<2.0',
+'numpy>=1.14.3',
 
 Review comment:
   Numpy is needed here because we do a zipf distribution (i.e. heavily 
weighted towards a few keys). We can define an 'extra' that is `'perftests'`, 
so that numpy is only installed when we do `pip install -e .[perftest]`. WDYT?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 110958)
Time Spent: 2h  (was: 1h 50m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-11 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=110957=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-110957
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 12/Jun/18 05:48
Start Date: 12/Jun/18 05:48
Worklog Time Spent: 10m 
  Work Description: pabloem commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r194620716
 
 

 ##
 File path: sdks/python/setup.py
 ##
 @@ -117,6 +117,7 @@ def get_version():
 REQUIRED_TEST_PACKAGES = [
 'nose>=1.3.7',
 'pyhamcrest>=1.9,<2.0',
+'numpy>=1.14.3',
 
 Review comment:
   Numpy is needed here because we do a zipf distribution (i.e. heavily 
weighted towards a few keys). Another option would be to have these files not 
be part of Beam, but since they will be used for perf tests, it'll likely be 
worth having them directly n beam.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 110957)
Time Spent: 1h 50m  (was: 1h 40m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-11 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=110956=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-110956
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 12/Jun/18 05:47
Start Date: 12/Jun/18 05:47
Worklog Time Spent: 10m 
  Work Description: pabloem commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r194620610
 
 

 ##
 File path: sdks/python/scripts/generate_pydoc.sh
 ##
 @@ -174,6 +174,7 @@ nitpicky = True
 nitpick_ignore = []
 nitpick_ignore += [('py:class', iden) for iden in ignore_identifiers]
 nitpick_ignore += [('py:obj', iden) for iden in ignore_identifiers]
+nitpick_ignore += [('py:exc', 'ValueError')]
 
 Review comment:
   I found this issue in Sphinx. 
https://github.com/sphinx-doc/sphinx/issues/1034


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 110956)
Time Spent: 1h 40m  (was: 1.5h)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-11 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=110951=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-110951
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 12/Jun/18 04:58
Start Date: 12/Jun/18 04:58
Worklog Time Spent: 10m 
  Work Description: aaltay commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r194614742
 
 

 ##
 File path: sdks/python/scripts/generate_pydoc.sh
 ##
 @@ -174,6 +174,7 @@ nitpicky = True
 nitpick_ignore = []
 nitpick_ignore += [('py:class', iden) for iden in ignore_identifiers]
 nitpick_ignore += [('py:obj', iden) for iden in ignore_identifiers]
+nitpick_ignore += [('py:exc', 'ValueError')]
 
 Review comment:
   Do you know why?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 110951)
Time Spent: 1.5h  (was: 1h 20m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-11 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=110950=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-110950
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 12/Jun/18 04:58
Start Date: 12/Jun/18 04:58
Worklog Time Spent: 10m 
  Work Description: aaltay commented on a change in pull request #5519: 
[BEAM-4432] Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#discussion_r194614733
 
 

 ##
 File path: sdks/python/setup.py
 ##
 @@ -117,6 +117,7 @@ def get_version():
 REQUIRED_TEST_PACKAGES = [
 'nose>=1.3.7',
 'pyhamcrest>=1.9,<2.0',
+'numpy>=1.14.3',
 
 Review comment:
   This is a rather big dependency. Is it possible for us to not use it?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 110950)
Time Spent: 1h 20m  (was: 1h 10m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=109865=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-109865
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 07/Jun/18 21:34
Start Date: 07/Jun/18 21:34
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-395572681
 
 
   Hi Pablo my excuses I have been quite busy these last days. I expect to do a 
first round of the Java part tomorrow.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 109865)
Time Spent: 1h 10m  (was: 1h)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-06-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=109852=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-109852
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 07/Jun/18 20:36
Start Date: 07/Jun/18 20:36
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-395556195
 
 
   Hi Pablo my excuses I have been quite busy these last days. I expect to do a 
first round of the Java part tomorrow.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 109852)
Time Spent: 1h  (was: 50m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-05-31 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=107762=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-107762
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 31/May/18 17:12
Start Date: 31/May/18 17:12
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-393605464
 
 
   r: @aaltay 
   cc: @iemejia 
   starting the review process, but leaving space for discussion as we go


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 107762)
Time Spent: 50m  (was: 40m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-05-30 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=107390=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-107390
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 30/May/18 21:38
Start Date: 30/May/18 21:38
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-393328065
 
 
   Removing lambdas that take advantage of tuple unpacking in function 
arguments: https://www.python.org/dev/peps/pep-3113/


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 107390)
Time Spent: 40m  (was: 0.5h)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-05-30 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=107380=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-107380
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 30/May/18 21:12
Start Date: 30/May/18 21:12
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-393321132
 
 
   Passing Java PreCommit: 
https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/5907/


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 107380)
Time Spent: 0.5h  (was: 20m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-05-30 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=107379=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-107379
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 30/May/18 21:12
Start Date: 30/May/18 21:12
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #5519: [BEAM-4432] Adding 
Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519#issuecomment-393320967
 
 
   Also adding ValueError as a nitpick_ignore element due to a sphinx bug: 
https://github.com/sphinx-doc/sphinx/issues/1034


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 107379)
Time Spent: 20m  (was: 10m)

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4432) Performance tests need a way to generate Synthetic data

2018-05-30 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4432?focusedWorklogId=107350=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-107350
 ]

ASF GitHub Bot logged work on BEAM-4432:


Author: ASF GitHub Bot
Created on: 30/May/18 20:08
Start Date: 30/May/18 20:08
Worklog Time Spent: 10m 
  Work Description: pabloem opened a new pull request #5519: [BEAM-4432] 
Adding Sources to produce Synthetic output for Batch pipelines
URL: https://github.com/apache/beam/pull/5519
 
 
   Adding sources to produce Synthetic data in Beam. These will be used in 
subsequent performance tests that validate the performance of basic Beam 
operations.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 107350)
Time Spent: 10m
Remaining Estimate: 0h

> Performance tests need a way to generate Synthetic data
> ---
>
> Key: BEAM-4432
> URL: https://issues.apache.org/jira/browse/BEAM-4432
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> GenerateSequence fal.lls short in this regard, as we may want to generate 
> data in custom distributions, or with specific repeatability requirements / 
> and hardcoded delays for autoscaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)