[jira] [Commented] (BEAM-10074) Hash Functions in BeamSQL

2020-05-27 Thread Darshan Jani (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117483#comment-17117483
 ] 

Darshan Jani commented on BEAM-10074:
-

Hi Rui,
I feel hashing functions can be part of builtinfunctions as they are in BQ.
Own udf does works.
Only limitation I see is when we want to use SerializableFunction, we need to 
have a public class derived from it which implements apply method. we cannot 
use a local variable or a lamba function instead like in other transforms like 
MapElements ...
That is calcite limitation I think.

On a side note, I feel there should be documentation in offical BeamSQL pages 
of how we can write UDFs and register it. I also see offical beamSQL 
documentation is outdated and all functions are not documented there. It would 
be good to provide example usage along with list of functions. For example: 
https://beam.apache.org/documentation/dsls/sql/calcite/aggregate-functions/



> Hash Functions in BeamSQL
> -
>
> Key: BEAM-10074
> URL: https://issues.apache.org/jira/browse/BEAM-10074
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Darshan Jani
>Assignee: Darshan Jani
>Priority: P2
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I would like to propose hash functions (implemented as UDFs): 
> Optionally we can also add for below functions variants which return hex 
> string instead of bytes. 
> # MD5
> Calculates an MD5 128-bit checksum of string or bytes and returns it as a 
> bytes
> {code:java}
> SELECT MD5("Some String") as md5;
> {code}
> # SHA1
> Calculates a SHA-1 hash value of string or bytes and returns it as a bytes.
> {code:java}
> SELECT SHA1("Some String") as sha1;
> {code}
> # SHA256
> Calculates a SHA-256 hash value of string or bytes and returns it as a bytes
> {code:java}
> SELECT SHA256("Some String") as sha256;
> {code}
> # SHA512
> Calculates a SHA-512 hash value of string or bytes and returns it as a bytes.
> {code:java}
> SELECT SHA512("Some String") as sha512;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-10074) Hash Functions in BeamSQL

2020-05-27 Thread Darshan Jani (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117483#comment-17117483
 ] 

Darshan Jani edited comment on BEAM-10074 at 5/27/20, 7:54 AM:
---

Hi Rui,
It would be nice if hashing functions can be part of builtinfunctions as they 
are in BQ.
Own udf does works.
Only limitation I see is when we want to use SerializableFunction, we need to 
have a public class derived from it which implements apply method. we cannot 
use a local variable or a lamba function instead like in other transforms like 
MapElements ...
That is calcite limitation I think.

On a side note, I feel there should be documentation in offical BeamSQL pages 
of how we can write UDFs and register it. I also see offical beamSQL 
documentation is outdated and all functions are not documented there. It would 
be good to provide example usage along with list of functions. For example: 
https://beam.apache.org/documentation/dsls/sql/calcite/aggregate-functions/




was (Author: darshanjani):
Hi Rui,
I feel hashing functions can be part of builtinfunctions as they are in BQ.
Own udf does works.
Only limitation I see is when we want to use SerializableFunction, we need to 
have a public class derived from it which implements apply method. we cannot 
use a local variable or a lamba function instead like in other transforms like 
MapElements ...
That is calcite limitation I think.

On a side note, I feel there should be documentation in offical BeamSQL pages 
of how we can write UDFs and register it. I also see offical beamSQL 
documentation is outdated and all functions are not documented there. It would 
be good to provide example usage along with list of functions. For example: 
https://beam.apache.org/documentation/dsls/sql/calcite/aggregate-functions/



> Hash Functions in BeamSQL
> -
>
> Key: BEAM-10074
> URL: https://issues.apache.org/jira/browse/BEAM-10074
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Darshan Jani
>Assignee: Darshan Jani
>Priority: P2
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I would like to propose hash functions (implemented as UDFs): 
> Optionally we can also add for below functions variants which return hex 
> string instead of bytes. 
> # MD5
> Calculates an MD5 128-bit checksum of string or bytes and returns it as a 
> bytes
> {code:java}
> SELECT MD5("Some String") as md5;
> {code}
> # SHA1
> Calculates a SHA-1 hash value of string or bytes and returns it as a bytes.
> {code:java}
> SELECT SHA1("Some String") as sha1;
> {code}
> # SHA256
> Calculates a SHA-256 hash value of string or bytes and returns it as a bytes
> {code:java}
> SELECT SHA256("Some String") as sha256;
> {code}
> # SHA512
> Calculates a SHA-512 hash value of string or bytes and returns it as a bytes.
> {code:java}
> SELECT SHA512("Some String") as sha512;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7774) Stop usign Perfkit Benchmarker tool in Python performance tests

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7774?focusedWorklogId=437690&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437690
 ]

ASF GitHub Bot logged work on BEAM-7774:


Author: ASF GitHub Bot
Created on: 27/May/20 08:12
Start Date: 27/May/20 08:12
Worklog Time Spent: 10m 
  Work Description: piotr-szuberski commented on a change in pull request 
#11661:
URL: https://github.com/apache/beam/pull/11661#discussion_r430935474



##
File path: sdks/python/apache_beam/examples/wordcount_it_test.py
##
@@ -104,18 +107,33 @@ def _run_wordcount_it(self, run_wordcount, **opts):
 run_time = end_time - start_time
 
 if publish_to_bq:
-  bq_publisher = BigQueryMetricsPublisher(
-  project_name=test_pipeline.get_option('project'),
-  table=test_pipeline.get_option('metrics_table'),
-  dataset=test_pipeline.get_option('metrics_dataset'),
-  )
-  result = Metric(
-  submit_timestamp=time.time(),
-  metric_id=uuid.uuid4().hex,
-  value=run_time,
-  label='Python performance test',
-  )
-  bq_publisher.publish([result.as_dict()])
+  self._publish_metrics(test_pipeline, run_time)
+
+  def _publish_metrics(self, pipeline, metric_value):

Review comment:
   I can add something like "publish_single_value(bq/influx/console)" for 
each publisher





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437690)
Time Spent: 2h 50m  (was: 2h 40m)

> Stop usign Perfkit Benchmarker tool in Python performance tests
> ---
>
> Key: BEAM-7774
> URL: https://issues.apache.org/jira/browse/BEAM-7774
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Piotr Szuberski
>Priority: P2
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7774) Stop usign Perfkit Benchmarker tool in Python performance tests

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7774?focusedWorklogId=437691&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437691
 ]

ASF GitHub Bot logged work on BEAM-7774:


Author: ASF GitHub Bot
Created on: 27/May/20 08:13
Start Date: 27/May/20 08:13
Worklog Time Spent: 10m 
  Work Description: piotr-szuberski commented on a change in pull request 
#11661:
URL: https://github.com/apache/beam/pull/11661#discussion_r430935474



##
File path: sdks/python/apache_beam/examples/wordcount_it_test.py
##
@@ -104,18 +107,33 @@ def _run_wordcount_it(self, run_wordcount, **opts):
 run_time = end_time - start_time
 
 if publish_to_bq:
-  bq_publisher = BigQueryMetricsPublisher(
-  project_name=test_pipeline.get_option('project'),
-  table=test_pipeline.get_option('metrics_table'),
-  dataset=test_pipeline.get_option('metrics_dataset'),
-  )
-  result = Metric(
-  submit_timestamp=time.time(),
-  metric_id=uuid.uuid4().hex,
-  value=run_time,
-  label='Python performance test',
-  )
-  bq_publisher.publish([result.as_dict()])
+  self._publish_metrics(test_pipeline, run_time)
+
+  def _publish_metrics(self, pipeline, metric_value):

Review comment:
   I can add something like "publish_single_value_(bq/influx/console)" for 
each publisher





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437691)
Time Spent: 3h  (was: 2h 50m)

> Stop usign Perfkit Benchmarker tool in Python performance tests
> ---
>
> Key: BEAM-7774
> URL: https://issues.apache.org/jira/browse/BEAM-7774
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Piotr Szuberski
>Priority: P2
>  Time Spent: 3h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10072) RequiesTimeSortedInput not working for DoFns without state

2020-05-27 Thread Jira


 [ 
https://issues.apache.org/jira/browse/BEAM-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Lukavský updated BEAM-10072:

Summary: RequiesTimeSortedInput not working for DoFns without state  (was: 
RequiesTimeSortedInput not working for DnFs without state)

> RequiesTimeSortedInput not working for DoFns without state
> --
>
> Key: BEAM-10072
> URL: https://issues.apache.org/jira/browse/BEAM-10072
> Project: Beam
>  Issue Type: Bug
>  Components: runner-core
>Affects Versions: 2.20.0, 2.21.0, 2.22.0
>Reporter: Jan Lukavský
>Assignee: Jan Lukavský
>Priority: P2
> Fix For: 2.23.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When DoFn annotated with `@RequiresTimeSortedInput` doesn't have a 
> `StateSpec`, the ordering might break.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-9994) Cannot create a virtualenv using Python 3.8 on Jenkins machines

2020-05-27 Thread Kamil Wasilewski (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamil Wasilewski closed BEAM-9994.
--

> Cannot create a virtualenv using Python 3.8 on Jenkins machines
> ---
>
> Key: BEAM-9994
> URL: https://issues.apache.org/jira/browse/BEAM-9994
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Reporter: Kamil Wasilewski
>Assignee: Kamil Wasilewski
>Priority: P2
> Fix For: Not applicable
>
>
> Command: *virtualenv --python /usr/bin/python3.8 env*
> Output:
> {noformat}
> Running virtualenv with interpreter /usr/bin/python3.8
> Traceback (most recent call last):
>   File "/usr/local/lib/python3.5/dist-packages/virtualenv.py", line 22, in 
> 
> import distutils.spawn
> ModuleNotFoundError: No module named 'distutils.spawn'
> {noformat}
> Example test affected: 
> https://builds.apache.org/job/beam_PreCommit_PythonFormatter_Commit/1723/console



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-9994) Cannot create a virtualenv using Python 3.8 on Jenkins machines

2020-05-27 Thread Kamil Wasilewski (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamil Wasilewski resolved BEAM-9994.

Fix Version/s: Not applicable
   Resolution: Fixed

> Cannot create a virtualenv using Python 3.8 on Jenkins machines
> ---
>
> Key: BEAM-9994
> URL: https://issues.apache.org/jira/browse/BEAM-9994
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Reporter: Kamil Wasilewski
>Assignee: Kamil Wasilewski
>Priority: P2
> Fix For: Not applicable
>
>
> Command: *virtualenv --python /usr/bin/python3.8 env*
> Output:
> {noformat}
> Running virtualenv with interpreter /usr/bin/python3.8
> Traceback (most recent call last):
>   File "/usr/local/lib/python3.5/dist-packages/virtualenv.py", line 22, in 
> 
> import distutils.spawn
> ModuleNotFoundError: No module named 'distutils.spawn'
> {noformat}
> Example test affected: 
> https://builds.apache.org/job/beam_PreCommit_PythonFormatter_Commit/1723/console



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9994) Cannot create a virtualenv using Python 3.8 on Jenkins machines

2020-05-27 Thread Kamil Wasilewski (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117504#comment-17117504
 ] 

Kamil Wasilewski commented on BEAM-9994:


Thanks! Much appreciated.

Creating a virtualenv using Python3.8 works now. I think everything's set and 
we can close the ticket.

> Cannot create a virtualenv using Python 3.8 on Jenkins machines
> ---
>
> Key: BEAM-9994
> URL: https://issues.apache.org/jira/browse/BEAM-9994
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Reporter: Kamil Wasilewski
>Assignee: Kamil Wasilewski
>Priority: P2
>
> Command: *virtualenv --python /usr/bin/python3.8 env*
> Output:
> {noformat}
> Running virtualenv with interpreter /usr/bin/python3.8
> Traceback (most recent call last):
>   File "/usr/local/lib/python3.5/dist-packages/virtualenv.py", line 22, in 
> 
> import distutils.spawn
> ModuleNotFoundError: No module named 'distutils.spawn'
> {noformat}
> Example test affected: 
> https://builds.apache.org/job/beam_PreCommit_PythonFormatter_Commit/1723/console



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9810) Add a Tox (precommit) suite for Python 3.8

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9810?focusedWorklogId=437696&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437696
 ]

ASF GitHub Bot logged work on BEAM-9810:


Author: ASF GitHub Bot
Created on: 27/May/20 08:32
Start Date: 27/May/20 08:32
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #11707:
URL: https://github.com/apache/beam/pull/11707#issuecomment-634513015


   Retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437696)
Time Spent: 2h 10m  (was: 2h)

> Add a Tox (precommit) suite for Python 3.8
> --
>
> Key: BEAM-9810
> URL: https://issues.apache.org/jira/browse/BEAM-9810
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core, testing
>Reporter: Valentyn Tymofieiev
>Assignee: Kamil Wasilewski
>Priority: P2
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9723) [Java] PTransform that connects to Cloud DLP deidentification service

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9723?focusedWorklogId=437698&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437698
 ]

ASF GitHub Bot logged work on BEAM-9723:


Author: ASF GitHub Bot
Created on: 27/May/20 08:36
Start Date: 27/May/20 08:36
Worklog Time Spent: 10m 
  Work Description: mwalenia commented on pull request #11566:
URL: https://github.com/apache/beam/pull/11566#issuecomment-634514801


   @tysonjh , @santhh Pinging for review, I applied your suggestions and I 
think the code is ready for review



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437698)
Time Spent: 7h  (was: 6h 50m)

> [Java] PTransform that connects to Cloud DLP deidentification service
> -
>
> Key: BEAM-9723
> URL: https://issues.apache.org/jira/browse/BEAM-9723
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-java-gcp
>Reporter: Michał Walenia
>Assignee: Michał Walenia
>Priority: P2
>  Time Spent: 7h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9646) [Java] PTransform that integrates Cloud Vision functionality

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9646?focusedWorklogId=437699&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437699
 ]

ASF GitHub Bot logged work on BEAM-9646:


Author: ASF GitHub Bot
Created on: 27/May/20 08:37
Start Date: 27/May/20 08:37
Worklog Time Spent: 10m 
  Work Description: mwalenia commented on a change in pull request #11331:
URL: https://github.com/apache/beam/pull/11331#discussion_r430950390



##
File path: 
sdks/java/extensions/ml/src/main/java/org/apache/beam/sdk/extensions/ml/AnnotateImages.java
##
@@ -0,0 +1,209 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.ml;
+
+import com.google.cloud.vision.v1.AnnotateImageRequest;
+import com.google.cloud.vision.v1.AnnotateImageResponse;
+import com.google.cloud.vision.v1.BatchAnnotateImagesResponse;
+import com.google.cloud.vision.v1.Feature;
+import com.google.cloud.vision.v1.ImageAnnotatorClient;
+import com.google.cloud.vision.v1.ImageContext;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.Random;
+import org.apache.beam.sdk.transforms.DoFn;
+import org.apache.beam.sdk.transforms.GroupIntoBatches;
+import org.apache.beam.sdk.transforms.PTransform;
+import org.apache.beam.sdk.transforms.ParDo;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.beam.sdk.values.PCollectionView;
+
+/**
+ * Parent class for transform utilizing Cloud Vision API.
+ *
+ * @param  Type of input PCollection.
+ */
+public abstract class AnnotateImages
+extends PTransform, 
PCollection>> {
+
+  private static final Long MIN_BATCH_SIZE = 1L;
+  private static final Long MAX_BATCH_SIZE = 5L;
+
+  protected final PCollectionView> contextSideInput;
+  protected final List featureList;
+  private long batchSize;
+
+  public AnnotateImages(
+  PCollectionView> contextSideInput,
+  List featureList,
+  long batchSize) {
+this.contextSideInput = contextSideInput;
+this.featureList = featureList;
+checkBatchSizeCorrectness(batchSize);
+this.batchSize = batchSize;
+  }
+
+  public AnnotateImages(List featureList, long batchSize) {
+contextSideInput = null;
+this.featureList = featureList;
+checkBatchSizeCorrectness(batchSize);
+this.batchSize = batchSize;
+  }
+
+  private void checkBatchSizeCorrectness(long batchSize) {
+if (batchSize > MAX_BATCH_SIZE) {
+  throw new IllegalArgumentException(
+  String.format(
+  "Max batch size exceeded.\n" + "Batch size needs to be equal or 
smaller than %d",
+  MAX_BATCH_SIZE));
+} else if (batchSize < MIN_BATCH_SIZE) {
+  throw new IllegalArgumentException(
+  String.format(
+  "Min batch size not reached.\n" + "Batch size needs to be larger 
than %d",
+  MIN_BATCH_SIZE));
+}
+  }
+
+  /**
+   * Applies all necessary transforms to call the Vision API. In order to 
group requests into
+   * batches, we assign keys to the requests, as {@link GroupIntoBatches} 
works only on {@link KV}s.
+   */
+  @Override
+  public PCollection> expand(PCollection input) 
{
+ParDo.SingleOutput inputToRequestMapper;
+if (contextSideInput != null) {
+  inputToRequestMapper =
+  ParDo.of(new 
MapInputToRequest(contextSideInput)).withSideInputs(contextSideInput);
+} else {
+  inputToRequestMapper = ParDo.of(new MapInputToRequest(null));
+}
+return input
+.apply(inputToRequestMapper)
+.apply(ParDo.of(new AssignRandomKeys()))
+.apply(GroupIntoBatches.ofSize(batchSize))
+.apply(ParDo.of(new ExtractValues()))
+.apply(ParDo.of(new PerformImageAnnotation()));
+  }
+
+  /**
+   * Input type to {@link AnnotateImageRequest} mapper. Needs to be 
implemented by child classes
+   *
+   * @param input Input element.
+   * @param ctx optional image context.
+   * @return A valid {@link AnnotateImageRequest} object.
+   */
+  public abstract AnnotateImageRequest mapToRequest(T input, ImageCon

[jira] [Work logged] (BEAM-9646) [Java] PTransform that integrates Cloud Vision functionality

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9646?focusedWorklogId=437700&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437700
 ]

ASF GitHub Bot logged work on BEAM-9646:


Author: ASF GitHub Bot
Created on: 27/May/20 08:37
Start Date: 27/May/20 08:37
Worklog Time Spent: 10m 
  Work Description: mwalenia commented on a change in pull request #11331:
URL: https://github.com/apache/beam/pull/11331#discussion_r430950390



##
File path: 
sdks/java/extensions/ml/src/main/java/org/apache/beam/sdk/extensions/ml/AnnotateImages.java
##
@@ -0,0 +1,209 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.ml;
+
+import com.google.cloud.vision.v1.AnnotateImageRequest;
+import com.google.cloud.vision.v1.AnnotateImageResponse;
+import com.google.cloud.vision.v1.BatchAnnotateImagesResponse;
+import com.google.cloud.vision.v1.Feature;
+import com.google.cloud.vision.v1.ImageAnnotatorClient;
+import com.google.cloud.vision.v1.ImageContext;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.Random;
+import org.apache.beam.sdk.transforms.DoFn;
+import org.apache.beam.sdk.transforms.GroupIntoBatches;
+import org.apache.beam.sdk.transforms.PTransform;
+import org.apache.beam.sdk.transforms.ParDo;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.beam.sdk.values.PCollectionView;
+
+/**
+ * Parent class for transform utilizing Cloud Vision API.
+ *
+ * @param  Type of input PCollection.
+ */
+public abstract class AnnotateImages
+extends PTransform, 
PCollection>> {
+
+  private static final Long MIN_BATCH_SIZE = 1L;
+  private static final Long MAX_BATCH_SIZE = 5L;
+
+  protected final PCollectionView> contextSideInput;
+  protected final List featureList;
+  private long batchSize;
+
+  public AnnotateImages(
+  PCollectionView> contextSideInput,
+  List featureList,
+  long batchSize) {
+this.contextSideInput = contextSideInput;
+this.featureList = featureList;
+checkBatchSizeCorrectness(batchSize);
+this.batchSize = batchSize;
+  }
+
+  public AnnotateImages(List featureList, long batchSize) {
+contextSideInput = null;
+this.featureList = featureList;
+checkBatchSizeCorrectness(batchSize);
+this.batchSize = batchSize;
+  }
+
+  private void checkBatchSizeCorrectness(long batchSize) {
+if (batchSize > MAX_BATCH_SIZE) {
+  throw new IllegalArgumentException(
+  String.format(
+  "Max batch size exceeded.\n" + "Batch size needs to be equal or 
smaller than %d",
+  MAX_BATCH_SIZE));
+} else if (batchSize < MIN_BATCH_SIZE) {
+  throw new IllegalArgumentException(
+  String.format(
+  "Min batch size not reached.\n" + "Batch size needs to be larger 
than %d",
+  MIN_BATCH_SIZE));
+}
+  }
+
+  /**
+   * Applies all necessary transforms to call the Vision API. In order to 
group requests into
+   * batches, we assign keys to the requests, as {@link GroupIntoBatches} 
works only on {@link KV}s.
+   */
+  @Override
+  public PCollection> expand(PCollection input) 
{
+ParDo.SingleOutput inputToRequestMapper;
+if (contextSideInput != null) {
+  inputToRequestMapper =
+  ParDo.of(new 
MapInputToRequest(contextSideInput)).withSideInputs(contextSideInput);
+} else {
+  inputToRequestMapper = ParDo.of(new MapInputToRequest(null));
+}
+return input
+.apply(inputToRequestMapper)
+.apply(ParDo.of(new AssignRandomKeys()))
+.apply(GroupIntoBatches.ofSize(batchSize))
+.apply(ParDo.of(new ExtractValues()))
+.apply(ParDo.of(new PerformImageAnnotation()));
+  }
+
+  /**
+   * Input type to {@link AnnotateImageRequest} mapper. Needs to be 
implemented by child classes
+   *
+   * @param input Input element.
+   * @param ctx optional image context.
+   * @return A valid {@link AnnotateImageRequest} object.
+   */
+  public abstract AnnotateImageRequest mapToRequest(T input, ImageCon

[jira] [Work logged] (BEAM-9894) Add batch SnowflakeIO.Write to Java SDK

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9894?focusedWorklogId=437704&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437704
 ]

ASF GitHub Bot logged work on BEAM-9894:


Author: ASF GitHub Bot
Created on: 27/May/20 08:56
Start Date: 27/May/20 08:56
Worklog Time Spent: 10m 
  Work Description: purbanow commented on a change in pull request #11794:
URL: https://github.com/apache/beam/pull/11794#discussion_r430886885



##
File path: 
sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/CSVSink.java
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.io.snowflake;
+
+import java.io.BufferedWriter;
+import java.io.IOException;
+import java.io.OutputStreamWriter;
+import java.io.PrintWriter;
+import java.nio.channels.Channels;
+import java.nio.channels.WritableByteChannel;
+import java.nio.charset.Charset;
+import java.util.List;
+import org.apache.beam.sdk.io.FileIO;
+import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Joiner;
+
+/** Implementation of {@link org.apache.beam.sdk.io.FileIO.Sink} for writing 
CSV. */
+public class CSVSink implements FileIO.Sink {

Review comment:
   Yes, we are using [COPY INTO 
table](https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html#type-csv)
 with CSV format. 
   
   Currently, `SnowflakeIO.write `is constructed in a way that requires a 
particular table to exist in Snowflake before starting writing into Snowflake. 
   
   In one of the next Snowflake PR's we're planning to add the option for a 
user a possibility for passing table schema.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437704)
Time Spent: 1h 10m  (was: 1h)

> Add batch SnowflakeIO.Write to Java SDK
> ---
>
> Key: BEAM-9894
> URL: https://issues.apache.org/jira/browse/BEAM-9894
> Project: Beam
>  Issue Type: New Feature
>  Components: io-ideas
>Reporter: Dariusz Aniszewski
>Priority: P2
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9810) Add a Tox (precommit) suite for Python 3.8

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9810?focusedWorklogId=437707&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437707
 ]

ASF GitHub Bot logged work on BEAM-9810:


Author: ASF GitHub Bot
Created on: 27/May/20 09:27
Start Date: 27/May/20 09:27
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #11707:
URL: https://github.com/apache/beam/pull/11707#issuecomment-634542088


   Run Python PreCommit



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437707)
Time Spent: 2h 20m  (was: 2h 10m)

> Add a Tox (precommit) suite for Python 3.8
> --
>
> Key: BEAM-9810
> URL: https://issues.apache.org/jira/browse/BEAM-9810
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core, testing
>Reporter: Valentyn Tymofieiev
>Assignee: Kamil Wasilewski
>Priority: P2
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9936) Create SDK harness containers with Python 3.8

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9936?focusedWorklogId=437710&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437710
 ]

ASF GitHub Bot logged work on BEAM-9936:


Author: ASF GitHub Bot
Created on: 27/May/20 09:36
Start Date: 27/May/20 09:36
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #11738:
URL: https://github.com/apache/beam/pull/11738#issuecomment-634546732


   Run Python Dataflow ValidatesContainer



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437710)
Time Spent: 2h 50m  (was: 2h 40m)

> Create SDK harness containers with Python 3.8
> -
>
> Key: BEAM-9936
> URL: https://issues.apache.org/jira/browse/BEAM-9936
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Kamil Wasilewski
>Priority: P2
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-10066) Support ValueProvider for RedisIO

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10066?focusedWorklogId=437715&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437715
 ]

ASF GitHub Bot logged work on BEAM-10066:
-

Author: ASF GitHub Bot
Created on: 27/May/20 09:40
Start Date: 27/May/20 09:40
Worklog Time Spent: 10m 
  Work Description: iemejia commented on pull request #11799:
URL: https://github.com/apache/beam/pull/11799#issuecomment-634548871


   @naipath Can you please run `./gradlew :sdks:java:io:redis:check 
spotlessApply` and update the PR, can you also please change the commit title 
to include the ticket: "[BEAM-10066] Added support for ValueProvider in 
RedisConnectionConfiguration" Thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437715)
Remaining Estimate: 0.5h  (was: 40m)
Time Spent: 0.5h  (was: 20m)

> Support ValueProvider for RedisIO
> -
>
> Key: BEAM-10066
> URL: https://issues.apache.org/jira/browse/BEAM-10066
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-redis
>Affects Versions: 2.20.0
>Reporter: Teije van Sloten
>Assignee: Teije van Sloten
>Priority: P3
>   Original Estimate: 1h
>  Time Spent: 0.5h
>  Remaining Estimate: 0.5h
>
> RedisIO doesn't have support for `ValueProvider` when setting up the 
> connection with Redis, therefore I cannot provide the connection at runtime 
> of the application only at compile time.
> This will involve wrapping the RedisConnectionConfiguration with 
> ValueProvider and ensuring that the building the configuration still supports 
> values without ValueProvider.
> E.g.:
>  
> {code:java}
> public abstract class RedisConnectionConfiguration implements Serializable {
>   abstract ValueProvider host();
>   abstract ValueProvider port();
>   @Nullable
>   abstract ValueProvider auth();
>   abstract ValueProvider timeout();
>   abstract ValueProvider ssl();
>   abstract Builder builder();
> }
>  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-2546) Add InfluxDbIO

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-2546?focusedWorklogId=437713&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437713
 ]

ASF GitHub Bot logged work on BEAM-2546:


Author: ASF GitHub Bot
Created on: 27/May/20 09:40
Start Date: 27/May/20 09:40
Worklog Time Spent: 10m 
  Work Description: Sorkanius commented on pull request #11459:
URL: https://github.com/apache/beam/pull/11459#issuecomment-634548570


   Any updates on this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437713)
Time Spent: 18h 40m  (was: 18.5h)

> Add InfluxDbIO
> --
>
> Key: BEAM-2546
> URL: https://issues.apache.org/jira/browse/BEAM-2546
> Project: Beam
>  Issue Type: New Feature
>  Components: io-ideas
>Reporter: Jean-Baptiste Onofré
>Assignee: Bipin Upadhyaya
>Priority: P2
> Fix For: 2.23.0
>
>  Time Spent: 18h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9936) Create SDK harness containers with Python 3.8

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9936?focusedWorklogId=437717&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437717
 ]

ASF GitHub Bot logged work on BEAM-9936:


Author: ASF GitHub Bot
Created on: 27/May/20 09:47
Start Date: 27/May/20 09:47
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #11738:
URL: https://github.com/apache/beam/pull/11738#issuecomment-634552389


   Run Python2_PVR_Flink PreCommit



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437717)
Time Spent: 3h  (was: 2h 50m)

> Create SDK harness containers with Python 3.8
> -
>
> Key: BEAM-9936
> URL: https://issues.apache.org/jira/browse/BEAM-9936
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Kamil Wasilewski
>Priority: P2
>  Time Spent: 3h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9936) Create SDK harness containers with Python 3.8

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9936?focusedWorklogId=437724&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437724
 ]

ASF GitHub Bot logged work on BEAM-9936:


Author: ASF GitHub Bot
Created on: 27/May/20 09:59
Start Date: 27/May/20 09:59
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #11738:
URL: https://github.com/apache/beam/pull/11738#issuecomment-634557941


   Run SQL_Java11 PreCommit



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437724)
Time Spent: 3h 10m  (was: 3h)

> Create SDK harness containers with Python 3.8
> -
>
> Key: BEAM-9936
> URL: https://issues.apache.org/jira/browse/BEAM-9936
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Kamil Wasilewski
>Priority: P2
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10072) RequiresTimeSortedInput not working for DoFns without state

2020-05-27 Thread Jira


 [ 
https://issues.apache.org/jira/browse/BEAM-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-10072:

Summary: RequiresTimeSortedInput not working for DoFns without state  (was: 
RequiesTimeSortedInput not working for DoFns without state)

> RequiresTimeSortedInput not working for DoFns without state
> ---
>
> Key: BEAM-10072
> URL: https://issues.apache.org/jira/browse/BEAM-10072
> Project: Beam
>  Issue Type: Bug
>  Components: runner-core
>Affects Versions: 2.20.0, 2.21.0, 2.22.0
>Reporter: Jan Lukavský
>Assignee: Jan Lukavský
>Priority: P2
> Fix For: 2.23.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When DoFn annotated with `@RequiresTimeSortedInput` doesn't have a 
> `StateSpec`, the ordering might break.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-10016) Flink postcommits failing testFlattenWithDifferentInputAndOutputCoders2

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10016?focusedWorklogId=437728&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437728
 ]

ASF GitHub Bot logged work on BEAM-10016:
-

Author: ASF GitHub Bot
Created on: 27/May/20 10:07
Start Date: 27/May/20 10:07
Worklog Time Spent: 10m 
  Work Description: mxm commented on pull request #11825:
URL: https://github.com/apache/beam/pull/11825#issuecomment-634561794


   Looks like a bunch of the portable ValidatesRunner tests are failing for 
Spark. Needs to be looked up independently.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437728)
Time Spent: 1.5h  (was: 1h 20m)

> Flink postcommits failing testFlattenWithDifferentInputAndOutputCoders2
> ---
>
> Key: BEAM-10016
> URL: https://issues.apache.org/jira/browse/BEAM-10016
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Kyle Weaver
>Assignee: Maximilian Michels
>Priority: P2
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Both beam_PostCommit_Java_PVR_Flink_Batch and 
> beam_PostCommit_Java_PVR_Flink_Streaming are failing newly added test 
> org.apache.beam.sdk.transforms.FlattenTest.testFlattenWithDifferentInputAndOutputCoders2.
> SEVERE: Error in task code:  CHAIN MapPartition (MapPartition at [6]{Values, 
> FlatMapElements, PAssert$0}) -> FlatMap (FlatMap at ExtractOutput[0]) -> Map 
> (Key Extractor) -> GroupCombine (GroupCombine at GroupCombine: 
> PAssert$0/GroupGlobally/GatherAllOutputs/GroupByKey) -> Map (Key Extractor) 
> (2/2) java.lang.ClassCastException: org.apache.beam.sdk.values.KV cannot be 
> cast to [B
>   at 
> org.apache.beam.sdk.coders.ByteArrayCoder.encode(ByteArrayCoder.java:41)
>   at 
> org.apache.beam.sdk.coders.LengthPrefixCoder.encode(LengthPrefixCoder.java:56)
>   at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
>   at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:590)
>   at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:581)
>   at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:541)
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataSizeBasedBufferingOutboundObserver.accept(BeamFnDataSizeBasedBufferingOutboundObserver.java:109)
>   at 
> org.apache.beam.runners.fnexecution.control.SdkHarnessClient$CountingFnDataReceiver.accept(SdkHarnessClient.java:667)
>   at 
> org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction.processElements(FlinkExecutableStageFunction.java:271)
>   at 
> org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction.mapPartition(FlinkExecutableStageFunction.java:203)
>   at 
> org.apache.flink.runtime.operators.MapPartitionDriver.run(MapPartitionDriver.java:103)
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:504)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>   at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9810) Add a Tox (precommit) suite for Python 3.8

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9810?focusedWorklogId=437742&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437742
 ]

ASF GitHub Bot logged work on BEAM-9810:


Author: ASF GitHub Bot
Created on: 27/May/20 11:22
Start Date: 27/May/20 11:22
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #11707:
URL: https://github.com/apache/beam/pull/11707#issuecomment-634596264


   Run Python PreCommit



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437742)
Time Spent: 2.5h  (was: 2h 20m)

> Add a Tox (precommit) suite for Python 3.8
> --
>
> Key: BEAM-9810
> URL: https://issues.apache.org/jira/browse/BEAM-9810
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core, testing
>Reporter: Valentyn Tymofieiev
>Assignee: Kamil Wasilewski
>Priority: P2
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Dave Martin (Jira)
Dave Martin created BEAM-10100:
--

 Summary: FileIO writeDynamic with AvroIO.sink not writing all data
 Key: BEAM-10100
 URL: https://issues.apache.org/jira/browse/BEAM-10100
 Project: Beam
  Issue Type: Bug
  Components: beam-community
Affects Versions: 2.20.0, 2.17.0
 Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
Reporter: Dave Martin
Assignee: Aizhamal Nurmamat kyzy


{code:java}
Pipeline p = Pipeline.create(options);
PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))
.apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));

//write out into AVRO in each separate directory
records.apply("Write avro file per dataset", FileIO.>writeDynamic()
  .by(KV::getKey)
  .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
  .to(options.getTargetPath())
  .withDestinationCoder(StringUtf8Coder.of())
  .withNaming(key -> defaultNaming(key + "/export", 
PipelinesVariables.Pipeline.AVRO_EXTENSION)));

p.run().waitUntilFinish();
{code}


If i replace AvroIO.sink() with TextIO.sink() (and replace the initial mapping 
function) then the correct number of records are written to the separate 
directories.

e.g.

{code:java}
// Initialise pipeline
Pipeline p = Pipeline.create(options);

PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))

.apply(ParDo.of(new StringToDatasetIDKVFcn()));
//write out into AVRO in each separate directory
records.apply("Write CSV file per dataset", FileIO.>writeDynamic()
.by(KV::getKey)
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to(options.getTargetPath())
.withDestinationCoder(StringUtf8Coder.of())
.withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));

 p.run().waitUntilFinish();
{code}

cc [~timrobertson100]





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7774) Stop usign Perfkit Benchmarker tool in Python performance tests

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7774?focusedWorklogId=437744&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437744
 ]

ASF GitHub Bot logged work on BEAM-7774:


Author: ASF GitHub Bot
Created on: 27/May/20 11:41
Start Date: 27/May/20 11:41
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on a change in pull request #11661:
URL: https://github.com/apache/beam/pull/11661#discussion_r431052998



##
File path: 
.test-infra/metrics/grafana/dashboards/perftests_metrics/Python_Performance_Tests.json
##
@@ -0,0 +1,297 @@
+{

Review comment:
   @tvalentyn 
   
   > Is there a way to visualize the dashboard on an in-progress PR?
   
   Currently, the best way is to: pull branch, deploy metrics stack locally 
(`docker-compose build`; `docker-compose up`), open web browser and go to 
localhost:3000.
   
   > Is it possible to reduce duplication in the configs?
   
   There's a concept called [Scripted 
Dashboards](https://grafana.com/docs/grafana/latest/reference/scripting/), but 
we didn't implement it. So, if we want to create two dashboards next to each 
other, we have to provide two configs for them (even if they are very similar). 
   
   Apart from that, we can also consider another question: is having those 
.json files version-controlled profitable? Diffs are often very large, so the 
standard review process does not apply here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437744)
Time Spent: 3h 10m  (was: 3h)

> Stop usign Perfkit Benchmarker tool in Python performance tests
> ---
>
> Key: BEAM-7774
> URL: https://issues.apache.org/jira/browse/BEAM-7774
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Piotr Szuberski
>Priority: P2
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Dave Martin (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Martin updated BEAM-10100:
---
Description: 
{code:java}
Pipeline p = Pipeline.create(options);
PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))
.apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));

//write out into AVRO in each separate directory
records.apply("Write avro file per dataset", FileIO.>writeDynamic()
  .by(KV::getKey)
  .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
  .to(options.getTargetPath())
  .withDestinationCoder(StringUtf8Coder.of())
  .withNaming(key -> defaultNaming(key + "/export", 
PipelinesVariables.Pipeline.AVRO_EXTENSION)));

p.run().waitUntilFinish();
{code}


If i replace AvroIO.sink() with TextIO.sink() (and replace the initial mapping 
function) then the correct number of records are written to the separate 
directories.

e.g.

{code:java}
// Initialise pipeline
Pipeline p = Pipeline.create(options);

PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))

.apply(ParDo.of(new StringToDatasetIDKVFcn()));
//write out into AVRO in each separate directory
records.apply("Write CSV file per dataset", FileIO.>writeDynamic()
.by(KV::getKey)
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to(options.getTargetPath())
.withDestinationCoder(StringUtf8Coder.of())
.withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));

 p.run().waitUntilFinish();
{code}

cc [~timrobertson100]



  was:
{code:java}
Pipeline p = Pipeline.create(options);
PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))
.apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));

//write out into AVRO in each separate directory
records.apply("Write avro file per dataset", FileIO.>writeDynamic()
  .by(KV::getKey)
  .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
  .to(options.getTargetPath())
  .withDestinationCoder(StringUtf8Coder.of())
  .withNaming(key -> defaultNaming(key + "/export", 
PipelinesVariables.Pipeline.AVRO_EXTENSION)));

p.run().waitUntilFinish();
{code}


If i replace AvroIO.sink() with TextIO.sink() (and replace the initial mapping 
function) then the correct number of records are written to the separate 
directories.

e.g.

{code:java}
// Initialise pipeline
Pipeline p = Pipeline.create(options);

PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))

.apply(ParDo.of(new StringToDatasetIDKVFcn()));
//write out into AVRO in each separate directory
records.apply("Write CSV file per dataset", FileIO.>writeDynamic()
.by(KV::getKey)
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to(options.getTargetPath())
.withDestinationCoder(StringUtf8Coder.of())
.withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));

 p.run().waitUntilFinish();
{code}

cc [~timrobertson100]




> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: beam-community
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Assignee: Aizhamal Nurmamat kyzy
>Priority: P2
>  Labels: AVRO, IO, Spark
>
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder(StringUtf8Coder.of())
>   .withNaming(key -> defaultNaming(key + "/export", 
> PipelinesVariables.Pipeline.AVRO_EXTENSION)));
> p.run().waitUntilFinish();
> {code}
> If i replace AvroIO.sink() with TextIO.sink() (and replace the initial 
> mapping function) then the correct number of records are written to the 
> separate directories.
> e.g.
> {code:java}
> // Initialise pipeline
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
>   
> .apply(ParDo.of(new StringToDatasetIDKVFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write CSV

[jira] [Work logged] (BEAM-7774) Stop usign Perfkit Benchmarker tool in Python performance tests

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7774?focusedWorklogId=437745&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437745
 ]

ASF GitHub Bot logged work on BEAM-7774:


Author: ASF GitHub Bot
Created on: 27/May/20 11:42
Start Date: 27/May/20 11:42
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on a change in pull request #11661:
URL: https://github.com/apache/beam/pull/11661#discussion_r431052998



##
File path: 
.test-infra/metrics/grafana/dashboards/perftests_metrics/Python_Performance_Tests.json
##
@@ -0,0 +1,297 @@
+{

Review comment:
   @tvalentyn 
   
   > Is there a way to visualize the dashboard on an in-progress PR?
   
   Currently, the best way is to: pull branch, deploy metrics stack locally 
(`docker-compose build`; `docker-compose up`), open web browser and go to 
localhost:3000.
   
   > Is it possible to reduce duplication in the configs?
   
   There's a concept called [Scripted 
Dashboards](https://grafana.com/docs/grafana/latest/reference/scripting/), but 
we didn't implement it. So, if we want to create two charts next to each other, 
we have to provide two configs for them (even if they are very similar). 
   
   Apart from that, we can also consider another question: is having those 
.json files version-controlled profitable? Diffs are often very large, so the 
standard review process does not apply here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437745)
Time Spent: 3h 20m  (was: 3h 10m)

> Stop usign Perfkit Benchmarker tool in Python performance tests
> ---
>
> Key: BEAM-7774
> URL: https://issues.apache.org/jira/browse/BEAM-7774
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Piotr Szuberski
>Priority: P2
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Dave Martin (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Martin updated BEAM-10100:
---
Description: 
FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
pipeline. The amount of data written varies between runs but it is consistently 
dropping. This is with a very small test dataset - 5 records, which should 
produce 3 directories.

{code:java}
Pipeline p = Pipeline.create(options);
PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))
.apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));

//write out into AVRO in each separate directory
records.apply("Write avro file per dataset", FileIO.>writeDynamic()
  .by(KV::getKey)
  .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
  .to(options.getTargetPath())
  .withDestinationCoder(StringUtf8Coder.of())
  .withNaming(key -> defaultNaming(key + "/export", 
PipelinesVariables.Pipeline.AVRO_EXTENSION)));

p.run().waitUntilFinish();
{code}


If i replace AvroIO.sink() with TextIO.sink() (and replace the initial mapping 
function) then the correct number of records are written to the separate 
directories.

e.g.

{code:java}
// Initialise pipeline
Pipeline p = Pipeline.create(options);

PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))

.apply(ParDo.of(new StringToDatasetIDKVFcn()));
//write out into AVRO in each separate directory
records.apply("Write CSV file per dataset", FileIO.>writeDynamic()
.by(KV::getKey)
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to(options.getTargetPath())
.withDestinationCoder(StringUtf8Coder.of())
.withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));

 p.run().waitUntilFinish();
{code}

cc [~timrobertson100]



  was:
{code:java}
Pipeline p = Pipeline.create(options);
PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))
.apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));

//write out into AVRO in each separate directory
records.apply("Write avro file per dataset", FileIO.>writeDynamic()
  .by(KV::getKey)
  .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
  .to(options.getTargetPath())
  .withDestinationCoder(StringUtf8Coder.of())
  .withNaming(key -> defaultNaming(key + "/export", 
PipelinesVariables.Pipeline.AVRO_EXTENSION)));

p.run().waitUntilFinish();
{code}


If i replace AvroIO.sink() with TextIO.sink() (and replace the initial mapping 
function) then the correct number of records are written to the separate 
directories.

e.g.

{code:java}
// Initialise pipeline
Pipeline p = Pipeline.create(options);

PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))

.apply(ParDo.of(new StringToDatasetIDKVFcn()));
//write out into AVRO in each separate directory
records.apply("Write CSV file per dataset", FileIO.>writeDynamic()
.by(KV::getKey)
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to(options.getTargetPath())
.withDestinationCoder(StringUtf8Coder.of())
.withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));

 p.run().waitUntilFinish();
{code}

cc [~timrobertson100]




> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: beam-community
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Assignee: Aizhamal Nurmamat kyzy
>Priority: P2
>  Labels: AVRO, IO, Spark
>
> FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
> pipeline. The amount of data written varies between runs but it is 
> consistently dropping. This is with a very small test dataset - 5 records, 
> which should produce 3 directories.
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder(StringUtf8Coder.of())
>   .withNaming(key -> defaultNaming(key + "/export", 
> PipelinesVariables.Pipeline.AVRO_EXTENSION)));
> p.run().waitUntilFinish();
> {code}
> If i replace 

[jira] [Work logged] (BEAM-7774) Stop usign Perfkit Benchmarker tool in Python performance tests

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7774?focusedWorklogId=437748&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437748
 ]

ASF GitHub Bot logged work on BEAM-7774:


Author: ASF GitHub Bot
Created on: 27/May/20 11:46
Start Date: 27/May/20 11:46
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on a change in pull request #11661:
URL: https://github.com/apache/beam/pull/11661#discussion_r431055234



##
File path: 
.test-infra/metrics/grafana/dashboards/perftests_metrics/Python_Performance_Tests.json
##
@@ -0,0 +1,297 @@
+{

Review comment:
   @piotr-szuberski 
   
   Could you rename your dashboard and charts? `Python Performance Tests` and 
`Python27/37 Performance Tests | 1GB` doesn't say much. How about using `Python 
WordCount IT Benchmarks` from the old Perfkit? [1]
   
   [1] 
https://apache-beam-testing.appspot.com/explore?dashboard=5691127080419328





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437748)
Time Spent: 3.5h  (was: 3h 20m)

> Stop usign Perfkit Benchmarker tool in Python performance tests
> ---
>
> Key: BEAM-7774
> URL: https://issues.apache.org/jira/browse/BEAM-7774
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Piotr Szuberski
>Priority: P2
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Dave Martin (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Martin updated BEAM-10100:
---
Description: 
FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
pipeline. The amount of data written varies between runs but it is consistently 
dropping. This is with a very small test dataset - 5 records, which should 
produce 3 directories.

{code:java}
Pipeline p = Pipeline.create(options);
PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))
.apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));

//write out into AVRO in each separate directory
records.apply("Write avro file per dataset", FileIO.>writeDynamic()
  .by(KV::getKey)
  .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
  .to(options.getTargetPath())
  .withDestinationCoder(StringUtf8Coder.of())
  .withNaming(key -> defaultNaming(key + "/export", 
PipelinesVariables.Pipeline.AVRO_EXTENSION)));

p.run().waitUntilFinish();
{code}


If i replace AvroIO.sink() with TextIO.sink() (and replace the initial mapping 
function) then the correct number of records are written to the separate 
directories.

e.g.

{code:java}
// Initialise pipeline
Pipeline p = Pipeline.create(options);

PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
StringToDatasetIDKVFcn()));

//write out into AVRO in each separate directory
records.apply("Write CSV file per dataset", FileIO.>writeDynamic()
.by(KV::getKey)
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to(options.getTargetPath())
.withDestinationCoder(StringUtf8Coder.of())
.withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));

 p.run().waitUntilFinish();
{code}

cc [~timrobertson100]



  was:
FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
pipeline. The amount of data written varies between runs but it is consistently 
dropping. This is with a very small test dataset - 5 records, which should 
produce 3 directories.

{code:java}
Pipeline p = Pipeline.create(options);
PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))
.apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));

//write out into AVRO in each separate directory
records.apply("Write avro file per dataset", FileIO.>writeDynamic()
  .by(KV::getKey)
  .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
  .to(options.getTargetPath())
  .withDestinationCoder(StringUtf8Coder.of())
  .withNaming(key -> defaultNaming(key + "/export", 
PipelinesVariables.Pipeline.AVRO_EXTENSION)));

p.run().waitUntilFinish();
{code}


If i replace AvroIO.sink() with TextIO.sink() (and replace the initial mapping 
function) then the correct number of records are written to the separate 
directories.

e.g.

{code:java}
// Initialise pipeline
Pipeline p = Pipeline.create(options);

PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))

.apply(ParDo.of(new StringToDatasetIDKVFcn()));
//write out into AVRO in each separate directory
records.apply("Write CSV file per dataset", FileIO.>writeDynamic()
.by(KV::getKey)
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to(options.getTargetPath())
.withDestinationCoder(StringUtf8Coder.of())
.withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));

 p.run().waitUntilFinish();
{code}

cc [~timrobertson100]




> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: beam-community
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Assignee: Aizhamal Nurmamat kyzy
>Priority: P2
>  Labels: AVRO, IO, Spark
>
> FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
> pipeline. The amount of data written varies between runs but it is 
> consistently dropping. This is with a very small test dataset - 5 records, 
> which should produce 3 directories.
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder

[jira] [Work logged] (BEAM-7770) Add ReadAll transform for SolrIO

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7770?focusedWorklogId=437750&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437750
 ]

ASF GitHub Bot logged work on BEAM-7770:


Author: ASF GitHub Bot
Created on: 27/May/20 11:51
Start Date: 27/May/20 11:51
Worklog Time Spent: 10m 
  Work Description: iemejia commented on pull request #11357:
URL: https://github.com/apache/beam/pull/11357#issuecomment-634608874


   `readAll()` is already tested. Notice that `read()` was refactored to rely 
on `readAll()` in line 392. I can write an extra test if you prefer but it is 
just duplication because it is already covered. Should I write the extra test?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437750)
Time Spent: 50m  (was: 40m)

> Add ReadAll transform for SolrIO
> 
>
> Key: BEAM-7770
> URL: https://issues.apache.org/jira/browse/BEAM-7770
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-solr
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: P3
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> SolrIO already uses internally a composable approach but we need to expose an 
> explicit ReadAll transform that allows user to create reads in the middle of 
> the Pipeline to improve composability (e.g. Reads in the middle of a 
> Pipeline).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Dave Martin (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Martin updated BEAM-10100:
---
Description: 
FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
pipeline. The amount of data written varies between runs but it is consistently 
dropping. This is with a very small test dataset - 5 records, which should 
produce 3 directories.

{code:java}
Pipeline p = Pipeline.create(options);
PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))
.apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));

//write out into AVRO in each separate directory
records.apply("Write avro file per dataset", FileIO.>writeDynamic()
  .by(KV::getKey)
  .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
  .to(options.getTargetPath())
  .withDestinationCoder(StringUtf8Coder.of())
  .withNaming(key -> defaultNaming(key + "/export", 
PipelinesVariables.Pipeline.AVRO_EXTENSION)));

p.run().waitUntilFinish();
{code}


If i replace AvroIO.sink() with TextIO.sink() (and replace the initial mapping 
function) then the correct number of records are written to the separate 
directories. This is working consistently.

e.g.

{code:java}
// Initialise pipeline
Pipeline p = Pipeline.create(options);

PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
StringToDatasetIDKVFcn()));

//write out into AVRO in each separate directory
records.apply("Write CSV file per dataset", FileIO.>writeDynamic()
.by(KV::getKey)
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to(options.getTargetPath())
.withDestinationCoder(StringUtf8Coder.of())
.withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));

 p.run().waitUntilFinish();
{code}

cc [~timrobertson100]



  was:
FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
pipeline. The amount of data written varies between runs but it is consistently 
dropping. This is with a very small test dataset - 5 records, which should 
produce 3 directories.

{code:java}
Pipeline p = Pipeline.create(options);
PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))
.apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));

//write out into AVRO in each separate directory
records.apply("Write avro file per dataset", FileIO.>writeDynamic()
  .by(KV::getKey)
  .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
  .to(options.getTargetPath())
  .withDestinationCoder(StringUtf8Coder.of())
  .withNaming(key -> defaultNaming(key + "/export", 
PipelinesVariables.Pipeline.AVRO_EXTENSION)));

p.run().waitUntilFinish();
{code}


If i replace AvroIO.sink() with TextIO.sink() (and replace the initial mapping 
function) then the correct number of records are written to the separate 
directories.

e.g.

{code:java}
// Initialise pipeline
Pipeline p = Pipeline.create(options);

PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
StringToDatasetIDKVFcn()));

//write out into AVRO in each separate directory
records.apply("Write CSV file per dataset", FileIO.>writeDynamic()
.by(KV::getKey)
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to(options.getTargetPath())
.withDestinationCoder(StringUtf8Coder.of())
.withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));

 p.run().waitUntilFinish();
{code}

cc [~timrobertson100]




> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: beam-community
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Assignee: Aizhamal Nurmamat kyzy
>Priority: P2
>  Labels: AVRO, IO, Spark
>
> FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
> pipeline. The amount of data written varies between runs but it is 
> consistently dropping. This is with a very small test dataset - 5 records, 
> which should produce 3 directories.
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder(StringUtf8Coder.of())
>   .withNaming(key -> defa

[jira] [Updated] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Dave Martin (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Martin updated BEAM-10100:
---
Description: 
FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
pipeline. The amount of data written varies between runs but it is consistently 
dropping records. This is with a very small test dataset - 6 records, which 
should produce 3 directories.

{code:java}
Pipeline p = Pipeline.create(options);
PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))
.apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));

//write out into AVRO in each separate directory
records.apply("Write avro file per dataset", FileIO.>writeDynamic()
  .by(KV::getKey)
  .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
  .to(options.getTargetPath())
  .withDestinationCoder(StringUtf8Coder.of())
  .withNaming(key -> defaultNaming(key + "/export", 
PipelinesVariables.Pipeline.AVRO_EXTENSION)));

p.run().waitUntilFinish();
{code}


If i replace AvroIO.sink() with TextIO.sink() (and replace the initial mapping 
function) then the correct number of records are written to the separate 
directories. This is working consistently.

e.g.

{code:java}
// Initialise pipeline
Pipeline p = Pipeline.create(options);

PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
StringToDatasetIDKVFcn()));

//write out into AVRO in each separate directory
records.apply("Write CSV file per dataset", FileIO.>writeDynamic()
.by(KV::getKey)
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to(options.getTargetPath())
.withDestinationCoder(StringUtf8Coder.of())
.withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));

 p.run().waitUntilFinish();
{code}

cc [~timrobertson100]



  was:
FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
pipeline. The amount of data written varies between runs but it is consistently 
dropping. This is with a very small test dataset - 5 records, which should 
produce 3 directories.

{code:java}
Pipeline p = Pipeline.create(options);
PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv"))
.apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));

//write out into AVRO in each separate directory
records.apply("Write avro file per dataset", FileIO.>writeDynamic()
  .by(KV::getKey)
  .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
  .to(options.getTargetPath())
  .withDestinationCoder(StringUtf8Coder.of())
  .withNaming(key -> defaultNaming(key + "/export", 
PipelinesVariables.Pipeline.AVRO_EXTENSION)));

p.run().waitUntilFinish();
{code}


If i replace AvroIO.sink() with TextIO.sink() (and replace the initial mapping 
function) then the correct number of records are written to the separate 
directories. This is working consistently.

e.g.

{code:java}
// Initialise pipeline
Pipeline p = Pipeline.create(options);

PCollection> records = 
p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
StringToDatasetIDKVFcn()));

//write out into AVRO in each separate directory
records.apply("Write CSV file per dataset", FileIO.>writeDynamic()
.by(KV::getKey)
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to(options.getTargetPath())
.withDestinationCoder(StringUtf8Coder.of())
.withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));

 p.run().waitUntilFinish();
{code}

cc [~timrobertson100]




> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: beam-community
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Assignee: Aizhamal Nurmamat kyzy
>Priority: P2
>  Labels: AVRO, IO, Spark
>
> FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
> pipeline. The amount of data written varies between runs but it is 
> consistently dropping records. This is with a very small test dataset - 6 
> records, which should produce 3 directories.
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder(Str

[jira] [Work logged] (BEAM-10054) Direct Runner execution stalls with test pipeline

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10054?focusedWorklogId=437751&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437751
 ]

ASF GitHub Bot logged work on BEAM-10054:
-

Author: ASF GitHub Bot
Created on: 27/May/20 11:55
Start Date: 27/May/20 11:55
Worklog Time Spent: 10m 
  Work Description: mxm commented on pull request #11777:
URL: https://github.com/apache/beam/pull/11777#issuecomment-634610378


   This should be covered by the tests added in the PR which introduced the 
changes: https://github.com/apache/beam/pull/10304.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437751)
Time Spent: 1h 20m  (was: 1h 10m)

> Direct Runner execution stalls with test pipeline
> -
>
> Key: BEAM-10054
> URL: https://issues.apache.org/jira/browse/BEAM-10054
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Maximilian Michels
>Assignee: Maximilian Michels
>Priority: P2
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Internally, we have a test pipeline which runs with the DirectRunner. When 
> upgrading from 2.18.0 to 2.21.0 the test failed with the following exception:
> {noformat}
> tp = Exception('Monitor task detected a pipeline stall.',), value = None, tb 
> = None
> def raise_(tp, value=None, tb=None):
> """
> A function that matches the Python 2.x ``raise`` statement. This
> allows re-raising exceptions with the cls value and traceback on
> Python 2 and 3.
> """
> if value is not None and isinstance(tp, Exception):
> raise TypeError("instance exception may not have a separate 
> value")
> if value is not None:
> exc = tp(value)
> else:
> exc = tp
> if exc.__traceback__ is not tb:
> raise exc.with_traceback(tb)
> >   raise exc
> E   Exception: Monitor task detected a pipeline stall.
> {noformat}
> I was able to bisect the error. This commit introduced the failure: 
> https://github.com/apache/beam/commit/ea9b1f350b88c2996cafb4d24351869e82857731
> If the following conditions evaluates to False, the pipeline runs correctly: 
> https://github.com/apache/beam/commit/ea9b1f350b88c2996cafb4d24351869e82857731#diff-2bb845e226f3a97c0f0f737d0558c5dbR1273



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7774) Stop usign Perfkit Benchmarker tool in Python performance tests

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7774?focusedWorklogId=437752&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437752
 ]

ASF GitHub Bot logged work on BEAM-7774:


Author: ASF GitHub Bot
Created on: 27/May/20 11:58
Start Date: 27/May/20 11:58
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on a change in pull request #11661:
URL: https://github.com/apache/beam/pull/11661#discussion_r431061661



##
File path: sdks/python/apache_beam/examples/wordcount_it_test.py
##
@@ -104,18 +107,33 @@ def _run_wordcount_it(self, run_wordcount, **opts):
 run_time = end_time - start_time
 
 if publish_to_bq:
-  bq_publisher = BigQueryMetricsPublisher(
-  project_name=test_pipeline.get_option('project'),
-  table=test_pipeline.get_option('metrics_table'),
-  dataset=test_pipeline.get_option('metrics_dataset'),
-  )
-  result = Metric(
-  submit_timestamp=time.time(),
-  metric_id=uuid.uuid4().hex,
-  value=run_time,
-  label='Python performance test',
-  )
-  bq_publisher.publish([result.as_dict()])
+  self._publish_metrics(test_pipeline, run_time)
+
+  def _publish_metrics(self, pipeline, metric_value):

Review comment:
   `MetricsReader` from `load_test_metrics_utils` module has a 
`publish_metrics` method. This method takes a pipeline result as an input, 
extracts metrics from it and calls all publishers. I think we could add a 
similar method, e.g. `publish_value`, which would take a list of kv pairs. 
@piotr-szuberski Does it make sense? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437752)
Time Spent: 3h 40m  (was: 3.5h)

> Stop usign Perfkit Benchmarker tool in Python performance tests
> ---
>
> Key: BEAM-7774
> URL: https://issues.apache.org/jira/browse/BEAM-7774
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Piotr Szuberski
>Priority: P2
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Jira


 [ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía reassigned BEAM-10100:
---

Assignee: (was: Aizhamal Nurmamat kyzy)

> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: beam-community
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Priority: P2
>  Labels: AVRO, IO, Spark
>
> FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
> pipeline. The amount of data written varies between runs but it is 
> consistently dropping records. This is with a very small test dataset - 6 
> records, which should produce 3 directories.
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder(StringUtf8Coder.of())
>   .withNaming(key -> defaultNaming(key + "/export", 
> PipelinesVariables.Pipeline.AVRO_EXTENSION)));
> p.run().waitUntilFinish();
> {code}
> If i replace AvroIO.sink() with TextIO.sink() (and replace the initial 
> mapping function) then the correct number of records are written to the 
> separate directories. This is working consistently.
> e.g.
> {code:java}
> // Initialise pipeline
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
> StringToDatasetIDKVFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write CSV file per dataset", FileIO. String>>writeDynamic()
> .by(KV::getKey)
> .via(Contextful.fn(KV::getValue), TextIO.sink())
> .to(options.getTargetPath())
> .withDestinationCoder(StringUtf8Coder.of())
> .withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));
>  p.run().waitUntilFinish();
> {code}
> cc [~timrobertson100]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Jira


 [ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-10100:

Component/s: io-java-avro

> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-avro, runner-spark, sdk-java-core
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Priority: P2
>  Labels: AVRO, IO, Spark
>
> FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
> pipeline. The amount of data written varies between runs but it is 
> consistently dropping records. This is with a very small test dataset - 6 
> records, which should produce 3 directories.
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder(StringUtf8Coder.of())
>   .withNaming(key -> defaultNaming(key + "/export", 
> PipelinesVariables.Pipeline.AVRO_EXTENSION)));
> p.run().waitUntilFinish();
> {code}
> If i replace AvroIO.sink() with TextIO.sink() (and replace the initial 
> mapping function) then the correct number of records are written to the 
> separate directories. This is working consistently.
> e.g.
> {code:java}
> // Initialise pipeline
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
> StringToDatasetIDKVFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write CSV file per dataset", FileIO. String>>writeDynamic()
> .by(KV::getKey)
> .via(Contextful.fn(KV::getValue), TextIO.sink())
> .to(options.getTargetPath())
> .withDestinationCoder(StringUtf8Coder.of())
> .withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));
>  p.run().waitUntilFinish();
> {code}
> cc [~timrobertson100]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Jira


 [ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-10100:

Component/s: runner-spark

> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark, sdk-java-core
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Priority: P2
>  Labels: AVRO, IO, Spark
>
> FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
> pipeline. The amount of data written varies between runs but it is 
> consistently dropping records. This is with a very small test dataset - 6 
> records, which should produce 3 directories.
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder(StringUtf8Coder.of())
>   .withNaming(key -> defaultNaming(key + "/export", 
> PipelinesVariables.Pipeline.AVRO_EXTENSION)));
> p.run().waitUntilFinish();
> {code}
> If i replace AvroIO.sink() with TextIO.sink() (and replace the initial 
> mapping function) then the correct number of records are written to the 
> separate directories. This is working consistently.
> e.g.
> {code:java}
> // Initialise pipeline
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
> StringToDatasetIDKVFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write CSV file per dataset", FileIO. String>>writeDynamic()
> .by(KV::getKey)
> .via(Contextful.fn(KV::getValue), TextIO.sink())
> .to(options.getTargetPath())
> .withDestinationCoder(StringUtf8Coder.of())
> .withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));
>  p.run().waitUntilFinish();
> {code}
> cc [~timrobertson100]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Jira


 [ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-10100:

Component/s: (was: beam-community)
 sdk-java-core

> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Priority: P2
>  Labels: AVRO, IO, Spark
>
> FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
> pipeline. The amount of data written varies between runs but it is 
> consistently dropping records. This is with a very small test dataset - 6 
> records, which should produce 3 directories.
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder(StringUtf8Coder.of())
>   .withNaming(key -> defaultNaming(key + "/export", 
> PipelinesVariables.Pipeline.AVRO_EXTENSION)));
> p.run().waitUntilFinish();
> {code}
> If i replace AvroIO.sink() with TextIO.sink() (and replace the initial 
> mapping function) then the correct number of records are written to the 
> separate directories. This is working consistently.
> e.g.
> {code:java}
> // Initialise pipeline
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
> StringToDatasetIDKVFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write CSV file per dataset", FileIO. String>>writeDynamic()
> .by(KV::getKey)
> .via(Contextful.fn(KV::getValue), TextIO.sink())
> .to(options.getTargetPath())
> .withDestinationCoder(StringUtf8Coder.of())
> .withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));
>  p.run().waitUntilFinish();
> {code}
> cc [~timrobertson100]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Jira


 [ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-10100:

Labels:   (was: AVRO IO Spark)

> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-avro, runner-spark, sdk-java-core
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Priority: P2
>
> FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
> pipeline. The amount of data written varies between runs but it is 
> consistently dropping records. This is with a very small test dataset - 6 
> records, which should produce 3 directories.
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder(StringUtf8Coder.of())
>   .withNaming(key -> defaultNaming(key + "/export", 
> PipelinesVariables.Pipeline.AVRO_EXTENSION)));
> p.run().waitUntilFinish();
> {code}
> If i replace AvroIO.sink() with TextIO.sink() (and replace the initial 
> mapping function) then the correct number of records are written to the 
> separate directories. This is working consistently.
> e.g.
> {code:java}
> // Initialise pipeline
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
> StringToDatasetIDKVFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write CSV file per dataset", FileIO. String>>writeDynamic()
> .by(KV::getKey)
> .via(Contextful.fn(KV::getValue), TextIO.sink())
> .to(options.getTargetPath())
> .withDestinationCoder(StringUtf8Coder.of())
> .withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));
>  p.run().waitUntilFinish();
> {code}
> cc [~timrobertson100]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Jira


 [ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-10100:

Component/s: io-java-files

> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-avro, io-java-files, runner-spark, sdk-java-core
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Priority: P2
>
> FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
> pipeline. The amount of data written varies between runs but it is 
> consistently dropping records. This is with a very small test dataset - 6 
> records, which should produce 3 directories.
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder(StringUtf8Coder.of())
>   .withNaming(key -> defaultNaming(key + "/export", 
> PipelinesVariables.Pipeline.AVRO_EXTENSION)));
> p.run().waitUntilFinish();
> {code}
> If i replace AvroIO.sink() with TextIO.sink() (and replace the initial 
> mapping function) then the correct number of records are written to the 
> separate directories. This is working consistently.
> e.g.
> {code:java}
> // Initialise pipeline
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
> StringToDatasetIDKVFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write CSV file per dataset", FileIO. String>>writeDynamic()
> .by(KV::getKey)
> .via(Contextful.fn(KV::getValue), TextIO.sink())
> .to(options.getTargetPath())
> .withDestinationCoder(StringUtf8Coder.of())
> .withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));
>  p.run().waitUntilFinish();
> {code}
> cc [~timrobertson100]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Jira


 [ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-10100:

Component/s: (was: sdk-java-core)

> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-avro, io-java-files, runner-spark
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Priority: P2
>
> FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
> pipeline. The amount of data written varies between runs but it is 
> consistently dropping records. This is with a very small test dataset - 6 
> records, which should produce 3 directories.
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder(StringUtf8Coder.of())
>   .withNaming(key -> defaultNaming(key + "/export", 
> PipelinesVariables.Pipeline.AVRO_EXTENSION)));
> p.run().waitUntilFinish();
> {code}
> If i replace AvroIO.sink() with TextIO.sink() (and replace the initial 
> mapping function) then the correct number of records are written to the 
> separate directories. This is working consistently.
> e.g.
> {code:java}
> // Initialise pipeline
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
> StringToDatasetIDKVFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write CSV file per dataset", FileIO. String>>writeDynamic()
> .by(KV::getKey)
> .via(Contextful.fn(KV::getValue), TextIO.sink())
> .to(options.getTargetPath())
> .withDestinationCoder(StringUtf8Coder.of())
> .withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));
>  p.run().waitUntilFinish();
> {code}
> cc [~timrobertson100]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Jira


[ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117703#comment-17117703
 ] 

Ismaël Mejía commented on BEAM-10100:
-

Just a minor question if you try the same pipeline with Direct runner the 
result is the expected? (Just trying to confirm that the issue is in the Spark 
side or not).

> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-avro, runner-spark, sdk-java-core
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Priority: P2
>
> FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
> pipeline. The amount of data written varies between runs but it is 
> consistently dropping records. This is with a very small test dataset - 6 
> records, which should produce 3 directories.
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder(StringUtf8Coder.of())
>   .withNaming(key -> defaultNaming(key + "/export", 
> PipelinesVariables.Pipeline.AVRO_EXTENSION)));
> p.run().waitUntilFinish();
> {code}
> If i replace AvroIO.sink() with TextIO.sink() (and replace the initial 
> mapping function) then the correct number of records are written to the 
> separate directories. This is working consistently.
> e.g.
> {code:java}
> // Initialise pipeline
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
> StringToDatasetIDKVFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write CSV file per dataset", FileIO. String>>writeDynamic()
> .by(KV::getKey)
> .via(Contextful.fn(KV::getValue), TextIO.sink())
> .to(options.getTargetPath())
> .withDestinationCoder(StringUtf8Coder.of())
> .withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));
>  p.run().waitUntilFinish();
> {code}
> cc [~timrobertson100]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9810) Add a Tox (precommit) suite for Python 3.8

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9810?focusedWorklogId=437755&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437755
 ]

ASF GitHub Bot logged work on BEAM-9810:


Author: ASF GitHub Bot
Created on: 27/May/20 12:04
Start Date: 27/May/20 12:04
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #11707:
URL: https://github.com/apache/beam/pull/11707#issuecomment-634614486


   Run Python PreCommit



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437755)
Time Spent: 2h 40m  (was: 2.5h)

> Add a Tox (precommit) suite for Python 3.8
> --
>
> Key: BEAM-9810
> URL: https://issues.apache.org/jira/browse/BEAM-9810
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core, testing
>Reporter: Valentyn Tymofieiev
>Assignee: Kamil Wasilewski
>Priority: P2
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9810) Add a Tox (precommit) suite for Python 3.8

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9810?focusedWorklogId=437756&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437756
 ]

ASF GitHub Bot logged work on BEAM-9810:


Author: ASF GitHub Bot
Created on: 27/May/20 12:04
Start Date: 27/May/20 12:04
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #11707:
URL: https://github.com/apache/beam/pull/11707#issuecomment-634614624


   Run Java PreCommit



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437756)
Time Spent: 2h 50m  (was: 2h 40m)

> Add a Tox (precommit) suite for Python 3.8
> --
>
> Key: BEAM-9810
> URL: https://issues.apache.org/jira/browse/BEAM-9810
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core, testing
>Reporter: Valentyn Tymofieiev
>Assignee: Kamil Wasilewski
>Priority: P2
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7770) Add ReadAll transform for SolrIO

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7770?focusedWorklogId=437757&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437757
 ]

ASF GitHub Bot logged work on BEAM-7770:


Author: ASF GitHub Bot
Created on: 27/May/20 12:05
Start Date: 27/May/20 12:05
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on pull request #11357:
URL: https://github.com/apache/beam/pull/11357#issuecomment-634615002


   @iemejia You are right - it's clear that `readAll()` is tested implicitly 
through calling `read()`. However, tests are not (and shouldn't be!) aware  
about internal implementation of tested method, so we have to have tests for 
all public API methods. So, yes, please add an extra test for that.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437757)
Time Spent: 1h  (was: 50m)

> Add ReadAll transform for SolrIO
> 
>
> Key: BEAM-7770
> URL: https://issues.apache.org/jira/browse/BEAM-7770
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-solr
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: P3
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> SolrIO already uses internally a composable approach but we need to expose an 
> explicit ReadAll transform that allows user to create reads in the middle of 
> the Pipeline to improve composability (e.g. Reads in the middle of a 
> Pipeline).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Dave Martin (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117718#comment-17117718
 ] 

Dave Martin commented on BEAM-10100:


Thanks [~iemejia]

Just tested with DirectRunner and the result is expected (i.e. a full export is 
achieved - no dropped records). So this seems to be an issue only for 
SparkRunner.

> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-avro, io-java-files, runner-spark
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Priority: P2
>
> FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
> pipeline. The amount of data written varies between runs but it is 
> consistently dropping records. This is with a very small test dataset - 6 
> records, which should produce 3 directories.
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder(StringUtf8Coder.of())
>   .withNaming(key -> defaultNaming(key + "/export", 
> PipelinesVariables.Pipeline.AVRO_EXTENSION)));
> p.run().waitUntilFinish();
> {code}
> If i replace AvroIO.sink() with TextIO.sink() (and replace the initial 
> mapping function) then the correct number of records are written to the 
> separate directories. This is working consistently.
> e.g.
> {code:java}
> // Initialise pipeline
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
> StringToDatasetIDKVFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write CSV file per dataset", FileIO. String>>writeDynamic()
> .by(KV::getKey)
> .via(Contextful.fn(KV::getValue), TextIO.sink())
> .to(options.getTargetPath())
> .withDestinationCoder(StringUtf8Coder.of())
> .withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));
>  p.run().waitUntilFinish();
> {code}
> cc [~timrobertson100]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Dave Martin (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117718#comment-17117718
 ] 

Dave Martin edited comment on BEAM-10100 at 5/27/20, 12:20 PM:
---

Thanks [~iemejia]

Ive just tested with DirectRunner and the result is expected (i.e. a full 
export is achieved - no dropped records). So this seems to be an issue only for 
SparkRunner.


was (Author: djtfmartin):
Thanks [~iemejia]

Just tested with DirectRunner and the result is expected (i.e. a full export is 
achieved - no dropped records). So this seems to be an issue only for 
SparkRunner.

> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-avro, io-java-files, runner-spark
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Priority: P2
>
> FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
> pipeline. The amount of data written varies between runs but it is 
> consistently dropping records. This is with a very small test dataset - 6 
> records, which should produce 3 directories.
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder(StringUtf8Coder.of())
>   .withNaming(key -> defaultNaming(key + "/export", 
> PipelinesVariables.Pipeline.AVRO_EXTENSION)));
> p.run().waitUntilFinish();
> {code}
> If i replace AvroIO.sink() with TextIO.sink() (and replace the initial 
> mapping function) then the correct number of records are written to the 
> separate directories. This is working consistently.
> e.g.
> {code:java}
> // Initialise pipeline
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
> StringToDatasetIDKVFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write CSV file per dataset", FileIO. String>>writeDynamic()
> .by(KV::getKey)
> .via(Contextful.fn(KV::getValue), TextIO.sink())
> .to(options.getTargetPath())
> .withDestinationCoder(StringUtf8Coder.of())
> .withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));
>  p.run().waitUntilFinish();
> {code}
> cc [~timrobertson100]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7770) Add ReadAll transform for SolrIO

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7770?focusedWorklogId=437764&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437764
 ]

ASF GitHub Bot logged work on BEAM-7770:


Author: ASF GitHub Bot
Created on: 27/May/20 12:32
Start Date: 27/May/20 12:32
Worklog Time Spent: 10m 
  Work Description: iemejia commented on pull request #11357:
URL: https://github.com/apache/beam/pull/11357#issuecomment-634628307


   `testReadAll` added can you PTAL again @aromanenko-dev. Thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437764)
Time Spent: 1h 10m  (was: 1h)

> Add ReadAll transform for SolrIO
> 
>
> Key: BEAM-7770
> URL: https://issues.apache.org/jira/browse/BEAM-7770
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-solr
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: P3
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> SolrIO already uses internally a composable approach but we need to expose an 
> explicit ReadAll transform that allows user to create reads in the middle of 
> the Pipeline to improve composability (e.g. Reads in the middle of a 
> Pipeline).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Dave Martin (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Martin updated BEAM-10100:
---
Component/s: runner-flink

> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-avro, io-java-files, runner-flink, runner-spark
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Priority: P2
>
> FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
> pipeline. The amount of data written varies between runs but it is 
> consistently dropping records. This is with a very small test dataset - 6 
> records, which should produce 3 directories.
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder(StringUtf8Coder.of())
>   .withNaming(key -> defaultNaming(key + "/export", 
> PipelinesVariables.Pipeline.AVRO_EXTENSION)));
> p.run().waitUntilFinish();
> {code}
> If i replace AvroIO.sink() with TextIO.sink() (and replace the initial 
> mapping function) then the correct number of records are written to the 
> separate directories. This is working consistently.
> e.g.
> {code:java}
> // Initialise pipeline
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
> StringToDatasetIDKVFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write CSV file per dataset", FileIO. String>>writeDynamic()
> .by(KV::getKey)
> .via(Contextful.fn(KV::getValue), TextIO.sink())
> .to(options.getTargetPath())
> .withDestinationCoder(StringUtf8Coder.of())
> .withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));
>  p.run().waitUntilFinish();
> {code}
> cc [~timrobertson100]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10100) FileIO writeDynamic with AvroIO.sink not writing all data

2020-05-27 Thread Dave Martin (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117729#comment-17117729
 ] 

Dave Martin commented on BEAM-10100:


Ive just tested with FlinkRunner and the results are the same as SparkRunner. 
Records are dropped when ran with FlinkRunner.

> FileIO writeDynamic with AvroIO.sink not writing all data
> -
>
> Key: BEAM-10100
> URL: https://issues.apache.org/jira/browse/BEAM-10100
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-avro, io-java-files, runner-spark
>Affects Versions: 2.17.0, 2.20.0
> Environment: Mac OSX Catalina, tested with SparkRunner - Spark 2.4.5.
>Reporter: Dave Martin
>Priority: P2
>
> FileIO writeDynamic with AvroIO.sink is not writing all data in the following 
> pipeline. The amount of data written varies between runs but it is 
> consistently dropping records. This is with a very small test dataset - 6 
> records, which should produce 3 directories.
> {code:java}
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv"))
> .apply(ParDo.of(new StringToDatasetIDAvroRecordFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write avro file per dataset", FileIO. AvroRecord>>writeDynamic()
>   .by(KV::getKey)
>   .via(Contextful.fn(KV::getValue), Contextful.fn(x -> 
> AvroIO.sink(AvroRecord.class).withCodec(BASE_CODEC)))
>   .to(options.getTargetPath())
>   .withDestinationCoder(StringUtf8Coder.of())
>   .withNaming(key -> defaultNaming(key + "/export", 
> PipelinesVariables.Pipeline.AVRO_EXTENSION)));
> p.run().waitUntilFinish();
> {code}
> If i replace AvroIO.sink() with TextIO.sink() (and replace the initial 
> mapping function) then the correct number of records are written to the 
> separate directories. This is working consistently.
> e.g.
> {code:java}
> // Initialise pipeline
> Pipeline p = Pipeline.create(options);
> PCollection> records = 
> p.apply(TextIO.read().from("/tmp/input.csv")).apply(ParDo.of(new 
> StringToDatasetIDKVFcn()));
> //write out into AVRO in each separate directory
> records.apply("Write CSV file per dataset", FileIO. String>>writeDynamic()
> .by(KV::getKey)
> .via(Contextful.fn(KV::getValue), TextIO.sink())
> .to(options.getTargetPath())
> .withDestinationCoder(StringUtf8Coder.of())
> .withNaming(datasetID -> defaultNaming(key + "/export", ".csv"));
>  p.run().waitUntilFinish();
> {code}
> cc [~timrobertson100]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9603) Support Dynamic Timer in Java SDK over FnApi

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9603?focusedWorklogId=437766&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437766
 ]

ASF GitHub Bot logged work on BEAM-9603:


Author: ASF GitHub Bot
Created on: 27/May/20 12:41
Start Date: 27/May/20 12:41
Worklog Time Spent: 10m 
  Work Description: iemejia commented on pull request #11756:
URL: https://github.com/apache/beam/pull/11756#issuecomment-634633557


   This PR did not remove the exclusion of the tests on Flink so running the 
PVR jobs on it was not validating the proper support of timer family.
   
https://github.com/apache/beam/blob/03d99dfa359f44a29a772fcc8ec8b0a237cab113/runners/flink/job-server/flink_job_server.gradle#L150
   Maybe worth to create a new PR that enables the tests to validate that the 
support works on Portable Flink.
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437766)
Time Spent: 7h 10m  (was: 7h)

> Support Dynamic Timer in Java SDK over FnApi
> 
>
> Key: BEAM-9603
> URL: https://issues.apache.org/jira/browse/BEAM-9603
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-harness
>Reporter: Boyuan Zhang
>Assignee: Yichi Zhang
>Priority: P2
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9810) Add a Tox (precommit) suite for Python 3.8

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9810?focusedWorklogId=437769&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437769
 ]

ASF GitHub Bot logged work on BEAM-9810:


Author: ASF GitHub Bot
Created on: 27/May/20 12:45
Start Date: 27/May/20 12:45
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #11707:
URL: https://github.com/apache/beam/pull/11707#issuecomment-634635541


   Run Java PreCommit



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437769)
Time Spent: 3h  (was: 2h 50m)

> Add a Tox (precommit) suite for Python 3.8
> --
>
> Key: BEAM-9810
> URL: https://issues.apache.org/jira/browse/BEAM-9810
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core, testing
>Reporter: Valentyn Tymofieiev
>Assignee: Kamil Wasilewski
>Priority: P2
>  Time Spent: 3h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-10016) Flink postcommits failing testFlattenWithDifferentInputAndOutputCoders2

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10016?focusedWorklogId=437768&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437768
 ]

ASF GitHub Bot logged work on BEAM-10016:
-

Author: ASF GitHub Bot
Created on: 27/May/20 12:45
Start Date: 27/May/20 12:45
Worklog Time Spent: 10m 
  Work Description: mxm merged pull request #11825:
URL: https://github.com/apache/beam/pull/11825


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437768)
Time Spent: 1h 40m  (was: 1.5h)

> Flink postcommits failing testFlattenWithDifferentInputAndOutputCoders2
> ---
>
> Key: BEAM-10016
> URL: https://issues.apache.org/jira/browse/BEAM-10016
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Kyle Weaver
>Assignee: Maximilian Michels
>Priority: P2
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Both beam_PostCommit_Java_PVR_Flink_Batch and 
> beam_PostCommit_Java_PVR_Flink_Streaming are failing newly added test 
> org.apache.beam.sdk.transforms.FlattenTest.testFlattenWithDifferentInputAndOutputCoders2.
> SEVERE: Error in task code:  CHAIN MapPartition (MapPartition at [6]{Values, 
> FlatMapElements, PAssert$0}) -> FlatMap (FlatMap at ExtractOutput[0]) -> Map 
> (Key Extractor) -> GroupCombine (GroupCombine at GroupCombine: 
> PAssert$0/GroupGlobally/GatherAllOutputs/GroupByKey) -> Map (Key Extractor) 
> (2/2) java.lang.ClassCastException: org.apache.beam.sdk.values.KV cannot be 
> cast to [B
>   at 
> org.apache.beam.sdk.coders.ByteArrayCoder.encode(ByteArrayCoder.java:41)
>   at 
> org.apache.beam.sdk.coders.LengthPrefixCoder.encode(LengthPrefixCoder.java:56)
>   at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
>   at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:590)
>   at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:581)
>   at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:541)
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataSizeBasedBufferingOutboundObserver.accept(BeamFnDataSizeBasedBufferingOutboundObserver.java:109)
>   at 
> org.apache.beam.runners.fnexecution.control.SdkHarnessClient$CountingFnDataReceiver.accept(SdkHarnessClient.java:667)
>   at 
> org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction.processElements(FlinkExecutableStageFunction.java:271)
>   at 
> org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction.mapPartition(FlinkExecutableStageFunction.java:203)
>   at 
> org.apache.flink.runtime.operators.MapPartitionDriver.run(MapPartitionDriver.java:103)
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:504)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>   at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9810) Add a Tox (precommit) suite for Python 3.8

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9810?focusedWorklogId=437770&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437770
 ]

ASF GitHub Bot logged work on BEAM-9810:


Author: ASF GitHub Bot
Created on: 27/May/20 12:46
Start Date: 27/May/20 12:46
Worklog Time Spent: 10m 
  Work Description: kamilwu removed a comment on pull request #11707:
URL: https://github.com/apache/beam/pull/11707#issuecomment-634614624


   Run Java PreCommit



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437770)
Time Spent: 3h 10m  (was: 3h)

> Add a Tox (precommit) suite for Python 3.8
> --
>
> Key: BEAM-9810
> URL: https://issues.apache.org/jira/browse/BEAM-9810
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core, testing
>Reporter: Valentyn Tymofieiev
>Assignee: Kamil Wasilewski
>Priority: P2
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9603) Support Dynamic Timer in Java SDK over FnApi

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9603?focusedWorklogId=437771&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437771
 ]

ASF GitHub Bot logged work on BEAM-9603:


Author: ASF GitHub Bot
Created on: 27/May/20 12:48
Start Date: 27/May/20 12:48
Worklog Time Spent: 10m 
  Work Description: iemejia commented on pull request #11756:
URL: https://github.com/apache/beam/pull/11756#issuecomment-634636975


   Additionally it seems the breaking tests are passing now #11825 so it should 
be good to go.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437771)
Time Spent: 7h 20m  (was: 7h 10m)

> Support Dynamic Timer in Java SDK over FnApi
> 
>
> Key: BEAM-9603
> URL: https://issues.apache.org/jira/browse/BEAM-9603
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-harness
>Reporter: Boyuan Zhang
>Assignee: Yichi Zhang
>Priority: P2
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-9785) Add PostCommit suite for Python 3.8

2020-05-27 Thread yoshiki obata (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yoshiki obata reassigned BEAM-9785:
---

Assignee: Ashwin Ramaswami

> Add PostCommit suite for Python 3.8
> ---
>
> Key: BEAM-9785
> URL: https://issues.apache.org/jira/browse/BEAM-9785
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: yoshiki obata
>Assignee: Ashwin Ramaswami
>Priority: P2
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Add PostCommit suites for Python 3.8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9810) Add a Tox (precommit) suite for Python 3.8

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9810?focusedWorklogId=437776&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437776
 ]

ASF GitHub Bot logged work on BEAM-9810:


Author: ASF GitHub Bot
Created on: 27/May/20 13:11
Start Date: 27/May/20 13:11
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #11707:
URL: https://github.com/apache/beam/pull/11707#issuecomment-634649590


   Run Python PreCommit



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437776)
Time Spent: 3h 20m  (was: 3h 10m)

> Add a Tox (precommit) suite for Python 3.8
> --
>
> Key: BEAM-9810
> URL: https://issues.apache.org/jira/browse/BEAM-9810
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core, testing
>Reporter: Valentyn Tymofieiev
>Assignee: Kamil Wasilewski
>Priority: P2
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-2788) Use SerializablePipelineOptions to serde PipelineOptions

2020-05-27 Thread Ashwin Ramaswami (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-2788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Ramaswami resolved BEAM-2788.

Fix Version/s: Not applicable
   Resolution: Fixed

This was merged in 
[https://github.com/apache/beam/commit/d8d3f30936a33cfdc106e486daad6ef81a4699bd]

> Use SerializablePipelineOptions to serde PipelineOptions
> 
>
> Key: BEAM-2788
> URL: https://issues.apache.org/jira/browse/BEAM-2788
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-gearpump
>Affects Versions: 2.0.0
>Reporter: Manu Zhang
>Assignee: Huafeng Wang
>Priority: P3
>  Labels: triaged
> Fix For: Not applicable
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-10065) Docs - Beam "Release guide" template is broken

2020-05-27 Thread Ashwin Ramaswami (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Ramaswami resolved BEAM-10065.
-
Fix Version/s: 2.23.0
   Resolution: Fixed

> Docs - Beam "Release guide" template is broken
> --
>
> Key: BEAM-10065
> URL: https://issues.apache.org/jira/browse/BEAM-10065
> Project: Beam
>  Issue Type: Bug
>  Components: website
>Reporter: Ashwin Ramaswami
>Assignee: Ashwin Ramaswami
>Priority: P2
> Fix For: 2.23.0
>
> Attachments: Screen Shot 2020-05-22 at 9.09.35 AM.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> It just shows "e>" for the template.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10101) Add a HttpIO / HttpFileSystem

2020-05-27 Thread Ashwin Ramaswami (Jira)
Ashwin Ramaswami created BEAM-10101:
---

 Summary: Add a HttpIO / HttpFileSystem
 Key: BEAM-10101
 URL: https://issues.apache.org/jira/browse/BEAM-10101
 Project: Beam
  Issue Type: Bug
  Components: io-ideas
Reporter: Ashwin Ramaswami


Add HttpIO (and a related HttpFileSystem and HttpsFileSystem), which can 
download files from a particular http:// or https:// URL. HttpIO cannot upload 
/ write to files, though, because there's no standardized way to write to files 
using HTTP.

Sample usage:

 
{code:python}
(
p
| 
ReadFromText("https://raw.githubusercontent.com/apache/beam/5ff5313f0913ec81d31ad306400ad30c0a928b34/NOTICE";)
| WriteToText("output.txt", shard_name_template="", num_shards=0)
)
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-10101) Add a HttpIO / HttpFileSystem

2020-05-27 Thread Ashwin Ramaswami (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Ramaswami reassigned BEAM-10101:
---

Assignee: Ashwin Ramaswami

> Add a HttpIO / HttpFileSystem
> -
>
> Key: BEAM-10101
> URL: https://issues.apache.org/jira/browse/BEAM-10101
> Project: Beam
>  Issue Type: Bug
>  Components: io-ideas
>Reporter: Ashwin Ramaswami
>Assignee: Ashwin Ramaswami
>Priority: P2
>
> Add HttpIO (and a related HttpFileSystem and HttpsFileSystem), which can 
> download files from a particular http:// or https:// URL. HttpIO cannot 
> upload / write to files, though, because there's no standardized way to write 
> to files using HTTP.
> Sample usage:
>  
> {code:python}
> (
> p
> | 
> ReadFromText("https://raw.githubusercontent.com/apache/beam/5ff5313f0913ec81d31ad306400ad30c0a928b34/NOTICE";)
> | WriteToText("output.txt", shard_name_template="", num_shards=0)
> )
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9810) Add a Tox (precommit) suite for Python 3.8

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9810?focusedWorklogId=437782&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437782
 ]

ASF GitHub Bot logged work on BEAM-9810:


Author: ASF GitHub Bot
Created on: 27/May/20 13:33
Start Date: 27/May/20 13:33
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #11707:
URL: https://github.com/apache/beam/pull/11707#issuecomment-634664271


   Run Java PreCommit



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437782)
Time Spent: 3.5h  (was: 3h 20m)

> Add a Tox (precommit) suite for Python 3.8
> --
>
> Key: BEAM-9810
> URL: https://issues.apache.org/jira/browse/BEAM-9810
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core, testing
>Reporter: Valentyn Tymofieiev
>Assignee: Kamil Wasilewski
>Priority: P2
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-10016) Flink postcommits failing testFlattenWithDifferentInputAndOutputCoders2

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10016?focusedWorklogId=437783&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437783
 ]

ASF GitHub Bot logged work on BEAM-10016:
-

Author: ASF GitHub Bot
Created on: 27/May/20 13:37
Start Date: 27/May/20 13:37
Worklog Time Spent: 10m 
  Work Description: ibzib commented on pull request #11825:
URL: https://github.com/apache/beam/pull/11825#issuecomment-634666477


   > Looks like a bunch of the portable ValidatesRunner tests are failing for 
Spark. Needs to be looked up independently.
   
   I'm working on it (BEAM-9971).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437783)
Time Spent: 1h 50m  (was: 1h 40m)

> Flink postcommits failing testFlattenWithDifferentInputAndOutputCoders2
> ---
>
> Key: BEAM-10016
> URL: https://issues.apache.org/jira/browse/BEAM-10016
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Kyle Weaver
>Assignee: Maximilian Michels
>Priority: P2
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Both beam_PostCommit_Java_PVR_Flink_Batch and 
> beam_PostCommit_Java_PVR_Flink_Streaming are failing newly added test 
> org.apache.beam.sdk.transforms.FlattenTest.testFlattenWithDifferentInputAndOutputCoders2.
> SEVERE: Error in task code:  CHAIN MapPartition (MapPartition at [6]{Values, 
> FlatMapElements, PAssert$0}) -> FlatMap (FlatMap at ExtractOutput[0]) -> Map 
> (Key Extractor) -> GroupCombine (GroupCombine at GroupCombine: 
> PAssert$0/GroupGlobally/GatherAllOutputs/GroupByKey) -> Map (Key Extractor) 
> (2/2) java.lang.ClassCastException: org.apache.beam.sdk.values.KV cannot be 
> cast to [B
>   at 
> org.apache.beam.sdk.coders.ByteArrayCoder.encode(ByteArrayCoder.java:41)
>   at 
> org.apache.beam.sdk.coders.LengthPrefixCoder.encode(LengthPrefixCoder.java:56)
>   at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
>   at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:590)
>   at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:581)
>   at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:541)
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataSizeBasedBufferingOutboundObserver.accept(BeamFnDataSizeBasedBufferingOutboundObserver.java:109)
>   at 
> org.apache.beam.runners.fnexecution.control.SdkHarnessClient$CountingFnDataReceiver.accept(SdkHarnessClient.java:667)
>   at 
> org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction.processElements(FlinkExecutableStageFunction.java:271)
>   at 
> org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction.mapPartition(FlinkExecutableStageFunction.java:203)
>   at 
> org.apache.flink.runtime.operators.MapPartitionDriver.run(MapPartitionDriver.java:103)
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:504)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>   at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10102) Fix query in python pubsub IO streaming performance tests dashboards

2020-05-27 Thread Kenneth Knowles (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-10102:
---
Status: Open  (was: Triage Needed)

> Fix query in python pubsub IO streaming performance tests dashboards
> 
>
> Key: BEAM-10102
> URL: https://issues.apache.org/jira/browse/BEAM-10102
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: Not applicable
>Reporter: Piotr Szuberski
>Assignee: Piotr Szuberski
>Priority: P1
> Fix For: Not applicable
>
>
> There is
> {color:#cc7832}\"{color}{color:#6a8759}metric{color}{color:#cc7832}\"{color}{color:#6a8759}
>  = \"pubsub_io_perf_read_runtime\"{color}
> instead of
> {color:#cc7832}\"{color}{color:#6a8759}metric{color}{color:#cc7832}\"{color}{color:#6a8759}
>  = 'pubsub_io_perf_read_runtime'{color}
> {color:#6a8759}in grafana dashboard json{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10102) Fix query in python pubsub IO streaming performance tests dashboards

2020-05-27 Thread Piotr Szuberski (Jira)
Piotr Szuberski created BEAM-10102:
--

 Summary: Fix query in python pubsub IO streaming performance tests 
dashboards
 Key: BEAM-10102
 URL: https://issues.apache.org/jira/browse/BEAM-10102
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Affects Versions: Not applicable
Reporter: Piotr Szuberski
Assignee: Piotr Szuberski
 Fix For: Not applicable


There is

{color:#cc7832}\"{color}{color:#6a8759}metric{color}{color:#cc7832}\"{color}{color:#6a8759}
 = \"pubsub_io_perf_read_runtime\"{color}

instead of

{color:#cc7832}\"{color}{color:#6a8759}metric{color}{color:#cc7832}\"{color}{color:#6a8759}
 = 'pubsub_io_perf_read_runtime'{color}

{color:#6a8759}in grafana dashboard json{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-10102) Fix query in python pubsub IO streaming performance tests dashboards

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10102?focusedWorklogId=437784&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437784
 ]

ASF GitHub Bot logged work on BEAM-10102:
-

Author: ASF GitHub Bot
Created on: 27/May/20 13:43
Start Date: 27/May/20 13:43
Worklog Time Spent: 10m 
  Work Description: piotr-szuberski opened a new pull request #11827:
URL: https://github.com/apache/beam/pull/11827


   Replace "" with '' in read pubsub query in grafana dashboard
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] Update `CHANGES.md` with noteworthy changes.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more 
tips on [how to make review process 
smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)[![Build
 
Status](https:/

[jira] [Work logged] (BEAM-7774) Stop usign Perfkit Benchmarker tool in Python performance tests

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7774?focusedWorklogId=437785&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437785
 ]

ASF GitHub Bot logged work on BEAM-7774:


Author: ASF GitHub Bot
Created on: 27/May/20 13:45
Start Date: 27/May/20 13:45
Worklog Time Spent: 10m 
  Work Description: piotr-szuberski commented on a change in pull request 
#11661:
URL: https://github.com/apache/beam/pull/11661#discussion_r431142517



##
File path: sdks/python/apache_beam/examples/wordcount_it_test.py
##
@@ -104,18 +107,33 @@ def _run_wordcount_it(self, run_wordcount, **opts):
 run_time = end_time - start_time
 
 if publish_to_bq:
-  bq_publisher = BigQueryMetricsPublisher(
-  project_name=test_pipeline.get_option('project'),
-  table=test_pipeline.get_option('metrics_table'),
-  dataset=test_pipeline.get_option('metrics_dataset'),
-  )
-  result = Metric(
-  submit_timestamp=time.time(),
-  metric_id=uuid.uuid4().hex,
-  value=run_time,
-  label='Python performance test',
-  )
-  bq_publisher.publish([result.as_dict()])
+  self._publish_metrics(test_pipeline, run_time)
+
+  def _publish_metrics(self, pipeline, metric_value):

Review comment:
   I think that's a very good idea.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437785)
Time Spent: 3h 50m  (was: 3h 40m)

> Stop usign Perfkit Benchmarker tool in Python performance tests
> ---
>
> Key: BEAM-7774
> URL: https://issues.apache.org/jira/browse/BEAM-7774
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Piotr Szuberski
>Priority: P2
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7774) Stop usign Perfkit Benchmarker tool in Python performance tests

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7774?focusedWorklogId=437786&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437786
 ]

ASF GitHub Bot logged work on BEAM-7774:


Author: ASF GitHub Bot
Created on: 27/May/20 13:45
Start Date: 27/May/20 13:45
Worklog Time Spent: 10m 
  Work Description: piotr-szuberski commented on a change in pull request 
#11661:
URL: https://github.com/apache/beam/pull/11661#discussion_r431142517



##
File path: sdks/python/apache_beam/examples/wordcount_it_test.py
##
@@ -104,18 +107,33 @@ def _run_wordcount_it(self, run_wordcount, **opts):
 run_time = end_time - start_time
 
 if publish_to_bq:
-  bq_publisher = BigQueryMetricsPublisher(
-  project_name=test_pipeline.get_option('project'),
-  table=test_pipeline.get_option('metrics_table'),
-  dataset=test_pipeline.get_option('metrics_dataset'),
-  )
-  result = Metric(
-  submit_timestamp=time.time(),
-  metric_id=uuid.uuid4().hex,
-  value=run_time,
-  label='Python performance test',
-  )
-  bq_publisher.publish([result.as_dict()])
+  self._publish_metrics(test_pipeline, run_time)
+
+  def _publish_metrics(self, pipeline, metric_value):

Review comment:
   I think that's a very good idea to add a method to the MetricsReader.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437786)
Time Spent: 4h  (was: 3h 50m)

> Stop usign Perfkit Benchmarker tool in Python performance tests
> ---
>
> Key: BEAM-7774
> URL: https://issues.apache.org/jira/browse/BEAM-7774
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Piotr Szuberski
>Priority: P2
>  Time Spent: 4h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10103) Add WasbIO / WasbFileSystem (Azure Blob Storage)

2020-05-27 Thread Ashwin Ramaswami (Jira)
Ashwin Ramaswami created BEAM-10103:
---

 Summary: Add WasbIO / WasbFileSystem (Azure Blob Storage)
 Key: BEAM-10103
 URL: https://issues.apache.org/jira/browse/BEAM-10103
 Project: Beam
  Issue Type: Bug
  Components: io-ideas
Reporter: Ashwin Ramaswami


Azure Blob Storage can be accessed by using the wasb:// and wasbs:// protocols. 
This should be quite similar to the hdfs:// implementations already there.

 

See:

 [1] 
[https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage]

 [2] [https://hadoop.apache.org/docs/current/hadoop-azure/index.html]

 [3] [https://gerardnico.com/azure/wasb]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10103) Add WasbIO / WasbFileSystem (Azure Blob Storage)

2020-05-27 Thread Ashwin Ramaswami (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Ramaswami updated BEAM-10103:

Description: 
Azure Blob Storage can be accessed by using the wasb:// and wasbs:// protocols. 
This should be quite similar to the hdfs:// implementations already there.

 

We should just be able to use it like this:

 
{code:python}
(
p
| 
ReadFromText("yourcontai...@youraccount.blob.core.windows.net/test/sample.txt")
| WriteToText("output.txt", shard_name_template="", num_shards=0)
)
{code}


 

 

See:

 [1] 
[https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage]

 [2] [https://hadoop.apache.org/docs/current/hadoop-azure/index.html]

 [3] [https://gerardnico.com/azure/wasb]

  was:
Azure Blob Storage can be accessed by using the wasb:// and wasbs:// protocols. 
This should be quite similar to the hdfs:// implementations already there.

 

See:

 [1] 
[https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage]

 [2] [https://hadoop.apache.org/docs/current/hadoop-azure/index.html]

 [3] [https://gerardnico.com/azure/wasb]


> Add WasbIO / WasbFileSystem (Azure Blob Storage)
> 
>
> Key: BEAM-10103
> URL: https://issues.apache.org/jira/browse/BEAM-10103
> Project: Beam
>  Issue Type: Bug
>  Components: io-ideas
>Reporter: Ashwin Ramaswami
>Priority: P2
>
> Azure Blob Storage can be accessed by using the wasb:// and wasbs:// 
> protocols. This should be quite similar to the hdfs:// implementations 
> already there.
>  
> We should just be able to use it like this:
>  
> {code:python}
> (
> p
> | 
> ReadFromText("yourcontai...@youraccount.blob.core.windows.net/test/sample.txt")
> | WriteToText("output.txt", shard_name_template="", num_shards=0)
> )
> {code}
>  
>  
> See:
>  [1] 
> [https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage]
>  [2] [https://hadoop.apache.org/docs/current/hadoop-azure/index.html]
>  [3] [https://gerardnico.com/azure/wasb]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7774) Stop usign Perfkit Benchmarker tool in Python performance tests

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7774?focusedWorklogId=437787&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437787
 ]

ASF GitHub Bot logged work on BEAM-7774:


Author: ASF GitHub Bot
Created on: 27/May/20 13:47
Start Date: 27/May/20 13:47
Worklog Time Spent: 10m 
  Work Description: piotr-szuberski commented on a change in pull request 
#11661:
URL: https://github.com/apache/beam/pull/11661#discussion_r431143884



##
File path: 
.test-infra/metrics/grafana/dashboards/perftests_metrics/Python_Performance_Tests.json
##
@@ -0,0 +1,297 @@
+{

Review comment:
   Python WordCount IT Benchmarks definitely sounds better.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437787)
Time Spent: 4h 10m  (was: 4h)

> Stop usign Perfkit Benchmarker tool in Python performance tests
> ---
>
> Key: BEAM-7774
> URL: https://issues.apache.org/jira/browse/BEAM-7774
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Piotr Szuberski
>Priority: P2
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10104) Allow configuring credentials for access to S3FileSystem (Python)

2020-05-27 Thread Ashwin Ramaswami (Jira)
Ashwin Ramaswami created BEAM-10104:
---

 Summary: Allow configuring credentials for access to S3FileSystem 
(Python)
 Key: BEAM-10104
 URL: https://issues.apache.org/jira/browse/BEAM-10104
 Project: Beam
  Issue Type: Improvement
  Components: io-py-aws
Reporter: Ashwin Ramaswami


We should create something like Java's AwsOptions for Python, so that we can 
configure authenticated S3 access. See how hadoopfilesystem does it with 
HadoopFileSystemOptions, and see pipeline_options.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7774) Stop usign Perfkit Benchmarker tool in Python performance tests

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7774?focusedWorklogId=437790&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437790
 ]

ASF GitHub Bot logged work on BEAM-7774:


Author: ASF GitHub Bot
Created on: 27/May/20 13:51
Start Date: 27/May/20 13:51
Worklog Time Spent: 10m 
  Work Description: piotr-szuberski commented on a change in pull request 
#11661:
URL: https://github.com/apache/beam/pull/11661#discussion_r431143884



##
File path: 
.test-infra/metrics/grafana/dashboards/perftests_metrics/Python_Performance_Tests.json
##
@@ -0,0 +1,297 @@
+{

Review comment:
   Python WordCount IT Benchmarks definitely sounds better.
   BTW, there was a typo WorldCount instead of WordCount :D





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437790)
Time Spent: 4h 20m  (was: 4h 10m)

> Stop usign Perfkit Benchmarker tool in Python performance tests
> ---
>
> Key: BEAM-7774
> URL: https://issues.apache.org/jira/browse/BEAM-7774
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Piotr Szuberski
>Priority: P2
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9810) Add a Tox (precommit) suite for Python 3.8

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9810?focusedWorklogId=437792&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437792
 ]

ASF GitHub Bot logged work on BEAM-9810:


Author: ASF GitHub Bot
Created on: 27/May/20 13:52
Start Date: 27/May/20 13:52
Worklog Time Spent: 10m 
  Work Description: kamilwu removed a comment on pull request #11707:
URL: https://github.com/apache/beam/pull/11707#issuecomment-634649590







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437792)
Time Spent: 3h 40m  (was: 3.5h)

> Add a Tox (precommit) suite for Python 3.8
> --
>
> Key: BEAM-9810
> URL: https://issues.apache.org/jira/browse/BEAM-9810
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core, testing
>Reporter: Valentyn Tymofieiev
>Assignee: Kamil Wasilewski
>Priority: P2
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9679) Core Transforms | Go SDK Code Katas

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9679?focusedWorklogId=437795&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437795
 ]

ASF GitHub Bot logged work on BEAM-9679:


Author: ASF GitHub Bot
Created on: 27/May/20 14:01
Start Date: 27/May/20 14:01
Worklog Time Spent: 10m 
  Work Description: henryken commented on a change in pull request #11806:
URL: https://github.com/apache/beam/pull/11806#discussion_r431148739



##
File path: learning/katas/go/Core Transforms/Flatten/Flatten/task-info.yaml
##
@@ -0,0 +1,31 @@
+  // Licensed to the Apache Software Foundation (ASF) under one or more

Review comment:
   Incorrect comment syntax for yaml file.

##
File path: learning/katas/go/Core Transforms/Flatten/Flatten/task.md
##
@@ -0,0 +1,38 @@
+
+
+Flatten
+---
+
+Flatten is a Beam transform for PCollection objects that store the same data 
type. Flatten merges 
+multiple PCollection objects into a single logical PCollection.
+
+**Kata:** Implement a 

Review comment:
   The use of "implement" here means to implement a solution. The same 
description is used across many task descriptions.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437795)
Time Spent: 3h  (was: 2h 50m)

> Core Transforms | Go SDK Code Katas
> ---
>
> Key: BEAM-9679
> URL: https://issues.apache.org/jira/browse/BEAM-9679
> Project: Beam
>  Issue Type: Sub-task
>  Components: katas, sdk-go
>Reporter: Damon Douglas
>Assignee: Damon Douglas
>Priority: P2
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> A kata devoted to core beam transforms patterns after 
> [https://github.com/apache/beam/tree/master/learning/katas/java/Core%20Transforms]
>  where the take away is an individual's ability to master the following using 
> an Apache Beam pipeline using the Golang SDK.
>  
> ||Transform||Pull Request||Status||
> |Map|[11564|https://github.com/apache/beam/pull/11564]|Closed|
> |GroupByKey|[11734|https://github.com/apache/beam/pull/11734]|Closed|
> |CoGroupByKey|[11803|https://github.com/apache/beam/pull/11803]|Open|
> |Combine| | |
> |Flatten|[11806|https://github.com/apache/beam/pull/11806]| |
> |Partition| | |
> |Side Input| | |
> |Side Output| | |
> |Branching| | |
> |Composite Transform| | |
> |DoFn Additional Parameters| | |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9679) Core Transforms | Go SDK Code Katas

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9679?focusedWorklogId=437796&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437796
 ]

ASF GitHub Bot logged work on BEAM-9679:


Author: ASF GitHub Bot
Created on: 27/May/20 14:03
Start Date: 27/May/20 14:03
Worklog Time Spent: 10m 
  Work Description: henryken commented on a change in pull request #11806:
URL: https://github.com/apache/beam/pull/11806#discussion_r431156976



##
File path: learning/katas/go/Core Transforms/Flatten/Flatten/cmd/main.go
##
@@ -0,0 +1,28 @@
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements.  See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+// (the "License"); you may not use this file except in compliance with
+// the License.  You may obtain a copy of the License at
+//
+//http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package task
+
+import "github.com/apache/beam/sdks/go/pkg/beam"
+
+func ApplyTransform(s beam.Scope, input beam.PCollection) beam.PCollection {

Review comment:
   Looks good now. Thanks!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437796)
Time Spent: 3h 10m  (was: 3h)

> Core Transforms | Go SDK Code Katas
> ---
>
> Key: BEAM-9679
> URL: https://issues.apache.org/jira/browse/BEAM-9679
> Project: Beam
>  Issue Type: Sub-task
>  Components: katas, sdk-go
>Reporter: Damon Douglas
>Assignee: Damon Douglas
>Priority: P2
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> A kata devoted to core beam transforms patterns after 
> [https://github.com/apache/beam/tree/master/learning/katas/java/Core%20Transforms]
>  where the take away is an individual's ability to master the following using 
> an Apache Beam pipeline using the Golang SDK.
>  
> ||Transform||Pull Request||Status||
> |Map|[11564|https://github.com/apache/beam/pull/11564]|Closed|
> |GroupByKey|[11734|https://github.com/apache/beam/pull/11734]|Closed|
> |CoGroupByKey|[11803|https://github.com/apache/beam/pull/11803]|Open|
> |Combine| | |
> |Flatten|[11806|https://github.com/apache/beam/pull/11806]| |
> |Partition| | |
> |Side Input| | |
> |Side Output| | |
> |Branching| | |
> |Composite Transform| | |
> |DoFn Additional Parameters| | |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10106) Script the deployment of artifacts to pypi

2020-05-27 Thread Kyle Weaver (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Weaver updated BEAM-10106:
---
Description: Right now there's only manual instructions, which are tedious 
and error-prone. 
http://localhost:1313/contribute/release-guide/#8-finalize-the-release  (was: 
Right now there's only manual instructions, which are tedious and error-prone.)

> Script the deployment of artifacts to pypi
> --
>
> Key: BEAM-10106
> URL: https://issues.apache.org/jira/browse/BEAM-10106
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: P2
>
> Right now there's only manual instructions, which are tedious and 
> error-prone. 
> http://localhost:1313/contribute/release-guide/#8-finalize-the-release



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10106) Script the deployment of artifacts to pypi

2020-05-27 Thread Kenneth Knowles (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-10106:
---
Status: Open  (was: Triage Needed)

> Script the deployment of artifacts to pypi
> --
>
> Key: BEAM-10106
> URL: https://issues.apache.org/jira/browse/BEAM-10106
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: P2
>
> Right now there's only manual instructions, which are tedious and error-prone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10106) Script the deployment of artifacts to pypi

2020-05-27 Thread Kyle Weaver (Jira)
Kyle Weaver created BEAM-10106:
--

 Summary: Script the deployment of artifacts to pypi
 Key: BEAM-10106
 URL: https://issues.apache.org/jira/browse/BEAM-10106
 Project: Beam
  Issue Type: Improvement
  Components: build-system
Reporter: Kyle Weaver
Assignee: Kyle Weaver


Right now there's only manual instructions, which are tedious and error-prone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10105) Add more metadata fields to FileMetadata

2020-05-27 Thread Ashwin Ramaswami (Jira)
Ashwin Ramaswami created BEAM-10105:
---

 Summary: Add more metadata fields to FileMetadata
 Key: BEAM-10105
 URL: https://issues.apache.org/jira/browse/BEAM-10105
 Project: Beam
  Issue Type: Improvement
  Components: io-py-files
Reporter: Ashwin Ramaswami


We currently only have "path" and "size_in_bytes" to FileMetadata. We could add 
additional arguments such as:

- mime type
- permissions
- data hash (such as Etag from S3)
- date created
- date updated



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10106) Script the deployment of artifacts to pypi

2020-05-27 Thread Kyle Weaver (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Weaver updated BEAM-10106:
---
Description: Right now there's only manual instructions, which are tedious 
and error-prone. 
https://beam.apache.org/contribute/release-guide/#8-finalize-the-release  (was: 
Right now there's only manual instructions, which are tedious and error-prone. 
http://localhost:1313/contribute/release-guide/#8-finalize-the-release)

> Script the deployment of artifacts to pypi
> --
>
> Key: BEAM-10106
> URL: https://issues.apache.org/jira/browse/BEAM-10106
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: P2
>
> Right now there's only manual instructions, which are tedious and 
> error-prone. 
> https://beam.apache.org/contribute/release-guide/#8-finalize-the-release



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9679) Core Transforms | Go SDK Code Katas

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9679?focusedWorklogId=437801&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437801
 ]

ASF GitHub Bot logged work on BEAM-9679:


Author: ASF GitHub Bot
Created on: 27/May/20 14:17
Start Date: 27/May/20 14:17
Worklog Time Spent: 10m 
  Work Description: henryken commented on a change in pull request #11803:
URL: https://github.com/apache/beam/pull/11803#discussion_r431167018



##
File path: learning/katas/go/Core 
Transforms/CoGroupByKey/CoGroupByKey/pkg/task/task.go
##
@@ -0,0 +1,52 @@
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements.  See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+// (the "License"); you may not use this file except in compliance with
+// the License.  You may obtain a copy of the License at
+//
+//http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package task
+
+import (
+   "fmt"
+   "github.com/apache/beam/sdks/go/pkg/beam"
+)
+
+func ApplyTransform(s beam.Scope, fruits beam.PCollection, countries 
beam.PCollection) beam.PCollection {
+   fruitsKV := beam.ParDo(s, func(e string) (string, string) {
+   return string(e[0]), e
+   }, fruits)
+
+   countriesKV := beam.ParDo(s, func(e string) (string, string) {
+   return string(e[0]), e
+   }, countries)
+
+   grouped := beam.CoGroupByKey(s, fruitsKV, countriesKV)
+   return beam.ParDo(s, func(key string, f func(*string) bool, c 
func(*string) bool, emit func(string)) {

Review comment:
   Thanks for creating the cleanup task!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437801)
Time Spent: 3h 20m  (was: 3h 10m)

> Core Transforms | Go SDK Code Katas
> ---
>
> Key: BEAM-9679
> URL: https://issues.apache.org/jira/browse/BEAM-9679
> Project: Beam
>  Issue Type: Sub-task
>  Components: katas, sdk-go
>Reporter: Damon Douglas
>Assignee: Damon Douglas
>Priority: P2
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> A kata devoted to core beam transforms patterns after 
> [https://github.com/apache/beam/tree/master/learning/katas/java/Core%20Transforms]
>  where the take away is an individual's ability to master the following using 
> an Apache Beam pipeline using the Golang SDK.
>  
> ||Transform||Pull Request||Status||
> |Map|[11564|https://github.com/apache/beam/pull/11564]|Closed|
> |GroupByKey|[11734|https://github.com/apache/beam/pull/11734]|Closed|
> |CoGroupByKey|[11803|https://github.com/apache/beam/pull/11803]|Open|
> |Combine| | |
> |Flatten|[11806|https://github.com/apache/beam/pull/11806]| |
> |Partition| | |
> |Side Input| | |
> |Side Output| | |
> |Branching| | |
> |Composite Transform| | |
> |DoFn Additional Parameters| | |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9679) Core Transforms | Go SDK Code Katas

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9679?focusedWorklogId=437804&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437804
 ]

ASF GitHub Bot logged work on BEAM-9679:


Author: ASF GitHub Bot
Created on: 27/May/20 14:25
Start Date: 27/May/20 14:25
Worklog Time Spent: 10m 
  Work Description: henryken commented on pull request #11803:
URL: https://github.com/apache/beam/pull/11803#issuecomment-634696399


   LGTM now. Thanks for the changes @damondouglas.
   
   For updating to Stepik, can you wait to do it together with [PR 
#11806](https://github.com/apache/beam/pull/11806)?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437804)
Time Spent: 3.5h  (was: 3h 20m)

> Core Transforms | Go SDK Code Katas
> ---
>
> Key: BEAM-9679
> URL: https://issues.apache.org/jira/browse/BEAM-9679
> Project: Beam
>  Issue Type: Sub-task
>  Components: katas, sdk-go
>Reporter: Damon Douglas
>Assignee: Damon Douglas
>Priority: P2
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> A kata devoted to core beam transforms patterns after 
> [https://github.com/apache/beam/tree/master/learning/katas/java/Core%20Transforms]
>  where the take away is an individual's ability to master the following using 
> an Apache Beam pipeline using the Golang SDK.
>  
> ||Transform||Pull Request||Status||
> |Map|[11564|https://github.com/apache/beam/pull/11564]|Closed|
> |GroupByKey|[11734|https://github.com/apache/beam/pull/11734]|Closed|
> |CoGroupByKey|[11803|https://github.com/apache/beam/pull/11803]|Open|
> |Combine| | |
> |Flatten|[11806|https://github.com/apache/beam/pull/11806]| |
> |Partition| | |
> |Side Input| | |
> |Side Output| | |
> |Branching| | |
> |Composite Transform| | |
> |DoFn Additional Parameters| | |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9810) Add a Tox (precommit) suite for Python 3.8

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9810?focusedWorklogId=437807&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437807
 ]

ASF GitHub Bot logged work on BEAM-9810:


Author: ASF GitHub Bot
Created on: 27/May/20 14:33
Start Date: 27/May/20 14:33
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #11707:
URL: https://github.com/apache/beam/pull/11707#issuecomment-634701208


   Run Java PreCommit



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437807)
Time Spent: 3h 50m  (was: 3h 40m)

> Add a Tox (precommit) suite for Python 3.8
> --
>
> Key: BEAM-9810
> URL: https://issues.apache.org/jira/browse/BEAM-9810
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core, testing
>Reporter: Valentyn Tymofieiev
>Assignee: Kamil Wasilewski
>Priority: P2
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9946) Enhance Partition transform to provide partitionfn with SideInputs

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9946?focusedWorklogId=437811&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437811
 ]

ASF GitHub Bot logged work on BEAM-9946:


Author: ASF GitHub Bot
Created on: 27/May/20 14:41
Start Date: 27/May/20 14:41
Worklog Time Spent: 10m 
  Work Description: darshanj commented on pull request #11682:
URL: https://github.com/apache/beam/pull/11682#issuecomment-634707426


   This fails for :
   
org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapperTest$ParameterizedUnboundedSourceWrapperTest.testWatermarkEmission[numTasks
 = 1; numSplits=2]
   
   Doesn't seem to be related.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437811)
Remaining Estimate: 93h  (was: 93h 10m)
Time Spent: 3h  (was: 2h 50m)

> Enhance Partition transform to provide partitionfn with SideInputs
> --
>
> Key: BEAM-9946
> URL: https://issues.apache.org/jira/browse/BEAM-9946
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Darshan Jani
>Assignee: Darshan Jani
>Priority: P2
>   Original Estimate: 96h
>  Time Spent: 3h
>  Remaining Estimate: 93h
>
> Currently _Partition_ transform can partition a collection into n collections 
> based on only _element_ value in _PartitionFn_ to decide on which partition a 
> particular element belongs to.
> {code:java}
> public interface PartitionFn extends Serializable {
> int partitionFor(T elem, int numPartitions);
>   }
> public static  Partition of(int numPartitions, PartitionFn 
> partitionFn) {
> return new Partition<>(new PartitionDoFn(numPartitions, partitionFn));
>   }
> {code}
> It will be useful to introduce new API with additional _sideInputs_ provided 
> to partition function. User will be able to write logic to use both _element_ 
> value and _sideInputs_ to decide on which partition a particular element 
> belongs to.
> Option-1: Proposed new API:
> {code:java}
>   public interface PartitionWithSideInputsFn extends Serializable {
> int partitionFor(T elem, int numPartitions, Context c);
>   }
> public static  Partition of(int numPartitions, 
> PartitionWithSideInputsFn partitionFn, Requirements requirements) {
>  ...
>   }
> {code}
> User can use any of the two APIs as per there partitioning function logic.
> Option-2: Redesign old API with Builder Pattern which can provide optionally 
> a _Requirements_ with _sideInputs._ Deprecate old API.
> {code:java}
> // using sideviews
> Partition.into(numberOfPartitions).via(
> fn(
>   (input,c) ->  {
> // use c.sideInput(view)
> // use input
> // return partitionnumber
>  },requiresSideInputs(view))
> )
> // without using sideviews
> Partition.into(numberOfPartitions).via(
> fn((input,c) ->  {
> // use input
> // return partitionnumber
>  })
> )
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9421) AI Platform pipeline patterns

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9421?focusedWorklogId=437813&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437813
 ]

ASF GitHub Bot logged work on BEAM-9421:


Author: ASF GitHub Bot
Created on: 27/May/20 14:42
Start Date: 27/May/20 14:42
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on a change in pull request #11776:
URL: https://github.com/apache/beam/pull/11776#discussion_r431191464



##
File path: sdks/python/apache_beam/examples/snippets/snippets.py
##
@@ -1520,3 +1528,90 @@ def bigqueryio_deadletter():
   # [END BigQueryIODeadLetter]
 
   return result
+
+
+def extract_sentiments(response):
+  # [START nlp_extract_sentiments]
+  return {
+  'sentences': [{
+  sentence.text.content: sentence.sentiment.score
+  } for sentence in response.sentences],
+  'document_sentiment': response.document_sentiment.score,
+  }
+

Review comment:
   Done, thanks.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437813)
Time Spent: 12.5h  (was: 12h 20m)

> AI Platform pipeline patterns
> -
>
> Key: BEAM-9421
> URL: https://issues.apache.org/jira/browse/BEAM-9421
> Project: Beam
>  Issue Type: Sub-task
>  Components: website
>Reporter: Kamil Wasilewski
>Assignee: Kamil Wasilewski
>Priority: P2
>  Labels: pipeline-patterns
>  Time Spent: 12.5h
>  Remaining Estimate: 0h
>
> New pipeline patterns should be contributed to the Beam's website in order to 
> demonstrate how newly implemented Google Cloud AI PTransforms can be used in 
> pipelines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9810) Add a Tox (precommit) suite for Python 3.8

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9810?focusedWorklogId=437818&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437818
 ]

ASF GitHub Bot logged work on BEAM-9810:


Author: ASF GitHub Bot
Created on: 27/May/20 14:44
Start Date: 27/May/20 14:44
Worklog Time Spent: 10m 
  Work Description: kamilwu removed a comment on pull request #11707:
URL: https://github.com/apache/beam/pull/11707#issuecomment-634701208







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437818)
Time Spent: 4h  (was: 3h 50m)

> Add a Tox (precommit) suite for Python 3.8
> --
>
> Key: BEAM-9810
> URL: https://issues.apache.org/jira/browse/BEAM-9810
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core, testing
>Reporter: Valentyn Tymofieiev
>Assignee: Kamil Wasilewski
>Priority: P2
>  Time Spent: 4h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-10106) Script the deployment of artifacts to pypi

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10106?focusedWorklogId=437819&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437819
 ]

ASF GitHub Bot logged work on BEAM-10106:
-

Author: ASF GitHub Bot
Created on: 27/May/20 14:45
Start Date: 27/May/20 14:45
Worklog Time Spent: 10m 
  Work Description: ibzib opened a new pull request #11828:
URL: https://github.com/apache/beam/pull/11828


   R: @tvalentyn 
   cc: @TheNeuralBit 
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] Update `CHANGES.md` with noteworthy changes.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more 
tips on [how to make review process 
smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_Po

[jira] [Work logged] (BEAM-9810) Add a Tox (precommit) suite for Python 3.8

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9810?focusedWorklogId=437824&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437824
 ]

ASF GitHub Bot logged work on BEAM-9810:


Author: ASF GitHub Bot
Created on: 27/May/20 14:51
Start Date: 27/May/20 14:51
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #11707:
URL: https://github.com/apache/beam/pull/11707#issuecomment-634714585


   Run Java PreCommit



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437824)
Time Spent: 4h 10m  (was: 4h)

> Add a Tox (precommit) suite for Python 3.8
> --
>
> Key: BEAM-9810
> URL: https://issues.apache.org/jira/browse/BEAM-9810
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core, testing
>Reporter: Valentyn Tymofieiev
>Assignee: Kamil Wasilewski
>Priority: P2
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-10102) Fix query in python pubsub IO streaming performance tests dashboards

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10102?focusedWorklogId=437827&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437827
 ]

ASF GitHub Bot logged work on BEAM-10102:
-

Author: ASF GitHub Bot
Created on: 27/May/20 14:53
Start Date: 27/May/20 14:53
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #11827:
URL: https://github.com/apache/beam/pull/11827#issuecomment-634715978


   LGTM, thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437827)
Time Spent: 20m  (was: 10m)

> Fix query in python pubsub IO streaming performance tests dashboards
> 
>
> Key: BEAM-10102
> URL: https://issues.apache.org/jira/browse/BEAM-10102
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: Not applicable
>Reporter: Piotr Szuberski
>Assignee: Piotr Szuberski
>Priority: P1
> Fix For: Not applicable
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There is
> {color:#cc7832}\"{color}{color:#6a8759}metric{color}{color:#cc7832}\"{color}{color:#6a8759}
>  = \"pubsub_io_perf_read_runtime\"{color}
> instead of
> {color:#cc7832}\"{color}{color:#6a8759}metric{color}{color:#cc7832}\"{color}{color:#6a8759}
>  = 'pubsub_io_perf_read_runtime'{color}
> {color:#6a8759}in grafana dashboard json{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9421) AI Platform pipeline patterns

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9421?focusedWorklogId=437828&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437828
 ]

ASF GitHub Bot logged work on BEAM-9421:


Author: ASF GitHub Bot
Created on: 27/May/20 14:54
Start Date: 27/May/20 14:54
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #11776:
URL: https://github.com/apache/beam/pull/11776#issuecomment-634716326


   Run Portable_Python PreCommit



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437828)
Time Spent: 12h 40m  (was: 12.5h)

> AI Platform pipeline patterns
> -
>
> Key: BEAM-9421
> URL: https://issues.apache.org/jira/browse/BEAM-9421
> Project: Beam
>  Issue Type: Sub-task
>  Components: website
>Reporter: Kamil Wasilewski
>Assignee: Kamil Wasilewski
>Priority: P2
>  Labels: pipeline-patterns
>  Time Spent: 12h 40m
>  Remaining Estimate: 0h
>
> New pipeline patterns should be contributed to the Beam's website in order to 
> demonstrate how newly implemented Google Cloud AI PTransforms can be used in 
> pipelines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7770) Add ReadAll transform for SolrIO

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7770?focusedWorklogId=437830&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437830
 ]

ASF GitHub Bot logged work on BEAM-7770:


Author: ASF GitHub Bot
Created on: 27/May/20 14:58
Start Date: 27/May/20 14:58
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on a change in pull request 
#11357:
URL: https://github.com/apache/beam/pull/11357#discussion_r431203774



##
File path: 
sdks/java/io/solr/src/test/java/org/apache/beam/sdk/io/solr/SolrIOTest.java
##
@@ -155,6 +155,23 @@ public void testRead() throws Exception {
 pipeline.run();
   }
 
+  @Test
+  public void testReadAll() throws Exception {
+SolrIOTestUtils.insertTestDocuments(SOLR_COLLECTION, NUM_DOCS, solrClient);
+
+PCollection output =
+pipeline
+.apply(
+Create.of(
+SolrIO.read()
+.withConnectionConfiguration(connectionConfiguration)
+.from(SOLR_COLLECTION)
+.withBatchSize(101)))
+.apply(SolrIO.readAll());
+PAssert.thatSingleton(output.apply("Count", 
Count.globally())).isEqualTo(NUM_DOCS);

Review comment:
   I believe we can improve this test in the future by reading with 
multiple `Read`s (the goal of adding `readAll()`) and check the content of read 
messages (not only count).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437830)
Time Spent: 1h 20m  (was: 1h 10m)

> Add ReadAll transform for SolrIO
> 
>
> Key: BEAM-7770
> URL: https://issues.apache.org/jira/browse/BEAM-7770
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-solr
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: P3
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> SolrIO already uses internally a composable approach but we need to expose an 
> explicit ReadAll transform that allows user to create reads in the middle of 
> the Pipeline to improve composability (e.g. Reads in the middle of a 
> Pipeline).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-7770) Add ReadAll transform for SolrIO

2020-05-27 Thread Alexey Romanenko (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko resolved BEAM-7770.

Fix Version/s: 2.23.0
   Resolution: Fixed

> Add ReadAll transform for SolrIO
> 
>
> Key: BEAM-7770
> URL: https://issues.apache.org/jira/browse/BEAM-7770
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-solr
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: P3
> Fix For: 2.23.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> SolrIO already uses internally a composable approach but we need to expose an 
> explicit ReadAll transform that allows user to create reads in the middle of 
> the Pipeline to improve composability (e.g. Reads in the middle of a 
> Pipeline).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7770) Add ReadAll transform for SolrIO

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7770?focusedWorklogId=437831&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437831
 ]

ASF GitHub Bot logged work on BEAM-7770:


Author: ASF GitHub Bot
Created on: 27/May/20 14:59
Start Date: 27/May/20 14:59
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev merged pull request #11357:
URL: https://github.com/apache/beam/pull/11357


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437831)
Time Spent: 1.5h  (was: 1h 20m)

> Add ReadAll transform for SolrIO
> 
>
> Key: BEAM-7770
> URL: https://issues.apache.org/jira/browse/BEAM-7770
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-solr
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: P3
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> SolrIO already uses internally a composable approach but we need to expose an 
> explicit ReadAll transform that allows user to create reads in the middle of 
> the Pipeline to improve composability (e.g. Reads in the middle of a 
> Pipeline).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10107) beam website PR listed twice in release guide with contradictory instructions

2020-05-27 Thread Kyle Weaver (Jira)
Kyle Weaver created BEAM-10107:
--

 Summary: beam website PR listed twice in release guide with 
contradictory instructions
 Key: BEAM-10107
 URL: https://issues.apache.org/jira/browse/BEAM-10107
 Project: Beam
  Issue Type: Improvement
  Components: website
Reporter: Kyle Weaver
Assignee: Kyle Weaver


The Beam website update PR is mentioned twice, once in 5. with the new 
instructions and again in 6. with the old instructions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10107) beam website PR listed twice in release guide with contradictory instructions

2020-05-27 Thread Kenneth Knowles (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-10107:
---
Status: Open  (was: Triage Needed)

> beam website PR listed twice in release guide with contradictory instructions
> -
>
> Key: BEAM-10107
> URL: https://issues.apache.org/jira/browse/BEAM-10107
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: P2
>
> The Beam website update PR is mentioned twice, once in 5. with the new 
> instructions and again in 6. with the old instructions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10108) publish_docker_images.sh has out of date Flink versions

2020-05-27 Thread Kenneth Knowles (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-10108:
---
Status: Open  (was: Triage Needed)

> publish_docker_images.sh has out of date Flink versions
> ---
>
> Key: BEAM-10108
> URL: https://issues.apache.org/jira/browse/BEAM-10108
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: P2
>
> Is 1.7, 1.8, 1.9. Should be 1.8, 1.9, 1.10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-5863) Automate Community Metrics infrastructure deployment

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-5863?focusedWorklogId=437841&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437841
 ]

ASF GitHub Bot logged work on BEAM-5863:


Author: ASF GitHub Bot
Created on: 27/May/20 15:14
Start Date: 27/May/20 15:14
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on a change in pull request #11816:
URL: https://github.com/apache/beam/pull/11816#discussion_r431219888



##
File path: .test-infra/metrics/build.gradle
##
@@ -50,9 +51,49 @@ dockerCompose {
 
 dockerCompose.isRequiredBy(testMetricsStack)
 
-task preCommit { dependsOn testMetricsStack }
+task validateConfiguration(type: Exec) {
+  commandLine 'sh', '-c', 'kubectl apply --dry-run=true -Rf kubernetes'
+}
+
+task preCommit {
+  dependsOn validateConfiguration
+  dependsOn testMetricsStack
+}
+
+task buildAndPublishContainers(type: Exec) {
+  commandLine './build_and_publish_containers.sh', 'true'
+}
+
+// Applies new configuration to all resources labeled with `app=beammetrics`
+// and forces Kubernetes to re-pull images.
+task applyConfiguration() {
+  doLast {
+assert grgit : 'Cannot use outside of git repository'
+
+def git = grgit.open()
+def commitedChanges = git.log(paths: ['.test-infra/metrics']).findAll {
+  it.dateTime > ZonedDateTime.now().minusHours(6)

Review comment:
   To be honest, I don't know if this is possible. We're using ghprb plugin 
to integrate Github with Jenkins, and comments from this issue 
(https://github.com/jenkinsci/ghprb-plugin/issues/651) claim this feature 
(triggering on merging to master) is outside the scope of the plugin. That's 
why I decided to create a cron job (Website_Publish is based on a cron job too).
   
   Do you want me to continue searching for a trigger-on-merge solution? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437841)
Time Spent: 40m  (was: 0.5h)

> Automate Community Metrics infrastructure deployment
> 
>
> Key: BEAM-5863
> URL: https://issues.apache.org/jira/browse/BEAM-5863
> Project: Beam
>  Issue Type: Sub-task
>  Components: community-metrics, project-management
>Reporter: Scott Wegner
>Assignee: Kamil Wasilewski
>Priority: P3
>  Labels: community-metrics
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently the deployment process for the production Community Metrics stack 
> is manual (documented 
> [here|https://cwiki.apache.org/confluence/display/BEAM/Community+Metrics]). 
> If we end up having to deploy more than a few times a year, it would be nice 
> to automate these steps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10108) publish_docker_images.sh has out of date Flink versions

2020-05-27 Thread Kyle Weaver (Jira)
Kyle Weaver created BEAM-10108:
--

 Summary: publish_docker_images.sh has out of date Flink versions
 Key: BEAM-10108
 URL: https://issues.apache.org/jira/browse/BEAM-10108
 Project: Beam
  Issue Type: Bug
  Components: build-system
Reporter: Kyle Weaver
Assignee: Kyle Weaver


Is 1.7, 1.8, 1.9. Should be 1.8, 1.9, 1.10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-10102) Fix query in python pubsub IO streaming performance tests dashboards

2020-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10102?focusedWorklogId=437843&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437843
 ]

ASF GitHub Bot logged work on BEAM-10102:
-

Author: ASF GitHub Bot
Created on: 27/May/20 15:16
Start Date: 27/May/20 15:16
Worklog Time Spent: 10m 
  Work Description: kamilwu merged pull request #11827:
URL: https://github.com/apache/beam/pull/11827


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 437843)
Time Spent: 0.5h  (was: 20m)

> Fix query in python pubsub IO streaming performance tests dashboards
> 
>
> Key: BEAM-10102
> URL: https://issues.apache.org/jira/browse/BEAM-10102
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: Not applicable
>Reporter: Piotr Szuberski
>Assignee: Piotr Szuberski
>Priority: P1
> Fix For: Not applicable
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> There is
> {color:#cc7832}\"{color}{color:#6a8759}metric{color}{color:#cc7832}\"{color}{color:#6a8759}
>  = \"pubsub_io_perf_read_runtime\"{color}
> instead of
> {color:#cc7832}\"{color}{color:#6a8759}metric{color}{color:#cc7832}\"{color}{color:#6a8759}
>  = 'pubsub_io_perf_read_runtime'{color}
> {color:#6a8759}in grafana dashboard json{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   >