[jira] [Work logged] (BEAM-7916) Change ElasticsearchIO query parameter to be a ValueProvider

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7916?focusedWorklogId=290305&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290305
 ]

ASF GitHub Bot logged work on BEAM-7916:


Author: ASF GitHub Bot
Created on: 07/Aug/19 07:35
Start Date: 07/Aug/19 07:35
Worklog Time Spent: 10m 
  Work Description: oliverhenlich commented on pull request #9285: 
[BEAM-7916] - Change ElasticsearchIO query parameter to be a ValueProvider
URL: https://github.com/apache/beam/pull/9285
 
 
   We need to be able to perform Elasticsearch queries that are dynamic. The 
problem is `ElasticsearchIO.read().withQuery()` only accepts a string which 
means the query must be known when the pipleline/Google Dataflow Template is 
built. 
   
   This change changes the `withQuery()` parameter from `String` to 
`ValueProvider`.
   
   R: @jbonofre 
   R: @echauchot 
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [x] @jbonofre, @echauchot 
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Py

[jira] [Updated] (BEAM-7916) Change ElasticsearchIO query parameter to be a ValueProvider

2019-08-07 Thread Oliver Henlich (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oliver Henlich updated BEAM-7916:
-
Description: 
We need to be able to perform Elasticsearch queries that are dynamic. The 
problem is {{ElasticsearchIO.read().withQuery()}} only accepts a string which 
means the query must be known when the pipleline/Google Dataflow Template is 
built.

It would be great if we could change the parameter on the {{withQuery()}} 
method from {{String}} to {{ValueProvider}}.

Pull request: https://github.com/apache/beam/pull/9285

  was:
We need to be able to perform Elasticsearch queries that are dynamic. The 
problem is {{ElasticsearchIO.read().withQuery()}} only accepts a string which 
means the query must be known when the pipleline/Google Dataflow Template is 
built.

It would be great if we could change the parameter on the {{withQuery()}} 
method from {{String}} to {{ValueProvider}}.


> Change ElasticsearchIO query parameter to be a ValueProvider
> 
>
> Key: BEAM-7916
> URL: https://issues.apache.org/jira/browse/BEAM-7916
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Affects Versions: 2.14.0
>Reporter: Oliver Henlich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We need to be able to perform Elasticsearch queries that are dynamic. The 
> problem is {{ElasticsearchIO.read().withQuery()}} only accepts a string which 
> means the query must be known when the pipleline/Google Dataflow Template is 
> built.
> It would be great if we could change the parameter on the {{withQuery()}} 
> method from {{String}} to {{ValueProvider}}.
> Pull request: https://github.com/apache/beam/pull/9285



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7839) Distinguish unknown and unrecognized states in Dataflow runner.

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7839?focusedWorklogId=290314&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290314
 ]

ASF GitHub Bot logged work on BEAM-7839:


Author: ASF GitHub Bot
Created on: 07/Aug/19 09:01
Start Date: 07/Aug/19 09:01
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on issue #9132: [BEAM-7839] Default 
to UNRECOGNIZED state when a state cannot be accurately interpreted by the SDK.
URL: https://github.com/apache/beam/pull/9132#issuecomment-519009560
 
 
   R: @kennknowles (once you are back).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290314)
Time Spent: 0.5h  (was: 20m)

> Distinguish unknown and unrecognized states in Dataflow runner.
> ---
>
> Key: BEAM-7839
> URL: https://issues.apache.org/jira/browse/BEAM-7839
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Reporter: Valentyn Tymofieiev
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In BEAM-7766 [~kenn] mentioned that when reasoning about pipeline state, 
> Dataflow runner should distinguish between an unknown (to the service) 
> pipeline state state  and unrecognized by the SDK pipeline state.  An 
> unrecognized state may happen when service API introduces a new state that 
> older versions of the  SDK will not know about. Filing this issue to track 
> introducing a distinction.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (BEAM-7838) LTS backport: Dataflow runner should default to PipelineState.UNKNOWN when job state received via v1beta3 cannot be recognized.

2019-08-07 Thread Valentyn Tymofieiev (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentyn Tymofieiev resolved BEAM-7838.
---
Resolution: Fixed

> LTS backport: Dataflow runner should default to PipelineState.UNKNOWN when 
> job state received via v1beta3 cannot be recognized.
> ---
>
> Key: BEAM-7838
> URL: https://issues.apache.org/jira/browse/BEAM-7838
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.1.0
>Reporter: Valentyn Tymofieiev
>Assignee: Valentyn Tymofieiev
>Priority: Minor
> Fix For: 2.7.1
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (BEAM-5490) Beam should not irreversibly modify inspect.getargspec

2019-08-07 Thread Valentyn Tymofieiev (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentyn Tymofieiev resolved BEAM-5490.
---
   Resolution: Fixed
Fix Version/s: 2.9.0

Closing this as modifications to instpect.getargspec are reverted  since 2.9.0.

> Beam should not irreversibly modify inspect.getargspec
> --
>
> Key: BEAM-5490
> URL: https://issues.apache.org/jira/browse/BEAM-5490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Chuan Yu Foo
>Assignee: Valentyn Tymofieiev
>Priority: Major
> Fix For: 2.9.0
>
>
> In the Python SDK, in {{typehints/decorators.py}}, Beam irreversibly 
> monkey-patches {{inspect.getargspec}} for its own purposes. This is really 
> bad form, since it leads to hard to debug issues in other modules which also 
> use {{inspect.getargspec}} when Beam is imported. 
> Beam should either maintain its own modified copy of the functions it needs, 
> or else it should always reverse the monkey-patching so that it does not 
> affect other modules.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-5787) test_translate_pattern fails on Python 3.7

2019-08-07 Thread Valentyn Tymofieiev (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentyn Tymofieiev updated BEAM-5787:
--
Issue Type: Sub-task  (was: Bug)
Parent: BEAM-6984

> test_translate_pattern fails on Python 3.7
> --
>
> Key: BEAM-5787
> URL: https://issues.apache.org/jira/browse/BEAM-5787
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Affects Versions: 2.8.0
>Reporter: Joar Wandborg
>Assignee: Valentyn Tymofieiev
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Originally reported by michel on slack: 
> [https://the-asf.slack.com/archives/CBDNLQZM1/p1539697967000100]
> Cause:
> {quote}[https://docs.python.org/3/whatsnew/3.7.html]:
>  Change re.escape() to only escape regex special characters instead of
>  escaping all characters other than ASCII letters, numbers, and '_'.
>  (Contributed by Serhiy Storchaka in bpo-29995.)
> {quote}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (BEAM-5787) test_translate_pattern fails on Python 3.7

2019-08-07 Thread Valentyn Tymofieiev (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentyn Tymofieiev resolved BEAM-5787.
---
   Resolution: Fixed
Fix Version/s: Not applicable

> test_translate_pattern fails on Python 3.7
> --
>
> Key: BEAM-5787
> URL: https://issues.apache.org/jira/browse/BEAM-5787
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Affects Versions: 2.8.0
>Reporter: Joar Wandborg
>Assignee: Valentyn Tymofieiev
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Originally reported by michel on slack: 
> [https://the-asf.slack.com/archives/CBDNLQZM1/p1539697967000100]
> Cause:
> {quote}[https://docs.python.org/3/whatsnew/3.7.html]:
>  Change re.escape() to only escape regex special characters instead of
>  escaping all characters other than ASCII letters, numbers, and '_'.
>  (Contributed by Serhiy Storchaka in bpo-29995.)
> {quote}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-6923) OOM errors in jobServer when using GCS artifactDir

2019-08-07 Thread Lukasz Gajowy (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukasz Gajowy updated BEAM-6923:

Attachment: (was: Screenshot 2019-08-07 at 11.42.52.png)

> OOM errors in jobServer when using GCS artifactDir
> --
>
> Key: BEAM-6923
> URL: https://issues.apache.org/jira/browse/BEAM-6923
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-harness
>Reporter: Lukasz Gajowy
>Priority: Major
> Attachments: Instance counts.png, Paths to GC root.png, 
> Telemetries.png
>
>
> When starting jobServer with artifactDir pointing to a GCS bucket: 
> {code:java}
> ./gradlew :beam-runners-flink_2.11-job-server:runShadow 
> -PflinkMasterUrl=localhost:8081 -PartifactsDir=gs://the-bucket{code}
> and running a Java portable pipeline with the following, portability related 
> pipeline options: 
> {code:java}
> --runner=PortableRunner --jobEndpoint=localhost:8099 
> --defaultEnvironmentType=DOCKER 
> --defaultEnvironmentConfig=gcr.io//java:latest'{code}
>  
> I'm facing a series of OOM errors, like this: 
> {code:java}
> Exception in thread "grpc-default-executor-3" java.lang.OutOfMemoryError: 
> Java heap space
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.buildContentChunk(MediaHttpUploader.java:606)
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:408)
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:508)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:549)
> at 
> com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:301)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745){code}
>  
> This does not happen when I'm using a local filesystem for the artifact 
> staging location. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-6923) OOM errors in jobServer when using GCS artifactDir

2019-08-07 Thread Lukasz Gajowy (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukasz Gajowy updated BEAM-6923:

Attachment: Screenshot 2019-08-07 at 11.42.52.png

> OOM errors in jobServer when using GCS artifactDir
> --
>
> Key: BEAM-6923
> URL: https://issues.apache.org/jira/browse/BEAM-6923
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-harness
>Reporter: Lukasz Gajowy
>Priority: Major
> Attachments: Instance counts.png, Paths to GC root.png, 
> Telemetries.png
>
>
> When starting jobServer with artifactDir pointing to a GCS bucket: 
> {code:java}
> ./gradlew :beam-runners-flink_2.11-job-server:runShadow 
> -PflinkMasterUrl=localhost:8081 -PartifactsDir=gs://the-bucket{code}
> and running a Java portable pipeline with the following, portability related 
> pipeline options: 
> {code:java}
> --runner=PortableRunner --jobEndpoint=localhost:8099 
> --defaultEnvironmentType=DOCKER 
> --defaultEnvironmentConfig=gcr.io//java:latest'{code}
>  
> I'm facing a series of OOM errors, like this: 
> {code:java}
> Exception in thread "grpc-default-executor-3" java.lang.OutOfMemoryError: 
> Java heap space
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.buildContentChunk(MediaHttpUploader.java:606)
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:408)
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:508)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:549)
> at 
> com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:301)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745){code}
>  
> This does not happen when I'm using a local filesystem for the artifact 
> staging location. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-6923) OOM errors in jobServer when using GCS artifactDir

2019-08-07 Thread Lukasz Gajowy (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukasz Gajowy updated BEAM-6923:

Attachment: heapdump size-sorted.png

> OOM errors in jobServer when using GCS artifactDir
> --
>
> Key: BEAM-6923
> URL: https://issues.apache.org/jira/browse/BEAM-6923
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-harness
>Reporter: Lukasz Gajowy
>Priority: Major
> Attachments: Instance counts.png, Paths to GC root.png, 
> Telemetries.png, heapdump size-sorted.png
>
>
> When starting jobServer with artifactDir pointing to a GCS bucket: 
> {code:java}
> ./gradlew :beam-runners-flink_2.11-job-server:runShadow 
> -PflinkMasterUrl=localhost:8081 -PartifactsDir=gs://the-bucket{code}
> and running a Java portable pipeline with the following, portability related 
> pipeline options: 
> {code:java}
> --runner=PortableRunner --jobEndpoint=localhost:8099 
> --defaultEnvironmentType=DOCKER 
> --defaultEnvironmentConfig=gcr.io//java:latest'{code}
>  
> I'm facing a series of OOM errors, like this: 
> {code:java}
> Exception in thread "grpc-default-executor-3" java.lang.OutOfMemoryError: 
> Java heap space
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.buildContentChunk(MediaHttpUploader.java:606)
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:408)
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:508)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:549)
> at 
> com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:301)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745){code}
>  
> This does not happen when I'm using a local filesystem for the artifact 
> staging location. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-6923) OOM errors in jobServer when using GCS artifactDir

2019-08-07 Thread Lukasz Gajowy (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901955#comment-16901955
 ] 

Lukasz Gajowy commented on BEAM-6923:
-

[~angoenka] I added a new screenshot with size sorted heap dump. However, for 
future investigation, I think screenshots are not the best way here. The issue 
can be easily reproduced + you can add:
{code:java}
"-XX:+HeapDumpOnOutOfMemoryError"{code}
to the jvm options in 
[flink_job_server.gradle|https://github.com/apache/beam/blob/ff0f308fb83056bd2ba990de2edec33a0c6c7720/runners/flink/job-server/flink_job_server.gradle#L112]
 file. It will generate the whole heap dump once the error appears. Once you 
have that you can use jvisualvm tool from JDK to investigate (I just learned 
that! :) ). See more 
[here|https://docs.oracle.com/javase/7/docs/webnotes/tsg/TSG-VM/html/clopts.html#gbzrr]

 

As for the pipelines - I encountered the error while running performance tests 
of core beam operations. I tried 
[GroupByKeyLoadTest|https://github.com/apache/beam/blob/e0cbd2aa7e371c75511544ab78075d54f3f086ca/sdks/java/testing/load-tests/src/main/java/org/apache/beam/sdk/loadtests/GroupByKeyLoadTest.java]
 and 
[ParDoLoadTest|https://github.com/apache/beam/blob/e0cbd2aa7e371c75511544ab78075d54f3f086ca/sdks/java/testing/load-tests/src/main/java/org/apache/beam/sdk/loadtests/ParDoLoadTest.java].
 Example command:
{code:java}
./gradlew --continue --max-workers=12 -Dorg.gradle.jvmargs=-Xms2g 
-Dorg.gradle.jvmargs=-Xmx4g 
-PloadTest.mainClass=org.apache.beam.sdk.loadtests.ParDoLoadTest 
'-PloadTest.args=--sourceOptions={"numRecords":100,"keySizeBytes":1,"valueSizeBytes":9}
 --iterations=1 --runner=PortableRunner --jobEndpoint=localhost:8099 
--defaultEnvironmentType=DOCKER 
--defaultEnvironmentConfig=gcr.io/apache-beam-testing/beam/java:latest' 
:beam-sdks-java-load-tests:run -Prunner=":runners:reference:java{code}
As far as I understand, [~marcelo.castro] is running something completely 
different on Spark and still gets the error.

 

> OOM errors in jobServer when using GCS artifactDir
> --
>
> Key: BEAM-6923
> URL: https://issues.apache.org/jira/browse/BEAM-6923
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-harness
>Reporter: Lukasz Gajowy
>Priority: Major
> Attachments: Instance counts.png, Paths to GC root.png, 
> Telemetries.png, heapdump size-sorted.png
>
>
> When starting jobServer with artifactDir pointing to a GCS bucket: 
> {code:java}
> ./gradlew :beam-runners-flink_2.11-job-server:runShadow 
> -PflinkMasterUrl=localhost:8081 -PartifactsDir=gs://the-bucket{code}
> and running a Java portable pipeline with the following, portability related 
> pipeline options: 
> {code:java}
> --runner=PortableRunner --jobEndpoint=localhost:8099 
> --defaultEnvironmentType=DOCKER 
> --defaultEnvironmentConfig=gcr.io//java:latest'{code}
>  
> I'm facing a series of OOM errors, like this: 
> {code:java}
> Exception in thread "grpc-default-executor-3" java.lang.OutOfMemoryError: 
> Java heap space
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.buildContentChunk(MediaHttpUploader.java:606)
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:408)
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:508)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:549)
> at 
> com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:301)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745){code}
>  
> This does not happen when I'm using a local filesystem for the artifact 
> staging location. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7613) HadoopFileSystem can be only used with fs.defaultFS

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7613?focusedWorklogId=290354&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290354
 ]

ASF GitHub Bot logged work on BEAM-7613:


Author: ASF GitHub Bot
Created on: 07/Aug/19 10:32
Start Date: 07/Aug/19 10:32
Worklog Time Spent: 10m 
  Work Description: dmvk commented on issue #8923: [BEAM-7613] 
HadoopFileSystem can work with more than one cluster.
URL: https://github.com/apache/beam/pull/8923#issuecomment-519040361
 
 
   @tvalentyn this is still valid, will take a look at it later this week
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290354)
Time Spent: 1h 50m  (was: 1h 40m)

> HadoopFileSystem can be only used with fs.defaultFS
> ---
>
> Key: BEAM-7613
> URL: https://issues.apache.org/jira/browse/BEAM-7613
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop-file-system
>Affects Versions: 2.13.0
>Reporter: David Moravek
>Assignee: David Moravek
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> _HadoopFileSystem_ creates underlying _FileSystem_ (one from 
> org.apache.hadoop) instance during it's construction. Single _FileSystem_ 
> instance is tied to a particular cluster (scheme + authority pair). In case 
> we want to talk to another cluster, this fail due to _FileSystem#checkPath_.
>  
> This can be fixed by using _FileSystem#get(java.net.URI, 
> org.apache.hadoop.conf.Configuration)_ instead of 
> _FileSystem#newInstance(org.apache.hadoop.conf.Configuration)_{{}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7819) PubsubMessage message parsing is lacking non-attribute fields

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7819?focusedWorklogId=290360&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290360
 ]

ASF GitHub Bot logged work on BEAM-7819:


Author: ASF GitHub Bot
Created on: 07/Aug/19 10:44
Start Date: 07/Aug/19 10:44
Worklog Time Spent: 10m 
  Work Description: matt-darwin commented on issue #9232: [BEAM-7819] 
Python - parse PubSub message_id into attributes property
URL: https://github.com/apache/beam/pull/9232#issuecomment-519043709
 
 
   > Note that in Dataflow, the Python SDK uses FnAPI to read from Pub/Sub 
using a Java harness. In other words, the message id might be missing because 
Java code doesn't provide it somehow. I'll look into it later.
   > 
   I'm afraid my Java is very weak; much worse 
   
   
   > When running locally using DirectRunner, there's a different 
implementation that replaces `ReadFromPubSub` entirely, which probably needs 
modification as well. See:
   > 
https://github.com/apache/beam/blob/02267999542c2ae0f52d217faa14b32c93b8787d/sdks/python/apache_beam/runners/direct/transform_evaluator.py#L380
   
   
   
   > FYI, there is #8370 which is attempting to add the message id to the Beam 
Java SDK.
   
   I'll have a look at building the runner from this change to see if I can 
then access the message_id from the dataflow runner.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290360)
Time Spent: 1h 20m  (was: 1h 10m)

> PubsubMessage message parsing is lacking non-attribute fields
> -
>
> Key: BEAM-7819
> URL: https://issues.apache.org/jira/browse/BEAM-7819
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> User reported issue: 
> https://lists.apache.org/thread.html/139b0c15abc6471a2e2202d76d915c645a529a23ecc32cd9cfecd315@%3Cuser.beam.apache.org%3E
> """
> Looking at the source code, with my untrained python eyes, I think if the 
> intention is to include the message id and the publish time in the attributes 
> attribute of the PubSubMessage type, then the protobuf mapping is missing 
> something:-
> @staticmethod
> def _from_proto_str(proto_msg):
> """Construct from serialized form of ``PubsubMessage``.
> Args:
> proto_msg: String containing a serialized protobuf of type
> https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage
> Returns:
> A new PubsubMessage object.
> """
> msg = pubsub.types.pubsub_pb2.PubsubMessage()
> msg.ParseFromString(proto_msg)
> # Convert ScalarMapContainer to dict.
> attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
> return PubsubMessage(msg.data, attributes)
> The protobuf definition is here:-
> https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage
> and so it looks as if the message_id and publish_time are not being parsed as 
> they are seperate from the attributes. Perhaps the PubsubMessage class needs 
> expanding to include these as attributes, or they would need adding to the 
> dictionary for attributes. This would only need doing for the _from_proto_str 
> as obviously they would not need to be populated when transmitting a message 
> to PubSub.
> My python is not great, I'm assuming the latter option would need to look 
> something like this?
> attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
> attributes.update({'message_id': msg.message_id, 'publish_time': 
> msg.publish_time})
> return PubsubMessage(msg.data, attributes)
> """



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7819) PubsubMessage message parsing is lacking non-attribute fields

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7819?focusedWorklogId=290361&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290361
 ]

ASF GitHub Bot logged work on BEAM-7819:


Author: ASF GitHub Bot
Created on: 07/Aug/19 10:47
Start Date: 07/Aug/19 10:47
Worklog Time Spent: 10m 
  Work Description: matt-darwin commented on issue #9232: [BEAM-7819] 
Python - parse PubSub message_id into attributes property
URL: https://github.com/apache/beam/pull/9232#issuecomment-519044616
 
 
   > Note that in Dataflow, the Python SDK uses FnAPI to read from Pub/Sub 
using a Java harness. In other words, the message id might be missing because 
Java code doesn't provide it somehow. I'll look into it later.
   > 
   > When running locally using DirectRunner, there's a different 
implementation that replaces `ReadFromPubSub` entirely, which probably needs 
modification as well. See:
   > 
https://github.com/apache/beam/blob/02267999542c2ae0f52d217faa14b32c93b8787d/sdks/python/apache_beam/runners/direct/transform_evaluator.py#L380
   
   The directrunner has been working ok with the above changes; I think the 
issue is on the dataflow runner side.  The directrunner is using the 
_from_message method, and this is parsing correctly and returning the pubsub 
message id in my testing so far.
   
```
def _read_from_pubsub(self, timestamp_attribute):
   from apache_beam.io.gcp.pubsub import PubsubMessage
   from google.cloud import pubsub
   
   def _get_element(message):
 parsed_message = PubsubMessage._from_message(message)
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290361)
Time Spent: 1.5h  (was: 1h 20m)

> PubsubMessage message parsing is lacking non-attribute fields
> -
>
> Key: BEAM-7819
> URL: https://issues.apache.org/jira/browse/BEAM-7819
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> User reported issue: 
> https://lists.apache.org/thread.html/139b0c15abc6471a2e2202d76d915c645a529a23ecc32cd9cfecd315@%3Cuser.beam.apache.org%3E
> """
> Looking at the source code, with my untrained python eyes, I think if the 
> intention is to include the message id and the publish time in the attributes 
> attribute of the PubSubMessage type, then the protobuf mapping is missing 
> something:-
> @staticmethod
> def _from_proto_str(proto_msg):
> """Construct from serialized form of ``PubsubMessage``.
> Args:
> proto_msg: String containing a serialized protobuf of type
> https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage
> Returns:
> A new PubsubMessage object.
> """
> msg = pubsub.types.pubsub_pb2.PubsubMessage()
> msg.ParseFromString(proto_msg)
> # Convert ScalarMapContainer to dict.
> attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
> return PubsubMessage(msg.data, attributes)
> The protobuf definition is here:-
> https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage
> and so it looks as if the message_id and publish_time are not being parsed as 
> they are seperate from the attributes. Perhaps the PubsubMessage class needs 
> expanding to include these as attributes, or they would need adding to the 
> dictionary for attributes. This would only need doing for the _from_proto_str 
> as obviously they would not need to be populated when transmitting a message 
> to PubSub.
> My python is not great, I'm assuming the latter option would need to look 
> something like this?
> attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
> attributes.update({'message_id': msg.message_id, 'publish_time': 
> msg.publish_time})
> return PubsubMessage(msg.data, attributes)
> """



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7819) PubsubMessage message parsing is lacking non-attribute fields

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7819?focusedWorklogId=290363&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290363
 ]

ASF GitHub Bot logged work on BEAM-7819:


Author: ASF GitHub Bot
Created on: 07/Aug/19 10:48
Start Date: 07/Aug/19 10:48
Worklog Time Spent: 10m 
  Work Description: matt-darwin commented on issue #9232: [BEAM-7819] 
Python - parse PubSub message_id into attributes property
URL: https://github.com/apache/beam/pull/9232#issuecomment-519043709
 
 
   > FYI, there is #8370 which is attempting to add the message id to the Beam 
Java SDK.
   
   I'll have a look at building the runner from this change to see if I can 
then access the message_id from the dataflow runner.  I'm not adept at all at 
Java however, so wouldn't be much help there.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290363)
Time Spent: 1h 40m  (was: 1.5h)

> PubsubMessage message parsing is lacking non-attribute fields
> -
>
> Key: BEAM-7819
> URL: https://issues.apache.org/jira/browse/BEAM-7819
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> User reported issue: 
> https://lists.apache.org/thread.html/139b0c15abc6471a2e2202d76d915c645a529a23ecc32cd9cfecd315@%3Cuser.beam.apache.org%3E
> """
> Looking at the source code, with my untrained python eyes, I think if the 
> intention is to include the message id and the publish time in the attributes 
> attribute of the PubSubMessage type, then the protobuf mapping is missing 
> something:-
> @staticmethod
> def _from_proto_str(proto_msg):
> """Construct from serialized form of ``PubsubMessage``.
> Args:
> proto_msg: String containing a serialized protobuf of type
> https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage
> Returns:
> A new PubsubMessage object.
> """
> msg = pubsub.types.pubsub_pb2.PubsubMessage()
> msg.ParseFromString(proto_msg)
> # Convert ScalarMapContainer to dict.
> attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
> return PubsubMessage(msg.data, attributes)
> The protobuf definition is here:-
> https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage
> and so it looks as if the message_id and publish_time are not being parsed as 
> they are seperate from the attributes. Perhaps the PubsubMessage class needs 
> expanding to include these as attributes, or they would need adding to the 
> dictionary for attributes. This would only need doing for the _from_proto_str 
> as obviously they would not need to be populated when transmitting a message 
> to PubSub.
> My python is not great, I'm assuming the latter option would need to look 
> something like this?
> attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
> attributes.update({'message_id': msg.message_id, 'publish_time': 
> msg.publish_time})
> return PubsubMessage(msg.data, attributes)
> """



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-5980) Add load tests for Core Apache Beam opertaions

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5980?focusedWorklogId=290365&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290365
 ]

ASF GitHub Bot logged work on BEAM-5980:


Author: ASF GitHub Bot
Created on: 07/Aug/19 10:49
Start Date: 07/Aug/19 10:49
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #9286: [BEAM-5980] Remove 
redundant combine tests
URL: https://github.com/apache/beam/pull/9286#issuecomment-519045299
 
 
   Run seed job
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290365)
Time Spent: 20m  (was: 10m)

> Add load tests for Core Apache Beam opertaions 
> ---
>
> Key: BEAM-5980
> URL: https://issues.apache.org/jira/browse/BEAM-5980
> Project: Beam
>  Issue Type: New Feature
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This involves adding a suite of load tests described in this proposal: 
> [https://s.apache.org/load-test-basic-operations]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-5980) Add load tests for Core Apache Beam opertaions

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5980?focusedWorklogId=290364&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290364
 ]

ASF GitHub Bot logged work on BEAM-5980:


Author: ASF GitHub Bot
Created on: 07/Aug/19 10:49
Start Date: 07/Aug/19 10:49
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on pull request #9286: [BEAM-5980] 
Remove redundant combine tests
URL: https://github.com/apache/beam/pull/9286
 
 
   Size of input records is irrelevant for the combine operation because the 
byte[] value gets mapped to a Long before the Combine actually happens. This 
makes the tests redundant as size is the only thing that changes in the 
synthetic source options given to the input. 
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.

[jira] [Work logged] (BEAM-7912) Optimize GroupIntoBatches for batch Dataflow pipelines

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7912?focusedWorklogId=290373&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290373
 ]

ASF GitHub Bot logged work on BEAM-7912:


Author: ASF GitHub Bot
Created on: 07/Aug/19 10:56
Start Date: 07/Aug/19 10:56
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on issue #9280: [BEAM-7912] Optimize 
GroupIntoBatches for batch Dataflow pipelines.
URL: https://github.com/apache/beam/pull/9280#issuecomment-519047288
 
 
   R: @boyuanzz @Ardagan 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290373)
Time Spent: 40m  (was: 0.5h)

> Optimize GroupIntoBatches for batch Dataflow pipelines
> --
>
> Key: BEAM-7912
> URL: https://issues.apache.org/jira/browse/BEAM-7912
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The GroupIntoBatches transform can be significantly optimized on Dataflow 
> since it always ensures that a key K appears in only one bundle after a 
> GroupByKey. This removes the usage of state and timers in the generic 
> GroupIntoBatches transform.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-5980) Add load tests for Core Apache Beam opertaions

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5980?focusedWorklogId=290376&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290376
 ]

ASF GitHub Bot logged work on BEAM-5980:


Author: ASF GitHub Bot
Created on: 07/Aug/19 11:00
Start Date: 07/Aug/19 11:00
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #9286: [BEAM-5980] Remove 
redundant combine tests
URL: https://github.com/apache/beam/pull/9286#issuecomment-519048539
 
 
   @pabloem @kkucharc could you take a look?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290376)
Time Spent: 0.5h  (was: 20m)

> Add load tests for Core Apache Beam opertaions 
> ---
>
> Key: BEAM-5980
> URL: https://issues.apache.org/jira/browse/BEAM-5980
> Project: Beam
>  Issue Type: New Feature
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This involves adding a suite of load tests described in this proposal: 
> [https://s.apache.org/load-test-basic-operations]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-6923) OOM errors in jobServer when using GCS artifactDir

2019-08-07 Thread Marcelo Pio de Castro (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902012#comment-16902012
 ] 

Marcelo Pio de Castro commented on BEAM-6923:
-

Sorry for the late response.

[~iemejia], the library was guava-20, but at the time I was using a mismatched 
version of beam and spark (spark was in a earlier version than the dependency 
of Apache beam). I'll retest this with the newer version of beam and get back 
to you.

 

As for the thread issue. I was uploading files that could 300+ GB file size 
using the AvroIO sink. At the time I had a very limited cluster of spark 
workers, with about 4GB of RAM. The problem is at the async http upload that 
the gcs lib uses.

I think that an option should exist to choose to use a sync method instead on 
limited cluster memory scenarios. (If such option exist, I couldn't find it)

 

I solved my problem doing the upload by myself in a ParDo instead of using the 
AvroIO sink, by uploading in a sync way. Even an async method can be done with 
a more limited ram, but that is a problem with the Google library.

 

> OOM errors in jobServer when using GCS artifactDir
> --
>
> Key: BEAM-6923
> URL: https://issues.apache.org/jira/browse/BEAM-6923
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-harness
>Reporter: Lukasz Gajowy
>Priority: Major
> Attachments: Instance counts.png, Paths to GC root.png, 
> Telemetries.png, heapdump size-sorted.png
>
>
> When starting jobServer with artifactDir pointing to a GCS bucket: 
> {code:java}
> ./gradlew :beam-runners-flink_2.11-job-server:runShadow 
> -PflinkMasterUrl=localhost:8081 -PartifactsDir=gs://the-bucket{code}
> and running a Java portable pipeline with the following, portability related 
> pipeline options: 
> {code:java}
> --runner=PortableRunner --jobEndpoint=localhost:8099 
> --defaultEnvironmentType=DOCKER 
> --defaultEnvironmentConfig=gcr.io//java:latest'{code}
>  
> I'm facing a series of OOM errors, like this: 
> {code:java}
> Exception in thread "grpc-default-executor-3" java.lang.OutOfMemoryError: 
> Java heap space
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.buildContentChunk(MediaHttpUploader.java:606)
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:408)
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:508)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:549)
> at 
> com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:301)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745){code}
>  
> This does not happen when I'm using a local filesystem for the artifact 
> staging location. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (BEAM-6923) OOM errors in jobServer when using GCS artifactDir

2019-08-07 Thread Marcelo Pio de Castro (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902012#comment-16902012
 ] 

Marcelo Pio de Castro edited comment on BEAM-6923 at 8/7/19 11:57 AM:
--

Sorry for the late response.

[~iemejia], the library was guava-20, but at the time I was using a mismatched 
version of beam and spark (spark was in a earlier version than the dependency 
of Apache beam). I'll retest this with the newer version of beam and get back 
to you.

 

As for the thread issue. I was uploading files that could have 300+ GB file 
size using the AvroIO sink. At the time I had a very limited cluster of spark 
workers, with about 4GB of RAM. The problem is at the async http upload that 
the gcs lib uses.

I think that an option should exist to choose to use a sync method instead on 
limited cluster memory scenarios. (If such option exist, I couldn't find it)

 

I solved my problem doing the upload by myself in a ParDo instead of using the 
AvroIO sink, by uploading in a sync way. Even an async method can be done with 
a more limited ram, but that is a problem with the Google library.

 


was (Author: marcelo.castro):
Sorry for the late response.

[~iemejia], the library was guava-20, but at the time I was using a mismatched 
version of beam and spark (spark was in a earlier version than the dependency 
of Apache beam). I'll retest this with the newer version of beam and get back 
to you.

 

As for the thread issue. I was uploading files that could 300+ GB file size 
using the AvroIO sink. At the time I had a very limited cluster of spark 
workers, with about 4GB of RAM. The problem is at the async http upload that 
the gcs lib uses.

I think that an option should exist to choose to use a sync method instead on 
limited cluster memory scenarios. (If such option exist, I couldn't find it)

 

I solved my problem doing the upload by myself in a ParDo instead of using the 
AvroIO sink, by uploading in a sync way. Even an async method can be done with 
a more limited ram, but that is a problem with the Google library.

 

> OOM errors in jobServer when using GCS artifactDir
> --
>
> Key: BEAM-6923
> URL: https://issues.apache.org/jira/browse/BEAM-6923
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-harness
>Reporter: Lukasz Gajowy
>Priority: Major
> Attachments: Instance counts.png, Paths to GC root.png, 
> Telemetries.png, heapdump size-sorted.png
>
>
> When starting jobServer with artifactDir pointing to a GCS bucket: 
> {code:java}
> ./gradlew :beam-runners-flink_2.11-job-server:runShadow 
> -PflinkMasterUrl=localhost:8081 -PartifactsDir=gs://the-bucket{code}
> and running a Java portable pipeline with the following, portability related 
> pipeline options: 
> {code:java}
> --runner=PortableRunner --jobEndpoint=localhost:8099 
> --defaultEnvironmentType=DOCKER 
> --defaultEnvironmentConfig=gcr.io//java:latest'{code}
>  
> I'm facing a series of OOM errors, like this: 
> {code:java}
> Exception in thread "grpc-default-executor-3" java.lang.OutOfMemoryError: 
> Java heap space
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.buildContentChunk(MediaHttpUploader.java:606)
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:408)
> at 
> com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:508)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
> at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:549)
> at 
> com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:301)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745){code}
>  
> This does not happen when I'm using a local filesystem for the artifact 
> staging location. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-5980) Add load tests for Core Apache Beam opertaions

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5980?focusedWorklogId=290429&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290429
 ]

ASF GitHub Bot logged work on BEAM-5980:


Author: ASF GitHub Bot
Created on: 07/Aug/19 12:04
Start Date: 07/Aug/19 12:04
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #9286: [BEAM-5980] Remove 
redundant combine tests
URL: https://github.com/apache/beam/pull/9286#issuecomment-519066797
 
 
   You are right - I will remove there too. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290429)
Time Spent: 40m  (was: 0.5h)

> Add load tests for Core Apache Beam opertaions 
> ---
>
> Key: BEAM-5980
> URL: https://issues.apache.org/jira/browse/BEAM-5980
> Project: Beam
>  Issue Type: New Feature
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This involves adding a suite of load tests described in this proposal: 
> [https://s.apache.org/load-test-basic-operations]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-5980) Add load tests for Core Apache Beam opertaions

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5980?focusedWorklogId=290432&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290432
 ]

ASF GitHub Bot logged work on BEAM-5980:


Author: ASF GitHub Bot
Created on: 07/Aug/19 12:10
Start Date: 07/Aug/19 12:10
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #9286: [BEAM-5980] Remove 
redundant combine tests
URL: https://github.com/apache/beam/pull/9286#issuecomment-519068705
 
 
   Run seed job
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290432)
Time Spent: 50m  (was: 40m)

> Add load tests for Core Apache Beam opertaions 
> ---
>
> Key: BEAM-5980
> URL: https://issues.apache.org/jira/browse/BEAM-5980
> Project: Beam
>  Issue Type: New Feature
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This involves adding a suite of load tests described in this proposal: 
> [https://s.apache.org/load-test-basic-operations]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7389) Colab examples for element-wise transforms (Python)

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7389?focusedWorklogId=290433&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290433
 ]

ASF GitHub Bot logged work on BEAM-7389:


Author: ASF GitHub Bot
Created on: 07/Aug/19 12:12
Start Date: 07/Aug/19 12:12
Worklog Time Spent: 10m 
  Work Description: robertwb commented on issue #9260: [BEAM-7389] Add code 
examples for FlatMap page
URL: https://github.com/apache/beam/pull/9260#issuecomment-519069426
 
 
   I agree. I don't think tuple variants of Filter/Partition would hold their 
weight, and ParDo is supposed to take a DoFn not a lambda and I don't think we 
should introduce anther kind of DoFn just for that (by then you're probably 
heavyweight enough that manually upacking isn't too bad.)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290433)
Time Spent: 33h 20m  (was: 33h 10m)

> Colab examples for element-wise transforms (Python)
> ---
>
> Key: BEAM-7389
> URL: https://issues.apache.org/jira/browse/BEAM-7389
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Rose Nguyen
>Assignee: David Cavazos
>Priority: Minor
>  Time Spent: 33h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (BEAM-7578) Add integration tests for HDFS

2019-08-07 Thread Valentyn Tymofieiev (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentyn Tymofieiev resolved BEAM-7578.
---
   Resolution: Fixed
Fix Version/s: Not applicable

> Add integration tests for HDFS 
> ---
>
> Key: BEAM-7578
> URL: https://issues.apache.org/jira/browse/BEAM-7578
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-python-hadoop
>Reporter: Frederik Bode
>Assignee: Frederik Bode
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Add Integration tests for HDFS IO



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7874) FnApi only supports up to 10 workers

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7874?focusedWorklogId=290435&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290435
 ]

ASF GitHub Bot logged work on BEAM-7874:


Author: ASF GitHub Bot
Created on: 07/Aug/19 12:33
Start Date: 07/Aug/19 12:33
Worklog Time Spent: 10m 
  Work Description: robertwb commented on pull request #9218: [BEAM-7874], 
[BEAM-7873] Distributed FnApiRunner bugfixs
URL: https://github.com/apache/beam/pull/9218#discussion_r311524807
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/fn_api_runner.py
 ##
 @@ -1319,55 +1320,62 @@ def stop_worker(self):
 @WorkerHandler.register_environment(common_urns.environments.DOCKER.urn,
 beam_runner_api_pb2.DockerPayload)
 class DockerSdkWorkerHandler(GrpcWorkerHandler):
+
+  _lock = threading.Lock()
 
 Review comment:
   Add some comments as to why this is needed. Is it to workaround subprocess 
issues with Python 2? If so, perhaps lift it up as an appropriately named 
module global (as it should be shared with the SubProcess worker handler 
below). 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290435)

> FnApi only supports up to 10 workers
> 
>
> Key: BEAM-7874
> URL: https://issues.apache.org/jira/browse/BEAM-7874
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Because max_workers of grpc servers are hardcoded to 10, it only supports up 
> to 10 workers, and if we pass more direct_num_workers greater than 10, 
> pipeline hangs, because not all workers get connected to the runner.
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/fn_api_runner.py#L1141]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7874) FnApi only supports up to 10 workers

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7874?focusedWorklogId=290436&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290436
 ]

ASF GitHub Bot logged work on BEAM-7874:


Author: ASF GitHub Bot
Created on: 07/Aug/19 12:33
Start Date: 07/Aug/19 12:33
Worklog Time Spent: 10m 
  Work Description: robertwb commented on pull request #9218: [BEAM-7874], 
[BEAM-7873] Distributed FnApiRunner bugfixs
URL: https://github.com/apache/beam/pull/9218#discussion_r311523584
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/fn_api_runner.py
 ##
 @@ -1134,11 +1134,12 @@ class GrpcServer(object):
 
   _DEFAULT_SHUTDOWN_TIMEOUT_SECS = 5
 
-  def __init__(self, state, provision_info):
+  def __init__(self, state, provision_info, num_workers):
 self.state = state
 self.provision_info = provision_info
+max_workers = max(10, num_workers)
 
 Review comment:
   How is this safe?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290436)
Time Spent: 1.5h  (was: 1h 20m)

> FnApi only supports up to 10 workers
> 
>
> Key: BEAM-7874
> URL: https://issues.apache.org/jira/browse/BEAM-7874
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Because max_workers of grpc servers are hardcoded to 10, it only supports up 
> to 10 workers, and if we pass more direct_num_workers greater than 10, 
> pipeline hangs, because not all workers get connected to the runner.
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/fn_api_runner.py#L1141]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=290440&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290440
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 07/Aug/19 12:43
Start Date: 07/Aug/19 12:43
Worklog Time Spent: 10m 
  Work Description: robertwb commented on pull request #9179: [BEAM-7060] 
Use typing in type decorators of core.py
URL: https://github.com/apache/beam/pull/9179#discussion_r311530990
 
 

 ##
 File path: sdks/python/apache_beam/transforms/core.py
 ##
 @@ -89,9 +90,9 @@
 ]
 
 # Type variables
-T = typehints.TypeVariable('T')
-K = typehints.TypeVariable('K')
-V = typehints.TypeVariable('V')
+T = typing.TypeVar('T')
 
 Review comment:
   Done. PTAL
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290440)
Time Spent: 10h  (was: 9h 50m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 10h
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7916) Change ElasticsearchIO query parameter to be a ValueProvider

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7916?focusedWorklogId=290441&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290441
 ]

ASF GitHub Bot logged work on BEAM-7916:


Author: ASF GitHub Bot
Created on: 07/Aug/19 12:44
Start Date: 07/Aug/19 12:44
Worklog Time Spent: 10m 
  Work Description: RyanSkraba commented on issue #9285: [BEAM-7916] - 
Change ElasticsearchIO query parameter to be a ValueProvider
URL: https://github.com/apache/beam/pull/9285#issuecomment-519080014
 
 
   For info: @jbonofre and @echauchot are not available this week, you might 
want to find an alternate reviewer.  
   
   For what it's worth, LGTM but I would highly recommend the suggested change 
for compatibility.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290441)
Time Spent: 20m  (was: 10m)

> Change ElasticsearchIO query parameter to be a ValueProvider
> 
>
> Key: BEAM-7916
> URL: https://issues.apache.org/jira/browse/BEAM-7916
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Affects Versions: 2.14.0
>Reporter: Oliver Henlich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We need to be able to perform Elasticsearch queries that are dynamic. The 
> problem is {{ElasticsearchIO.read().withQuery()}} only accepts a string which 
> means the query must be known when the pipleline/Google Dataflow Template is 
> built.
> It would be great if we could change the parameter on the {{withQuery()}} 
> method from {{String}} to {{ValueProvider}}.
> Pull request: https://github.com/apache/beam/pull/9285



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-5980) Add load tests for Core Apache Beam opertaions

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5980?focusedWorklogId=290442&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290442
 ]

ASF GitHub Bot logged work on BEAM-5980:


Author: ASF GitHub Bot
Created on: 07/Aug/19 12:47
Start Date: 07/Aug/19 12:47
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #9286: [BEAM-5980] Remove 
redundant combine tests
URL: https://github.com/apache/beam/pull/9286#issuecomment-519081033
 
 
   It seems that I didn't mess up the jobs. Could you take a look again 
@kkucharc ?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290442)
Time Spent: 1h  (was: 50m)

> Add load tests for Core Apache Beam opertaions 
> ---
>
> Key: BEAM-5980
> URL: https://issues.apache.org/jira/browse/BEAM-5980
> Project: Beam
>  Issue Type: New Feature
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This involves adding a suite of load tests described in this proposal: 
> [https://s.apache.org/load-test-basic-operations]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-4775) JobService should support returning metrics

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=290459&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290459
 ]

ASF GitHub Bot logged work on BEAM-4775:


Author: ASF GitHub Bot
Created on: 07/Aug/19 13:20
Start Date: 07/Aug/19 13:20
Worklog Time Spent: 10m 
  Work Description: robertwb commented on pull request #9020: [BEAM-4775] 
Support returning metrics from job service
URL: https://github.com/apache/beam/pull/9020#discussion_r311548631
 
 

 ##
 File path: 
runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/jobsubmission/JobInvocation.java
 ##
 @@ -86,11 +89,17 @@ public synchronized void start() {
 setState(JobState.Enum.RUNNING);
 Futures.addCallback(
 invocationFuture,
-new FutureCallback() {
+new FutureCallback() {
   @Override
-  public void onSuccess(PipelineResult pipelineResult) {
+  public void onSuccess(PortablePipelineResult pipelineResult) {
 if (pipelineResult != null) {
-  switch (pipelineResult.getState()) {
+  PipelineResult.State state = pipelineResult.getState();
+
+  if (state.isTerminal()) {
+metrics = pipelineResult.portableMetrics();
 
 Review comment:
   Otherwise we should stash a handle to pipelineResult such that we can query 
its current metrics in the getMetrics call below. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290459)
Time Spent: 48.5h  (was: 48h 20m)

> JobService should support returning metrics
> ---
>
> Key: BEAM-4775
> URL: https://issues.apache.org/jira/browse/BEAM-4775
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Eugene Kirpichov
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 48.5h
>  Remaining Estimate: 0h
>
> Design doc: [https://s.apache.org/get-metrics-api].
> Further discussion is ongoing on [this 
> doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm].
> We want to report job metrics back to the portability harness from the runner 
> harness, for displaying to users.
> h1. Relevant PRs in flight:
> h2. Ready for Review:
>  * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC 
> protos from [#8018|https://github.com/apache/beam/pull/8018].
> h2. Iterating / Discussing:
>  * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: 
> get ptransform from MonitoringInfo, not stage name
>  ** this is a simpler, Flink-specific PR that is basically duplicated inside 
> each of the following two, so may be worth trying to merge in first
>  * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data 
> model in Java SDK metrics
>  * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks
> h2. Merged
>  * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC 
> protos
>  * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a 
> MetricKey
>  * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo 
> protos to model/pipeline module
>  * [#7883|https://github.com/apache/beam/pull/7883]: Add 
> MetricQueryResults.allMetrics() helper
>  * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers 
> from fn-harness to sdks/java/core
>  * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult 
> implementations
> h2. Closed
>  * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK 
> support
>  * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; 
> support integer distributions, gauges



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-5980) Add load tests for Core Apache Beam opertaions

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5980?focusedWorklogId=290460&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290460
 ]

ASF GitHub Bot logged work on BEAM-5980:


Author: ASF GitHub Bot
Created on: 07/Aug/19 13:24
Start Date: 07/Aug/19 13:24
Worklog Time Spent: 10m 
  Work Description: kkucharc commented on pull request #9286: [BEAM-5980] 
Remove redundant combine tests
URL: https://github.com/apache/beam/pull/9286#discussion_r311549710
 
 

 ##
 File path: .test-infra/jenkins/job_LoadTests_Combine_Flink_Python.groovy
 ##
 @@ -149,22 +83,15 @@ def batchLoadTestJob = { scope, triggeringContext ->
 scope.description('Runs Python Combine load tests on Flink runner in batch 
mode')
 commonJobProperties.setTopLevelMainJobProperties(scope, 'master', 240)
 
-def numberOfWorkers = 16
-def scaledNumberOfWorkers = 5
 def datasetName = loadTestsBuilder.getBigQueryDataset('load_test', 
triggeringContext)
 
 infra.prepareSDKHarness(scope, CommonTestProperties.SDK.PYTHON, 
dockerRegistryRoot, dockerTag)
 infra.prepareFlinkJobServer(scope, flinkVersion, dockerRegistryRoot, 
dockerTag)
-infra.setupFlinkCluster(scope, jenkinsJobName, flinkDownloadUrl, 
pythonHarnessImageTag, jobServerImageTag, numberOfWorkers)
 
-def testConfigs = loadTestConfigurationsSixteenWorkers(datasetName)
-for (config in testConfigs) {
-loadTestsBuilder.loadTest(scope, config.title, config.runner, 
CommonTestProperties.SDK.PYTHON, config.jobProperties, config.itClass)
-}
-
-infra.scaleCluster(scope, jenkinsJobName, scaledNumberOfWorkers)
+def numberOfWorkers = 16
 
 Review comment:
   What do you think about making it global again and put in parallelism param 
in each test configuration?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290460)
Time Spent: 1h 10m  (was: 1h)

> Add load tests for Core Apache Beam opertaions 
> ---
>
> Key: BEAM-5980
> URL: https://issues.apache.org/jira/browse/BEAM-5980
> Project: Beam
>  Issue Type: New Feature
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> This involves adding a suite of load tests described in this proposal: 
> [https://s.apache.org/load-test-basic-operations]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7776) Implement Kubernetes setup/teardown code to gradle/jenkins tasks.

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7776?focusedWorklogId=290463&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290463
 ]

ASF GitHub Bot logged work on BEAM-7776:


Author: ASF GitHub Bot
Created on: 07/Aug/19 13:31
Start Date: 07/Aug/19 13:31
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #9287: [BEAM-7776] Remove 
Pkb from file-based IOIT jobs
URL: https://github.com/apache/beam/pull/9287#issuecomment-519098037
 
 
   Run seed job
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290463)
Time Spent: 9h 10m  (was: 9h)

> Implement Kubernetes setup/teardown code to gradle/jenkins tasks. 
> --
>
> Key: BEAM-7776
> URL: https://issues.apache.org/jira/browse/BEAM-7776
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> Currently this is done by Perfkit Benchmarker but can be easily moved to 
> Beam's codebase and a set of fine-grained gradle tasks. Those could be then 
> invoked by Jenkins giving more elasticity to our tests and making Perkfit 
> totally obsolete in IOITs. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7776) Implement Kubernetes setup/teardown code to gradle/jenkins tasks.

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7776?focusedWorklogId=290462&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290462
 ]

ASF GitHub Bot logged work on BEAM-7776:


Author: ASF GitHub Bot
Created on: 07/Aug/19 13:31
Start Date: 07/Aug/19 13:31
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on pull request #9287: [BEAM-7776] 
Remove Pkb from file-based IOIT jobs
URL: https://github.com/apache/beam/pull/9287
 
 
   
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/)
 | --- | [![Build 
Stat

[jira] [Created] (BEAM-7917) Python datastore v1new fails on retry

2019-08-07 Thread Dmytro Sadovnychyi (JIRA)
Dmytro Sadovnychyi created BEAM-7917:


 Summary: Python datastore v1new fails on retry
 Key: BEAM-7917
 URL: https://issues.apache.org/jira/browse/BEAM-7917
 Project: Beam
  Issue Type: Bug
  Components: io-python-gcp, runner-dataflow
Affects Versions: 2.14.0
 Environment: Python 3.7 on Dataflow
Reporter: Dmytro Sadovnychyi


Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 782, in 
apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 454, in 
apache_beam.runners.common.SimpleInvoker.invoke_process
  File 
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/datastore/v1new/datastoreio.py",
 line 334, in process
self._flush_batch()
  File 
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/datastore/v1new/datastoreio.py",
 line 349, in _flush_batch
throttle_delay=util.WRITE_BATCH_TARGET_LATENCY_MS // 1000)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/utils/retry.py", 
line 197, in wrapper
return fun(*args, **kwargs)
  File 
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/datastore/v1new/helper.py",
 line 99, in write_mutations
batch.commit()
  File 
"/usr/local/lib/python3.7/site-packages/google/cloud/datastore/batch.py", line 
271, in commit
raise ValueError("Batch must be in progress to commit()")
ValueError: Batch must be in progress to commit()



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7918) UNNEST does not work with nested records

2019-08-07 Thread Sahith Nallapareddy (JIRA)
Sahith Nallapareddy created BEAM-7918:
-

 Summary: UNNEST does not work with nested records
 Key: BEAM-7918
 URL: https://issues.apache.org/jira/browse/BEAM-7918
 Project: Beam
  Issue Type: Bug
  Components: dsl-sql
Affects Versions: 2.15.0
Reporter: Sahith Nallapareddy
Assignee: Sahith Nallapareddy


UNNEST seems to have problems with nested rows. It assumes that the values will 
be primitives and adds it to the resulting row, but for a nested row it must go 
one level deeper and add the row values. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7776) Implement Kubernetes setup/teardown code to gradle/jenkins tasks.

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7776?focusedWorklogId=290467&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290467
 ]

ASF GitHub Bot logged work on BEAM-7776:


Author: ASF GitHub Bot
Created on: 07/Aug/19 13:49
Start Date: 07/Aug/19 13:49
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #9287: [BEAM-7776] Remove 
Pkb from file-based IOIT jobs
URL: https://github.com/apache/beam/pull/9287#issuecomment-519105896
 
 
   Run seed job
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290467)
Time Spent: 9h 20m  (was: 9h 10m)

> Implement Kubernetes setup/teardown code to gradle/jenkins tasks. 
> --
>
> Key: BEAM-7776
> URL: https://issues.apache.org/jira/browse/BEAM-7776
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> Currently this is done by Perfkit Benchmarker but can be easily moved to 
> Beam's codebase and a set of fine-grained gradle tasks. Those could be then 
> invoked by Jenkins giving more elasticity to our tests and making Perkfit 
> totally obsolete in IOITs. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-5980) Add load tests for Core Apache Beam opertaions

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5980?focusedWorklogId=290475&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290475
 ]

ASF GitHub Bot logged work on BEAM-5980:


Author: ASF GitHub Bot
Created on: 07/Aug/19 13:54
Start Date: 07/Aug/19 13:54
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on pull request #9286: [BEAM-5980] 
Remove redundant combine tests
URL: https://github.com/apache/beam/pull/9286#discussion_r311565958
 
 

 ##
 File path: .test-infra/jenkins/job_LoadTests_Combine_Flink_Python.groovy
 ##
 @@ -149,22 +83,15 @@ def batchLoadTestJob = { scope, triggeringContext ->
 scope.description('Runs Python Combine load tests on Flink runner in batch 
mode')
 commonJobProperties.setTopLevelMainJobProperties(scope, 'master', 240)
 
-def numberOfWorkers = 16
-def scaledNumberOfWorkers = 5
 def datasetName = loadTestsBuilder.getBigQueryDataset('load_test', 
triggeringContext)
 
 infra.prepareSDKHarness(scope, CommonTestProperties.SDK.PYTHON, 
dockerRegistryRoot, dockerTag)
 infra.prepareFlinkJobServer(scope, flinkVersion, dockerRegistryRoot, 
dockerTag)
-infra.setupFlinkCluster(scope, jenkinsJobName, flinkDownloadUrl, 
pythonHarnessImageTag, jobServerImageTag, numberOfWorkers)
 
-def testConfigs = loadTestConfigurationsSixteenWorkers(datasetName)
-for (config in testConfigs) {
-loadTestsBuilder.loadTest(scope, config.title, config.runner, 
CommonTestProperties.SDK.PYTHON, config.jobProperties, config.itClass)
-}
-
-infra.scaleCluster(scope, jenkinsJobName, scaledNumberOfWorkers)
+def numberOfWorkers = 16
 
 Review comment:
   👍 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290475)
Time Spent: 1h 20m  (was: 1h 10m)

> Add load tests for Core Apache Beam opertaions 
> ---
>
> Key: BEAM-5980
> URL: https://issues.apache.org/jira/browse/BEAM-5980
> Project: Beam
>  Issue Type: New Feature
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> This involves adding a suite of load tests described in this proposal: 
> [https://s.apache.org/load-test-basic-operations]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7918) UNNEST does not work with nested records

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7918?focusedWorklogId=290478&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290478
 ]

ASF GitHub Bot logged work on BEAM-7918:


Author: ASF GitHub Bot
Created on: 07/Aug/19 13:58
Start Date: 07/Aug/19 13:58
Worklog Time Spent: 10m 
  Work Description: snallapa commented on pull request #9288: [BEAM-7918] 
adding nested row implementation for unnest and uncollect
URL: https://github.com/apache/beam/pull/9288
 
 
   unnest and uncollect did not seem to handle nested row implementations as 
for nested rows they would need to add all the values in the nested row itself 
to the resulting row. Added tests as well to cover these cases
   `R:@kennknowles`
   `R:@amaliujia`
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [x] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https:/

[jira] [Work logged] (BEAM-7776) Implement Kubernetes setup/teardown code to gradle/jenkins tasks.

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7776?focusedWorklogId=290481&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290481
 ]

ASF GitHub Bot logged work on BEAM-7776:


Author: ASF GitHub Bot
Created on: 07/Aug/19 13:59
Start Date: 07/Aug/19 13:59
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #9287: [BEAM-7776] Remove 
Pkb from file-based IOIT jobs
URL: https://github.com/apache/beam/pull/9287#issuecomment-519110432
 
 
   Run Java TextIO Performance Test
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290481)
Time Spent: 9.5h  (was: 9h 20m)

> Implement Kubernetes setup/teardown code to gradle/jenkins tasks. 
> --
>
> Key: BEAM-7776
> URL: https://issues.apache.org/jira/browse/BEAM-7776
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> Currently this is done by Perfkit Benchmarker but can be easily moved to 
> Beam's codebase and a set of fine-grained gradle tasks. Those could be then 
> invoked by Jenkins giving more elasticity to our tests and making Perkfit 
> totally obsolete in IOITs. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-5980) Add load tests for Core Apache Beam opertaions

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5980?focusedWorklogId=290482&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290482
 ]

ASF GitHub Bot logged work on BEAM-5980:


Author: ASF GitHub Bot
Created on: 07/Aug/19 14:02
Start Date: 07/Aug/19 14:02
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #9286: [BEAM-5980] Remove 
redundant combine tests
URL: https://github.com/apache/beam/pull/9286#issuecomment-519111574
 
 
   Run seed job
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290482)
Time Spent: 1.5h  (was: 1h 20m)

> Add load tests for Core Apache Beam opertaions 
> ---
>
> Key: BEAM-5980
> URL: https://issues.apache.org/jira/browse/BEAM-5980
> Project: Beam
>  Issue Type: New Feature
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This involves adding a suite of load tests described in this proposal: 
> [https://s.apache.org/load-test-basic-operations]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7776) Implement Kubernetes setup/teardown code to gradle/jenkins tasks.

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7776?focusedWorklogId=290483&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290483
 ]

ASF GitHub Bot logged work on BEAM-7776:


Author: ASF GitHub Bot
Created on: 07/Aug/19 14:08
Start Date: 07/Aug/19 14:08
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #9287: [BEAM-7776] Remove 
Pkb from file-based IOIT jobs
URL: https://github.com/apache/beam/pull/9287#issuecomment-519114012
 
 
   Run seed job
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290483)
Time Spent: 9h 40m  (was: 9.5h)

> Implement Kubernetes setup/teardown code to gradle/jenkins tasks. 
> --
>
> Key: BEAM-7776
> URL: https://issues.apache.org/jira/browse/BEAM-7776
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 9h 40m
>  Remaining Estimate: 0h
>
> Currently this is done by Perfkit Benchmarker but can be easily moved to 
> Beam's codebase and a set of fine-grained gradle tasks. Those could be then 
> invoked by Jenkins giving more elasticity to our tests and making Perkfit 
> totally obsolete in IOITs. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-4379) Make ParquetIO Read splittable

2019-08-07 Thread Ryan Skraba (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902108#comment-16902108
 ] 

Ryan Skraba commented on BEAM-4379:
---

I'm looking at what Spark has done for splittable Parquet files -- it looks 
like there's a lot of reusable strategy that might be pushed up to Parquet, 
especially with their own ReadSupport for catalyst data types (the equivalent 
of BEAM-4812, reading and writing Rows directly from Parquet). 

I'm still ramping up on the necessary changes to Parquet, but I won't be 
offended if my conclusion is proven wrong or someone with more expertise takes 
the JIRA, of course!

> Make ParquetIO Read splittable
> --
>
> Key: BEAM-4379
> URL: https://issues.apache.org/jira/browse/BEAM-4379
> Project: Beam
>  Issue Type: Improvement
>  Components: io-ideas, io-java-parquet
>Reporter: Lukasz Gajowy
>Priority: Major
>
> As the title stands - currently it is not splittable which is not optimal for 
> runners that support splitting.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7776) Implement Kubernetes setup/teardown code to gradle/jenkins tasks.

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7776?focusedWorklogId=290489&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290489
 ]

ASF GitHub Bot logged work on BEAM-7776:


Author: ASF GitHub Bot
Created on: 07/Aug/19 14:22
Start Date: 07/Aug/19 14:22
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #9287: [BEAM-7776] Remove 
Pkb from file-based IOIT jobs
URL: https://github.com/apache/beam/pull/9287#issuecomment-519119431
 
 
   Run Java TextIO Performance Test
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290489)
Time Spent: 9h 50m  (was: 9h 40m)

> Implement Kubernetes setup/teardown code to gradle/jenkins tasks. 
> --
>
> Key: BEAM-7776
> URL: https://issues.apache.org/jira/browse/BEAM-7776
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> Currently this is done by Perfkit Benchmarker but can be easily moved to 
> Beam's codebase and a set of fine-grained gradle tasks. Those could be then 
> invoked by Jenkins giving more elasticity to our tests and making Perkfit 
> totally obsolete in IOITs. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7776) Implement Kubernetes setup/teardown code to gradle/jenkins tasks.

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7776?focusedWorklogId=290491&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290491
 ]

ASF GitHub Bot logged work on BEAM-7776:


Author: ASF GitHub Bot
Created on: 07/Aug/19 14:29
Start Date: 07/Aug/19 14:29
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #9287: [BEAM-7776] Remove 
Pkb from file-based IOIT jobs
URL: https://github.com/apache/beam/pull/9287#issuecomment-519122720
 
 
   Run seed job
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290491)
Time Spent: 10h  (was: 9h 50m)

> Implement Kubernetes setup/teardown code to gradle/jenkins tasks. 
> --
>
> Key: BEAM-7776
> URL: https://issues.apache.org/jira/browse/BEAM-7776
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 10h
>  Remaining Estimate: 0h
>
> Currently this is done by Perfkit Benchmarker but can be easily moved to 
> Beam's codebase and a set of fine-grained gradle tasks. Those could be then 
> invoked by Jenkins giving more elasticity to our tests and making Perkfit 
> totally obsolete in IOITs. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7776) Implement Kubernetes setup/teardown code to gradle/jenkins tasks.

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7776?focusedWorklogId=290497&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290497
 ]

ASF GitHub Bot logged work on BEAM-7776:


Author: ASF GitHub Bot
Created on: 07/Aug/19 14:40
Start Date: 07/Aug/19 14:40
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #9287: [BEAM-7776] Remove 
Pkb from file-based IOIT jobs
URL: https://github.com/apache/beam/pull/9287#issuecomment-519127434
 
 
   Run Java TextIO Performance Test
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290497)
Time Spent: 10h 10m  (was: 10h)

> Implement Kubernetes setup/teardown code to gradle/jenkins tasks. 
> --
>
> Key: BEAM-7776
> URL: https://issues.apache.org/jira/browse/BEAM-7776
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 10h 10m
>  Remaining Estimate: 0h
>
> Currently this is done by Perfkit Benchmarker but can be easily moved to 
> Beam's codebase and a set of fine-grained gradle tasks. Those could be then 
> invoked by Jenkins giving more elasticity to our tests and making Perkfit 
> totally obsolete in IOITs. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7742) BigQuery File Loads to work well with load job size limits

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7742?focusedWorklogId=290514&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290514
 ]

ASF GitHub Bot logged work on BEAM-7742:


Author: ASF GitHub Bot
Created on: 07/Aug/19 15:06
Start Date: 07/Aug/19 15:06
Worklog Time Spent: 10m 
  Work Description: ttanay commented on issue #9242: [BEAM-7742] Partition 
files in BQFL to cater to quotas & limits
URL: https://github.com/apache/beam/pull/9242#issuecomment-519138968
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290514)
Time Spent: 0.5h  (was: 20m)

> BigQuery File Loads to work well with load job size limits
> --
>
> Key: BEAM-7742
> URL: https://issues.apache.org/jira/browse/BEAM-7742
> Project: Beam
>  Issue Type: Improvement
>  Components: io-python-gcp
>Reporter: Pablo Estrada
>Assignee: Tanay Tummalapalli
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Load jobs into BigQuery have a number of limitations: 
> [https://cloud.google.com/bigquery/quotas#load_jobs]
>  
> Currently, the python BQ sink implemented in `bigquery_file_loads.py` does 
> not handle these limitations well. Improvements need to be made to the 
> miplementation, to:
>  * Decide to use temp_tables dynamically at pipeline execution
>  * Add code to determine when a load job to a single destination needs to be 
> partitioned into multiple jobs.
>  * When this happens, then we definitely need to use temp_tables, in case one 
> of the two load jobs fails, and the pipeline is rerun.
> Tanay, would you be able to look at this?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7918) UNNEST does not work with nested records

2019-08-07 Thread Sahith Nallapareddy (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahith Nallapareddy updated BEAM-7918:
--
Affects Version/s: (was: 2.15.0)

> UNNEST does not work with nested records
> 
>
> Key: BEAM-7918
> URL: https://issues.apache.org/jira/browse/BEAM-7918
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Reporter: Sahith Nallapareddy
>Assignee: Sahith Nallapareddy
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> UNNEST seems to have problems with nested rows. It assumes that the values 
> will be primitives and adds it to the resulting row, but for a nested row it 
> must go one level deeper and add the row values. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7918) UNNEST does not work with nested records

2019-08-07 Thread Sahith Nallapareddy (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahith Nallapareddy updated BEAM-7918:
--
Affects Version/s: 2.16.0
   2.15.0

> UNNEST does not work with nested records
> 
>
> Key: BEAM-7918
> URL: https://issues.apache.org/jira/browse/BEAM-7918
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Affects Versions: 2.15.0, 2.16.0
>Reporter: Sahith Nallapareddy
>Assignee: Sahith Nallapareddy
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> UNNEST seems to have problems with nested rows. It assumes that the values 
> will be primitives and adds it to the resulting row, but for a nested row it 
> must go one level deeper and add the row values. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7919) Add a Python 3 test scenario for MongoDB IO.

2019-08-07 Thread Valentyn Tymofieiev (JIRA)
Valentyn Tymofieiev created BEAM-7919:
-

 Summary: Add a Python 3 test scenario for MongoDB IO.
 Key: BEAM-7919
 URL: https://issues.apache.org/jira/browse/BEAM-7919
 Project: Beam
  Issue Type: Sub-task
  Components: io-ideas
Reporter: Valentyn Tymofieiev
Assignee: Yichi Zhang


Python 2 MongoDB IO suite was added in:

https://github.com/apache/beam/commit/17bf89d6070565b715f44ecb5f6394219b94cfe6

We should also exercise this IO in Python 3. 

cc: [~chamikara] [~altay]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7742) BigQuery File Loads to work well with load job size limits

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7742?focusedWorklogId=290537&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290537
 ]

ASF GitHub Bot logged work on BEAM-7742:


Author: ASF GitHub Bot
Created on: 07/Aug/19 15:48
Start Date: 07/Aug/19 15:48
Worklog Time Spent: 10m 
  Work Description: ttanay commented on pull request #9242: [BEAM-7742] 
Partition files in BQFL to cater to quotas & limits
URL: https://github.com/apache/beam/pull/9242#discussion_r311629495
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_file_loads.py
 ##
 @@ -116,6 +115,11 @@ def _make_new_file_writer(file_prefix, destination):
   return file_path, fs.FileSystems.create(file_path, 'application/text')
 
 
+def _file_size(file_path):
 
 Review comment:
   Try different approaches
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290537)
Time Spent: 40m  (was: 0.5h)

> BigQuery File Loads to work well with load job size limits
> --
>
> Key: BEAM-7742
> URL: https://issues.apache.org/jira/browse/BEAM-7742
> Project: Beam
>  Issue Type: Improvement
>  Components: io-python-gcp
>Reporter: Pablo Estrada
>Assignee: Tanay Tummalapalli
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Load jobs into BigQuery have a number of limitations: 
> [https://cloud.google.com/bigquery/quotas#load_jobs]
>  
> Currently, the python BQ sink implemented in `bigquery_file_loads.py` does 
> not handle these limitations well. Improvements need to be made to the 
> miplementation, to:
>  * Decide to use temp_tables dynamically at pipeline execution
>  * Add code to determine when a load job to a single destination needs to be 
> partitioned into multiple jobs.
>  * When this happens, then we definitely need to use temp_tables, in case one 
> of the two load jobs fails, and the pipeline is rerun.
> Tanay, would you be able to look at this?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-1383) Consistency in the Metrics examples

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1383?focusedWorklogId=290563&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290563
 ]

ASF GitHub Bot logged work on BEAM-1383:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:05
Start Date: 07/Aug/19 16:05
Worklog Time Spent: 10m 
  Work Description: coveralls commented on issue #2575: [BEAM-1383] Making 
metrics usage in datastore_wordcount consistent
URL: https://github.com/apache/beam/pull/2575#issuecomment-294906127
 
 
   
   [![Coverage 
Status](https://coveralls.io/builds/25037900/badge)](https://coveralls.io/builds/25037900)
   
   Coverage increased (+18.9%) to 89.589% when pulling 
**157a3eeb6a8b3b8931b1ac3bffc750d6f3133aa5 on pabloem:consistent-metrics** into 
**686b774ceda8bee32032cb421651e8350ca5bf3d on apache:master**.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290563)
Time Spent: 10m
Remaining Estimate: 0h

> Consistency in the Metrics examples
> ---
>
> Key: BEAM-1383
> URL: https://issues.apache.org/jira/browse/BEAM-1383
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Major
>  Labels: newbie, starter
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> snippets.py and wordcount.py initialize them in different places. Fix this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7742) BigQuery File Loads to work well with load job size limits

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7742?focusedWorklogId=290567&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290567
 ]

ASF GitHub Bot logged work on BEAM-7742:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:09
Start Date: 07/Aug/19 16:09
Worklog Time Spent: 10m 
  Work Description: ttanay commented on pull request #9242: [BEAM-7742] 
Partition files in BQFL to cater to quotas & limits
URL: https://github.com/apache/beam/pull/9242#discussion_r311639554
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_file_loads.py
 ##
 @@ -116,6 +115,11 @@ def _make_new_file_writer(file_prefix, destination):
   return file_path, fs.FileSystems.create(file_path, 'application/text')
 
 
+def _file_size(file_path):
 
 Review comment:
   1. Calculate size when writing each row
   2. return filesystem object in _make_new_file_writer so .size() is 
accessible 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290567)
Time Spent: 50m  (was: 40m)

> BigQuery File Loads to work well with load job size limits
> --
>
> Key: BEAM-7742
> URL: https://issues.apache.org/jira/browse/BEAM-7742
> Project: Beam
>  Issue Type: Improvement
>  Components: io-python-gcp
>Reporter: Pablo Estrada
>Assignee: Tanay Tummalapalli
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Load jobs into BigQuery have a number of limitations: 
> [https://cloud.google.com/bigquery/quotas#load_jobs]
>  
> Currently, the python BQ sink implemented in `bigquery_file_loads.py` does 
> not handle these limitations well. Improvements need to be made to the 
> miplementation, to:
>  * Decide to use temp_tables dynamically at pipeline execution
>  * Add code to determine when a load job to a single destination needs to be 
> partitioned into multiple jobs.
>  * When this happens, then we definitely need to use temp_tables, in case one 
> of the two load jobs fails, and the pipeline is rerun.
> Tanay, would you be able to look at this?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-6858) Support side inputs injected into a DoFn

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6858?focusedWorklogId=290585&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290585
 ]

ASF GitHub Bot logged work on BEAM-6858:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:31
Start Date: 07/Aug/19 16:31
Worklog Time Spent: 10m 
  Work Description: reuvenlax commented on pull request #9275: [BEAM-6858] 
Support side inputs injected into a DoFn
URL: https://github.com/apache/beam/pull/9275#discussion_r311648908
 
 

 ##
 File path: 
sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/ParDoTest.java
 ##
 @@ -3311,6 +3310,47 @@ public void processElement(
 }
   }
 
+  /** Tests validating SideInput annotation behaviors. */
+  @RunWith(JUnit4.class)
+  public static class SideInputAnnotationTests extends SharedTestBase 
implements Serializable {
+
+@Test
+@Category({ValidatesRunner.class, UsesSideInputs.class})
+public void testSideInputAnnotation() {
+
+  final PCollectionView> sideInput1 =
+  pipeline
+  .apply("CreateSideInput1", Create.of(2, 1, 0))
+  .apply("ViewSideInput1", View.asList());
+
+  // SideInput tag id
+  final String sideInputTag1 = "tag1";
+
+  DoFn> fn =
+  new DoFn>() {
+@ProcessElement
+public void processElement(
+ProcessContext c,
 
 Review comment:
   is the ProcessContext parameter needed?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290585)
Time Spent: 1h 10m  (was: 1h)

> Support side inputs injected into a DoFn
> 
>
> Key: BEAM-6858
> URL: https://issues.apache.org/jira/browse/BEAM-6858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Shehzaad Nakhoda
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Beam currently supports injecting main inputs into a DoFn process method. A 
> user can write the following:
> @ProcessElement public void process(@Element InputT element)
> And Beam will (using ByteBuddy code generation) inject the input element into 
> the process method.
> We would like to also support the same for side inputs. For example:
> @ProcessElement public void process(@Element InputT element, 
> @SideInput("tag1") String input1, @SideInput("tag2") Integer input2) 
> This requires the existing process-method analysis framework to capture these 
> side inputs. The ParDo code would have to verify the type of the side input 
> and include them in the list of side inputs. This would also eliminate the 
> need for the user to explicitly call withSideInputs on the ParDo.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-6858) Support side inputs injected into a DoFn

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6858?focusedWorklogId=290584&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290584
 ]

ASF GitHub Bot logged work on BEAM-6858:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:31
Start Date: 07/Aug/19 16:31
Worklog Time Spent: 10m 
  Work Description: reuvenlax commented on pull request #9275: [BEAM-6858] 
Support side inputs injected into a DoFn
URL: https://github.com/apache/beam/pull/9275#discussion_r311648483
 
 

 ##
 File path: 
sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/ParDoTest.java
 ##
 @@ -3311,6 +3310,47 @@ public void processElement(
 }
   }
 
+  /** Tests validating SideInput annotation behaviors. */
+  @RunWith(JUnit4.class)
 
 Review comment:
   Don't think we need a new test class here. A new test method (right after 
the existing testParDoWithSideInputs method) should be fine.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290584)
Time Spent: 1h  (was: 50m)

> Support side inputs injected into a DoFn
> 
>
> Key: BEAM-6858
> URL: https://issues.apache.org/jira/browse/BEAM-6858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Shehzaad Nakhoda
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Beam currently supports injecting main inputs into a DoFn process method. A 
> user can write the following:
> @ProcessElement public void process(@Element InputT element)
> And Beam will (using ByteBuddy code generation) inject the input element into 
> the process method.
> We would like to also support the same for side inputs. For example:
> @ProcessElement public void process(@Element InputT element, 
> @SideInput("tag1") String input1, @SideInput("tag2") Integer input2) 
> This requires the existing process-method analysis framework to capture these 
> side inputs. The ParDo code would have to verify the type of the side input 
> and include them in the list of side inputs. This would also eliminate the 
> need for the user to explicitly call withSideInputs on the ParDo.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-6858) Support side inputs injected into a DoFn

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6858?focusedWorklogId=290581&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290581
 ]

ASF GitHub Bot logged work on BEAM-6858:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:31
Start Date: 07/Aug/19 16:31
Worklog Time Spent: 10m 
  Work Description: reuvenlax commented on pull request #9275: [BEAM-6858] 
Support side inputs injected into a DoFn
URL: https://github.com/apache/beam/pull/9275#discussion_r311647741
 
 

 ##
 File path: 
sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollectionView.java
 ##
 @@ -59,6 +59,10 @@
   @Internal
   PCollection getPCollection();
 
+  /** Sets the {@link PCollection} this {@link PCollectionView} was created 
from. * */
 
 Review comment:
   extra *
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290581)
Time Spent: 0.5h  (was: 20m)

> Support side inputs injected into a DoFn
> 
>
> Key: BEAM-6858
> URL: https://issues.apache.org/jira/browse/BEAM-6858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Shehzaad Nakhoda
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Beam currently supports injecting main inputs into a DoFn process method. A 
> user can write the following:
> @ProcessElement public void process(@Element InputT element)
> And Beam will (using ByteBuddy code generation) inject the input element into 
> the process method.
> We would like to also support the same for side inputs. For example:
> @ProcessElement public void process(@Element InputT element, 
> @SideInput("tag1") String input1, @SideInput("tag2") Integer input2) 
> This requires the existing process-method analysis framework to capture these 
> side inputs. The ParDo code would have to verify the type of the side input 
> and include them in the list of side inputs. This would also eliminate the 
> need for the user to explicitly call withSideInputs on the ParDo.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-6858) Support side inputs injected into a DoFn

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6858?focusedWorklogId=290582&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290582
 ]

ASF GitHub Bot logged work on BEAM-6858:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:31
Start Date: 07/Aug/19 16:31
Worklog Time Spent: 10m 
  Work Description: reuvenlax commented on pull request #9275: [BEAM-6858] 
Support side inputs injected into a DoFn
URL: https://github.com/apache/beam/pull/9275#discussion_r311646536
 
 

 ##
 File path: 
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/reflect/DoFnInvoker.java
 ##
 @@ -135,6 +135,9 @@
 /** Provide a reference to the input element. */
 InputT element(DoFn doFn);
 
+/** Provide a reference to the input sideInput. */
 
 Review comment:
   to the side input with the specified tag.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290582)
Time Spent: 40m  (was: 0.5h)

> Support side inputs injected into a DoFn
> 
>
> Key: BEAM-6858
> URL: https://issues.apache.org/jira/browse/BEAM-6858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Shehzaad Nakhoda
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Beam currently supports injecting main inputs into a DoFn process method. A 
> user can write the following:
> @ProcessElement public void process(@Element InputT element)
> And Beam will (using ByteBuddy code generation) inject the input element into 
> the process method.
> We would like to also support the same for side inputs. For example:
> @ProcessElement public void process(@Element InputT element, 
> @SideInput("tag1") String input1, @SideInput("tag2") Integer input2) 
> This requires the existing process-method analysis framework to capture these 
> side inputs. The ParDo code would have to verify the type of the side input 
> and include them in the list of side inputs. This would also eliminate the 
> need for the user to explicitly call withSideInputs on the ParDo.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-6858) Support side inputs injected into a DoFn

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6858?focusedWorklogId=290579&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290579
 ]

ASF GitHub Bot logged work on BEAM-6858:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:31
Start Date: 07/Aug/19 16:31
Worklog Time Spent: 10m 
  Work Description: reuvenlax commented on pull request #9275: [BEAM-6858] 
Support side inputs injected into a DoFn
URL: https://github.com/apache/beam/pull/9275#discussion_r311645935
 
 

 ##
 File path: 
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java
 ##
 @@ -652,6 +650,17 @@ public static DoFnSchemaInformation 
getDoFnSchemaInformation(
   return withSideInputs(Arrays.asList(sideInputs));
 }
 
+/**
+ * Returns a new {@link ParDo} {@link PTransform} that's like this {@link 
PTransform} but with
+ * the specified additional side inputs. Does not modify this {@link 
PTransform}.
+ *
+ * See the discussion of Side Inputs above for more explanation.
+ */
+public SingleOutput withSideInput(String tagId, 
PCollectionView sideInput) {
+  sideInput.setTagInternalId(tagId);
 
 Review comment:
   We shouldn't modify the input view. Rather clone it and modify the the 
cloned copy.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290579)
Time Spent: 10m
Remaining Estimate: 0h

> Support side inputs injected into a DoFn
> 
>
> Key: BEAM-6858
> URL: https://issues.apache.org/jira/browse/BEAM-6858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Shehzaad Nakhoda
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Beam currently supports injecting main inputs into a DoFn process method. A 
> user can write the following:
> @ProcessElement public void process(@Element InputT element)
> And Beam will (using ByteBuddy code generation) inject the input element into 
> the process method.
> We would like to also support the same for side inputs. For example:
> @ProcessElement public void process(@Element InputT element, 
> @SideInput("tag1") String input1, @SideInput("tag2") Integer input2) 
> This requires the existing process-method analysis framework to capture these 
> side inputs. The ParDo code would have to verify the type of the side input 
> and include them in the list of side inputs. This would also eliminate the 
> need for the user to explicitly call withSideInputs on the ParDo.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-6858) Support side inputs injected into a DoFn

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6858?focusedWorklogId=290583&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290583
 ]

ASF GitHub Bot logged work on BEAM-6858:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:31
Start Date: 07/Aug/19 16:31
Worklog Time Spent: 10m 
  Work Description: reuvenlax commented on pull request #9275: [BEAM-6858] 
Support side inputs injected into a DoFn
URL: https://github.com/apache/beam/pull/9275#discussion_r311647094
 
 

 ##
 File path: 
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/reflect/DoFnSignature.java
 ##
 @@ -556,6 +573,34 @@ public static TimerParameter 
timerParameter(TimerDeclaration decl) {
   TimeDomainParameter() {}
 }
 
+/**
+ * Descriptor for a {@link Parameter} of type {@link DoFn.SideInput}.
+ *
+ * All such descriptors are equal.
 
 Review comment:
   I don't think they are all equal. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290583)
Time Spent: 50m  (was: 40m)

> Support side inputs injected into a DoFn
> 
>
> Key: BEAM-6858
> URL: https://issues.apache.org/jira/browse/BEAM-6858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Shehzaad Nakhoda
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Beam currently supports injecting main inputs into a DoFn process method. A 
> user can write the following:
> @ProcessElement public void process(@Element InputT element)
> And Beam will (using ByteBuddy code generation) inject the input element into 
> the process method.
> We would like to also support the same for side inputs. For example:
> @ProcessElement public void process(@Element InputT element, 
> @SideInput("tag1") String input1, @SideInput("tag2") Integer input2) 
> This requires the existing process-method analysis framework to capture these 
> side inputs. The ParDo code would have to verify the type of the side input 
> and include them in the list of side inputs. This would also eliminate the 
> need for the user to explicitly call withSideInputs on the ParDo.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-6858) Support side inputs injected into a DoFn

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6858?focusedWorklogId=290580&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290580
 ]

ASF GitHub Bot logged work on BEAM-6858:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:31
Start Date: 07/Aug/19 16:31
Worklog Time Spent: 10m 
  Work Description: reuvenlax commented on pull request #9275: [BEAM-6858] 
Support side inputs injected into a DoFn
URL: https://github.com/apache/beam/pull/9275#discussion_r311646048
 
 

 ##
 File path: 
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java
 ##
 @@ -793,6 +801,18 @@ public String toString() {
   return withSideInputs(Arrays.asList(sideInputs));
 }
 
+/**
+ * Returns a new multi-output {@link ParDo} {@link PTransform} that's like 
this {@link
+ * PTransform} but with the specified additional side inputs. Does not 
modify this {@link
+ * PTransform}.
+ *
+ * See the discussion of Side Inputs above for more explanation.
+ */
+public MultiOutput withSideInput(String tagId, 
PCollectionView sideInput) {
+  sideInput.setTagInternalId(tagId);
 
 Review comment:
   ditto - don't modify the input.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290580)
Time Spent: 20m  (was: 10m)

> Support side inputs injected into a DoFn
> 
>
> Key: BEAM-6858
> URL: https://issues.apache.org/jira/browse/BEAM-6858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Shehzaad Nakhoda
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Beam currently supports injecting main inputs into a DoFn process method. A 
> user can write the following:
> @ProcessElement public void process(@Element InputT element)
> And Beam will (using ByteBuddy code generation) inject the input element into 
> the process method.
> We would like to also support the same for side inputs. For example:
> @ProcessElement public void process(@Element InputT element, 
> @SideInput("tag1") String input1, @SideInput("tag2") Integer input2) 
> This requires the existing process-method analysis framework to capture these 
> side inputs. The ParDo code would have to verify the type of the side input 
> and include them in the list of side inputs. This would also eliminate the 
> need for the user to explicitly call withSideInputs on the ParDo.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290587&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290587
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:33
Start Date: 07/Aug/19 16:33
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311649962
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -194,18 +212,110 @@ def display_data(self):
 res['mongo_client_spec'] = self.spec
 return res
 
-  def _get_avg_document_size(self):
+  def _get_split_keys(self, desired_chunk_size_in_mb, start_pos, end_pos):
+# if desired chunk size smaller than 1mb, use mongodb default split size of
+# 1mb
+if desired_chunk_size_in_mb < 1:
+  desired_chunk_size_in_mb = 1
+if start_pos >= end_pos:
+  # single document not splittable
+  return []
 with MongoClient(self.uri, **self.spec) as client:
-  size = client[self.db].command('collstats', self.coll).get('avgObjSize')
-  if size is None or size <= 0:
-raise ValueError(
-'Collection %s not found or average doc size is '
-'incorrect', self.coll)
-  return size
-
-  def _get_document_count(self):
+  name_space = '%s.%s' % (self.db, self.coll)
+  return (client[self.db].command(
+  'splitVector',
+  name_space,
+  keyPattern={'_id': 1},
+  min={'_id': start_pos},
+  max={'_id': end_pos},
+  maxChunkSize=desired_chunk_size_in_mb)['splitKeys'])
+
+  def _merge_id_filter(self, range_tracker):
+all_filters = self.filter.copy()
+if '_id' in all_filters:
+  id_filter = all_filters['_id']
+  id_filter['$gte'] = (
+  max(id_filter['$gte'], range_tracker.start_position())
+  if '$gte' in id_filter else range_tracker.start_position())
+
+  id_filter['$lt'] = (min(id_filter['$lt'], range_tracker.stop_position())
+  if '$lt' in id_filter else
+  range_tracker.stop_position())
+else:
+  all_filters.update({
+  '_id': {
+  '$gte': range_tracker.start_position(),
+  '$lt': range_tracker.stop_position()
+  }
+  })
+return all_filters
+
+  def _get_head_document_id(self, sort_order):
 with MongoClient(self.uri, **self.spec) as client:
-  return max(client[self.db][self.coll].count_documents(self.filter), 0)
+  cursor = client[self.db][self.coll].find(filter={}, projection=[]).sort([
+  ('_id', sort_order)
+  ]).limit(1)
+  try:
+return cursor[0]['_id']
+  except IndexError:
+raise ValueError('Empty Mongodb collection')
 
 Review comment:
   it's actually required primary key for every document.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290587)
Time Spent: 6h 50m  (was: 6h 40m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated

[jira] [Commented] (BEAM-4379) Make ParquetIO Read splittable

2019-08-07 Thread Ryan Skraba (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902224#comment-16902224
 ] 

Ryan Skraba commented on BEAM-4379:
---

Alright, I was mistaken about one thing -- the current ParquetIO *already* 
includes the hadoop-client and hadoop-common jars in its dependencies, but it 
only uses the Parquet API that does not expose org.apache.hadoop classes.

I suppose PARQUET-1126 is necessary to remove the hadoop-client dependency 
(which would be a desirable outcome, but not currently possible today with or 
without splittability).

It should be possible to implement splittability by using the current Parquet 
API, and I'll take a look. 

> Make ParquetIO Read splittable
> --
>
> Key: BEAM-4379
> URL: https://issues.apache.org/jira/browse/BEAM-4379
> Project: Beam
>  Issue Type: Improvement
>  Components: io-ideas, io-java-parquet
>Reporter: Lukasz Gajowy
>Priority: Major
>
> As the title stands - currently it is not splittable which is not optimal for 
> runners that support splitting.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290589&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290589
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:39
Start Date: 07/Aug/19 16:39
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311652584
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -194,18 +212,110 @@ def display_data(self):
 res['mongo_client_spec'] = self.spec
 return res
 
-  def _get_avg_document_size(self):
+  def _get_split_keys(self, desired_chunk_size_in_mb, start_pos, end_pos):
+# if desired chunk size smaller than 1mb, use mongodb default split size of
+# 1mb
+if desired_chunk_size_in_mb < 1:
+  desired_chunk_size_in_mb = 1
+if start_pos >= end_pos:
+  # single document not splittable
+  return []
 with MongoClient(self.uri, **self.spec) as client:
-  size = client[self.db].command('collstats', self.coll).get('avgObjSize')
-  if size is None or size <= 0:
-raise ValueError(
-'Collection %s not found or average doc size is '
-'incorrect', self.coll)
-  return size
-
-  def _get_document_count(self):
+  name_space = '%s.%s' % (self.db, self.coll)
+  return (client[self.db].command(
+  'splitVector',
+  name_space,
+  keyPattern={'_id': 1},
+  min={'_id': start_pos},
+  max={'_id': end_pos},
+  maxChunkSize=desired_chunk_size_in_mb)['splitKeys'])
+
+  def _merge_id_filter(self, range_tracker):
+all_filters = self.filter.copy()
+if '_id' in all_filters:
+  id_filter = all_filters['_id']
+  id_filter['$gte'] = (
+  max(id_filter['$gte'], range_tracker.start_position())
+  if '$gte' in id_filter else range_tracker.start_position())
+
+  id_filter['$lt'] = (min(id_filter['$lt'], range_tracker.stop_position())
+  if '$lt' in id_filter else
+  range_tracker.stop_position())
+else:
+  all_filters.update({
+  '_id': {
+  '$gte': range_tracker.start_position(),
+  '$lt': range_tracker.stop_position()
+  }
+  })
+return all_filters
+
+  def _get_head_document_id(self, sort_order):
 with MongoClient(self.uri, **self.spec) as client:
-  return max(client[self.db][self.coll].count_documents(self.filter), 0)
+  cursor = client[self.db][self.coll].find(filter={}, projection=[]).sort([
+  ('_id', sort_order)
+  ]).limit(1)
+  try:
+return cursor[0]['_id']
+  except IndexError:
+raise ValueError('Empty Mongodb collection')
+
+
+class _ObjectIdHelper(object):
+  """A Utility class to bson object ids."""
+
+  @classmethod
+  def id_to_int(cls, id):
+# converts object id binary to integer
+# id object is bytes type with size of 12
+ints = struct.unpack('>III', id.binary)
+return (ints[0] << 64) + (ints[1] << 32) + ints[2]
+
+  @classmethod
+  def int_to_id(cls, number):
+# converts integer value to object id. Int value should be less than
+# (2 ^ 96) so it can be convert to 12 bytes required by object id.
+if number < 0 or number >= (1 << 96):
+  raise ValueError('number value must be within [0, %s)' % (1 << 96))
+ints = [(number & 0x) >> 64,
+(number & 0x) >> 32,
+number & 0x]
+
+bytes = struct.pack('>III', *ints)
+return objectid.ObjectId(bytes)
+
+  @classmethod
+  def increment_id(cls, object_id, inc):
+# increment object_id binary value by inc value and return new object id.
+id_number = _ObjectIdHelper.id_to_int(object_id)
+new_number = id_number + inc
+if new_number < 0 or new_number >= (1 << 96):
+  raise ValueError('invalid incremental, inc value must be within ['
 
 Review comment:
   This shows the acceptable inc argument range clearer but I can also remove 
this check.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290589)
Time Spent: 7h  (was: 6h 50m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
> 

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290593&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290593
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:44
Start Date: 07/Aug/19 16:44
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311654458
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,64 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
+ 'incorrect' % self.coll)
+  return size
 
   def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
 # use document cursor index as the start and stop positions
 if start_position is None:
-  start_position = 0
+  start_position = self._get_head_document_id(ASCENDING)
 if stop_position is None:
-  stop_position = self.doc_count
+  last_doc_id = self._get_head_document_id(DESCENDING)
+  # increment last doc id binary value by 1 to make sure the last document
+  # is not excluded
+  stop_position = _ObjectIdHelper.increment_id(last_doc_id, 1)
 
-# get an estimate on how many documents should be included in a split batch
-desired_bundle_count = desired_bundle_size // self.avg_doc_size
+desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024
+split_keys = self._get_split_keys(desired_bundle_size_in_mb, 
start_position,
+  stop_position)
 
 bundle_start = start_position
-while bundle_start < stop_position:
-  bundle_end = min(stop_position, bundle_start + desired_bundle_count)
-  yield iobase.SourceBundle(weight=bundle_end - bundle_start,
+for split_key_id in split_keys:
+  if bundle_start is not None or bundle_start >= stop_position:
+break
+  bundle_end = min(stop_position, split_key_id)
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
 source=self,
 start_position=bundle_start,
 stop_position=bundle_end)
   bundle_start = bundle_end
+# add range of last split_key to stop_position
+if bundle_start < stop_position:
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
+source=self,
+start_position=bundle_start,
+stop_position=stop_position)
 
   def get_range_tracker(self, start_position, stop_position):
 if start_position is None:
-  start_position = 0
+  start_position = self._get_head_document_id(ASCENDING)
 
 Review comment:
   moved to utility function
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290593)
Time Spent: 7h 10m  (was: 7h)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the 

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290595&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290595
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:44
Start Date: 07/Aug/19 16:44
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311338009
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,64 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
+ 'incorrect' % self.coll)
+  return size
 
   def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
 # use document cursor index as the start and stop positions
 if start_position is None:
-  start_position = 0
+  start_position = self._get_head_document_id(ASCENDING)
 if stop_position is None:
-  stop_position = self.doc_count
+  last_doc_id = self._get_head_document_id(DESCENDING)
+  # increment last doc id binary value by 1 to make sure the last document
+  # is not excluded
+  stop_position = _ObjectIdHelper.increment_id(last_doc_id, 1)
 
-# get an estimate on how many documents should be included in a split batch
-desired_bundle_count = desired_bundle_size // self.avg_doc_size
+desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024
+split_keys = self._get_split_keys(desired_bundle_size_in_mb, 
start_position,
+  stop_position)
 
 bundle_start = start_position
-while bundle_start < stop_position:
-  bundle_end = min(stop_position, bundle_start + desired_bundle_count)
-  yield iobase.SourceBundle(weight=bundle_end - bundle_start,
+for split_key_id in split_keys:
+  if bundle_start is not None or bundle_start >= stop_position:
+break
+  bundle_end = min(stop_position, split_key_id)
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
 source=self,
 start_position=bundle_start,
 stop_position=bundle_end)
   bundle_start = bundle_end
+# add range of last split_key to stop_position
+if bundle_start < stop_position:
 
 Review comment:
   yes, also bundle_start will be set to start_position before and unlikely to 
be None.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290595)
Time Spent: 7h 20m  (was: 7h 10m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then

[jira] [Created] (BEAM-7920) AvroTableProvider

2019-08-07 Thread Neville Li (JIRA)
Neville Li created BEAM-7920:


 Summary: AvroTableProvider
 Key: BEAM-7920
 URL: https://issues.apache.org/jira/browse/BEAM-7920
 Project: Beam
  Issue Type: Improvement
  Components: dsl-sql
Affects Versions: 2.14.0
Reporter: Neville Li


https://github.com/apache/beam/pull/6777 and BEAM-5807 mentioned 
{{AvroTableProvider}} but I don't see one in the code base. Is this to be 
implemented or am I missing something?

cc [~kanterov]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-6858) Support side inputs injected into a DoFn

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6858?focusedWorklogId=290596&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290596
 ]

ASF GitHub Bot logged work on BEAM-6858:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:46
Start Date: 07/Aug/19 16:46
Worklog Time Spent: 10m 
  Work Description: salmanVD commented on pull request #9275: [BEAM-6858] 
Support side inputs injected into a DoFn
URL: https://github.com/apache/beam/pull/9275#discussion_r311655427
 
 

 ##
 File path: 
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java
 ##
 @@ -652,6 +650,17 @@ public static DoFnSchemaInformation 
getDoFnSchemaInformation(
   return withSideInputs(Arrays.asList(sideInputs));
 }
 
+/**
+ * Returns a new {@link ParDo} {@link PTransform} that's like this {@link 
PTransform} but with
+ * the specified additional side inputs. Does not modify this {@link 
PTransform}.
+ *
+ * See the discussion of Side Inputs above for more explanation.
+ */
+public SingleOutput withSideInput(String tagId, 
PCollectionView sideInput) {
+  sideInput.setTagInternalId(tagId);
 
 Review comment:
   I tried to use SerializableUtils.clone to copy but since `PCollection`, 
`PValueBase`  does not implement serializable we cannot clone it with this 
method. Cloning it with this method makes `PCollection`, `Pipeline` and couple 
of other fields as `null`. I tried to implement setter but for `Pipeline` is 
final in `PValueBase` which affects a lot of other classes as well. 
   
   So, cloning the object and setting just the PCollection in that cloned 
object works but during the execution of pipeline it fails the condition in 
`SideInputContainer.createReaderForViews` method with an error `Can't create a 
SideInputReader with unknown views`.
   
   I think we need to set all the objects which are getting null after cloning 
the object. Do you have any other suggestion?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290596)
Time Spent: 1h 20m  (was: 1h 10m)

> Support side inputs injected into a DoFn
> 
>
> Key: BEAM-6858
> URL: https://issues.apache.org/jira/browse/BEAM-6858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Shehzaad Nakhoda
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Beam currently supports injecting main inputs into a DoFn process method. A 
> user can write the following:
> @ProcessElement public void process(@Element InputT element)
> And Beam will (using ByteBuddy code generation) inject the input element into 
> the process method.
> We would like to also support the same for side inputs. For example:
> @ProcessElement public void process(@Element InputT element, 
> @SideInput("tag1") String input1, @SideInput("tag2") Integer input2) 
> This requires the existing process-method analysis framework to capture these 
> side inputs. The ParDo code would have to verify the type of the side input 
> and include them in the list of side inputs. This would also eliminate the 
> need for the user to explicitly call withSideInputs on the ParDo.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-6858) Support side inputs injected into a DoFn

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6858?focusedWorklogId=290597&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290597
 ]

ASF GitHub Bot logged work on BEAM-6858:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:47
Start Date: 07/Aug/19 16:47
Worklog Time Spent: 10m 
  Work Description: salmanVD commented on pull request #9275: [BEAM-6858] 
Support side inputs injected into a DoFn
URL: https://github.com/apache/beam/pull/9275#discussion_r311655427
 
 

 ##
 File path: 
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java
 ##
 @@ -652,6 +650,17 @@ public static DoFnSchemaInformation 
getDoFnSchemaInformation(
   return withSideInputs(Arrays.asList(sideInputs));
 }
 
+/**
+ * Returns a new {@link ParDo} {@link PTransform} that's like this {@link 
PTransform} but with
+ * the specified additional side inputs. Does not modify this {@link 
PTransform}.
+ *
+ * See the discussion of Side Inputs above for more explanation.
+ */
+public SingleOutput withSideInput(String tagId, 
PCollectionView sideInput) {
+  sideInput.setTagInternalId(tagId);
 
 Review comment:
   I tried to use SerializableUtils.clone to copy but since `PCollection`, 
`PValueBase`  does not implement serializable we cannot clone it with this 
method. Cloning it with this method makes `PCollection`, `Pipeline` and couple 
of other fields as `null`. I tried to implement setter but `Pipeline` is final 
in `PValueBase` which affects a lot of other classes as well. 
   
   So, cloning the object and setting just the PCollection in that cloned 
object works but during the execution of pipeline it fails the condition in 
`SideInputContainer.createReaderForViews` method with an error `Can't create a 
SideInputReader with unknown views`.
   
   I think we need to set all the objects which are getting null after cloning 
the object. Do you have any other suggestion?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290597)
Time Spent: 1.5h  (was: 1h 20m)

> Support side inputs injected into a DoFn
> 
>
> Key: BEAM-6858
> URL: https://issues.apache.org/jira/browse/BEAM-6858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Shehzaad Nakhoda
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Beam currently supports injecting main inputs into a DoFn process method. A 
> user can write the following:
> @ProcessElement public void process(@Element InputT element)
> And Beam will (using ByteBuddy code generation) inject the input element into 
> the process method.
> We would like to also support the same for side inputs. For example:
> @ProcessElement public void process(@Element InputT element, 
> @SideInput("tag1") String input1, @SideInput("tag2") Integer input2) 
> This requires the existing process-method analysis framework to capture these 
> side inputs. The ParDo code would have to verify the type of the side input 
> and include them in the list of side inputs. This would also eliminate the 
> need for the user to explicitly call withSideInputs on the ParDo.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290598&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290598
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:47
Start Date: 07/Aug/19 16:47
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311337745
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,64 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
 
 Review comment:
   normally means the collection is 
empty. guess we can just return the size. I've not seen value below zero.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290598)
Time Spent: 7.5h  (was: 7h 20m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290599&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290599
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 16:49
Start Date: 07/Aug/19 16:49
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311336933
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio_test.py
 ##
 @@ -30,38 +34,136 @@
 from apache_beam.io.mongodbio import _BoundedMongoSource
 from apache_beam.io.mongodbio import _GenerateObjectIdFn
 from apache_beam.io.mongodbio import _MongoSink
+from apache_beam.io.mongodbio import _ObjectIdHelper
+from apache_beam.io.mongodbio import _ObjectIdRangeTracker
 from apache_beam.io.mongodbio import _WriteMongoFn
 from apache_beam.testing.test_pipeline import TestPipeline
 from apache_beam.testing.util import assert_that
 from apache_beam.testing.util import equal_to
 
 
+class _MockMongoColl(object):
+  """Fake mongodb collection cursor."""
+
+  def __init__(self, docs):
+self.docs = docs
+
+  def _filter(self, filter):
+match = []
+if not filter:
+  return self
+start = filter['_id'].get('$gte')
+end = filter['_id'].get('$lt')
+assert start is not None
+assert end is not None
+for doc in self.docs:
+  if start and doc['_id'] < start:
+continue
+  if end and doc['_id'] >= end:
+continue
+  match.append(doc)
+return match
+
+  def find(self, filter=None, **kwargs):
+return _MockMongoColl(self._filter(filter))
+
+  def sort(self, sort_items):
+key, order = sort_items[0]
+self.docs = sorted(self.docs,
+   key=lambda x: x[key],
+   reverse=(order != ASCENDING))
+return self
+
+  def limit(self, num):
+return _MockMongoColl(self.docs[0:num])
+
+  def count_documents(self, filter):
+return len(self._filter(filter))
+
+  def __getitem__(self, index):
+return self.docs[index]
+
+
+class _MockMongoDb(object):
+  """Fake Mongo Db."""
+
+  def __init__(self, docs):
+self.docs = docs
+
+  def __getitem__(self, coll_name):
+return _MockMongoColl(self.docs)
+
+  def command(self, command, *args, **kwargs):
+if command == 'collstats':
+  return {'size': 5, 'avgSize': 1}
+elif command == 'splitVector':
+  return self.get_split_key(command, *args, **kwargs)
+
+  def get_split_key(self, command, ns, min, max, maxChunkSize, **kwargs):
+# simulate mongo db splitVector command, return split keys base on chunk
+# size, assuming every doc is of size 1mb
+start_id = min['_id']
 
 Review comment:
   these are the argument key required by mongo client, I don't think it is 
easy to change.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290599)
Time Spent: 7h 40m  (was: 7.5h)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the quer

[jira] [Created] (BEAM-7921) Add Schema support for Tensorflow

2019-08-07 Thread Neville Li (JIRA)
Neville Li created BEAM-7921:


 Summary: Add Schema support for Tensorflow
 Key: BEAM-7921
 URL: https://issues.apache.org/jira/browse/BEAM-7921
 Project: Beam
  Issue Type: Improvement
  Components: dsl-sql
Affects Versions: 2.14.0
Reporter: Neville Li


Similar to BEAM-5807, Tensorflow's defacto file format is {{TFRecord}}s with 
{{Example}} proto payload and its own schema.proto. We already have 
{{TFRecordIO}} support. Need to implement:

* Conversion between Beam and TF schema
* Conversion between Beam Row and TF Example proto
* {{TFRecordTableProvider}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7922) Storage API reads in BigQueryTableProvider

2019-08-07 Thread Neville Li (JIRA)
Neville Li created BEAM-7922:


 Summary: Storage API reads in BigQueryTableProvider
 Key: BEAM-7922
 URL: https://issues.apache.org/jira/browse/BEAM-7922
 Project: Beam
  Issue Type: Improvement
  Components: dsl-sql
Reporter: Neville Li
 Fix For: 2.14.0


Looks like {{BigQueryTable.buildIOReader}} is reading the entire table.
It could probably be optimized to use BigQuery Storage API with 
`withReadOptions` and infer "selected fileds" column projection from {{Schema}}?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7389) Colab examples for element-wise transforms (Python)

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7389?focusedWorklogId=290619&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290619
 ]

ASF GitHub Bot logged work on BEAM-7389:


Author: ASF GitHub Bot
Created on: 07/Aug/19 17:33
Start Date: 07/Aug/19 17:33
Worklog Time Spent: 10m 
  Work Description: davidcavazos commented on pull request #9289: 
[BEAM-7389] Add code examples for ToString page
URL: https://github.com/apache/beam/pull/9289
 
 
   Adding code samples for the Values page.
   
   R: @rosetn [website]
   R: @aaltay [approval]
   Can you take a look at this whenever you have a chance? Thanks!
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [x] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [x] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.

[jira] [Work logged] (BEAM-7912) Optimize GroupIntoBatches for batch Dataflow pipelines

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7912?focusedWorklogId=290620&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290620
 ]

ASF GitHub Bot logged work on BEAM-7912:


Author: ASF GitHub Bot
Created on: 07/Aug/19 17:38
Start Date: 07/Aug/19 17:38
Worklog Time Spent: 10m 
  Work Description: Ardagan commented on pull request #9280: [BEAM-7912] 
Optimize GroupIntoBatches for batch Dataflow pipelines.
URL: https://github.com/apache/beam/pull/9280#discussion_r311677973
 
 

 ##
 File path: 
runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
 ##
 @@ -422,6 +428,14 @@ protected DataflowRunner(DataflowPipelineOptions options) 
{
   // Dataflow Streaming runner overrides the SPLITTABLE_PROCESS_KEYED 
transform
   // natively in the Dataflow service.
 } else {
+  overridesBuilder
 
 Review comment:
   Can you extract whole if/else or contents of if/else cases into separate 
methods please? First if case is pretty big to be worth moving to separate 
method/class.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290620)
Time Spent: 50m  (was: 40m)

> Optimize GroupIntoBatches for batch Dataflow pipelines
> --
>
> Key: BEAM-7912
> URL: https://issues.apache.org/jira/browse/BEAM-7912
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The GroupIntoBatches transform can be significantly optimized on Dataflow 
> since it always ensures that a key K appears in only one bundle after a 
> GroupByKey. This removes the usage of state and timers in the generic 
> GroupIntoBatches transform.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7389) Colab examples for element-wise transforms (Python)

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7389?focusedWorklogId=290621&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290621
 ]

ASF GitHub Bot logged work on BEAM-7389:


Author: ASF GitHub Bot
Created on: 07/Aug/19 17:43
Start Date: 07/Aug/19 17:43
Worklog Time Spent: 10m 
  Work Description: davidcavazos commented on issue #9289: [BEAM-7389] Add 
code examples for ToString page
URL: https://github.com/apache/beam/pull/9289#issuecomment-519199731
 
 
   Staged: 
http://apache-beam-website-pull-requests.storage.googleapis.com/9289/documentation/transforms/python/elementwise/tostring/index.html
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290621)
Time Spent: 33h 40m  (was: 33.5h)

> Colab examples for element-wise transforms (Python)
> ---
>
> Key: BEAM-7389
> URL: https://issues.apache.org/jira/browse/BEAM-7389
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Rose Nguyen
>Assignee: David Cavazos
>Priority: Minor
>  Time Spent: 33h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7912) Optimize GroupIntoBatches for batch Dataflow pipelines

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7912?focusedWorklogId=290622&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290622
 ]

ASF GitHub Bot logged work on BEAM-7912:


Author: ASF GitHub Bot
Created on: 07/Aug/19 17:44
Start Date: 07/Aug/19 17:44
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on pull request #9280: [BEAM-7912] 
Optimize GroupIntoBatches for batch Dataflow pipelines.
URL: https://github.com/apache/beam/pull/9280#discussion_r311680855
 
 

 ##
 File path: 
runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
 ##
 @@ -422,6 +428,14 @@ protected DataflowRunner(DataflowPipelineOptions options) 
{
   // Dataflow Streaming runner overrides the SPLITTABLE_PROCESS_KEYED 
transform
   // natively in the Dataflow service.
 } else {
+  overridesBuilder
 
 Review comment:
   Breaking up the contents will hinder understanding what happens in this 
method since many of these transforms have to be overridden in a specific order 
so that X is replaced before the Y replacement happens since X replaces part of 
something that then Y replaces to be something else.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290622)
Time Spent: 1h  (was: 50m)

> Optimize GroupIntoBatches for batch Dataflow pipelines
> --
>
> Key: BEAM-7912
> URL: https://issues.apache.org/jira/browse/BEAM-7912
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Minor
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The GroupIntoBatches transform can be significantly optimized on Dataflow 
> since it always ensures that a key K appears in only one bundle after a 
> GroupByKey. This removes the usage of state and timers in the generic 
> GroupIntoBatches transform.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7912) Optimize GroupIntoBatches for batch Dataflow pipelines

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7912?focusedWorklogId=290623&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290623
 ]

ASF GitHub Bot logged work on BEAM-7912:


Author: ASF GitHub Bot
Created on: 07/Aug/19 17:46
Start Date: 07/Aug/19 17:46
Worklog Time Spent: 10m 
  Work Description: Ardagan commented on pull request #9280: [BEAM-7912] 
Optimize GroupIntoBatches for batch Dataflow pipelines.
URL: https://github.com/apache/beam/pull/9280#discussion_r311681717
 
 

 ##
 File path: 
runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
 ##
 @@ -422,6 +428,14 @@ protected DataflowRunner(DataflowPipelineOptions options) 
{
   // Dataflow Streaming runner overrides the SPLITTABLE_PROCESS_KEYED 
transform
   // natively in the Dataflow service.
 } else {
+  overridesBuilder
 
 Review comment:
   Makes sense.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290623)
Time Spent: 1h 10m  (was: 1h)

> Optimize GroupIntoBatches for batch Dataflow pipelines
> --
>
> Key: BEAM-7912
> URL: https://issues.apache.org/jira/browse/BEAM-7912
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Minor
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The GroupIntoBatches transform can be significantly optimized on Dataflow 
> since it always ensures that a key K appears in only one bundle after a 
> GroupByKey. This removes the usage of state and timers in the generic 
> GroupIntoBatches transform.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-5878) Support DoFns with Keyword-only arguments in Python 3.

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5878?focusedWorklogId=290626&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290626
 ]

ASF GitHub Bot logged work on BEAM-5878:


Author: ASF GitHub Bot
Created on: 07/Aug/19 17:54
Start Date: 07/Aug/19 17:54
Worklog Time Spent: 10m 
  Work Description: lazylynx commented on issue #9237: [BEAM-5878] support 
DoFns with Keyword-only arguments
URL: https://github.com/apache/beam/pull/9237#issuecomment-519203798
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290626)
Time Spent: 6h 20m  (was: 6h 10m)

> Support DoFns with Keyword-only arguments in Python 3.
> --
>
> Key: BEAM-5878
> URL: https://issues.apache.org/jira/browse/BEAM-5878
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: yoshiki obata
>Priority: Minor
> Fix For: 2.16.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Python 3.0 [adds a possibility|https://www.python.org/dev/peps/pep-3102/] to 
> define functions with keyword-only arguments. 
> Currently Beam does not handle them correctly. [~ruoyu] pointed out [one 
> place|https://github.com/apache/beam/blob/a56ce43109c97c739fa08adca45528c41e3c925c/sdks/python/apache_beam/typehints/decorators.py#L118]
>  in our codebase that we should fix: in Python in 3.0 inspect.getargspec() 
> will fail on functions with keyword-only arguments, but a new method 
> [inspect.getfullargspec()|https://docs.python.org/3/library/inspect.html#inspect.getfullargspec]
>  supports them.
> There may be implications for our (best-effort) type-hints machinery.
> We should also add a Py3-only unit tests that covers DoFn's with keyword-only 
> arguments once Beam Python 3 tests are in a good shape.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7678) typehints with_output_types annotation doesn't work for stateful DoFn

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7678?focusedWorklogId=290628&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290628
 ]

ASF GitHub Bot logged work on BEAM-7678:


Author: ASF GitHub Bot
Created on: 07/Aug/19 17:56
Start Date: 07/Aug/19 17:56
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9238: [BEAM-7678] 
Fixes bug in output element_type generation in Kv PipelineVisitor
URL: https://github.com/apache/beam/pull/9238
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290628)
Time Spent: 2h  (was: 1h 50m)

> typehints with_output_types annotation doesn't work for stateful DoFn 
> --
>
> Key: BEAM-7678
> URL: https://issues.apache.org/jira/browse/BEAM-7678
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.13.0
>Reporter: Enrico Canzonieri
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The output types typehints seem to be ignored when using a stateful DoFn, but 
> the same typehint works perfectly when used without state. This issue 
> prevents a custom Coder from being used because Beam will default to one of 
> the {{FastCoders}} (I believe Pickle).
> Example code:
> {code}
> @typehints.with_output_types(Message)
> class StatefulDoFn(DoFn):
> COUNTER_STATE = BagStateSpec('counter', VarIntCoder())
> def process(self, element, counter=DoFn.StateParam(COUNTER_STATE)):
>   (key, messages) = element
>   newMessage = Message()
>   return [newMessage]
> {code}
> The example code is just defining a stateful DoFn for python. The used runner 
> is the Flink 1.6.4 portable runner.
> Finally, overriding {{infer_output_type}} to return a 
> {{typehints.List[Message]}} solves the issue.
> Looking at the code, it seems to me that in 
> [https://github.com/apache/beam/blob/v2.13.0/sdks/python/apache_beam/pipeline.py#L643]
>  we do not take the typehints into account.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=290634&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290634
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 07/Aug/19 17:58
Start Date: 07/Aug/19 17:58
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added iBeam module
URL: https://github.com/apache/beam/pull/9278#discussion_r311687253
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,199 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of the current iBeam (interactive Beam) environment.
+
+The purpose of the module is to reduce the learning curve of iBeam users, 
+provide a single place for importing and add sugar syntax for all iBeam
+components. It gives users capability to manipulate existing environment for
+interactive beam, TODO(ningk) run interactive pipeline on selected runner as
+normal pipeline, create pipeline with interactive runner and visualize
+PCollections as bounded dataset.
+
+Note: iBeam works the same as normal Beam with DirectRunner when not in an
+interactively environment such as Jupyter lab or Jupyter Notebook. You can also
+run pipeline created by iBeam as normal Beam pipeline by run_pipeline() with
+desired runners.
+"""
+
+import importlib
+
+import apache_beam as beam
+from apache_beam.runners.interactive import interactive_runner
+
+_ibeam_env = None
+
+
+def watch(watchable):
+  """Watches a watchable so that iBeam can understand your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  iBeam. However, if your Beam pipeline is defined in some module other than
+  __main__, e.g., inside a class function or a unit test, you can watch() the
+  scope to instruct iBeam to apply magic to your pipeline when running pipeline
+  interactively.
+
+For example:
+
+class Foo(object)
+  def build_pipeline(self):
+p = create_pipeline()
+init_pcoll = p |  'Init Create' >> beam.Create(range(10))
+watch(locals())
+return p
+Foo().build_pipeline().run()
+
+iBeam will cache init_pcoll for the first run. You can use:
+
+visualize(init_pcoll)
+
+To visualize data from init_pcoll once the pipeline is executed. And if you
+make change to the original pipeline by adding:
+
+squares = init_pcoll | 'Square' >> beam.Map(lambda x: x*x)
+
+When you re-run the pipeline from the line you just added, squares will
+use the init_pcoll data cached so you can have an interactive experience.
+
+  Currently the implementation mainly watches for PCollection variables defined
+  in user code. A watchable can be a dictionary of variable metadata such as
+  locals(), a str name of a module, a module object or an instance of a class.
+  The variable can come from any scope even local variables in a method of a
+  class defined in a module.
+
+Below are all valid:
+
+watch(__main__)  # if import __main__ is already invoked
+watch('__main__')  # does not require invoking import __main__ beforehand
+watch(self)  # inside a class
+watch(SomeInstance())  # an instance of a class
+watch(locals())  # inside a function, watching local variables within
+  """
+  current_env().watch(watchable)
+
+
+def create_pipeline(runner=None, options=None, argv=None):
+  """Creates a pipeline with interactive runner by default.
+
+  You can use run_pipeline() provided within this module to execute the iBeam
+  pipeline with other runners.
+
+  Args:
+runner (~apache_beam.runners.runner.PipelineRunner): An object of
+  type :class:`~apache_beam.runners.runner.PipelineRunner` that will be
+  used to execute the pipeline. For registered runners, the runner name
+  can be specified, otherwise a runner object must be supplied.
+options (~apache_beam.options.pipeline_options.PipelineOptions):
+  A configured
+  :class:`~apache_beam.options.pipeline_options.Pipel

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=290630&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290630
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 07/Aug/19 17:58
Start Date: 07/Aug/19 17:58
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added iBeam module
URL: https://github.com/apache/beam/pull/9278#discussion_r311655571
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,199 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of the current iBeam (interactive Beam) environment.
+
+The purpose of the module is to reduce the learning curve of iBeam users, 
+provide a single place for importing and add sugar syntax for all iBeam
+components. It gives users capability to manipulate existing environment for
+interactive beam, TODO(ningk) run interactive pipeline on selected runner as
+normal pipeline, create pipeline with interactive runner and visualize
+PCollections as bounded dataset.
+
+Note: iBeam works the same as normal Beam with DirectRunner when not in an
+interactively environment such as Jupyter lab or Jupyter Notebook. You can also
+run pipeline created by iBeam as normal Beam pipeline by run_pipeline() with
+desired runners.
+"""
+
+import importlib
+
+import apache_beam as beam
+from apache_beam.runners.interactive import interactive_runner
+
+_ibeam_env = None
+
+
+def watch(watchable):
+  """Watches a watchable so that iBeam can understand your pipeline.
 
 Review comment:
   R: @rosetn 
   PTAL
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290630)
Time Spent: 1h  (was: 50m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(to

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=290633&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290633
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 07/Aug/19 17:58
Start Date: 07/Aug/19 17:58
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added iBeam module
URL: https://github.com/apache/beam/pull/9278#discussion_r311667439
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,199 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of the current iBeam (interactive Beam) environment.
+
+The purpose of the module is to reduce the learning curve of iBeam users, 
+provide a single place for importing and add sugar syntax for all iBeam
+components. It gives users capability to manipulate existing environment for
+interactive beam, TODO(ningk) run interactive pipeline on selected runner as
+normal pipeline, create pipeline with interactive runner and visualize
+PCollections as bounded dataset.
+
+Note: iBeam works the same as normal Beam with DirectRunner when not in an
+interactively environment such as Jupyter lab or Jupyter Notebook. You can also
+run pipeline created by iBeam as normal Beam pipeline by run_pipeline() with
+desired runners.
+"""
+
+import importlib
+
+import apache_beam as beam
+from apache_beam.runners.interactive import interactive_runner
+
+_ibeam_env = None
+
+
+def watch(watchable):
+  """Watches a watchable so that iBeam can understand your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  iBeam. However, if your Beam pipeline is defined in some module other than
+  __main__, e.g., inside a class function or a unit test, you can watch() the
+  scope to instruct iBeam to apply magic to your pipeline when running pipeline
+  interactively.
+
+For example:
+
+class Foo(object)
+  def build_pipeline(self):
+p = create_pipeline()
+init_pcoll = p |  'Init Create' >> beam.Create(range(10))
+watch(locals())
+return p
+Foo().build_pipeline().run()
+
+iBeam will cache init_pcoll for the first run. You can use:
+
+visualize(init_pcoll)
+
+To visualize data from init_pcoll once the pipeline is executed. And if you
+make change to the original pipeline by adding:
+
+squares = init_pcoll | 'Square' >> beam.Map(lambda x: x*x)
+
+When you re-run the pipeline from the line you just added, squares will
+use the init_pcoll data cached so you can have an interactive experience.
+
+  Currently the implementation mainly watches for PCollection variables defined
+  in user code. A watchable can be a dictionary of variable metadata such as
+  locals(), a str name of a module, a module object or an instance of a class.
+  The variable can come from any scope even local variables in a method of a
+  class defined in a module.
+
+Below are all valid:
+
+watch(__main__)  # if import __main__ is already invoked
+watch('__main__')  # does not require invoking import __main__ beforehand
+watch(self)  # inside a class
+watch(SomeInstance())  # an instance of a class
+watch(locals())  # inside a function, watching local variables within
+  """
+  current_env().watch(watchable)
+
+
+def create_pipeline(runner=None, options=None, argv=None):
+  """Creates a pipeline with interactive runner by default.
+
+  You can use run_pipeline() provided within this module to execute the iBeam
+  pipeline with other runners.
+
+  Args:
+runner (~apache_beam.runners.runner.PipelineRunner): An object of
+  type :class:`~apache_beam.runners.runner.PipelineRunner` that will be
+  used to execute the pipeline. For registered runners, the runner name
+  can be specified, otherwise a runner object must be supplied.
+options (~apache_beam.options.pipeline_options.PipelineOptions):
+  A configured
+  :class:`~apache_beam.options.pipeline_options.Pipel

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=290635&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290635
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 07/Aug/19 17:58
Start Date: 07/Aug/19 17:58
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added iBeam module
URL: https://github.com/apache/beam/pull/9278#discussion_r311667474
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,199 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of the current iBeam (interactive Beam) environment.
+
+The purpose of the module is to reduce the learning curve of iBeam users, 
+provide a single place for importing and add sugar syntax for all iBeam
+components. It gives users capability to manipulate existing environment for
+interactive beam, TODO(ningk) run interactive pipeline on selected runner as
+normal pipeline, create pipeline with interactive runner and visualize
+PCollections as bounded dataset.
+
+Note: iBeam works the same as normal Beam with DirectRunner when not in an
+interactively environment such as Jupyter lab or Jupyter Notebook. You can also
+run pipeline created by iBeam as normal Beam pipeline by run_pipeline() with
+desired runners.
+"""
+
+import importlib
+
+import apache_beam as beam
+from apache_beam.runners.interactive import interactive_runner
+
+_ibeam_env = None
+
+
+def watch(watchable):
+  """Watches a watchable so that iBeam can understand your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  iBeam. However, if your Beam pipeline is defined in some module other than
+  __main__, e.g., inside a class function or a unit test, you can watch() the
+  scope to instruct iBeam to apply magic to your pipeline when running pipeline
+  interactively.
+
+For example:
+
+class Foo(object)
+  def build_pipeline(self):
+p = create_pipeline()
+init_pcoll = p |  'Init Create' >> beam.Create(range(10))
+watch(locals())
+return p
+Foo().build_pipeline().run()
+
+iBeam will cache init_pcoll for the first run. You can use:
+
+visualize(init_pcoll)
+
+To visualize data from init_pcoll once the pipeline is executed. And if you
+make change to the original pipeline by adding:
+
+squares = init_pcoll | 'Square' >> beam.Map(lambda x: x*x)
+
+When you re-run the pipeline from the line you just added, squares will
+use the init_pcoll data cached so you can have an interactive experience.
+
+  Currently the implementation mainly watches for PCollection variables defined
+  in user code. A watchable can be a dictionary of variable metadata such as
+  locals(), a str name of a module, a module object or an instance of a class.
+  The variable can come from any scope even local variables in a method of a
+  class defined in a module.
+
+Below are all valid:
+
+watch(__main__)  # if import __main__ is already invoked
+watch('__main__')  # does not require invoking import __main__ beforehand
+watch(self)  # inside a class
+watch(SomeInstance())  # an instance of a class
+watch(locals())  # inside a function, watching local variables within
+  """
+  current_env().watch(watchable)
+
+
+def create_pipeline(runner=None, options=None, argv=None):
+  """Creates a pipeline with interactive runner by default.
+
+  You can use run_pipeline() provided within this module to execute the iBeam
+  pipeline with other runners.
+
+  Args:
+runner (~apache_beam.runners.runner.PipelineRunner): An object of
+  type :class:`~apache_beam.runners.runner.PipelineRunner` that will be
+  used to execute the pipeline. For registered runners, the runner name
+  can be specified, otherwise a runner object must be supplied.
+options (~apache_beam.options.pipeline_options.PipelineOptions):
+  A configured
+  :class:`~apache_beam.options.pipeline_options.Pipel

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=290632&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290632
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 07/Aug/19 17:58
Start Date: 07/Aug/19 17:58
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added iBeam module
URL: https://github.com/apache/beam/pull/9278#discussion_r311662255
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,199 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of the current iBeam (interactive Beam) environment.
+
+The purpose of the module is to reduce the learning curve of iBeam users, 
+provide a single place for importing and add sugar syntax for all iBeam
+components. It gives users capability to manipulate existing environment for
+interactive beam, TODO(ningk) run interactive pipeline on selected runner as
+normal pipeline, create pipeline with interactive runner and visualize
+PCollections as bounded dataset.
+
+Note: iBeam works the same as normal Beam with DirectRunner when not in an
+interactively environment such as Jupyter lab or Jupyter Notebook. You can also
+run pipeline created by iBeam as normal Beam pipeline by run_pipeline() with
+desired runners.
+"""
+
+import importlib
+
+import apache_beam as beam
+from apache_beam.runners.interactive import interactive_runner
+
+_ibeam_env = None
+
+
+def watch(watchable):
+  """Watches a watchable so that iBeam can understand your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  iBeam. However, if your Beam pipeline is defined in some module other than
+  __main__, e.g., inside a class function or a unit test, you can watch() the
+  scope to instruct iBeam to apply magic to your pipeline when running pipeline
+  interactively.
+
+For example:
+
+class Foo(object)
+  def build_pipeline(self):
+p = create_pipeline()
+init_pcoll = p |  'Init Create' >> beam.Create(range(10))
+watch(locals())
+return p
+Foo().build_pipeline().run()
+
+iBeam will cache init_pcoll for the first run. You can use:
+
+visualize(init_pcoll)
+
+To visualize data from init_pcoll once the pipeline is executed. And if you
+make change to the original pipeline by adding:
+
+squares = init_pcoll | 'Square' >> beam.Map(lambda x: x*x)
+
+When you re-run the pipeline from the line you just added, squares will
+use the init_pcoll data cached so you can have an interactive experience.
+
+  Currently the implementation mainly watches for PCollection variables defined
+  in user code. A watchable can be a dictionary of variable metadata such as
+  locals(), a str name of a module, a module object or an instance of a class.
+  The variable can come from any scope even local variables in a method of a
+  class defined in a module.
+
+Below are all valid:
+
+watch(__main__)  # if import __main__ is already invoked
+watch('__main__')  # does not require invoking import __main__ beforehand
+watch(self)  # inside a class
+watch(SomeInstance())  # an instance of a class
+watch(locals())  # inside a function, watching local variables within
+  """
+  current_env().watch(watchable)
+
+
+def create_pipeline(runner=None, options=None, argv=None):
+  """Creates a pipeline with interactive runner by default.
+
+  You can use run_pipeline() provided within this module to execute the iBeam
+  pipeline with other runners.
+
+  Args:
+runner (~apache_beam.runners.runner.PipelineRunner): An object of
+  type :class:`~apache_beam.runners.runner.PipelineRunner` that will be
+  used to execute the pipeline. For registered runners, the runner name
+  can be specified, otherwise a runner object must be supplied.
+options (~apache_beam.options.pipeline_options.PipelineOptions):
+  A configured
+  :class:`~apache_beam.options.pipeline_options.Pipel

[jira] [Work logged] (BEAM-7912) Optimize GroupIntoBatches for batch Dataflow pipelines

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7912?focusedWorklogId=290629&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290629
 ]

ASF GitHub Bot logged work on BEAM-7912:


Author: ASF GitHub Bot
Created on: 07/Aug/19 17:58
Start Date: 07/Aug/19 17:58
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on issue #9280: [BEAM-7912] Optimize 
GroupIntoBatches for batch Dataflow pipelines.
URL: https://github.com/apache/beam/pull/9280#issuecomment-519205117
 
 
   I'm wondering whether we want to do the overrides by default for dataflow 
batch. Is it possible that a batch pipeline applies window?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290629)
Time Spent: 1h 20m  (was: 1h 10m)

> Optimize GroupIntoBatches for batch Dataflow pipelines
> --
>
> Key: BEAM-7912
> URL: https://issues.apache.org/jira/browse/BEAM-7912
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Minor
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The GroupIntoBatches transform can be significantly optimized on Dataflow 
> since it always ensures that a key K appears in only one bundle after a 
> GroupByKey. This removes the usage of state and timers in the generic 
> GroupIntoBatches transform.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=290631&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290631
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 07/Aug/19 17:58
Start Date: 07/Aug/19 17:58
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9278: [BEAM-7760] 
Added iBeam module
URL: https://github.com/apache/beam/pull/9278#discussion_r311661371
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,199 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of the current iBeam (interactive Beam) environment.
+
+The purpose of the module is to reduce the learning curve of iBeam users, 
+provide a single place for importing and add sugar syntax for all iBeam
+components. It gives users capability to manipulate existing environment for
+interactive beam, TODO(ningk) run interactive pipeline on selected runner as
+normal pipeline, create pipeline with interactive runner and visualize
+PCollections as bounded dataset.
+
+Note: iBeam works the same as normal Beam with DirectRunner when not in an
+interactively environment such as Jupyter lab or Jupyter Notebook. You can also
+run pipeline created by iBeam as normal Beam pipeline by run_pipeline() with
+desired runners.
+"""
+
+import importlib
+
+import apache_beam as beam
+from apache_beam.runners.interactive import interactive_runner
+
+_ibeam_env = None
+
+
+def watch(watchable):
+  """Watches a watchable so that iBeam can understand your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  iBeam. However, if your Beam pipeline is defined in some module other than
+  __main__, e.g., inside a class function or a unit test, you can watch() the
+  scope to instruct iBeam to apply magic to your pipeline when running pipeline
+  interactively.
+
+For example:
+
+class Foo(object)
+  def build_pipeline(self):
+p = create_pipeline()
+init_pcoll = p |  'Init Create' >> beam.Create(range(10))
+watch(locals())
+return p
+Foo().build_pipeline().run()
+
+iBeam will cache init_pcoll for the first run. You can use:
+
+visualize(init_pcoll)
+
+To visualize data from init_pcoll once the pipeline is executed. And if you
+make change to the original pipeline by adding:
+
+squares = init_pcoll | 'Square' >> beam.Map(lambda x: x*x)
+
+When you re-run the pipeline from the line you just added, squares will
+use the init_pcoll data cached so you can have an interactive experience.
+
+  Currently the implementation mainly watches for PCollection variables defined
+  in user code. A watchable can be a dictionary of variable metadata such as
+  locals(), a str name of a module, a module object or an instance of a class.
+  The variable can come from any scope even local variables in a method of a
+  class defined in a module.
+
+Below are all valid:
+
+watch(__main__)  # if import __main__ is already invoked
+watch('__main__')  # does not require invoking import __main__ beforehand
+watch(self)  # inside a class
+watch(SomeInstance())  # an instance of a class
+watch(locals())  # inside a function, watching local variables within
+  """
+  current_env().watch(watchable)
+
+
+def create_pipeline(runner=None, options=None, argv=None):
 
 Review comment:
   I'll make the change to only use interactive runner (and by default the 
underlying runner with direct runner) for now.
   
   The function basically creates a pipeline object that is interactive. 
Currently we could only provide interactive experience if the underlying runner 
is DirectRunner (which makes the code equivalent to 
`beam.Pipeline(interactive_runner.InteractiveRunner(), ...))`.
   Yeah, you're right, this is confusing. I'll remove the runner arg from the 
signature. In the future, when more underlying runners are supported, we can 
add back the runner arg and set the runner in the Int

[jira] [Work logged] (BEAM-7912) Optimize GroupIntoBatches for batch Dataflow pipelines

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7912?focusedWorklogId=290636&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290636
 ]

ASF GitHub Bot logged work on BEAM-7912:


Author: ASF GitHub Bot
Created on: 07/Aug/19 17:59
Start Date: 07/Aug/19 17:59
Worklog Time Spent: 10m 
  Work Description: aaltay commented on issue #9280: [BEAM-7912] Optimize 
GroupIntoBatches for batch Dataflow pipelines.
URL: https://github.com/apache/beam/pull/9280#issuecomment-519205745
 
 
   The same optimization could be applied to python as well. 
(https://github.com/apache/beam/blob/bc2c6ff5d4a464a4103db4f9835bac2e42258771/sdks/python/apache_beam/transforms/util.py#L690).
 You can keep the JIRA open after this change for that.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290636)
Time Spent: 1.5h  (was: 1h 20m)

> Optimize GroupIntoBatches for batch Dataflow pipelines
> --
>
> Key: BEAM-7912
> URL: https://issues.apache.org/jira/browse/BEAM-7912
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The GroupIntoBatches transform can be significantly optimized on Dataflow 
> since it always ensures that a key K appears in only one bundle after a 
> GroupByKey. This removes the usage of state and timers in the generic 
> GroupIntoBatches transform.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7860) v1new ReadFromDatastore returns duplicates if keys are of mixed types

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7860?focusedWorklogId=290638&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290638
 ]

ASF GitHub Bot logged work on BEAM-7860:


Author: ASF GitHub Bot
Created on: 07/Aug/19 17:59
Start Date: 07/Aug/19 17:59
Worklog Time Spent: 10m 
  Work Description: udim commented on pull request #9240: [BEAM-7860] 
Python Datastore: fix key sort order
URL: https://github.com/apache/beam/pull/9240#discussion_r311687900
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/datastore/v1new/query_splitter.py
 ##
 @@ -128,10 +131,57 @@ def _create_scatter_query(query, num_splits):
   return scatter_query
 
 
+class IdOrName(object):
+  """Represents an ID or name of a Datastore key,
+
+   Implements sort ordering: by ID, then by name, keys with IDs before those
+   with names.
+   """
+  def __init__(self, id_or_name):
+self.id_or_name = id_or_name
+if isinstance(id_or_name, (str, unicode)):
+  self.id = None
+  self.name = id_or_name
+elif isinstance(id_or_name, (int, long)):
+  self.id = id_or_name
+  self.name = None
+else:
+  raise TypeError('Unexpected type of id_or_name: %s' % id_or_name)
+
+  def __lt__(self, other):
+if not isinstance(other, IdOrName):
+  return super(IdOrName, self).__lt__(other)
+
+if self.id is not None:
+  if other.id is None:
+return True
+  else:
+return self.id < other.id
+
+if other.id is not None:
+  return False
+
+return self.name < other.name
+
+  def __eq__(self, other):
+if not isinstance(other, IdOrName):
+  return super(IdOrName, self).__eq__(other)
+return self.id == other.id and self.name == other.name
+
+  def __hash__(self):
+return hash((self.id, self.other))
+
+
 def client_key_sort_key(client_key):
   """Key function for sorting lists of ``google.cloud.datastore.key.Key``."""
-  return [client_key.project, client_key.namespace or ''] + [
-  str(element) for element in client_key.flat_path]
+  sort_key = [client_key.project, client_key.namespace or '']
+  flat_path = list(client_key.flat_path)
 
 Review comment:
   Done
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290638)
Time Spent: 1h 20m  (was: 1h 10m)

> v1new ReadFromDatastore returns duplicates if keys are of mixed types
> -
>
> Key: BEAM-7860
> URL: https://issues.apache.org/jira/browse/BEAM-7860
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Affects Versions: 2.13.0
> Environment: Python 2.7
> Python 3.7
>Reporter: Niels Stender
>Assignee: Udi Meiri
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In the presence of mixed type keys, v1new ReadFromDatastore may return 
> duplicate items. The attached example returns 4 records, not the expected 3.
>  
> {code:java}
> // code placeholder
> from __future__ import unicode_literals
> import apache_beam as beam
> from apache_beam.io.gcp.datastore.v1new.types import Key, Entity, Query
> from apache_beam.io.gcp.datastore.v1new import datastoreio
> config = dict(project='your-google-project', namespace='test')
> def test_mixed():
> keys = [
> Key(['mixed', '10038260-iperm_eservice'], **config),
> Key(['mixed', 4812224868188160], **config),
> Key(['mixed', '99152975-pointshop'], **config)
> ]
> entities = map(lambda key: Entity(key=key), keys)
> with beam.Pipeline() as p:
> (p
> | beam.Create(entities)
> | datastoreio.WriteToDatastore(project=config['project'])
> )
> query = Query(kind='mixed', **config)
> with beam.Pipeline() as p:
> (p
> | datastoreio.ReadFromDatastore(query=query, num_splits=4)
> | beam.io.WriteToText('tmp.txt', num_shards=1, 
> shard_name_template='')
> )
> items = open('tmp.txt').read().strip().split('\n')
> assert len(items) == 3, 'incorrect number of items'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7860) v1new ReadFromDatastore returns duplicates if keys are of mixed types

2019-08-07 Thread Udi Meiri (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902366#comment-16902366
 ] 

Udi Meiri commented on BEAM-7860:
-

Hopefully PR will be merged today.

> v1new ReadFromDatastore returns duplicates if keys are of mixed types
> -
>
> Key: BEAM-7860
> URL: https://issues.apache.org/jira/browse/BEAM-7860
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Affects Versions: 2.13.0
> Environment: Python 2.7
> Python 3.7
>Reporter: Niels Stender
>Assignee: Udi Meiri
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In the presence of mixed type keys, v1new ReadFromDatastore may return 
> duplicate items. The attached example returns 4 records, not the expected 3.
>  
> {code:java}
> // code placeholder
> from __future__ import unicode_literals
> import apache_beam as beam
> from apache_beam.io.gcp.datastore.v1new.types import Key, Entity, Query
> from apache_beam.io.gcp.datastore.v1new import datastoreio
> config = dict(project='your-google-project', namespace='test')
> def test_mixed():
> keys = [
> Key(['mixed', '10038260-iperm_eservice'], **config),
> Key(['mixed', 4812224868188160], **config),
> Key(['mixed', '99152975-pointshop'], **config)
> ]
> entities = map(lambda key: Entity(key=key), keys)
> with beam.Pipeline() as p:
> (p
> | beam.Create(entities)
> | datastoreio.WriteToDatastore(project=config['project'])
> )
> query = Query(kind='mixed', **config)
> with beam.Pipeline() as p:
> (p
> | datastoreio.ReadFromDatastore(query=query, num_splits=4)
> | beam.io.WriteToText('tmp.txt', num_shards=1, 
> shard_name_template='')
> )
> items = open('tmp.txt').read().strip().split('\n')
> assert len(items) == 3, 'incorrect number of items'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7860) v1new ReadFromDatastore returns duplicates if keys are of mixed types

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7860?focusedWorklogId=290639&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290639
 ]

ASF GitHub Bot logged work on BEAM-7860:


Author: ASF GitHub Bot
Created on: 07/Aug/19 18:00
Start Date: 07/Aug/19 18:00
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #9240: [BEAM-7860] Python 
Datastore: fix key sort order
URL: https://github.com/apache/beam/pull/9240#issuecomment-519205882
 
 
   run python 2 postcommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290639)
Time Spent: 1.5h  (was: 1h 20m)

> v1new ReadFromDatastore returns duplicates if keys are of mixed types
> -
>
> Key: BEAM-7860
> URL: https://issues.apache.org/jira/browse/BEAM-7860
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Affects Versions: 2.13.0
> Environment: Python 2.7
> Python 3.7
>Reporter: Niels Stender
>Assignee: Udi Meiri
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In the presence of mixed type keys, v1new ReadFromDatastore may return 
> duplicate items. The attached example returns 4 records, not the expected 3.
>  
> {code:java}
> // code placeholder
> from __future__ import unicode_literals
> import apache_beam as beam
> from apache_beam.io.gcp.datastore.v1new.types import Key, Entity, Query
> from apache_beam.io.gcp.datastore.v1new import datastoreio
> config = dict(project='your-google-project', namespace='test')
> def test_mixed():
> keys = [
> Key(['mixed', '10038260-iperm_eservice'], **config),
> Key(['mixed', 4812224868188160], **config),
> Key(['mixed', '99152975-pointshop'], **config)
> ]
> entities = map(lambda key: Entity(key=key), keys)
> with beam.Pipeline() as p:
> (p
> | beam.Create(entities)
> | datastoreio.WriteToDatastore(project=config['project'])
> )
> query = Query(kind='mixed', **config)
> with beam.Pipeline() as p:
> (p
> | datastoreio.ReadFromDatastore(query=query, num_splits=4)
> | beam.io.WriteToText('tmp.txt', num_shards=1, 
> shard_name_template='')
> )
> items = open('tmp.txt').read().strip().split('\n')
> assert len(items) == 3, 'incorrect number of items'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7389) Colab examples for element-wise transforms (Python)

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7389?focusedWorklogId=290644&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290644
 ]

ASF GitHub Bot logged work on BEAM-7389:


Author: ASF GitHub Bot
Created on: 07/Aug/19 18:01
Start Date: 07/Aug/19 18:01
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9276: [BEAM-7389] Add 
code examples for MapTuple and FlatMapTuple
URL: https://github.com/apache/beam/pull/9276
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290644)
Time Spent: 33h 50m  (was: 33h 40m)

> Colab examples for element-wise transforms (Python)
> ---
>
> Key: BEAM-7389
> URL: https://issues.apache.org/jira/browse/BEAM-7389
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Rose Nguyen
>Assignee: David Cavazos
>Priority: Minor
>  Time Spent: 33h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7389) Colab examples for element-wise transforms (Python)

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7389?focusedWorklogId=290647&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290647
 ]

ASF GitHub Bot logged work on BEAM-7389:


Author: ASF GitHub Bot
Created on: 07/Aug/19 18:03
Start Date: 07/Aug/19 18:03
Worklog Time Spent: 10m 
  Work Description: davidcavazos commented on issue #9260: [BEAM-7389] Add 
code examples for FlatMap page
URL: https://github.com/apache/beam/pull/9260#issuecomment-519207280
 
 
   Makes sense :)
   
   Thank you!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290647)
Time Spent: 34h  (was: 33h 50m)

> Colab examples for element-wise transforms (Python)
> ---
>
> Key: BEAM-7389
> URL: https://issues.apache.org/jira/browse/BEAM-7389
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Rose Nguyen
>Assignee: David Cavazos
>Priority: Minor
>  Time Spent: 34h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7915) show cross-language validate runner Flink badge on github PR template

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7915?focusedWorklogId=290648&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290648
 ]

ASF GitHub Bot logged work on BEAM-7915:


Author: ASF GitHub Bot
Created on: 07/Aug/19 18:04
Start Date: 07/Aug/19 18:04
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9282: [BEAM-7915] 
show cross-language validate runner Flink badge on github PR template
URL: https://github.com/apache/beam/pull/9282
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290648)
Time Spent: 0.5h  (was: 20m)

> show cross-language validate runner Flink badge on github PR template
> -
>
> Key: BEAM-7915
> URL: https://issues.apache.org/jira/browse/BEAM-7915
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> show cross-language validate runner Flink badge on github template



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7921) Add Schema support for Tensorflow

2019-08-07 Thread Neville Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Li updated BEAM-7921:
-
Description: 
Similar to BEAM-5807, Tensorflow's defacto storage format is {{TFRecord}} files 
with {{Example}} proto payload and its own schema.proto. We already have 
{{TFRecordIO}} support. Need to implement:
 * Conversion between Beam and TF schema
 * Conversion between Beam Row and TF Example proto
 * {{TFRecordTableProvider}}

[https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto]

[https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto]

  was:
Similar to BEAM-5807, Tensorflow's defacto file format is {{TFRecord}}s with 
{{Example}} proto payload and its own schema.proto. We already have 
{{TFRecordIO}} support. Need to implement:

* Conversion between Beam and TF schema
* Conversion between Beam Row and TF Example proto
* {{TFRecordTableProvider}}


> Add Schema support for Tensorflow
> -
>
> Key: BEAM-7921
> URL: https://issues.apache.org/jira/browse/BEAM-7921
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Affects Versions: 2.14.0
>Reporter: Neville Li
>Priority: Major
>
> Similar to BEAM-5807, Tensorflow's defacto storage format is {{TFRecord}} 
> files with {{Example}} proto payload and its own schema.proto. We already 
> have {{TFRecordIO}} support. Need to implement:
>  * Conversion between Beam and TF schema
>  * Conversion between Beam Row and TF Example proto
>  * {{TFRecordTableProvider}}
> [https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto]
> [https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7860) v1new ReadFromDatastore returns duplicates if keys are of mixed types

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7860?focusedWorklogId=290653&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290653
 ]

ASF GitHub Bot logged work on BEAM-7860:


Author: ASF GitHub Bot
Created on: 07/Aug/19 18:07
Start Date: 07/Aug/19 18:07
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #9240: [BEAM-7860] Python 
Datastore: fix key sort order
URL: https://github.com/apache/beam/pull/9240#issuecomment-519208647
 
 
   run python 3.6 postcommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290653)
Time Spent: 1h 40m  (was: 1.5h)

> v1new ReadFromDatastore returns duplicates if keys are of mixed types
> -
>
> Key: BEAM-7860
> URL: https://issues.apache.org/jira/browse/BEAM-7860
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Affects Versions: 2.13.0
> Environment: Python 2.7
> Python 3.7
>Reporter: Niels Stender
>Assignee: Udi Meiri
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> In the presence of mixed type keys, v1new ReadFromDatastore may return 
> duplicate items. The attached example returns 4 records, not the expected 3.
>  
> {code:java}
> // code placeholder
> from __future__ import unicode_literals
> import apache_beam as beam
> from apache_beam.io.gcp.datastore.v1new.types import Key, Entity, Query
> from apache_beam.io.gcp.datastore.v1new import datastoreio
> config = dict(project='your-google-project', namespace='test')
> def test_mixed():
> keys = [
> Key(['mixed', '10038260-iperm_eservice'], **config),
> Key(['mixed', 4812224868188160], **config),
> Key(['mixed', '99152975-pointshop'], **config)
> ]
> entities = map(lambda key: Entity(key=key), keys)
> with beam.Pipeline() as p:
> (p
> | beam.Create(entities)
> | datastoreio.WriteToDatastore(project=config['project'])
> )
> query = Query(kind='mixed', **config)
> with beam.Pipeline() as p:
> (p
> | datastoreio.ReadFromDatastore(query=query, num_splits=4)
> | beam.io.WriteToText('tmp.txt', num_shards=1, 
> shard_name_template='')
> )
> items = open('tmp.txt').read().strip().split('\n')
> assert len(items) == 3, 'incorrect number of items'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7921) Add Schema support for Tensorflow

2019-08-07 Thread Neville Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Li updated BEAM-7921:
-
Description: 
Similar to BEAM-5807, Tensorflow's defacto storage format is {{TFRecord}} files 
with {{Example}} proto payload and its own schema.proto. We already have 
{{TFRecordIO}} support. Need to implement:
 * Conversion between Beam and TF schema
 * Conversion between Beam Row and TF Example proto
 * {{TFRecordTableProvider}}

[https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto]

[https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto]

 

Also it seems the metadata protos are not published as Java artifacts:

[https://github.com/tensorflow/metadata/issues/5]

  was:
Similar to BEAM-5807, Tensorflow's defacto storage format is {{TFRecord}} files 
with {{Example}} proto payload and its own schema.proto. We already have 
{{TFRecordIO}} support. Need to implement:
 * Conversion between Beam and TF schema
 * Conversion between Beam Row and TF Example proto
 * {{TFRecordTableProvider}}

[https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto]

[https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto]


> Add Schema support for Tensorflow
> -
>
> Key: BEAM-7921
> URL: https://issues.apache.org/jira/browse/BEAM-7921
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Affects Versions: 2.14.0
>Reporter: Neville Li
>Priority: Major
>
> Similar to BEAM-5807, Tensorflow's defacto storage format is {{TFRecord}} 
> files with {{Example}} proto payload and its own schema.proto. We already 
> have {{TFRecordIO}} support. Need to implement:
>  * Conversion between Beam and TF schema
>  * Conversion between Beam Row and TF Example proto
>  * {{TFRecordTableProvider}}
> [https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto]
> [https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto]
>  
> Also it seems the metadata protos are not published as Java artifacts:
> [https://github.com/tensorflow/metadata/issues/5]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7389) Colab examples for element-wise transforms (Python)

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7389?focusedWorklogId=290666&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290666
 ]

ASF GitHub Bot logged work on BEAM-7389:


Author: ASF GitHub Bot
Created on: 07/Aug/19 18:17
Start Date: 07/Aug/19 18:17
Worklog Time Spent: 10m 
  Work Description: davidcavazos commented on issue #9265: [BEAM-7389] Add 
code examples for Map page
URL: https://github.com/apache/beam/pull/9265#issuecomment-519212416
 
 
   Added example for `MapTuple` from #9276 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290666)
Time Spent: 34h 10m  (was: 34h)

> Colab examples for element-wise transforms (Python)
> ---
>
> Key: BEAM-7389
> URL: https://issues.apache.org/jira/browse/BEAM-7389
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Rose Nguyen
>Assignee: David Cavazos
>Priority: Minor
>  Time Spent: 34h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7389) Colab examples for element-wise transforms (Python)

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7389?focusedWorklogId=290667&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290667
 ]

ASF GitHub Bot logged work on BEAM-7389:


Author: ASF GitHub Bot
Created on: 07/Aug/19 18:18
Start Date: 07/Aug/19 18:18
Worklog Time Spent: 10m 
  Work Description: davidcavazos commented on issue #9260: [BEAM-7389] Add 
code examples for FlatMap page
URL: https://github.com/apache/beam/pull/9260#issuecomment-519212535
 
 
   Added example for `FlatMapTuple` from #9276 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290667)
Time Spent: 34h 20m  (was: 34h 10m)

> Colab examples for element-wise transforms (Python)
> ---
>
> Key: BEAM-7389
> URL: https://issues.apache.org/jira/browse/BEAM-7389
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Rose Nguyen
>Assignee: David Cavazos
>Priority: Minor
>  Time Spent: 34h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7923) Interactive Beam

2019-08-07 Thread Ning Kang (JIRA)
Ning Kang created BEAM-7923:
---

 Summary: Interactive Beam
 Key: BEAM-7923
 URL: https://issues.apache.org/jira/browse/BEAM-7923
 Project: Beam
  Issue Type: New Feature
  Components: examples-python
Reporter: Ning Kang
Assignee: Ning Kang


This is the top level ticket for all efforts leveraging [interactive 
Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]

As the development goes, blocking tickets will be added to this one.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7912) Optimize GroupIntoBatches for batch Dataflow pipelines

2019-08-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7912?focusedWorklogId=290679&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290679
 ]

ASF GitHub Bot logged work on BEAM-7912:


Author: ASF GitHub Bot
Created on: 07/Aug/19 18:30
Start Date: 07/Aug/19 18:30
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on issue #9280: [BEAM-7912] Optimize 
GroupIntoBatches for batch Dataflow pipelines.
URL: https://github.com/apache/beam/pull/9280#issuecomment-519216916
 
 
   > I'm wondering whether we want to do the overrides by default for dataflow 
batch. Is it possible that a batch pipeline applies window?
   
   The override is currently setup to be the default for all batch Dataflow 
jobs irregardless of the windowing strategy. We can't make this the default for 
all runners since a runner may execute in a way where it produces multiple 
panes across multiple bundles for a batch pipeline (we could do this 
optimization if we knew the windowing strategy had at most one firing per key 
and window).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290679)
Time Spent: 1h 40m  (was: 1.5h)

> Optimize GroupIntoBatches for batch Dataflow pipelines
> --
>
> Key: BEAM-7912
> URL: https://issues.apache.org/jira/browse/BEAM-7912
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The GroupIntoBatches transform can be significantly optimized on Dataflow 
> since it always ensures that a key K appears in only one bundle after a 
> GroupByKey. This removes the usage of state and timers in the generic 
> GroupIntoBatches transform.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


  1   2   3   >