[jira] [Work logged] (BEAM-8318) Add a num_threads_per_worker pipeline option to Python SDK.
[ https://issues.apache.org/jira/browse/BEAM-8318?focusedWorklogId=319321=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319321 ] ASF GitHub Bot logged work on BEAM-8318: Author: ASF GitHub Bot Created on: 27/Sep/19 02:43 Start Date: 27/Sep/19 02:43 Worklog Time Spent: 10m Work Description: chamikaramj commented on pull request #9675: [BEAM-8318] Adds a pipeline option to Python SDK for controlling the number of threads per worker. URL: https://github.com/apache/beam/pull/9675 This will be similar to following already available for Java SDK. https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/options/DataflowPipelineDebugOptions.java#L178 Currently, only works for DataflowRunner on Fn API path. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
[jira] [Created] (BEAM-8318) Add a num_threads_per_worker pipeline option to Python SDK.
Chamikara Madhusanka Jayalath created BEAM-8318: --- Summary: Add a num_threads_per_worker pipeline option to Python SDK. Key: BEAM-8318 URL: https://issues.apache.org/jira/browse/BEAM-8318 Project: Beam Issue Type: Improvement Components: sdk-py-core Reporter: Chamikara Madhusanka Jayalath Assignee: Chamikara Madhusanka Jayalath Similar to what we have here for Java: [https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/options/DataflowPipelineDebugOptions.java#L178] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7455) Improve Avro IO integration test coverage on Python 3.
[ https://issues.apache.org/jira/browse/BEAM-7455?focusedWorklogId=319296=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319296 ] ASF GitHub Bot logged work on BEAM-7455: Author: ASF GitHub Bot Created on: 27/Sep/19 00:54 Start Date: 27/Sep/19 00:54 Worklog Time Spent: 10m Work Description: tvalentyn commented on pull request #9466: [DO NOT MERGE] [BEAM-7455] Improve Avro IO integration test coverage on Python 3. URL: https://github.com/apache/beam/pull/9466 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319296) Time Spent: 6h 10m (was: 6h) > Improve Avro IO integration test coverage on Python 3. > -- > > Key: BEAM-7455 > URL: https://issues.apache.org/jira/browse/BEAM-7455 > Project: Beam > Issue Type: Sub-task > Components: io-py-avro >Reporter: Valentyn Tymofieiev >Assignee: Valentyn Tymofieiev >Priority: Major > Time Spent: 6h 10m > Remaining Estimate: 0h > > It seems that we don't have an integration test for Avro IO on Python 3: > fastavro_it_test [1] depends on both avro and fastavro, however avro package > currently does not work with Beam on Python 3, so we don't have an > integration test that exercises Avro IO on Python 3. > We should add an integration test for Avro IO that does not need both > libraries at the same time, and instead can run using either library. > [~frederik] is this something you could help with? > cc: [~chamikara] [~Juta] > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/fastavro_it_test.py -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8213) Run and report python tox tasks separately within Jenkins
[ https://issues.apache.org/jira/browse/BEAM-8213?focusedWorklogId=319287=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319287 ] ASF GitHub Bot logged work on BEAM-8213: Author: ASF GitHub Bot Created on: 27/Sep/19 00:05 Start Date: 27/Sep/19 00:05 Worklog Time Spent: 10m Work Description: markflyhigh commented on issue #9642: [BEAM-8213] Split up monolithic python preCommit tests on jenkins URL: https://github.com/apache/beam/pull/9642#issuecomment-535728199 I suggest move this discussion to dev@ since we had similar discussions before and many other people also have insights to this problem. For me, I don't see big benefit to do this split (1 job to 5 jobs). - The painpoints you mentioned about pylint failure doesn't require this change. I agree with the approach to split pylint alone. Similar thing is done in Java (RAT) and we could move pylint into that as well (or put in a separate job). - For logging, Gradle scan organize logs by task and provide a pretty good UI to surface the error. The link is in Jenkins job page. Did you have a chance to explore that? - For efficiency, this split will **not shorten** the walltime of the precommit run (50 - 75mins), on the contrary, adding requirement of extra 4 job slots. Given that those are precommit and the run frequency is very likely high (triggered by each commit push and manual in PR), it's likely to increase the precommit queue time. - I agree with Daniel's proposal in https://github.com/apache/beam/pull/9642#issuecomment-535241993 if we want to proceed this change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319287) Time Spent: 9h 40m (was: 9.5h) > Run and report python tox tasks separately within Jenkins > - > > Key: BEAM-8213 > URL: https://issues.apache.org/jira/browse/BEAM-8213 > Project: Beam > Issue Type: Improvement > Components: build-system >Reporter: Chad Dombrova >Priority: Major > Time Spent: 9h 40m > Remaining Estimate: 0h > > As a python developer, the speed and comprehensibility of the jenkins > PreCommit job could be greatly improved. > Here are some of the problems > - when a lint job fails, it's not reported in the test results summary, so > even though the job is marked as failed, I see "Test Result (no failures)" > which is quite confusing > - I have to wait for over an hour to discover the lint failed, which takes > about a minute to run on its own > - The logs are a jumbled mess of all the different tasks running on top of > each other > - The test results give no indication of which version of python they use. I > click on Test results, then the test module, then the test class, then I see > 4 tests named the same thing. I assume that the first is python 2.7, the > second is 3.5 and so on. It takes 5 clicks and then reading the log output > to know which version of python a single error pertains to, then I need to > repeat for each failure. This makes it very difficult to discover problems, > and deduce that they may have something to do with python version mismatches. > I believe the solution to this is to split up the single monolithic python > PreCommit job into sub-jobs (possibly using a pipeline with steps). This > would give us the following benefits: > - sub job results should become available as they finish, so for example, > lint results should be available very early on > - sub job results will be reported separately, and there will be a job for > each py2, py35, py36 and so on, so it will be clear when an error is related > to a particular python version > - sub jobs without reports, like docs and lint, will have their own failure > status and logs, so when they fail it will be more obvious what went wrong. > I'm happy to help out once I get some feedback on the desired way forward. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8213) Run and report python tox tasks separately within Jenkins
[ https://issues.apache.org/jira/browse/BEAM-8213?focusedWorklogId=319285=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319285 ] ASF GitHub Bot logged work on BEAM-8213: Author: ASF GitHub Bot Created on: 27/Sep/19 00:00 Start Date: 27/Sep/19 00:00 Worklog Time Spent: 10m Work Description: markflyhigh commented on issue #9642: [BEAM-8213] Split up monolithic python preCommit tests on jenkins URL: https://github.com/apache/beam/pull/9642#issuecomment-535728199 I suggest move this discussion to dev@ since we had similar discussions before and many other people also have insights to this problem. For me, I don't see big benefit to do this split (1 job to 5 jobs). - The painpoints you mentioned about pylint failure doesn't require this change. I agree with the approach to split pylint alone. Similar thing is done in Java (RAT) and we could move pylint into that as well (or put in a separate job). - For logging, Gradle scan organize logs by task and provide a pretty good UI to surface the error. The link is in Jenkins job page. Did you have a chance to explore that? - For efficiency, this split will **not shorten** the walltime of the precommit run, on the contrary, adding requirement of extra 4 job slots. Given that those are precommit and the run frequency is very likely high (triggered by each commit push and manual in PR), it's likely to increase the precommit queue time. - I agree with Daniel's proposal in https://github.com/apache/beam/pull/9642#issuecomment-535241993 if we want to proceed this change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319285) Time Spent: 9.5h (was: 9h 20m) > Run and report python tox tasks separately within Jenkins > - > > Key: BEAM-8213 > URL: https://issues.apache.org/jira/browse/BEAM-8213 > Project: Beam > Issue Type: Improvement > Components: build-system >Reporter: Chad Dombrova >Priority: Major > Time Spent: 9.5h > Remaining Estimate: 0h > > As a python developer, the speed and comprehensibility of the jenkins > PreCommit job could be greatly improved. > Here are some of the problems > - when a lint job fails, it's not reported in the test results summary, so > even though the job is marked as failed, I see "Test Result (no failures)" > which is quite confusing > - I have to wait for over an hour to discover the lint failed, which takes > about a minute to run on its own > - The logs are a jumbled mess of all the different tasks running on top of > each other > - The test results give no indication of which version of python they use. I > click on Test results, then the test module, then the test class, then I see > 4 tests named the same thing. I assume that the first is python 2.7, the > second is 3.5 and so on. It takes 5 clicks and then reading the log output > to know which version of python a single error pertains to, then I need to > repeat for each failure. This makes it very difficult to discover problems, > and deduce that they may have something to do with python version mismatches. > I believe the solution to this is to split up the single monolithic python > PreCommit job into sub-jobs (possibly using a pipeline with steps). This > would give us the following benefits: > - sub job results should become available as they finish, so for example, > lint results should be available very early on > - sub job results will be reported separately, and there will be a job for > each py2, py35, py36 and so on, so it will be clear when an error is related > to a particular python version > - sub jobs without reports, like docs and lint, will have their own failure > status and logs, so when they fail it will be more obvious what went wrong. > I'm happy to help out once I get some feedback on the desired way forward. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-876) Support schemaUpdateOption in BigQueryIO
[ https://issues.apache.org/jira/browse/BEAM-876?focusedWorklogId=319284=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319284 ] ASF GitHub Bot logged work on BEAM-876: --- Author: ASF GitHub Bot Created on: 26/Sep/19 23:56 Start Date: 26/Sep/19 23:56 Worklog Time Spent: 10m Work Description: chamikaramj commented on issue #9524: [BEAM-876] Support schemaUpdateOption in BigQueryIO URL: https://github.com/apache/beam/pull/9524#issuecomment-535727479 Thanks for the contribution. R: @pabloem will you be able to review ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319284) Time Spent: 40m (was: 0.5h) > Support schemaUpdateOption in BigQueryIO > > > Key: BEAM-876 > URL: https://issues.apache.org/jira/browse/BEAM-876 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Eugene Kirpichov >Assignee: canaan silberberg >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > BigQuery recently added support for updating the schema as a side effect of > the load job. > Here is the relevant API method in JobConfigurationLoad: > https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#setSchemaUpdateOptions(java.util.List) > BigQueryIO should support this too. See user request for this: > http://stackoverflow.com/questions/40333245/is-it-possible-to-update-schema-while-doing-a-load-into-an-existing-bigquery-tab -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (BEAM-8212) StatefulParDoFn creates GC timers for every record
[ https://issues.apache.org/jira/browse/BEAM-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937228#comment-16937228 ] Akshay Iyangar edited comment on BEAM-8212 at 9/26/19 11:19 PM: {code:java} public class TestDecodeTimer { @Test public void gctimerValue() throws IOException, ClassNotFoundException { StateNamespace stateNamespace = StateNamespaces.window(GlobalWindow.Coder.INSTANCE, GlobalWindow.INSTANCE); String GC_TIMER_ID = "__StatefulParDoGcTimerId"; //timerInternals.setTimer( //StateNamespaces.window(windowCoder, window), GC_TIMER_ID, gcTime, TimeDomain.EVENT_TIME); ByteArrayOutputStream outStream = new ByteArrayOutputStream(); StringUtf8Coder.of().encode(GC_TIMER_ID, outStream); StringUtf8Coder.of().encode(stateNamespace.stringKey(), outStream); System.out.println("The output stream is :"+ outStream.toString()); // __StatefulParDoGcTimerId// //We need to find what the hex value representation of this is String encode = BaseEncoding.base16().encode(outStream.toByteArray()); System.out.println("The encoded string is " + encode); //185F5F537461746566756C506172446F476354696D65724964022F2F // We need everything after this as that is the gctimer and check what the value is for it also remove the eventime. ByteArrayOutputStream outStream1 = new ByteArrayOutputStream(); StringUtf8Coder.of().encode(TimeDomain.EVENT_TIME.toString(), outStream1); String encode1 = BaseEncoding.base16().encode(outStream1.toByteArray()); System.out.println("The encoded1 string is " + encode1); //0A4556454E545F54494D45 System.out.println("Total Length of the encode key: "+ outStream.size()); //Example key String decode = "008020C49BA0BCF7F901006A6176612E6E696F2E4865617042797465427565F20100010C0107313831303639000C0100185F5F537461746566756C506172446F476354696D65724964022F2F8020C49BA0BCF7F80A4556454E545F54494D45"; //So the timer is whatever is between these two 185F5F537461746566756C506172446F476354696D65724964022F2F and 0A4556454E545F54494D45 viz 8020C49BA0BCF7F8 Instant timeDecode = InstantCoder.of().decode(new ByteArrayInputStream(BaseEncoding.base16().decode( "8020C49BA0BCF7F8"))); System.out.println("GC timer for Global Window is" +timeDecode); //294247-01-10T04:00:54.775Z //This is nothing but +infinity and thus these timers would never be cleaned as the window never closes. //just cross verify System.out.println("MAX value" + BoundedWindow.TIMESTAMP_MAX_VALUE); System.out.println("MAX: "+GlobalWindow.TIMESTAMP_MAX_VALUE); } } {code} So I just wrote a test to verify what the values are that are being generated for each of the events. just took one key from rocksdb to analyze and the timer is +Infinity or GlobalWindow.TIMESTAMP_MAX_VALUE which makes sense as it's a global window. Also, I didn't see any keys associated with timers in the StatefulParDoFn .. {code:java} rocksdb_ldb --db=db --column_family=_timer_state/event_beam-timer scan --max_keys=100 --key_hex {code} returned me zero keys. I ran a big pipeline to see the effect of having it disabled. so at 1-hour mark with Global Window and rocksdb as the state backend, the pipeline had consumed 432 million records with a memory usage of the node at roughly 50%. The node is 32GB EKS node where I gave 15GB to the JVM. the same pipeline took 1 hr 30 mins to read 432 million records with the total node memory usage at 62%. So I think it is fair to assume that for global windows the timers can affect the pipeline performance. [~mxm] and [~NathanHowell] ^^ was (Author: aiyangar): {code:java} public class TestDecodeTimer { @Test public void gctimerValue() throws IOException, ClassNotFoundException { StateNamespace stateNamespace = StateNamespaces.window(GlobalWindow.Coder.INSTANCE, GlobalWindow.INSTANCE); String GC_TIMER_ID = "__StatefulParDoGcTimerId"; //timerInternals.setTimer( //StateNamespaces.window(windowCoder, window), GC_TIMER_ID, gcTime, TimeDomain.EVENT_TIME); ByteArrayOutputStream outStream = new ByteArrayOutputStream(); StringUtf8Coder.of().encode(GC_TIMER_ID, outStream); StringUtf8Coder.of().encode(stateNamespace.stringKey(), outStream); System.out.println("The output stream is :"+ outStream.toString()); // __StatefulParDoGcTimerId// //We need to find what the hex value representation of this is String encode = BaseEncoding.base16().encode(outStream.toByteArray()); System.out.println("The encoded string is " + encode); //185F5F537461746566756C506172446F476354696D65724964022F2F // We need everything after this as that is the gctimer and check what the value is for it also remove the eventime. ByteArrayOutputStream outStream1 = new ByteArrayOutputStream();
[jira] [Work logged] (BEAM-7933) Adding timeout to JobServer grpc calls
[ https://issues.apache.org/jira/browse/BEAM-7933?focusedWorklogId=319263=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319263 ] ASF GitHub Bot logged work on BEAM-7933: Author: ASF GitHub Bot Created on: 26/Sep/19 22:51 Start Date: 26/Sep/19 22:51 Worklog Time Spent: 10m Work Description: ecanzonieri commented on issue #9673: [BEAM-7933] Add job server request timeout (default to 60 seconds) URL: https://github.com/apache/beam/pull/9673#issuecomment-535714867 R: @ibzib @aaltay This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319263) Time Spent: 20m (was: 10m) > Adding timeout to JobServer grpc calls > -- > > Key: BEAM-7933 > URL: https://issues.apache.org/jira/browse/BEAM-7933 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Affects Versions: 2.14.0 >Reporter: Enrico Canzonieri >Assignee: Enrico Canzonieri >Priority: Minor > Labels: portability > Time Spent: 20m > Remaining Estimate: 0h > > grpc calls to the JobServer from the Python SDK do not have timeouts. That > means that the call to pipeline.run()could hang forever if the JobServer is > not running (or failing to start). > E.g. > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/portable_runner.py#L307] > the call to Prepare() doesn't provide any timeout value and the same applies > to other JobServer requests. > As part of this ticket we could add a default timeout of 60 seconds as the > default timeout for http client. > Additionally, we could consider adding a --job-server-request-timeout to the > [PortableOptions|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/options/pipeline_options.py#L805] > class to be used in the JobServer interactions inside probable_runner.py. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7933) Adding timeout to JobServer grpc calls
[ https://issues.apache.org/jira/browse/BEAM-7933?focusedWorklogId=319262=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319262 ] ASF GitHub Bot logged work on BEAM-7933: Author: ASF GitHub Bot Created on: 26/Sep/19 22:48 Start Date: 26/Sep/19 22:48 Worklog Time Spent: 10m Work Description: ecanzonieri commented on pull request #9673: [BEAM-7933] Add job server request timeout (default to 60 seconds) URL: https://github.com/apache/beam/pull/9673 Add a new option --job_server_timeout that default to 60 seconds. The request timeout is user for all job server requests (with exception to the ones that are expected to last long or hang). The request timeout is also used upon channel creation so that the Beam driver will fail if the job server is not available. Let me know if the option name looks fine or it needs renaming. Finally, there are other grpc requests (non job service) that I'm not covering as part of this pr. Let me know if I should add the timeout to those as well. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build
[jira] [Work started] (BEAM-7933) Adding timeout to JobServer grpc calls
[ https://issues.apache.org/jira/browse/BEAM-7933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on BEAM-7933 started by Enrico Canzonieri. --- > Adding timeout to JobServer grpc calls > -- > > Key: BEAM-7933 > URL: https://issues.apache.org/jira/browse/BEAM-7933 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Affects Versions: 2.14.0 >Reporter: Enrico Canzonieri >Assignee: Enrico Canzonieri >Priority: Minor > Labels: portability > > grpc calls to the JobServer from the Python SDK do not have timeouts. That > means that the call to pipeline.run()could hang forever if the JobServer is > not running (or failing to start). > E.g. > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/portable_runner.py#L307] > the call to Prepare() doesn't provide any timeout value and the same applies > to other JobServer requests. > As part of this ticket we could add a default timeout of 60 seconds as the > default timeout for http client. > Additionally, we could consider adding a --job-server-request-timeout to the > [PortableOptions|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/options/pipeline_options.py#L805] > class to be used in the JobServer interactions inside probable_runner.py. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8303) Filesystems not properly registered using FileIO.write()
[ https://issues.apache.org/jira/browse/BEAM-8303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939002#comment-16939002 ] Preston Koprivica commented on BEAM-8303: - [~mxm] Sorry for the delay. I was struggling with my local build and I had to track down some issues. Totally unrelated, but if you have some time, I think I may have found an issue related to some recent build changes [1]. In any case, I was able to finally get the local build working and pulled into my test project. {quote} Just to proof this theory, do you mind building Beam and testing your pipeline with the following line added before line 75? https://github.com/apache/beam/blob/04dc3c3b14ab780e9736d5f769c6bf2a27a390bb/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/types/CoderTypeInformation.java#L75 FileSystems.setDefaultPipelineOptions(PipelineOptionsFactory.create()); {quote} This change did not impact the behavior at all. And I guess the question is, would we have expected it to using the default PipelineOptions (which I'm assuming wouldn't include the S3 options). [1] https://issues.apache.org/jira/browse/BEAM-8021 > Filesystems not properly registered using FileIO.write() > > > Key: BEAM-8303 > URL: https://issues.apache.org/jira/browse/BEAM-8303 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Affects Versions: 2.15.0 >Reporter: Preston Koprivica >Assignee: Maximilian Michels >Priority: Major > > I’m getting the following error when attempting to use the FileIO apis > (beam-2.15.0) and integrating with AWS S3. I have setup the PipelineOptions > with all the relevant AWS options, so the filesystem registry **should** be > properly seeded by the time the graph is compiled and executed: > {code:java} > java.lang.IllegalArgumentException: No filesystem found for scheme s3 > at > org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:456) > at > org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:526) > at > org.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1149) > at > org.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1105) > at org.apache.beam.sdk.coders.Coder.decode(Coder.java:159) > at > org.apache.beam.sdk.transforms.join.UnionCoder.decode(UnionCoder.java:83) > at > org.apache.beam.sdk.transforms.join.UnionCoder.decode(UnionCoder.java:32) > at > org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:543) > at > org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:534) > at > org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:480) > at > org.apache.beam.runners.flink.translation.types.CoderTypeSerializer.deserialize(CoderTypeSerializer.java:93) > at > org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55) > at > org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:106) > at > org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:72) > at > org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47) > at > org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73) > at > org.apache.flink.runtime.operators.FlatMapDriver.run(FlatMapDriver.java:107) > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:503) > at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:368) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) > at java.lang.Thread.run(Thread.java:748) > {code} > For reference, the write code resembles this: > {code:java} > FileIO.Write write = FileIO.write() > .via(ParquetIO.sink(schema)) > .to(options.getOutputDir()). // will be something like: > s3:/// > .withSuffix(".parquet"); > records.apply(String.format("Write(%s)", options.getOutputDir()), > write);{code} > The issue does not appear to be related to ParquetIO.sink(). I am able to > reliably reproduce the issue using JSON formatted records and TextIO.sink(), > as well. Moreover, AvroIO is affected if withWindowedWrites() option is > added. > Just trying some different knobs, I went ahead and set the following option: > {code:java} > write = write.withNoSpilling();{code} > This actually seemed to fix the issue, only to have it reemerge as I scaled > up the data set size. The stack trace, while very
[jira] [Work logged] (BEAM-8275) Beam SQL should support BigQuery in DIRECT_READ mode
[ https://issues.apache.org/jira/browse/BEAM-8275?focusedWorklogId=319254=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319254 ] ASF GitHub Bot logged work on BEAM-8275: Author: ASF GitHub Bot Created on: 26/Sep/19 22:13 Start Date: 26/Sep/19 22:13 Worklog Time Spent: 10m Work Description: apilloud commented on pull request #9625: [BEAM-8275] Beam SQL should support BigQuery in DIRECT_READ mode URL: https://github.com/apache/beam/pull/9625#discussion_r328848385 ## File path: sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryReadWriteIT.java ## @@ -154,6 +156,78 @@ public void testSQLRead() { assertEquals(state, State.DONE); } + @Test + public void testSQLRead_withDirectRead() { Review comment: We should probably also add a test that EXPORT works, so when that is no longer the default we will still have a test for it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319254) Time Spent: 1h 10m (was: 1h) > Beam SQL should support BigQuery in DIRECT_READ mode > > > Key: BEAM-8275 > URL: https://issues.apache.org/jira/browse/BEAM-8275 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Andrew Pilloud >Assignee: Kirill Kozlov >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > > SQL currently only supports reading from BigQuery in DEFAULT (EXPORT) mode. > We also need to support DIRECT_READ mode. The method should be configurable > by TBLPROPERTIES through the SQL CLI. This will enable us to take advantage > of the DIRECT_READ features for filter and project push down. > References: > [https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api] > [https://beam.apache.org/blog/2019/06/04/adding-data-sources-to-sql.html] > [https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8275) Beam SQL should support BigQuery in DIRECT_READ mode
[ https://issues.apache.org/jira/browse/BEAM-8275?focusedWorklogId=319255=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319255 ] ASF GitHub Bot logged work on BEAM-8275: Author: ASF GitHub Bot Created on: 26/Sep/19 22:13 Start Date: 26/Sep/19 22:13 Worklog Time Spent: 10m Work Description: apilloud commented on pull request #9625: [BEAM-8275] Beam SQL should support BigQuery in DIRECT_READ mode URL: https://github.com/apache/beam/pull/9625#discussion_r326830964 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java ## @@ -45,15 +49,45 @@ */ @Experimental class BigQueryTable extends BaseBeamTable implements Serializable { + @VisibleForTesting static final String METHOD_PROPERTY = "method"; @VisibleForTesting final String bqLocation; private final ConversionOptions conversionOptions; private BeamTableStatistics rowCountStatistics = null; private static final Logger LOGGER = LoggerFactory.getLogger(BigQueryTable.class); + @VisibleForTesting Method method; Review comment: This is set exactly once (per code path) in the constructor, so you should be able to make it `final`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319255) Time Spent: 1h 10m (was: 1h) > Beam SQL should support BigQuery in DIRECT_READ mode > > > Key: BEAM-8275 > URL: https://issues.apache.org/jira/browse/BEAM-8275 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Andrew Pilloud >Assignee: Kirill Kozlov >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > > SQL currently only supports reading from BigQuery in DEFAULT (EXPORT) mode. > We also need to support DIRECT_READ mode. The method should be configurable > by TBLPROPERTIES through the SQL CLI. This will enable us to take advantage > of the DIRECT_READ features for filter and project push down. > References: > [https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api] > [https://beam.apache.org/blog/2019/06/04/adding-data-sources-to-sql.html] > [https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8275) Beam SQL should support BigQuery in DIRECT_READ mode
[ https://issues.apache.org/jira/browse/BEAM-8275?focusedWorklogId=319256=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319256 ] ASF GitHub Bot logged work on BEAM-8275: Author: ASF GitHub Bot Created on: 26/Sep/19 22:13 Start Date: 26/Sep/19 22:13 Worklog Time Spent: 10m Work Description: apilloud commented on pull request #9625: [BEAM-8275] Beam SQL should support BigQuery in DIRECT_READ mode URL: https://github.com/apache/beam/pull/9625#discussion_r326830808 ## File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java ## @@ -45,15 +49,45 @@ */ @Experimental class BigQueryTable extends BaseBeamTable implements Serializable { + @VisibleForTesting static final String METHOD_PROPERTY = "method"; @VisibleForTesting final String bqLocation; private final ConversionOptions conversionOptions; private BeamTableStatistics rowCountStatistics = null; private static final Logger LOGGER = LoggerFactory.getLogger(BigQueryTable.class); + @VisibleForTesting Method method; - BigQueryTable(Table table, BigQueryUtils.ConversionOptions options) { + BigQueryTable(Table table, BigQueryUtils.ConversionOptions options) + throws InvalidPropertyException { Review comment: This statement is redundant, please remove. `InvalidPropertyException` indirectly extends `RuntimeException`. Every java method implicity `throws RuntimeException`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319256) Time Spent: 1h 20m (was: 1h 10m) > Beam SQL should support BigQuery in DIRECT_READ mode > > > Key: BEAM-8275 > URL: https://issues.apache.org/jira/browse/BEAM-8275 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Andrew Pilloud >Assignee: Kirill Kozlov >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > SQL currently only supports reading from BigQuery in DEFAULT (EXPORT) mode. > We also need to support DIRECT_READ mode. The method should be configurable > by TBLPROPERTIES through the SQL CLI. This will enable us to take advantage > of the DIRECT_READ features for filter and project push down. > References: > [https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api] > [https://beam.apache.org/blog/2019/06/04/adding-data-sources-to-sql.html] > [https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8317) SqlTransform doesn't support aggregation over a filter node
[ https://issues.apache.org/jira/browse/BEAM-8317?focusedWorklogId=319251=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319251 ] ASF GitHub Bot logged work on BEAM-8317: Author: ASF GitHub Bot Created on: 26/Sep/19 22:06 Start Date: 26/Sep/19 22:06 Worklog Time Spent: 10m Work Description: apilloud commented on issue #9672: [BEAM-8317] Add (skipped) test for aggregating after a filter URL: https://github.com/apache/beam/pull/9672#issuecomment-535704007 LGTM. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319251) Time Spent: 20m (was: 10m) > SqlTransform doesn't support aggregation over a filter node > --- > > Key: BEAM-8317 > URL: https://issues.apache.org/jira/browse/BEAM-8317 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Affects Versions: 2.15.0 >Reporter: Brian Hulette >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > For example, the following query fails to translate to a physical plan: > SELECT SUM(f_intValue) FROM PCOLLECTION WHERE f_intValue < 5 GROUP BY > f_intGroupingKey -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8317) SqlTransform doesn't support aggregation over a filter node
[ https://issues.apache.org/jira/browse/BEAM-8317?focusedWorklogId=319252=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319252 ] ASF GitHub Bot logged work on BEAM-8317: Author: ASF GitHub Bot Created on: 26/Sep/19 22:06 Start Date: 26/Sep/19 22:06 Worklog Time Spent: 10m Work Description: apilloud commented on pull request #9672: [BEAM-8317] Add (skipped) test for aggregating after a filter URL: https://github.com/apache/beam/pull/9672 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319252) Time Spent: 0.5h (was: 20m) > SqlTransform doesn't support aggregation over a filter node > --- > > Key: BEAM-8317 > URL: https://issues.apache.org/jira/browse/BEAM-8317 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Affects Versions: 2.15.0 >Reporter: Brian Hulette >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > For example, the following query fails to translate to a physical plan: > SELECT SUM(f_intValue) FROM PCOLLECTION WHERE f_intValue < 5 GROUP BY > f_intGroupingKey -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8021) Add Automatic-Module-Name headers for Beam Java modules
[ https://issues.apache.org/jira/browse/BEAM-8021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938995#comment-16938995 ] Preston Koprivica commented on BEAM-8021: - [~ŁukaszG]I think the PR associated to this issue may have broken local builds. I'm still very new to beam (and gradle), so please bear with me and apologies if I'm mistaken. The default for applyJavaNature (as of 2.15.0) was to publish [1]. The project :sdks:java:build-tools was previously being published and there was a compile dependency on it by the flink-runner [2]. It appears that dependency still exists [3], but the build-tools project is no longer being published, hence the broken builds. I'm guessing that the reason it wasn't caught in the PR is because the SNAPSHOT artifact was still available in whatever repo the build server was accessing. And I'm also wondering if this doesn't manifest when you attempt to release it. [1] https://github.com/apache/beam/blob/v2.15.0/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L129 [2] https://github.com/apache/beam/blob/release-2.15.0/runners/flink/flink_runner.gradle [3] https://github.com/apache/beam/blob/2acbfbd/runners/flink/flink_runner.gradle#L102 > Add Automatic-Module-Name headers for Beam Java modules > > > Key: BEAM-8021 > URL: https://issues.apache.org/jira/browse/BEAM-8021 > Project: Beam > Issue Type: Sub-task > Components: build-system >Reporter: Ismaël Mejía >Assignee: Lukasz Gajowy >Priority: Minor > Fix For: 2.17.0 > > Time Spent: 8h > Remaining Estimate: 0h > > For compatibility with the Java Platform Module System (JPMS) in Java 9 and > later, every JAR should have a module name, even if the library does not > itself use modules. As [suggested in the mailing > list|https://lists.apache.org/thread.html/956065580ce049481e756482dc3ccfdc994fef3b8cdb37cab3e2d9b1@%3Cdev.beam.apache.org%3E], > this is a simple change that we can do and still be backwards compatible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7886) Make row coder a standard coder and implement in python
[ https://issues.apache.org/jira/browse/BEAM-7886?focusedWorklogId=319240=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319240 ] ASF GitHub Bot logged work on BEAM-7886: Author: ASF GitHub Bot Created on: 26/Sep/19 21:36 Start Date: 26/Sep/19 21:36 Worklog Time Spent: 10m Work Description: TheNeuralBit commented on issue #9188: [BEAM-7886] Make row coder a standard coder and implement in Python URL: https://github.com/apache/beam/pull/9188#issuecomment-535695414 Apologies for letting this go stale. After [BEAM-8111](https://issues.apache.org/jira/browse/BEAM-8111) I wanted to make sure we had some better test coverage on the Java side. @udim, @aaltay, @robertwb, and/or @chadrik - would you mind taking another look at the Python changes now? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319240) Time Spent: 10h 50m (was: 10h 40m) > Make row coder a standard coder and implement in python > --- > > Key: BEAM-7886 > URL: https://issues.apache.org/jira/browse/BEAM-7886 > Project: Beam > Issue Type: Improvement > Components: beam-model, sdk-java-core, sdk-py-core >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Time Spent: 10h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7886) Make row coder a standard coder and implement in python
[ https://issues.apache.org/jira/browse/BEAM-7886?focusedWorklogId=319238=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319238 ] ASF GitHub Bot logged work on BEAM-7886: Author: ASF GitHub Bot Created on: 26/Sep/19 21:33 Start Date: 26/Sep/19 21:33 Worklog Time Spent: 10m Work Description: TheNeuralBit commented on pull request #9188: [BEAM-7886] Make row coder a standard coder and implement in Python URL: https://github.com/apache/beam/pull/9188#discussion_r328835963 ## File path: sdks/python/apache_beam/coders/row_coder.py ## @@ -0,0 +1,162 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from __future__ import absolute_import + +import itertools +from array import array + +from apache_beam.coders.coder_impl import StreamCoderImpl +from apache_beam.coders.coders import BytesCoder +from apache_beam.coders.coders import Coder +from apache_beam.coders.coders import FastCoder +from apache_beam.coders.coders import FloatCoder +from apache_beam.coders.coders import IterableCoder +from apache_beam.coders.coders import StrUtf8Coder +from apache_beam.coders.coders import TupleCoder +from apache_beam.coders.coders import VarIntCoder +from apache_beam.portability import common_urns +from apache_beam.portability.api import schema_pb2 +from apache_beam.typehints.schemas import named_tuple_from_schema +from apache_beam.typehints.schemas import named_tuple_to_schema + +__all__ = ["RowCoder"] + + +class RowCoder(FastCoder): + """ Coder for `typing.NamedTuple` instances. + + Implements the beam:coder:row:v1 standard coder spec. + """ + + def __init__(self, schema): +self.schema = schema +self.components = [ +coder_from_type(field.type) for field in self.schema.fields +] + + def _create_impl(self): +return RowCoderImpl(self.schema, self.components) + + def is_deterministic(self): +return all(c.is_deterministic() for c in self.components) + + def to_type_hint(self): +return named_tuple_from_schema(self.schema) + + def as_cloud_object(self, coders_context=None): +raise NotImplementedError("TODO") + + def __eq__(self, other): +return type(self) == type(other) and self.schema == other.schema + + def __hash__(self): Review comment: Did some reading on this and it looks like I was wrong. From [SO](https://stackoverflow.com/a/53519136): > A class that overrides `__eq__()` and does not define `__hash__()` will have its `__hash__()` implicitly set to None. I went ahead and removed __hash__ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319238) Time Spent: 10.5h (was: 10h 20m) > Make row coder a standard coder and implement in python > --- > > Key: BEAM-7886 > URL: https://issues.apache.org/jira/browse/BEAM-7886 > Project: Beam > Issue Type: Improvement > Components: beam-model, sdk-java-core, sdk-py-core >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Time Spent: 10.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7886) Make row coder a standard coder and implement in python
[ https://issues.apache.org/jira/browse/BEAM-7886?focusedWorklogId=319239=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319239 ] ASF GitHub Bot logged work on BEAM-7886: Author: ASF GitHub Bot Created on: 26/Sep/19 21:33 Start Date: 26/Sep/19 21:33 Worklog Time Spent: 10m Work Description: TheNeuralBit commented on pull request #9188: [BEAM-7886] Make row coder a standard coder and implement in Python URL: https://github.com/apache/beam/pull/9188#discussion_r328835963 ## File path: sdks/python/apache_beam/coders/row_coder.py ## @@ -0,0 +1,162 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from __future__ import absolute_import + +import itertools +from array import array + +from apache_beam.coders.coder_impl import StreamCoderImpl +from apache_beam.coders.coders import BytesCoder +from apache_beam.coders.coders import Coder +from apache_beam.coders.coders import FastCoder +from apache_beam.coders.coders import FloatCoder +from apache_beam.coders.coders import IterableCoder +from apache_beam.coders.coders import StrUtf8Coder +from apache_beam.coders.coders import TupleCoder +from apache_beam.coders.coders import VarIntCoder +from apache_beam.portability import common_urns +from apache_beam.portability.api import schema_pb2 +from apache_beam.typehints.schemas import named_tuple_from_schema +from apache_beam.typehints.schemas import named_tuple_to_schema + +__all__ = ["RowCoder"] + + +class RowCoder(FastCoder): + """ Coder for `typing.NamedTuple` instances. + + Implements the beam:coder:row:v1 standard coder spec. + """ + + def __init__(self, schema): +self.schema = schema +self.components = [ +coder_from_type(field.type) for field in self.schema.fields +] + + def _create_impl(self): +return RowCoderImpl(self.schema, self.components) + + def is_deterministic(self): +return all(c.is_deterministic() for c in self.components) + + def to_type_hint(self): +return named_tuple_from_schema(self.schema) + + def as_cloud_object(self, coders_context=None): +raise NotImplementedError("TODO") + + def __eq__(self, other): +return type(self) == type(other) and self.schema == other.schema + + def __hash__(self): Review comment: Did some reading on this and it looks like I was wrong. From [SO](https://stackoverflow.com/a/53519136): > A class that overrides `__eq__()` and does not define `__hash__()` will have its `__hash__()` implicitly set to None. I went ahead and removed `__hash__` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319239) Time Spent: 10h 40m (was: 10.5h) > Make row coder a standard coder and implement in python > --- > > Key: BEAM-7886 > URL: https://issues.apache.org/jira/browse/BEAM-7886 > Project: Beam > Issue Type: Improvement > Components: beam-model, sdk-java-core, sdk-py-core >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Time Spent: 10h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7886) Make row coder a standard coder and implement in python
[ https://issues.apache.org/jira/browse/BEAM-7886?focusedWorklogId=319237=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319237 ] ASF GitHub Bot logged work on BEAM-7886: Author: ASF GitHub Bot Created on: 26/Sep/19 21:32 Start Date: 26/Sep/19 21:32 Worklog Time Spent: 10m Work Description: TheNeuralBit commented on pull request #9188: [BEAM-7886] Make row coder a standard coder and implement in Python URL: https://github.com/apache/beam/pull/9188#discussion_r328835963 ## File path: sdks/python/apache_beam/coders/row_coder.py ## @@ -0,0 +1,162 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from __future__ import absolute_import + +import itertools +from array import array + +from apache_beam.coders.coder_impl import StreamCoderImpl +from apache_beam.coders.coders import BytesCoder +from apache_beam.coders.coders import Coder +from apache_beam.coders.coders import FastCoder +from apache_beam.coders.coders import FloatCoder +from apache_beam.coders.coders import IterableCoder +from apache_beam.coders.coders import StrUtf8Coder +from apache_beam.coders.coders import TupleCoder +from apache_beam.coders.coders import VarIntCoder +from apache_beam.portability import common_urns +from apache_beam.portability.api import schema_pb2 +from apache_beam.typehints.schemas import named_tuple_from_schema +from apache_beam.typehints.schemas import named_tuple_to_schema + +__all__ = ["RowCoder"] + + +class RowCoder(FastCoder): + """ Coder for `typing.NamedTuple` instances. + + Implements the beam:coder:row:v1 standard coder spec. + """ + + def __init__(self, schema): +self.schema = schema +self.components = [ +coder_from_type(field.type) for field in self.schema.fields +] + + def _create_impl(self): +return RowCoderImpl(self.schema, self.components) + + def is_deterministic(self): +return all(c.is_deterministic() for c in self.components) + + def to_type_hint(self): +return named_tuple_from_schema(self.schema) + + def as_cloud_object(self, coders_context=None): +raise NotImplementedError("TODO") + + def __eq__(self, other): +return type(self) == type(other) and self.schema == other.schema + + def __hash__(self): Review comment: Did some reading on this and it looks like I was wrong. From (SO)[https://stackoverflow.com/a/53519136]: > A class that overrides `__eq__()` and does not define `__hash__()` will have its `__hash__()` implicitly set to None. I went ahead and removed __hash__ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319237) Time Spent: 10h 20m (was: 10h 10m) > Make row coder a standard coder and implement in python > --- > > Key: BEAM-7886 > URL: https://issues.apache.org/jira/browse/BEAM-7886 > Project: Beam > Issue Type: Improvement > Components: beam-model, sdk-java-core, sdk-py-core >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Time Spent: 10h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8317) SqlTransform doesn't support aggregation over a filter node
[ https://issues.apache.org/jira/browse/BEAM-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938980#comment-16938980 ] Brian Hulette commented on BEAM-8317: - Discussed this with [~apilloud]. He said that probably the appropriate fix is to make [BeamBasicAggregationRule|https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamBasicAggregationRule.java] accept any thing that's _not_ a projection, rather than the current approach that just accepts table scans. Basically it should complement [BeamAggregationRule|https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamAggregationRule.java] which accepts _only_ projections. > SqlTransform doesn't support aggregation over a filter node > --- > > Key: BEAM-8317 > URL: https://issues.apache.org/jira/browse/BEAM-8317 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Affects Versions: 2.15.0 >Reporter: Brian Hulette >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > For example, the following query fails to translate to a physical plan: > SELECT SUM(f_intValue) FROM PCOLLECTION WHERE f_intValue < 5 GROUP BY > f_intGroupingKey -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8317) SqlTransform doesn't support aggregation over a filter node
[ https://issues.apache.org/jira/browse/BEAM-8317?focusedWorklogId=319227=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319227 ] ASF GitHub Bot logged work on BEAM-8317: Author: ASF GitHub Bot Created on: 26/Sep/19 21:19 Start Date: 26/Sep/19 21:19 Worklog Time Spent: 10m Work Description: TheNeuralBit commented on pull request #9672: [BEAM-8317] Add (skipped) test for aggregating after a filter URL: https://github.com/apache/beam/pull/9672 R: @apilloud Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/) | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/) | --- | --- | [![Build
[jira] [Created] (BEAM-8317) SqlTransform doesn't support aggregation over a filter node
Brian Hulette created BEAM-8317: --- Summary: SqlTransform doesn't support aggregation over a filter node Key: BEAM-8317 URL: https://issues.apache.org/jira/browse/BEAM-8317 Project: Beam Issue Type: Bug Components: dsl-sql Affects Versions: 2.15.0 Reporter: Brian Hulette For example, the following query fails to translate to a physical plan: SELECT SUM(f_intValue) FROM PCOLLECTION WHERE f_intValue < 5 GROUP BY f_intGroupingKey -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null
[ https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319207=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319207 ] ASF GitHub Bot logged work on BEAM-8300: Author: ASF GitHub Bot Created on: 26/Sep/19 20:48 Start Date: 26/Sep/19 20:48 Worklog Time Spent: 10m Work Description: jhalaria commented on issue #9640: [BEAM-8300]: KinesisIO.write throws NPE because producer is null URL: https://github.com/apache/beam/pull/9640#issuecomment-535680190 Thanks @aromanenko-dev . Build is happy now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319207) Time Spent: 4h 10m (was: 4h) > KinesisIO.write causes NPE as the producer is null > -- > > Key: BEAM-8300 > URL: https://issues.apache.org/jira/browse/BEAM-8300 > Project: Beam > Issue Type: Bug > Components: io-java-kinesis >Affects Versions: 2.15.0 >Reporter: Ankit Jhalaria >Assignee: Ankit Jhalaria >Priority: Minor > Fix For: Not applicable > > Time Spent: 4h 10m > Remaining Estimate: 0h > > While using KinesisIO.write(), we encountered a NPE with the following stack > trace > {code:java} > org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat > > org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat > > org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat > org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat > java.lang.Thread.run(Thread.java:748)\nCaused by: > java.lang.NullPointerException: null\n\tat > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat > > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-5543) Beam Dependency Update Request: com.gradle:build-scan-plugin
[ https://issues.apache.org/jira/browse/BEAM-5543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Cwik reassigned BEAM-5543: --- Assignee: (was: Luke Cwik) > Beam Dependency Update Request: com.gradle:build-scan-plugin > > > Key: BEAM-5543 > URL: https://issues.apache.org/jira/browse/BEAM-5543 > Project: Beam > Issue Type: Sub-task > Components: dependencies >Reporter: Beam JIRA Bot >Priority: Major > > - 2018-10-01 19:29:44.279184 > - > Please consider upgrading the dependency > com.gradle:build-scan-plugin. > The current version is 1.13.1. The latest version is 1.16 > cc: [~swegner], > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-10-08 12:16:12.858178 > - > Please consider upgrading the dependency > com.gradle:build-scan-plugin. > The current version is 1.13.1. The latest version is 1.16 > cc: [~swegner], > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-10-15 12:11:19.400992 > - > Please consider upgrading the dependency > com.gradle:build-scan-plugin. > The current version is 1.13.1. The latest version is 1.16 > cc: [~swegner], > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-10-22 12:11:02.816812 > - > Please consider upgrading the dependency > com.gradle:build-scan-plugin. > The current version is 1.13.1. The latest version is 2.0 > cc: [~swegner], > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-10-29 12:13:18.853246 > - > Please consider upgrading the dependency > com.gradle:build-scan-plugin. > The current version is 1.13.1. The latest version is 2.0.1 > cc: [~swegner], > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-11-05 12:12:01.873127 > - > Please consider upgrading the dependency > com.gradle:build-scan-plugin. > The current version is 1.13.1. The latest version is 2.0.1 > cc: [~swegner], > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-11-12 12:11:52.184461 > - > Please consider upgrading the dependency > com.gradle:build-scan-plugin. > The current version is 1.13.1. The latest version is 2.0.2 > cc: [~swegner], > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-11-19 12:12:39.182250 > - > Please consider upgrading the dependency > com.gradle:build-scan-plugin. > The current version is 1.13.1. The latest version is 2.0.2 > cc: [~swegner], > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-11-26 12:11:44.898503 > - > Please consider upgrading the dependency > com.gradle:build-scan-plugin. > The current version is 1.13.1. The latest version is 2.0.2 > cc: [~swegner], > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-12-03 12:12:05.592304 > - > Please consider upgrading the dependency > com.gradle:build-scan-plugin. > The current version is 1.13.1. The latest version is 2.0.2 > cc: [~swegner], > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-12-10 12:14:26.144633 > - > Please consider upgrading the
[jira] [Assigned] (BEAM-6646) Beam Dependency Update Request: com.gradle.build-scan
[ https://issues.apache.org/jira/browse/BEAM-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Cwik reassigned BEAM-6646: --- Assignee: (was: Luke Cwik) > Beam Dependency Update Request: com.gradle.build-scan > - > > Key: BEAM-6646 > URL: https://issues.apache.org/jira/browse/BEAM-6646 > Project: Beam > Issue Type: Bug > Components: dependencies >Reporter: Beam JIRA Bot >Priority: Major > > - 2019-02-11 12:12:25.062529 > - > Please consider upgrading the dependency com.gradle.build-scan. > The current version is None. The latest version is None > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-6647) Beam Dependency Update Request: com.gradle.build-scan:com.gradle.build-scan.gradle.plugin
[ https://issues.apache.org/jira/browse/BEAM-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Cwik resolved BEAM-6647. - Fix Version/s: 2.16.0 Resolution: Fixed This has been updated to 2.4 > Beam Dependency Update Request: > com.gradle.build-scan:com.gradle.build-scan.gradle.plugin > - > > Key: BEAM-6647 > URL: https://issues.apache.org/jira/browse/BEAM-6647 > Project: Beam > Issue Type: Sub-task > Components: dependencies >Reporter: Beam JIRA Bot >Assignee: Luke Cwik >Priority: Major > Fix For: 2.16.0 > > > - 2019-02-11 12:12:26.233579 > - > Please consider upgrading the dependency > com.gradle.build-scan:com.gradle.build-scan.gradle.plugin. > The current version is 1.13.1. The latest version is 2.1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-08-12 12:04:23.565409 > - > Please consider upgrading the dependency > com.gradle.build-scan:com.gradle.build-scan.gradle.plugin. > The current version is 2.1. The latest version is 2.4 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1296) Providing a small dataset for "Apache Beam Mobile Gaming Pipeline Examples"
[ https://issues.apache.org/jira/browse/BEAM-1296?focusedWorklogId=319197=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319197 ] ASF GitHub Bot logged work on BEAM-1296: Author: ASF GitHub Bot Created on: 26/Sep/19 20:18 Start Date: 26/Sep/19 20:18 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9633: [BEAM-1296] Providing a small dataset for "Apache Beam Mobile Gaming … URL: https://github.com/apache/beam/pull/9633 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319197) Time Spent: 1.5h (was: 1h 20m) > Providing a small dataset for "Apache Beam Mobile Gaming Pipeline Examples" > --- > > Key: BEAM-1296 > URL: https://issues.apache.org/jira/browse/BEAM-1296 > Project: Beam > Issue Type: Wish > Components: examples-java >Reporter: Keiji Yoshida >Assignee: John Patoch >Priority: Trivial > Labels: ccoss2019, newbie, starter > Time Spent: 1.5h > Remaining Estimate: 0h > > A dataset "gs://apache-beam-samples/game/gaming_data*.csv" for "Apache Beam > Mobile Gaming Pipeline Examples" is so huge (about 12 GB) and it takes long > time to download the dataset. It might pose difficulties to Apache Beam > beginners who want to try "Apache Beam Mobile Gaming Pipeline Examples" > quickly. > How about providing a small dataset (say less than 1 GB) for this examples? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1296) Providing a small dataset for "Apache Beam Mobile Gaming Pipeline Examples"
[ https://issues.apache.org/jira/browse/BEAM-1296?focusedWorklogId=319196=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319196 ] ASF GitHub Bot logged work on BEAM-1296: Author: ASF GitHub Bot Created on: 26/Sep/19 20:17 Start Date: 26/Sep/19 20:17 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9633: [BEAM-1296] Providing a small dataset for "Apache Beam Mobile Gaming … URL: https://github.com/apache/beam/pull/9633#issuecomment-535669628 Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319196) Time Spent: 1h 20m (was: 1h 10m) > Providing a small dataset for "Apache Beam Mobile Gaming Pipeline Examples" > --- > > Key: BEAM-1296 > URL: https://issues.apache.org/jira/browse/BEAM-1296 > Project: Beam > Issue Type: Wish > Components: examples-java >Reporter: Keiji Yoshida >Assignee: John Patoch >Priority: Trivial > Labels: ccoss2019, newbie, starter > Time Spent: 1h 20m > Remaining Estimate: 0h > > A dataset "gs://apache-beam-samples/game/gaming_data*.csv" for "Apache Beam > Mobile Gaming Pipeline Examples" is so huge (about 12 GB) and it takes long > time to download the dataset. It might pose difficulties to Apache Beam > beginners who want to try "Apache Beam Mobile Gaming Pipeline Examples" > quickly. > How about providing a small dataset (say less than 1 GB) for this examples? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8313) Rename yyy_reference to yyy_id to be consistent across protos
[ https://issues.apache.org/jira/browse/BEAM-8313?focusedWorklogId=319189=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319189 ] ASF GitHub Bot logged work on BEAM-8313: Author: ASF GitHub Bot Created on: 26/Sep/19 19:55 Start Date: 26/Sep/19 19:55 Worklog Time Spent: 10m Work Description: lukecwik commented on pull request #9668: [BEAM-8313] Follow up on PR comments for #9663 URL: https://github.com/apache/beam/pull/9668 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319189) Time Spent: 2h 10m (was: 2h) > Rename yyy_reference to yyy_id to be consistent across protos > - > > Key: BEAM-8313 > URL: https://issues.apache.org/jira/browse/BEAM-8313 > Project: Beam > Issue Type: Sub-task > Components: beam-model >Reporter: Luke Cwik >Assignee: Luke Cwik >Priority: Major > Fix For: 2.17.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > This will replace: > * ptransform_id -> transform_id > * ptransform_reference -> transform_id > * instruction_reference -> instruction_id > * process_bundle_descriptor_reference -> process_bundle_descriptor_id > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work started] (BEAM-8313) Rename yyy_reference to yyy_id to be consistent across protos
[ https://issues.apache.org/jira/browse/BEAM-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on BEAM-8313 started by Luke Cwik. --- > Rename yyy_reference to yyy_id to be consistent across protos > - > > Key: BEAM-8313 > URL: https://issues.apache.org/jira/browse/BEAM-8313 > Project: Beam > Issue Type: Sub-task > Components: beam-model >Reporter: Luke Cwik >Assignee: Luke Cwik >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > This will replace: > * ptransform_id -> transform_id > * ptransform_reference -> transform_id > * instruction_reference -> instruction_id > * process_bundle_descriptor_reference -> process_bundle_descriptor_id > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-8315) Remove unused fields for splitting SDFs in beam_fn_api.proto
[ https://issues.apache.org/jira/browse/BEAM-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Cwik resolved BEAM-8315. - Fix Version/s: 2.17.0 Resolution: Fixed > Remove unused fields for splitting SDFs in beam_fn_api.proto > > > Key: BEAM-8315 > URL: https://issues.apache.org/jira/browse/BEAM-8315 > Project: Beam > Issue Type: Sub-task > Components: beam-model >Reporter: Luke Cwik >Assignee: Luke Cwik >Priority: Major > Fix For: 2.17.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-8313) Rename yyy_reference to yyy_id to be consistent across protos
[ https://issues.apache.org/jira/browse/BEAM-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Cwik resolved BEAM-8313. - Fix Version/s: 2.17.0 Resolution: Fixed > Rename yyy_reference to yyy_id to be consistent across protos > - > > Key: BEAM-8313 > URL: https://issues.apache.org/jira/browse/BEAM-8313 > Project: Beam > Issue Type: Sub-task > Components: beam-model >Reporter: Luke Cwik >Assignee: Luke Cwik >Priority: Major > Fix For: 2.17.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > This will replace: > * ptransform_id -> transform_id > * ptransform_reference -> transform_id > * instruction_reference -> instruction_id > * process_bundle_descriptor_reference -> process_bundle_descriptor_id > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8306) improve estimation of data byte size reading from source in ElasticsearchIO
[ https://issues.apache.org/jira/browse/BEAM-8306?focusedWorklogId=319182=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319182 ] ASF GitHub Bot logged work on BEAM-8306: Author: ASF GitHub Bot Created on: 26/Sep/19 19:43 Start Date: 26/Sep/19 19:43 Worklog Time Spent: 10m Work Description: timrobertson100 commented on issue #9660: [BEAM-8306] improve estimation datasize elasticsearch io URL: https://github.com/apache/beam/pull/9660#issuecomment-535657412 I don't have time to do a thorough review until next week but I looked over this for 10 mins and my general impression is it looks like a good addition. (Commit message is not formatted correctly) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319182) Time Spent: 0.5h (was: 20m) > improve estimation of data byte size reading from source in ElasticsearchIO > --- > > Key: BEAM-8306 > URL: https://issues.apache.org/jira/browse/BEAM-8306 > Project: Beam > Issue Type: Improvement > Components: io-java-elasticsearch >Affects Versions: 2.14.0 >Reporter: Derek He >Assignee: Derek He >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > ElasticsearchIO splits BoundedSource based on the Elasticsearch index size. > We expect it can be more accurate to split it base on query result size. > Currently, we have a big Elasticsearch index. But for query result, it only > contains a few documents in the index. ElasticsearchIO splits it into up > to1024 BoundedSources in Google dataflow. It takes long time to finish the > processing the small numbers of Elasticsearch document in Google dataflow. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8306) improve estimation of data byte size reading from source in ElasticsearchIO
[ https://issues.apache.org/jira/browse/BEAM-8306?focusedWorklogId=319181=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319181 ] ASF GitHub Bot logged work on BEAM-8306: Author: ASF GitHub Bot Created on: 26/Sep/19 19:43 Start Date: 26/Sep/19 19:43 Worklog Time Spent: 10m Work Description: timrobertson100 commented on issue #9660: [BEAM-8306] improve estimation datasize elasticsearch io URL: https://github.com/apache/beam/pull/9660#issuecomment-535657412 I don't have time to do a thorough review until next week but I looked over this and my general impression is it looks like a good addition. (Commit message is not formatted correctly) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319181) Time Spent: 20m (was: 10m) > improve estimation of data byte size reading from source in ElasticsearchIO > --- > > Key: BEAM-8306 > URL: https://issues.apache.org/jira/browse/BEAM-8306 > Project: Beam > Issue Type: Improvement > Components: io-java-elasticsearch >Affects Versions: 2.14.0 >Reporter: Derek He >Assignee: Derek He >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > ElasticsearchIO splits BoundedSource based on the Elasticsearch index size. > We expect it can be more accurate to split it base on query result size. > Currently, we have a big Elasticsearch index. But for query result, it only > contains a few documents in the index. ElasticsearchIO splits it into up > to1024 BoundedSources in Google dataflow. It takes long time to finish the > processing the small numbers of Elasticsearch document in Google dataflow. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-5192) Support Elasticsearch 7.x
[ https://issues.apache.org/jira/browse/BEAM-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938907#comment-16938907 ] Tim Robertson commented on BEAM-5192: - [~chetaldrich] - help would be greatly appreciated. Please feel free to reassign to you. > Support Elasticsearch 7.x > - > > Key: BEAM-5192 > URL: https://issues.apache.org/jira/browse/BEAM-5192 > Project: Beam > Issue Type: Improvement > Components: io-java-elasticsearch >Reporter: Etienne Chauchot >Assignee: Tim Robertson >Priority: Major > > ES v7 is not out yet. But Elastic team scheduled a breaking change for ES > 7.0: the removal of the type feature. See > [https://www.elastic.co/blog/index-type-parent-child-join-now-future-in-elasticsearch] > This will require a good amont of changes in the IO. > This ticket is there to track the future work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-5192) Support Elasticsearch 7.x
[ https://issues.apache.org/jira/browse/BEAM-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938902#comment-16938902 ] Chet Aldrich commented on BEAM-5192: Hey folks, just curious about the current state of things here. Seems like since this issue was created ES 7.3 was released: [https://www.elastic.co/blog/elasticsearch-7-3-0-released.] I might be able to help if there's a need for it, but if work has started on this in some form I can hold off. > Support Elasticsearch 7.x > - > > Key: BEAM-5192 > URL: https://issues.apache.org/jira/browse/BEAM-5192 > Project: Beam > Issue Type: Improvement > Components: io-java-elasticsearch >Reporter: Etienne Chauchot >Assignee: Tim Robertson >Priority: Major > > ES v7 is not out yet. But Elastic team scheduled a breaking change for ES > 7.0: the removal of the type feature. See > [https://www.elastic.co/blog/index-type-parent-child-join-now-future-in-elasticsearch] > This will require a good amont of changes in the IO. > This ticket is there to track the future work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7389) Colab examples for element-wise transforms (Python)
[ https://issues.apache.org/jira/browse/BEAM-7389?focusedWorklogId=319138=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319138 ] ASF GitHub Bot logged work on BEAM-7389: Author: ASF GitHub Bot Created on: 26/Sep/19 18:54 Start Date: 26/Sep/19 18:54 Worklog Time Spent: 10m Work Description: davidcavazos commented on pull request #9669: [BEAM-7389] Update include buttons to support multiple languages URL: https://github.com/apache/beam/pull/9669 **No content changes, should render exactly the same** Small update on code snippet buttons to support multiple languages. R: @aaltay Filter: http://apache-beam-website-pull-requests.storage.googleapis.com/9661/documentation/transforms/python/elementwise/filter/index.html FlatMap: http://apache-beam-website-pull-requests.storage.googleapis.com/9661/documentation/transforms/python/elementwise/flatmap/index.html Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
[jira] [Work logged] (BEAM-8315) Remove unused fields for splitting SDFs in beam_fn_api.proto
[ https://issues.apache.org/jira/browse/BEAM-8315?focusedWorklogId=319137=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319137 ] ASF GitHub Bot logged work on BEAM-8315: Author: ASF GitHub Bot Created on: 26/Sep/19 18:53 Start Date: 26/Sep/19 18:53 Worklog Time Spent: 10m Work Description: boyuanzz commented on pull request #9666: [BEAM-8315] Clean-up unused fields for splitting SDFs URL: https://github.com/apache/beam/pull/9666 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319137) Time Spent: 0.5h (was: 20m) > Remove unused fields for splitting SDFs in beam_fn_api.proto > > > Key: BEAM-8315 > URL: https://issues.apache.org/jira/browse/BEAM-8315 > Project: Beam > Issue Type: Sub-task > Components: beam-model >Reporter: Luke Cwik >Assignee: Luke Cwik >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7738) Support PubSubIO to be configured externally for use with other SDKs
[ https://issues.apache.org/jira/browse/BEAM-7738?focusedWorklogId=319133=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319133 ] ASF GitHub Bot logged work on BEAM-7738: Author: ASF GitHub Bot Created on: 26/Sep/19 18:28 Start Date: 26/Sep/19 18:28 Worklog Time Spent: 10m Work Description: chamikaramj commented on issue #9268: [BEAM-7738] Add external transform support to PubsubIO URL: https://github.com/apache/beam/pull/9268#issuecomment-535629521 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319133) Time Spent: 4.5h (was: 4h 20m) > Support PubSubIO to be configured externally for use with other SDKs > > > Key: BEAM-7738 > URL: https://issues.apache.org/jira/browse/BEAM-7738 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp, runner-flink, sdk-py-core >Reporter: Chad Dombrova >Assignee: Chad Dombrova >Priority: Major > Labels: portability > Time Spent: 4.5h > Remaining Estimate: 0h > > Now that KafkaIO is supported via the external transform API (BEAM-7029) we > should add support for PubSub. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7738) Support PubSubIO to be configured externally for use with other SDKs
[ https://issues.apache.org/jira/browse/BEAM-7738?focusedWorklogId=319132=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319132 ] ASF GitHub Bot logged work on BEAM-7738: Author: ASF GitHub Bot Created on: 26/Sep/19 18:28 Start Date: 26/Sep/19 18:28 Worklog Time Spent: 10m Work Description: chamikaramj commented on issue #9268: [BEAM-7738] Add external transform support to PubsubIO URL: https://github.com/apache/beam/pull/9268#issuecomment-535629486 Python failure seems to be a docs issue (:sdks:python:test-suites:tox:py2:docs'.) Could be a flake. Retrying. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319132) Time Spent: 4h 20m (was: 4h 10m) > Support PubSubIO to be configured externally for use with other SDKs > > > Key: BEAM-7738 > URL: https://issues.apache.org/jira/browse/BEAM-7738 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp, runner-flink, sdk-py-core >Reporter: Chad Dombrova >Assignee: Chad Dombrova >Priority: Major > Labels: portability > Time Spent: 4h 20m > Remaining Estimate: 0h > > Now that KafkaIO is supported via the external transform API (BEAM-7029) we > should add support for PubSub. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7738) Support PubSubIO to be configured externally for use with other SDKs
[ https://issues.apache.org/jira/browse/BEAM-7738?focusedWorklogId=319129=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319129 ] ASF GitHub Bot logged work on BEAM-7738: Author: ASF GitHub Bot Created on: 26/Sep/19 18:23 Start Date: 26/Sep/19 18:23 Worklog Time Spent: 10m Work Description: chamikaramj commented on issue #9268: [BEAM-7738] Add external transform support to PubsubIO URL: https://github.com/apache/beam/pull/9268#issuecomment-535627722 Seems like the Java PreCommit failure is for the new test ? Stacktrace is: org.apache.beam.sdk.io.gcp.pubsub.PubsubIOExternalTest > testConstructPubsubRead FAILED java.lang.RuntimeException at PubsubIOExternalTest.java:97 Caused by: java.lang.RuntimeException at PubsubIOExternalTest.java:97 Caused by: java.lang.reflect.InvocationTargetException at PubsubIOExternalTest.java:97 Caused by: java.lang.IllegalStateException at PubsubIOExternalTest.java:97 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319129) Time Spent: 4h 10m (was: 4h) > Support PubSubIO to be configured externally for use with other SDKs > > > Key: BEAM-7738 > URL: https://issues.apache.org/jira/browse/BEAM-7738 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp, runner-flink, sdk-py-core >Reporter: Chad Dombrova >Assignee: Chad Dombrova >Priority: Major > Labels: portability > Time Spent: 4h 10m > Remaining Estimate: 0h > > Now that KafkaIO is supported via the external transform API (BEAM-7029) we > should add support for PubSub. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1296) Providing a small dataset for "Apache Beam Mobile Gaming Pipeline Examples"
[ https://issues.apache.org/jira/browse/BEAM-1296?focusedWorklogId=319126=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319126 ] ASF GitHub Bot logged work on BEAM-1296: Author: ASF GitHub Bot Created on: 26/Sep/19 18:20 Start Date: 26/Sep/19 18:20 Worklog Time Spent: 10m Work Description: angulartist commented on issue #9633: [BEAM-1296] Providing a small dataset for "Apache Beam Mobile Gaming … URL: https://github.com/apache/beam/pull/9633#issuecomment-535626785 The file is in fact accessible. I've deleted the previous file and updated the comment. I left the default original input as it is, because we just want a lighter alternative xoxo This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319126) Time Spent: 1h 10m (was: 1h) > Providing a small dataset for "Apache Beam Mobile Gaming Pipeline Examples" > --- > > Key: BEAM-1296 > URL: https://issues.apache.org/jira/browse/BEAM-1296 > Project: Beam > Issue Type: Wish > Components: examples-java >Reporter: Keiji Yoshida >Assignee: John Patoch >Priority: Trivial > Labels: ccoss2019, newbie, starter > Time Spent: 1h 10m > Remaining Estimate: 0h > > A dataset "gs://apache-beam-samples/game/gaming_data*.csv" for "Apache Beam > Mobile Gaming Pipeline Examples" is so huge (about 12 GB) and it takes long > time to download the dataset. It might pose difficulties to Apache Beam > beginners who want to try "Apache Beam Mobile Gaming Pipeline Examples" > quickly. > How about providing a small dataset (say less than 1 GB) for this examples? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8313) Rename yyy_reference to yyy_id to be consistent across protos
[ https://issues.apache.org/jira/browse/BEAM-8313?focusedWorklogId=319125=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319125 ] ASF GitHub Bot logged work on BEAM-8313: Author: ASF GitHub Bot Created on: 26/Sep/19 18:19 Start Date: 26/Sep/19 18:19 Worklog Time Spent: 10m Work Description: lukecwik commented on issue #9668: [BEAM-8313] Follow up on PR comments for #9663 URL: https://github.com/apache/beam/pull/9668#issuecomment-535626489 Run Java_Examples_Dataflow PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319125) Time Spent: 2h (was: 1h 50m) > Rename yyy_reference to yyy_id to be consistent across protos > - > > Key: BEAM-8313 > URL: https://issues.apache.org/jira/browse/BEAM-8313 > Project: Beam > Issue Type: Sub-task > Components: beam-model >Reporter: Luke Cwik >Assignee: Luke Cwik >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > This will replace: > * ptransform_id -> transform_id > * ptransform_reference -> transform_id > * instruction_reference -> instruction_id > * process_bundle_descriptor_reference -> process_bundle_descriptor_id > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8313) Rename yyy_reference to yyy_id to be consistent across protos
[ https://issues.apache.org/jira/browse/BEAM-8313?focusedWorklogId=319124=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319124 ] ASF GitHub Bot logged work on BEAM-8313: Author: ASF GitHub Bot Created on: 26/Sep/19 18:19 Start Date: 26/Sep/19 18:19 Worklog Time Spent: 10m Work Description: lukecwik commented on issue #9668: [BEAM-8313] Follow up on PR comments for #9663 URL: https://github.com/apache/beam/pull/9668#issuecomment-535626437 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319124) Time Spent: 1h 50m (was: 1h 40m) > Rename yyy_reference to yyy_id to be consistent across protos > - > > Key: BEAM-8313 > URL: https://issues.apache.org/jira/browse/BEAM-8313 > Project: Beam > Issue Type: Sub-task > Components: beam-model >Reporter: Luke Cwik >Assignee: Luke Cwik >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > This will replace: > * ptransform_id -> transform_id > * ptransform_reference -> transform_id > * instruction_reference -> instruction_id > * process_bundle_descriptor_reference -> process_bundle_descriptor_id > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7495) Add support for dynamic worker re-balancing when reading BigQuery data using Cloud Dataflow
[ https://issues.apache.org/jira/browse/BEAM-7495?focusedWorklogId=319121=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319121 ] ASF GitHub Bot logged work on BEAM-7495: Author: ASF GitHub Bot Created on: 26/Sep/19 18:13 Start Date: 26/Sep/19 18:13 Worklog Time Spent: 10m Work Description: chamikaramj commented on issue #9083: [BEAM-7495] Improve the test that compares EXPORT and DIRECT_READ URL: https://github.com/apache/beam/pull/9083#issuecomment-535624269 Looks like we lost the logs so not sure failures are related. Trying again. Run Java PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319121) Remaining Estimate: 490h 50m (was: 491h) Time Spent: 13h 10m (was: 13h) > Add support for dynamic worker re-balancing when reading BigQuery data using > Cloud Dataflow > --- > > Key: BEAM-7495 > URL: https://issues.apache.org/jira/browse/BEAM-7495 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Aryan Naraghi >Assignee: Aryan Naraghi >Priority: Major > Original Estimate: 504h > Time Spent: 13h 10m > Remaining Estimate: 490h 50m > > Currently, the BigQuery connector for reading data using the BigQuery Storage > API does not support any of the facilities on the source for Dataflow to > split streams. > > On the server side, the BigQuery Storage API supports splitting streams at a > fraction. By adding support to the connector, we enable Dataflow to split > streams, which unlocks dynamic worker re-balancing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.
[ https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319110=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319110 ] ASF GitHub Bot logged work on BEAM-5428: Author: ASF GitHub Bot Created on: 26/Sep/19 18:02 Start Date: 26/Sep/19 18:02 Worklog Time Spent: 10m Work Description: mxm commented on pull request #9418: [BEAM-5428] Implement cross-bundle user state caching in the Python SDK URL: https://github.com/apache/beam/pull/9418#discussion_r328751006 ## File path: sdks/python/apache_beam/runners/worker/sdk_worker.py ## @@ -620,6 +632,107 @@ def _next_id(self): return str(self._last_id) +class CachingMaterializingStateHandler(object): + """ A State handler which retrieves and caches state. """ + + def __init__(self, global_state_cache, underlying_state): +self._underlying = underlying_state +self._state_cache = global_state_cache +self._context = threading.local() + + @contextlib.contextmanager + def process_instruction_id(self, bundle_id, cache_tokens): +if getattr(self._context, 'cache_token', None) is not None: + raise RuntimeError( + 'Cache tokens already set to %s' % self._context.cache_token) +# TODO Also handle cache tokens for side input, if present: +# https://issues.apache.org/jira/browse/BEAM-8298 +user_state_cache_token = None +for cache_token_struct in cache_tokens: + if cache_token_struct.HasField("user_state"): +# There should only be one user state token present +assert not user_state_cache_token +user_state_cache_token = cache_token_struct.token +try: + self._context.cache_token = user_state_cache_token + with self._underlying.process_instruction_id(bundle_id): +yield +finally: + self._context.cache_token = None + + def blocking_get(self, state_key, coder, is_cached=False): +if not self._should_be_cached(is_cached): + # no cache / tokens, can't do a lookup/store in the cache + return self._materialize_iter(state_key, coder) +# Cache lookup +cache_state_key = self._convert_to_cache_key(state_key) +cached_value = self._state_cache.get(cache_state_key, + self._context.cache_token) +if cached_value is None: + # Cache miss, need to retrieve from the Runner + materialized = cached_value = list( + self._materialize_iter(state_key, coder)) + self._state_cache.put( + cache_state_key, + self._context.cache_token, + materialized) +return iter(cached_value) + + def append(self, state_key, coder, elements, is_cached=False): +if self._should_be_cached(is_cached): + # Update the cache + cache_key = self._convert_to_cache_key(state_key) + self._state_cache.append(cache_key, self._context.cache_token, elements) +# Write to state handler +out = coder_impl.create_OutputStream() +for element in elements: + coder.encode_to_stream(element, out, True) +return self._underlying.append_raw(state_key, out.get()) + + def clear(self, state_key, is_cached=False): +if self._should_be_cached(is_cached): + cache_key = self._convert_to_cache_key(state_key) + self._state_cache.clear(cache_key, self._context.cache_token) +return self._underlying.clear(state_key) + + # The following methods are for interaction with the FnApiRunner: + + def get_raw(self, state_key, continuation_token=None): +return self._underlying.get_raw(state_key, continuation_token) + + def append_raw(self, state_key, data): +return self._underlying.append_raw(state_key, data) + + def restore(self): +self._underlying.restore() + + def checkpoint(self): Review comment: Note, these and the above until the comment will go away (fix will come tomorrow), as they won't be necessary with the cache not being inserted for the state handler of the fn_api_runner. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319110) Time Spent: 23h 50m (was: 23h 40m) > Implement cross-bundle state caching. > - > > Key: BEAM-5428 > URL: https://issues.apache.org/jira/browse/BEAM-5428 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Maximilian Michels >Priority: Major > Time Spent: 23h 50m > Remaining Estimate: 0h > > Tech spec: >
[jira] [Created] (BEAM-8316) What is corresponding data type to set in UDF parameter to match RecordType
Yang Zhang created BEAM-8316: Summary: What is corresponding data type to set in UDF parameter to match RecordType Key: BEAM-8316 URL: https://issues.apache.org/jira/browse/BEAM-8316 Project: Beam Issue Type: Bug Components: beam-model Affects Versions: 2.15.0 Reporter: Yang Zhang Hello Beam community, I want to have an UDF to take a record as input. Per error info as shown below, it indicates that the input is *RecordType*, but ** what should I set in the UDF parameter so that Beam would not complain about the type compatibility? Below is the rull error trace. Thank you very much! error trace=== Exception in thread "main" org.apache.beam.sdk.extensions.sql.impl.ParseException: Unable to parse query select fooudf(pv.header) from kafka.tracking.PageViewEvent as pvException in thread "main" org.apache.beam.sdk.extensions.sql.impl.ParseException: Unable to parse query select fooudf(pv.header) from kafka.tracking.PageViewEvent as pv at org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner.convertToBeamRel(CalciteQueryPlanner.java:165) at org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv.parseQuery(BeamSqlEnv.java:103) at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:124) at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:82) at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:539) at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:473) at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:44) at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:169) at com.linkedin.samza.sql.engine.BeamSqlEntry.preparePipeline(BeamSqlEntry.java:52) at com.linkedin.samza.sql.engine.BeamSqlEntry.exec(BeamSqlEntry.java:41) at com.linkedin.samza.sql.engine.BeamSqlUI.main(BeamSqlUI.java:33)Caused by: org.apache.calcite.tools.ValidationException: org.apache.calcite.runtime.CalciteContextException: >From line 1, column 8 to line 1, column 24: No match found for function signature fooudf() at org.apache.calcite.prepare.PlannerImpl.validate(PlannerImpl.java:190) at org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner.convertToBeamRel(CalciteQueryPlanner.java:136) ... 10 moreCaused by: org.apache.calcite.runtime.CalciteContextException: From line 1, column 8 to line 1, column 24: No match found for function signature fooudf() at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.calcite.runtime.Resources$ExInstWithCause.ex(Resources.java:463) at org.apache.calcite.sql.SqlUtil.newContextException(SqlUtil.java:787) at org.apache.calcite.sql.SqlUtil.newContextException(SqlUtil.java:772) at org.apache.calcite.sql.validate.SqlValidatorImpl.newValidationError(SqlValidatorImpl.java:4825) at org.apache.calcite.sql.validate.SqlValidatorImpl.handleUnresolvedFunction(SqlValidatorImpl.java:1739) at org.apache.calcite.sql.SqlFunction.deriveType(SqlFunction.java:270) at org.apache.calcite.sql.SqlFunction.deriveType(SqlFunction.java:215) at org.apache.calcite.sql.validate.SqlValidatorImpl$DeriveTypeVisitor.visit(SqlValidatorImpl.java:5584) at org.apache.calcite.sql.validate.SqlValidatorImpl$DeriveTypeVisitor.visit(SqlValidatorImpl.java:5571) at org.apache.calcite.sql.SqlCall.accept(SqlCall.java:138) at org.apache.calcite.sql.validate.SqlValidatorImpl.deriveTypeImpl(SqlValidatorImpl.java:1657) at org.apache.calcite.sql.validate.SqlValidatorImpl.deriveType(SqlValidatorImpl.java:1642) at org.apache.calcite.sql.validate.SqlValidatorImpl.expandSelectItem(SqlValidatorImpl.java:462) at org.apache.calcite.sql.validate.SqlValidatorImpl.validateSelectList(SqlValidatorImpl.java:4089) at org.apache.calcite.sql.validate.SqlValidatorImpl.validateSelect(SqlValidatorImpl.java:3352) at org.apache.calcite.sql.validate.SelectNamespace.validateImpl(SelectNamespace.java:60) at org.apache.calcite.sql.validate.AbstractNamespace.validate(AbstractNamespace.java:84) at org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace(SqlValidatorImpl.java:994) at org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery(SqlValidatorImpl.java:954) at org.apache.calcite.sql.SqlSelect.validate(SqlSelect.java:216) at org.apache.calcite.sql.validate.SqlValidatorImpl.validateScopedExpression(SqlValidatorImpl.java:929) at org.apache.calcite.sql.validate.SqlValidatorImpl.validate(SqlValidatorImpl.java:633) at org.apache.calcite.prepare.PlannerImpl.validate(PlannerImpl.java:188) ... 11 moreCaused by: org.apache.calcite.sql.validate.SqlValidatorException: No match found for
[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.
[ https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319093=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319093 ] ASF GitHub Bot logged work on BEAM-5428: Author: ASF GitHub Bot Created on: 26/Sep/19 17:15 Start Date: 26/Sep/19 17:15 Worklog Time Spent: 10m Work Description: mxm commented on pull request #9418: [BEAM-5428] Implement cross-bundle user state caching in the Python SDK URL: https://github.com/apache/beam/pull/9418#discussion_r328730529 ## File path: sdks/python/apache_beam/runners/portability/fn_api_runner.py ## @@ -1412,11 +1470,13 @@ def stop_worker(self): class WorkerHandlerManager(object): - def __init__(self, environments, job_provision_info): + def __init__(self, environments, job_provision_info, state_cache_size): self._environments = environments self._job_provision_info = job_provision_info self._cached_handlers = collections.defaultdict(list) -self._state = FnApiRunner.StateServicer() # rename? +self._state = sdk_worker.CachingMaterializingStateHandler( +StateCache(state_cache_size), Review comment: I added this because the WorkerHandlerManager will insert the state handler into the WorkerHandlerFactory, which generates a cached BundleProcessorCache with that state handler for the EmbeddedWorkerHandler. It is not necessary otherwise and just does unnecessary caching on the FnApiRunner side. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319093) Time Spent: 23h 40m (was: 23.5h) > Implement cross-bundle state caching. > - > > Key: BEAM-5428 > URL: https://issues.apache.org/jira/browse/BEAM-5428 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Maximilian Michels >Priority: Major > Time Spent: 23h 40m > Remaining Estimate: 0h > > Tech spec: > [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m] > Relevant document: > [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit] > Mailing list link: > [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8313) Rename yyy_reference to yyy_id to be consistent across protos
[ https://issues.apache.org/jira/browse/BEAM-8313?focusedWorklogId=319092=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319092 ] ASF GitHub Bot logged work on BEAM-8313: Author: ASF GitHub Bot Created on: 26/Sep/19 17:10 Start Date: 26/Sep/19 17:10 Worklog Time Spent: 10m Work Description: lukecwik commented on issue #9668: [BEAM-8313] Follow up on PR comments for #9663 URL: https://github.com/apache/beam/pull/9668#issuecomment-535599328 R: @lostluck This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319092) Time Spent: 1h 40m (was: 1.5h) > Rename yyy_reference to yyy_id to be consistent across protos > - > > Key: BEAM-8313 > URL: https://issues.apache.org/jira/browse/BEAM-8313 > Project: Beam > Issue Type: Sub-task > Components: beam-model >Reporter: Luke Cwik >Assignee: Luke Cwik >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > This will replace: > * ptransform_id -> transform_id > * ptransform_reference -> transform_id > * instruction_reference -> instruction_id > * process_bundle_descriptor_reference -> process_bundle_descriptor_id > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8313) Rename yyy_reference to yyy_id to be consistent across protos
[ https://issues.apache.org/jira/browse/BEAM-8313?focusedWorklogId=319091=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319091 ] ASF GitHub Bot logged work on BEAM-8313: Author: ASF GitHub Bot Created on: 26/Sep/19 17:10 Start Date: 26/Sep/19 17:10 Worklog Time Spent: 10m Work Description: lukecwik commented on pull request #9668: [BEAM-8313] Follow up on PR comments for #9663 URL: https://github.com/apache/beam/pull/9668 Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/) | --- | [![Build
[jira] [Work logged] (BEAM-7389) Colab examples for element-wise transforms (Python)
[ https://issues.apache.org/jira/browse/BEAM-7389?focusedWorklogId=319087=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319087 ] ASF GitHub Bot logged work on BEAM-7389: Author: ASF GitHub Bot Created on: 26/Sep/19 17:07 Start Date: 26/Sep/19 17:07 Worklog Time Spent: 10m Work Description: davidcavazos commented on issue #9664: [BEAM-7389] Created code files to match doc filenames URL: https://github.com/apache/beam/pull/9664#issuecomment-535597904 Yes, will do. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319087) Time Spent: 60h 50m (was: 60h 40m) > Colab examples for element-wise transforms (Python) > --- > > Key: BEAM-7389 > URL: https://issues.apache.org/jira/browse/BEAM-7389 > Project: Beam > Issue Type: Improvement > Components: website >Reporter: Rose Nguyen >Assignee: David Cavazos >Priority: Minor > Time Spent: 60h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8314) Beam Fn Api metrics piling causes pipeline to stuck after running for a while
[ https://issues.apache.org/jira/browse/BEAM-8314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8314: -- Attachment: E4UaSUhJJKF.png > Beam Fn Api metrics piling causes pipeline to stuck after running for a while > - > > Key: BEAM-8314 > URL: https://issues.apache.org/jira/browse/BEAM-8314 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Yichi Zhang >Priority: Blocker > Fix For: 2.16.0 > > Attachments: E4UaSUhJJKF.png > > > Seems that in StreamingDataflowWorker we are not able to update the metrics > fast enough to dataflow service, the piling metrics causes memory usage to > increase and eventually leads to excessive memory thrashing/GC. And it will > almost stop the pipeline from processing new items. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8314) Beam Fn Api metrics piling causes pipeline to stuck after running for a while
[ https://issues.apache.org/jira/browse/BEAM-8314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8314: -- Description: Seems that in StreamingDataflowWorker we are not able to update the metrics fast enough to dataflow service, the piling metrics causes memory usage to increase and eventually leads to excessive memory thrashing/GC. And it will almost stop the pipeline from processing new items. was: Seems that in StreamingDataflowWorker we are not able to update the metrics fast enough to dataflow service, the piling metrics causes memory usage to increase and eventually leads to excessive memory thrashing/GC. And it will almost stop the pipeline from processing new items. https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-09-25_17_47_17-1934265625846707281?project=google.com:clouddfe > Beam Fn Api metrics piling causes pipeline to stuck after running for a while > - > > Key: BEAM-8314 > URL: https://issues.apache.org/jira/browse/BEAM-8314 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Yichi Zhang >Priority: Blocker > Fix For: 2.16.0 > > Attachments: E4UaSUhJJKF.png > > > Seems that in StreamingDataflowWorker we are not able to update the metrics > fast enough to dataflow service, the piling metrics causes memory usage to > increase and eventually leads to excessive memory thrashing/GC. And it will > almost stop the pipeline from processing new items. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8314) Beam Fn Api metrics piling causes pipeline to stuck after running for a while
[ https://issues.apache.org/jira/browse/BEAM-8314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8314: -- Description: Seems that in StreamingDataflowWorker we are not able to update the metrics fast enough to dataflow service, the piling metrics causes memory usage to increase and eventually leads to excessive memory thrashing/GC. And it will almost stop the pipeline from processing new items. !E4UaSUhJJKF.png! was: Seems that in StreamingDataflowWorker we are not able to update the metrics fast enough to dataflow service, the piling metrics causes memory usage to increase and eventually leads to excessive memory thrashing/GC. And it will almost stop the pipeline from processing new items. > Beam Fn Api metrics piling causes pipeline to stuck after running for a while > - > > Key: BEAM-8314 > URL: https://issues.apache.org/jira/browse/BEAM-8314 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Yichi Zhang >Priority: Blocker > Fix For: 2.16.0 > > Attachments: E4UaSUhJJKF.png > > > Seems that in StreamingDataflowWorker we are not able to update the metrics > fast enough to dataflow service, the piling metrics causes memory usage to > increase and eventually leads to excessive memory thrashing/GC. And it will > almost stop the pipeline from processing new items. > > !E4UaSUhJJKF.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8315) Remove unused fields for splitting SDFs in beam_fn_api.proto
[ https://issues.apache.org/jira/browse/BEAM-8315?focusedWorklogId=319080=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319080 ] ASF GitHub Bot logged work on BEAM-8315: Author: ASF GitHub Bot Created on: 26/Sep/19 16:52 Start Date: 26/Sep/19 16:52 Worklog Time Spent: 10m Work Description: lukecwik commented on issue #9666: [BEAM-8315] Clean-up unused fields for splitting SDFs URL: https://github.com/apache/beam/pull/9666#issuecomment-535591932 R: @boyuanzz This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319080) Time Spent: 20m (was: 10m) > Remove unused fields for splitting SDFs in beam_fn_api.proto > > > Key: BEAM-8315 > URL: https://issues.apache.org/jira/browse/BEAM-8315 > Project: Beam > Issue Type: Sub-task > Components: beam-model >Reporter: Luke Cwik >Assignee: Luke Cwik >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8315) Remove unused fields for splitting SDFs in beam_fn_api.proto
[ https://issues.apache.org/jira/browse/BEAM-8315?focusedWorklogId=319078=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319078 ] ASF GitHub Bot logged work on BEAM-8315: Author: ASF GitHub Bot Created on: 26/Sep/19 16:52 Start Date: 26/Sep/19 16:52 Worklog Time Spent: 10m Work Description: lukecwik commented on pull request #9666: [BEAM-8315] Clean-up unused fields for splitting SDFs URL: https://github.com/apache/beam/pull/9666 I kept output watermark around since this is expected to be reported for streaming SDFs. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
[jira] [Created] (BEAM-8315) Remove unused fields for splitting SDFs in beam_fn_api.proto
Luke Cwik created BEAM-8315: --- Summary: Remove unused fields for splitting SDFs in beam_fn_api.proto Key: BEAM-8315 URL: https://issues.apache.org/jira/browse/BEAM-8315 Project: Beam Issue Type: Sub-task Components: beam-model Reporter: Luke Cwik Assignee: Luke Cwik -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8315) Remove unused fields for splitting SDFs in beam_fn_api.proto
[ https://issues.apache.org/jira/browse/BEAM-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Cwik updated BEAM-8315: Status: Open (was: Triage Needed) > Remove unused fields for splitting SDFs in beam_fn_api.proto > > > Key: BEAM-8315 > URL: https://issues.apache.org/jira/browse/BEAM-8315 > Project: Beam > Issue Type: Sub-task > Components: beam-model >Reporter: Luke Cwik >Assignee: Luke Cwik >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook
[ https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=319068=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319068 ] ASF GitHub Bot logged work on BEAM-7760: Author: ASF GitHub Bot Created on: 26/Sep/19 16:34 Start Date: 26/Sep/19 16:34 Worklog Time Spent: 10m Work Description: KevinGG commented on issue #9619: [BEAM-7760] Added pipeline_instrument module URL: https://github.com/apache/beam/pull/9619#issuecomment-535585079 @aaltay Thanks for the quick response! I'm currently oncall till next Tuesday. I'll get back to the PR once I'm off duty asap. Thank you very much for the detailed review! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319068) Time Spent: 11h 20m (was: 11h 10m) > Interactive Beam Caching PCollections bound to user defined vars in notebook > > > Key: BEAM-7760 > URL: https://issues.apache.org/jira/browse/BEAM-7760 > Project: Beam > Issue Type: New Feature > Components: examples-python >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 11h 20m > Remaining Estimate: 0h > > Cache only PCollections bound to user defined variables in a pipeline when > running pipeline with interactive runner in jupyter notebooks. > [Interactive > Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]] > has been caching and using caches of "leaf" PCollections for interactive > execution in jupyter notebooks. > The interactive execution is currently supported so that when appending new > transforms to existing pipeline for a new run, executed part of the pipeline > doesn't need to be re-executed. > A PCollection is "leaf" when it is never used as input in any PTransform in > the pipeline. > The problem with building caches and pipeline to execute around "leaf" is > that when a PCollection is consumed by a sink with no output, the pipeline to > execute built will miss the subgraph generating and consuming that > PCollection. > An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty > pipeline. > Caching around PCollections bound to user defined variables and replacing > transforms with source and sink of caches could resolve the pipeline to > execute properly under the interactive execution scenario. Also, cached > PCollection now can trace back to user code and can be used for user data > visualization if user wants to do it. > E.g., > {code:java} > // ... > p = beam.Pipeline(interactive_runner.InteractiveRunner(), > options=pipeline_options) > messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...') > messages | "Write" >> beam.io.WriteToPubSub(topic_path) > result = p.run() > // ... > visualize(messages){code} > The interactive runner automatically figures out that PCollection > {code:java} > messages{code} > created by > {code:java} > p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code} > should be cached and reused if the notebook user appends more transforms. > And once the pipeline gets executed, the user could use any > visualize(PCollection) module to visualize the data statically (batch) or > dynamically (stream) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null
[ https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319064=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319064 ] ASF GitHub Bot logged work on BEAM-8300: Author: ASF GitHub Bot Created on: 26/Sep/19 16:21 Start Date: 26/Sep/19 16:21 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #9640: [BEAM-8300]: KinesisIO.write throws NPE because producer is null URL: https://github.com/apache/beam/pull/9640#issuecomment-535574848 @jhalaria To fix failed test, I think you need to add `tearDown` method into `KinesisWriterFn`, like: ``` @Teardown public void teardown() throws Exception { if (producer != null && producer.getOutstandingRecordsCount() > 0) { producer.flushSync(); } producer = null; } ``` and change `new UnsupportedOperationException` in `KinesisProducerMock.flushSync()` to just `flush()` call. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319064) Time Spent: 4h (was: 3h 50m) > KinesisIO.write causes NPE as the producer is null > -- > > Key: BEAM-8300 > URL: https://issues.apache.org/jira/browse/BEAM-8300 > Project: Beam > Issue Type: Bug > Components: io-java-kinesis >Affects Versions: 2.15.0 >Reporter: Ankit Jhalaria >Assignee: Ankit Jhalaria >Priority: Minor > Fix For: Not applicable > > Time Spent: 4h > Remaining Estimate: 0h > > While using KinesisIO.write(), we encountered a NPE with the following stack > trace > {code:java} > org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat > > org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat > > org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat > org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat > java.lang.Thread.run(Thread.java:748)\nCaused by: > java.lang.NullPointerException: null\n\tat > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat > > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null
[ https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319063=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319063 ] ASF GitHub Bot logged work on BEAM-8300: Author: ASF GitHub Bot Created on: 26/Sep/19 16:21 Start Date: 26/Sep/19 16:21 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #9640: [BEAM-8300]: KinesisIO.write throws NPE because producer is null URL: https://github.com/apache/beam/pull/9640#issuecomment-535574848 @jhalaria To fix failed test, I think you need to add `tearDown` method into `KinesisWriterFn`, like: ``` @Teardown public void teardown() throws Exception { if (producer != null && producer.getOutstandingRecordsCount() > 0) { producer.flushSync(); } producer = null; } ``` and change `UnsupportedOperationException` from `KinesisProducerMock.flushSync()` to just `flush()` call This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319063) Time Spent: 3h 50m (was: 3h 40m) > KinesisIO.write causes NPE as the producer is null > -- > > Key: BEAM-8300 > URL: https://issues.apache.org/jira/browse/BEAM-8300 > Project: Beam > Issue Type: Bug > Components: io-java-kinesis >Affects Versions: 2.15.0 >Reporter: Ankit Jhalaria >Assignee: Ankit Jhalaria >Priority: Minor > Fix For: Not applicable > > Time Spent: 3h 50m > Remaining Estimate: 0h > > While using KinesisIO.write(), we encountered a NPE with the following stack > trace > {code:java} > org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat > > org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat > > org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat > org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat > java.lang.Thread.run(Thread.java:748)\nCaused by: > java.lang.NullPointerException: null\n\tat > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat > > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null
[ https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319062=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319062 ] ASF GitHub Bot logged work on BEAM-8300: Author: ASF GitHub Bot Created on: 26/Sep/19 16:19 Start Date: 26/Sep/19 16:19 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #9640: [BEAM-8300]: KinesisIO.write throws NPE because producer is null URL: https://github.com/apache/beam/pull/9640#issuecomment-535574848 @jhalaria To fix failed test, I think you need to add `tearDown` method into `KinesisWriterFn`, like: ``` @Teardown public void teardown() throws Exception { if (producer != null && producer.getOutstandingRecordsCount() > 0) { producer.flushSync(); } producer = null; } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319062) Time Spent: 3h 40m (was: 3.5h) > KinesisIO.write causes NPE as the producer is null > -- > > Key: BEAM-8300 > URL: https://issues.apache.org/jira/browse/BEAM-8300 > Project: Beam > Issue Type: Bug > Components: io-java-kinesis >Affects Versions: 2.15.0 >Reporter: Ankit Jhalaria >Assignee: Ankit Jhalaria >Priority: Minor > Fix For: Not applicable > > Time Spent: 3h 40m > Remaining Estimate: 0h > > While using KinesisIO.write(), we encountered a NPE with the following stack > trace > {code:java} > org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat > > org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat > > org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat > org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat > java.lang.Thread.run(Thread.java:748)\nCaused by: > java.lang.NullPointerException: null\n\tat > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat > > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null
[ https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319061=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319061 ] ASF GitHub Bot logged work on BEAM-8300: Author: ASF GitHub Bot Created on: 26/Sep/19 16:18 Start Date: 26/Sep/19 16:18 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #9640: [BEAM-8300]: KinesisIO.write throws NPE because producer is null URL: https://github.com/apache/beam/pull/9640#issuecomment-535574848 @jhalaria To fix failed test, I think you need to add `tearDown` method into `KinesisWriterFn`, like: ``` @Teardown public void teardown() throws Exception { if (producer != null) { if (producer.getOutstandingRecordsCount() > 0) { producer.flushSync(); } producer = null; } } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319061) Time Spent: 3.5h (was: 3h 20m) > KinesisIO.write causes NPE as the producer is null > -- > > Key: BEAM-8300 > URL: https://issues.apache.org/jira/browse/BEAM-8300 > Project: Beam > Issue Type: Bug > Components: io-java-kinesis >Affects Versions: 2.15.0 >Reporter: Ankit Jhalaria >Assignee: Ankit Jhalaria >Priority: Minor > Fix For: Not applicable > > Time Spent: 3.5h > Remaining Estimate: 0h > > While using KinesisIO.write(), we encountered a NPE with the following stack > trace > {code:java} > org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat > > org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat > > org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat > org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat > java.lang.Thread.run(Thread.java:748)\nCaused by: > java.lang.NullPointerException: null\n\tat > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat > > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null
[ https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319060=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319060 ] ASF GitHub Bot logged work on BEAM-8300: Author: ASF GitHub Bot Created on: 26/Sep/19 16:16 Start Date: 26/Sep/19 16:16 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #9640: [BEAM-8300]: KinesisIO.write throws NPE because producer is null URL: https://github.com/apache/beam/pull/9640#issuecomment-535574848 @jhalaria To fix failed test, I think you need to add `tearDown` method into `KinesisWriterFn`, like: ``` @Teardown public void teardown() throws Exception { if (producer != null) { producer.flushSync(); producer = null; } } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319060) Time Spent: 3h 20m (was: 3h 10m) > KinesisIO.write causes NPE as the producer is null > -- > > Key: BEAM-8300 > URL: https://issues.apache.org/jira/browse/BEAM-8300 > Project: Beam > Issue Type: Bug > Components: io-java-kinesis >Affects Versions: 2.15.0 >Reporter: Ankit Jhalaria >Assignee: Ankit Jhalaria >Priority: Minor > Fix For: Not applicable > > Time Spent: 3h 20m > Remaining Estimate: 0h > > While using KinesisIO.write(), we encountered a NPE with the following stack > trace > {code:java} > org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat > > org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat > > org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat > org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat > java.lang.Thread.run(Thread.java:748)\nCaused by: > java.lang.NullPointerException: null\n\tat > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat > > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null
[ https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319053=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319053 ] ASF GitHub Bot logged work on BEAM-8300: Author: ASF GitHub Bot Created on: 26/Sep/19 16:07 Start Date: 26/Sep/19 16:07 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #9640: [BEAM-8300]: KinesisIO.write throws NPE because producer is null URL: https://github.com/apache/beam/pull/9640#issuecomment-535574848 @jhalaria To fox failed test, I think you need to add `tearDown` method into `KinesisWriterFn`, like: ``` @Teardown public void teardown() throws Exception { producer = null; } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319053) Time Spent: 2h 50m (was: 2h 40m) > KinesisIO.write causes NPE as the producer is null > -- > > Key: BEAM-8300 > URL: https://issues.apache.org/jira/browse/BEAM-8300 > Project: Beam > Issue Type: Bug > Components: io-java-kinesis >Affects Versions: 2.15.0 >Reporter: Ankit Jhalaria >Assignee: Ankit Jhalaria >Priority: Minor > Fix For: Not applicable > > Time Spent: 2h 50m > Remaining Estimate: 0h > > While using KinesisIO.write(), we encountered a NPE with the following stack > trace > {code:java} > org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat > > org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat > > org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat > org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat > java.lang.Thread.run(Thread.java:748)\nCaused by: > java.lang.NullPointerException: null\n\tat > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat > > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null
[ https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319055=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319055 ] ASF GitHub Bot logged work on BEAM-8300: Author: ASF GitHub Bot Created on: 26/Sep/19 16:07 Start Date: 26/Sep/19 16:07 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #9640: [BEAM-8300]: KinesisIO.write throws NPE because producer is null URL: https://github.com/apache/beam/pull/9640#issuecomment-535574848 @jhalaria To fix failed test, I think you need to add `tearDown` method into `KinesisWriterFn`, like: ``` @Teardown public void teardown() { producer = null; } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319055) Time Spent: 3h 10m (was: 3h) > KinesisIO.write causes NPE as the producer is null > -- > > Key: BEAM-8300 > URL: https://issues.apache.org/jira/browse/BEAM-8300 > Project: Beam > Issue Type: Bug > Components: io-java-kinesis >Affects Versions: 2.15.0 >Reporter: Ankit Jhalaria >Assignee: Ankit Jhalaria >Priority: Minor > Fix For: Not applicable > > Time Spent: 3h 10m > Remaining Estimate: 0h > > While using KinesisIO.write(), we encountered a NPE with the following stack > trace > {code:java} > org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat > > org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat > > org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat > org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat > java.lang.Thread.run(Thread.java:748)\nCaused by: > java.lang.NullPointerException: null\n\tat > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat > > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null
[ https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319054=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319054 ] ASF GitHub Bot logged work on BEAM-8300: Author: ASF GitHub Bot Created on: 26/Sep/19 16:07 Start Date: 26/Sep/19 16:07 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #9640: [BEAM-8300]: KinesisIO.write throws NPE because producer is null URL: https://github.com/apache/beam/pull/9640#issuecomment-535574848 @jhalaria To fix failed test, I think you need to add `tearDown` method into `KinesisWriterFn`, like: ``` @Teardown public void teardown() throws Exception { producer = null; } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319054) Time Spent: 3h (was: 2h 50m) > KinesisIO.write causes NPE as the producer is null > -- > > Key: BEAM-8300 > URL: https://issues.apache.org/jira/browse/BEAM-8300 > Project: Beam > Issue Type: Bug > Components: io-java-kinesis >Affects Versions: 2.15.0 >Reporter: Ankit Jhalaria >Assignee: Ankit Jhalaria >Priority: Minor > Fix For: Not applicable > > Time Spent: 3h > Remaining Estimate: 0h > > While using KinesisIO.write(), we encountered a NPE with the following stack > trace > {code:java} > org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat > > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat > > org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat > > org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat > org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat > java.lang.Thread.run(Thread.java:748)\nCaused by: > java.lang.NullPointerException: null\n\tat > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat > > org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.
[ https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319037=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319037 ] ASF GitHub Bot logged work on BEAM-5428: Author: ASF GitHub Bot Created on: 26/Sep/19 15:20 Start Date: 26/Sep/19 15:20 Worklog Time Spent: 10m Work Description: tweise commented on pull request #9418: [BEAM-5428] Implement cross-bundle user state caching in the Python SDK URL: https://github.com/apache/beam/pull/9418#discussion_r328676138 ## File path: sdks/python/apache_beam/runners/worker/statecache.py ## @@ -0,0 +1,122 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""A module for caching state reads/writes in Beam applications.""" +from __future__ import absolute_import + +import collections +import logging +from threading import Lock + + +class StateCache(object): + """ Cache for Beam state access, scoped by state key and cache_token. + + For a given state_key, caches a (cache_token, value) tuple and allows to +a) read from the cache, + if the currently stored cache_token matches the provided +a) write to the cache, + storing the new value alongside with a cache token +c) append to the cache, + if the currently stored cache_token matches the provided + + The operations on the cache are thread-safe for use by multiple workers. + + :arg max_entries The maximum number of entries to store in the cache. + TODO Memory-based caching: https://issues.apache.org/jira/browse/BEAM-8297 + """ + + def __init__(self, max_entries): +logging.info('Creating state cache with size %s', max_entries) +self._cache = self.LRUCache(max_entries, (None, None)) +self._lock = Lock() + + def get(self, state_key, cache_token): +assert cache_token and self.is_cache_enabled() +with self._lock: + token, value = self._cache.get(state_key) +return value if token == cache_token else None + + def put(self, state_key, cache_token, value): +assert cache_token and self.is_cache_enabled() +with self._lock: + return self._cache.put(state_key, (cache_token, value)) + + def append(self, state_key, cache_token, elements): +assert cache_token and self.is_cache_enabled() +with self._lock: + token, value = self._cache.get(state_key) + if token in [cache_token, None]: +if value is None: + value = [] +value.extend(elements) +self._cache.put(state_key, (cache_token, value)) + else: +# Discard cached state if tokens do not match +self._cache.evict(state_key) + + def clear(self, state_key, cache_token): +assert cache_token and self.is_cache_enabled() +with self._lock: + token, _ = self._cache.get(state_key) + if token in [cache_token, None]: +self._cache.put(state_key, (cache_token, [])) + else: +# Discard cached state if tokens do not match +self._cache.evict(state_key) + + def evict(self, state_key): +assert self.is_cache_enabled() +with self._lock: + self._cache.evict(state_key) + + def evict_all(self): +with self._lock: + self._cache.evict_all() + + def is_cache_enabled(self): +return self._cache._max_entries > 0 + + def __len__(self): +return len(self._cache) + + class LRUCache(object): + +def __init__(self, max_entries, default_entry): + self._max_entries = max_entries + self._default_entry = default_entry + self._cache = collections.OrderedDict() + +def get(self, key): + value = self._cache.pop(key, self._default_entry) + if value != self._default_entry: +self._cache[key] = value + return value + +def put(self, key, value): + self._cache[key] = value + while len(self._cache) > self._max_entries: +self._cache.popitem(last=False) + +def evict(self, key): + self._cache.pop(key, self._default_entry) + +def evict_all(self): Review comment: nit: would clear be the better name for this? This is an automated message from the Apache
[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.
[ https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319036=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319036 ] ASF GitHub Bot logged work on BEAM-5428: Author: ASF GitHub Bot Created on: 26/Sep/19 15:16 Start Date: 26/Sep/19 15:16 Worklog Time Spent: 10m Work Description: mxm commented on pull request #9418: [BEAM-5428] Implement cross-bundle user state caching in the Python SDK URL: https://github.com/apache/beam/pull/9418#discussion_r328673772 ## File path: sdks/python/apache_beam/runners/worker/bundle_processor.py ## @@ -430,13 +424,13 @@ def clear(self): def _commit(self): if self._cleared: - self._state_handler.blocking_clear(self._state_key) + self._state_handler.clear(self._state_key, is_cached=True).get() if self._added_elements: - value_coder_impl = self._value_coder.get_impl() - out = coder_impl.create_OutputStream() - for element in self._added_elements: -value_coder_impl.encode_to_stream(element, out, True) - self._state_handler.blocking_append(self._state_key, out.get()) + self._state_handler.append( Review comment: Will consolidate with the above. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319036) Time Spent: 23h 20m (was: 23h 10m) > Implement cross-bundle state caching. > - > > Key: BEAM-5428 > URL: https://issues.apache.org/jira/browse/BEAM-5428 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Maximilian Michels >Priority: Major > Time Spent: 23h 20m > Remaining Estimate: 0h > > Tech spec: > [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m] > Relevant document: > [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit] > Mailing list link: > [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.
[ https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319033=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319033 ] ASF GitHub Bot logged work on BEAM-5428: Author: ASF GitHub Bot Created on: 26/Sep/19 15:14 Start Date: 26/Sep/19 15:14 Worklog Time Spent: 10m Work Description: mxm commented on pull request #9418: [BEAM-5428] Implement cross-bundle user state caching in the Python SDK URL: https://github.com/apache/beam/pull/9418#discussion_r328673182 ## File path: sdks/python/apache_beam/runners/worker/sdk_worker_main.py ## @@ -205,6 +206,28 @@ def _get_worker_count(pipeline_options): return 12 +def _get_state_cache_size(pipeline_options): + """Defines the upper number of state items to cache. + + Note: state_cache_size is an experimental flag and might not be available in + future releases. + + Returns: +an int indicating the maximum number of items to cache. + Default is 0 (disabled) + """ + experiments = pipeline_options.view_as(DebugOptions).experiments + experiments = experiments if experiments else [] + + for experiment in experiments: +# There should only be 1 match so returning from the loop +if re.match(r'state_cache_size=', experiment): + return int( + re.match(r'state_cache_size=(?P.*)', + experiment).group('state_cache_size')) + return 100 Review comment: Good catch, this was actually pending in my branch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319033) Time Spent: 23h 10m (was: 23h) > Implement cross-bundle state caching. > - > > Key: BEAM-5428 > URL: https://issues.apache.org/jira/browse/BEAM-5428 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Maximilian Michels >Priority: Major > Time Spent: 23h 10m > Remaining Estimate: 0h > > Tech spec: > [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m] > Relevant document: > [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit] > Mailing list link: > [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python
[ https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=319032=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319032 ] ASF GitHub Bot logged work on BEAM-3342: Author: ASF GitHub Bot Created on: 26/Sep/19 15:14 Start Date: 26/Sep/19 15:14 Worklog Time Spent: 10m Work Description: chamikaramj commented on pull request #8457: [BEAM-3342] Create a Cloud Bigtable IO connector for Python URL: https://github.com/apache/beam/pull/8457#discussion_r327778002 ## File path: sdks/python/apache_beam/io/gcp/bigtableio.py ## @@ -122,22 +126,145 @@ class WriteToBigTable(beam.PTransform): A PTransform that write a list of `DirectRow` into the Bigtable Table """ - def __init__(self, project_id=None, instance_id=None, - table_id=None): + def __init__(self, project_id=None, instance_id=None, table_id=None): """ The PTransform to access the Bigtable Write connector Args: project_id(str): GCP Project of to write the Rows instance_id(str): GCP Instance to write the Rows table_id(str): GCP Table to write the `DirectRows` """ super(WriteToBigTable, self).__init__() -self.beam_options = {'project_id': project_id, - 'instance_id': instance_id, - 'table_id': table_id} +self._beam_options = {'project_id': project_id, + 'instance_id': instance_id, + 'table_id': table_id} def expand(self, pvalue): -beam_options = self.beam_options +beam_options = self._beam_options return (pvalue | beam.ParDo(_BigTableWriteFn(beam_options['project_id'], beam_options['instance_id'], beam_options['table_id']))) + + +class _BigtableReadFn(beam.DoFn): + """ Creates the connector that can read rows for Beam pipeline + + Args: +project_id(str): GCP Project ID +instance_id(str): GCP Instance ID +table_id(str): GCP Table ID + + """ + + def __init__(self, project_id, instance_id, table_id, filter_=b''): +""" Constructor of the Read connector of Bigtable + +Args: + project_id: [str] GCP Project of to write the Rows + instance_id: [str] GCP Instance to write the Rows + table_id: [str] GCP Table to write the `DirectRows` + filter_: [RowFilter] Filter to apply to columns in a row. +""" +super(self.__class__, self).__init__() +self._initialize({'project_id': project_id, + 'instance_id': instance_id, + 'table_id': table_id, + 'filter_': filter_}) + + def __getstate__(self): +return self._beam_options + + def __setstate__(self, options): +self._initialize(options) + + def _initialize(self, options): +self._beam_options = options +self.table = None +self.sample_row_keys = None +self.row_count = Metrics.counter(self.__class__.__name__, 'Rows read') + + def start_bundle(self): +if self.table is None: + self.table = Client(project=self._beam_options['project_id'])\ Review comment: nit: PEP 8 recommends using parenthesis for formatting instead of backslashes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319032) Time Spent: 40h 40m (was: 40.5h) > Create a Cloud Bigtable IO connector for Python > --- > > Key: BEAM-3342 > URL: https://issues.apache.org/jira/browse/BEAM-3342 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Solomon Duskis >Assignee: Solomon Duskis >Priority: Major > Time Spent: 40h 40m > Remaining Estimate: 0h > > I would like to create a Cloud Bigtable python connector. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.
[ https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319031=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319031 ] ASF GitHub Bot logged work on BEAM-5428: Author: ASF GitHub Bot Created on: 26/Sep/19 15:14 Start Date: 26/Sep/19 15:14 Worklog Time Spent: 10m Work Description: mxm commented on pull request #9418: [BEAM-5428] Implement cross-bundle user state caching in the Python SDK URL: https://github.com/apache/beam/pull/9418#discussion_r328672702 ## File path: sdks/python/apache_beam/runners/worker/bundle_processor.py ## @@ -430,13 +424,13 @@ def clear(self): def _commit(self): if self._cleared: - self._state_handler.blocking_clear(self._state_key) + self._state_handler.clear(self._state_key, is_cached=True).get() Review comment: Yes, and no. We need to wait for the last response here. For example, if there is no following append, we'll have to wait on the clear. So best to safe the last returned future in a variable and call `.get()`on it here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319031) Time Spent: 23h (was: 22h 50m) > Implement cross-bundle state caching. > - > > Key: BEAM-5428 > URL: https://issues.apache.org/jira/browse/BEAM-5428 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Maximilian Michels >Priority: Major > Time Spent: 23h > Remaining Estimate: 0h > > Tech spec: > [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m] > Relevant document: > [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit] > Mailing list link: > [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.
[ https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319030=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319030 ] ASF GitHub Bot logged work on BEAM-5428: Author: ASF GitHub Bot Created on: 26/Sep/19 15:13 Start Date: 26/Sep/19 15:13 Worklog Time Spent: 10m Work Description: tweise commented on pull request #9418: [BEAM-5428] Implement cross-bundle user state caching in the Python SDK URL: https://github.com/apache/beam/pull/9418#discussion_r328672482 ## File path: sdks/python/apache_beam/runners/worker/sdk_worker_main.py ## @@ -205,6 +206,28 @@ def _get_worker_count(pipeline_options): return 12 +def _get_state_cache_size(pipeline_options): + """Defines the upper number of state items to cache. + + Note: state_cache_size is an experimental flag and might not be available in + future releases. + + Returns: +an int indicating the maximum number of items to cache. + Default is 0 (disabled) + """ + experiments = pipeline_options.view_as(DebugOptions).experiments + experiments = experiments if experiments else [] + + for experiment in experiments: +# There should only be 1 match so returning from the loop +if re.match(r'state_cache_size=', experiment): + return int( + re.match(r'state_cache_size=(?P.*)', + experiment).group('state_cache_size')) + return 100 Review comment: Should be 0 (the default)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319030) Time Spent: 22h 50m (was: 22h 40m) > Implement cross-bundle state caching. > - > > Key: BEAM-5428 > URL: https://issues.apache.org/jira/browse/BEAM-5428 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Maximilian Michels >Priority: Major > Time Spent: 22h 50m > Remaining Estimate: 0h > > Tech spec: > [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m] > Relevant document: > [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit] > Mailing list link: > [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.
[ https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319029=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319029 ] ASF GitHub Bot logged work on BEAM-5428: Author: ASF GitHub Bot Created on: 26/Sep/19 15:12 Start Date: 26/Sep/19 15:12 Worklog Time Spent: 10m Work Description: mxm commented on pull request #9418: [BEAM-5428] Implement cross-bundle user state caching in the Python SDK URL: https://github.com/apache/beam/pull/9418#discussion_r328671827 ## File path: sdks/python/apache_beam/runners/worker/bundle_processor.py ## @@ -199,26 +199,19 @@ def finish(self): class _StateBackedIterable(object): - def __init__(self, state_handler, state_key, coder_or_impl): + def __init__(self, state_handler, state_key, coder_or_impl, + is_cached=False): self._state_handler = state_handler self._state_key = state_key if isinstance(coder_or_impl, coders.Coder): self._coder_impl = coder_or_impl.get_impl() else: self._coder_impl = coder_or_impl +self._is_cached = is_cached def __iter__(self): -# This is the continuation token this might be useful -data, continuation_token = self._state_handler.blocking_get(self._state_key) -while True: - input_stream = coder_impl.create_InputStream(data) - while input_stream.size() > 0: -yield self._coder_impl.decode_from_stream(input_stream, True) - if not continuation_token: -break - else: -data, continuation_token = self._state_handler.blocking_get( -self._state_key, continuation_token) +return self._state_handler.blocking_get( Review comment: Could you elaborate? The continuation token is respected as before. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319029) Time Spent: 22h 40m (was: 22.5h) > Implement cross-bundle state caching. > - > > Key: BEAM-5428 > URL: https://issues.apache.org/jira/browse/BEAM-5428 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Maximilian Michels >Priority: Major > Time Spent: 22h 40m > Remaining Estimate: 0h > > Tech spec: > [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m] > Relevant document: > [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit] > Mailing list link: > [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.
[ https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319028=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319028 ] ASF GitHub Bot logged work on BEAM-5428: Author: ASF GitHub Bot Created on: 26/Sep/19 15:12 Start Date: 26/Sep/19 15:12 Worklog Time Spent: 10m Work Description: mxm commented on pull request #9418: [BEAM-5428] Implement cross-bundle user state caching in the Python SDK URL: https://github.com/apache/beam/pull/9418#discussion_r328671811 ## File path: sdks/python/apache_beam/runners/portability/portable_runner_test.py ## @@ -201,7 +204,8 @@ class PortableRunnerTestWithExternalEnv(PortableRunnerTest): @classmethod def setUpClass(cls): cls._worker_address, cls._worker_server = ( -worker_pool_main.BeamFnExternalWorkerPoolServicer.start()) +worker_pool_main.BeamFnExternalWorkerPoolServicer.start( Review comment: Will look into this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319028) Time Spent: 22.5h (was: 22h 20m) > Implement cross-bundle state caching. > - > > Key: BEAM-5428 > URL: https://issues.apache.org/jira/browse/BEAM-5428 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Maximilian Michels >Priority: Major > Time Spent: 22.5h > Remaining Estimate: 0h > > Tech spec: > [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m] > Relevant document: > [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit] > Mailing list link: > [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.
[ https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319027=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319027 ] ASF GitHub Bot logged work on BEAM-5428: Author: ASF GitHub Bot Created on: 26/Sep/19 15:12 Start Date: 26/Sep/19 15:12 Worklog Time Spent: 10m Work Description: mxm commented on pull request #9418: [BEAM-5428] Implement cross-bundle user state caching in the Python SDK URL: https://github.com/apache/beam/pull/9418#discussion_r328671668 ## File path: sdks/python/apache_beam/runners/portability/fn_api_runner.py ## @@ -1412,11 +1470,13 @@ def stop_worker(self): class WorkerHandlerManager(object): - def __init__(self, environments, job_provision_info): + def __init__(self, environments, job_provision_info, state_cache_size): self._environments = environments self._job_provision_info = job_provision_info self._cached_handlers = collections.defaultdict(list) -self._state = FnApiRunner.StateServicer() # rename? +self._state = sdk_worker.CachingMaterializingStateHandler( +StateCache(state_cache_size), Review comment: Good catch. This does not seem to be necessary. Let me revisit. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319027) Time Spent: 22h 20m (was: 22h 10m) > Implement cross-bundle state caching. > - > > Key: BEAM-5428 > URL: https://issues.apache.org/jira/browse/BEAM-5428 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Maximilian Michels >Priority: Major > Time Spent: 22h 20m > Remaining Estimate: 0h > > Tech spec: > [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m] > Relevant document: > [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit] > Mailing list link: > [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.
[ https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319009=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319009 ] ASF GitHub Bot logged work on BEAM-5428: Author: ASF GitHub Bot Created on: 26/Sep/19 15:01 Start Date: 26/Sep/19 15:01 Worklog Time Spent: 10m Work Description: tweise commented on pull request #9418: [BEAM-5428] Implement cross-bundle user state caching in the Python SDK URL: https://github.com/apache/beam/pull/9418#discussion_r328665599 ## File path: sdks/python/apache_beam/runners/worker/bundle_processor.py ## @@ -430,13 +424,13 @@ def clear(self): def _commit(self): if self._cleared: - self._state_handler.blocking_clear(self._state_key) + self._state_handler.clear(self._state_key, is_cached=True).get() if self._added_elements: - value_coder_impl = self._value_coder.get_impl() - out = coder_impl.create_OutputStream() - for element in self._added_elements: -value_coder_impl.encode_to_stream(element, out, True) - self._state_handler.blocking_append(self._state_key, out.get()) + self._state_handler.append( Review comment: An explicit comment regarding the need for blocking call might be useful here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319009) Time Spent: 22h 10m (was: 22h) > Implement cross-bundle state caching. > - > > Key: BEAM-5428 > URL: https://issues.apache.org/jira/browse/BEAM-5428 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Maximilian Michels >Priority: Major > Time Spent: 22h 10m > Remaining Estimate: 0h > > Tech spec: > [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m] > Relevant document: > [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit] > Mailing list link: > [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.
[ https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319008=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319008 ] ASF GitHub Bot logged work on BEAM-5428: Author: ASF GitHub Bot Created on: 26/Sep/19 15:00 Start Date: 26/Sep/19 15:00 Worklog Time Spent: 10m Work Description: tweise commented on pull request #9418: [BEAM-5428] Implement cross-bundle user state caching in the Python SDK URL: https://github.com/apache/beam/pull/9418#discussion_r328665061 ## File path: sdks/python/apache_beam/runners/worker/bundle_processor.py ## @@ -430,13 +424,13 @@ def clear(self): def _commit(self): if self._cleared: - self._state_handler.blocking_clear(self._state_key) + self._state_handler.clear(self._state_key, is_cached=True).get() Review comment: Is this the call where we don't need to wait for a response? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319008) Time Spent: 22h (was: 21h 50m) > Implement cross-bundle state caching. > - > > Key: BEAM-5428 > URL: https://issues.apache.org/jira/browse/BEAM-5428 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Maximilian Michels >Priority: Major > Time Spent: 22h > Remaining Estimate: 0h > > Tech spec: > [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m] > Relevant document: > [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit] > Mailing list link: > [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.
[ https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319002=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319002 ] ASF GitHub Bot logged work on BEAM-5428: Author: ASF GitHub Bot Created on: 26/Sep/19 14:50 Start Date: 26/Sep/19 14:50 Worklog Time Spent: 10m Work Description: tweise commented on pull request #9418: [BEAM-5428] Implement cross-bundle user state caching in the Python SDK URL: https://github.com/apache/beam/pull/9418#discussion_r328659068 ## File path: sdks/python/apache_beam/runners/worker/bundle_processor.py ## @@ -199,26 +199,19 @@ def finish(self): class _StateBackedIterable(object): - def __init__(self, state_handler, state_key, coder_or_impl): + def __init__(self, state_handler, state_key, coder_or_impl, + is_cached=False): self._state_handler = state_handler self._state_key = state_key if isinstance(coder_or_impl, coders.Coder): self._coder_impl = coder_or_impl.get_impl() else: self._coder_impl = coder_or_impl +self._is_cached = is_cached def __iter__(self): -# This is the continuation token this might be useful -data, continuation_token = self._state_handler.blocking_get(self._state_key) -while True: - input_stream = coder_impl.create_InputStream(data) - while input_stream.size() > 0: -yield self._coder_impl.decode_from_stream(input_stream, True) - if not continuation_token: -break - else: -data, continuation_token = self._state_handler.blocking_get( -self._state_key, continuation_token) +return self._state_handler.blocking_get( Review comment: Is there a JIRA for the continuation token that can be referenced here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 319002) Time Spent: 21h 50m (was: 21h 40m) > Implement cross-bundle state caching. > - > > Key: BEAM-5428 > URL: https://issues.apache.org/jira/browse/BEAM-5428 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Maximilian Michels >Priority: Major > Time Spent: 21h 50m > Remaining Estimate: 0h > > Tech spec: > [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m] > Relevant document: > [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit] > Mailing list link: > [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.
[ https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=318997=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318997 ] ASF GitHub Bot logged work on BEAM-5428: Author: ASF GitHub Bot Created on: 26/Sep/19 14:48 Start Date: 26/Sep/19 14:48 Worklog Time Spent: 10m Work Description: tweise commented on pull request #9418: [BEAM-5428] Implement cross-bundle user state caching in the Python SDK URL: https://github.com/apache/beam/pull/9418#discussion_r328658157 ## File path: sdks/python/apache_beam/runners/portability/portable_runner_test.py ## @@ -201,7 +204,8 @@ class PortableRunnerTestWithExternalEnv(PortableRunnerTest): @classmethod def setUpClass(cls): cls._worker_address, cls._worker_server = ( -worker_pool_main.BeamFnExternalWorkerPoolServicer.start()) +worker_pool_main.BeamFnExternalWorkerPoolServicer.start( Review comment: Logically this is duplicates the experiment flag above. I can see why that is currently necessary, but maybe it would be better to provide the pipeline options to the worker pool servicer? (Inside the container this would happen via provisioning endpoint and environment.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 318997) Time Spent: 21h 40m (was: 21.5h) > Implement cross-bundle state caching. > - > > Key: BEAM-5428 > URL: https://issues.apache.org/jira/browse/BEAM-5428 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Maximilian Michels >Priority: Major > Time Spent: 21h 40m > Remaining Estimate: 0h > > Tech spec: > [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m] > Relevant document: > [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit] > Mailing list link: > [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.
[ https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=318994=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318994 ] ASF GitHub Bot logged work on BEAM-5428: Author: ASF GitHub Bot Created on: 26/Sep/19 14:37 Start Date: 26/Sep/19 14:37 Worklog Time Spent: 10m Work Description: tweise commented on pull request #9418: [BEAM-5428] Implement cross-bundle user state caching in the Python SDK URL: https://github.com/apache/beam/pull/9418#discussion_r328651322 ## File path: sdks/python/apache_beam/runners/portability/fn_api_runner.py ## @@ -1412,11 +1470,13 @@ def stop_worker(self): class WorkerHandlerManager(object): - def __init__(self, environments, job_provision_info): + def __init__(self, environments, job_provision_info, state_cache_size): self._environments = environments self._job_provision_info = job_provision_info self._cached_handlers = collections.defaultdict(list) -self._state = FnApiRunner.StateServicer() # rename? +self._state = sdk_worker.CachingMaterializingStateHandler( +StateCache(state_cache_size), Review comment: For my learning, why is the cache required inside the fn_api_runner vs. on the SDK side? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 318994) Time Spent: 21.5h (was: 21h 20m) > Implement cross-bundle state caching. > - > > Key: BEAM-5428 > URL: https://issues.apache.org/jira/browse/BEAM-5428 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Maximilian Michels >Priority: Major > Time Spent: 21.5h > Remaining Estimate: 0h > > Tech spec: > [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m] > Relevant document: > [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit] > Mailing list link: > [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8306) improve estimation of data byte size reading from source in ElasticsearchIO
[ https://issues.apache.org/jira/browse/BEAM-8306?focusedWorklogId=318943=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318943 ] ASF GitHub Bot logged work on BEAM-8306: Author: ASF GitHub Bot Created on: 26/Sep/19 12:59 Start Date: 26/Sep/19 12:59 Worklog Time Spent: 10m Work Description: iemejia commented on issue #9660: [BEAM-8306] improve estimation datasize elasticsearch io URL: https://github.com/apache/beam/pull/9660#issuecomment-535490565 R: @echauchot @timrobertson100 Can some of you PTAL at this one, looks like an interesting improvement. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 318943) Remaining Estimate: 0h Time Spent: 10m > improve estimation of data byte size reading from source in ElasticsearchIO > --- > > Key: BEAM-8306 > URL: https://issues.apache.org/jira/browse/BEAM-8306 > Project: Beam > Issue Type: Improvement > Components: io-java-elasticsearch >Affects Versions: 2.14.0 >Reporter: Derek He >Assignee: Derek He >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > ElasticsearchIO splits BoundedSource based on the Elasticsearch index size. > We expect it can be more accurate to split it base on query result size. > Currently, we have a big Elasticsearch index. But for query result, it only > contains a few documents in the index. ElasticsearchIO splits it into up > to1024 BoundedSources in Google dataflow. It takes long time to finish the > processing the small numbers of Elasticsearch document in Google dataflow. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-8212) StatefulParDoFn creates GC timers for every record
[ https://issues.apache.org/jira/browse/BEAM-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía reassigned BEAM-8212: -- Assignee: (was: Aizhamal Nurmamat kyzy) > StatefulParDoFn creates GC timers for every record > --- > > Key: BEAM-8212 > URL: https://issues.apache.org/jira/browse/BEAM-8212 > Project: Beam > Issue Type: Bug > Components: beam-community >Reporter: Akshay Iyangar >Priority: Major > > Hi > So currently the StatefulParDoFn create timers for all the records. > [https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/StatefulDoFnRunner.java#L211] > This becomes a problem if you are using GlobalWindows for streaming where > these timers get created and never get closed since the window will never > close. > This is a problem especially if your memory bound in rocksDB where these > timers take up potential space and sloe the pipelines considerably. > Was wondering that if the pipeline runs in global windows we should avoid > adding timers to it at all? > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8212) StatefulParDoFn creates GC timers for every record
[ https://issues.apache.org/jira/browse/BEAM-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-8212: --- Component/s: (was: beam-community) runner-core > StatefulParDoFn creates GC timers for every record > --- > > Key: BEAM-8212 > URL: https://issues.apache.org/jira/browse/BEAM-8212 > Project: Beam > Issue Type: Bug > Components: runner-core >Reporter: Akshay Iyangar >Priority: Major > > Hi > So currently the StatefulParDoFn create timers for all the records. > [https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/StatefulDoFnRunner.java#L211] > This becomes a problem if you are using GlobalWindows for streaming where > these timers get created and never get closed since the window will never > close. > This is a problem especially if your memory bound in rocksDB where these > timers take up potential space and sloe the pipelines considerably. > Was wondering that if the pipeline runs in global windows we should avoid > adding timers to it at all? > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8312) Flink portable pipeline jars do not need to stage artifacts remotely
[ https://issues.apache.org/jira/browse/BEAM-8312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-8312: --- Status: Open (was: Triage Needed) > Flink portable pipeline jars do not need to stage artifacts remotely > > > Key: BEAM-8312 > URL: https://issues.apache.org/jira/browse/BEAM-8312 > Project: Beam > Issue Type: Improvement > Components: runner-flink >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Labels: portability-flink > > Currently, Flink job jars re-stage all artifacts at runtime (on the > JobManager) by using the usual BeamFileSystemArtifactRetrievalService [1]. > However, since the manifest and all the artifacts live on the classpath of > the jar, and everything from the classpath is copied to the Flink workers > anyway [2], it should not be necessary to do additional artifact staging. We > could replace BeamFileSystemArtifactRetrievalService in this case with a > simple ArtifactRetrievalService that just pulls the artifacts from the > classpath. > > [1] > [https://github.com/apache/beam/blob/340c3202b1e5824b959f5f9f626e4c7c7842a3cb/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/artifact/BeamFileSystemArtifactRetrievalService.java] > [2] > [https://github.com/apache/beam/blob/2f1b56ccc506054e40afe4793a8b556e872e1865/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkExecutionEnvironments.java#L93] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8146) SchemaCoder/RowCoder have no equals() function
[ https://issues.apache.org/jira/browse/BEAM-8146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-8146: --- Status: Open (was: Triage Needed) > SchemaCoder/RowCoder have no equals() function > -- > > Key: BEAM-8146 > URL: https://issues.apache.org/jira/browse/BEAM-8146 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Affects Versions: 2.15.0 >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > SchemaCoder has no equals function, so it can't be compared in tests, like > CloudComponentsTests$DefaultCoders, which is being re-enabled in > https://github.com/apache/beam/pull/9446 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8275) Beam SQL should support BigQuery in DIRECT_READ mode
[ https://issues.apache.org/jira/browse/BEAM-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-8275: --- Status: Open (was: Triage Needed) > Beam SQL should support BigQuery in DIRECT_READ mode > > > Key: BEAM-8275 > URL: https://issues.apache.org/jira/browse/BEAM-8275 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Andrew Pilloud >Assignee: Kirill Kozlov >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > SQL currently only supports reading from BigQuery in DEFAULT (EXPORT) mode. > We also need to support DIRECT_READ mode. The method should be configurable > by TBLPROPERTIES through the SQL CLI. This will enable us to take advantage > of the DIRECT_READ features for filter and project push down. > References: > [https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api] > [https://beam.apache.org/blog/2019/06/04/adding-data-sources-to-sql.html] > [https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-8306) improve estimation of data byte size reading from source in ElasticsearchIO
[ https://issues.apache.org/jira/browse/BEAM-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía reassigned BEAM-8306: -- Assignee: Derek He > improve estimation of data byte size reading from source in ElasticsearchIO > --- > > Key: BEAM-8306 > URL: https://issues.apache.org/jira/browse/BEAM-8306 > Project: Beam > Issue Type: Improvement > Components: io-java-elasticsearch >Affects Versions: 2.14.0 >Reporter: Derek He >Assignee: Derek He >Priority: Major > > ElasticsearchIO splits BoundedSource based on the Elasticsearch index size. > We expect it can be more accurate to split it base on query result size. > Currently, we have a big Elasticsearch index. But for query result, it only > contains a few documents in the index. ElasticsearchIO splits it into up > to1024 BoundedSources in Google dataflow. It takes long time to finish the > processing the small numbers of Elasticsearch document in Google dataflow. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8306) improve estimation of data byte size reading from source in ElasticsearchIO
[ https://issues.apache.org/jira/browse/BEAM-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-8306: --- Status: Open (was: Triage Needed) > improve estimation of data byte size reading from source in ElasticsearchIO > --- > > Key: BEAM-8306 > URL: https://issues.apache.org/jira/browse/BEAM-8306 > Project: Beam > Issue Type: Improvement > Components: io-java-elasticsearch >Affects Versions: 2.14.0 >Reporter: Derek He >Priority: Major > > ElasticsearchIO splits BoundedSource based on the Elasticsearch index size. > We expect it can be more accurate to split it base on query result size. > Currently, we have a big Elasticsearch index. But for query result, it only > contains a few documents in the index. ElasticsearchIO splits it into up > to1024 BoundedSources in Google dataflow. It takes long time to finish the > processing the small numbers of Elasticsearch document in Google dataflow. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8303) Filesystems not properly registered using FileIO.write()
[ https://issues.apache.org/jira/browse/BEAM-8303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-8303: --- Status: Open (was: Triage Needed) > Filesystems not properly registered using FileIO.write() > > > Key: BEAM-8303 > URL: https://issues.apache.org/jira/browse/BEAM-8303 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Affects Versions: 2.15.0 >Reporter: Preston Koprivica >Assignee: Maximilian Michels >Priority: Major > > I’m getting the following error when attempting to use the FileIO apis > (beam-2.15.0) and integrating with AWS S3. I have setup the PipelineOptions > with all the relevant AWS options, so the filesystem registry **should** be > properly seeded by the time the graph is compiled and executed: > {code:java} > java.lang.IllegalArgumentException: No filesystem found for scheme s3 > at > org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:456) > at > org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:526) > at > org.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1149) > at > org.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1105) > at org.apache.beam.sdk.coders.Coder.decode(Coder.java:159) > at > org.apache.beam.sdk.transforms.join.UnionCoder.decode(UnionCoder.java:83) > at > org.apache.beam.sdk.transforms.join.UnionCoder.decode(UnionCoder.java:32) > at > org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:543) > at > org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:534) > at > org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:480) > at > org.apache.beam.runners.flink.translation.types.CoderTypeSerializer.deserialize(CoderTypeSerializer.java:93) > at > org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55) > at > org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:106) > at > org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:72) > at > org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47) > at > org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73) > at > org.apache.flink.runtime.operators.FlatMapDriver.run(FlatMapDriver.java:107) > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:503) > at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:368) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) > at java.lang.Thread.run(Thread.java:748) > {code} > For reference, the write code resembles this: > {code:java} > FileIO.Write write = FileIO.write() > .via(ParquetIO.sink(schema)) > .to(options.getOutputDir()). // will be something like: > s3:/// > .withSuffix(".parquet"); > records.apply(String.format("Write(%s)", options.getOutputDir()), > write);{code} > The issue does not appear to be related to ParquetIO.sink(). I am able to > reliably reproduce the issue using JSON formatted records and TextIO.sink(), > as well. Moreover, AvroIO is affected if withWindowedWrites() option is > added. > Just trying some different knobs, I went ahead and set the following option: > {code:java} > write = write.withNoSpilling();{code} > This actually seemed to fix the issue, only to have it reemerge as I scaled > up the data set size. The stack trace, while very similar, reads: > {code:java} > java.lang.IllegalArgumentException: No filesystem found for scheme s3 > at > org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:456) > at > org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:526) > at > org.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1149) > at > org.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1105) > at org.apache.beam.sdk.coders.Coder.decode(Coder.java:159) > at org.apache.beam.sdk.coders.KvCoder.decode(KvCoder.java:82) > at org.apache.beam.sdk.coders.KvCoder.decode(KvCoder.java:36) > at > org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:543) > at > org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:534) > at > org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:480) >