date:20190926

[jira] [Work logged] (BEAM-8318) Add a num_threads_per_worker pipeline option to Python SDK.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8318?focusedWorklogId=319321=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319321
 ]

ASF GitHub Bot logged work on BEAM-8318:


Author: ASF GitHub Bot
Created on: 27/Sep/19 02:43
Start Date: 27/Sep/19 02:43
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9675: 
[BEAM-8318] Adds a pipeline option to Python SDK for controlling the number of 
threads per worker.
URL: https://github.com/apache/beam/pull/9675
 
 
   This will be similar to following already available for Java SDK.
   
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/options/DataflowPipelineDebugOptions.java#L178
   
   Currently, only works for DataflowRunner on Fn API path.
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build

[jira] [Created] (BEAM-8318) Add a num_threads_per_worker pipeline option to Python SDK.

2019-09-26 Thread Chamikara Madhusanka Jayalath (Jira)

Chamikara Madhusanka Jayalath created BEAM-8318:
---

 Summary: Add a num_threads_per_worker pipeline option to Python 
SDK.
 Key: BEAM-8318
 URL: https://issues.apache.org/jira/browse/BEAM-8318
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Chamikara Madhusanka Jayalath
Assignee: Chamikara Madhusanka Jayalath


Similar to what we have here for Java: 
[https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/options/DataflowPipelineDebugOptions.java#L178]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-7455) Improve Avro IO integration test coverage on Python 3.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7455?focusedWorklogId=319296=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319296
 ]

ASF GitHub Bot logged work on BEAM-7455:


Author: ASF GitHub Bot
Created on: 27/Sep/19 00:54
Start Date: 27/Sep/19 00:54
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on pull request #9466: [DO NOT 
MERGE] [BEAM-7455] Improve Avro IO integration test coverage on Python 3.
URL: https://github.com/apache/beam/pull/9466
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319296)
Time Spent: 6h 10m  (was: 6h)

> Improve Avro IO integration test coverage on Python 3.
> --
>
> Key: BEAM-7455
> URL: https://issues.apache.org/jira/browse/BEAM-7455
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-py-avro
>Reporter: Valentyn Tymofieiev
>Assignee: Valentyn Tymofieiev
>Priority: Major
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> It seems that we don't have an integration test for Avro IO on Python 3:
> fastavro_it_test [1] depends on both avro and fastavro, however avro package 
> currently does not work with Beam on Python 3, so we don't have an 
> integration test that exercises Avro IO on Python 3. 
> We should add an integration test for Avro IO that does not need both 
> libraries at the same time, and instead can run using either library. 
> [~frederik] is this something you could help with? 
> cc: [~chamikara] [~Juta]
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/fastavro_it_test.py



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8213) Run and report python tox tasks separately within Jenkins

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8213?focusedWorklogId=319287=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319287
 ]

ASF GitHub Bot logged work on BEAM-8213:


Author: ASF GitHub Bot
Created on: 27/Sep/19 00:05
Start Date: 27/Sep/19 00:05
Worklog Time Spent: 10m 
  Work Description: markflyhigh commented on issue #9642: [BEAM-8213] Split 
up monolithic python preCommit tests on jenkins
URL: https://github.com/apache/beam/pull/9642#issuecomment-535728199
 
 
   I suggest move this discussion to dev@ since we had similar discussions 
before and many other people also have insights to this problem. 
   
   For me, I don't see big benefit to do this split (1 job to 5 jobs). 
   - The painpoints you mentioned about pylint failure doesn't require this 
change. I agree with the approach to split pylint alone. Similar thing is done 
in Java (RAT) and we could move pylint into that as well (or put in a separate 
job).
   - For logging, Gradle scan organize logs by task and provide a pretty good 
UI to surface the error. The link is in Jenkins job page. Did you have a chance 
to explore that?
   - For efficiency, this split will **not shorten** the walltime of the 
precommit run (50 - 75mins), on the contrary, adding requirement of extra 4 job 
slots. Given that those are precommit and the run frequency is very likely high 
(triggered by each commit push and manual in PR), it's likely to increase the 
precommit queue time.
   - I agree with Daniel's proposal in 
https://github.com/apache/beam/pull/9642#issuecomment-535241993 if we want to 
proceed this change.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319287)
Time Spent: 9h 40m  (was: 9.5h)

> Run and report python tox tasks separately within Jenkins
> -
>
> Key: BEAM-8213
> URL: https://issues.apache.org/jira/browse/BEAM-8213
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system
>Reporter: Chad Dombrova
>Priority: Major
>  Time Spent: 9h 40m
>  Remaining Estimate: 0h
>
> As a python developer, the speed and comprehensibility of the jenkins 
> PreCommit job could be greatly improved.
> Here are some of the problems
> - when a lint job fails, it's not reported in the test results summary, so 
> even though the job is marked as failed, I see "Test Result (no failures)" 
> which is quite confusing
> - I have to wait for over an hour to discover the lint failed, which takes 
> about a minute to run on its own
> - The logs are a jumbled mess of all the different tasks running on top of 
> each other
> - The test results give no indication of which version of python they use.  I 
> click on Test results, then the test module, then the test class, then I see 
> 4 tests named the same thing.  I assume that the first is python 2.7, the 
> second is 3.5 and so on.   It takes 5 clicks and then reading the log output 
> to know which version of python a single error pertains to, then I need to 
> repeat for each failure.  This makes it very difficult to discover problems, 
> and deduce that they may have something to do with python version mismatches.
> I believe the solution to this is to split up the single monolithic python 
> PreCommit job into sub-jobs (possibly using a pipeline with steps).  This 
> would give us the following benefits:
> - sub job results should become available as they finish, so for example, 
> lint results should be available very early on
> - sub job results will be reported separately, and there will be a job for 
> each py2, py35, py36 and so on, so it will be clear when an error is related 
> to a particular python version
> - sub jobs without reports, like docs and lint, will have their own failure 
> status and logs, so when they fail it will be more obvious what went wrong.
> I'm happy to help out once I get some feedback on the desired way forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8213) Run and report python tox tasks separately within Jenkins

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8213?focusedWorklogId=319285=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319285
 ]

ASF GitHub Bot logged work on BEAM-8213:


Author: ASF GitHub Bot
Created on: 27/Sep/19 00:00
Start Date: 27/Sep/19 00:00
Worklog Time Spent: 10m 
  Work Description: markflyhigh commented on issue #9642: [BEAM-8213] Split 
up monolithic python preCommit tests on jenkins
URL: https://github.com/apache/beam/pull/9642#issuecomment-535728199
 
 
   I suggest move this discussion to dev@ since we had similar discussions 
before and many other people also have insights to this problem. 
   
   For me, I don't see big benefit to do this split (1 job to 5 jobs). 
   - The painpoints you mentioned about pylint failure doesn't require this 
change. I agree with the approach to split pylint alone. Similar thing is done 
in Java (RAT) and we could move pylint into that as well (or put in a separate 
job).
   - For logging, Gradle scan organize logs by task and provide a pretty good 
UI to surface the error. The link is in Jenkins job page. Did you have a chance 
to explore that?
   - For efficiency, this split will **not shorten** the walltime of the 
precommit run, on the contrary, adding requirement of extra 4 job slots. Given 
that those are precommit and the run frequency is very likely high (triggered 
by each commit push and manual in PR), it's likely to increase the precommit 
queue time.
   - I agree with Daniel's proposal in 
https://github.com/apache/beam/pull/9642#issuecomment-535241993 if we want to 
proceed this change.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319285)
Time Spent: 9.5h  (was: 9h 20m)

> Run and report python tox tasks separately within Jenkins
> -
>
> Key: BEAM-8213
> URL: https://issues.apache.org/jira/browse/BEAM-8213
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system
>Reporter: Chad Dombrova
>Priority: Major
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> As a python developer, the speed and comprehensibility of the jenkins 
> PreCommit job could be greatly improved.
> Here are some of the problems
> - when a lint job fails, it's not reported in the test results summary, so 
> even though the job is marked as failed, I see "Test Result (no failures)" 
> which is quite confusing
> - I have to wait for over an hour to discover the lint failed, which takes 
> about a minute to run on its own
> - The logs are a jumbled mess of all the different tasks running on top of 
> each other
> - The test results give no indication of which version of python they use.  I 
> click on Test results, then the test module, then the test class, then I see 
> 4 tests named the same thing.  I assume that the first is python 2.7, the 
> second is 3.5 and so on.   It takes 5 clicks and then reading the log output 
> to know which version of python a single error pertains to, then I need to 
> repeat for each failure.  This makes it very difficult to discover problems, 
> and deduce that they may have something to do with python version mismatches.
> I believe the solution to this is to split up the single monolithic python 
> PreCommit job into sub-jobs (possibly using a pipeline with steps).  This 
> would give us the following benefits:
> - sub job results should become available as they finish, so for example, 
> lint results should be available very early on
> - sub job results will be reported separately, and there will be a job for 
> each py2, py35, py36 and so on, so it will be clear when an error is related 
> to a particular python version
> - sub jobs without reports, like docs and lint, will have their own failure 
> status and logs, so when they fail it will be more obvious what went wrong.
> I'm happy to help out once I get some feedback on the desired way forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-876) Support schemaUpdateOption in BigQueryIO

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-876?focusedWorklogId=319284=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319284
 ]

ASF GitHub Bot logged work on BEAM-876:
---

Author: ASF GitHub Bot
Created on: 26/Sep/19 23:56
Start Date: 26/Sep/19 23:56
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on issue #9524: [BEAM-876] 
Support schemaUpdateOption in BigQueryIO
URL: https://github.com/apache/beam/pull/9524#issuecomment-535727479
 
 
   Thanks for the contribution.
   
   R: @pabloem will you be able to review ?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319284)
Time Spent: 40m  (was: 0.5h)

> Support schemaUpdateOption in BigQueryIO
> 
>
> Key: BEAM-876
> URL: https://issues.apache.org/jira/browse/BEAM-876
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Eugene Kirpichov
>Assignee: canaan silberberg
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> BigQuery recently added support for updating the schema as a side effect of 
> the load job.
> Here is the relevant API method in JobConfigurationLoad: 
> https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#setSchemaUpdateOptions(java.util.List)
> BigQueryIO should support this too. See user request for this: 
> http://stackoverflow.com/questions/40333245/is-it-possible-to-update-schema-while-doing-a-load-into-an-existing-bigquery-tab



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (BEAM-8212) StatefulParDoFn creates GC timers for every record

2019-09-26 Thread Akshay Iyangar (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937228#comment-16937228
 ] 

Akshay Iyangar edited comment on BEAM-8212 at 9/26/19 11:19 PM:


 
{code:java}
public class TestDecodeTimer {
  @Test
  public void gctimerValue() throws IOException, ClassNotFoundException {

StateNamespace stateNamespace = 
StateNamespaces.window(GlobalWindow.Coder.INSTANCE, GlobalWindow.INSTANCE);

String GC_TIMER_ID = "__StatefulParDoGcTimerId";
//timerInternals.setTimer(
//StateNamespaces.window(windowCoder, window), GC_TIMER_ID, 
gcTime, TimeDomain.EVENT_TIME);

ByteArrayOutputStream outStream = new ByteArrayOutputStream();
StringUtf8Coder.of().encode(GC_TIMER_ID, outStream);
StringUtf8Coder.of().encode(stateNamespace.stringKey(), outStream);

System.out.println("The output stream is :"+ outStream.toString()); // 
__StatefulParDoGcTimerId//
//We need to find what the hex value representation of this is
String encode = BaseEncoding.base16().encode(outStream.toByteArray());
System.out.println("The encoded string is " + encode); 
//185F5F537461746566756C506172446F476354696D65724964022F2F
// We need everything after this as that is the gctimer and check what the 
value is for it also remove the eventime.

ByteArrayOutputStream outStream1 = new ByteArrayOutputStream();
StringUtf8Coder.of().encode(TimeDomain.EVENT_TIME.toString(), outStream1);
String encode1 = BaseEncoding.base16().encode(outStream1.toByteArray());
System.out.println("The encoded1 string is " + encode1); 
//0A4556454E545F54494D45
System.out.println("Total Length of the encode key: "+ outStream.size());

//Example key
String decode = 
"008020C49BA0BCF7F901006A6176612E6E696F2E4865617042797465427565F20100010C0107313831303639000C0100185F5F537461746566756C506172446F476354696D65724964022F2F8020C49BA0BCF7F80A4556454E545F54494D45";

//So the timer is whatever is between these two 
185F5F537461746566756C506172446F476354696D65724964022F2F and 
0A4556454E545F54494D45 viz 8020C49BA0BCF7F8
Instant timeDecode = InstantCoder.of().decode(new 
ByteArrayInputStream(BaseEncoding.base16().decode(
"8020C49BA0BCF7F8")));

System.out.println("GC timer for Global Window is" +timeDecode); 
//294247-01-10T04:00:54.775Z 
//This is nothing but +infinity and thus these timers would never be 
cleaned as the window never closes.

//just cross verify
System.out.println("MAX value" + BoundedWindow.TIMESTAMP_MAX_VALUE);
System.out.println("MAX: "+GlobalWindow.TIMESTAMP_MAX_VALUE);

  }
}
{code}
So I just wrote a test to verify what the values are that are being generated 
for each of the events. just took one key from rocksdb to analyze and the timer 
is +Infinity or GlobalWindow.TIMESTAMP_MAX_VALUE which makes sense as it's a 
global window.

 

Also, I didn't see any keys associated with timers in the StatefulParDoFn .. 
{code:java}
rocksdb_ldb --db=db --column_family=_timer_state/event_beam-timer scan 
--max_keys=100 --key_hex
{code}
returned me zero keys. 

 

I ran a big pipeline to see the effect of having it disabled.

so at 1-hour mark with Global Window and rocksdb as the state backend,

the pipeline had consumed 432 million records with a memory usage of the node 
at roughly 50%. The node is 32GB EKS node where I gave 15GB to the JVM.

the same pipeline took 1 hr 30 mins to read 432 million records with the total 
node memory usage at 62%.

So I think it is fair to assume that for global windows the timers can affect 
the pipeline performance.

[~mxm] and [~NathanHowell] ^^

 

 

 

 

 


was (Author: aiyangar):
 
{code:java}
public class TestDecodeTimer {
  @Test
  public void gctimerValue() throws IOException, ClassNotFoundException {

StateNamespace stateNamespace = 
StateNamespaces.window(GlobalWindow.Coder.INSTANCE, GlobalWindow.INSTANCE);

String GC_TIMER_ID = "__StatefulParDoGcTimerId";
//timerInternals.setTimer(
//StateNamespaces.window(windowCoder, window), GC_TIMER_ID, 
gcTime, TimeDomain.EVENT_TIME);

ByteArrayOutputStream outStream = new ByteArrayOutputStream();
StringUtf8Coder.of().encode(GC_TIMER_ID, outStream);
StringUtf8Coder.of().encode(stateNamespace.stringKey(), outStream);

System.out.println("The output stream is :"+ outStream.toString()); // 
__StatefulParDoGcTimerId//
//We need to find what the hex value representation of this is
String encode = BaseEncoding.base16().encode(outStream.toByteArray());
System.out.println("The encoded string is " + encode); 
//185F5F537461746566756C506172446F476354696D65724964022F2F
// We need everything after this as that is the gctimer and check what the 
value is for it also remove the eventime.

ByteArrayOutputStream outStream1 = new ByteArrayOutputStream();

[jira] [Work logged] (BEAM-7933) Adding timeout to JobServer grpc calls

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7933?focusedWorklogId=319263=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319263
 ]

ASF GitHub Bot logged work on BEAM-7933:


Author: ASF GitHub Bot
Created on: 26/Sep/19 22:51
Start Date: 26/Sep/19 22:51
Worklog Time Spent: 10m 
  Work Description: ecanzonieri commented on issue #9673: [BEAM-7933] Add 
job server request timeout (default to 60 seconds)
URL: https://github.com/apache/beam/pull/9673#issuecomment-535714867
 
 
   R: @ibzib @aaltay 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319263)
Time Spent: 20m  (was: 10m)

> Adding timeout to JobServer grpc calls
> --
>
> Key: BEAM-7933
> URL: https://issues.apache.org/jira/browse/BEAM-7933
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Affects Versions: 2.14.0
>Reporter: Enrico Canzonieri
>Assignee: Enrico Canzonieri
>Priority: Minor
>  Labels: portability
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> grpc calls to the JobServer from the Python SDK do not have timeouts. That 
> means that the call to pipeline.run()could hang forever if the JobServer is 
> not running (or failing to start).
> E.g. 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/portable_runner.py#L307]
>  the call to Prepare() doesn't provide any timeout value and the same applies 
> to other JobServer requests.
> As part of this ticket we could add a default timeout of 60 seconds as the 
> default timeout for http client.
> Additionally, we could consider adding a --job-server-request-timeout to the 
> [PortableOptions|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/options/pipeline_options.py#L805]
>  class to be used in the JobServer interactions inside probable_runner.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-7933) Adding timeout to JobServer grpc calls

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7933?focusedWorklogId=319262=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319262
 ]

ASF GitHub Bot logged work on BEAM-7933:


Author: ASF GitHub Bot
Created on: 26/Sep/19 22:48
Start Date: 26/Sep/19 22:48
Worklog Time Spent: 10m 
  Work Description: ecanzonieri commented on pull request #9673: 
[BEAM-7933] Add job server request timeout (default to 60 seconds)
URL: https://github.com/apache/beam/pull/9673
 
 
   Add a new option --job_server_timeout that default to 60 seconds. The 
request timeout is user for all job server requests (with exception to the ones 
that are expected to last long or hang). The request timeout is also used upon 
channel creation so that the Beam driver will fail if the job server is not 
available.
   
   Let me know if the option name looks fine or it needs renaming. Finally, 
there are other grpc requests (non job service) that I'm not covering as part 
of this pr. Let me know if I should add the timeout to those as well.
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build

[jira] [Work started] (BEAM-7933) Adding timeout to JobServer grpc calls

2019-09-26 Thread Enrico Canzonieri (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-7933 started by Enrico Canzonieri.
---
> Adding timeout to JobServer grpc calls
> --
>
> Key: BEAM-7933
> URL: https://issues.apache.org/jira/browse/BEAM-7933
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Affects Versions: 2.14.0
>Reporter: Enrico Canzonieri
>Assignee: Enrico Canzonieri
>Priority: Minor
>  Labels: portability
>
> grpc calls to the JobServer from the Python SDK do not have timeouts. That 
> means that the call to pipeline.run()could hang forever if the JobServer is 
> not running (or failing to start).
> E.g. 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/portable_runner.py#L307]
>  the call to Prepare() doesn't provide any timeout value and the same applies 
> to other JobServer requests.
> As part of this ticket we could add a default timeout of 60 seconds as the 
> default timeout for http client.
> Additionally, we could consider adding a --job-server-request-timeout to the 
> [PortableOptions|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/options/pipeline_options.py#L805]
>  class to be used in the JobServer interactions inside probable_runner.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-8303) Filesystems not properly registered using FileIO.write()

2019-09-26 Thread Preston Koprivica (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939002#comment-16939002
 ] 

Preston Koprivica commented on BEAM-8303:
-

[~mxm] Sorry for the delay.  I was struggling with my local build and I had to 
track down some issues.  Totally unrelated, but if you have some time, I think 
I may have found an issue related to some recent build changes [1].  In any 
case, I was able to finally get the local build working and pulled into my test 
project.  

{quote}
Just to proof this theory, do you mind building Beam and testing your pipeline 
with the following line added before line 75?
https://github.com/apache/beam/blob/04dc3c3b14ab780e9736d5f769c6bf2a27a390bb/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/types/CoderTypeInformation.java#L75
FileSystems.setDefaultPipelineOptions(PipelineOptionsFactory.create());
{quote}

This change did not impact the behavior at all.  And I guess the question is, 
would we have expected it to using the default PipelineOptions (which I'm 
assuming wouldn't include the S3 options).

[1] https://issues.apache.org/jira/browse/BEAM-8021

> Filesystems not properly registered using FileIO.write()
> 
>
> Key: BEAM-8303
> URL: https://issues.apache.org/jira/browse/BEAM-8303
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 2.15.0
>Reporter: Preston Koprivica
>Assignee: Maximilian Michels
>Priority: Major
>
> I’m getting the following error when attempting to use the FileIO apis 
> (beam-2.15.0) and integrating with AWS S3.  I have setup the PipelineOptions 
> with all the relevant AWS options, so the filesystem registry **should** be 
> properly seeded by the time the graph is compiled and executed:
> {code:java}
>  java.lang.IllegalArgumentException: No filesystem found for scheme s3
>     at 
> org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:456)
>     at 
> org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:526)
>     at 
> org.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1149)
>     at 
> org.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1105)
>     at org.apache.beam.sdk.coders.Coder.decode(Coder.java:159)
>     at 
> org.apache.beam.sdk.transforms.join.UnionCoder.decode(UnionCoder.java:83)
>     at 
> org.apache.beam.sdk.transforms.join.UnionCoder.decode(UnionCoder.java:32)
>     at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:543)
>     at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:534)
>     at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:480)
>     at 
> org.apache.beam.runners.flink.translation.types.CoderTypeSerializer.deserialize(CoderTypeSerializer.java:93)
>     at 
> org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55)
>     at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:106)
>     at 
> org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:72)
>     at 
> org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)
>     at 
> org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)
>     at 
> org.apache.flink.runtime.operators.FlatMapDriver.run(FlatMapDriver.java:107)
>     at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:503)
>     at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:368)
>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
>     at java.lang.Thread.run(Thread.java:748)
>  {code}
> For reference, the write code resembles this:
> {code:java}
>  FileIO.Write write = FileIO.write()
>     .via(ParquetIO.sink(schema))
>     .to(options.getOutputDir()). // will be something like: 
> s3:///
>     .withSuffix(".parquet");
> records.apply(String.format("Write(%s)", options.getOutputDir()), 
> write);{code}
> The issue does not appear to be related to ParquetIO.sink().  I am able to 
> reliably reproduce the issue using JSON formatted records and TextIO.sink(), 
> as well.  Moreover, AvroIO is affected if withWindowedWrites() option is 
> added.
> Just trying some different knobs, I went ahead and set the following option:
> {code:java}
> write = write.withNoSpilling();{code}
> This actually seemed to fix the issue, only to have it reemerge as I scaled 
> up the data set size.  The stack trace, while very

[jira] [Work logged] (BEAM-8275) Beam SQL should support BigQuery in DIRECT_READ mode

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8275?focusedWorklogId=319254=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319254
 ]

ASF GitHub Bot logged work on BEAM-8275:


Author: ASF GitHub Bot
Created on: 26/Sep/19 22:13
Start Date: 26/Sep/19 22:13
Worklog Time Spent: 10m 
  Work Description: apilloud commented on pull request #9625: [BEAM-8275] 
Beam SQL should support BigQuery in DIRECT_READ mode
URL: https://github.com/apache/beam/pull/9625#discussion_r328848385
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryReadWriteIT.java
 ##
 @@ -154,6 +156,78 @@ public void testSQLRead() {
 assertEquals(state, State.DONE);
   }
 
+  @Test
+  public void testSQLRead_withDirectRead() {
 
 Review comment:
   We should probably also add a test that EXPORT works, so when that is no 
longer the default we will still have a test for it.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319254)
Time Spent: 1h 10m  (was: 1h)

> Beam SQL should support BigQuery in DIRECT_READ mode
> 
>
> Key: BEAM-8275
> URL: https://issues.apache.org/jira/browse/BEAM-8275
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Andrew Pilloud
>Assignee: Kirill Kozlov
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> SQL currently only supports reading from BigQuery in DEFAULT (EXPORT) mode. 
> We also need to support DIRECT_READ mode. The method should be configurable 
> by TBLPROPERTIES through the SQL CLI. This will enable us to take advantage 
> of the DIRECT_READ features for filter and project push down.
> References:
> [https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api]
> [https://beam.apache.org/blog/2019/06/04/adding-data-sources-to-sql.html]
> [https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8275) Beam SQL should support BigQuery in DIRECT_READ mode

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8275?focusedWorklogId=319255=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319255
 ]

ASF GitHub Bot logged work on BEAM-8275:


Author: ASF GitHub Bot
Created on: 26/Sep/19 22:13
Start Date: 26/Sep/19 22:13
Worklog Time Spent: 10m 
  Work Description: apilloud commented on pull request #9625: [BEAM-8275] 
Beam SQL should support BigQuery in DIRECT_READ mode
URL: https://github.com/apache/beam/pull/9625#discussion_r326830964
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java
 ##
 @@ -45,15 +49,45 @@
  */
 @Experimental
 class BigQueryTable extends BaseBeamTable implements Serializable {
+  @VisibleForTesting static final String METHOD_PROPERTY = "method";
   @VisibleForTesting final String bqLocation;
   private final ConversionOptions conversionOptions;
   private BeamTableStatistics rowCountStatistics = null;
   private static final Logger LOGGER = 
LoggerFactory.getLogger(BigQueryTable.class);
+  @VisibleForTesting Method method;
 
 Review comment:
   This is set exactly once (per code path) in the constructor, so you should 
be able to make it `final`.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319255)
Time Spent: 1h 10m  (was: 1h)

> Beam SQL should support BigQuery in DIRECT_READ mode
> 
>
> Key: BEAM-8275
> URL: https://issues.apache.org/jira/browse/BEAM-8275
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Andrew Pilloud
>Assignee: Kirill Kozlov
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> SQL currently only supports reading from BigQuery in DEFAULT (EXPORT) mode. 
> We also need to support DIRECT_READ mode. The method should be configurable 
> by TBLPROPERTIES through the SQL CLI. This will enable us to take advantage 
> of the DIRECT_READ features for filter and project push down.
> References:
> [https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api]
> [https://beam.apache.org/blog/2019/06/04/adding-data-sources-to-sql.html]
> [https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8275) Beam SQL should support BigQuery in DIRECT_READ mode

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8275?focusedWorklogId=319256=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319256
 ]

ASF GitHub Bot logged work on BEAM-8275:


Author: ASF GitHub Bot
Created on: 26/Sep/19 22:13
Start Date: 26/Sep/19 22:13
Worklog Time Spent: 10m 
  Work Description: apilloud commented on pull request #9625: [BEAM-8275] 
Beam SQL should support BigQuery in DIRECT_READ mode
URL: https://github.com/apache/beam/pull/9625#discussion_r326830808
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java
 ##
 @@ -45,15 +49,45 @@
  */
 @Experimental
 class BigQueryTable extends BaseBeamTable implements Serializable {
+  @VisibleForTesting static final String METHOD_PROPERTY = "method";
   @VisibleForTesting final String bqLocation;
   private final ConversionOptions conversionOptions;
   private BeamTableStatistics rowCountStatistics = null;
   private static final Logger LOGGER = 
LoggerFactory.getLogger(BigQueryTable.class);
+  @VisibleForTesting Method method;
 
-  BigQueryTable(Table table, BigQueryUtils.ConversionOptions options) {
+  BigQueryTable(Table table, BigQueryUtils.ConversionOptions options)
+  throws InvalidPropertyException {
 
 Review comment:
   This statement is redundant, please remove. `InvalidPropertyException` 
indirectly extends `RuntimeException`. Every java method implicity `throws 
RuntimeException`.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319256)
Time Spent: 1h 20m  (was: 1h 10m)

> Beam SQL should support BigQuery in DIRECT_READ mode
> 
>
> Key: BEAM-8275
> URL: https://issues.apache.org/jira/browse/BEAM-8275
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Andrew Pilloud
>Assignee: Kirill Kozlov
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> SQL currently only supports reading from BigQuery in DEFAULT (EXPORT) mode. 
> We also need to support DIRECT_READ mode. The method should be configurable 
> by TBLPROPERTIES through the SQL CLI. This will enable us to take advantage 
> of the DIRECT_READ features for filter and project push down.
> References:
> [https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api]
> [https://beam.apache.org/blog/2019/06/04/adding-data-sources-to-sql.html]
> [https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8317) SqlTransform doesn't support aggregation over a filter node

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8317?focusedWorklogId=319251=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319251
 ]

ASF GitHub Bot logged work on BEAM-8317:


Author: ASF GitHub Bot
Created on: 26/Sep/19 22:06
Start Date: 26/Sep/19 22:06
Worklog Time Spent: 10m 
  Work Description: apilloud commented on issue #9672: [BEAM-8317] Add 
(skipped) test for aggregating after a filter
URL: https://github.com/apache/beam/pull/9672#issuecomment-535704007
 
 
   LGTM.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319251)
Time Spent: 20m  (was: 10m)

> SqlTransform doesn't support aggregation over a filter node
> ---
>
> Key: BEAM-8317
> URL: https://issues.apache.org/jira/browse/BEAM-8317
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Affects Versions: 2.15.0
>Reporter: Brian Hulette
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For example, the following query fails to translate to a physical plan:
> SELECT SUM(f_intValue) FROM PCOLLECTION WHERE f_intValue < 5 GROUP BY 
> f_intGroupingKey



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8317) SqlTransform doesn't support aggregation over a filter node

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8317?focusedWorklogId=319252=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319252
 ]

ASF GitHub Bot logged work on BEAM-8317:


Author: ASF GitHub Bot
Created on: 26/Sep/19 22:06
Start Date: 26/Sep/19 22:06
Worklog Time Spent: 10m 
  Work Description: apilloud commented on pull request #9672: [BEAM-8317] 
Add (skipped) test for aggregating after a filter
URL: https://github.com/apache/beam/pull/9672
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319252)
Time Spent: 0.5h  (was: 20m)

> SqlTransform doesn't support aggregation over a filter node
> ---
>
> Key: BEAM-8317
> URL: https://issues.apache.org/jira/browse/BEAM-8317
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Affects Versions: 2.15.0
>Reporter: Brian Hulette
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> For example, the following query fails to translate to a physical plan:
> SELECT SUM(f_intValue) FROM PCOLLECTION WHERE f_intValue < 5 GROUP BY 
> f_intGroupingKey



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-8021) Add Automatic-Module-Name headers for Beam Java modules

2019-09-26 Thread Preston Koprivica (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938995#comment-16938995
 ] 

Preston Koprivica commented on BEAM-8021:
-

[~ŁukaszG]I think the PR associated to this issue may have broken local builds. 
 I'm still very new to beam (and gradle), so please bear with me and apologies 
if I'm mistaken.  The default for applyJavaNature (as of 2.15.0) was to publish 
[1].  The project :sdks:java:build-tools was previously being published and 
there was a compile dependency on it by the flink-runner [2].  It appears that 
dependency still exists [3], but the build-tools project is no longer being 
published, hence the broken builds.  

I'm guessing that the reason it wasn't caught in the PR is because the SNAPSHOT 
artifact was still available in whatever repo the build server was accessing.  
And I'm also wondering if this doesn't manifest when you attempt to release it.

[1] 
https://github.com/apache/beam/blob/v2.15.0/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L129
[2] 
https://github.com/apache/beam/blob/release-2.15.0/runners/flink/flink_runner.gradle
[3] 
https://github.com/apache/beam/blob/2acbfbd/runners/flink/flink_runner.gradle#L102

> Add Automatic-Module-Name headers for Beam Java modules 
> 
>
> Key: BEAM-8021
> URL: https://issues.apache.org/jira/browse/BEAM-8021
> Project: Beam
>  Issue Type: Sub-task
>  Components: build-system
>Reporter: Ismaël Mejía
>Assignee: Lukasz Gajowy
>Priority: Minor
> Fix For: 2.17.0
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> For compatibility with the Java Platform Module System (JPMS) in Java 9 and 
> later, every JAR should have a module name, even if the library does not 
> itself use modules. As [suggested in the mailing 
> list|https://lists.apache.org/thread.html/956065580ce049481e756482dc3ccfdc994fef3b8cdb37cab3e2d9b1@%3Cdev.beam.apache.org%3E],
>  this is a simple change that we can do and still be backwards compatible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-7886) Make row coder a standard coder and implement in python

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7886?focusedWorklogId=319240=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319240
 ]

ASF GitHub Bot logged work on BEAM-7886:


Author: ASF GitHub Bot
Created on: 26/Sep/19 21:36
Start Date: 26/Sep/19 21:36
Worklog Time Spent: 10m 
  Work Description: TheNeuralBit commented on issue #9188: [BEAM-7886] Make 
row coder a standard coder and implement in Python
URL: https://github.com/apache/beam/pull/9188#issuecomment-535695414
 
 
   Apologies for letting this go stale. After 
[BEAM-8111](https://issues.apache.org/jira/browse/BEAM-8111) I wanted to make 
sure we had some better test coverage on the Java side.
   
   @udim, @aaltay, @robertwb, and/or @chadrik - would you mind taking another 
look at the Python changes now?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319240)
Time Spent: 10h 50m  (was: 10h 40m)

> Make row coder a standard coder and implement in python
> ---
>
> Key: BEAM-7886
> URL: https://issues.apache.org/jira/browse/BEAM-7886
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model, sdk-java-core, sdk-py-core
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Time Spent: 10h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-7886) Make row coder a standard coder and implement in python

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7886?focusedWorklogId=319238=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319238
 ]

ASF GitHub Bot logged work on BEAM-7886:


Author: ASF GitHub Bot
Created on: 26/Sep/19 21:33
Start Date: 26/Sep/19 21:33
Worklog Time Spent: 10m 
  Work Description: TheNeuralBit commented on pull request #9188: 
[BEAM-7886] Make row coder a standard coder and implement in Python
URL: https://github.com/apache/beam/pull/9188#discussion_r328835963
 
 

 ##
 File path: sdks/python/apache_beam/coders/row_coder.py
 ##
 @@ -0,0 +1,162 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import absolute_import
+
+import itertools
+from array import array
+
+from apache_beam.coders.coder_impl import StreamCoderImpl
+from apache_beam.coders.coders import BytesCoder
+from apache_beam.coders.coders import Coder
+from apache_beam.coders.coders import FastCoder
+from apache_beam.coders.coders import FloatCoder
+from apache_beam.coders.coders import IterableCoder
+from apache_beam.coders.coders import StrUtf8Coder
+from apache_beam.coders.coders import TupleCoder
+from apache_beam.coders.coders import VarIntCoder
+from apache_beam.portability import common_urns
+from apache_beam.portability.api import schema_pb2
+from apache_beam.typehints.schemas import named_tuple_from_schema
+from apache_beam.typehints.schemas import named_tuple_to_schema
+
+__all__ = ["RowCoder"]
+
+
+class RowCoder(FastCoder):
+  """ Coder for `typing.NamedTuple` instances.
+
+  Implements the beam:coder:row:v1 standard coder spec.
+  """
+
+  def __init__(self, schema):
+self.schema = schema
+self.components = [
+coder_from_type(field.type) for field in self.schema.fields
+]
+
+  def _create_impl(self):
+return RowCoderImpl(self.schema, self.components)
+
+  def is_deterministic(self):
+return all(c.is_deterministic() for c in self.components)
+
+  def to_type_hint(self):
+return named_tuple_from_schema(self.schema)
+
+  def as_cloud_object(self, coders_context=None):
+raise NotImplementedError("TODO")
+
+  def __eq__(self, other):
+return type(self) == type(other) and self.schema == other.schema
+
+  def __hash__(self):
 
 Review comment:
   Did some reading on this and it looks like I was wrong. From 
[SO](https://stackoverflow.com/a/53519136):
   > A class that overrides `__eq__()` and does not define `__hash__()` will 
have its `__hash__()` implicitly set to None.
   
   I went ahead and removed __hash__
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319238)
Time Spent: 10.5h  (was: 10h 20m)

> Make row coder a standard coder and implement in python
> ---
>
> Key: BEAM-7886
> URL: https://issues.apache.org/jira/browse/BEAM-7886
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model, sdk-java-core, sdk-py-core
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-7886) Make row coder a standard coder and implement in python

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7886?focusedWorklogId=319239=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319239
 ]

ASF GitHub Bot logged work on BEAM-7886:


Author: ASF GitHub Bot
Created on: 26/Sep/19 21:33
Start Date: 26/Sep/19 21:33
Worklog Time Spent: 10m 
  Work Description: TheNeuralBit commented on pull request #9188: 
[BEAM-7886] Make row coder a standard coder and implement in Python
URL: https://github.com/apache/beam/pull/9188#discussion_r328835963
 
 

 ##
 File path: sdks/python/apache_beam/coders/row_coder.py
 ##
 @@ -0,0 +1,162 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import absolute_import
+
+import itertools
+from array import array
+
+from apache_beam.coders.coder_impl import StreamCoderImpl
+from apache_beam.coders.coders import BytesCoder
+from apache_beam.coders.coders import Coder
+from apache_beam.coders.coders import FastCoder
+from apache_beam.coders.coders import FloatCoder
+from apache_beam.coders.coders import IterableCoder
+from apache_beam.coders.coders import StrUtf8Coder
+from apache_beam.coders.coders import TupleCoder
+from apache_beam.coders.coders import VarIntCoder
+from apache_beam.portability import common_urns
+from apache_beam.portability.api import schema_pb2
+from apache_beam.typehints.schemas import named_tuple_from_schema
+from apache_beam.typehints.schemas import named_tuple_to_schema
+
+__all__ = ["RowCoder"]
+
+
+class RowCoder(FastCoder):
+  """ Coder for `typing.NamedTuple` instances.
+
+  Implements the beam:coder:row:v1 standard coder spec.
+  """
+
+  def __init__(self, schema):
+self.schema = schema
+self.components = [
+coder_from_type(field.type) for field in self.schema.fields
+]
+
+  def _create_impl(self):
+return RowCoderImpl(self.schema, self.components)
+
+  def is_deterministic(self):
+return all(c.is_deterministic() for c in self.components)
+
+  def to_type_hint(self):
+return named_tuple_from_schema(self.schema)
+
+  def as_cloud_object(self, coders_context=None):
+raise NotImplementedError("TODO")
+
+  def __eq__(self, other):
+return type(self) == type(other) and self.schema == other.schema
+
+  def __hash__(self):
 
 Review comment:
   Did some reading on this and it looks like I was wrong. From 
[SO](https://stackoverflow.com/a/53519136):
   > A class that overrides `__eq__()` and does not define `__hash__()` will 
have its `__hash__()` implicitly set to None.
   
   I went ahead and removed `__hash__`
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319239)
Time Spent: 10h 40m  (was: 10.5h)

> Make row coder a standard coder and implement in python
> ---
>
> Key: BEAM-7886
> URL: https://issues.apache.org/jira/browse/BEAM-7886
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model, sdk-java-core, sdk-py-core
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Time Spent: 10h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-7886) Make row coder a standard coder and implement in python

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7886?focusedWorklogId=319237=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319237
 ]

ASF GitHub Bot logged work on BEAM-7886:


Author: ASF GitHub Bot
Created on: 26/Sep/19 21:32
Start Date: 26/Sep/19 21:32
Worklog Time Spent: 10m 
  Work Description: TheNeuralBit commented on pull request #9188: 
[BEAM-7886] Make row coder a standard coder and implement in Python
URL: https://github.com/apache/beam/pull/9188#discussion_r328835963
 
 

 ##
 File path: sdks/python/apache_beam/coders/row_coder.py
 ##
 @@ -0,0 +1,162 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import absolute_import
+
+import itertools
+from array import array
+
+from apache_beam.coders.coder_impl import StreamCoderImpl
+from apache_beam.coders.coders import BytesCoder
+from apache_beam.coders.coders import Coder
+from apache_beam.coders.coders import FastCoder
+from apache_beam.coders.coders import FloatCoder
+from apache_beam.coders.coders import IterableCoder
+from apache_beam.coders.coders import StrUtf8Coder
+from apache_beam.coders.coders import TupleCoder
+from apache_beam.coders.coders import VarIntCoder
+from apache_beam.portability import common_urns
+from apache_beam.portability.api import schema_pb2
+from apache_beam.typehints.schemas import named_tuple_from_schema
+from apache_beam.typehints.schemas import named_tuple_to_schema
+
+__all__ = ["RowCoder"]
+
+
+class RowCoder(FastCoder):
+  """ Coder for `typing.NamedTuple` instances.
+
+  Implements the beam:coder:row:v1 standard coder spec.
+  """
+
+  def __init__(self, schema):
+self.schema = schema
+self.components = [
+coder_from_type(field.type) for field in self.schema.fields
+]
+
+  def _create_impl(self):
+return RowCoderImpl(self.schema, self.components)
+
+  def is_deterministic(self):
+return all(c.is_deterministic() for c in self.components)
+
+  def to_type_hint(self):
+return named_tuple_from_schema(self.schema)
+
+  def as_cloud_object(self, coders_context=None):
+raise NotImplementedError("TODO")
+
+  def __eq__(self, other):
+return type(self) == type(other) and self.schema == other.schema
+
+  def __hash__(self):
 
 Review comment:
   Did some reading on this and it looks like I was wrong. From 
(SO)[https://stackoverflow.com/a/53519136]:
   > A class that overrides `__eq__()` and does not define `__hash__()` will 
have its `__hash__()` implicitly set to None.
   
   I went ahead and removed __hash__
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319237)
Time Spent: 10h 20m  (was: 10h 10m)

> Make row coder a standard coder and implement in python
> ---
>
> Key: BEAM-7886
> URL: https://issues.apache.org/jira/browse/BEAM-7886
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model, sdk-java-core, sdk-py-core
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-8317) SqlTransform doesn't support aggregation over a filter node

2019-09-26 Thread Brian Hulette (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938980#comment-16938980
 ] 

Brian Hulette commented on BEAM-8317:
-

Discussed this with [~apilloud]. He said that probably the appropriate fix is 
to make 
[BeamBasicAggregationRule|https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamBasicAggregationRule.java]
 accept any thing that's _not_ a projection, rather than the current approach 
that just accepts table scans.

Basically it should complement 
[BeamAggregationRule|https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamAggregationRule.java]
 which accepts _only_ projections.

> SqlTransform doesn't support aggregation over a filter node
> ---
>
> Key: BEAM-8317
> URL: https://issues.apache.org/jira/browse/BEAM-8317
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Affects Versions: 2.15.0
>Reporter: Brian Hulette
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For example, the following query fails to translate to a physical plan:
> SELECT SUM(f_intValue) FROM PCOLLECTION WHERE f_intValue < 5 GROUP BY 
> f_intGroupingKey



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8317) SqlTransform doesn't support aggregation over a filter node

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8317?focusedWorklogId=319227=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319227
 ]

ASF GitHub Bot logged work on BEAM-8317:


Author: ASF GitHub Bot
Created on: 26/Sep/19 21:19
Start Date: 26/Sep/19 21:19
Worklog Time Spent: 10m 
  Work Description: TheNeuralBit commented on pull request #9672: 
[BEAM-8317] Add (skipped) test for aggregating after a filter
URL: https://github.com/apache/beam/pull/9672
 
 
   R: @apilloud 
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/)
 | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/)
 | --- | --- | [![Build

[jira] [Created] (BEAM-8317) SqlTransform doesn't support aggregation over a filter node

2019-09-26 Thread Brian Hulette (Jira)

Brian Hulette created BEAM-8317:
---

 Summary: SqlTransform doesn't support aggregation over a filter 
node
 Key: BEAM-8317
 URL: https://issues.apache.org/jira/browse/BEAM-8317
 Project: Beam
  Issue Type: Bug
  Components: dsl-sql
Affects Versions: 2.15.0
Reporter: Brian Hulette


For example, the following query fails to translate to a physical plan:
SELECT SUM(f_intValue) FROM PCOLLECTION WHERE f_intValue < 5 GROUP BY 
f_intGroupingKey



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319207=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319207
 ]

ASF GitHub Bot logged work on BEAM-8300:


Author: ASF GitHub Bot
Created on: 26/Sep/19 20:48
Start Date: 26/Sep/19 20:48
Worklog Time Spent: 10m 
  Work Description: jhalaria commented on issue #9640: [BEAM-8300]: 
KinesisIO.write throws NPE because producer is null
URL: https://github.com/apache/beam/pull/9640#issuecomment-535680190
 
 
   Thanks @aromanenko-dev . Build is happy now.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319207)
Time Spent: 4h 10m  (was: 4h)

> KinesisIO.write causes NPE as the producer is null
> --
>
> Key: BEAM-8300
> URL: https://issues.apache.org/jira/browse/BEAM-8300
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.15.0
>Reporter: Ankit Jhalaria
>Assignee: Ankit Jhalaria
>Priority: Minor
> Fix For: Not applicable
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> While using KinesisIO.write(), we encountered a NPE with the following stack 
> trace 
> {code:java}
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat
>  org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat 
> java.lang.Thread.run(Thread.java:748)\nCaused by: 
> java.lang.NullPointerException: null\n\tat 
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat
>  
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (BEAM-5543) Beam Dependency Update Request: com.gradle:build-scan-plugin

2019-09-26 Thread Luke Cwik (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik reassigned BEAM-5543:
---

Assignee: (was: Luke Cwik)

> Beam Dependency Update Request: com.gradle:build-scan-plugin
> 
>
> Key: BEAM-5543
> URL: https://issues.apache.org/jira/browse/BEAM-5543
> Project: Beam
>  Issue Type: Sub-task
>  Components: dependencies
>Reporter: Beam JIRA Bot
>Priority: Major
>
>  - 2018-10-01 19:29:44.279184 
> -
> Please consider upgrading the dependency 
> com.gradle:build-scan-plugin. 
> The current version is 1.13.1. The latest version is 1.16 
> cc: [~swegner], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-10-08 12:16:12.858178 
> -
> Please consider upgrading the dependency 
> com.gradle:build-scan-plugin. 
> The current version is 1.13.1. The latest version is 1.16 
> cc: [~swegner], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-10-15 12:11:19.400992 
> -
> Please consider upgrading the dependency 
> com.gradle:build-scan-plugin. 
> The current version is 1.13.1. The latest version is 1.16 
> cc: [~swegner], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-10-22 12:11:02.816812 
> -
> Please consider upgrading the dependency 
> com.gradle:build-scan-plugin. 
> The current version is 1.13.1. The latest version is 2.0 
> cc: [~swegner], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-10-29 12:13:18.853246 
> -
> Please consider upgrading the dependency 
> com.gradle:build-scan-plugin. 
> The current version is 1.13.1. The latest version is 2.0.1 
> cc: [~swegner], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-11-05 12:12:01.873127 
> -
> Please consider upgrading the dependency 
> com.gradle:build-scan-plugin. 
> The current version is 1.13.1. The latest version is 2.0.1 
> cc: [~swegner], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-11-12 12:11:52.184461 
> -
> Please consider upgrading the dependency 
> com.gradle:build-scan-plugin. 
> The current version is 1.13.1. The latest version is 2.0.2 
> cc: [~swegner], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-11-19 12:12:39.182250 
> -
> Please consider upgrading the dependency 
> com.gradle:build-scan-plugin. 
> The current version is 1.13.1. The latest version is 2.0.2 
> cc: [~swegner], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-11-26 12:11:44.898503 
> -
> Please consider upgrading the dependency 
> com.gradle:build-scan-plugin. 
> The current version is 1.13.1. The latest version is 2.0.2 
> cc: [~swegner], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-12-03 12:12:05.592304 
> -
> Please consider upgrading the dependency 
> com.gradle:build-scan-plugin. 
> The current version is 1.13.1. The latest version is 2.0.2 
> cc: [~swegner], 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-12-10 12:14:26.144633 
> -
> Please consider upgrading the

[jira] [Assigned] (BEAM-6646) Beam Dependency Update Request: com.gradle.build-scan

2019-09-26 Thread Luke Cwik (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik reassigned BEAM-6646:
---

Assignee: (was: Luke Cwik)

> Beam Dependency Update Request: com.gradle.build-scan
> -
>
> Key: BEAM-6646
> URL: https://issues.apache.org/jira/browse/BEAM-6646
> Project: Beam
>  Issue Type: Bug
>  Components: dependencies
>Reporter: Beam JIRA Bot
>Priority: Major
>
>  - 2019-02-11 12:12:25.062529 
> -
> Please consider upgrading the dependency com.gradle.build-scan. 
> The current version is None. The latest version is None 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (BEAM-6647) Beam Dependency Update Request: com.gradle.build-scan:com.gradle.build-scan.gradle.plugin

2019-09-26 Thread Luke Cwik (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik resolved BEAM-6647.
-
Fix Version/s: 2.16.0
   Resolution: Fixed

This has been updated to 2.4

> Beam Dependency Update Request: 
> com.gradle.build-scan:com.gradle.build-scan.gradle.plugin
> -
>
> Key: BEAM-6647
> URL: https://issues.apache.org/jira/browse/BEAM-6647
> Project: Beam
>  Issue Type: Sub-task
>  Components: dependencies
>Reporter: Beam JIRA Bot
>Assignee: Luke Cwik
>Priority: Major
> Fix For: 2.16.0
>
>
>  - 2019-02-11 12:12:26.233579 
> -
> Please consider upgrading the dependency 
> com.gradle.build-scan:com.gradle.build-scan.gradle.plugin. 
> The current version is 1.13.1. The latest version is 2.1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-08-12 12:04:23.565409 
> -
> Please consider upgrading the dependency 
> com.gradle.build-scan:com.gradle.build-scan.gradle.plugin. 
> The current version is 2.1. The latest version is 2.4 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1296) Providing a small dataset for "Apache Beam Mobile Gaming Pipeline Examples"

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1296?focusedWorklogId=319197=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319197
 ]

ASF GitHub Bot logged work on BEAM-1296:


Author: ASF GitHub Bot
Created on: 26/Sep/19 20:18
Start Date: 26/Sep/19 20:18
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9633: [BEAM-1296] 
Providing a small dataset for "Apache Beam Mobile Gaming …
URL: https://github.com/apache/beam/pull/9633
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319197)
Time Spent: 1.5h  (was: 1h 20m)

> Providing a small dataset for "Apache Beam Mobile Gaming Pipeline Examples"
> ---
>
> Key: BEAM-1296
> URL: https://issues.apache.org/jira/browse/BEAM-1296
> Project: Beam
>  Issue Type: Wish
>  Components: examples-java
>Reporter: Keiji Yoshida
>Assignee: John Patoch
>Priority: Trivial
>  Labels: ccoss2019, newbie, starter
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> A dataset "gs://apache-beam-samples/game/gaming_data*.csv" for "Apache Beam 
> Mobile Gaming Pipeline Examples" is so huge (about 12 GB) and it takes long 
> time to download the dataset. It might pose difficulties to Apache Beam 
> beginners who want to try "Apache Beam Mobile Gaming Pipeline Examples" 
> quickly.
> How about providing a small dataset (say less than 1 GB) for this examples?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1296) Providing a small dataset for "Apache Beam Mobile Gaming Pipeline Examples"

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1296?focusedWorklogId=319196=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319196
 ]

ASF GitHub Bot logged work on BEAM-1296:


Author: ASF GitHub Bot
Created on: 26/Sep/19 20:17
Start Date: 26/Sep/19 20:17
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9633: [BEAM-1296] Providing 
a small dataset for "Apache Beam Mobile Gaming …
URL: https://github.com/apache/beam/pull/9633#issuecomment-535669628
 
 
   Thanks!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319196)
Time Spent: 1h 20m  (was: 1h 10m)

> Providing a small dataset for "Apache Beam Mobile Gaming Pipeline Examples"
> ---
>
> Key: BEAM-1296
> URL: https://issues.apache.org/jira/browse/BEAM-1296
> Project: Beam
>  Issue Type: Wish
>  Components: examples-java
>Reporter: Keiji Yoshida
>Assignee: John Patoch
>Priority: Trivial
>  Labels: ccoss2019, newbie, starter
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> A dataset "gs://apache-beam-samples/game/gaming_data*.csv" for "Apache Beam 
> Mobile Gaming Pipeline Examples" is so huge (about 12 GB) and it takes long 
> time to download the dataset. It might pose difficulties to Apache Beam 
> beginners who want to try "Apache Beam Mobile Gaming Pipeline Examples" 
> quickly.
> How about providing a small dataset (say less than 1 GB) for this examples?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8313) Rename yyy_reference to yyy_id to be consistent across protos

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8313?focusedWorklogId=319189=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319189
 ]

ASF GitHub Bot logged work on BEAM-8313:


Author: ASF GitHub Bot
Created on: 26/Sep/19 19:55
Start Date: 26/Sep/19 19:55
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on pull request #9668: [BEAM-8313] 
Follow up on PR comments for #9663
URL: https://github.com/apache/beam/pull/9668
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319189)
Time Spent: 2h 10m  (was: 2h)

> Rename yyy_reference to yyy_id to be consistent across protos
> -
>
> Key: BEAM-8313
> URL: https://issues.apache.org/jira/browse/BEAM-8313
> Project: Beam
>  Issue Type: Sub-task
>  Components: beam-model
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Major
> Fix For: 2.17.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> This will replace:
>  * ptransform_id -> transform_id
>  * ptransform_reference -> transform_id
>  * instruction_reference -> instruction_id
>  * process_bundle_descriptor_reference -> process_bundle_descriptor_id
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work started] (BEAM-8313) Rename yyy_reference to yyy_id to be consistent across protos

2019-09-26 Thread Luke Cwik (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-8313 started by Luke Cwik.
---
> Rename yyy_reference to yyy_id to be consistent across protos
> -
>
> Key: BEAM-8313
> URL: https://issues.apache.org/jira/browse/BEAM-8313
> Project: Beam
>  Issue Type: Sub-task
>  Components: beam-model
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> This will replace:
>  * ptransform_id -> transform_id
>  * ptransform_reference -> transform_id
>  * instruction_reference -> instruction_id
>  * process_bundle_descriptor_reference -> process_bundle_descriptor_id
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (BEAM-8315) Remove unused fields for splitting SDFs in beam_fn_api.proto

2019-09-26 Thread Luke Cwik (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik resolved BEAM-8315.
-
Fix Version/s: 2.17.0
   Resolution: Fixed

> Remove unused fields for splitting SDFs in beam_fn_api.proto
> 
>
> Key: BEAM-8315
> URL: https://issues.apache.org/jira/browse/BEAM-8315
> Project: Beam
>  Issue Type: Sub-task
>  Components: beam-model
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Major
> Fix For: 2.17.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (BEAM-8313) Rename yyy_reference to yyy_id to be consistent across protos

2019-09-26 Thread Luke Cwik (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik resolved BEAM-8313.
-
Fix Version/s: 2.17.0
   Resolution: Fixed

> Rename yyy_reference to yyy_id to be consistent across protos
> -
>
> Key: BEAM-8313
> URL: https://issues.apache.org/jira/browse/BEAM-8313
> Project: Beam
>  Issue Type: Sub-task
>  Components: beam-model
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Major
> Fix For: 2.17.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> This will replace:
>  * ptransform_id -> transform_id
>  * ptransform_reference -> transform_id
>  * instruction_reference -> instruction_id
>  * process_bundle_descriptor_reference -> process_bundle_descriptor_id
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8306) improve estimation of data byte size reading from source in ElasticsearchIO

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8306?focusedWorklogId=319182=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319182
 ]

ASF GitHub Bot logged work on BEAM-8306:


Author: ASF GitHub Bot
Created on: 26/Sep/19 19:43
Start Date: 26/Sep/19 19:43
Worklog Time Spent: 10m 
  Work Description: timrobertson100 commented on issue #9660: [BEAM-8306] 
improve estimation datasize elasticsearch io
URL: https://github.com/apache/beam/pull/9660#issuecomment-535657412
 
 
   I don't have time to do a thorough review until next week but I looked over 
this for 10 mins and my general impression is it looks like a good addition. 
   
   (Commit message is not formatted correctly)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319182)
Time Spent: 0.5h  (was: 20m)

> improve estimation of data byte size reading from source in ElasticsearchIO
> ---
>
> Key: BEAM-8306
> URL: https://issues.apache.org/jira/browse/BEAM-8306
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Affects Versions: 2.14.0
>Reporter: Derek He
>Assignee: Derek He
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> ElasticsearchIO splits BoundedSource based on the Elasticsearch index size. 
> We expect it can be more accurate to split it base on query result size.
> Currently, we have a big Elasticsearch index. But for query result, it only 
> contains a few documents in the index.  ElasticsearchIO splits it into up 
> to1024 BoundedSources in Google dataflow. It takes long time to finish the 
> processing the small numbers of Elasticsearch document in Google dataflow.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8306) improve estimation of data byte size reading from source in ElasticsearchIO

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8306?focusedWorklogId=319181=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319181
 ]

ASF GitHub Bot logged work on BEAM-8306:


Author: ASF GitHub Bot
Created on: 26/Sep/19 19:43
Start Date: 26/Sep/19 19:43
Worklog Time Spent: 10m 
  Work Description: timrobertson100 commented on issue #9660: [BEAM-8306] 
improve estimation datasize elasticsearch io
URL: https://github.com/apache/beam/pull/9660#issuecomment-535657412
 
 
   I don't have time to do a thorough review until next week but I looked over 
this and my general impression is it looks like a good addition. 
   
   (Commit message is not formatted correctly)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319181)
Time Spent: 20m  (was: 10m)

> improve estimation of data byte size reading from source in ElasticsearchIO
> ---
>
> Key: BEAM-8306
> URL: https://issues.apache.org/jira/browse/BEAM-8306
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Affects Versions: 2.14.0
>Reporter: Derek He
>Assignee: Derek He
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ElasticsearchIO splits BoundedSource based on the Elasticsearch index size. 
> We expect it can be more accurate to split it base on query result size.
> Currently, we have a big Elasticsearch index. But for query result, it only 
> contains a few documents in the index.  ElasticsearchIO splits it into up 
> to1024 BoundedSources in Google dataflow. It takes long time to finish the 
> processing the small numbers of Elasticsearch document in Google dataflow.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-5192) Support Elasticsearch 7.x

2019-09-26 Thread Tim Robertson (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938907#comment-16938907
 ] 

Tim Robertson commented on BEAM-5192:
-

[~chetaldrich] - help would be greatly appreciated. 
Please feel free to reassign to you.


> Support Elasticsearch 7.x
> -
>
> Key: BEAM-5192
> URL: https://issues.apache.org/jira/browse/BEAM-5192
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Tim Robertson
>Priority: Major
>
> ES v7 is not out yet. But Elastic team scheduled a breaking change for ES 
> 7.0: the removal of the type feature. See 
> [https://www.elastic.co/blog/index-type-parent-child-join-now-future-in-elasticsearch]
> This will require a good amont of changes in the IO. 
> This ticket is there to track the future work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-5192) Support Elasticsearch 7.x

2019-09-26 Thread Chet Aldrich (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938902#comment-16938902
 ] 

Chet Aldrich commented on BEAM-5192:


Hey folks, just curious about the current state of things here. Seems like 
since this issue was created ES 7.3 was released: 
[https://www.elastic.co/blog/elasticsearch-7-3-0-released.] 

I might be able to help if there's a need for it, but if work has started on 
this in some form I can hold off.

> Support Elasticsearch 7.x
> -
>
> Key: BEAM-5192
> URL: https://issues.apache.org/jira/browse/BEAM-5192
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Tim Robertson
>Priority: Major
>
> ES v7 is not out yet. But Elastic team scheduled a breaking change for ES 
> 7.0: the removal of the type feature. See 
> [https://www.elastic.co/blog/index-type-parent-child-join-now-future-in-elasticsearch]
> This will require a good amont of changes in the IO. 
> This ticket is there to track the future work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-7389) Colab examples for element-wise transforms (Python)

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7389?focusedWorklogId=319138=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319138
 ]

ASF GitHub Bot logged work on BEAM-7389:


Author: ASF GitHub Bot
Created on: 26/Sep/19 18:54
Start Date: 26/Sep/19 18:54
Worklog Time Spent: 10m 
  Work Description: davidcavazos commented on pull request #9669: 
[BEAM-7389] Update include buttons to support multiple languages
URL: https://github.com/apache/beam/pull/9669
 
 
   **No content changes, should render exactly the same**
   
   Small update on code snippet buttons to support multiple languages.
   
   R: @aaltay 
   
   Filter: 
http://apache-beam-website-pull-requests.storage.googleapis.com/9661/documentation/transforms/python/elementwise/filter/index.html
   
   FlatMap: 
http://apache-beam-website-pull-requests.storage.googleapis.com/9661/documentation/transforms/python/elementwise/flatmap/index.html
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build

[jira] [Work logged] (BEAM-8315) Remove unused fields for splitting SDFs in beam_fn_api.proto

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8315?focusedWorklogId=319137=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319137
 ]

ASF GitHub Bot logged work on BEAM-8315:


Author: ASF GitHub Bot
Created on: 26/Sep/19 18:53
Start Date: 26/Sep/19 18:53
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9666: [BEAM-8315] 
Clean-up unused fields for splitting SDFs
URL: https://github.com/apache/beam/pull/9666
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319137)
Time Spent: 0.5h  (was: 20m)

> Remove unused fields for splitting SDFs in beam_fn_api.proto
> 
>
> Key: BEAM-8315
> URL: https://issues.apache.org/jira/browse/BEAM-8315
> Project: Beam
>  Issue Type: Sub-task
>  Components: beam-model
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-7738) Support PubSubIO to be configured externally for use with other SDKs

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7738?focusedWorklogId=319133=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319133
 ]

ASF GitHub Bot logged work on BEAM-7738:


Author: ASF GitHub Bot
Created on: 26/Sep/19 18:28
Start Date: 26/Sep/19 18:28
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on issue #9268: [BEAM-7738] Add 
external transform support to PubsubIO
URL: https://github.com/apache/beam/pull/9268#issuecomment-535629521
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319133)
Time Spent: 4.5h  (was: 4h 20m)

> Support PubSubIO to be configured externally for use with other SDKs
> 
>
> Key: BEAM-7738
> URL: https://issues.apache.org/jira/browse/BEAM-7738
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp, runner-flink, sdk-py-core
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Labels: portability
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Now that KafkaIO is supported via the external transform API (BEAM-7029) we 
> should add support for PubSub.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-7738) Support PubSubIO to be configured externally for use with other SDKs

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7738?focusedWorklogId=319132=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319132
 ]

ASF GitHub Bot logged work on BEAM-7738:


Author: ASF GitHub Bot
Created on: 26/Sep/19 18:28
Start Date: 26/Sep/19 18:28
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on issue #9268: [BEAM-7738] Add 
external transform support to PubsubIO
URL: https://github.com/apache/beam/pull/9268#issuecomment-535629486
 
 
   Python failure seems to be a docs issue 
(:sdks:python:test-suites:tox:py2:docs'.)
   
   Could be a flake. Retrying.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319132)
Time Spent: 4h 20m  (was: 4h 10m)

> Support PubSubIO to be configured externally for use with other SDKs
> 
>
> Key: BEAM-7738
> URL: https://issues.apache.org/jira/browse/BEAM-7738
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp, runner-flink, sdk-py-core
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Labels: portability
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Now that KafkaIO is supported via the external transform API (BEAM-7029) we 
> should add support for PubSub.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-7738) Support PubSubIO to be configured externally for use with other SDKs

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7738?focusedWorklogId=319129=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319129
 ]

ASF GitHub Bot logged work on BEAM-7738:


Author: ASF GitHub Bot
Created on: 26/Sep/19 18:23
Start Date: 26/Sep/19 18:23
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on issue #9268: [BEAM-7738] Add 
external transform support to PubsubIO
URL: https://github.com/apache/beam/pull/9268#issuecomment-535627722
 
 
   Seems like the Java PreCommit failure is for the new test ?
   
   Stacktrace is:
   org.apache.beam.sdk.io.gcp.pubsub.PubsubIOExternalTest > 
testConstructPubsubRead FAILED
   java.lang.RuntimeException at PubsubIOExternalTest.java:97
   Caused by: java.lang.RuntimeException at PubsubIOExternalTest.java:97
   Caused by: java.lang.reflect.InvocationTargetException at 
PubsubIOExternalTest.java:97
   Caused by: java.lang.IllegalStateException at 
PubsubIOExternalTest.java:97
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319129)
Time Spent: 4h 10m  (was: 4h)

> Support PubSubIO to be configured externally for use with other SDKs
> 
>
> Key: BEAM-7738
> URL: https://issues.apache.org/jira/browse/BEAM-7738
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp, runner-flink, sdk-py-core
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Labels: portability
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Now that KafkaIO is supported via the external transform API (BEAM-7029) we 
> should add support for PubSub.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1296) Providing a small dataset for "Apache Beam Mobile Gaming Pipeline Examples"

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1296?focusedWorklogId=319126=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319126
 ]

ASF GitHub Bot logged work on BEAM-1296:


Author: ASF GitHub Bot
Created on: 26/Sep/19 18:20
Start Date: 26/Sep/19 18:20
Worklog Time Spent: 10m 
  Work Description: angulartist commented on issue #9633: [BEAM-1296] 
Providing a small dataset for "Apache Beam Mobile Gaming …
URL: https://github.com/apache/beam/pull/9633#issuecomment-535626785
 
 
   The file is in fact accessible.
   
   I've deleted the previous file and updated the comment.
   I left the default original input as it is, because we just want a lighter 
alternative
   
   xoxo
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319126)
Time Spent: 1h 10m  (was: 1h)

> Providing a small dataset for "Apache Beam Mobile Gaming Pipeline Examples"
> ---
>
> Key: BEAM-1296
> URL: https://issues.apache.org/jira/browse/BEAM-1296
> Project: Beam
>  Issue Type: Wish
>  Components: examples-java
>Reporter: Keiji Yoshida
>Assignee: John Patoch
>Priority: Trivial
>  Labels: ccoss2019, newbie, starter
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> A dataset "gs://apache-beam-samples/game/gaming_data*.csv" for "Apache Beam 
> Mobile Gaming Pipeline Examples" is so huge (about 12 GB) and it takes long 
> time to download the dataset. It might pose difficulties to Apache Beam 
> beginners who want to try "Apache Beam Mobile Gaming Pipeline Examples" 
> quickly.
> How about providing a small dataset (say less than 1 GB) for this examples?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8313) Rename yyy_reference to yyy_id to be consistent across protos

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8313?focusedWorklogId=319125=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319125
 ]

ASF GitHub Bot logged work on BEAM-8313:


Author: ASF GitHub Bot
Created on: 26/Sep/19 18:19
Start Date: 26/Sep/19 18:19
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on issue #9668: [BEAM-8313] Follow 
up on PR comments for #9663
URL: https://github.com/apache/beam/pull/9668#issuecomment-535626489
 
 
   Run Java_Examples_Dataflow PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319125)
Time Spent: 2h  (was: 1h 50m)

> Rename yyy_reference to yyy_id to be consistent across protos
> -
>
> Key: BEAM-8313
> URL: https://issues.apache.org/jira/browse/BEAM-8313
> Project: Beam
>  Issue Type: Sub-task
>  Components: beam-model
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> This will replace:
>  * ptransform_id -> transform_id
>  * ptransform_reference -> transform_id
>  * instruction_reference -> instruction_id
>  * process_bundle_descriptor_reference -> process_bundle_descriptor_id
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8313) Rename yyy_reference to yyy_id to be consistent across protos

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8313?focusedWorklogId=319124=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319124
 ]

ASF GitHub Bot logged work on BEAM-8313:


Author: ASF GitHub Bot
Created on: 26/Sep/19 18:19
Start Date: 26/Sep/19 18:19
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on issue #9668: [BEAM-8313] Follow 
up on PR comments for #9663
URL: https://github.com/apache/beam/pull/9668#issuecomment-535626437
 
 
   Run Java PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319124)
Time Spent: 1h 50m  (was: 1h 40m)

> Rename yyy_reference to yyy_id to be consistent across protos
> -
>
> Key: BEAM-8313
> URL: https://issues.apache.org/jira/browse/BEAM-8313
> Project: Beam
>  Issue Type: Sub-task
>  Components: beam-model
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> This will replace:
>  * ptransform_id -> transform_id
>  * ptransform_reference -> transform_id
>  * instruction_reference -> instruction_id
>  * process_bundle_descriptor_reference -> process_bundle_descriptor_id
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-7495) Add support for dynamic worker re-balancing when reading BigQuery data using Cloud Dataflow

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7495?focusedWorklogId=319121=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319121
 ]

ASF GitHub Bot logged work on BEAM-7495:


Author: ASF GitHub Bot
Created on: 26/Sep/19 18:13
Start Date: 26/Sep/19 18:13
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on issue #9083: [BEAM-7495] 
Improve the test that compares EXPORT and DIRECT_READ
URL: https://github.com/apache/beam/pull/9083#issuecomment-535624269
 
 
   Looks like we lost the logs so not sure failures are related. Trying again.
   
   Run Java PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319121)
Remaining Estimate: 490h 50m  (was: 491h)
Time Spent: 13h 10m  (was: 13h)

> Add support for dynamic worker re-balancing when reading BigQuery data using 
> Cloud Dataflow
> ---
>
> Key: BEAM-7495
> URL: https://issues.apache.org/jira/browse/BEAM-7495
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Aryan Naraghi
>Assignee: Aryan Naraghi
>Priority: Major
>   Original Estimate: 504h
>  Time Spent: 13h 10m
>  Remaining Estimate: 490h 50m
>
> Currently, the BigQuery connector for reading data using the BigQuery Storage 
> API does not support any of the facilities on the source for Dataflow to 
> split streams.
>  
> On the server side, the BigQuery Storage API supports splitting streams at a 
> fraction. By adding support to the connector, we enable Dataflow to split 
> streams, which unlocks dynamic worker re-balancing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319110=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319110
 ]

ASF GitHub Bot logged work on BEAM-5428:


Author: ASF GitHub Bot
Created on: 26/Sep/19 18:02
Start Date: 26/Sep/19 18:02
Worklog Time Spent: 10m 
  Work Description: mxm commented on pull request #9418: [BEAM-5428] 
Implement cross-bundle user state caching in the Python SDK
URL: https://github.com/apache/beam/pull/9418#discussion_r328751006
 
 

 ##
 File path: sdks/python/apache_beam/runners/worker/sdk_worker.py
 ##
 @@ -620,6 +632,107 @@ def _next_id(self):
 return str(self._last_id)
 
 
+class CachingMaterializingStateHandler(object):
+  """ A State handler which retrieves and caches state. """
+
+  def __init__(self, global_state_cache, underlying_state):
+self._underlying = underlying_state
+self._state_cache = global_state_cache
+self._context = threading.local()
+
+  @contextlib.contextmanager
+  def process_instruction_id(self, bundle_id, cache_tokens):
+if getattr(self._context, 'cache_token', None) is not None:
+  raise RuntimeError(
+  'Cache tokens already set to %s' % self._context.cache_token)
+# TODO Also handle cache tokens for side input, if present:
+# https://issues.apache.org/jira/browse/BEAM-8298
+user_state_cache_token = None
+for cache_token_struct in cache_tokens:
+  if cache_token_struct.HasField("user_state"):
+# There should only be one user state token present
+assert not user_state_cache_token
+user_state_cache_token = cache_token_struct.token
+try:
+  self._context.cache_token = user_state_cache_token
+  with self._underlying.process_instruction_id(bundle_id):
+yield
+finally:
+  self._context.cache_token = None
+
+  def blocking_get(self, state_key, coder, is_cached=False):
+if not self._should_be_cached(is_cached):
+  # no cache / tokens, can't do a lookup/store in the cache
+  return self._materialize_iter(state_key, coder)
+# Cache lookup
+cache_state_key = self._convert_to_cache_key(state_key)
+cached_value = self._state_cache.get(cache_state_key,
+ self._context.cache_token)
+if cached_value is None:
+  # Cache miss, need to retrieve from the Runner
+  materialized = cached_value = list(
+  self._materialize_iter(state_key, coder))
+  self._state_cache.put(
+  cache_state_key,
+  self._context.cache_token,
+  materialized)
+return iter(cached_value)
+
+  def append(self, state_key, coder, elements, is_cached=False):
+if self._should_be_cached(is_cached):
+  # Update the cache
+  cache_key = self._convert_to_cache_key(state_key)
+  self._state_cache.append(cache_key, self._context.cache_token, elements)
+# Write to state handler
+out = coder_impl.create_OutputStream()
+for element in elements:
+  coder.encode_to_stream(element, out, True)
+return self._underlying.append_raw(state_key, out.get())
+
+  def clear(self, state_key, is_cached=False):
+if self._should_be_cached(is_cached):
+  cache_key = self._convert_to_cache_key(state_key)
+  self._state_cache.clear(cache_key, self._context.cache_token)
+return self._underlying.clear(state_key)
+
+  # The following methods are for interaction with the FnApiRunner:
+
+  def get_raw(self, state_key, continuation_token=None):
+return self._underlying.get_raw(state_key, continuation_token)
+
+  def append_raw(self, state_key, data):
+return self._underlying.append_raw(state_key, data)
+
+  def restore(self):
+self._underlying.restore()
+
+  def checkpoint(self):
 
 Review comment:
   Note, these and the above until the comment will go away (fix will come 
tomorrow), as they won't be necessary with the cache not being inserted for the 
state handler of the fn_api_runner.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319110)
Time Spent: 23h 50m  (was: 23h 40m)

> Implement cross-bundle state caching.
> -
>
> Key: BEAM-5428
> URL: https://issues.apache.org/jira/browse/BEAM-5428
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Maximilian Michels
>Priority: Major
>  Time Spent: 23h 50m
>  Remaining Estimate: 0h
>
> Tech spec: 
>

[jira] [Created] (BEAM-8316) What is corresponding data type to set in UDF parameter to match RecordType

2019-09-26 Thread Yang Zhang (Jira)

Yang Zhang created BEAM-8316:


 Summary: What is corresponding data type to set in UDF parameter 
to match RecordType
 Key: BEAM-8316
 URL: https://issues.apache.org/jira/browse/BEAM-8316
 Project: Beam
  Issue Type: Bug
  Components: beam-model
Affects Versions: 2.15.0
Reporter: Yang Zhang


Hello Beam community, 

I want to have an UDF to take a record as input. Per error info as shown below, 
it indicates that the input is *RecordType*, but ** what should I set in the 
UDF parameter so that Beam would not complain about the type compatibility? 
Below is the rull error trace. Thank you very much!

 

error trace===

Exception in thread "main" 
org.apache.beam.sdk.extensions.sql.impl.ParseException: Unable to parse query 
select fooudf(pv.header) from kafka.tracking.PageViewEvent as pvException in 
thread "main" org.apache.beam.sdk.extensions.sql.impl.ParseException: Unable to 
parse query select fooudf(pv.header) from kafka.tracking.PageViewEvent as pv at 
org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner.convertToBeamRel(CalciteQueryPlanner.java:165)
 at 
org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv.parseQuery(BeamSqlEnv.java:103)
 at 
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:124) 
at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:82) 
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:539) at 
org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:473) at 
org.apache.beam.sdk.values.PBegin.apply(PBegin.java:44) at 
org.apache.beam.sdk.Pipeline.apply(Pipeline.java:169) at 
com.linkedin.samza.sql.engine.BeamSqlEntry.preparePipeline(BeamSqlEntry.java:52)
 at com.linkedin.samza.sql.engine.BeamSqlEntry.exec(BeamSqlEntry.java:41) at 
com.linkedin.samza.sql.engine.BeamSqlUI.main(BeamSqlUI.java:33)Caused by: 
org.apache.calcite.tools.ValidationException: 
org.apache.calcite.runtime.CalciteContextException: >From line 1, column 8 to 
line 1, column 24: No match found for function signature 
fooudf() at 
org.apache.calcite.prepare.PlannerImpl.validate(PlannerImpl.java:190) at 
org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner.convertToBeamRel(CalciteQueryPlanner.java:136)
 ... 10 moreCaused by: org.apache.calcite.runtime.CalciteContextException: From 
line 1, column 8 to line 1, column 24: No match found for function signature 
fooudf() at 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
org.apache.calcite.runtime.Resources$ExInstWithCause.ex(Resources.java:463) at 
org.apache.calcite.sql.SqlUtil.newContextException(SqlUtil.java:787) at 
org.apache.calcite.sql.SqlUtil.newContextException(SqlUtil.java:772) at 
org.apache.calcite.sql.validate.SqlValidatorImpl.newValidationError(SqlValidatorImpl.java:4825)
 at 
org.apache.calcite.sql.validate.SqlValidatorImpl.handleUnresolvedFunction(SqlValidatorImpl.java:1739)
 at org.apache.calcite.sql.SqlFunction.deriveType(SqlFunction.java:270) at 
org.apache.calcite.sql.SqlFunction.deriveType(SqlFunction.java:215) at 
org.apache.calcite.sql.validate.SqlValidatorImpl$DeriveTypeVisitor.visit(SqlValidatorImpl.java:5584)
 at 
org.apache.calcite.sql.validate.SqlValidatorImpl$DeriveTypeVisitor.visit(SqlValidatorImpl.java:5571)
 at org.apache.calcite.sql.SqlCall.accept(SqlCall.java:138) at 
org.apache.calcite.sql.validate.SqlValidatorImpl.deriveTypeImpl(SqlValidatorImpl.java:1657)
 at 
org.apache.calcite.sql.validate.SqlValidatorImpl.deriveType(SqlValidatorImpl.java:1642)
 at 
org.apache.calcite.sql.validate.SqlValidatorImpl.expandSelectItem(SqlValidatorImpl.java:462)
 at 
org.apache.calcite.sql.validate.SqlValidatorImpl.validateSelectList(SqlValidatorImpl.java:4089)
 at 
org.apache.calcite.sql.validate.SqlValidatorImpl.validateSelect(SqlValidatorImpl.java:3352)
 at 
org.apache.calcite.sql.validate.SelectNamespace.validateImpl(SelectNamespace.java:60)
 at 
org.apache.calcite.sql.validate.AbstractNamespace.validate(AbstractNamespace.java:84)
 at 
org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace(SqlValidatorImpl.java:994)
 at 
org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery(SqlValidatorImpl.java:954)
 at org.apache.calcite.sql.SqlSelect.validate(SqlSelect.java:216) at 
org.apache.calcite.sql.validate.SqlValidatorImpl.validateScopedExpression(SqlValidatorImpl.java:929)
 at 
org.apache.calcite.sql.validate.SqlValidatorImpl.validate(SqlValidatorImpl.java:633)
 at org.apache.calcite.prepare.PlannerImpl.validate(PlannerImpl.java:188) ... 
11 moreCaused by: org.apache.calcite.sql.validate.SqlValidatorException: No 
match found for

[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319093=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319093
 ]

ASF GitHub Bot logged work on BEAM-5428:


Author: ASF GitHub Bot
Created on: 26/Sep/19 17:15
Start Date: 26/Sep/19 17:15
Worklog Time Spent: 10m 
  Work Description: mxm commented on pull request #9418: [BEAM-5428] 
Implement cross-bundle user state caching in the Python SDK
URL: https://github.com/apache/beam/pull/9418#discussion_r328730529
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/fn_api_runner.py
 ##
 @@ -1412,11 +1470,13 @@ def stop_worker(self):
 
 
 class WorkerHandlerManager(object):
-  def __init__(self, environments, job_provision_info):
+  def __init__(self, environments, job_provision_info, state_cache_size):
 self._environments = environments
 self._job_provision_info = job_provision_info
 self._cached_handlers = collections.defaultdict(list)
-self._state = FnApiRunner.StateServicer() # rename?
+self._state = sdk_worker.CachingMaterializingStateHandler(
+StateCache(state_cache_size),
 
 Review comment:
   I added this because the WorkerHandlerManager will insert the state handler 
into the WorkerHandlerFactory, which generates a cached BundleProcessorCache 
with that state handler for the EmbeddedWorkerHandler. It is not necessary 
otherwise and just does unnecessary caching on the FnApiRunner side.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319093)
Time Spent: 23h 40m  (was: 23.5h)

> Implement cross-bundle state caching.
> -
>
> Key: BEAM-5428
> URL: https://issues.apache.org/jira/browse/BEAM-5428
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Maximilian Michels
>Priority: Major
>  Time Spent: 23h 40m
>  Remaining Estimate: 0h
>
> Tech spec: 
> [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m]
> Relevant document: 
> [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit]
> Mailing list link: 
> [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8313) Rename yyy_reference to yyy_id to be consistent across protos

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8313?focusedWorklogId=319092=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319092
 ]

ASF GitHub Bot logged work on BEAM-8313:


Author: ASF GitHub Bot
Created on: 26/Sep/19 17:10
Start Date: 26/Sep/19 17:10
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on issue #9668: [BEAM-8313] Follow 
up on PR comments for #9663
URL: https://github.com/apache/beam/pull/9668#issuecomment-535599328
 
 
   R: @lostluck 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319092)
Time Spent: 1h 40m  (was: 1.5h)

> Rename yyy_reference to yyy_id to be consistent across protos
> -
>
> Key: BEAM-8313
> URL: https://issues.apache.org/jira/browse/BEAM-8313
> Project: Beam
>  Issue Type: Sub-task
>  Components: beam-model
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> This will replace:
>  * ptransform_id -> transform_id
>  * ptransform_reference -> transform_id
>  * instruction_reference -> instruction_id
>  * process_bundle_descriptor_reference -> process_bundle_descriptor_id
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8313) Rename yyy_reference to yyy_id to be consistent across protos

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8313?focusedWorklogId=319091=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319091
 ]

ASF GitHub Bot logged work on BEAM-8313:


Author: ASF GitHub Bot
Created on: 26/Sep/19 17:10
Start Date: 26/Sep/19 17:10
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on pull request #9668: [BEAM-8313] 
Follow up on PR comments for #9663
URL: https://github.com/apache/beam/pull/9668
 
 
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/)
 | --- | [![Build

[jira] [Work logged] (BEAM-7389) Colab examples for element-wise transforms (Python)

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7389?focusedWorklogId=319087=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319087
 ]

ASF GitHub Bot logged work on BEAM-7389:


Author: ASF GitHub Bot
Created on: 26/Sep/19 17:07
Start Date: 26/Sep/19 17:07
Worklog Time Spent: 10m 
  Work Description: davidcavazos commented on issue #9664: [BEAM-7389] 
Created code files to match doc filenames
URL: https://github.com/apache/beam/pull/9664#issuecomment-535597904
 
 
   Yes, will do. Thanks!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319087)
Time Spent: 60h 50m  (was: 60h 40m)

> Colab examples for element-wise transforms (Python)
> ---
>
> Key: BEAM-7389
> URL: https://issues.apache.org/jira/browse/BEAM-7389
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Rose Nguyen
>Assignee: David Cavazos
>Priority: Minor
>  Time Spent: 60h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-8314) Beam Fn Api metrics piling causes pipeline to stuck after running for a while

2019-09-26 Thread Yichi Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8314:
--
Attachment: E4UaSUhJJKF.png

> Beam Fn Api metrics piling causes pipeline to stuck after running for a while
> -
>
> Key: BEAM-8314
> URL: https://issues.apache.org/jira/browse/BEAM-8314
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Reporter: Yichi Zhang
>Priority: Blocker
> Fix For: 2.16.0
>
> Attachments: E4UaSUhJJKF.png
>
>
> Seems that in StreamingDataflowWorker we are not able to update the metrics 
> fast enough to dataflow service, the piling metrics causes memory usage to 
> increase and eventually leads to excessive memory thrashing/GC. And it will 
> almost stop the pipeline from processing new items.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-8314) Beam Fn Api metrics piling causes pipeline to stuck after running for a while

2019-09-26 Thread Yichi Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8314:
--
Description: 
Seems that in StreamingDataflowWorker we are not able to update the metrics 
fast enough to dataflow service, the piling metrics causes memory usage to 
increase and eventually leads to excessive memory thrashing/GC. And it will 
almost stop the pipeline from processing new items.

 

 

  was:
Seems that in StreamingDataflowWorker we are not able to update the metrics 
fast enough to dataflow service, the piling metrics causes memory usage to 
increase and eventually leads to excessive memory thrashing/GC. And it will 
almost stop the pipeline from processing new items.

 

https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-09-25_17_47_17-1934265625846707281?project=google.com:clouddfe


> Beam Fn Api metrics piling causes pipeline to stuck after running for a while
> -
>
> Key: BEAM-8314
> URL: https://issues.apache.org/jira/browse/BEAM-8314
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Reporter: Yichi Zhang
>Priority: Blocker
> Fix For: 2.16.0
>
> Attachments: E4UaSUhJJKF.png
>
>
> Seems that in StreamingDataflowWorker we are not able to update the metrics 
> fast enough to dataflow service, the piling metrics causes memory usage to 
> increase and eventually leads to excessive memory thrashing/GC. And it will 
> almost stop the pipeline from processing new items.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-8314) Beam Fn Api metrics piling causes pipeline to stuck after running for a while

2019-09-26 Thread Yichi Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8314:
--
Description: 
Seems that in StreamingDataflowWorker we are not able to update the metrics 
fast enough to dataflow service, the piling metrics causes memory usage to 
increase and eventually leads to excessive memory thrashing/GC. And it will 
almost stop the pipeline from processing new items.

 

 !E4UaSUhJJKF.png! 

  was:
Seems that in StreamingDataflowWorker we are not able to update the metrics 
fast enough to dataflow service, the piling metrics causes memory usage to 
increase and eventually leads to excessive memory thrashing/GC. And it will 
almost stop the pipeline from processing new items.

 

 


> Beam Fn Api metrics piling causes pipeline to stuck after running for a while
> -
>
> Key: BEAM-8314
> URL: https://issues.apache.org/jira/browse/BEAM-8314
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Reporter: Yichi Zhang
>Priority: Blocker
> Fix For: 2.16.0
>
> Attachments: E4UaSUhJJKF.png
>
>
> Seems that in StreamingDataflowWorker we are not able to update the metrics 
> fast enough to dataflow service, the piling metrics causes memory usage to 
> increase and eventually leads to excessive memory thrashing/GC. And it will 
> almost stop the pipeline from processing new items.
>  
>  !E4UaSUhJJKF.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8315) Remove unused fields for splitting SDFs in beam_fn_api.proto

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8315?focusedWorklogId=319080=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319080
 ]

ASF GitHub Bot logged work on BEAM-8315:


Author: ASF GitHub Bot
Created on: 26/Sep/19 16:52
Start Date: 26/Sep/19 16:52
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on issue #9666: [BEAM-8315] Clean-up 
unused fields for splitting SDFs
URL: https://github.com/apache/beam/pull/9666#issuecomment-535591932
 
 
   R: @boyuanzz 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319080)
Time Spent: 20m  (was: 10m)

> Remove unused fields for splitting SDFs in beam_fn_api.proto
> 
>
> Key: BEAM-8315
> URL: https://issues.apache.org/jira/browse/BEAM-8315
> Project: Beam
>  Issue Type: Sub-task
>  Components: beam-model
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8315) Remove unused fields for splitting SDFs in beam_fn_api.proto

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8315?focusedWorklogId=319078=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319078
 ]

ASF GitHub Bot logged work on BEAM-8315:


Author: ASF GitHub Bot
Created on: 26/Sep/19 16:52
Start Date: 26/Sep/19 16:52
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on pull request #9666: [BEAM-8315] 
Clean-up unused fields for splitting SDFs
URL: https://github.com/apache/beam/pull/9666
 
 
   I kept output watermark around since this is expected to be reported for 
streaming SDFs.
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build

[jira] [Created] (BEAM-8315) Remove unused fields for splitting SDFs in beam_fn_api.proto

2019-09-26 Thread Luke Cwik (Jira)

Luke Cwik created BEAM-8315:
---

 Summary: Remove unused fields for splitting SDFs in 
beam_fn_api.proto
 Key: BEAM-8315
 URL: https://issues.apache.org/jira/browse/BEAM-8315
 Project: Beam
  Issue Type: Sub-task
  Components: beam-model
Reporter: Luke Cwik
Assignee: Luke Cwik






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-8315) Remove unused fields for splitting SDFs in beam_fn_api.proto

2019-09-26 Thread Luke Cwik (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik updated BEAM-8315:

Status: Open  (was: Triage Needed)

> Remove unused fields for splitting SDFs in beam_fn_api.proto
> 
>
> Key: BEAM-8315
> URL: https://issues.apache.org/jira/browse/BEAM-8315
> Project: Beam
>  Issue Type: Sub-task
>  Components: beam-model
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=319068=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319068
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 26/Sep/19 16:34
Start Date: 26/Sep/19 16:34
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on issue #9619: [BEAM-7760] Added 
pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#issuecomment-535585079
 
 
   @aaltay Thanks for the quick response! I'm currently oncall till next 
Tuesday. I'll get back to the PR once I'm off duty asap. Thank you very much 
for the detailed review!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319068)
Time Spent: 11h 20m  (was: 11h 10m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 11h 20m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319064=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319064
 ]

ASF GitHub Bot logged work on BEAM-8300:


Author: ASF GitHub Bot
Created on: 26/Sep/19 16:21
Start Date: 26/Sep/19 16:21
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on issue #9640: [BEAM-8300]: 
KinesisIO.write throws NPE because producer is null
URL: https://github.com/apache/beam/pull/9640#issuecomment-535574848
 
 
   @jhalaria To fix failed test, I think you need to add `tearDown` method into 
`KinesisWriterFn`, like:
   ```
 @Teardown
 public void teardown() throws Exception {
   if (producer != null && producer.getOutstandingRecordsCount() > 0) {
 producer.flushSync();
   }
   producer = null;
 }
   ```
   and change `new UnsupportedOperationException` in 
`KinesisProducerMock.flushSync()` to just `flush()` call.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319064)
Time Spent: 4h  (was: 3h 50m)

> KinesisIO.write causes NPE as the producer is null
> --
>
> Key: BEAM-8300
> URL: https://issues.apache.org/jira/browse/BEAM-8300
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.15.0
>Reporter: Ankit Jhalaria
>Assignee: Ankit Jhalaria
>Priority: Minor
> Fix For: Not applicable
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> While using KinesisIO.write(), we encountered a NPE with the following stack 
> trace 
> {code:java}
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat
>  org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat 
> java.lang.Thread.run(Thread.java:748)\nCaused by: 
> java.lang.NullPointerException: null\n\tat 
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat
>  
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319063=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319063
 ]

ASF GitHub Bot logged work on BEAM-8300:


Author: ASF GitHub Bot
Created on: 26/Sep/19 16:21
Start Date: 26/Sep/19 16:21
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on issue #9640: [BEAM-8300]: 
KinesisIO.write throws NPE because producer is null
URL: https://github.com/apache/beam/pull/9640#issuecomment-535574848
 
 
   @jhalaria To fix failed test, I think you need to add `tearDown` method into 
`KinesisWriterFn`, like:
   ```
 @Teardown
 public void teardown() throws Exception {
   if (producer != null && producer.getOutstandingRecordsCount() > 0) {
 producer.flushSync();
   }
   producer = null;
 }
   ```
   and change `UnsupportedOperationException` from 
`KinesisProducerMock.flushSync()` to just `flush()` call
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319063)
Time Spent: 3h 50m  (was: 3h 40m)

> KinesisIO.write causes NPE as the producer is null
> --
>
> Key: BEAM-8300
> URL: https://issues.apache.org/jira/browse/BEAM-8300
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.15.0
>Reporter: Ankit Jhalaria
>Assignee: Ankit Jhalaria
>Priority: Minor
> Fix For: Not applicable
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> While using KinesisIO.write(), we encountered a NPE with the following stack 
> trace 
> {code:java}
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat
>  org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat 
> java.lang.Thread.run(Thread.java:748)\nCaused by: 
> java.lang.NullPointerException: null\n\tat 
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat
>  
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319062=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319062
 ]

ASF GitHub Bot logged work on BEAM-8300:


Author: ASF GitHub Bot
Created on: 26/Sep/19 16:19
Start Date: 26/Sep/19 16:19
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on issue #9640: [BEAM-8300]: 
KinesisIO.write throws NPE because producer is null
URL: https://github.com/apache/beam/pull/9640#issuecomment-535574848
 
 
   @jhalaria To fix failed test, I think you need to add `tearDown` method into 
`KinesisWriterFn`, like:
   ```
 @Teardown
 public void teardown() throws Exception {
   if (producer != null && producer.getOutstandingRecordsCount() > 0) {
 producer.flushSync();
   }
   producer = null;
 }
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319062)
Time Spent: 3h 40m  (was: 3.5h)

> KinesisIO.write causes NPE as the producer is null
> --
>
> Key: BEAM-8300
> URL: https://issues.apache.org/jira/browse/BEAM-8300
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.15.0
>Reporter: Ankit Jhalaria
>Assignee: Ankit Jhalaria
>Priority: Minor
> Fix For: Not applicable
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> While using KinesisIO.write(), we encountered a NPE with the following stack 
> trace 
> {code:java}
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat
>  org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat 
> java.lang.Thread.run(Thread.java:748)\nCaused by: 
> java.lang.NullPointerException: null\n\tat 
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat
>  
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319061=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319061
 ]

ASF GitHub Bot logged work on BEAM-8300:


Author: ASF GitHub Bot
Created on: 26/Sep/19 16:18
Start Date: 26/Sep/19 16:18
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on issue #9640: [BEAM-8300]: 
KinesisIO.write throws NPE because producer is null
URL: https://github.com/apache/beam/pull/9640#issuecomment-535574848
 
 
   @jhalaria To fix failed test, I think you need to add `tearDown` method into 
`KinesisWriterFn`, like:
   ```
 @Teardown
 public void teardown() throws Exception {
   if (producer != null) {
 if (producer.getOutstandingRecordsCount() > 0) {
   producer.flushSync();
 }
 producer = null;
   }
 }
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319061)
Time Spent: 3.5h  (was: 3h 20m)

> KinesisIO.write causes NPE as the producer is null
> --
>
> Key: BEAM-8300
> URL: https://issues.apache.org/jira/browse/BEAM-8300
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.15.0
>Reporter: Ankit Jhalaria
>Assignee: Ankit Jhalaria
>Priority: Minor
> Fix For: Not applicable
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> While using KinesisIO.write(), we encountered a NPE with the following stack 
> trace 
> {code:java}
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat
>  org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat 
> java.lang.Thread.run(Thread.java:748)\nCaused by: 
> java.lang.NullPointerException: null\n\tat 
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat
>  
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319060=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319060
 ]

ASF GitHub Bot logged work on BEAM-8300:


Author: ASF GitHub Bot
Created on: 26/Sep/19 16:16
Start Date: 26/Sep/19 16:16
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on issue #9640: [BEAM-8300]: 
KinesisIO.write throws NPE because producer is null
URL: https://github.com/apache/beam/pull/9640#issuecomment-535574848
 
 
   @jhalaria To fix failed test, I think you need to add `tearDown` method into 
`KinesisWriterFn`, like:
   ```
 @Teardown
 public void teardown() throws Exception {
   if (producer != null) {
 producer.flushSync();
 producer = null;
   }
 }
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319060)
Time Spent: 3h 20m  (was: 3h 10m)

> KinesisIO.write causes NPE as the producer is null
> --
>
> Key: BEAM-8300
> URL: https://issues.apache.org/jira/browse/BEAM-8300
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.15.0
>Reporter: Ankit Jhalaria
>Assignee: Ankit Jhalaria
>Priority: Minor
> Fix For: Not applicable
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> While using KinesisIO.write(), we encountered a NPE with the following stack 
> trace 
> {code:java}
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat
>  org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat 
> java.lang.Thread.run(Thread.java:748)\nCaused by: 
> java.lang.NullPointerException: null\n\tat 
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat
>  
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319053=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319053
 ]

ASF GitHub Bot logged work on BEAM-8300:


Author: ASF GitHub Bot
Created on: 26/Sep/19 16:07
Start Date: 26/Sep/19 16:07
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on issue #9640: [BEAM-8300]: 
KinesisIO.write throws NPE because producer is null
URL: https://github.com/apache/beam/pull/9640#issuecomment-535574848
 
 
   @jhalaria To fox failed test, I think you need to add `tearDown` method into 
`KinesisWriterFn`, like:
   ```
   @Teardown
   public void teardown() throws Exception {
 producer = null;
   }
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319053)
Time Spent: 2h 50m  (was: 2h 40m)

> KinesisIO.write causes NPE as the producer is null
> --
>
> Key: BEAM-8300
> URL: https://issues.apache.org/jira/browse/BEAM-8300
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.15.0
>Reporter: Ankit Jhalaria
>Assignee: Ankit Jhalaria
>Priority: Minor
> Fix For: Not applicable
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> While using KinesisIO.write(), we encountered a NPE with the following stack 
> trace 
> {code:java}
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat
>  org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat 
> java.lang.Thread.run(Thread.java:748)\nCaused by: 
> java.lang.NullPointerException: null\n\tat 
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat
>  
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319055=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319055
 ]

ASF GitHub Bot logged work on BEAM-8300:


Author: ASF GitHub Bot
Created on: 26/Sep/19 16:07
Start Date: 26/Sep/19 16:07
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on issue #9640: [BEAM-8300]: 
KinesisIO.write throws NPE because producer is null
URL: https://github.com/apache/beam/pull/9640#issuecomment-535574848
 
 
   @jhalaria To fix failed test, I think you need to add `tearDown` method into 
`KinesisWriterFn`, like:
   ```
   @Teardown
   public void teardown() {
 producer = null;
   }
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319055)
Time Spent: 3h 10m  (was: 3h)

> KinesisIO.write causes NPE as the producer is null
> --
>
> Key: BEAM-8300
> URL: https://issues.apache.org/jira/browse/BEAM-8300
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.15.0
>Reporter: Ankit Jhalaria
>Assignee: Ankit Jhalaria
>Priority: Minor
> Fix For: Not applicable
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> While using KinesisIO.write(), we encountered a NPE with the following stack 
> trace 
> {code:java}
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat
>  org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat 
> java.lang.Thread.run(Thread.java:748)\nCaused by: 
> java.lang.NullPointerException: null\n\tat 
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat
>  
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8300) KinesisIO.write causes NPE as the producer is null

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8300?focusedWorklogId=319054=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319054
 ]

ASF GitHub Bot logged work on BEAM-8300:


Author: ASF GitHub Bot
Created on: 26/Sep/19 16:07
Start Date: 26/Sep/19 16:07
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on issue #9640: [BEAM-8300]: 
KinesisIO.write throws NPE because producer is null
URL: https://github.com/apache/beam/pull/9640#issuecomment-535574848
 
 
   @jhalaria To fix failed test, I think you need to add `tearDown` method into 
`KinesisWriterFn`, like:
   ```
   @Teardown
   public void teardown() throws Exception {
 producer = null;
   }
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319054)
Time Spent: 3h  (was: 2h 50m)

> KinesisIO.write causes NPE as the producer is null
> --
>
> Key: BEAM-8300
> URL: https://issues.apache.org/jira/browse/BEAM-8300
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.15.0
>Reporter: Ankit Jhalaria
>Assignee: Ankit Jhalaria
>Priority: Minor
> Fix For: Not applicable
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> While using KinesisIO.write(), we encountered a NPE with the following stack 
> trace 
> {code:java}
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:297)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)\n\tat
>  
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)\n\tat
>  
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)\n\tat
>  org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)\n\tat 
> java.lang.Thread.run(Thread.java:748)\nCaused by: 
> java.lang.NullPointerException: null\n\tat 
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.flushBundle(KinesisIO.java:685)\n\tat
>  
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.finishBundle(KinesisIO.java:669){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319037=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319037
 ]

ASF GitHub Bot logged work on BEAM-5428:


Author: ASF GitHub Bot
Created on: 26/Sep/19 15:20
Start Date: 26/Sep/19 15:20
Worklog Time Spent: 10m 
  Work Description: tweise commented on pull request #9418: [BEAM-5428] 
Implement cross-bundle user state caching in the Python SDK
URL: https://github.com/apache/beam/pull/9418#discussion_r328676138
 
 

 ##
 File path: sdks/python/apache_beam/runners/worker/statecache.py
 ##
 @@ -0,0 +1,122 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""A module for caching state reads/writes in Beam applications."""
+from __future__ import absolute_import
+
+import collections
+import logging
+from threading import Lock
+
+
+class StateCache(object):
+  """ Cache for Beam state access, scoped by state key and cache_token.
+
+  For a given state_key, caches a (cache_token, value) tuple and allows to
+a) read from the cache,
+   if the currently stored cache_token matches the provided
+a) write to the cache,
+   storing the new value alongside with a cache token
+c) append to the cache,
+   if the currently stored cache_token matches the provided
+
+  The operations on the cache are thread-safe for use by multiple workers.
+
+  :arg max_entries The maximum number of entries to store in the cache.
+  TODO Memory-based caching: https://issues.apache.org/jira/browse/BEAM-8297
+  """
+
+  def __init__(self, max_entries):
+logging.info('Creating state cache with size %s', max_entries)
+self._cache = self.LRUCache(max_entries, (None, None))
+self._lock = Lock()
+
+  def get(self, state_key, cache_token):
+assert cache_token and self.is_cache_enabled()
+with self._lock:
+  token, value = self._cache.get(state_key)
+return value if token == cache_token else None
+
+  def put(self, state_key, cache_token, value):
+assert cache_token and self.is_cache_enabled()
+with self._lock:
+  return self._cache.put(state_key, (cache_token, value))
+
+  def append(self, state_key, cache_token, elements):
+assert cache_token and self.is_cache_enabled()
+with self._lock:
+  token, value = self._cache.get(state_key)
+  if token in [cache_token, None]:
+if value is None:
+  value = []
+value.extend(elements)
+self._cache.put(state_key, (cache_token, value))
+  else:
+# Discard cached state if tokens do not match
+self._cache.evict(state_key)
+
+  def clear(self, state_key, cache_token):
+assert cache_token and self.is_cache_enabled()
+with self._lock:
+  token, _ = self._cache.get(state_key)
+  if token in [cache_token, None]:
+self._cache.put(state_key, (cache_token, []))
+  else:
+# Discard cached state if tokens do not match
+self._cache.evict(state_key)
+
+  def evict(self, state_key):
+assert self.is_cache_enabled()
+with self._lock:
+  self._cache.evict(state_key)
+
+  def evict_all(self):
+with self._lock:
+  self._cache.evict_all()
+
+  def is_cache_enabled(self):
+return self._cache._max_entries > 0
+
+  def __len__(self):
+return len(self._cache)
+
+  class LRUCache(object):
+
+def __init__(self, max_entries, default_entry):
+  self._max_entries = max_entries
+  self._default_entry = default_entry
+  self._cache = collections.OrderedDict()
+
+def get(self, key):
+  value = self._cache.pop(key, self._default_entry)
+  if value != self._default_entry:
+self._cache[key] = value
+  return value
+
+def put(self, key, value):
+  self._cache[key] = value
+  while len(self._cache) > self._max_entries:
+self._cache.popitem(last=False)
+
+def evict(self, key):
+  self._cache.pop(key, self._default_entry)
+
+def evict_all(self):
 
 Review comment:
   nit: would clear be the better name for this?
 

This is an automated message from the Apache

[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319036=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319036
 ]

ASF GitHub Bot logged work on BEAM-5428:


Author: ASF GitHub Bot
Created on: 26/Sep/19 15:16
Start Date: 26/Sep/19 15:16
Worklog Time Spent: 10m 
  Work Description: mxm commented on pull request #9418: [BEAM-5428] 
Implement cross-bundle user state caching in the Python SDK
URL: https://github.com/apache/beam/pull/9418#discussion_r328673772
 
 

 ##
 File path: sdks/python/apache_beam/runners/worker/bundle_processor.py
 ##
 @@ -430,13 +424,13 @@ def clear(self):
 
   def _commit(self):
 if self._cleared:
-  self._state_handler.blocking_clear(self._state_key)
+  self._state_handler.clear(self._state_key, is_cached=True).get()
 if self._added_elements:
-  value_coder_impl = self._value_coder.get_impl()
-  out = coder_impl.create_OutputStream()
-  for element in self._added_elements:
-value_coder_impl.encode_to_stream(element, out, True)
-  self._state_handler.blocking_append(self._state_key, out.get())
+  self._state_handler.append(
 
 Review comment:
   Will consolidate with the above.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319036)
Time Spent: 23h 20m  (was: 23h 10m)

> Implement cross-bundle state caching.
> -
>
> Key: BEAM-5428
> URL: https://issues.apache.org/jira/browse/BEAM-5428
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Maximilian Michels
>Priority: Major
>  Time Spent: 23h 20m
>  Remaining Estimate: 0h
>
> Tech spec: 
> [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m]
> Relevant document: 
> [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit]
> Mailing list link: 
> [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319033=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319033
 ]

ASF GitHub Bot logged work on BEAM-5428:


Author: ASF GitHub Bot
Created on: 26/Sep/19 15:14
Start Date: 26/Sep/19 15:14
Worklog Time Spent: 10m 
  Work Description: mxm commented on pull request #9418: [BEAM-5428] 
Implement cross-bundle user state caching in the Python SDK
URL: https://github.com/apache/beam/pull/9418#discussion_r328673182
 
 

 ##
 File path: sdks/python/apache_beam/runners/worker/sdk_worker_main.py
 ##
 @@ -205,6 +206,28 @@ def _get_worker_count(pipeline_options):
   return 12
 
 
+def _get_state_cache_size(pipeline_options):
+  """Defines the upper number of state items to cache.
+
+  Note: state_cache_size is an experimental flag and might not be available in
+  future releases.
+
+  Returns:
+an int indicating the maximum number of items to cache.
+  Default is 0 (disabled)
+  """
+  experiments = pipeline_options.view_as(DebugOptions).experiments
+  experiments = experiments if experiments else []
+
+  for experiment in experiments:
+# There should only be 1 match so returning from the loop
+if re.match(r'state_cache_size=', experiment):
+  return int(
+  re.match(r'state_cache_size=(?P.*)',
+   experiment).group('state_cache_size'))
+  return 100
 
 Review comment:
   Good catch, this was actually pending in my branch.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319033)
Time Spent: 23h 10m  (was: 23h)

> Implement cross-bundle state caching.
> -
>
> Key: BEAM-5428
> URL: https://issues.apache.org/jira/browse/BEAM-5428
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Maximilian Michels
>Priority: Major
>  Time Spent: 23h 10m
>  Remaining Estimate: 0h
>
> Tech spec: 
> [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m]
> Relevant document: 
> [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit]
> Mailing list link: 
> [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=319032=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319032
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 26/Sep/19 15:14
Start Date: 26/Sep/19 15:14
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #8457: 
[BEAM-3342] Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r327778002
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio.py
 ##
 @@ -122,22 +126,145 @@ class WriteToBigTable(beam.PTransform):
   A PTransform that write a list of `DirectRow` into the Bigtable Table
 
   """
-  def __init__(self, project_id=None, instance_id=None,
-   table_id=None):
+  def __init__(self, project_id=None, instance_id=None, table_id=None):
 """ The PTransform to access the Bigtable Write connector
 Args:
   project_id(str): GCP Project of to write the Rows
   instance_id(str): GCP Instance to write the Rows
   table_id(str): GCP Table to write the `DirectRows`
 """
 super(WriteToBigTable, self).__init__()
-self.beam_options = {'project_id': project_id,
- 'instance_id': instance_id,
- 'table_id': table_id}
+self._beam_options = {'project_id': project_id,
+  'instance_id': instance_id,
+  'table_id': table_id}
 
   def expand(self, pvalue):
-beam_options = self.beam_options
+beam_options = self._beam_options
 return (pvalue
 | beam.ParDo(_BigTableWriteFn(beam_options['project_id'],
   beam_options['instance_id'],
   beam_options['table_id'])))
+
+
+class _BigtableReadFn(beam.DoFn):
+  """ Creates the connector that can read rows for Beam pipeline
+
+  Args:
+project_id(str): GCP Project ID
+instance_id(str): GCP Instance ID
+table_id(str): GCP Table ID
+
+  """
+
+  def __init__(self, project_id, instance_id, table_id, filter_=b''):
+""" Constructor of the Read connector of Bigtable
+
+Args:
+  project_id: [str] GCP Project of to write the Rows
+  instance_id: [str] GCP Instance to write the Rows
+  table_id: [str] GCP Table to write the `DirectRows`
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._initialize({'project_id': project_id,
+  'instance_id': instance_id,
+  'table_id': table_id,
+  'filter_': filter_})
+
+  def __getstate__(self):
+return self._beam_options
+
+  def __setstate__(self, options):
+self._initialize(options)
+
+  def _initialize(self, options):
+self._beam_options = options
+self.table = None
+self.sample_row_keys = None
+self.row_count = Metrics.counter(self.__class__.__name__, 'Rows read')
+
+  def start_bundle(self):
+if self.table is None:
+  self.table = Client(project=self._beam_options['project_id'])\
 
 Review comment:
   nit: PEP 8 recommends using parenthesis for formatting instead of 
backslashes.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319032)
Time Spent: 40h 40m  (was: 40.5h)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 40h 40m
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319031=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319031
 ]

ASF GitHub Bot logged work on BEAM-5428:


Author: ASF GitHub Bot
Created on: 26/Sep/19 15:14
Start Date: 26/Sep/19 15:14
Worklog Time Spent: 10m 
  Work Description: mxm commented on pull request #9418: [BEAM-5428] 
Implement cross-bundle user state caching in the Python SDK
URL: https://github.com/apache/beam/pull/9418#discussion_r328672702
 
 

 ##
 File path: sdks/python/apache_beam/runners/worker/bundle_processor.py
 ##
 @@ -430,13 +424,13 @@ def clear(self):
 
   def _commit(self):
 if self._cleared:
-  self._state_handler.blocking_clear(self._state_key)
+  self._state_handler.clear(self._state_key, is_cached=True).get()
 
 Review comment:
   Yes, and no. We need to wait for the last response here. For example, if 
there is no following append, we'll have to wait on the clear. So best to safe 
the last returned future in a variable and call `.get()`on it here.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319031)
Time Spent: 23h  (was: 22h 50m)

> Implement cross-bundle state caching.
> -
>
> Key: BEAM-5428
> URL: https://issues.apache.org/jira/browse/BEAM-5428
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Maximilian Michels
>Priority: Major
>  Time Spent: 23h
>  Remaining Estimate: 0h
>
> Tech spec: 
> [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m]
> Relevant document: 
> [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit]
> Mailing list link: 
> [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319030=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319030
 ]

ASF GitHub Bot logged work on BEAM-5428:


Author: ASF GitHub Bot
Created on: 26/Sep/19 15:13
Start Date: 26/Sep/19 15:13
Worklog Time Spent: 10m 
  Work Description: tweise commented on pull request #9418: [BEAM-5428] 
Implement cross-bundle user state caching in the Python SDK
URL: https://github.com/apache/beam/pull/9418#discussion_r328672482
 
 

 ##
 File path: sdks/python/apache_beam/runners/worker/sdk_worker_main.py
 ##
 @@ -205,6 +206,28 @@ def _get_worker_count(pipeline_options):
   return 12
 
 
+def _get_state_cache_size(pipeline_options):
+  """Defines the upper number of state items to cache.
+
+  Note: state_cache_size is an experimental flag and might not be available in
+  future releases.
+
+  Returns:
+an int indicating the maximum number of items to cache.
+  Default is 0 (disabled)
+  """
+  experiments = pipeline_options.view_as(DebugOptions).experiments
+  experiments = experiments if experiments else []
+
+  for experiment in experiments:
+# There should only be 1 match so returning from the loop
+if re.match(r'state_cache_size=', experiment):
+  return int(
+  re.match(r'state_cache_size=(?P.*)',
+   experiment).group('state_cache_size'))
+  return 100
 
 Review comment:
   Should be 0 (the default)?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319030)
Time Spent: 22h 50m  (was: 22h 40m)

> Implement cross-bundle state caching.
> -
>
> Key: BEAM-5428
> URL: https://issues.apache.org/jira/browse/BEAM-5428
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Maximilian Michels
>Priority: Major
>  Time Spent: 22h 50m
>  Remaining Estimate: 0h
>
> Tech spec: 
> [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m]
> Relevant document: 
> [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit]
> Mailing list link: 
> [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319029=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319029
 ]

ASF GitHub Bot logged work on BEAM-5428:


Author: ASF GitHub Bot
Created on: 26/Sep/19 15:12
Start Date: 26/Sep/19 15:12
Worklog Time Spent: 10m 
  Work Description: mxm commented on pull request #9418: [BEAM-5428] 
Implement cross-bundle user state caching in the Python SDK
URL: https://github.com/apache/beam/pull/9418#discussion_r328671827
 
 

 ##
 File path: sdks/python/apache_beam/runners/worker/bundle_processor.py
 ##
 @@ -199,26 +199,19 @@ def finish(self):
 
 
 class _StateBackedIterable(object):
-  def __init__(self, state_handler, state_key, coder_or_impl):
+  def __init__(self, state_handler, state_key, coder_or_impl,
+   is_cached=False):
 self._state_handler = state_handler
 self._state_key = state_key
 if isinstance(coder_or_impl, coders.Coder):
   self._coder_impl = coder_or_impl.get_impl()
 else:
   self._coder_impl = coder_or_impl
+self._is_cached = is_cached
 
   def __iter__(self):
-# This is the continuation token this might be useful
-data, continuation_token = 
self._state_handler.blocking_get(self._state_key)
-while True:
-  input_stream = coder_impl.create_InputStream(data)
-  while input_stream.size() > 0:
-yield self._coder_impl.decode_from_stream(input_stream, True)
-  if not continuation_token:
-break
-  else:
-data, continuation_token = self._state_handler.blocking_get(
-self._state_key, continuation_token)
+return self._state_handler.blocking_get(
 
 Review comment:
   Could you elaborate? The continuation token is respected as before.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319029)
Time Spent: 22h 40m  (was: 22.5h)

> Implement cross-bundle state caching.
> -
>
> Key: BEAM-5428
> URL: https://issues.apache.org/jira/browse/BEAM-5428
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Maximilian Michels
>Priority: Major
>  Time Spent: 22h 40m
>  Remaining Estimate: 0h
>
> Tech spec: 
> [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m]
> Relevant document: 
> [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit]
> Mailing list link: 
> [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319028=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319028
 ]

ASF GitHub Bot logged work on BEAM-5428:


Author: ASF GitHub Bot
Created on: 26/Sep/19 15:12
Start Date: 26/Sep/19 15:12
Worklog Time Spent: 10m 
  Work Description: mxm commented on pull request #9418: [BEAM-5428] 
Implement cross-bundle user state caching in the Python SDK
URL: https://github.com/apache/beam/pull/9418#discussion_r328671811
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/portable_runner_test.py
 ##
 @@ -201,7 +204,8 @@ class 
PortableRunnerTestWithExternalEnv(PortableRunnerTest):
   @classmethod
   def setUpClass(cls):
 cls._worker_address, cls._worker_server = (
-worker_pool_main.BeamFnExternalWorkerPoolServicer.start())
+worker_pool_main.BeamFnExternalWorkerPoolServicer.start(
 
 Review comment:
   Will look into this.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319028)
Time Spent: 22.5h  (was: 22h 20m)

> Implement cross-bundle state caching.
> -
>
> Key: BEAM-5428
> URL: https://issues.apache.org/jira/browse/BEAM-5428
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Maximilian Michels
>Priority: Major
>  Time Spent: 22.5h
>  Remaining Estimate: 0h
>
> Tech spec: 
> [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m]
> Relevant document: 
> [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit]
> Mailing list link: 
> [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319027=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319027
 ]

ASF GitHub Bot logged work on BEAM-5428:


Author: ASF GitHub Bot
Created on: 26/Sep/19 15:12
Start Date: 26/Sep/19 15:12
Worklog Time Spent: 10m 
  Work Description: mxm commented on pull request #9418: [BEAM-5428] 
Implement cross-bundle user state caching in the Python SDK
URL: https://github.com/apache/beam/pull/9418#discussion_r328671668
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/fn_api_runner.py
 ##
 @@ -1412,11 +1470,13 @@ def stop_worker(self):
 
 
 class WorkerHandlerManager(object):
-  def __init__(self, environments, job_provision_info):
+  def __init__(self, environments, job_provision_info, state_cache_size):
 self._environments = environments
 self._job_provision_info = job_provision_info
 self._cached_handlers = collections.defaultdict(list)
-self._state = FnApiRunner.StateServicer() # rename?
+self._state = sdk_worker.CachingMaterializingStateHandler(
+StateCache(state_cache_size),
 
 Review comment:
   Good catch. This does not seem to be necessary. Let me revisit.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319027)
Time Spent: 22h 20m  (was: 22h 10m)

> Implement cross-bundle state caching.
> -
>
> Key: BEAM-5428
> URL: https://issues.apache.org/jira/browse/BEAM-5428
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Maximilian Michels
>Priority: Major
>  Time Spent: 22h 20m
>  Remaining Estimate: 0h
>
> Tech spec: 
> [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m]
> Relevant document: 
> [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit]
> Mailing list link: 
> [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319009=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319009
 ]

ASF GitHub Bot logged work on BEAM-5428:


Author: ASF GitHub Bot
Created on: 26/Sep/19 15:01
Start Date: 26/Sep/19 15:01
Worklog Time Spent: 10m 
  Work Description: tweise commented on pull request #9418: [BEAM-5428] 
Implement cross-bundle user state caching in the Python SDK
URL: https://github.com/apache/beam/pull/9418#discussion_r328665599
 
 

 ##
 File path: sdks/python/apache_beam/runners/worker/bundle_processor.py
 ##
 @@ -430,13 +424,13 @@ def clear(self):
 
   def _commit(self):
 if self._cleared:
-  self._state_handler.blocking_clear(self._state_key)
+  self._state_handler.clear(self._state_key, is_cached=True).get()
 if self._added_elements:
-  value_coder_impl = self._value_coder.get_impl()
-  out = coder_impl.create_OutputStream()
-  for element in self._added_elements:
-value_coder_impl.encode_to_stream(element, out, True)
-  self._state_handler.blocking_append(self._state_key, out.get())
+  self._state_handler.append(
 
 Review comment:
   An explicit comment regarding the need for blocking call might be useful 
here.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319009)
Time Spent: 22h 10m  (was: 22h)

> Implement cross-bundle state caching.
> -
>
> Key: BEAM-5428
> URL: https://issues.apache.org/jira/browse/BEAM-5428
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Maximilian Michels
>Priority: Major
>  Time Spent: 22h 10m
>  Remaining Estimate: 0h
>
> Tech spec: 
> [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m]
> Relevant document: 
> [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit]
> Mailing list link: 
> [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319008=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319008
 ]

ASF GitHub Bot logged work on BEAM-5428:


Author: ASF GitHub Bot
Created on: 26/Sep/19 15:00
Start Date: 26/Sep/19 15:00
Worklog Time Spent: 10m 
  Work Description: tweise commented on pull request #9418: [BEAM-5428] 
Implement cross-bundle user state caching in the Python SDK
URL: https://github.com/apache/beam/pull/9418#discussion_r328665061
 
 

 ##
 File path: sdks/python/apache_beam/runners/worker/bundle_processor.py
 ##
 @@ -430,13 +424,13 @@ def clear(self):
 
   def _commit(self):
 if self._cleared:
-  self._state_handler.blocking_clear(self._state_key)
+  self._state_handler.clear(self._state_key, is_cached=True).get()
 
 Review comment:
   Is this the call where we don't need to wait for a response?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319008)
Time Spent: 22h  (was: 21h 50m)

> Implement cross-bundle state caching.
> -
>
> Key: BEAM-5428
> URL: https://issues.apache.org/jira/browse/BEAM-5428
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Maximilian Michels
>Priority: Major
>  Time Spent: 22h
>  Remaining Estimate: 0h
>
> Tech spec: 
> [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m]
> Relevant document: 
> [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit]
> Mailing list link: 
> [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=319002=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-319002
 ]

ASF GitHub Bot logged work on BEAM-5428:


Author: ASF GitHub Bot
Created on: 26/Sep/19 14:50
Start Date: 26/Sep/19 14:50
Worklog Time Spent: 10m 
  Work Description: tweise commented on pull request #9418: [BEAM-5428] 
Implement cross-bundle user state caching in the Python SDK
URL: https://github.com/apache/beam/pull/9418#discussion_r328659068
 
 

 ##
 File path: sdks/python/apache_beam/runners/worker/bundle_processor.py
 ##
 @@ -199,26 +199,19 @@ def finish(self):
 
 
 class _StateBackedIterable(object):
-  def __init__(self, state_handler, state_key, coder_or_impl):
+  def __init__(self, state_handler, state_key, coder_or_impl,
+   is_cached=False):
 self._state_handler = state_handler
 self._state_key = state_key
 if isinstance(coder_or_impl, coders.Coder):
   self._coder_impl = coder_or_impl.get_impl()
 else:
   self._coder_impl = coder_or_impl
+self._is_cached = is_cached
 
   def __iter__(self):
-# This is the continuation token this might be useful
-data, continuation_token = 
self._state_handler.blocking_get(self._state_key)
-while True:
-  input_stream = coder_impl.create_InputStream(data)
-  while input_stream.size() > 0:
-yield self._coder_impl.decode_from_stream(input_stream, True)
-  if not continuation_token:
-break
-  else:
-data, continuation_token = self._state_handler.blocking_get(
-self._state_key, continuation_token)
+return self._state_handler.blocking_get(
 
 Review comment:
   Is there a JIRA for the continuation token that can be referenced here?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 319002)
Time Spent: 21h 50m  (was: 21h 40m)

> Implement cross-bundle state caching.
> -
>
> Key: BEAM-5428
> URL: https://issues.apache.org/jira/browse/BEAM-5428
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Maximilian Michels
>Priority: Major
>  Time Spent: 21h 50m
>  Remaining Estimate: 0h
>
> Tech spec: 
> [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m]
> Relevant document: 
> [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit]
> Mailing list link: 
> [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=318997=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318997
 ]

ASF GitHub Bot logged work on BEAM-5428:


Author: ASF GitHub Bot
Created on: 26/Sep/19 14:48
Start Date: 26/Sep/19 14:48
Worklog Time Spent: 10m 
  Work Description: tweise commented on pull request #9418: [BEAM-5428] 
Implement cross-bundle user state caching in the Python SDK
URL: https://github.com/apache/beam/pull/9418#discussion_r328658157
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/portable_runner_test.py
 ##
 @@ -201,7 +204,8 @@ class 
PortableRunnerTestWithExternalEnv(PortableRunnerTest):
   @classmethod
   def setUpClass(cls):
 cls._worker_address, cls._worker_server = (
-worker_pool_main.BeamFnExternalWorkerPoolServicer.start())
+worker_pool_main.BeamFnExternalWorkerPoolServicer.start(
 
 Review comment:
   Logically this is duplicates the experiment flag above. I can see why that 
is currently necessary, but maybe it would be better to provide the pipeline 
options to the worker pool servicer? (Inside the container this would happen 
via provisioning endpoint and environment.)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 318997)
Time Spent: 21h 40m  (was: 21.5h)

> Implement cross-bundle state caching.
> -
>
> Key: BEAM-5428
> URL: https://issues.apache.org/jira/browse/BEAM-5428
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Maximilian Michels
>Priority: Major
>  Time Spent: 21h 40m
>  Remaining Estimate: 0h
>
> Tech spec: 
> [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m]
> Relevant document: 
> [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit]
> Mailing list link: 
> [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-5428) Implement cross-bundle state caching.

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=318994=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318994
 ]

ASF GitHub Bot logged work on BEAM-5428:


Author: ASF GitHub Bot
Created on: 26/Sep/19 14:37
Start Date: 26/Sep/19 14:37
Worklog Time Spent: 10m 
  Work Description: tweise commented on pull request #9418: [BEAM-5428] 
Implement cross-bundle user state caching in the Python SDK
URL: https://github.com/apache/beam/pull/9418#discussion_r328651322
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/fn_api_runner.py
 ##
 @@ -1412,11 +1470,13 @@ def stop_worker(self):
 
 
 class WorkerHandlerManager(object):
-  def __init__(self, environments, job_provision_info):
+  def __init__(self, environments, job_provision_info, state_cache_size):
 self._environments = environments
 self._job_provision_info = job_provision_info
 self._cached_handlers = collections.defaultdict(list)
-self._state = FnApiRunner.StateServicer() # rename?
+self._state = sdk_worker.CachingMaterializingStateHandler(
+StateCache(state_cache_size),
 
 Review comment:
   For my learning, why is the cache required inside the fn_api_runner vs. on 
the SDK side? 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 318994)
Time Spent: 21.5h  (was: 21h 20m)

> Implement cross-bundle state caching.
> -
>
> Key: BEAM-5428
> URL: https://issues.apache.org/jira/browse/BEAM-5428
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Maximilian Michels
>Priority: Major
>  Time Spent: 21.5h
>  Remaining Estimate: 0h
>
> Tech spec: 
> [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m]
> Relevant document: 
> [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit]
> Mailing list link: 
> [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8306) improve estimation of data byte size reading from source in ElasticsearchIO

2019-09-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8306?focusedWorklogId=318943=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318943
 ]

ASF GitHub Bot logged work on BEAM-8306:


Author: ASF GitHub Bot
Created on: 26/Sep/19 12:59
Start Date: 26/Sep/19 12:59
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #9660: [BEAM-8306] improve 
estimation datasize elasticsearch io
URL: https://github.com/apache/beam/pull/9660#issuecomment-535490565
 
 
   R: @echauchot @timrobertson100 
   Can some of you PTAL at this one, looks like an interesting improvement.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 318943)
Remaining Estimate: 0h
Time Spent: 10m

> improve estimation of data byte size reading from source in ElasticsearchIO
> ---
>
> Key: BEAM-8306
> URL: https://issues.apache.org/jira/browse/BEAM-8306
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Affects Versions: 2.14.0
>Reporter: Derek He
>Assignee: Derek He
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ElasticsearchIO splits BoundedSource based on the Elasticsearch index size. 
> We expect it can be more accurate to split it base on query result size.
> Currently, we have a big Elasticsearch index. But for query result, it only 
> contains a few documents in the index.  ElasticsearchIO splits it into up 
> to1024 BoundedSources in Google dataflow. It takes long time to finish the 
> processing the small numbers of Elasticsearch document in Google dataflow.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (BEAM-8212) StatefulParDoFn creates GC timers for every record

2019-09-26 Thread Jira



 [ 
https://issues.apache.org/jira/browse/BEAM-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía reassigned BEAM-8212:
--

Assignee: (was: Aizhamal Nurmamat kyzy)

> StatefulParDoFn creates GC timers for every record 
> ---
>
> Key: BEAM-8212
> URL: https://issues.apache.org/jira/browse/BEAM-8212
> Project: Beam
>  Issue Type: Bug
>  Components: beam-community
>Reporter: Akshay Iyangar
>Priority: Major
>
> Hi 
> So currently the StatefulParDoFn create timers for all the records.
> [https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/StatefulDoFnRunner.java#L211]
> This becomes a problem if you are using GlobalWindows for streaming where 
> these timers get created and never get closed since the window will never 
> close.
> This is a problem especially if your memory bound in rocksDB where these 
> timers take up potential space and sloe the pipelines considerably.
> Was wondering that if the pipeline runs in global windows we should avoid 
> adding timers to it at all?
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-8212) StatefulParDoFn creates GC timers for every record

2019-09-26 Thread Jira



 [ 
https://issues.apache.org/jira/browse/BEAM-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-8212:
---
Component/s: (was: beam-community)
 runner-core

> StatefulParDoFn creates GC timers for every record 
> ---
>
> Key: BEAM-8212
> URL: https://issues.apache.org/jira/browse/BEAM-8212
> Project: Beam
>  Issue Type: Bug
>  Components: runner-core
>Reporter: Akshay Iyangar
>Priority: Major
>
> Hi 
> So currently the StatefulParDoFn create timers for all the records.
> [https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/StatefulDoFnRunner.java#L211]
> This becomes a problem if you are using GlobalWindows for streaming where 
> these timers get created and never get closed since the window will never 
> close.
> This is a problem especially if your memory bound in rocksDB where these 
> timers take up potential space and sloe the pipelines considerably.
> Was wondering that if the pipeline runs in global windows we should avoid 
> adding timers to it at all?
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-8312) Flink portable pipeline jars do not need to stage artifacts remotely

2019-09-26 Thread Jira



 [ 
https://issues.apache.org/jira/browse/BEAM-8312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-8312:
---
Status: Open  (was: Triage Needed)

> Flink portable pipeline jars do not need to stage artifacts remotely
> 
>
> Key: BEAM-8312
> URL: https://issues.apache.org/jira/browse/BEAM-8312
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Labels: portability-flink
>
> Currently, Flink job jars re-stage all artifacts at runtime (on the 
> JobManager) by using the usual BeamFileSystemArtifactRetrievalService [1]. 
> However, since the manifest and all the artifacts live on the classpath of 
> the jar, and everything from the classpath is copied to the Flink workers 
> anyway [2], it should not be necessary to do additional artifact staging. We 
> could replace BeamFileSystemArtifactRetrievalService in this case with a 
> simple ArtifactRetrievalService that just pulls the artifacts from the 
> classpath.
>  
>  [1] 
> [https://github.com/apache/beam/blob/340c3202b1e5824b959f5f9f626e4c7c7842a3cb/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/artifact/BeamFileSystemArtifactRetrievalService.java]
> [2] 
> [https://github.com/apache/beam/blob/2f1b56ccc506054e40afe4793a8b556e872e1865/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkExecutionEnvironments.java#L93]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-8146) SchemaCoder/RowCoder have no equals() function

2019-09-26 Thread Jira



 [ 
https://issues.apache.org/jira/browse/BEAM-8146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-8146:
---
Status: Open  (was: Triage Needed)

> SchemaCoder/RowCoder have no equals() function
> --
>
> Key: BEAM-8146
> URL: https://issues.apache.org/jira/browse/BEAM-8146
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 2.15.0
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> SchemaCoder has no equals function, so it can't be compared in tests, like 
> CloudComponentsTests$DefaultCoders, which is being re-enabled in 
> https://github.com/apache/beam/pull/9446



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-8275) Beam SQL should support BigQuery in DIRECT_READ mode

2019-09-26 Thread Jira



 [ 
https://issues.apache.org/jira/browse/BEAM-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-8275:
---
Status: Open  (was: Triage Needed)

> Beam SQL should support BigQuery in DIRECT_READ mode
> 
>
> Key: BEAM-8275
> URL: https://issues.apache.org/jira/browse/BEAM-8275
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Andrew Pilloud
>Assignee: Kirill Kozlov
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> SQL currently only supports reading from BigQuery in DEFAULT (EXPORT) mode. 
> We also need to support DIRECT_READ mode. The method should be configurable 
> by TBLPROPERTIES through the SQL CLI. This will enable us to take advantage 
> of the DIRECT_READ features for filter and project push down.
> References:
> [https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api]
> [https://beam.apache.org/blog/2019/06/04/adding-data-sources-to-sql.html]
> [https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (BEAM-8306) improve estimation of data byte size reading from source in ElasticsearchIO

2019-09-26 Thread Jira



 [ 
https://issues.apache.org/jira/browse/BEAM-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía reassigned BEAM-8306:
--

Assignee: Derek He

> improve estimation of data byte size reading from source in ElasticsearchIO
> ---
>
> Key: BEAM-8306
> URL: https://issues.apache.org/jira/browse/BEAM-8306
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Affects Versions: 2.14.0
>Reporter: Derek He
>Assignee: Derek He
>Priority: Major
>
> ElasticsearchIO splits BoundedSource based on the Elasticsearch index size. 
> We expect it can be more accurate to split it base on query result size.
> Currently, we have a big Elasticsearch index. But for query result, it only 
> contains a few documents in the index.  ElasticsearchIO splits it into up 
> to1024 BoundedSources in Google dataflow. It takes long time to finish the 
> processing the small numbers of Elasticsearch document in Google dataflow.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-8306) improve estimation of data byte size reading from source in ElasticsearchIO

2019-09-26 Thread Jira



 [ 
https://issues.apache.org/jira/browse/BEAM-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-8306:
---
Status: Open  (was: Triage Needed)

> improve estimation of data byte size reading from source in ElasticsearchIO
> ---
>
> Key: BEAM-8306
> URL: https://issues.apache.org/jira/browse/BEAM-8306
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Affects Versions: 2.14.0
>Reporter: Derek He
>Priority: Major
>
> ElasticsearchIO splits BoundedSource based on the Elasticsearch index size. 
> We expect it can be more accurate to split it base on query result size.
> Currently, we have a big Elasticsearch index. But for query result, it only 
> contains a few documents in the index.  ElasticsearchIO splits it into up 
> to1024 BoundedSources in Google dataflow. It takes long time to finish the 
> processing the small numbers of Elasticsearch document in Google dataflow.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-8303) Filesystems not properly registered using FileIO.write()

2019-09-26 Thread Jira



 [ 
https://issues.apache.org/jira/browse/BEAM-8303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-8303:
---
Status: Open  (was: Triage Needed)

> Filesystems not properly registered using FileIO.write()
> 
>
> Key: BEAM-8303
> URL: https://issues.apache.org/jira/browse/BEAM-8303
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 2.15.0
>Reporter: Preston Koprivica
>Assignee: Maximilian Michels
>Priority: Major
>
> I’m getting the following error when attempting to use the FileIO apis 
> (beam-2.15.0) and integrating with AWS S3.  I have setup the PipelineOptions 
> with all the relevant AWS options, so the filesystem registry **should** be 
> properly seeded by the time the graph is compiled and executed:
> {code:java}
>  java.lang.IllegalArgumentException: No filesystem found for scheme s3
>     at 
> org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:456)
>     at 
> org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:526)
>     at 
> org.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1149)
>     at 
> org.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1105)
>     at org.apache.beam.sdk.coders.Coder.decode(Coder.java:159)
>     at 
> org.apache.beam.sdk.transforms.join.UnionCoder.decode(UnionCoder.java:83)
>     at 
> org.apache.beam.sdk.transforms.join.UnionCoder.decode(UnionCoder.java:32)
>     at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:543)
>     at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:534)
>     at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:480)
>     at 
> org.apache.beam.runners.flink.translation.types.CoderTypeSerializer.deserialize(CoderTypeSerializer.java:93)
>     at 
> org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55)
>     at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:106)
>     at 
> org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:72)
>     at 
> org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)
>     at 
> org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)
>     at 
> org.apache.flink.runtime.operators.FlatMapDriver.run(FlatMapDriver.java:107)
>     at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:503)
>     at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:368)
>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
>     at java.lang.Thread.run(Thread.java:748)
>  {code}
> For reference, the write code resembles this:
> {code:java}
>  FileIO.Write write = FileIO.write()
>     .via(ParquetIO.sink(schema))
>     .to(options.getOutputDir()). // will be something like: 
> s3:///
>     .withSuffix(".parquet");
> records.apply(String.format("Write(%s)", options.getOutputDir()), 
> write);{code}
> The issue does not appear to be related to ParquetIO.sink().  I am able to 
> reliably reproduce the issue using JSON formatted records and TextIO.sink(), 
> as well.  Moreover, AvroIO is affected if withWindowedWrites() option is 
> added.
> Just trying some different knobs, I went ahead and set the following option:
> {code:java}
> write = write.withNoSpilling();{code}
> This actually seemed to fix the issue, only to have it reemerge as I scaled 
> up the data set size.  The stack trace, while very similar, reads:
> {code:java}
>  java.lang.IllegalArgumentException: No filesystem found for scheme s3
>     at 
> org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:456)
>     at 
> org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:526)
>     at 
> org.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1149)
>     at 
> org.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1105)
>     at org.apache.beam.sdk.coders.Coder.decode(Coder.java:159)
>     at org.apache.beam.sdk.coders.KvCoder.decode(KvCoder.java:82)
>     at org.apache.beam.sdk.coders.KvCoder.decode(KvCoder.java:36)
>     at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:543)
>     at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:534)
>     at 
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:480)
>

92 matches

Mail list logo