Re: Configuring file-based transforms with different options

2018-03-08 Thread Robert Bradshaw
On Thu, Mar 8, 2018 at 9:38 PM Eugene Kirpichov wrote: > I think it may have been an API design mistake to put the S3 region into > PipelineOptions. > +1, IMHO it's generally a mistake to put any transform configuration into PipelineOptions for exactly this reason. > PipelineOptions are global

Re: Configuring file-based transforms with different options

2018-03-08 Thread Eugene Kirpichov
I think it may have been an API design mistake to put the S3 region into PipelineOptions. PipelineOptions are global per pipeline, whereas it's totally reasonable to access S3 files in different regions even from the code of a single DoFn running on a single element. The same applies to "setS3Stora

Re: Configuring file-based transforms with different options

2018-03-08 Thread Romain Manni-Bucau
The "hint" would probably to use hints :) - indees this joke refers to the hint thread. Long story short with hints you should be able to say "use that specialize config here". Now, personally, I'd like to see a way to specialize config per transform. With an hint an easy way is to use a prefix:

Re: Gradle status

2018-03-08 Thread Romain Manni-Bucau
Hi Kenneth, For now maven covers the full needs of beam. If we start to have this kind of PR we become dependent of the 2 builds which is what this thread is about avoiding so tempted to say it must be a PR drop completely maven or nothing as mentionned before. Le 9 mars 2018 04:48, "Kenneth Know

Re: Gradle status

2018-03-08 Thread Kenneth Knowles
I would like to briefly re-focus this discussion and suggest that we merge https://github.com/apache/beam/pull/4814. The only material objection I've heard is that it means the precommit no longer tests exactly what is built for release. It is a valid concern, but we have mvn postcommit so the cov

Re: Main input element accesses expired side input windows

2018-03-08 Thread Shen Li
Thank you, Kenn! Shen On Thu, Mar 8, 2018 at 9:58 PM, Kenneth Knowles wrote: > > > On Thu, Mar 8, 2018 at 6:50 PM Shen Li wrote: > >> Hi Kenn, >> >> I just want to confirm that I understand it correctly. >> >> > - You know that W is expired only when you can be sure that no main >> input elem

Re: Main input element accesses expired side input windows

2018-03-08 Thread Kenneth Knowles
On Thu, Mar 8, 2018 at 6:50 PM Shen Li wrote: > Hi Kenn, > > I just want to confirm that I understand it correctly. > > > - You know that W is expired only when you can be sure that no main > input element could reference it. > > This is determined by the *main input* watermark, allowedLateness,

Re: Main input element accesses expired side input windows

2018-03-08 Thread Shen Li
Hi Kenn, I just want to confirm that I understand it correctly. > - You know that W is expired only when you can be sure that no main input element could reference it. This is determined by the *main input* watermark, allowedLateness, and maximumLookback, right? https://github.com/apache/beam/

Re: Main input element accesses expired side input windows

2018-03-08 Thread Shen Li
I see. Thank you Kenn and Lukasz. Best, Shen On Thu, Mar 8, 2018 at 7:46 PM, Kenneth Knowles wrote: > I think the description of when a side input is ready vs expired is the > trouble here. > > - You know that W is expired only when you can be sure that no main input > element could reference

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread Chamikara Jayalath
Great talk, Eugene. Ted, will share more info on Kafka IO for Python soon :) - Cham On Thu, Mar 8, 2018 at 4:55 PM Ted Yu wrote: > I see. > > I have added myself as watcher on BEAM-3788. > > Thanks > > On Thu, Mar 8, 2018 at 4:51 PM, Eugene Kirpichov > wrote: > >> Hi Ted - KafkaIO is not yet

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread Ted Yu
I see. I have added myself as watcher on BEAM-3788. Thanks On Thu, Mar 8, 2018 at 4:51 PM, Eugene Kirpichov wrote: > Hi Ted - KafkaIO is not yet implemented using Splittable DoFn's (it was > implemented before SDFs existed and hasn't been rewritten yet), but it will > be, once more runners cat

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread Eugene Kirpichov
Hi Ted - KafkaIO is not yet implemented using Splittable DoFn's (it was implemented before SDFs existed and hasn't been rewritten yet), but it will be, once more runners catch up with the support: currently we have Dataflow and Flink. +Chamikara Jayalath is currently working on implementing it usi

Re: Main input element accesses expired side input windows

2018-03-08 Thread Kenneth Knowles
I think the description of when a side input is ready vs expired is the trouble here. - You know that W is expired only when you can be sure that no main input element could reference it. - You know that W is ready *even if it got no data* if the input that would end up in W would be dropped (ak

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread Ted Yu
Eugene: Very informative talk. I looked at: sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/splittabledofn/OffsetRangeTrackerTest.java Is there some example showing how OffsetRangeTracker works with Kafka partition(s) ? Thanks On Thu, Mar 8, 2018 at 3:58 PM, Eugene Kirpichov wrote:

Re: Main input element accesses expired side input windows

2018-03-08 Thread Shen Li
Hi Lukasz, Let's explain this problem using a specific example. Say I have a main input element X, which accesses side input window W. When X arrives at a ParDo operator, W is not ready and not expired either. So, in this case, the ParDo should push back X and wait for W to become ready. Say, aft

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread Eugene Kirpichov
Hi Thomas! In case of tailing a Kafka partition, the restriction would be [start_offset, infinity), and it would keep being split by checkpointing into [start_offset, end_offset) and [end_offset, infinity) On Thu, Mar 8, 2018 at 3:52 PM Thomas Weise wrote: > Eugene, > > I actually had one quest

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread Thomas Weise
Eugene, I actually had one question regarding the application of SDF for the Kafka consumer. Reading through a topic partition can be parallel by splitting a partition into multiple restrictions (for use cases where order does not matter). But how would the tail read be managed? I assume there wou

Re: [VOTE] Release 2.4.0, release candidate #2

2018-03-08 Thread Lukasz Cwik
Yes, the release guide has a segment "Update release specific configurations" that has a tidbit about this. On Thu, Mar 8, 2018 at 3:45 PM, Alan Myrvold wrote: > The dataflow java worker version wasn't updated on the branch as in past > releases ... should it be? > https://issues.apache.org/jira

Re: [VOTE] Release 2.4.0, release candidate #2

2018-03-08 Thread Alan Myrvold
The dataflow java worker version wasn't updated on the branch as in past releases ... should it be? https://issues.apache.org/jira/browse/BEAM-3815 On Thu, Mar 8, 2018 at 1:40 PM Romain Manni-Bucau wrote: > Can still be provided as a generic one (like the an offset or key based > one) but good

Re: Main input element accesses expired side input windows

2018-03-08 Thread Lukasz Cwik
I believe your missing over this point: "and also to not expire the side input till the main input watermark advances beyond the garbage collection hold of the side input." On Thu, Mar 8, 2018 at 3:33 PM, Shen Li wrote: > Hi Lukasz, > > Thanks again. > > > the runner is required to hold back th

Re: Main input element accesses expired side input windows

2018-03-08 Thread Shen Li
Hi Lukasz, Thanks again. > the runner is required to hold back the main input till the side input is ready Yes, I understand these requirements. But what if the side input expires before it becomes ready? Shen

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread Thomas Weise
Great, thanks for sharing! On Thu, Mar 8, 2018 at 12:16 PM, Eugene Kirpichov wrote: > Oops that's just the template I used. Thanks for noticing, will regenerate > the PDF and reupload when I get to it. > > > On Thu, Mar 8, 2018, 11:59 AM Dan Halperin wrote: > >> Looks like it was a good talk!

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread Raghu Angadi
Terrific! Thanks Eugene. Just the slides themselves are so good, can't wait for the video. Do you know when the video might be available? On Thu, Mar 8, 2018 at 12:16 PM Eugene Kirpichov wrote: > Oops that's just the template I used. Thanks for noticing, will regenerate > the PDF and reupload w

Re: Portable Flink Runner plan

2018-03-08 Thread Kenneth Knowles
I want to nitpick slightly the wording of "Java-only runner". I would like/expect that a runner using some specialized Java execution paths would still be accepting a portable pipeline and using the URNs and URLs to pick out special codepaths, so it is still different than just leaving the old code

Configuring file-based transforms with different options

2018-03-08 Thread Ismaël Mejía
I was trying to create a really simple pipeline that read from a bucket in a filesystem (s3) and writes to a different bucket in the same filesystem. S3Options options = PipelineOptionsFactory.fromArgs(args).create().as(S3Options.class); Pipeline pipeline = Pipeline.create(options); pi

Re: Main input element accesses expired side input windows

2018-03-08 Thread Lukasz Cwik
Neither, the runner is required to hold back the main input till the side input is ready and also to not expire the side input till the main input watermark advances beyond the garbage collection hold of the side input. On Thu, Mar 8, 2018 at 1:52 PM, Shen Li wrote: > Hi Lukasz, > > Thanks. > >

Re: Main input element accesses expired side input windows

2018-03-08 Thread Shen Li
Hi Lukasz, Thanks. > So having the side input significantly delayed can cause a serious backlog on the main input. Sometimes, side input is delayed and it is out of the applications' control. In this situation, should the runner/engine discard pushed back main elements if they access expired si

Re: [VOTE] Release 2.4.0, release candidate #2

2018-03-08 Thread Romain Manni-Bucau
Can still be provided as a generic one (like the an offset or key based one) but good enough for now, right, was just surprising to not see it when checking the breakage. Le 8 mars 2018 22:05, "Eugene Kirpichov" a écrit : All SDF-related method annotations in DoFn are marked @Experimental. I gue

Re: Main input element accesses expired side input windows

2018-03-08 Thread Lukasz Cwik
The runner/engine is responsible for pushing back the main input until the side input becomes ready. So having the side input significantly delayed can cause a serious backlog on the main input. On Thu, Mar 8, 2018 at 1:34 PM, Shen Li wrote: > Hi Lukasz, > > Thanks for the prompt response. Does

Re: Main input element accesses expired side input windows

2018-03-08 Thread Shen Li
Hi Lukasz, Thanks for the prompt response. Does it mean that if the side input elements and watermarks got delayed and lagged behind the main input stream, it is considered as an application problem, and Beam runners/engines do not need to handle that? Best, Shen On Thu, Mar 8, 2018 at 4:15 PM,

Re: Main input element accesses expired side input windows

2018-03-08 Thread Lukasz Cwik
The side input expires relative to the input watermark of the ParDo so what your suggesting could only happen if the runner had a bug and expired the side input before it should have happened or the user pipeline has a bug and is attempting to access a window for something that would always be cons

Main input element accesses expired side input windows

2018-03-08 Thread Shen Li
Hi, When a main input element tries to access an expired side input window (violating maximumLookback), should ParDo discard the element or treat it as an error? Besides, what should ParDo do in the following situation: 1. The side input window W is not expired but unready when the main input ele

Re: [VOTE] Release 2.4.0, release candidate #2

2018-03-08 Thread Eugene Kirpichov
All SDF-related method annotations in DoFn are marked @Experimental. I guess that should apply to RestrictionTracker too, but I wouldn't be too worried about that, since it only makes sense to use in the context of those methods. On Thu, Mar 8, 2018 at 12:36 PM Romain Manni-Bucau wrote: > Hmm, d

Re: Portable Flink Runner plan

2018-03-08 Thread Robert Bradshaw
All runners should support portable execution for Java, which should be just as easy as supporting execution of non-Java pipelines over this API. As for non-portable "specialized" execution of Java, I think it's a tradeoff between the overhead of the portability framework vs. the maintenance cost

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread Kenneth Knowles
Love it. Great flashy title, too :-) On Thu, Mar 8, 2018 at 12:16 PM Eugene Kirpichov wrote: > Oops that's just the template I used. Thanks for noticing, will regenerate > the PDF and reupload when I get to it. > > On Thu, Mar 8, 2018, 11:59 AM Dan Halperin wrote: > >> Looks like it was a good

Re: Portable Flink Runner plan

2018-03-08 Thread Kenneth Knowles
+1 to Luke's answer of "yes" for everything to be "portable by default". However, I (always) favor decentralizing this decision as long as the "Beam model" is respected. Baseline: - the input pipeline should always be in portable format - the results of execution should match portable execution

Re: [VOTE] Release 2.4.0, release candidate #2

2018-03-08 Thread Romain Manni-Bucau
Hmm, does sdf api misses some @Experimental then? To clarify: for waitUntilFinish I'm fine with the 2.4 as this but cant +1 or +0 since none of my tests pass reliably in current state without a retry strategy making the call useless. Le 8 mars 2018 21:02, "Reuven Lax" a écrit : > Does Nexmark u

Re: Gradle status

2018-03-08 Thread Romain Manni-Bucau
@Luskasz: not sure Im the best to host it since I know more gradle internals that user interface/ecosystem but happy to help. Will also require a "sudo" merger for this day to merge fixes asap - guess we can bypass reviews or have a fast cycle plan for this day to avoid it to be a week? Le 8 mars

Re: Mentoring for Google Summer of Code

2018-03-08 Thread Yanael Barbier
Thanks I had a look, It’s nice suggestions. Preparing the next step to chose one of them. Yanael On Wed 7 Mar 2018 at 9:46 PM, Kenneth Knowles wrote: > Hi Yanael, > > Glad to hear from you! Here is a saved filter for Jira tickets describing > GSoC project ideas in Beam: > > https://issues.ap

Re: [VOTE] Release 2.4.0, release candidate #2

2018-03-08 Thread Robert Bradshaw
On Thu, Mar 8, 2018 at 12:02 PM Reuven Lax wrote: > Does Nexmark use SerializableCoder? > That's what the errors (and fix) for RC1 seemed to indicate. > On Thu, Mar 8, 2018 at 10:42 AM Robert Bradshaw > wrote: > >> I put the validation checklist spreadsheet is up at >> https://docs.google.com

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread Eugene Kirpichov
Oops that's just the template I used. Thanks for noticing, will regenerate the PDF and reupload when I get to it. On Thu, Mar 8, 2018, 11:59 AM Dan Halperin wrote: > Looks like it was a good talk! Why is it Google Confidential & > Proprietary, though? > > Dan > > On Thu, Mar 8, 2018 at 11:49 AM,

Re: Gradle status

2018-03-08 Thread Kenneth Knowles
@Romain - Idea/IntelliJ is great with Gradle, way better than Maven, and what I mean is that I have added enough hints that it works OOTB already. The rest of my instructions are just how you should override IntelliJ's defaults to have a proper dev env - mostly just about storing files outside the

Re: [VOTE] Release 2.4.0, release candidate #2

2018-03-08 Thread Reuven Lax
Does Nexmark use SerializableCoder? On Thu, Mar 8, 2018 at 10:42 AM Robert Bradshaw wrote: > I put the validation checklist spreadsheet is up at > https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?ts=5a1c7310#gid=1663314475 > > Regarding the direct runner

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread Lukasz Cwik
I really like slide 19: Author: "I made a bigdata programming model" Reader: "Cool, how does data get in and out?" Author: "Brb" On Thu, Mar 8, 2018 at 11:49 AM, Eugene Kirpichov wrote: > Hey all, > > The slides for my yesterday's talk at Strata San Jose https://conferences. > oreilly.com/strata

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread Dan Halperin
Looks like it was a good talk! Why is it Google Confidential & Proprietary, though? Dan On Thu, Mar 8, 2018 at 11:49 AM, Eugene Kirpichov wrote: > Hey all, > > The slides for my yesterday's talk at Strata San Jose https://conferences. > oreilly.com/strata/strata-ca/public/schedule/detail/63696

"Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread Eugene Kirpichov
Hey all, The slides for my yesterday's talk at Strata San Jose https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63696 have been posted on the talk page. They may be of interest both to users and IO authors. Thanks.

Re: Gradle status

2018-03-08 Thread Lukasz Cwik
Romain, would you like to host/plan/run the Gradle fixit day? On Thu, Mar 8, 2018 at 11:24 AM, Chamikara Jayalath wrote: > +1 for the general idea of fixit day/week for Gradle. > > Agree with what Łukasz said. Some of these performance tests are new and > are flaky due to other issues that were

Re: Gradle status

2018-03-08 Thread Chamikara Jayalath
+1 for the general idea of fixit day/week for Gradle. Agree with what Łukasz said. Some of these performance tests are new and are flaky due to other issues that were discovered during the process of adding the test. I think the high level blocker is updating performance testing framework to use

Re: Portable Flink Runner plan

2018-03-08 Thread Lukasz Cwik
I ran some very pessimistic pipelines that were shuffle heavy (Random KV -> GBK -> IdentityDoFn) and found that the performance overhead was 15% when executed with Dataflow. This is a while back and there was a lot of inefficiencies due to coder encode/decode cycles and based upon profiling informa

Re: Portable Flink Runner plan

2018-03-08 Thread Thomas Weise
Performance, due to the extra gRPC hop. On Thu, Mar 8, 2018 at 11:08 AM, Lukasz Cwik wrote: > The goal is to use containers (and similar technologies) in the future. It > really hinders pipeline portability between runners if you also have to > deal with the dependency conflicts between Flink/D

Re: Portable Flink Runner plan

2018-03-08 Thread Lukasz Cwik
The goal is to use containers (and similar technologies) in the future. It really hinders pipeline portability between runners if you also have to deal with the dependency conflicts between Flink/Dataflow/Spark/... execution runtimes. What kinds of penalty are you referring to (perf, user complexi

Re: Portable Flink Runner plan

2018-03-08 Thread Thomas Weise
I'm curious if pipelines that are exclusively Java will be executed (when running on Flink or other JVM based runnner) in separate harness containers also? This would impose a significant penalty compared to the current execution model. Will this be something the user can control? Thanks, Thomas

Re: [VOTE] Release 2.4.0, release candidate #2

2018-03-08 Thread Eugene Kirpichov
Yes, SDF is an experimental API at this point, so backwards incompatible changes are allowed and expected. On Thu, Mar 8, 2018, 10:42 AM Robert Bradshaw wrote: > I put the validation checklist spreadsheet is up at > https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXB

Re: [VOTE] Release 2.4.0, release candidate #2

2018-03-08 Thread Robert Bradshaw
I put the validation checklist spreadsheet is up at https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?ts=5a1c7310#gid=1663314475 Regarding the direct runner regression on query 10, this is understandable given how mutation detection has been changed for seria

Re: The Go SDK got accidentally merged - options to deal with the pain

2018-03-08 Thread Robert Bradshaw
+1 to both of these points. SGA should have probably already been filed, and excising this from releases should be easy, but I added a line item to the validation checklist template to make sure we don't forget. On Thu, Mar 8, 2018 at 7:13 AM Davor Bonaci wrote: > I support leaving things as the

Re: The Go SDK got accidentally merged - options to deal with the pain

2018-03-08 Thread Davor Bonaci
I support leaving things as they stand now -- thanks for finding a good way out of an uncomfortable situation. That said, two things need to happen: (1) SGA needs to be filed asap, per Board feedback in the last report, and (2) releases cannot contain any code from the Go SDK before formally voted

Re: Calling rest api at the end of the pipeline

2018-03-08 Thread Jean-Baptiste Onofré
Hi David, you can use a simple ParDo(DoFn) at the end of your pipeline. For instance, you can mimic what we do in the ElasticsearchIO Write: https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L789 I also

Calling rest api at the end of the pipeline

2018-03-08 Thread Sam, David
Hi , I have a use case where I would like to post the results to an API , I was thinking of using Sink , then its documented to not use that , what is the recommended way of doing this . My flow is Read from a flow--> do transformation / validation --> post the results to rest API and also post

Re: Gradle status

2018-03-08 Thread Łukasz Gajowy
2018-03-07 22:29 GMT+01:00 Kenneth Knowles : > > Based on https://builds.apache.org/view/A-D/view/Beam/ and our failure > spam level the performance tests are mostly not healthy anyhow. So is there > any high level blocker to switching them or is it just someone sitting down > with each one? > > I

Re: [VOTE] Release 2.4.0, release candidate #2

2018-03-08 Thread Ismaël Mejía
I confirm that the new release fixes both problems reported previously: - python package name - nexmark query 10 mutability issue with the direct runner. One extra regression is that the the fix produced a way longer execution time on the query. Not sure if a blocker but worth tracking. Query 10

Re: [YouTube channel] Add video: Apache Beam meetup London 2: use case in finance + IO in Beam and Splittable DoFns

2018-03-08 Thread Ismaël Mejía
+1 I was wondering if we can also add a playlist that links to presentations Beamers have done in different conferences, e.g. some of the public available talks from the past by Frances/Tyler/others are worth to be included so they can be easily found. (of course not sure if we need approval from