Re: How to serialize/deserialize a Pipeline object?

2016-12-21 Thread Kenneth Knowles
I went ahead and filed a more specific ticket just for this use,
https://issues.apache.org/jira/browse/BEAM-1196

On Wed, Dec 21, 2016 at 11:12 AM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> On Wed, Dec 21, 2016 at 10:58 AM, Shen Li <cs.she...@gmail.com> wrote:
> > Hi Kenn,
> >
> > Thanks a lot for the information.
> >
> > Sure, below are more details about the problem I encountered.
> >
> > I am developing a runner for IBM Streams, and am exploring possible ways
> to
> > conduct integration tests. As Streams is not an open-source project, we
> > cannot add the full set of libraries to Maven Central repo. Nor can we
> > guarantee to provide a server (with Streams installed) as a long-term
> > Jenkins slave. So, it seems more flexible to let the runner submit the
> > graph to a Streams cloud service, and provide the account info through
> > "-DrunnableOnServicePipelineOptions" (please correct me if it does not
> work
> > in this way). The problem is that the runner cannot convert the Pipeline
> > into a Streams graph format without a local Streams install. So, I am
> > thinking about sending the serialized Pipeline to the Cloud service for
> > execution. Maybe I should create some intermediate format between the
> > Pipeline and Streams graph format. Or, is there any other way to carry
> out
> > the integration test without a Streams install?
>
> Choosing an intermediate representation that can be serialized and
> sent to a cloud service (where it is then translated into the actual
> implementation representation) is a fine solution. In fact that's what
> Dataflow itself does.
>
> Of course we'll want to move as close to (2) as possible once it exists.
>
> > On Wed, Dec 21, 2016 at 12:08 PM, Kenneth Knowles <k...@google.com.invalid
> >
> > wrote:
> >
> >> Hi Shen,
> >>
> >> I want to tell you (1) how things work today and (2) how we want them
> to be
> >> eventually.
> >>
> >> (1) So far, each runner translates the Pipeline to their own graph
> format
> >> before serialization, so we have not yet encountered this issue.
> >>
> >> (2) We intend to make a standard mostly-readable JSON format for a
> >> Pipeline. It is based the Avro schema sketched in the design doc at
> >> https://s.apache.org/beam-runner-api and there is also a draft JSON
> schema
> >> at https://github.com/apache/incubator-beam/pull/662.
> >>
> >> You may wish to follow https://issues.apache.org/jira/browse/BEAM-115,
> >> though that is a very general ticket.
> >>
> >> Can you share any more details?
> >>
> >> Kenn
> >>
> >> On Wed, Dec 21, 2016 at 8:47 AM, Shen Li <cs.she...@gmail.com> wrote:
> >>
> >> > Hi,
> >> >
> >> > What are the recommended ways to serialize/deserialize a Pipeline
> >> object? I
> >> > need to submit a pipeline object to cloud for execution and fetch the
> >> > result.
> >> >
> >> > Thanks,
> >> >
> >> > Shen
> >> >
> >>
>


Re: How to serialize/deserialize a Pipeline object?

2016-12-21 Thread Kenneth Knowles
Hi Shen,

I want to tell you (1) how things work today and (2) how we want them to be
eventually.

(1) So far, each runner translates the Pipeline to their own graph format
before serialization, so we have not yet encountered this issue.

(2) We intend to make a standard mostly-readable JSON format for a
Pipeline. It is based the Avro schema sketched in the design doc at
https://s.apache.org/beam-runner-api and there is also a draft JSON schema
at https://github.com/apache/incubator-beam/pull/662.

You may wish to follow https://issues.apache.org/jira/browse/BEAM-115,
though that is a very general ticket.

Can you share any more details?

Kenn

On Wed, Dec 21, 2016 at 8:47 AM, Shen Li  wrote:

> Hi,
>
> What are the recommended ways to serialize/deserialize a Pipeline object? I
> need to submit a pipeline object to cloud for execution and fetch the
> result.
>
> Thanks,
>
> Shen
>


Jenkins seed job breakage

2016-12-19 Thread Kenneth Knowles
Hi all,

The massive Jenkins breakage just now was me updating the seed job in
unfriendly ways. It should be all cleared up now. Apologies for that. I'll
be trying to come up with safer ways to validate such changes in the future.

Kenn


Re: Build failed in Jenkins: beam_SeedJob_Main #43

2016-12-19 Thread Kenneth Knowles
Context: PR #1640 has its LGTM. Before committing it, I am ensuring it
works by running the seed job against it. This _will_ change the other jobs
if/when it succeeds, but it will change them to what they are about to be
anyhow.

The failure here is not substantive. I built against origin/pr/1640 instead
of origin/pr/1640/head. The fetch spec of the Jenkins job is benignly
different than the one in our contribution guide.

On Mon, Dec 19, 2016 at 10:31 AM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See 
>
> --
> Started by user kenn
> [EnvInject] - Loading node environment variables.
> Building remotely on beam2 (beam) in workspace  job/beam_SeedJob_Main/ws/>
>  > git rev-parse --is-inside-work-tree # timeout=10
> Fetching changes from the remote Git repository
>  > git config remote.origin.url https://github.com/apache/
> incubator-beam.git # timeout=10
> Fetching upstream changes from https://github.com/apache/
> incubator-beam.git
>  > git --version # timeout=10
>  > git -c core.askpass=true fetch --tags --progress
> https://github.com/apache/incubator-beam.git 
> +refs/heads/*:refs/remotes/origin/*
> +refs/pull/*:refs/remotes/origin/pr/*
>  > git rev-parse refs/remotes/origin/pr/1640^{commit} # timeout=10
>  > git rev-parse refs/remotes/origin/origin/pr/1640^{commit} # timeout=10
>  > git rev-parse origin/pr/1640^{commit} # timeout=10
> ERROR: Couldn't find any revision to build. Verify the repository and
> branch configuration for this job.
> Retrying after 10 seconds
>  > git rev-parse --is-inside-work-tree # timeout=10
> Fetching changes from the remote Git repository
>  > git config remote.origin.url https://github.com/apache/
> incubator-beam.git # timeout=10
> Fetching upstream changes from https://github.com/apache/
> incubator-beam.git
>  > git --version # timeout=10
>  > git -c core.askpass=true fetch --tags --progress
> https://github.com/apache/incubator-beam.git 
> +refs/heads/*:refs/remotes/origin/*
> +refs/pull/*:refs/remotes/origin/pr/*
>  > git rev-parse refs/remotes/origin/pr/1640^{commit} # timeout=10
>  > git rev-parse refs/remotes/origin/origin/pr/1640^{commit} # timeout=10
>  > git rev-parse origin/pr/1640^{commit} # timeout=10
> ERROR: Couldn't find any revision to build. Verify the repository and
> branch configuration for this job.
> Retrying after 10 seconds
>  > git rev-parse --is-inside-work-tree # timeout=10
> Fetching changes from the remote Git repository
>  > git config remote.origin.url https://github.com/apache/
> incubator-beam.git # timeout=10
> Fetching upstream changes from https://github.com/apache/
> incubator-beam.git
>  > git --version # timeout=10
>  > git -c core.askpass=true fetch --tags --progress
> https://github.com/apache/incubator-beam.git 
> +refs/heads/*:refs/remotes/origin/*
> +refs/pull/*:refs/remotes/origin/pr/*
>  > git rev-parse refs/remotes/origin/pr/1640^{commit} # timeout=10
>  > git rev-parse refs/remotes/origin/origin/pr/1640^{commit} # timeout=10
>  > git rev-parse origin/pr/1640^{commit} # timeout=10
> ERROR: Couldn't find any revision to build. Verify the repository and
> branch configuration for this job.
>
>


Re: [VOTE] Release 0.4.0-incubating, release candidate #3

2016-12-17 Thread Kenneth Knowles
+1, as long as it is fine for the release to be signed by a PMC member
other than the release manager. Otherwise need to replace the .asc file.

Following [Apache release checklist](
http://incubator.apache.org/guides/releasemanagement.html#check-list):

1.1 Verified checksums & signature (Davor's)
2.1 Ran unit tests and integration tests
3.1 DISCLAIMER is correct
3.2 LICENSE & NOTICE are correct
3.3 Files have license headers (RAT & checkstyle)
3.4 Provenance is clear
3.5 Dependencies license are legal (RAT) [2]
3.6 Release contains source code, no binaries

Additionally:

 - Went over the generated javadoc (filed tickets but no release blockers)
 - Went over the generated release notes
 - Sanity checked the Maven Central artifacts
 - Confirmed that the git tag matches
 - Checked the website PR

I heartily agree that the components would give much better context on
tickets. Even with that, our JIRA titles could use a lot of improvement.


On Fri, Dec 16, 2016 at 5:06 AM, Jean-Baptiste Onofré 
wrote:

> Hi everyone,
>
> Please review and vote on the release candidate #3 for the version
> 0.4.0-incubating, as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org
> [2],
> * all artifacts to be deployed to the Maven Central Repository [3],
> * source code tag "v0.4.0-incubating-RC3" [4],
> * website pull request listing the release and publishing the API reference
> manual [5].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PPMC affirmative votes.
>
> Thanks,
> Regards
> JB
>
> [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
> ctId=12319527=12338590
> [2] https://dist.apache.org/repos/dist/dev/incubator/beam/0.4.0-
> incubating/
> [3] https://repository.apache.org/content/repositories/orgapachebeam-1008/
> [4] https://git-wip-us.apache.org/repos/asf?p=incubator-beam.git
> ;a=tag;h=112e38e4a68b07e6bf4916d1bdcc7ecaca8bbbd4
> [5] https://github.com/apache/incubator-beam-site/pull/109
>


Re: Jenkins build became unstable: beam_PostCommit_Java_MavenInstall #2124

2016-12-16 Thread Kenneth Knowles
Prior to this coming in, it was already mostly rolled forwards in Dataflow.
It would actually be counterproductive to revert in Beam as that would
re-introduce the same bug in reverse.

Filed BEAM-1172 to prevent in the future.

On Fri, Dec 16, 2016 at 2:18 PM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See  MavenInstall/2124/changes>
>
>


Re: [VOTE] Release 0.4.0-incubating, release candidate #1

2016-12-15 Thread Kenneth Knowles
Looping back, all issues I know about or have mentioned are now on the
release branch.

On Thu, Dec 15, 2016 at 11:36 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> -1 (binding)
>
> I would include the fix for metrics and impacting dataflow and flink
> runners.
>
> I agree with Davor: I would prefer to cut a RC2.
>
> Regards
> JB⁣​
>
> On Dec 15, 2016, 20:06, at 20:06, Kenneth Knowles <k...@google.com.INVALID>
> wrote:
> >Agreed. I had though the issue in PR #1620 only affected Dataflow (in
> >which
> >case we could address it in the service) but it now also affects the
> >Flink
> >runner, so it should be included in the release.
> >
> >On Thu, Dec 15, 2016 at 10:46 AM, Eugene Kirpichov <
> >kirpic...@google.com.invalid> wrote:
> >
> >> There is one more data-loss type error, a fix for which should go
> >into the
> >> release.
> >> https://github.com/apache/incubator-beam/pull/1620
> >>
> >> On Thu, Dec 15, 2016 at 10:42 AM Davor Bonaci <da...@apache.org>
> >wrote:
> >>
> >> > I think we should build another RC.
> >> >
> >> > Two issues:
> >> > * Metrics issue that JB pointed out earlier. It seems to cause a
> >somewhat
> >> > poor user experience for every pipeline executed on the Direct
> >runner.
> >> > (Thanks JB for finding this out!)
> >> > * Failure of testSideInputsWithMultipleWindows in Jenkins [1].
> >> >
> >> > Both issues seem easy, trivial, non-risky fixes that are already
> >> committed
> >> > to master. I'd suggest just taking them.
> >> >
> >> > Davor
> >> >
> >> > [1]
> >> >
> >> > https://builds.apache.org/view/Beam/job/beam_PostCommit_
> >> Java_RunnableOnService_Dataflow/1819/
> >> >
> >> > On Thu, Dec 15, 2016 at 8:45 AM, Ismaël Mejía <ieme...@gmail.com>
> >wrote:
> >> >
> >> > > +1 (non-binding)
> >> > >
> >> > > - verified signatures + checksums
> >> > > - run mvn clean verify -Prelease, all artifacts+tests run
> >smoothly
> >> > >
> >> > > The release artifacts are signed with the key with fingerprint
> >8F0D334F
> >> > > https://dist.apache.org/repos/dist/release/incubator/beam/KEYS
> >> > >
> >> > > I just created a JIRA to add the signer/KEYS information in the
> >release
> >> > > template, I will do a PR for this later on.
> >> > >
> >> > > Ismaël
> >> > >
> >> > > On Thu, Dec 15, 2016 at 2:26 PM, Jean-Baptiste Onofré
> ><j...@nanthrax.net
> >> >
> >> > > wrote:
> >> > >
> >> > > > Hi Amit,
> >> > > >
> >> > > > thanks for the update.
> >> > > >
> >> > > > As you changed the Jira, the Release Notes are now up to date.
> >> > > >
> >> > > > Regards
> >> > > > JB
> >> > > >
> >> > > >
> >> > > > On 12/15/2016 02:20 PM, Amit Sela wrote:
> >> > > >
> >> > > >> I see three problems in the release notes (related to Spark
> >runner):
> >> > > >>
> >> > > >> Improvement:
> >> > > >> 
> >> > > >> [BEAM-757] - The SparkRunner should utilize the SDK's
> >DoFnRunner
> >> > instead
> >> > > >> of
> >> > > >> writing it's own.
> >> > > >> 
> >> > > >> [BEAM-807] - [SparkRunner] Replace OldDoFn with DoFn
> >> > > >> 
> >> > > >> [BEAM-855] - Remove the need for --streaming option in the
> >spark
> >> > runner
> >> > > >>
> >> > > >> BEAM-855 is duplicate and probably shouldn't have had a Fix
> >Version.
> >> > > >>
> >> > > >> The other two are not a part of this release - I was probably
> >too
> >> > eager
> >> > > to
> >> > > >> mark them fixed after merge and I accidentally put 0.4.0 as
> >the Fix
> >> > > >> Version.
> >> > > >>
> >> > > >> I made the changes in JIRA now.
> >> > > >>
> >> > > >> Thanks,
> >> >

Re: [VOTE] Release 0.4.0-incubating, release candidate #1

2016-12-15 Thread Kenneth Knowles
Agreed. I had though the issue in PR #1620 only affected Dataflow (in which
case we could address it in the service) but it now also affects the Flink
runner, so it should be included in the release.

On Thu, Dec 15, 2016 at 10:46 AM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> There is one more data-loss type error, a fix for which should go into the
> release.
> https://github.com/apache/incubator-beam/pull/1620
>
> On Thu, Dec 15, 2016 at 10:42 AM Davor Bonaci  wrote:
>
> > I think we should build another RC.
> >
> > Two issues:
> > * Metrics issue that JB pointed out earlier. It seems to cause a somewhat
> > poor user experience for every pipeline executed on the Direct runner.
> > (Thanks JB for finding this out!)
> > * Failure of testSideInputsWithMultipleWindows in Jenkins [1].
> >
> > Both issues seem easy, trivial, non-risky fixes that are already
> committed
> > to master. I'd suggest just taking them.
> >
> > Davor
> >
> > [1]
> >
> > https://builds.apache.org/view/Beam/job/beam_PostCommit_
> Java_RunnableOnService_Dataflow/1819/
> >
> > On Thu, Dec 15, 2016 at 8:45 AM, Ismaël Mejía  wrote:
> >
> > > +1 (non-binding)
> > >
> > > - verified signatures + checksums
> > > - run mvn clean verify -Prelease, all artifacts+tests run smoothly
> > >
> > > The release artifacts are signed with the key with fingerprint 8F0D334F
> > > https://dist.apache.org/repos/dist/release/incubator/beam/KEYS
> > >
> > > I just created a JIRA to add the signer/KEYS information in the release
> > > template, I will do a PR for this later on.
> > >
> > > Ismaël
> > >
> > > On Thu, Dec 15, 2016 at 2:26 PM, Jean-Baptiste Onofré  >
> > > wrote:
> > >
> > > > Hi Amit,
> > > >
> > > > thanks for the update.
> > > >
> > > > As you changed the Jira, the Release Notes are now up to date.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > >
> > > > On 12/15/2016 02:20 PM, Amit Sela wrote:
> > > >
> > > >> I see three problems in the release notes (related to Spark runner):
> > > >>
> > > >> Improvement:
> > > >> 
> > > >> [BEAM-757] - The SparkRunner should utilize the SDK's DoFnRunner
> > instead
> > > >> of
> > > >> writing it's own.
> > > >> 
> > > >> [BEAM-807] - [SparkRunner] Replace OldDoFn with DoFn
> > > >> 
> > > >> [BEAM-855] - Remove the need for --streaming option in the spark
> > runner
> > > >>
> > > >> BEAM-855 is duplicate and probably shouldn't have had a Fix Version.
> > > >>
> > > >> The other two are not a part of this release - I was probably too
> > eager
> > > to
> > > >> mark them fixed after merge and I accidentally put 0.4.0 as the Fix
> > > >> Version.
> > > >>
> > > >> I made the changes in JIRA now.
> > > >>
> > > >> Thanks,
> > > >> Amit
> > > >>
> > > >> On Thu, Dec 15, 2016 at 3:09 PM Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > > >> wrote:
> > > >>
> > > >> Reviewing and testing the release, I see:
> > > >>>
> > > >>> 16/12/15 14:04:47 ERROR MetricsContainer: Unable to update metrics
> on
> > > >>> the current thread. Most likely caused by using metrics outside the
> > > >>> managed work-execution thread.
> > > >>>
> > > >>> It doesn't block the execution of the pipeline, but basically, it
> > means
> > > >>> that metrics don't work anymore.
> > > >>>
> > > >>> I'm investigating.
> > > >>>
> > > >>> Regards
> > > >>> JB
> > > >>>
> > > >>> On 12/15/2016 01:46 PM, Jean-Baptiste Onofré wrote:
> > > >>>
> > >  Hi everyone,
> > > 
> > >  Please review and vote on the release candidate #1 for the version
> > >  0.4.0-incubating, as follows:
> > >  [ ] +1, Approve the release
> > >  [ ] -1, Do not approve the release (please provide specific
> > comments)
> > > 
> > >  The complete staging area is available for your review, which
> > > includes:
> > >  * JIRA release notes [1],
> > >  * the official Apache source release to be deployed to
> > > dist.apache.org
> > > 
> > > >>> [2],
> > > >>>
> > >  * all artifacts to be deployed to the Maven Central Repository
> [3],
> > >  * source code tag "v0.4.0-incubating-RC1" [4],
> > >  * website pull request listing the release and publishing the API
> > > 
> > > >>> reference
> > > >>>
> > >  manual [5].
> > > 
> > >  The vote will be open for at least 72 hours. It is adopted by
> > majority
> > >  approval, with at least 3 PPMC affirmative votes.
> > > 
> > >  Thanks,
> > >  Regards
> > >  JB
> > > 
> > >  [1]
> > > 
> > >  https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
> > > >>> ctId=12319527=12338590
> > > >>>
> > > 
> > >  [2]
> > > 
> > > >>> https://dist.apache.org/repos/dist/dev/incubator/beam/0.4.0-
> > > incubating/
> > > >>>
> > >  [3]
> > > 
> > > >>>
> > https://repository.apache.org/content/repositories/orgapachebeam-1006/
> > > >>>
> > >  [4]
> > > 
> > >  

Re: New testSideInputsWithMultipleWindows and should DoFnRunner explode if DoFn contains a side input ?

2016-12-14 Thread Kenneth Knowles
Yes, this is a bug in SimplerDoFnRunner (or maybe some clarity on whether
or not it owns this) not the Spark runner. FWIW the test is definitely
correct, and runners-core has had this bug for a while. It is
https://issues.apache.org/jira/browse/BEAM-1149 and I'm on it.

On Wed, Dec 14, 2016 at 11:03 AM, Amit Sela  wrote:

> Hi all,
>
> Yesterday a new test was added to ParDoTest suite:
> "testSideInputsWithMultipleWindows".
> To the best of my understanding, it's meant to test sideInputs for elements
> in multiple windows (unexploded).
>
> The Spark runner uses the DoFnRunner (Simple) to process DoFns, and it will
> explode compressed elements only if it's "tagged" as "Observes Window".
>
> Should it also explode if it has sideInputs ?
>
> Thanks,
> Amit
>


Re: Jenkins build is still unstable: beam_PostCommit_Java_RunnableOnService_Spark #409

2016-12-14 Thread Kenneth Knowles
This is still https://issues.apache.org/jira/browse/BEAM-1149. We recently
added a test for it. The actual behavior has been broken for everyone for a
while. It is half-fixed by Eugene K. (some DoFnRunners) but not all.

On Wed, Dec 14, 2016 at 10:51 AM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See  RunnableOnService_Spark/409/>
>
>


Re: Jenkins build is still unstable: beam_PostCommit_Java_RunnableOnService_Dataflow #1806

2016-12-13 Thread Kenneth Knowles
Failure in
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_RunnableOnService_Dataflow/1806/
is caused by https://github.com/apache/incubator-beam/pull/1541, which I am
reverting.

On Tue, Dec 13, 2016 at 3:16 PM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See  RunnableOnService_Dataflow/changes>
>
>


Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-12-13 Thread Kenneth Knowles
I don't think there is any conflict here.

On Tue, Dec 13, 2016 at 12:34 PM, Pei He <pe...@google.com.invalid> wrote:

> One design decision made during previous design discussion [1] is
> "Replacing
> FilePath with URI for resolving files paths". This has been brought back to
> dev@ mailing list in my previous email.
>

The direction of this argument, in my opinion, gets the burden of proof
wrong.

The original design document effectively proposed "instead of using URIs,
let's make a Beam-specific abstraction" and [1] is just the natural comment
"let's just use URI". This works for the internet, and gives interop with
essentially all code, so you need a very special reason not to do it (and
special cases generally manifest as custom URI schemes).

Comment [2] asked me to clarify the impact on Windows OS users because
> users have to specify the path in the URI format, such as:
> "file:///C:/home/input-*"
> "C:/home/"
>

It is not really true that users have to do this. For the command line, it
is the responsibility of the code that parses "--filesToStage
C:\my\windows\path". Users should absolutely be able to specify paths like
this on Windows, and it is not difficult and nothing your proposal needs to
solve.

With programmatic creation in Java code, the same principle applies: the
environment-specific String/File/Path should be converted to a URI at the
membrane. Making an API take a URI makes it completely obvious to a Java
programmer that if they have a String/File/Path they need to convert it
appropriately.

Kenn


> Using URIs in the API is to ensure Beam code is file systems agnostic.
>
> Another alternative is Java Path/File. It is used in the current
> IOChannelFactory API, and it works poorly. For example, Path throws when
> there are file scheme or asterisk in the path:
> new File("file:///C:/home/").toPath() throws in toPath().
> Paths.get("C:/home/").resolve("output-*") throws in resolve().
>
> any thoughts and suggestions are welcome.
>
> Thanks
> --
> Pei
>
> ---
> [1]:
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> XJsVG3qel2lhdKTknmZ_7M/edit?disco=A30vtPU#heading=h.p3gc3colc2cs
>
> [2]:
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> XJsVG3qel2lhdKTknmZ_7M/edit?disco=A02O1cY
>
> On Tue, Dec 6, 2016 at 1:25 PM, Kenneth Knowles <k...@google.com.invalid>
> wrote:
>
> > Thanks for the thorough answers. It all sounds good to me.
> >
> > On Tue, Dec 6, 2016 at 12:57 PM, Pei He <pe...@google.com.invalid>
> wrote:
> >
> > > Thanks Kenn for the feedback and questions.
> > >
> > > I responded inline.
> > >
> > > On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles <k...@google.com.invalid
> >
> > > wrote:
> > >
> > > > I really like this document. It is easy to read and informative.
> Three
> > > > things not addressed by the document:
> > > >
> > > > 1. Major Beam use cases. I'm sure we have a few in the SDK that could
> > be
> > > > outlined in terms of the new API with pseudocode.
> > >
> > >
> > > (I am writing pseudocode directly with FileSystem interface to
> > demonstrate.
> > > However, clients will use the utility FileSystems. This is for us to
> > have a
> > > layer between the file systems providers' interface and the client
> > > interface. We can add utility functions to FileSystems for common use
> > > patterns as needed.)
> > >
> > > Major Beam use cases are the followings:
> > > A. FileBasedSource:
> > > // a. Get input URIs and file sizes from users provided specs.
> > > // Note: I updated the match() to be a bulk operation after I sent my
> > last
> > > email.
> > > List results = match(specList);
> > > List inputMetadataList = FluentIterable.from(results)
> > > .transformAndConcat(
> > > new Function<MatchResult, Metadata>() {
> > >   @Override
> > >   public Iterable apply(MatchResult result) {
> > > return Arrays.asList(result.metadata());
> > >   });
> > >
> > > // b. Read from a start offset to support the source splitting.
> > > SeekableByteChannel seekChannel = open(fileUri);
> > > seekChannel.position(source.getStartOffset());
> > > seekChannel.read(...);
> > >
> > > B. FileBasedSink:
> > > // bulk rename temporary files to output files
> > > rename(tempUris, outputUris);
> > >
> > > C. General file operations:
>

Re: Beam Tuple

2016-12-13 Thread Kenneth Knowles
If the scope is really just tuples, then supposing a user chooses to go
with Apache Commons tuples or javatuples it seems that the problem to be
solved is easily providing coders for common data types that are not part
of Beam. I think we should address this anyhow.

The scope of having a common format is much more broad. Remember that a
coder is just a proxy for a well-defined binary format [1], so a solution
will fall somewhere in that arena. Even before encoding IDs, We had some
rudimentary support for tagging the most critical common formats [2] [3]
but it was too runner-specific and not a general solution.

Kenn

[1]
https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L227
[2]
https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/KvCoder.java#L129
[3]
https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/IterableCoder.java#L73

On Dec 13, 2016 09:03, "Jean-Baptiste Onofré"  wrote:

> Hi Robert,
>
> Agree, however which one the user would use ? Create his own one ?
>
> Today, I think Beam is heavily flexible in term of data format (which is
> great), but the trade off is that the end-users have to write lot of
> boilerplate code (just to convert from one type to another).
>
> So, basically, the purpose of a Beam Tuple is to have something provided
> out of box: if the user wants to use another tuple, that's fine.
> Generally speaking, the discussion about data format extension is about to
> simplify the way for users to manipulate popular data formats.
>
> Regards
> JB
>
> On 12/13/2016 05:56 PM, Robert Bradshaw wrote:
>
>> The Java language isn't very amenable to Tuple APIs as there are several
>> (mutually exclusive?) tradeoffs that must be made, each with their pros
>> and
>> cons. What advantage is there of Beam providing its own tuple API vs.
>> letting users pick whatever tuple library they want and using that with
>> Beam?
>>
>> (I suppose we're already using and encouraging AutoValue which covers a
>> lot
>> of tuple cases.)
>>
>> On Tue, Dec 13, 2016 at 8:20 AM, Aparup Banerjee (apbanerj) <
>> apban...@cisco.com> wrote:
>>
>> We have created one. An untagged Tuple. Will be happy to contribute it to
>>> the community
>>>
>>> Aparup
>>>
>>> On Dec 13, 2016, at 5:11 AM, Amit  wrote:

 I'll add that I know of Beam's PTuple, but my question is about much
 simpler Tuples, untagged.

 On Tue, Dec 13, 2016 at 1:56 PM Jean-Baptiste Onofré 
 wrote:

 Hi Amit,
>
> as discussed together, I think a Tuple abstraction would be good in the
> SDK (more than in the data format extension).
>
> Regards
> JB
>
> On 12/13/2016 11:06 AM, Amit Sela wrote:
>> Hi all,
>>
>> I was wondering why Beam doesn't have tuples as part of the SDK ?
>> To the best of my knowledge all currently supported (OSS) runners:
>>
> Spark,
>>>
 Flink, Apex provide a Tuple abstraction and I was wondering if Beam
>>
> should
>
>> too ?
>>
>> Consider KV for example; it is a special ("*keyed*" by the first
>> field)
>> implementation Tuple2.
>> While KV's importance is far more than being a Tuple2, I'm wondering
>> if
>>
> the
>
>> SDK would benefit from a proper TupleX support ?
>>
>> Thanks,
>> Amit
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Jenkins pre/postcommit increased from 35m to 60m+ on Friday

2016-12-12 Thread Kenneth Knowles
Great. That means the timestamp change I made for Travis, ported to
Jenkins, should reveal more.

Meanwhile - any known issues with Jenkins or Maven Central? Status
dashboard for Maven Central doesn't look unhappy.

On Mon, Dec 12, 2016 at 6:25 PM, Dan Halperin <dhalp...@google.com.invalid>
wrote:

> From the "bad run", the Maven part took 35 minutes and presumably the rest
> is Jenkins / Maven / downloading overhead.
>
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Apache Beam :: Parent .. SUCCESS [
> 37.558 s]
> [INFO] Apache Beam :: SDKs :: Java :: Build Tools . SUCCESS [
> 8.172 s]
> [INFO] Apache Beam :: SDKs  SUCCESS [
> 11.521 s]
> [INFO] Apache Beam :: SDKs :: Java  SUCCESS [
> 10.801 s]
> [INFO] Apache Beam :: SDKs :: Java :: Core  SUCCESS [04:04
> min]
> [INFO] Apache Beam :: Runners . SUCCESS [
> 14.945 s]
> [INFO] Apache Beam :: Runners :: Core Java  SUCCESS [
> 44.654 s]
> [INFO] Apache Beam :: Runners :: Direct Java .. SUCCESS [02:06
> min]
> [INFO] Apache Beam :: Runners :: Google Cloud Dataflow  SUCCESS [
> 33.245 s]
> [INFO] Apache Beam :: SDKs :: Java :: IO .. SUCCESS [
> 4.047 s]
> [INFO] Apache Beam :: SDKs :: Java :: IO :: Google Cloud Platform
> SUCCESS [04:32 min]
> [INFO] Apache Beam :: SDKs :: Java :: IO :: HDFS .. SUCCESS [
> 32.009 s]
> [INFO] Apache Beam :: SDKs :: Java :: IO :: JMS ... SUCCESS [
> 19.006 s]
> [INFO] Apache Beam :: SDKs :: Java :: IO :: Kafka . SUCCESS [
> 22.021 s]
> [INFO] Apache Beam :: SDKs :: Java :: IO :: Kinesis ... SUCCESS [
> 22.817 s]
> [INFO] Apache Beam :: SDKs :: Java :: IO :: MongoDB ... SUCCESS [
> 27.276 s]
> [INFO] Apache Beam :: SDKs :: Java :: IO :: JDBC .. SUCCESS [
> 23.662 s]
> [INFO] Apache Beam :: SDKs :: Java :: Maven Archetypes  SUCCESS [
> 3.115 s]
> [INFO] Apache Beam :: SDKs :: Java :: Maven Archetypes :: Starter
> SUCCESS [ 17.818 s]
> [INFO] Apache Beam :: SDKs :: Java :: Maven Archetypes :: Examples
> SUCCESS [ 15.111 s]
> [INFO] Apache Beam :: SDKs :: Java :: Maven Archetypes :: Examples -
> Java 8 SUCCESS [ 24.477 s]
> [INFO] Apache Beam :: SDKs :: Java :: Extensions .. SUCCESS [
> 5.759 s]
> [INFO] Apache Beam :: SDKs :: Java :: Extensions :: Join library
> SUCCESS [ 22.293 s]
> [INFO] Apache Beam :: SDKs :: Java :: Extensions :: Sorter  SUCCESS [
> 31.145 s]
> [INFO] Apache Beam :: SDKs :: Java :: Java 8 Tests  SUCCESS [
> 8.033 s]
> [INFO] Apache Beam :: Runners :: Flink  SUCCESS [
> 6.503 s]
> [INFO] Apache Beam :: Runners :: Flink :: Core  SUCCESS [
> 43.593 s]
> [INFO] Apache Beam :: Runners :: Flink :: Examples  SUCCESS [
> 20.006 s]
> [INFO] Apache Beam :: Runners :: Spark  SUCCESS [03:10
> min]
> [INFO] Apache Beam :: Runners :: Apex . SUCCESS [01:03
> min]
> [INFO] Apache Beam :: Examples  SUCCESS [
> 8.124 s]
> [INFO] Apache Beam :: Examples :: Java  SUCCESS [05:29
> min]
> [INFO] Apache Beam :: Examples :: Java 8 .. SUCCESS [
> 19.005 s]
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> --------
> [INFO] Total time: 34:30 min
> [INFO] Finished at: 2016-12-09T18:50:49+00:00
> [INFO] Final Memory: 196M/1051M
> [INFO] 
> 
>
>
>
> On Mon, Dec 12, 2016 at 5:36 PM, Kenneth Knowles <k...@google.com.invalid>
> wrote:
>
> > Hi all,
> >
> > We have a huge Jenkins backlog, surely exacerbated by the fact that our
> > test time (precommit and postcommit mvn install) has roughly doubled in
> the
> > last few days.
> >
> > Here's the quick link to the trend:
> > https://builds.apache.org/view/Beam/job/beam_PostCommit_
> Java_MavenInstall/
> > buildTimeTrend
> >
> > Good 33m build at 2016-12-09 02:42:
> > https://builds.apache.org/view/Beam/job/beam_PostCommit_
> > Java_MavenInstall/2041/
> >
> > Bad 59m build at  2016-12-09 18:00 (trigger by timer):
> > https://builds.apache.org/view/Beam/job/beam_PostCommit_
> > Java_MavenInstall/2048/
> >
> > There are a couple middling runs in between that I can't place
> immediately
> > into either bucket. I'm still looking into this, but it is a
> clicky-clicky
> > process that hasn't yielded anything yet.
> >
> > If anyone knows something, please share your insight.
> >
> > Kenn
> >
>


Re: examples-java8 tests running slow

2016-12-12 Thread Kenneth Knowles
Yes, they are a bit harder to get fine-tuned executions. But they should
only be run in the integration-test phase, not with unit tests. Is this
happening when you run them locally or in Jenkins?

On Mon, Dec 12, 2016 at 5:06 PM, Manu Zhang  wrote:

> Sorry, they are tests under *maven-archetypes/examples-java8 *and cannot
> be
> skipped with "-DskipTests"
>
> On Tue, Dec 13, 2016 at 9:01 AM Manu Zhang 
> wrote:
>
> > Hi,
> >
> > I find tests under example-java8 are running slow each time. Looking at
> > the codes, it might be that MinimalWordCountJava8Test reading from and
> > writing to "gs://".
> > Have anyone else had similar experience ? Do we have to access external
> > resources in a UT ?
> >
> > Thanks,
> > Manu
> >
>


Jenkins pre/postcommit increased from 35m to 60m+ on Friday

2016-12-12 Thread Kenneth Knowles
Hi all,

We have a huge Jenkins backlog, surely exacerbated by the fact that our
test time (precommit and postcommit mvn install) has roughly doubled in the
last few days.

Here's the quick link to the trend:
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/buildTimeTrend

Good 33m build at 2016-12-09 02:42:
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/2041/

Bad 59m build at  2016-12-09 18:00 (trigger by timer):
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/2048/

There are a couple middling runs in between that I can't place immediately
into either bucket. I'm still looking into this, but it is a clicky-clicky
process that hasn't yielded anything yet.

If anyone knows something, please share your insight.

Kenn


Re: New DoFn and WindowedValue/WinowingInternals

2016-12-11 Thread Kenneth Knowles
You've got it right. My recommendations is to just directly implement it
for the Spark runner. It will often actually clean things up a bit. Here's
the analogous change for the Flink runner:
https://github.com/apache/incubator-beam/pull/1435/files.

With GABW, I tried going through the process of keeping some utility
expansion in runners-core, making StateInternalsFactory, refactoring
GroupAlsoByWindowsDoFn, then GroupByKeyViaGroupByKeyOnly,
GroupAlsoByWindow. But it ended up simpler for each runner to just not use
most of that and do it directly. (they all still share GABW but none of the
surrounding bits, IIRC)

On Sun, Dec 11, 2016 at 10:33 AM, Amit Sela  wrote:

> So basically using new DoFn with *SplittableParDo*, *Window.Bound*, and
> *GroupAlsoByWindow* requires a custom implementation by per runner as they
> are not handled by DoFn anymore, right ?
>
> On Sun, Dec 11, 2016 at 3:42 PM Eugene Kirpichov
>  wrote:
>
> > Hi Amit, I'll comment in more detail later, but meanwhile please take a
> > look at https://github.com/apache/incubator-beam/pull/1565
> > There is a small amount of relevant changes to spark runner.
> > Take a look at implementation of SplittableParDo (already committed) in
> > particular ProcessFn and it's usage in direct runner - this is exactly
> what
> > you're looking for, a new DoFn that with per-runner support is able to
> emit
> > multi-windowed values.
> > On Sun, Dec 11, 2016 at 4:28 AM Amit Sela  wrote:
> >
> > > Hi all,
> > >
> > > I've been working on migrating the Spark runner to new DoFn and I've
> > > stumbled upon a couple of cases where OldDoFn is used in a way that
> > > accessed windowInternals (outputWindowedValue) such as
> AssignWindowsDoFn.
> > >
> > > Since changing windows is no longer the responsibility of DoFn I was
> > > wondering who and how is this done.
> > >
> > > Thanks,
> > > Amit
> > >
> >
>


Re: [DISCUSS] [BEAM-438] Rename one of PTransform.apply or PInput.apply

2016-12-08 Thread Kenneth Knowles
Seems like the topic of "block 0.4.0-incubating for all major
backwards-incompatible changes" deserves its own thread.

On Thu, Dec 8, 2016 at 9:33 PM, Dan Halperin <dhalp...@google.com.invalid>
wrote:

> I did not expect this to be merged until after we'd confirmed there were no
> more major changes to be made, or that they were all "ready to go".
>
> Are there any? If so, we should block the next release.
>
> On Fri, Dec 9, 2016 at 1:58 AM, Kenneth Knowles <k...@google.com.invalid>
> wrote:
>
> > Thanks all! This has been done.
> >
> > On Thu, Dec 8, 2016 at 3:37 AM, Amit Sela <amitsel...@gmail.com> wrote:
> >
> > > +1
> > >
> > > On Thu, Dec 8, 2016 at 1:27 PM Manu Zhang <owenzhang1...@gmail.com>
> > wrote:
> > >
> > > > +1
> > > >
> > > > Manu
> > > >
> > > > On Thu, Dec 8, 2016 at 2:40 PM Tyler Akidau
> <taki...@google.com.invalid
> > >
> > > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > On Thu, Dec 8, 2016 at 1:10 PM Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > > > > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > Regards
> > > > > > JB
> > > > > >
> > > > > > On 12/07/2016 10:37 PM, Kenneth Knowles wrote:
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I want to bring up another major backwards-incompatible change
> > > before
> > > > > it
> > > > > > is
> > > > > > > too late, to resolve [BEAM-438].
> > > > > > >
> > > > > > > Summary: Leave PInput.apply the same but rename
> PTransform.apply
> > to
> > > > > > > PTransform.expand. I have opened [PR #1538] just for reference
> > (it
> > > > took
> > > > > > 30
> > > > > > > seconds using IDE automated refactor)
> > > > > > >
> > > > > > > This change affects *PTransform authors* but does *not* affect
> > > > pipeline
> > > > > > > authors.
> > > > > > >
> > > > > > > This issue was filed a long time ago. It has been a problem
> many
> > > > times
> > > > > > with
> > > > > > > actual users since before Beam started incubating. This is what
> > > goes
> > > > > > wrong
> > > > > > > (often):
> > > > > > >
> > > > > > >PCollection input = ...
> > > > > > >PTransform<PCollection, ...> transform = ...
> > > > > > >
> > > > > > >transform.apply(input)
> > > > > > >
> > > > > > > This type checks and even looks perfectly normal. Do you see
> the
> > > > error?
> > > > > > >
> > > > > > > ... what we need the user to write is:
> > > > > > >
> > > > > > > input.apply(transform)
> > > > > > >
> > > > > > > What a confusing difference! After all, the first one
> type-checks
> > > and
> > > > > the
> > > > > > > first one is how you apply a Function or Predicate or
> > > > > > SerializableFunction,
> > > > > > > etc. But it is broken. With transform.apply(input) the
> transform
> > is
> > > > not
> > > > > > > registered with the pipeline at all.
> > > > > > >
> > > > > > > We obviously can't (and don't want to) change the most core way
> > > that
> > > > > > > pipeline authors use Beam, so PInput.apply (aka
> > PCollection.apply)
> > > > must
> > > > > > > remain the same. But we do need a way to make it impossible to
> > mix
> > > > > these
> > > > > > up.
> > > > > > >
> > > > > > > The simplest way I can think of is to choose a new name for the
> > > other
> > > > > > > method involved. Users probably won't write
> > transform.expand(input)
> > > > > since
> > > > > > > they will never have seen it in any examples, etc. This will
> just
> > > > make
> > > > > > > PTransform authors need to do a global rename, and the type
> > system
> > > > will
> > > > > > > direct them to all cases so there is no silent failure
> possible.
> > > > > > >
> > > > > > > What do you think?
> > > > > > >
> > > > > > > Kenn
> > > > > > >
> > > > > > > [BEAM-438] https://issues.apache.org/jira/browse/BEAM-438
> > > > > > > [PR #1538] https://github.com/apache/incubator-beam/pull/1538
> > > > > > >
> > > > > > > p.s. there is a really amusing and confusing call chain:
> > > > > > PCollection.apply
> > > > > > > -> Pipeline.applyTransform -> Pipeline.applyInternal ->
> > > > > > > PipelineRunner.apply -> PTransform.apply
> > > > > > >
> > > > > > > After this change and work to get the runner out of the loop,
> it
> > > > > becomes
> > > > > > > PCollection.apply -> Pipeline.applyTransform ->
> PTransform.expand
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Jean-Baptiste Onofré
> > > > > > jbono...@apache.org
> > > > > > http://blog.nanthrax.net
> > > > > > Talend - http://www.talend.com
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] [BEAM-438] Rename one of PTransform.apply or PInput.apply

2016-12-08 Thread Kenneth Knowles
Thanks all! This has been done.

On Thu, Dec 8, 2016 at 3:37 AM, Amit Sela <amitsel...@gmail.com> wrote:

> +1
>
> On Thu, Dec 8, 2016 at 1:27 PM Manu Zhang <owenzhang1...@gmail.com> wrote:
>
> > +1
> >
> > Manu
> >
> > On Thu, Dec 8, 2016 at 2:40 PM Tyler Akidau <taki...@google.com.invalid>
> > wrote:
> >
> > > +1
> > >
> > > On Thu, Dec 8, 2016 at 1:10 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> > > wrote:
> > >
> > > > +1
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > On 12/07/2016 10:37 PM, Kenneth Knowles wrote:
> > > > > Hi all,
> > > > >
> > > > > I want to bring up another major backwards-incompatible change
> before
> > > it
> > > > is
> > > > > too late, to resolve [BEAM-438].
> > > > >
> > > > > Summary: Leave PInput.apply the same but rename PTransform.apply to
> > > > > PTransform.expand. I have opened [PR #1538] just for reference (it
> > took
> > > > 30
> > > > > seconds using IDE automated refactor)
> > > > >
> > > > > This change affects *PTransform authors* but does *not* affect
> > pipeline
> > > > > authors.
> > > > >
> > > > > This issue was filed a long time ago. It has been a problem many
> > times
> > > > with
> > > > > actual users since before Beam started incubating. This is what
> goes
> > > > wrong
> > > > > (often):
> > > > >
> > > > >PCollection input = ...
> > > > >PTransform<PCollection, ...> transform = ...
> > > > >
> > > > >transform.apply(input)
> > > > >
> > > > > This type checks and even looks perfectly normal. Do you see the
> > error?
> > > > >
> > > > > ... what we need the user to write is:
> > > > >
> > > > > input.apply(transform)
> > > > >
> > > > > What a confusing difference! After all, the first one type-checks
> and
> > > the
> > > > > first one is how you apply a Function or Predicate or
> > > > SerializableFunction,
> > > > > etc. But it is broken. With transform.apply(input) the transform is
> > not
> > > > > registered with the pipeline at all.
> > > > >
> > > > > We obviously can't (and don't want to) change the most core way
> that
> > > > > pipeline authors use Beam, so PInput.apply (aka PCollection.apply)
> > must
> > > > > remain the same. But we do need a way to make it impossible to mix
> > > these
> > > > up.
> > > > >
> > > > > The simplest way I can think of is to choose a new name for the
> other
> > > > > method involved. Users probably won't write transform.expand(input)
> > > since
> > > > > they will never have seen it in any examples, etc. This will just
> > make
> > > > > PTransform authors need to do a global rename, and the type system
> > will
> > > > > direct them to all cases so there is no silent failure possible.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > Kenn
> > > > >
> > > > > [BEAM-438] https://issues.apache.org/jira/browse/BEAM-438
> > > > > [PR #1538] https://github.com/apache/incubator-beam/pull/1538
> > > > >
> > > > > p.s. there is a really amusing and confusing call chain:
> > > > PCollection.apply
> > > > > -> Pipeline.applyTransform -> Pipeline.applyInternal ->
> > > > > PipelineRunner.apply -> PTransform.apply
> > > > >
> > > > > After this change and work to get the runner out of the loop, it
> > > becomes
> > > > > PCollection.apply -> Pipeline.applyTransform -> PTransform.expand
> > > > >
> > > >
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbono...@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > >
> >
>


[DISCUSS] [BEAM-438] Rename one of PTransform.apply or PInput.apply

2016-12-07 Thread Kenneth Knowles
Hi all,

I want to bring up another major backwards-incompatible change before it is
too late, to resolve [BEAM-438].

Summary: Leave PInput.apply the same but rename PTransform.apply to
PTransform.expand. I have opened [PR #1538] just for reference (it took 30
seconds using IDE automated refactor)

This change affects *PTransform authors* but does *not* affect pipeline
authors.

This issue was filed a long time ago. It has been a problem many times with
actual users since before Beam started incubating. This is what goes wrong
(often):

   PCollection input = ...
   PTransform transform = ...

   transform.apply(input)

This type checks and even looks perfectly normal. Do you see the error?

... what we need the user to write is:

input.apply(transform)

What a confusing difference! After all, the first one type-checks and the
first one is how you apply a Function or Predicate or SerializableFunction,
etc. But it is broken. With transform.apply(input) the transform is not
registered with the pipeline at all.

We obviously can't (and don't want to) change the most core way that
pipeline authors use Beam, so PInput.apply (aka PCollection.apply) must
remain the same. But we do need a way to make it impossible to mix these up.

The simplest way I can think of is to choose a new name for the other
method involved. Users probably won't write transform.expand(input) since
they will never have seen it in any examples, etc. This will just make
PTransform authors need to do a global rename, and the type system will
direct them to all cases so there is no silent failure possible.

What do you think?

Kenn

[BEAM-438] https://issues.apache.org/jira/browse/BEAM-438
[PR #1538] https://github.com/apache/incubator-beam/pull/1538

p.s. there is a really amusing and confusing call chain: PCollection.apply
-> Pipeline.applyTransform -> Pipeline.applyInternal ->
PipelineRunner.apply -> PTransform.apply

After this change and work to get the runner out of the loop, it becomes
PCollection.apply -> Pipeline.applyTransform -> PTransform.expand


Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-12-06 Thread Kenneth Knowles
Thanks for the thorough answers. It all sounds good to me.

On Tue, Dec 6, 2016 at 12:57 PM, Pei He <pe...@google.com.invalid> wrote:

> Thanks Kenn for the feedback and questions.
>
> I responded inline.
>
> On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles <k...@google.com.invalid>
> wrote:
>
> > I really like this document. It is easy to read and informative. Three
> > things not addressed by the document:
> >
> > 1. Major Beam use cases. I'm sure we have a few in the SDK that could be
> > outlined in terms of the new API with pseudocode.
>
>
> (I am writing pseudocode directly with FileSystem interface to demonstrate.
> However, clients will use the utility FileSystems. This is for us to have a
> layer between the file systems providers' interface and the client
> interface. We can add utility functions to FileSystems for common use
> patterns as needed.)
>
> Major Beam use cases are the followings:
> A. FileBasedSource:
> // a. Get input URIs and file sizes from users provided specs.
> // Note: I updated the match() to be a bulk operation after I sent my last
> email.
> List results = match(specList);
> List inputMetadataList = FluentIterable.from(results)
> .transformAndConcat(
> new Function<MatchResult, Metadata>() {
>   @Override
>   public Iterable apply(MatchResult result) {
> return Arrays.asList(result.metadata());
>   });
>
> // b. Read from a start offset to support the source splitting.
> SeekableByteChannel seekChannel = open(fileUri);
> seekChannel.position(source.getStartOffset());
> seekChannel.read(...);
>
> B. FileBasedSink:
> // bulk rename temporary files to output files
> rename(tempUris, outputUris);
>
> C. General file operations:
> a. resolve paths
> b. create file to write, open file to read (for example in tests).
> c. bulk delete files/directories
>
>
>
> 2. Related work. How does this differ from other filesystem APIs and why?
>
> We need three sets of functionalities:
> 1. resolve paths.
> 2. read and write channels.
> 3. bulk files management operations(bulk delete/rename/match).
>
> And, they are available from Java nio, hadoop FileSystem APIs, and other
> standard library such as java.net.URI.
>
> Current IOChannelFactory interface uses Java nio for (1) and (2), and
> define its own interface for (3).
>
> In my redesign, I made the following choices:
> For (1), I replaced Java nio with URI, because it is standardized and
> precise and doesn't require additional implementation of a Path interface
> from file system providers.
>
> For (2), I kept the uses of Java nio (Writable/SeekableByteChannel), since
> I don't see any things that need to improve and I don't see any better
> alternatives (hadoop's FSDataInput/OutputStream provide same
> functionalities, but requires additional dependencies).
>
> For (3), reasons that I didn't choose Java nio or hadoop are:
> 1. Beam needs bulk operations API for better performance, however Java nio
> and hadoop FileSystems are single file based API.
> 2. Have APIs that are File systems agnostic. For example, we can use URI
> instead of Path.
> 3. Have APIs that are minimum, and easy to implement by file system
> providers.
> 4. Introducing less dependencies.
> 5. It is easy to build an adaptor based on Java nio or hadoop interfaces.
>
> 3. Discussion of non-Java languages. It would be good to know what classes
> > in e.g. Python we might use in place of URI, SeekableByteChannel, etc.
>
> I don't want to mislead people here without a thorough investigation. You
> can see from your second question, that would require iterations on design
> and prototyping.
>
> I didn't introduce any Java specific requirements in the redesign.
> Resolving paths, seeking with channels or streams, file management
> operations are languages independent. And, I pretty sure there are python
> libraries for that.
>
> However, I am happy to hear thoughts and get help from people working on
> the python sdk.
>
>
> > On Mon, Dec 5, 2016 at 4:41 PM, Pei He <pe...@google.com.invalid> wrote:
> >
> > > I have received a lot of comments in "Part 1: IOChannelFactory
> > > Redesign" [1]. And, I have updated the design based on the feedback.
> > >
> > > Now, I feel it is close to be ready for implementation, and I would
> like
> > to
> > > summarize the changes:
> > > 1. Replaced FilePath with URI for resolving files paths.
> > > 2. Required match(String spec) to handle ambiguities in users provided
> > > strings (see the match() java doc in the design doc for details).
> > &g

Re: Jenkins build is unstable: beam_PostCommit_Java_RunnableOnService_Dataflow #1730

2016-12-05 Thread Kenneth Knowles
The error message looks like a transient error, though it is easy to
believe this change could cause a problem. I will keep a sharp eye on it.

On Mon, Dec 5, 2016 at 4:21 PM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See  RunnableOnService_Dataflow/1730/changes>
>
>


Re: PAssertTest#runExpectingAssertionFailure() and waitUntilFinish()

2016-12-05 Thread Kenneth Knowles
Hi Stas,

This is something special to TestPipeline and the test configuration for a
runner.

If runExpectingAssertionFailure() does not succeed, then our whole suite of
RunnableOnService tests is not going to work, because they all have an
assumption that TestPipeline#run() waits until the assertions are known to
all succeed or one of them fails. This can be different than waiting until
the pipeline itself terminates.

Today, unfortunately, every runner is responsible for making that work in
their own way, thus we have TestDataflowRunner, TestFlinkRunner,
TestApexRunner, and TestSparkRunner.

Eventually the logic should be centralized in TestPipeline, and
TestPipeline become a JUnit @Rule. These issues are roughly covered by
[BEAM-85] and [BEAM-298]

Kenn

[BEAM-85] https://issues.apache.org/jira/browse/BEAM-85
[BEAM-298] https://issues.apache.org/jira/browse/BEAM-298

On Mon, Dec 5, 2016 at 3:30 AM, Stas Levin  wrote:

> Hi,
>
> PAssertTest#runExpectingAssertionFailure() contains the following block:
>
> try {
>   pipeline.run();
> } catch (AssertionError exc) {
>   return exc;
> }
>
> I was wondering if in some cases this might not produce the desired effect,
> particularly if the run() method returns before the pipeline has ended.
> While some runners may choose to inherently wait for the pipeline to end in
> their run() method (e.g., DirectRunner if options.isBlockOnRun() is set to
> true), this seems to be an optional behavior that may be absent in other
> runners.
>
> Is this indeed an issue (in which case the above will benefit from calling
> waitUntilFinish)?
>
> Regards,
> Stas
>


Jenkins precommit worker affinity

2016-11-30 Thread Kenneth Knowles
It appears that the new job beam_PreCommit_Java_MavenInstall has an
affinity for Jenkins worker beam3 while workers beam1 and beam2 sit idle.
Is this intentional? There seems to be a backlog of half a dozen builds.


Re: Questions about coders

2016-11-30 Thread Kenneth Knowles
On Wed, Nov 30, 2016 at 3:52 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Hello,
>
> Do we have anywhere a set of recommendations for developing new coders? I'm
> confused by a couple of things:
>
> - Why are coders serialized by JSON serialization instead of by regular
> Java serialization, unlike other elements of the pipeline such as DoFn's
> and transforms?
>

Big picture: a Coder is actually an agreed-upon binary format that should
be cross-language when possible. It (the coder itself) needs to be able to
be deserialized/implemented by any SDK. Alluded to a bit in
https://s.apache.org/beam-runner-api but not developed at length. There is
also some fragmentary bits moving in this direction (the encodingId which
should be a URN in Beam).

It is still very convenient to use Java serialization which is why we
provided CustomCoder, but this presents a portability barrier. We have
options here: We could remove it, or we could have language-specific coders
that cannot be used for PCollections consumed by diverse languages. Maybe
there are other good options.



> - Which should one inherit from: CustomCoder, AtomicCoder, StandardCoder? I
> looked at their direct subclasses and didn't see a clear distinction. Seems
> like, when the encoded type is not parameterized then it's CustomCoder, and
> when it's parameterized then it's StandardCoder? [but then again
> CustomCoder isolates you from JSON weirdness, and this seems useful for
> non-atomic coders too]
>

Within the SDK, one should not inherit from CustomCoder. It is bulky and
non-portable. If a coder has component coders then it should inherit from
StandardCoder. If not, then AtomicCoder adds convenience.

These APIs are not perfect. For coder inference, the component coders are
really expected to correspond to type variables. For example, List has a
component coder that is a Coder. But for the purposes of construction
and update-compatibility, component coders should include *all* coders that
can be passed in that would modify the overall binary format, whether or
not they correspond to a type variable, for example
FullWindowedValueCoder accepts not just a Coder but also a
Coder.

There is work to be done here to support both uses of "component" coders,
probably by separating them entirely.


- Which methods are necessary to implement? E.g. should I implement
> verifyDeterministic? Should I implement the "byte size observer" methods?
>

You should implement verifyDeterministic if your coder is deterministic so
it can be used for th ekeys in GroupByKey.


I'm actually even more confused by the hierarchy between Coder =>
> StandardCoder => DeterministicStandardCoder => AtomicCoder => CustomCoder.
> DeterministicStandardCoder implements verifyDeterministic(), but it has
> subclasses that override this method...
>

This is broken. It is for backwards compatibility reasons, I believe.
Certainly DeterministicStandardCoder is a bit questionable, as it ignores
components, and also overriding it so that subclasses cannot safely be used
as DeterministicStandardCoder is wrong.

Kenn


Re: Jenkins build is still unstable: beam_PostCommit_MavenVerify #1948

2016-11-30 Thread Kenneth Knowles
This is a Dataflow-specific linking error. I am investigating and
proceeding with a temporary rollback.

On Wed, Nov 30, 2016 at 3:06 PM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See 
>
>


Re: Jenkins build became unstable: beam_Release_NightlySnapshot #249

2016-11-30 Thread Kenneth Knowles
This looks like it might have been the sort of thing that #1189
 (just merged) will fix.

On Tue, Nov 29, 2016 at 11:29 PM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See  NightlySnapshot/249/changes>
>
>


Re: [DISCUSS] Graduation to a top-level project

2016-11-22 Thread Kenneth Knowles
+1 !!!

I especially love how the diversity of the community has contributed to the
conceptual growth and quality of Beam. I can't wait for more!

On Tue, Nov 22, 2016 at 11:22 AM, Thomas Groh 
wrote:

> +1
>
> It's been a thrilling experience thus far, and I'm excited for the future.
>
> On Tue, Nov 22, 2016 at 11:07 AM, Aljoscha Krettek 
> wrote:
>
> > +1
> >
> > I'm quite enthusiastic about the growth of the community and the open
> > discussions!
> >
> > On Tue, 22 Nov 2016 at 19:51 Jason Kuster  invalid>
> > wrote:
> >
> > > An enthusiastic +1!
> > >
> > > In particular it's been really great to see the commitment and interest
> > of
> > > the community in different kinds of testing. Between what we currently
> > have
> > > on Jenkins and Travis and the in-progress work on IO integration tests
> > and
> > > performance tests (plus, I'm sure, other things I'm not aware of) we're
> > in
> > > a really good place.
> > >
> > > On Tue, Nov 22, 2016 at 10:49 AM, Amit Sela 
> > wrote:
> > >
> > > > +1, super exciting!
> > > >
> > > > Thanks to JB, Davor and the whole team for creating this community. I
> > > think
> > > > we've achieved a lot in a short time.
> > > >
> > > > Amit.
> > > >
> > > > On Tue, Nov 22, 2016, 20:36 Tyler Akidau  >
> > > > wrote:
> > > >
> > > > > +1, thanks to everyone who's invested time getting us to this
> point.
> > > :-)
> > > > >
> > > > > -Tyler
> > > > >
> > > > > On Tue, Nov 22, 2016 at 10:33 AM Jean-Baptiste Onofré <
> > j...@nanthrax.net
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > First of all, I would like to thank the whole team, and
> especially
> > > > Davor
> > > > > > for the great work and commitment to Apache and the community.
> > > > > >
> > > > > > Of course, a big +1 to move forward on graduation !
> > > > > >
> > > > > > Regards
> > > > > > JB
> > > > > >
> > > > > > On 11/22/2016 07:19 PM, Davor Bonaci wrote:
> > > > > > > Hi everyone,
> > > > > > > With all the progress we’ve had recently in Apache Beam, I
> think
> > it
> > > > is
> > > > > > time
> > > > > > > we start the discussion about graduation as a new top-level
> > project
> > > > at
> > > > > > the
> > > > > > > Apache Software Foundation.
> > > > > > >
> > > > > > > Graduation means we are a self-sustaining and self-governing
> > > > community,
> > > > > > and
> > > > > > > ready to be a full participant in the Apache Software
> Foundation.
> > > It
> > > > > does
> > > > > > > not imply that our community growth is complete or that a
> > > particular
> > > > > > level
> > > > > > > of technical maturity has been reached, rather that we are on a
> > > solid
> > > > > > > trajectory in those areas. After graduation, we will still
> > > > periodically
> > > > > > > report to, and be overseen by, the ASF Board to ensure
> continued
> > > > growth
> > > > > > of
> > > > > > > a healthy community.
> > > > > > >
> > > > > > > Graduation is an important milestone for the project. It is
> also
> > > key
> > > > to
> > > > > > > further grow the user community: many users (incorrectly) see
> > > > > incubation
> > > > > > as
> > > > > > > a sign of instability and are much less likely to consider us
> > for a
> > > > > > > production use.
> > > > > > >
> > > > > > > A way to think about graduation readiness is through the Apache
> > > > > Maturity
> > > > > > > Model [1]. I think we clearly satisfy all the requirements [2].
> > It
> > > is
> > > > > > > probably worth emphasizing the recent community growth: over
> each
> > > of
> > > > > the
> > > > > > > past three months, no single organization contributing to Beam
> > has
> > > > had
> > > > > > more
> > > > > > > than ~50% of the unique contributors per month [2, see
> > > assumptions].
> > > > > > That’s
> > > > > > > a great statistic that shows how much we’ve grown our
> diversity!
> > > > > > >
> > > > > > > Process-wise, graduation consists of drafting a board
> resolution,
> > > > which
> > > > > > > needs to identify the full Project Management Committee, and
> > > getting
> > > > it
> > > > > > > approved by the community, the Incubator, and the Board. Within
> > the
> > > > > Beam
> > > > > > > community, most of these discussions and votes have to be on
> the
> > > > > private@
> > > > > > > mailing list, but, as usual, we’ll try to keep dev@ updated as
> > > much
> > > > as
> > > > > > > possible.
> > > > > > >
> > > > > > > With that in mind, let’s use this discussion on dev@ for two
> > > things:
> > > > > > > * Collect additional data points on our progress that we may
> want
> > > to
> > > > > > > present to the Incubator as a part of the proposal to accept
> our
> > > > > > graduation.
> > > > > > > * Determine whether the community supports graduation. Please
> > reply
> > > > > +1/-1
> > > > > > > with any additional comments, as appropriate. I’d encourage
> > > everyone
> > > > 

Re: Batcher DoFn

2016-11-14 Thread Kenneth Knowles
Hi Josh,

I think you probably mean something like buffering elements in a field on
the DoFn, emitting batches as appropriate, and emitting the remainder in
finishBundle.

Unfortunately there are two issues:

 - in the presence of windowing the DoFn might be invoked in different
windows, so you'll garble the contents between windows
 - when data is streamed in small bundles, way smaller than batch size, the
results might be unintuitive

The solution to both is the State API which I am hard at work on. Then you
buffer in state, which is per-window and cross-bundle, and output as
appropriate, emitting the remainder from a callback invoked once the window
has expired (exceeded the allowed lateness).

Kenn

On Mon, Nov 14, 2016 at 1:57 PM, Josh Cogan 
wrote:

> Hi Dev,
>
> After offline discussions with Gus, I'd like propose we include a Batcher
> function into contrib/.  This would be a DoFn that behaves like this:
>
> [1,2,3,4,5] -> Batcher(max_size=2) -> [[1,2],[3,4],[5]]
>
> Its simple code, but it also shows off that values can still be yielded
> from finish_bundle(), and lots of people found it useful for the internal
> Google version too.
>
> LMK what you think.  Thanks!
>
> Josh
>
> --
> joshgc
> :wq
>


Re: Introduction + contributing to docs

2016-11-11 Thread Kenneth Knowles
Welcome! It is great to witness the website really coming together.

On Fri, Nov 11, 2016 at 12:35 PM, Amit Sela  wrote:

> Welcome Melissa!
>
> On Fri, Nov 11, 2016, 22:31 Jean-Baptiste Onofré  wrote:
>
> > Hi Melissa,
> >
> > welcome aboard !!
> >
> > Regards
> > JB
> >
> > On 11/11/2016 08:11 PM, Melissa Pashniak wrote:
> > > Hello!
> > >
> > >
> > > My name is Melissa. I’ve previously been involved with Dataflow
> > > documentation, and I’m excited to start contributing to the Beam
> project
> > > and documentation.
> > >
> > >
> > > I’ve written up some text for Beam’s direct runner and Cloud Dataflow
> > > runner pages, currently available in pull requests [1][2]. I am also
> > > working on the unfinished parts of the programming guide [3]. Let me
> know
> > > if you have any thoughts or feedback.
> > >
> > > I look forward to working with everyone in the community!
> > >
> > > Melissa
> > >
> > >
> > > [1] https://github.com/apache/incubator-beam-site/pull/76
> > > [2] https://github.com/apache/incubator-beam-site/pull/77
> > > [3] https://issues.apache.org/jira/browse/BEAM-193
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: [PROPOSAL] Merge apex-runner to master branch

2016-11-11 Thread Kenneth Knowles
OK, I believe enough time has passed, and enough +1s, with caveats
addressed agreeably, that we have reach consensus on this. LGTM! I'll limit
technical details to the PR.

On Fri, Nov 11, 2016 at 11:09 AM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> Thanks, David! +1 to getting this into master from me.
>
> On Thu, Nov 10, 2016 at 12:03 AM, David Yan  wrote:
> >> >- Have at least 2 contributors interested in maintaining it, and 1
> >> >committer interested in supporting it:  *I'm going to sign up for
> the
> >> >support and there are more folks interested. Some have already
> contrib=
> > uted
> >> >and helped with PR reviews, others from the Apex community have
> expres=
> > sed
> >> >interest [3].*
> >>
> >> As anyone in the open source ecosystem knows, maintaining is a much
> >> higher bar than contributing, but very important. I'd like to see
> >> specific names here.
> >
> >
> > I would like to sign up as a maintainer for the Apex runner if possible.
> > Thanks!
> >
> > David
>


Re: [jira] [Created] (BEAM-961) CountingInput could have starting number

2016-11-10 Thread Kenneth Knowles
I'm not particular about whether the source itself does it versus replacing
uses of the source with a new PTransform encapsulating them. As long as
there is some object with `startingAt` and `upTo`. Either way should be
easy? Left it with starter tag as it would potentially be a fun initial
dive into the codebase.

On Thu, Nov 10, 2016 at 1:23 PM, Dan Halperin <dhalp...@google.com> wrote:

> Why not support this in a follow-on pardo that shifts the range?
>
> On Thu, Nov 10, 2016 at 1:22 PM, Kenneth Knowles (JIRA) <j...@apache.org>
> wrote:
>
>> Kenneth Knowles created BEAM-961:
>> 
>>
>>  Summary: CountingInput could have starting number
>>  Key: BEAM-961
>>  URL: https://issues.apache.org/jira/browse/BEAM-961
>>  Project: Beam
>>   Issue Type: New Feature
>>   Components: sdk-java-core
>> Reporter: Kenneth Knowles
>> Priority: Trivial
>>
>>
>> TL;DR: Add {{startingAt}} to {{CountingInput}}.
>>
>> Right now you can have {{CountingInput.upTo(someNumber)}} but it came up
>> in a test that if you want to have, say, one PCollection that is 1 through
>> 10 and another that is 11 through 20 - so you know they are disjoint - then
>> it requires some boilerplate to add 10 to every element. That boilerplate
>> should be part of the {{CountingInput}}
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.3.4#6332)
>>
>
>


Re: [DISCUSS] Change "RunnableOnService" To A More Intuitive Name

2016-11-10 Thread Kenneth Knowles
Nice. I like ValidatesRunner.

On Nov 10, 2016 03:39, "Amit Sela" <amitsel...@gmail.com> wrote:

> How about @ValidatesRunner ?
> Seems to complement @NeedsRunner as well.
>
> On Thu, Nov 10, 2016 at 9:47 AM Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > +1
> >
> > What I would really like to see is automatic derivation of the capability
> > matrix from an extended Runner Test Suite. (As outlined in Thomas' doc).
> >
> > On Wed, 9 Nov 2016 at 21:42 Kenneth Knowles <k...@google.com.invalid>
> > wrote:
> >
> > > Huge +1 to this.
> > >
> > > The two categories I care most about are:
> > >
> > > 1. Tests that need a runner, but are testing the other "thing under
> > test";
> > > today this is NeedsRunner.
> > > 2. Tests that are intended to test a runner; today this is
> > > RunnableOnService.
> > >
> > > Actually the lines are not necessary clear between them, but I think we
> > can
> > > make good choices, like we already do.
> > >
> > > The idea of two categories with a common superclass actually has a
> > pitfall:
> > > what if a test is put in the superclass category, when it does not
> have a
> > > clear meaning? And also, I don't have any good ideas for names.
> > >
> > > So I think just replacing RunnableOnService with RunnerTest to make
> clear
> > > that it is there just to test the runner is good. We might also want
> > > RunnerIntegrationTest extends NeedsRunner to use in the IO modules.
> > >
> > > See also Thomas's doc on capability matrix testing* which is aimed at
> > case
> > > 2. Those tests should all have a category from the doc, or a new one
> > added.
> > >
> > > *
> > >
> > >
> > https://docs.google.com/document/d/1fICxq32t9yWn9qXhmT07xpclHeHX2
> VlUyVtpi2WzzGM/edit
> > >
> > > Kenn
> > >
> > > On Wed, Nov 9, 2016 at 12:20 PM, Jean-Baptiste Onofré <j...@nanthrax.net
> >
> > > wrote:
> > >
> > > > Hi Mark,
> > > >
> > > > Generally speaking, I agree.
> > > >
> > > > As RunnableOnService extends NeedsRunner, @TestsWithRunner or
> > > @RunOnRunner
> > > > sound clearer.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > >
> > > > On 11/09/2016 09:00 PM, Mark Liu wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> I'm working on building RunnableOnService in Python SDK. After
> having
> > > >> discussions with folks, "RunnableOnService" looks like not a very
> > > >> intuitive
> > > >> name for those unit tests that require runners and build lightweight
> > > >> pipelines to test specific components. Especially, they don't have
> to
> > > run
> > > >> on a service.
> > > >>
> > > >> So I want to raise this idea to the community and see if anyone have
> > > >> similar thoughts. Maybe we can come up with a name this is tight to
> > > >> runner.
> > > >> Currently, I have two names in my head:
> > > >>
> > > >> - TestsWithRunners
> > > >> - RunnerExecutable
> > > >>
> > > >> Any thoughts?
> > > >>
> > > >> Thanks,
> > > >> Mark
> > > >>
> > > >>
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbono...@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > >
> >
>


Re: SBT/ivy dependency issues

2016-11-09 Thread Kenneth Knowles
Hi Abbass,

Seeing the output from `sbt dependency-tree` from the sbt-dependency-graph
plugin [1] might help. (caveat: I did not try this out; I don't know the
state of maintenance)

Kenn

[1] https://github.com/jrudolph/sbt-dependency-graph

On Wed, Nov 9, 2016 at 6:33 AM, Jean-Baptiste Onofré 
wrote:

> Hi Abbass,
>
> As discussed together, it could be related to some changes we did in the
> Maven profiles and build.
>
> Let me investigate.
>
> I keep you posted.
>
> Thanks !
> Regards
> JB
>
>
> On 11/09/2016 03:03 PM, amarouni wrote:
>
>> Hi guys,
>>
>> I'm facing a weird issue with a Scala project (using SBT/ivy) that uses
>> *beam-runners-spark:0.3.0-incubating *which depends on
>> *beam-sdks-java-core *& *beam-runners-core-java*.
>>
>> Until recently everything worked as expected i.e I had to declare a
>> single dependency on *beam-runners-spark:0.3.0-incubating *which brought
>> with it *beam-sdks-java-core *& *beam-runners-core-java*, but a couple
>> of weeks ago I started having issues where the only workaround was to
>> explicitly declare dependencies on *beam-runners-spark:0.3.0-incubating
>> *in addition to its direct beam dependencies : *beam-sdks-java-core *&
>> *beam-runners-core-java*.
>>
>> I verified that *beam-runners-spark's *pom contains both of the
>> *beam-sdks-java-core *& *beam-runners-core-java *dependencies but still
>> had to declare them explicitly, I'm not sure if this is an issue with
>> SBT/ivy because Maven can correctly fetch the required beam dependencies
>> but this issue appears only with beam dependencies.
>>
>> Did anyone with SBT/ivy encounter this issue.
>>
>> Thanks,
>>
>> Abbass,
>>
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: PCollection to PCollection Conversion

2016-11-09 Thread Kenneth Knowles
On this point from Amit and Ismaël, I agree: we could benefit from a place
for miscellaneous non-core helper transformations.

We have sdks/java/extensions but it is organized as separate artifacts. I
think that is fine, considering the nature of Join and SortValues. But for
simpler transforms, Importing one artifact per tiny transform is too much
overhead. It also seems unlikely that we will have enough commonality among
the transforms to call the artifact anything other than [some synonym for]
"miscellaneous".

I wouldn't want to take this too far - even though the SDK many transforms*
that are not required for the model [1], I like that the SDK artifact has
everything a user might need in their "getting started" phase of use. This
user-friendliness (the user doesn't care that ParDo is core and Sum is not)
plus the difficulty of judging which transforms go where, are probably why
we have them mostly all in one place.

Models to look at, off the top of my head, include Pig's PiggyBank and
Apex's Malhar. These have different levels of support implied. Others?

Kenn

[1] ApproximateQuantiles, ApproximateUnique, Count, Distinct, Filter,
FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min, Values, KvSwap,
Partition, Regex, Sample, Sum, Top, Values, WithKeys, WithTimestamps

* at least they are separate classes and not methods on PCollection :-)


On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <ieme...@gmail.com> wrote:

> ​Nice discussion, and thanks Jesse for bringing this subject back.
>
> I agree 100% with Amit and the idea of having a home for those transforms
> that are not core enough to be part of the sdk, but that we all end up
> re-writing somehow.
>
> This is a needed improvement to be more developer friendly, but also as a
> reference of good practices of Beam development, and for this reason I
> agree with JB that at this moment it would be better for these transforms
> to reside in the Beam repository at least for visibility reasons.
>
> One additional question is if these transforms represent a different DSL or
> if those could be grouped with the current extensions (e.g. Join and
> SortValues) into something more general that we as a community could
> maintain, but well even if it is not the case, it would be really nice to
> start working on something like this.
>
> Ismaël Mejía​
>
>
> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> > Related to spark-package, we also have Apache Bahir to host
> > connectors/transforms for Spark and Flink.
> >
> > IMHO, right now, Beam should host this, not sure if it makes sense
> > directly in the core.
> >
> > It reminds me the "Integration" DSL we discussed in the technical vision
> > document.
> >
> > Regards
> > JB
> >
> >
> > On 11/09/2016 11:17 AM, Amit Sela wrote:
> >
> >> I think Jesse has a very good point on one hand, while Luke's and
> >> Kenneth's
> >> worries about committing users to specific implementations is in place.
> >>
> >> The Spark community has a 3rd party repository for useful libraries that
> >> for various reasons are not a part of the Apache Spark project:
> >> https://spark-packages.org/.
> >>
> >> Maybe a "common-transformations" package would serve both users quick
> >> ramp-up and ease-of-use while keeping Beam more "enabling" ?
> >>
> >> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles <k...@google.com.invalid>
> >> wrote:
> >>
> >> It seems useful for small scale debugging / demoing to have
> >>> Dump.toString(). I think it should be named to clearly indicate its
> >>> limited
> >>> scope. Maybe other stuff could go in the Dump namespace, but
> >>> "Dump.toJson()" would be for humans to read - so it should be pretty
> >>> printed, not treated as a machine-to-machine wire format.
> >>>
> >>> The broader question of representing data in JSON or XML, etc, is
> already
> >>> the subject of many mature libraries which are already easy to use with
> >>> Beam.
> >>>
> >>> The more esoteric practice of implicit or semi-implicit coercions seems
> >>> like it is also already addressed in many ways elsewhere.
> >>> Transform.via(TypeConverter) is basically the same as
> >>> MapElements.via() and also easy to use with Beam.
> >>>
> >>> In both of the last cases, there are many reasonable approaches, and we
> >>> shouldn't commit our users to one of them.
> >>>
> >>> On Tue, Nov 8, 2016 at 10:15 AM, Luk

Re: [PROPOSAL] Merge apex-runner to master branch

2016-11-09 Thread Kenneth Knowles
Hi Thomas,

Very good point about establishing more clear definitions of the roles
mentioned in the guidelines. Let's discuss in a separate thread.

Kenn

On Tue, Nov 8, 2016 at 1:03 PM, Thomas Weise  wrote:

> Thanks for the support. It may be helpful to describe the roles of
> "maintainer" and "supporter" in this context, perhaps even capture it on:
>
> http://beam.apache.org/contribute/contribution-guide/
>
> Thanks,
> Thomas
>
>
> On Tue, Nov 8, 2016 at 7:51 PM, Robert Bradshaw
>  > wrote:
>
> > Nice. I'm +1 modulo one caveat below (hopefully easily addressed).
> >
> > On Tue, Nov 8, 2016 at 5:54 AM, Thomas Weise  wrote:
> > > Hi,
> > >
> > > As per previous discussion [1], I would like to propose to merge the
> > > apex-runner branch into master. The runner satisfies the criteria
> > outlined
> > > in [2] and merging it to master will give more visibility to other
> > > contributors and users.
> > >
> > > Specifically the Apex runner addresses:
> > >
> > >- Have at least 2 contributors interested in maintaining it, and 1
> > >committer interested in supporting it:  *I'm going to sign up for
> the
> > >support and there are more folks interested. Some have already
> > contributed
> > >and helped with PR reviews, others from the Apex community have
> > expressed
> > >interest [3].*
> >
> > As anyone in the open source ecosystem knows, maintaining is a much
> > higher bar than contributing, but very important. I'd like to see
> > specific names here.
> >
> > >- Provide both end-user and developer-facing documentation:  *Runner
> > has
> > >README, capability matrix, Javadoc. Planning to add it to the
> tutorial
> > >later.*
> > >- Have at least a basic level of unit test coverage:  *Has 30 runner
> > >specific tests and passes all Beam RunnableOnService tests.*
> > >- Run all existing applicable integration tests with other Beam
> > >components and create additional tests as appropriate: * Enabled
> > runner
> > >for examples integration tests in the same way as other runners.*
> > >- Be able to handle a subset of the model that address a significant
> > set of
> > >use cases (aka. ‘traditional batch’ or ‘processing time
> > > streaming’):  *Passes
> > >RunnableOnService without exclusions and example IT.*
> > >- Update the capability matrix with the current status:  *Done.*
> > >- Add a webpage under learn/runners: *Same "TODO" page as other
> > runners
> > >added to site.*
> > >
> > > The PR for the merge: https://github.com/apache/
> incubator-beam/pull/1305
> > >
> > > (There are intermittent test failures in individual Travis runs that
> are
> > > unrelated to the runner.)
> > >
> > > Thanks,
> > > Thomas
> > >
> > > [1]
> > > https://lists.apache.org/thread.html/2b420a35f05e47561f27c19e8ec648
> > 4f595553f32da88fe593ad931d@%3Cdev.beam.apache.org%3E
> > >
> > > [2] http://beam.apache.org/contribute/contribution-guide/
> > #feature-branches
> > >
> > > [3]
> > > https://lists.apache.org/thread.html/6e7618768cdcde81c28aa9883a1fcf
> > 4d3d4e41de4249547
> > >  > 4d3d4e41de4249547130691d52@%3Cdev.apex.apache.org%3E>
> > > 130691d52@%3Cdev.apex.apache.org%3E
> > >  > 4d3d4e41de4249547130691d52@%3Cdev.apex.apache.org%3E>
> >
>


Re: PCollection to PCollection Conversion

2016-11-08 Thread Kenneth Knowles
It seems useful for small scale debugging / demoing to have
Dump.toString(). I think it should be named to clearly indicate its limited
scope. Maybe other stuff could go in the Dump namespace, but
"Dump.toJson()" would be for humans to read - so it should be pretty
printed, not treated as a machine-to-machine wire format.

The broader question of representing data in JSON or XML, etc, is already
the subject of many mature libraries which are already easy to use with
Beam.

The more esoteric practice of implicit or semi-implicit coercions seems
like it is also already addressed in many ways elsewhere.
Transform.via(TypeConverter) is basically the same as
MapElements.via() and also easy to use with Beam.

In both of the last cases, there are many reasonable approaches, and we
shouldn't commit our users to one of them.

On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik 
wrote:

> The suggestions you give seem good except for the the XML cases.
>
> Might want to have the XML be a document per line similar to the JSON
> examples you have been giving.
>
> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson 
> wrote:
>
> > @lukasz Agreed there would have to be KV handling. I was more think that
> > whatever the addition, it shouldn't just handle KV. It should handle
> > Iterables, Lists, Sets, and KVs.
> >
> > For JSON and XML, I wonder if we'd be able to give someone something
> > general purpose enough that you would just end up writing your own code
> to
> > handle it anyway.
> >
> > Here are some ideas on what it could look like with a method and the
> > resulting string output:
> > *Stringify.toJSON()*
> >
> > With KV:
> > {"key": "value"}
> >
> > With Iterables:
> > ["one", "two", "three"]
> >
> > *Stringify.toXML("rootelement")*
> >
> > With KV:
> > 
> >
> > With Iterables:
> > 
> >   one
> >   two
> >   three
> > 
> >
> > *Stringify.toDelimited(",")*
> >
> > With KV:
> > key,value
> >
> > With Iterables:
> > one,two,three
> >
> > Do you think that would strike a good balance between reusable code and
> > writing your own for more difficult formatting?
> >
> > Thanks,
> >
> > Jesse
> >
> > On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik 
> > wrote:
> >
> > Jesse, I believe if one format gets special treatment in TextIO, people
> > will then ask why doesn't JSON, XML, ... also not supported.
> >
> > Also, the example that you provide is using the fact that the input
> format
> > is an Iterable. You had posted a question about using KV with
> > TextIO.Write which wouldn't align with the proposed input format and
> still
> > would require to write a type conversion function, this time from KV to
> > Iterable instead of KV to string.
> >
> > On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson 
> > wrote:
> >
> > > Lukasz,
> > >
> > > I don't think you'd need complicated logic for TextIO.Write. For CSV
> the
> > > call would look like:
> > > Stringify.to("", ",", "\n");
> > >
> > > Where the arguments would be Stringify.to(prefix, delimiter, suffix).
> > >
> > > The code would be something like:
> > > StringBuffer buffer = new StringBuffer(prefix);
> > >
> > > for (Item item : list) {
> > >   buffer.append(item.toString());
> > >
> > >   if(notLast) {
> > > buffer.append(delimiter);
> > >   }
> > > }
> > >
> > > buffer.append(suffix);
> > >
> > > c.output(buffer.toString());
> > >
> > > That would allow you to do the basic CSV, TSV, and other formats
> without
> > > complicated logic. The same sort of thing could be done for
> TextIO.Write.
> > >
> > > Thanks,
> > >
> > > Jesse
> > >
> > > On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik 
> > > wrote:
> > >
> > > > The conversion from object to string will have uses outside of just
> > > > TextIO.Write so it seems logical that we would want to have a ParDo
> do
> > > the
> > > > conversion.
> > > >
> > > > Text file formats have a lot of variance, even if you consider the
> > subset
> > > > of CSV like formats where it could have fixed width fields, or
> escaping
> > > and
> > > > quoting around other fields, or headers that should be placed at the
> > top.
> > > >
> > > > Having all these format conversions within TextIO.Write seems like a
> > lot
> > > of
> > > > logic to contain in that transform which should just focus on writing
> > to
> > > > files.
> > > >
> > > > On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
> je...@smokinghand.com>
> > > > wrote:
> > > >
> > > > > This is a thread moved over from the user mailing list.
> > > > >
> > > > > I think there needs to be a way to convert a PCollection to
> > > > > PCollection Conversion.
> > > > >
> > > > > To do a minimal WordCount, you have to manually convert the KV to a
> > > > String:
> > > > > p
> > > > > .apply(TextIO.Read.from("playing_cards.tsv"))
> > > > > .apply(Regex.split("\\W+"))
> > > > > .apply(Count.perElement())
> > > > > *   

Re: Verify a new Runner

2016-11-07 Thread Kenneth Knowles
Hi Zhixin,

I would love to help you out with this.

One of the best ways to test your runner is to enable the
"RunnableOnService" test suite in the core SDK. Here is an example of the
configuration for the Flink runner:
https://github.com/apache/incubator-beam/blob/master/runners/flink/runner/pom.xml#L49

Another good source of tests is the integration tests which makes sure your
runner can run all of our user-facing examples. The configuration for this
lives here:
https://github.com/apache/incubator-beam/blob/master/examples/java/pom.xml#L133

You said you are almost done, but just because you might find some issues
when you start running these tests, here are some pointers to other
references.

The standard abstract description of the model is still the Dataflow Model
paper from a couple years ago:
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43864.pdf
.

And finally, we don't have extensive documentation, but best current
reference for what transforms are primitive (and why) for a runner author
is probably the Beam Runner API proposal at
https://s.apache.org/beam-runner-api. The implementation (moving Beam to
the ideal) is still under development, but you may be helped by the
sections "Primitive Transforms" at
https://s.apache.org/beam-runner-api#heading=h.tt55lhd3k6by and "What does
a runner author need to do?" at https://s.apache.org/beam-runner-api
#heading=h.cdbhozvw83un.

I hope this helps. And, of course, if there are more details you can share
then we can talk about specifics.

Kenn

On Mon, Nov 7, 2016 at 7:50 PM 李劲松(之信)  wrote:

> Hi there,
>
> I'm working on the beam integration (= a new runner, mainly for streaming;
> almost done) for an internal system at Alibaba, targeted for production use. 
> I'm wondering if you could give me some advice on how to test/verify such an 
> implementation. Thank you!
>
> Best,
> Zhixin


Re: Contributing to Beam docs

2016-11-03 Thread Kenneth Knowles
This is great. These menus seem really intuitive for finding what you need.
I especially like the clarity in Get Started and Documentation. A pretty
big challenge, since we have  runners and  SDKs that all need to be
called out prominently in order to let users know what Beam is about.

I had ~3 thoughts.

1. "Get Started > Support " contains mostly things that I associate with
"Community". I would say only the users@ mailing list, StackOverflow, and
to a lesser extent JIRA are user facing, while the rest is contributor
facing.

2. "Contribute > Work In Progress" I wasn't quite sure how to interpret
until I clicked in on it. I favor slightly florid vocabulary like "Ongoing
Endeavors" but maybe there is a middle ground?

3. The organization of the Contribute subsections seems like it could put
together three sections for guides, miscellany, and promotion.

Contribute >
Get Started Contributing (was "Contributor Hub", should have the
"starter" content from Work In Progress page)

Contribution Guide
Testing Guide (was "Testing")
Release Guide
--
Team
Technical Vision (optional, from "Contributor Hub")
Design Principles
Ongoing Endeavors (was "Work In Progress")
Source Repository
Issue Tracking (was "Support" elsewhere)
Mailing Lists, etc (include Slack and dev@/commits@ lists)
--
Presentation Materials (was "Talks" but "Talks" sounds like prerecorded
learning material)
Logos and Design

Kenn

p.s. I'd love to contribute to solving the /index.html thing. Seems like
something we should be able to engineer our way around.

On Thu, Nov 3, 2016 at 7:42 PM Hadar Hod  wrote:

> Hi Beamers!
>
> I'm Hadar. I've worked on Dataflow documentation in the past, and have
> recently started contributing to the Beam docs. I'm excited to work with
> all of you and to be a part of this project. :)
>
> I believe the current structure of the website can be improved, so I'd like
> to propose a slightly different one. On a high level, I propose changing
> the tabs in the top menu and reorganizing some of the docs. Instead of the
> current "Use", "Learn", "Contribute", "Blog", and "Project" tabs, we could
> have "Get Started", "Documentation", "Contribute", and "Blog".
>
> I applied this new structure in a pull request
> , which is staged
> here
> <
> http://apache-beam-website-pull-requests.storage.googleapis.com/62/index.html
> >.
> If you've worked on the website before, you've probably run into this -
> note that you'll have to append "/index.html" to each URL.
>
> Thoughts? Thanks!
>
> Hadar
>


Re: PAssert.GroupedGlobally defaults to a single empty Iterable.

2016-11-02 Thread Kenneth Knowles
Interesting problem.

First, I do think that PAssert is defined entirely in terms of the model so
this is a real problem with the results that are produced.

Then, I think the answer might lie with triggering. When PAssert rewindows
into the global window and triggers "never" this actually means that its
internal GBK emits all of its output exactly once, when the window is
expiring. So producing more than one output - even if spread across
microbatches - is actually the trouble.

On Wed, Nov 2, 2016 at 10:20 AM Amit Sela <amitsel...@gmail.com> wrote:

> Spark won't evaluate empty PCollections as well, and I've handled it by
> asserting the SUCCESS/FAIL aggregators in PAssert in the TestSparkRunner.
>
> Main problem for Spark is when running gracefully it might take sometime
> to terminate and execute another batch which is an empty iterable, then
> fail on AssertionError <0> but expected <100>.
> Running ungracefully seemed to come back and bite me in the past, so I'm
> trying to avoid it.
>
> Thoughts ?
>
> On Wed, Nov 2, 2016 at 6:00 PM Dan Halperin <dhalp...@google.com.invalid>
> wrote:
>
> (Also: I meant "tests will [incorrectly] pass silently")
>
> On Wed, Nov 2, 2016 at 8:57 AM, Kenneth Knowles <k...@google.com> wrote:
>
> > FWIW if the runner is set up properly the tests will still fail with a
> > timeout waiting for the assertion aggregators to reach expected values.
> > Unfortunately we haven't yet centralized this functionality into
> > TestPipeline or thereabouts.
> >
> > On Wed, Nov 2, 2016 at 8:56 AM Dan Halperin <dhalp...@google.com> wrote:
> >
> >> +Kenn
> >>
> >> I believe this is done because if there is no output, no assertions will
> >> be run and tests will fail silently. (This was a side effect of
> switching
> >> from side inputs to groupbykey for this flow, which enabled testing of
> >> triggers/panes/etc.)
> >>
> >> On Wed, Nov 2, 2016 at 5:19 AM, Amit Sela <amitsel...@gmail.com> wrote:
> >>
> >> I've proposed https://github.com/apache/incubator-beam/pull/1257 (also
> >> opened a ticket).
> >> Tested locally Direct/Flink/Spark runners, and it looks fine, I've
> issued
> >> a
> >> PR to see if it affects Dataflow runner.
> >>
> >> Amit.
> >>
> >> On Wed, Nov 2, 2016 at 11:56 AM Jean-Baptiste Onofré <j...@nanthrax.net>
> >> wrote:
> >>
> >> > Agree, this element should be removed.
> >> >
> >> > Regards
> >> > JB
> >> >
> >> > On 11/02/2016 10:53 AM, Amit Sela wrote:
> >> > > Hi all,
> >> > >
> >> > > I've been looking at PAssert and I've notice that
> >> PAssert.GroupedGlobally
> >> > > points
> >> > > <
> >> > https://github.com/apache/incubator-beam/blob/master/sdks/
> >> java/core/src/main/java/org/apache/beam/sdk/testing/PAssert.java#L825
> >> > >
> >> > > that it will result in a singe empty iterable even if the input
> >> > PCollection
> >> > > is empty.
> >> > > This is a weird behaviour as it may cause following assertions to
> >> fail.
> >> > >
> >> > > Wouldn't it be more correct to remove (filter out ?) this element ?
> >> > >
> >> > > Thanks,
> >> > > Amit
> >> > >
> >> >
> >> > --
> >> > Jean-Baptiste Onofré
> >> > jbono...@apache.org
> >> > http://blog.nanthrax.net
> >> > Talend - http://www.talend.com
> >> >
> >>
> >>
> >>
>
>


Re: PAssert.GroupedGlobally defaults to a single empty Iterable.

2016-11-02 Thread Kenneth Knowles
The iterable is the entirety of the contents of the PCollection. So empty
iterable -> empty PCollection.

It is actually main purpose/complexity in this transform to make sure it is
non-empty, because otherwise downstream asserts do not run.

On Wed, Nov 2, 2016 at 5:20 AM Amit Sela  wrote:

> I've proposed https://github.com/apache/incubator-beam/pull/1257 (also
> opened a ticket).
> Tested locally Direct/Flink/Spark runners, and it looks fine, I've issued a
> PR to see if it affects Dataflow runner.
>
> Amit.
>
> On Wed, Nov 2, 2016 at 11:56 AM Jean-Baptiste Onofré 
> wrote:
>
> > Agree, this element should be removed.
> >
> > Regards
> > JB
> >
> > On 11/02/2016 10:53 AM, Amit Sela wrote:
> > > Hi all,
> > >
> > > I've been looking at PAssert and I've notice that
> PAssert.GroupedGlobally
> > > points
> > > <
> >
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/PAssert.java#L825
> > >
> > > that it will result in a singe empty iterable even if the input
> > PCollection
> > > is empty.
> > > This is a weird behaviour as it may cause following assertions to fail.
> > >
> > > Wouldn't it be more correct to remove (filter out ?) this element ?
> > >
> > > Thanks,
> > > Amit
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: PAssert.GroupedGlobally defaults to a single empty Iterable.

2016-11-02 Thread Kenneth Knowles
FWIW if the runner is set up properly the tests will still fail with a
timeout waiting for the assertion aggregators to reach expected values.
Unfortunately we haven't yet centralized this functionality into
TestPipeline or thereabouts.

On Wed, Nov 2, 2016 at 8:56 AM Dan Halperin  wrote:

> +Kenn
>
> I believe this is done because if there is no output, no assertions will
> be run and tests will fail silently. (This was a side effect of switching
> from side inputs to groupbykey for this flow, which enabled testing of
> triggers/panes/etc.)
>
> On Wed, Nov 2, 2016 at 5:19 AM, Amit Sela  wrote:
>
> I've proposed https://github.com/apache/incubator-beam/pull/1257 (also
> opened a ticket).
> Tested locally Direct/Flink/Spark runners, and it looks fine, I've issued a
> PR to see if it affects Dataflow runner.
>
> Amit.
>
> On Wed, Nov 2, 2016 at 11:56 AM Jean-Baptiste Onofré 
> wrote:
>
> > Agree, this element should be removed.
> >
> > Regards
> > JB
> >
> > On 11/02/2016 10:53 AM, Amit Sela wrote:
> > > Hi all,
> > >
> > > I've been looking at PAssert and I've notice that
> PAssert.GroupedGlobally
> > > points
> > > <
> >
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/PAssert.java#L825
> > >
> > > that it will result in a singe empty iterable even if the input
> > PCollection
> > > is empty.
> > > This is a weird behaviour as it may cause following assertions to fail.
> > >
> > > Wouldn't it be more correct to remove (filter out ?) this element ?
> > >
> > > Thanks,
> > > Amit
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
>
>


Re: Why does `Combine.perKey(SerializableFunction)` require same input and output type

2016-10-31 Thread Kenneth Knowles
Manu, I think your critique about user interface clarity is valid.
CombineFn conflates a few operations and is not that clear about what it is
doing or why. You seem to be concerned about CombineFn versus
SerializableFunction constructors for the Combine family of transforms. I
thought I'd respond from my own perspective, in case it is helpful. It is
mostly the same things that Luke has said. Let's ignore keys. I don't think
they change things much.

As you seem to already understand, a CombineFn is a convenient collapsed
representation of three functions:

init : InputT -> AccumT
combiner: (AccumT, AccumT) -> AccumT
extract: AccumT -> OutputT

And the real semantics:

MapElements.via(init)
Combine.via(combiner)
MapElements.via(extract)

For starters, "associative" is not even a well-typed word to use unless
input type and output type are the same. So it is `combiner` that needs to
be associative and commutative. Sometimes `combiner` also has an identity
element. I'm afraid `createAccumulator()` and `defaultValue()` confuse
things here (the latter is never meaningfully used). When we say a
CombineFn has to be "associative" and "commutative" we just mean that it
can be factored into these methods.

So the SerializableFunction just needs to be factorable into these methods,
too, like Luke said. Pragmatically, if we only have a
SerializableFunction then we don't have a way to
do hierarchical combines (can't feed the output of one layer into the next
layer), so associativity is irrelevant and it might as well be a
MapElements. So it only makes sense to allow
SerializableFunction. Some variant that is a
binary function would make sense for lambdas, etc.

Here are some reasons for the particular design of CombineFn that actually
should be called out:

 - It is a major efficiency gain to mutate the accumulator.
 - Usually `init` is trivial and best to inline, hence addInput(InputT,
AccumT)
 - With `compact` we allow multiple physical representations of the same
semantic accumulator, and a hook to switch between them
 - And it is hard to take the user through the journey from the real
reasons behind it and the particular Java interface

Note also that CombineWithContext allows side inputs, which complicates the
formalities somewhat but doesn't change the intuition.

Kenn

On Mon, Oct 31, 2016 at 6:37 PM Manu Zhang  wrote:

> I'm a bit confused here because neither of them requires same type of
> input and output. Also, the Javadoc of Globally says "It is common for {@code
> *InputT == OutputT}, but not required" *If associative and commutative is
> expected, why don't they have restrictions like
> Combine.perKey(SerializableFunction) ?
>
> I understand the motive and requirement behind Combine functions. I'm more
> asking about the user interface consistency.
> By the way, it's hard to know what Combine.Globally does from the name but
> that discussion should be put in another thread.
>
> Thanks for your patience here.
>
> Manu
>
> On Tue, Nov 1, 2016 at 12:04 AM Lukasz Cwik  wrote:
>
> GlobalCombineFn and PerKeyCombineFn still expect an associative and
> commutative function when accumulating.
> GlobalCombineFn is shorthand for assigning everything to a single key,
> doing the combine, and then discarding the key and extracting the single
> output.
> PerKeyCombineFn is shorthand for doing accumulation where the key doesn't
> modify the accumulation in anyway.
>
> On Fri, Oct 28, 2016 at 6:09 PM, Manu Zhang 
> wrote:
>
> Then what about the other interfaces, like Combine.perKey(GlobalCombineFn)
> and Combine.perKey(PerkeyCombineFn) ? Why not all of them have the
> requirements ?
>
> On Sat, Oct 29, 2016 at 12:06 AM Lukasz Cwik  wrote:
>
> For it to be considered a combiner, the function needs to be associative
> and commutative.
>
> The issue is that from an API perspective it would be easy to have a
> Combine.perKey(SerializableFunction). But many
> people in the data processing world expect that this
> parallelization/optimization is performed and thus exposing such a method
> would be dangerous as it would be breaking users expectations so from the
> design perspective it is a hard requirement. If PCollections ever become
> ordered or gain other properties, these requirements may loosen but it
> seems unlikely in the short term.
>
> At this point, I think your looking for a MapElements which you pass in a
> SerializableFunction>.
> Creating a wrapper SerializableFunction OutputT>> which can delegate to a SerializableFunction OutputT> should be trivial.
>
>
> On Thu, Oct 27, 2016 at 6:17 PM, Manu Zhang 
> wrote:
>
> Thanks for the thorough explanation. I see the benefits for such a
> function.
> My follow-up question is whether this is a hard requirement.
> 

Re: migrating gearpump-runner to new DoFn fails with NotSerializableException

2016-10-30 Thread Kenneth Knowles
Hi Manu,

That class is generated by DoFnInvokers, which generates bytecode to
efficiently execute a DoFn. It should not be part of the serialized
payload, but should be instantiated on the service/worker/etc. If you are
trying to serialize a DoFnInvoker, then my recommendation is to serialize
only the DoFn. If this is caused by something else, then can you provide
some more details? Perhaps even open a pull request (where tests will fail,
of course) so I can see and comment on the code itself.

Kenn

On Sun, Oct 30, 2016 at 8:52 PM Manu Zhang  wrote:

> Hi all,
>
> I'm migrating `OldDoFn` to `DoFn` in gearpump-runner. We serialize all the
> functions locally and ship to remote cluster. Hence, I try to make sure the
> functions are serializable. Unluckily, the integration-tests fail with
> `NotSerializableException` as follows. Anyone knows what that
> *AddTimestampsDoFn$auxiliary$x5sQZaqi
> is *?
>
> Caused by: java.io.NotSerializableException:
>
> org.apache.beam.sdk.transforms.WithTimestamps$AddTimestampsDoFn$auxiliary$x5sQZaqi
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at
>
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at
>
> org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:49)
> ... 83 more
>
> Thanks,
> Manu
>


Re: [DISCUSS] Merging master -> feature branch

2016-10-27 Thread Kenneth Knowles
In the spirit of explicitly summarizing and concluding threads on list: I
think we have affirmative consensus to go for it when a downstream
integration is completely conflict-free and fixup-free.

On Thu, Oct 27, 2016 at 12:43 PM Robert Bradshaw
 wrote:

> My concern was mostly about what to do in the face of conflicts, but
> it sounds like the consensus is that for a clean merge, with no
> conflicts or test breakage (or other concerns) a committer is free to
> push without any oversight which is fine by me.
>
> [If/when the Mergbot comes into action, and runs more extensive tests
> than standard precommit, it might make sense to still go through that
> rather than debug bad merges discovered in postcommit tests.]
>
> On Wed, Oct 26, 2016 at 9:07 PM, Davor Bonaci 
> wrote:
> > +1
> >
> > I concur it is fine to proceed with a downstream integration (master ->
> > feature branch -> sub-feature branch) without waiting for review for a
> > completely clean merge. Exactly as proposed -- I think there should still
> > be a pull request and comment saying it is a clean merge. (In some ideal
> > world, this would happen nightly by a tool automatically, but I think
> > that's not feasible in the short term.)
> >
> > I think other cases (upstream integration, merge conflict, any manual
> > action, etc.) should still wait for a normal review.
> >
> > On Wed, Oct 26, 2016 at 10:34 AM, Thomas Weise  wrote:
> >
> >> +1
> >>
> >> For a merge from master to the feature branch that does not require
> extra
> >> changes, RTC does not add value. It actually delays and burns reviewer
> time
> >> (even mechanics need some) that "real" PRs could benefit from. If
> >> adjustments are needed, then the regular process kicks in.
> >>
> >> Thanks,
> >> Thomas
> >>
> >>
> >> On Wed, Oct 26, 2016 at 1:33 AM, Amit Sela 
> wrote:
> >>
> >> > I generally agree with Kenneth.
> >> >
> >> > While working on the SparkRunnerV2 branch, it was a pain - i avoided
> >> > frequent merges to avoid trivial PRs, but it cost me with very large
> and
> >> > non-trivial merges later.
> >> > I think that frequent merges for feature-branches should most of the
> time
> >> > be trivial (no conflicts) and a committer should be allowed to
> self-merge
> >> > once tests pass.
> >> > As for conflicts, even for the smallest once I'd go with review just
> so
> >> > it's very clear when self-merging is OK - we can always revisit this
> >> later
> >> > and further discuss if we think we can improve this process.
> >> >
> >> > I guess +1 from me.
> >> >
> >> > Thanks,
> >> > Amit.
> >> >
> >> > On Wed, Oct 26, 2016 at 8:10 AM Frances Perry  >
> >> > wrote:
> >> >
> >> > > On Tue, Oct 25, 2016 at 9:44 PM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> >> >
> >> > > wrote:
> >> > >
> >> > > > Agree. When possible it would be great to have the branch merged
> on
> >> > > master
> >> > > > quickly, even when it's not fully ready. It would give more
> >> visibility
> >> > to
> >> > > > potential contributors.
> >> > > >
> >> > >
> >> > > This thread is about the opposite, I think -- merging master into
> >> feature
> >> > > branches regularly to prevent them from getting out of sync.
> >> > >
> >> > > As for increasing the visibility of feature branches, we have these
> new
> >> > > webpages:
> >> > > http://beam.incubator.apache.org/contribute/work-in-progress/
> >> > > http://beam.incubator.apache.org/contribute/contribution-
> >> > > guide/#feature-branches
> >> > > with more changes coming in the basic SDK/Runner landing pages too.
> >> > >
> >> >
> >>
>


Re: [DISCUSS] Using Verbs for Transforms

2016-10-25 Thread Kenneth Knowles
To be clear: I am not saying that I think the discussion has concluded. I
think we should give some more time for different time zone rotations to
occur. I just meant to say that if it does come to a vote, I'd prefer to
keep it focused rather than generalizing.

On Tue, Oct 25, 2016 at 10:51 PM Kenneth Knowles <k...@google.com> wrote:

> I'd prefer to keep the vote focused on this rename, not a general policy.
>
> On Tue, Oct 25, 2016 at 10:26 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> Yes I would start a formal vote with the three proposals: descriptive
> verb, adjective, verbs + adjective.
>
> Regards
> JB
>
> ⁣​
>
> On Oct 26, 2016, 07:16, at 07:16, Jesse Anderson <je...@smokinghand.com>
> wrote:
> >We need to make a decision on this so Neelesh can finish his commit.
> >Should
> >we take a vote or something?
> >
> >On Tue, Oct 25, 2016, 7:55 AM Jean-Baptiste Onofré <j...@nanthrax.net>
> >wrote:
> >
> >> Sounds good to me.
> >>
> >> ⁣​
> >>
> >> On Oct 24, 2016, 19:11, at 19:11, je...@smokinghand.com wrote:
> >> >I prefer MakeDistinct if we have to make it a verb.
> >>
>
>


Re: [DISCUSS] Using Verbs for Transforms

2016-10-25 Thread Kenneth Knowles
I'd prefer to keep the vote focused on this rename, not a general policy.

On Tue, Oct 25, 2016 at 10:26 PM Jean-Baptiste Onofré 
wrote:

> Yes I would start a formal vote with the three proposals: descriptive
> verb, adjective, verbs + adjective.
>
> Regards
> JB
>
> ⁣​
>
> On Oct 26, 2016, 07:16, at 07:16, Jesse Anderson 
> wrote:
> >We need to make a decision on this so Neelesh can finish his commit.
> >Should
> >we take a vote or something?
> >
> >On Tue, Oct 25, 2016, 7:55 AM Jean-Baptiste Onofré 
> >wrote:
> >
> >> Sounds good to me.
> >>
> >> ⁣​
> >>
> >> On Oct 24, 2016, 19:11, at 19:11, je...@smokinghand.com wrote:
> >> >I prefer MakeDistinct if we have to make it a verb.
> >>
>


Re: Apex runner integration tests

2016-10-25 Thread Kenneth Knowles
I've commented on a PR but also want to respond here.

In the precommit, we run
https://builds.apache.org/job/beam_PreCommit_MavenVerify/ which uses
-Pjenkins-precommit to select very few integration tests. It should just be
unit tests and integration tests based on our examples. This catches the
bulk of issues without being too slow

In postcommit, we also run `mvn verify` in
https://builds.apache.org/job/beam_PostCommit_MavenVerify/ which has a
slightly different configuration. I actually can't decipher the exact level
of overlap.

In postcommit, we also run a separate full RunnableOnService job per
runner:
https://builds.apache.org/job/beam_PostCommit_RunnableOnService_GoogleCloudDataflow/
,
https://builds.apache.org/view/Beam/job/beam_PostCommit_RunnableOnService_GearpumpLocal/
,
https://builds.apache.org/view/Beam/job/beam_PostCommit_RunnableOnService_FlinkLocal/
,
https://builds.apache.org/view/Beam/job/beam_PostCommit_RunnableOnService_SparkLocal/.
We definitely should add Apex to this list ASAP.

Hope this clears things up a bit.

Kenn

On Tue, Oct 25, 2016 at 10:28 AM Thomas Weise  wrote:

> The Apex runner has the integration tests enabled and that causes Travis PR
> builds to fail with timeout (they complete in Jenkins).
>
> What is the correct setup for this, when are the tests supposed to run?
>
>
> https://github.com/apache/incubator-beam/blob/apex-runner/runners/apex/pom.xml#L190
>
> Thanks,
> Thomas
>


Re: [VOTE] Release 0.3.0-incubating, release candidate #1

2016-10-25 Thread Kenneth Knowles
+1 (binding)

On Tue, Oct 25, 2016 at 5:26 PM Dan Halperin 
wrote:

> My reading of the LEGAL threads is that since we are not including (shading
> or bundling) the ASL-licensed code we are fine to distribute kinesis-io
> module. This was the original conclusion that LEGAL-198 got to, and that
> thread has not been resolved differently (even if Spark went ahead and
> broke the assembly). The beam-sdks-java-io-kinesis module is an optional
> part (Beam materially works just fine without it).
>
> So I think we're fine to keep this vote open.
>
> +1 (binding) on the release
>
> Thanks Aljoscha!
>
>
> On Tue, Oct 25, 2016 at 12:07 PM, Aljoscha Krettek 
> wrote:
>
> > Yep, I was looking at those same threads when I reviewing the artefacts.
> > The release was already close to being finished so I went through with it
> > but if we think it's not good to have them in we should quickly cancel in
> > favour of a new RC without a published Kinesis connector.
> >
> > On Tue, 25 Oct 2016 at 20:46 Dan Halperin 
> > wrote:
> >
> > > I can't tell whether it is a problem that we are distributing the
> > > beam-sdks-java-io-kinesis module [0].
> > >
> > > Here is the dev@ discussion thread [1] and the (unanswered) relevant
> > LEGAL
> > > thread [2].
> > > We linked through to a Spark-related discussion [3], and here is how to
> > > disable distribution of the KinesisIO module [4].
> > >
> > > [0]
> > >
> > > https://repository.apache.org/content/repositories/staging/
> > org/apache/beam/beam-sdks-java-io-kinesis/
> > > [1]
> > >
> > > https://lists.apache.org/thread.html/6784bc005f329d93fd59d0f8759ed4
> > 745e72f105e39d869e094d9645@%3Cdev.beam.apache.org%3E
> > > [2]
> > >
> > > https://issues.apache.org/jira/browse/LEGAL-198?
> > focusedCommentId=15471529=com.atlassian.jira.
> > plugin.system.issuetabpanels:comment-tabpanel#comment-15471529
> > > [3] https://issues.apache.org/jira/browse/SPARK-17418
> > > [4] https://github.com/apache/spark/pull/15167/files
> > >
> > > Dan
> > >
> > > On Tue, Oct 25, 2016 at 11:01 AM, Seetharam Venkatesh <
> > > venkat...@innerzeal.com> wrote:
> > >
> > > > +1
> > > >
> > > > Thanks!
> > > >
> > > > On Mon, Oct 24, 2016 at 2:30 PM Aljoscha Krettek <
> aljos...@apache.org>
> > > > wrote:
> > > >
> > > > > Hi Team!
> > > > >
> > > > > Please review and vote at your leisure on release candidate #1 for
> > > > version
> > > > > 0.3.0-incubating, as follows:
> > > > > [ ] +1, Approve the release
> > > > > [ ] -1, Do not approve the release (please provide specific
> comments)
> > > > >
> > > > > The complete staging area is available for your review, which
> > includes:
> > > > > * JIRA release notes [1],
> > > > > * the official Apache source release to be deployed to
> > dist.apache.org
> > > > > [2],
> > > > > * all artifacts to be deployed to the Maven Central Repository [3],
> > > > > * source code tag "v0.3.0-incubating-RC1" [4],
> > > > > * website pull request listing the release and publishing the API
> > > > reference
> > > > > manual [5].
> > > > >
> > > > > Please keep in mind that this release is not focused on providing
> new
> > > > > functionality. We want to refine the release process and make
> stable
> > > > source
> > > > > and binary artefacts available to our users.
> > > > >
> > > > > The vote will be open for at least 72 hours. It is adopted by
> > majority
> > > > > approval, with at least 3 PPMC affirmative votes.
> > > > >
> > > > > Cheers,
> > > > > Aljoscha
> > > > >
> > > > > [1]
> > > > >
> > > > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> > > > projectId=12319527=12338051
> > > > > [2]
> > > > >
> > >
> https://dist.apache.org/repos/dist/dev/incubator/beam/0.3.0-incubating/
> > > > > [3]
> > > > > https://repository.apache.org/content/repositories/staging/
> > > > org/apache/beam/
> > > > > [4]
> > > > >
> > > > > https://git-wip-us.apache.org/repos/asf?p=incubator-beam.
> > git;a=tag;h=
> > > > 5d86ff7f04862444c266142b0d5acecb5a6b7144
> > > > > [5] https://github.com/apache/incubator-beam-site/pull/52
> > > > >
> > > >
> > >
> >
>


[DISCUSS] Merging master -> feature branch

2016-10-25 Thread Kenneth Knowles
Hi all,

While collaborating on the apex-runner branch, the issue of how best to
continuously merge master into the feature branch came up. IMO it differs
somewhat from normal commits in two notable ways:

1. Modulo fix-ups, it is actually not adding any new code to the overall
codebase, so reviews don't serve to raise the quality bar for contributions.
2. It is pretty important to do very frequently to keep out of trouble, so
friction must be thoroughly justified.

I really wouldn't want reviewing a daily merge from master to consume
someone's time every day. On the gearpump-runner branch we had some major
review latency problems. Obviously, I'd like to hear from folks on other
feature branches. How has this process been for the Python SDK?

I will also throw out an idea for a variation in process:

1. A committer merges master to their feature branch without conflicts.*
2. They open a PR for the merge.
3a. If the tests pass without modifications _the committer self-merges the
PR without waiting for review_.
3b. If there are any adjustments needed, these go in separate commits on
the same PR and the review process is as usual (the reviewer can review
just the added commits).

What do you think? Does this treat such merges too blithely? This is meant
just as a starting point for discussion.

Kenn

* There should never be real conflicts unless something strange has
happened - the feature can't be edited on the master branch, and master
stuff shouldn't be touched on the feature branch.


Re: The Availability of PipelineOptions

2016-10-25 Thread Kenneth Knowles
In the spirit of some recent conversations about tracking proposals like
this, are there JIRAs you can [file and then] mention on this thread?

On Tue, Oct 25, 2016 at 2:07 PM Kenneth Knowles <k...@google.com> wrote:

> Yea +1. Definitely a real prerequisite to a true runner-independent graph.
>
> On Tue, Oct 25, 2016 at 1:24 PM Amit Sela <amitsel...@gmail.com> wrote:
>
> +1
>
> On Tue, Oct 25, 2016 at 8:43 PM Robert Bradshaw
> <rober...@google.com.invalid>
> wrote:
>
> > +1
> >
> > On Tue, Oct 25, 2016 at 7:26 AM, Thomas Weise <t...@apache.org> wrote:
> > > +1
> > >
> > >
> > > On Tue, Oct 25, 2016 at 3:03 AM, Jean-Baptiste Onofré <j...@nanthrax.net
> >
> > > wrote:
> > >
> > >> +1
> > >>
> > >> Agree
> > >>
> > >> Regards
> > >> JB
> > >>
> > >> ⁣
> > >>
> > >> On Oct 25, 2016, 12:01, at 12:01, Aljoscha Krettek <
> aljos...@apache.org
> > >
> > >> wrote:
> > >> >+1 This sounds quite straightforward.
> > >> >
> > >> >On Tue, 25 Oct 2016 at 01:36 Thomas Groh <tg...@google.com.invalid>
> > >> >wrote:
> > >> >
> > >> >> Hey everyone,
> > >> >>
> > >> >> I've been working on a declaration of intent for how we want to use
> > >> >> PipelineOptions and an API change to be consistent with that
> intent.
> > >> >This
> > >> >> is generally part of the move to the Runner API, specifically the
> > >> >desire to
> > >> >> be able to reuse Pipelines and the ability to choose runner at the
> > >> >time of
> > >> >> the call to run.
> > >> >>
> > >> >> The high-level summary is I wan to remove the
> > >> >Pipeline.getPipelineOptions
> > >> >> method.
> > >> >>
> > >> >> I believe this will be compatible with other in-flight proposals,
> > >> >> especially Dynamic PipelineOptions, but would love to see what
> > >> >everyone
> > >> >> else thinks. The document is available at the link below.
> > >> >>
> > >> >>
> > >> >>
> > >> >https://docs.google.com/document/d/1Wr05cYdqnCfrLLqSk-
> > >> -XmGMGgDwwNwWZaFbxLKvPqEQ/edit?usp=sharing
> > >> >>
> > >> >> Thanks,
> > >> >>
> > >> >> Thomas
> > >> >>
> > >>
> >
>
>


Re: The Availability of PipelineOptions

2016-10-25 Thread Kenneth Knowles
Yea +1. Definitely a real prerequisite to a true runner-independent graph.

On Tue, Oct 25, 2016 at 1:24 PM Amit Sela  wrote:

> +1
>
> On Tue, Oct 25, 2016 at 8:43 PM Robert Bradshaw
> 
> wrote:
>
> > +1
> >
> > On Tue, Oct 25, 2016 at 7:26 AM, Thomas Weise  wrote:
> > > +1
> > >
> > >
> > > On Tue, Oct 25, 2016 at 3:03 AM, Jean-Baptiste Onofré  >
> > > wrote:
> > >
> > >> +1
> > >>
> > >> Agree
> > >>
> > >> Regards
> > >> JB
> > >>
> > >> ⁣
> > >>
> > >> On Oct 25, 2016, 12:01, at 12:01, Aljoscha Krettek <
> aljos...@apache.org
> > >
> > >> wrote:
> > >> >+1 This sounds quite straightforward.
> > >> >
> > >> >On Tue, 25 Oct 2016 at 01:36 Thomas Groh 
> > >> >wrote:
> > >> >
> > >> >> Hey everyone,
> > >> >>
> > >> >> I've been working on a declaration of intent for how we want to use
> > >> >> PipelineOptions and an API change to be consistent with that
> intent.
> > >> >This
> > >> >> is generally part of the move to the Runner API, specifically the
> > >> >desire to
> > >> >> be able to reuse Pipelines and the ability to choose runner at the
> > >> >time of
> > >> >> the call to run.
> > >> >>
> > >> >> The high-level summary is I wan to remove the
> > >> >Pipeline.getPipelineOptions
> > >> >> method.
> > >> >>
> > >> >> I believe this will be compatible with other in-flight proposals,
> > >> >> especially Dynamic PipelineOptions, but would love to see what
> > >> >everyone
> > >> >> else thinks. The document is available at the link below.
> > >> >>
> > >> >>
> > >> >>
> > >> >https://docs.google.com/document/d/1Wr05cYdqnCfrLLqSk-
> > >> -XmGMGgDwwNwWZaFbxLKvPqEQ/edit?usp=sharing
> > >> >>
> > >> >> Thanks,
> > >> >>
> > >> >> Thomas
> > >> >>
> > >>
> >
>


Re: [ANNOUNCEMENT] New committers!

2016-10-21 Thread Kenneth Knowles
Huzzah!

I've personally enjoyed working together, and I am glad to extend this
acknowledgement and welcome this addition to the Beam community.

Kenn

On Fri, Oct 21, 2016 at 3:18 PM Davor Bonaci  wrote:

> Hi everyone,
> Please join me and the rest of Beam PPMC in welcoming the following
> contributors as our newest committers. They have significantly contributed
> to the project in different ways, and we look forward to many more
> contributions in the future.
>
> * Thomas Weise
> Thomas authored the Apache Apex runner for Beam [1]. This is an exciting
> new runner that opens a new user base. It is a large contribution, which
> starts the whole new component with a great potential.
>
> * Jesse Anderson
> Jesse has contributed significantly by promoting Beam. He has co-developed
> a Beam tutorial and delivered it at a top big data conference. He published
> several blog posts positioning Beam, Q with the Apache Beam team, and a
> demo video how to run Beam on multiple runners [2]. On the side, he has
> authored 7 pull requests and reported 6 JIRA issues.
>
> * Thomas Groh
> Since starting incubation, Thomas has contributed the most commits to the
> project [3], a total of 226 commits, which is more than anybody else. He
> has contributed broadly to the project, most significantly by developing
> from scratch the DirectRunner that supports the full model semantics.
> Additionally, he has contributed a new set of APIs for testing unbounded
> pipelines. He published a blog highlighting this work.
>
> Congratulations to all three! Welcome!
>
> Davor
>
> [1] https://github.com/apache/incubator-beam/tree/apex-runner
> [2] http://www.smokinghand.com/
> [3] https://github.com/apache/incubator-beam/graphs/contributors
> ?from=2016-02-01=2016-10-14=c
>


Re: Placement of temporary files by FileBasedSink

2016-10-20 Thread Kenneth Knowles
@Eugene, we can make breaking changes. But if we really don't want to, we
can add it under a new name easily. That particular inheritance hierarchy
is not precious IMO.

On Thu, Oct 20, 2016, 14:03 Eugene Kirpichov <kirpic...@google.com.invalid>
wrote:

> @Cham - this addresses temporary files that were written by successful
> bundles, but not by failed bundles (and not the case when the entire
> pipeline fails), so it's not sufficient.
>
> @Dan - yes, there are situations when it's impossible to create a sibling.
> In that case, we'd need a fallback - either something the user needs to
> explicitly specify ("your path is such that we don't know where to place
> temporary files, please specify withTempLocation or something"), or I like
> Robert's option of using sibling but differently-named files in this case.
>
> @Kenn - yeah, a directory-based format would be great
> (/path/to/foo/x-of-y), but this would be a breaking change to the
> expected behavior.
>
> I actually really like the option of sibling-but-differently-named files
> (/path/to/temp-beam-foo-$uid) which would be a very non-invasive change to
> the current (/path/to/foo-temp-$uid) and indeed would not involve creating
> new directories or needing new IOChannelFactory APIs. It will still match a
> glob like /path/to/* though (which a user could conceivably specify in a
> situation like gs://my-logs-bucket/*), but it might be good enough.
>
>
> On Thu, Oct 20, 2016 at 10:14 AM Robert Bradshaw
> <rober...@google.com.invalid> wrote:
>
> > On Thu, Oct 20, 2016 at 9:58 AM, Kenneth Knowles <k...@google.com.invalid
> >
> > wrote:
> > > I like the spirit of proposal #1 for addressing the critical
> duplication
> > > problem, though as Dan points out the logic to choose a related but
> > > collision-free name might be slightly more complex.
> > >
> > > It is a nice bonus that it addresses the less critical issues and
> > improves
> > > usability for manual inspections and interventions.
> > >
> > > The term "sibling" is being slightly misused here. I'd say #1 as
> proposed
> > > is a "sibling of the parent" while today's behavior is "sibling". I'd
> > say a
> > > root cause of multiple problems is that our sharded file format is "a
> > bunch
> > > of files next to each other" and the sibling is "other files in the
> same
> > > directory" so it takes some care, and explicit file name tracking
> instead
> > > of globbing, to work with it correctly.
> > >
> > >  AFAIK (corrections welcome) there is nothing special about
> > > Write.to("s3://bucket/file") meaning write to
> > > "s3://bucket/file-$shardnum-of-$totalshards". An alternative that seems
> > > superior is to write to "s3://bucket/file/$shardnum-of-$totalshards"
> with
> > > the convention that this prefix is fully owned by this file. Now the
> > prefix
> > > "s3://bucket/file/" _is_ the sharded file. It is conceptually simpler
> and
> > > more glob and UI friendly. (any non "-" character would work for GCS
> and
> > > S3, but the "/" convention is better, considering the broader world)
> > >
> > > And bringing it back to this thread, the "sibling" is no longer "more
> > files
> > > in the same directory" now "s3://bucket/file-temp-$uid" which is on the
> > > same filesystem with the same ACLs. It is also more UI friendly, easier
> > to
> > > clean up, and does more to explicitly indicate that this is really one
> > > sharded file. Perhaps there's a pitfall I am overlooking?
> >
> > Using directories rather than prefixes is a big change, and introduces
> > complications like dealing with hidden dot files (some placed
> > implicitly by the system or applications, and worrying about
> > executable bits rather than just the rw ones and possibly more
> > complicated permission inheritance).
> >
> > > Also since you mentioned local file support, FWIW the cleanup glob
> > "file-*"
> > > today breaks on Windows due to Java library vagaries, while "file/*"
> > would
> > > succeed.
> > > On Thu, Oct 20, 2016, 09:14 Dan Halperin <dhalp...@google.com.invalid>
> > > wrote:
> > >
> > > This thread is conflating many issues.
> > >
> > > * Putting temp files where they will not match the glob for the desired
> > > output files
> > > * Dealing with eventually-c

Re: Start of release 0.3.0-incubating

2016-10-20 Thread Kenneth Knowles
Aljoscha, I'm very interested in hearing how easy it is, or how fast we
think it could get, from your perspective as first time release manager.
The more frequent releases we have (eventually minor or patch version only)
the less these concerns impact users.

On Thu, Oct 20, 2016, 10:26 Jesse Anderson  wrote:

> +1 to Davor's. I'd really like to see an 0.3.0 release because there have
> been big API changes between 0.2.0 and 0.3.0 like the DoFN changes. It'd be
> nice to stop pointing people to HEAD and back to a release.
>
> On Thu, Oct 20, 2016 at 10:17 AM Davor Bonaci 
> wrote:
>
> > It's been a while since the last release, and I think we have accumulated
> > plenty of improvements across the board [1]. There are new IOs to be
> > released, performance improvements, and a ton of fixes.
> >
> > As a general principle, I'm always advocating for delaying releases when
> > there are outstanding bug fixes. For new features, however, I'm usually
> on
> > the fence. It happens sometimes that new features are rushed to make a
> > release, then we discover important issues later on, and sometimes regret
> > the decision.
> >
> > Of course, UnboundedSource for the SparkRunner and MqttIo would be
> > additional great improvements, and we should get that out to our users as
> > soon as possible too.
> >
> > In this particular case, I think it is perfectly reasonable either to:
> > * try to get 0.3.0 out now and follow it quickly with 0.4.0, as soon as
> > these improvements are ready, or
> > * delay the release, but with a specific time box of a few days.
> >
> > I'd give some preference to the first option now, since it is important
> to
> > keep a cadence of releases during incubation and build experience with
> the
> > process. If we were post-graduation, I'd almost certainly give a
> preference
> > to the second approach.
> >
> > Davor
> >
> > [1]
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12338051
> >
> > On Thu, Oct 20, 2016 at 9:32 AM, Amit Sela  wrote:
> >
> > > +1
> > >
> > > I would like to have my standing PRs merged please - they should
> provide
> > > support for UnboundedSource for the SparkRunner.
> > > If it won't be ready for merge at the beginning of next week, don't
> hold
> > > for me.
> > >
> > > Thanks,
> > > Amit
> > >
> > > On Thu, Oct 20, 2016 at 7:27 PM Jean-Baptiste Onofré 
> > > wrote:
> > >
> > > > +1
> > > >
> > > > Thanks Aljosha !!
> > > >
> > > > Do you mind to wait the week end or Monday to start the release ? I
> > would
> > > > like to include MqttIO if possible.
> > > >
> > > > Thanks !
> > > > Regards
> > > > JB
> > > >
> > > > ⁣​
> > > >
> > > > On Oct 20, 2016, 18:07, at 18:07, Dan Halperin
> > > 
> > > > wrote:
> > > > >On Thu, Oct 20, 2016 at 12:37 AM, Aljoscha Krettek
> > > > >
> > > > > wrote:
> > > > >
> > > > >> Hi,
> > > > >> thanks for taking the time and writing this extensive doc!
> > > > >>
> > > > >> If no-one is against this I would like to be the release manager
> for
> > > > >the
> > > > >> next (0.3.0-incubating) release. I would work with the guide and
> > > > >update it
> > > > >> with anything that I learn along the way. Should I open a new
> thread
> > > > >for
> > > > >> this or is it ok of nobody objects here?
> > > > >>
> > > > >> Cheers,
> > > > >> Aljoscha
> > > > >>
> > > > >
> > > > >Spinning this out as a separate thread.
> > > > >
> > > > >+1 -- Sounds great to me!
> > > > >
> > > > >Dan
> > > > >
> > > > >On Thu, Oct 20, 2016 at 12:37 AM, Aljoscha Krettek
> > > > >
> > > > >wrote:
> > > > >
> > > > >> Hi,
> > > > >> thanks for taking the time and writing this extensive doc!
> > > > >>
> > > > >> If no-one is against this I would like to be the release manager
> for
> > > > >the
> > > > >> next (0.3.0-incubating) release. I would work with the guide and
> > > > >update it
> > > > >> with anything that I learn along the way. Should I open a new
> thread
> > > > >for
> > > > >> this or is it ok of nobody objects here?
> > > > >>
> > > > >> Cheers,
> > > > >> Aljoscha
> > > > >>
> > > > >> On Thu, 20 Oct 2016 at 07:10 Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > > > >wrote:
> > > > >>
> > > > >> > Hi,
> > > > >> >
> > > > >> > well done.
> > > > >> >
> > > > >> > As already discussed, it looks good to me ;)
> > > > >> >
> > > > >> > Regards
> > > > >> > JB
> > > > >> >
> > > > >> > On 10/20/2016 01:24 AM, Davor Bonaci wrote:
> > > > >> > > Hi everybody,
> > > > >> > > As a project, I think we should have a Release Guide to
> document
> > > > >the
> > > > >> > > process, have consistent releases, on-board additional release
> > > > >> managers,
> > > > >> > > and generally share knowledge. It is also one of the project
> > > > >graduation
> > > > >> > > guidelines.
> > > > >> > >
> > > > >> > > Dan and I wrote a draft version, documenting the 

Re: Release Guide

2016-10-20 Thread Kenneth Knowles
This is really nice. Very readable and streamlined.

On Thu, Oct 20, 2016 at 7:44 AM Aljoscha Krettek 
wrote:

> Hi,
> thanks for taking the time and writing this extensive doc!
>
> If no-one is against this I would like to be the release manager for the
> next (0.3.0-incubating) release. I would work with the guide and update it
> with anything that I learn along the way. Should I open a new thread for
> this or is it ok of nobody objects here?
>
> Cheers,
> Aljoscha
>
> On Thu, 20 Oct 2016 at 07:10 Jean-Baptiste Onofré  wrote:
>
> > Hi,
> >
> > well done.
> >
> > As already discussed, it looks good to me ;)
> >
> > Regards
> > JB
> >
> > On 10/20/2016 01:24 AM, Davor Bonaci wrote:
> > > Hi everybody,
> > > As a project, I think we should have a Release Guide to document the
> > > process, have consistent releases, on-board additional release
> managers,
> > > and generally share knowledge. It is also one of the project graduation
> > > guidelines.
> > >
> > > Dan and I wrote a draft version, documenting the process we did for the
> > > first two releases. It is currently in a pull request [1]. I'd invite
> > > everyone interested to take a peek and comment, either on the pull
> > request
> > > itself or here on mailing list, as appropriate.
> > >
> > > Thanks,
> > > Davor
> > >
> > > [1] https://github.com/apache/incubator-beam-site/pull/49
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: [KUDOS] Contributed runner: Apache Apex!

2016-10-17 Thread Kenneth Knowles
*I would like to :-)

On Mon, Oct 17, 2016 at 9:51 AM Kenneth Knowles <k...@google.com> wrote:

> Hi all,
>
> I would to, once again, call attention to a great addition to Beam: a
> runner for Apache Apex.
>
> After lots of review and much thoughtful revision, pull request #540 has
> been merged to the apex-runner feature branch today. Please do take a look,
> and help us put the finishing touches on it to get it ready for the master
> branch.
>
> And please also congratulate and thank Thomas Weise for this large
> endeavor, Vlad Rosov who helped get the integration tests working, and
> Guarav Gupta who contributed review comments.
>
> Kenn
>


[KUDOS] Contributed runner: Apache Apex!

2016-10-17 Thread Kenneth Knowles
Hi all,

I would to, once again, call attention to a great addition to Beam: a
runner for Apache Apex.

After lots of review and much thoughtful revision, pull request #540 has
been merged to the apex-runner feature branch today. Please do take a look,
and help us put the finishing touches on it to get it ready for the master
branch.

And please also congratulate and thank Thomas Weise for this large
endeavor, Vlad Rosov who helped get the integration tests working, and
Guarav Gupta who contributed review comments.

Kenn


Re: [PROPOSAL] State and Timers for DoFn (aka per-key workflows)

2016-10-14 Thread Kenneth Knowles
Hi all,

I thought I would loop back on this proposal and email thread with an FYI
that coding has begun for this design. Here are some recent PRs for your
perusal, if you are interested.

https://github.com/apache/incubator-beam/pull/10
<https://github.com/apache/incubator-beam/pull/1064>44 "Refactor StateSpec
out of StateTag
https://github.com/apache/incubator-beam/pull/1086 "Add DoFn.StateId
annotation and validation on fields"
https://github.com/apache/incubator-beam/pull/1102 "Add initial bits for
DoFn Timer API"

Kenn

On Fri, Jul 29, 2016 at 12:26 PM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> +1
>
> It sounds very good.
>
> Regards
> JB
>
> On 07/27/2016 05:20 AM, Kenneth Knowles wrote:
> > Hi everyone,
> >
> >
> > I would like to offer a proposal for a much-requested feature in Beam:
> > Stateful processing in a DoFn. Please check out and comment on the
> proposal
> > at this URL:
> >
> >
> >   https://s.apache.org/beam-state
> >
> >
> > This proposal includes user-facing APIs for persistent state and timers.
> > Together, these provide rich capabilities that have been called "per-key
> > workflows", the subject of [BEAM-23].
> >
> >
> > Note that this proposal has an important prerequisite: a new design for
> > DoFn. The new DoFn is strongly motivated by this design for state and
> > timers, but we should discuss it separately. I will start a separate
> thread
> > for that.
> >
> >
> > On this email thread, I'd like to try to focus the discussion on state &
> > timers. And of course, please do comment on the particulars in the
> document.
> >
> >
> > Kenn
> >
> >
> > [BEAM-23] https://issues.apache.org/jira/browse/BEAM-23
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Specifying type arguments for generic PTransform builders

2016-10-13 Thread Kenneth Knowles
On Thu, Oct 13, 2016 at 4:55 PM Dan Halperin <dhalp...@google.com.invalid>
wrote:
> These
> suggestions are motivated by making things easier on transform writers,
but
> IMO we need to be optimizing for transform users.

To be fair to Eugene, he was actually analyzing real code patterns that
exists in Beam today, not suggesting new ones.

Along those lines, your addition of the BigTableIO pattern is well-taken
and my own analysis of that code is #5: "when you don't have a type
variable to bind, leave every field blank and validate later. Also, have an
XYZOptions object". I believe in the presence of type parameters this
reduces to #4 Bound/Unbound classes but it is more palatable in the
no-type-variable case. It is also a good choice when varying subsets of
parameters might be optional - the Window transform matches this pattern
for good reason.

The other major drawback of #3 is the inability to provide generic
> configuration. For example, a utility function that gives you a database
> reader ready-to-go with all the default credentials and options you need --
> but independent of the type you want the result to end up as. You can't do
> that if you must bind the type first.
>

This is a compelling use case. It is valuable for configuration to be a
first-class object that can be passed around. BigTableOptions is a good
example. It isn't in contradiction with #3, but actually fits very nicely.

By definition for this default configuration to be first-class it has to be
more than an invalid intermediate state of a PTransform's builder methods.
Concretely, it would be BigTableIO.defaultOptions(), which would make
manifest the inaccessible default options that could be implied by
BigTableIO.read(). There can sometimes be a pretty fine line between a
builder and an options object, to be sure. You might distinguish it by
whether you would conceivably use the object elsewhere - and for
BigTableOptions the answer is certainly "yes" since it actually is an
external class. In the extreme, every method takes one giant POJO and that
sucks.


> I think that in general all of these patterns are significantly worse in
> the long run than the existing standards, e.g., for BigtableIO.


Note that BigTableIO.read() is actually not "ready-to-go" but has nulls and
empty strings that cause crashes if they are not overridden. It is just a
builder without the concluding "build()" method (for the record: I find
concluding "build()" methods pointless, too :-)

One of the better examples of the pattern of "ready-to-go" builders -
though not a transform - is WindowingStrategy (props to Ben), where there
are intelligent defaults and you can override them, and it tracks whether
or not each field is a default or a user-specified value. To start it off
you have to either request "globalDefault()" or "of(WindowFn)", in the
spirit of #3.

Kenn

On Fri, Oct 7, 2016 at 4:48 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> > In my original email, all FooBuilder's should be simply Foo. Sorry for
> the
> > confusion.
> >
> > On Thu, Oct 6, 2016 at 3:08 PM Kenneth Knowles <k...@google.com.invalid>
> > wrote:
> >
> > > Mostly my thoughts are the same as Robert's. Use #3 whenever possible,
> > > fallback to #1 otherwise, but please consider using informative names
> for
> > > your methods in all cases.
> > >
> > > #1 GBK.create(): IMO this pattern is best only for transforms where
> > > withBar is optional or there is no such method, as in GBK. If it is
> > > mandatory, it should just be required in the first method, eliding the
> > > issue, as in ParDo.of(DoFn<I, O>), MapElements.via(...), etc, like you
> > say
> > > in your concluding remark.
> > >
> > > #2 FooBuilder FooBuilder.create(): this too - if you are going to
> fix
> > > the type, fix it first. If it is optional and Foo is usable as a
> > > transform, then sure. (it would have be something weird like
> Foo<InputT,
> > > OutputT, ?> extends PTransform<InputT, OutputT>)
> > >
> > > #3 Foo.create(Bar): this is best. Do this whenever possible. From my
> > > perspective, instead of "move the param to create(...)" I would
> describe
> > > this as "delete create() then rename withBar to create". Just skip the
> > > second step and you are in an even better design, withBar being the
> > > starting point. Just like ParDo.of and MapElements.via.
> > >
> > > #4 Dislike this, too, for the same reasons as #2 plus code bloat plus
> > user
> > > confusion.
> > >
> > > Side note since you use this method in all your examples: 

Re: Simplifying User-Defined Metrics in Beam

2016-10-12 Thread Kenneth Knowles
Correction: In my eagerness to see the end of aggregators, I mistook the
intention. Both A and B leave aggregators in place until there is a
replacement. In which case, I am strongly in favor of B. As soon as we can
remove aggregators, I think we should.

On Wed, Oct 12, 2016 at 10:48 AM Kenneth Knowles <k...@google.com> wrote:

> Huzzah! This is IMO a really great change. I agree that we can get
> something in to allow work to continue, and improve the API as we learn.
>
> On Wed, Oct 12, 2016 at 10:20 AM Ben Chambers <bchamb...@google.com.invalid>
> wrote:
>
> 3. One open question is what to do with Aggregators. In the doc I mentioned
>
> that long term I'd like to consider whether we can improve Aggregators to
> be a better fit for the model by supporting windowing and allowing them to
> serve as input for future steps. In the interim it's not clear what we
> should do with them. The two obvious (and extreme) options seem to be:
>
>
>
>   Option A: Do nothing, leave aggregators as they are until we revisit.
>
>
>   Option B: Remove aggregators from the SDK until we revisit.
>
> I'd like to suggest removing Aggregators once the existing runners have
> reasonable support for Metrics. Doing so reduces the surface area we need
> to maintain/support and simplifies other changes being made. It will also
> allow us to revisit them from a clean slate.
>
>
> +1 to removing aggregators, either of A or B. The new metrics design
> addresses aggregator use cases as well or better.
>
> So A vs B is a choice of whether we have a gap with no aggregator or
> metrics-like functionality. I think that is perhaps a bit of a bummer for
> users, and we will likely port over the runner code for it, so we wouldn't
> want to actually delete it, right? Can we do it in a week or two?
>
> One thing motivating me to do this quickly: Currently the new DoFn does
> not have its own implementation of aggregators, but leverages that of
> OldDoFn, so we cannot remove OldDoFn until either (1) new DoFn
> re-implements the aggregator instantiation and worker-side delegation (not
> hard, but it is throwaway code) or (2) aggregators are removed. This
> dependency also makes running the new DoFn directly (required for the state
> API) a bit more annoying.
>


Re: Simplifying User-Defined Metrics in Beam

2016-10-12 Thread Kenneth Knowles
Huzzah! This is IMO a really great change. I agree that we can get
something in to allow work to continue, and improve the API as we learn.

On Wed, Oct 12, 2016 at 10:20 AM Ben Chambers 
wrote:

> 3. One open question is what to do with Aggregators. In the doc I mentioned

that long term I'd like to consider whether we can improve Aggregators to
> be a better fit for the model by supporting windowing and allowing them to
> serve as input for future steps. In the interim it's not clear what we
> should do with them. The two obvious (and extreme) options seem to be:



>   Option A: Do nothing, leave aggregators as they are until we revisit.
>
>
  Option B: Remove aggregators from the SDK until we revisit.
>
> I'd like to suggest removing Aggregators once the existing runners have
> reasonable support for Metrics. Doing so reduces the surface area we need
> to maintain/support and simplifies other changes being made. It will also
> allow us to revisit them from a clean slate.
>

+1 to removing aggregators, either of A or B. The new metrics design
addresses aggregator use cases as well or better.

So A vs B is a choice of whether we have a gap with no aggregator or
metrics-like functionality. I think that is perhaps a bit of a bummer for
users, and we will likely port over the runner code for it, so we wouldn't
want to actually delete it, right? Can we do it in a week or two?

One thing motivating me to do this quickly: Currently the new DoFn does not
have its own implementation of aggregators, but leverages that of OldDoFn,
so we cannot remove OldDoFn until either (1) new DoFn re-implements the
aggregator instantiation and worker-side delegation (not hard, but it is
throwaway code) or (2) aggregators are removed. This dependency also makes
running the new DoFn directly (required for the state API) a bit more
annoying.


Re: [PROPOSAL] New Beam website design?

2016-10-05 Thread Kenneth Knowles
Just because the thread got bumped... I kind of miss the old bucket of
technical docs. They aren't user-facing, but I used it quite a lot. Perhaps
instead of deleting it, move from "Learn" to "Contribute" or bury it
somewhere near the bottom of the contributors' guide?

On Wed, Oct 5, 2016 at 11:21 AM James Malone 
wrote:

> Hi all!
>
> I want to revive this thread if I may. Is there any way I can help on the
> website redesign? Additionally, is anyone currently working on the UI/UX
> design? I want to make sure I don't duplicate any work.
>
> Cheers!
>
> James
>
> On Tue, Aug 2, 2016 at 1:41 PM, Frances Perry 
> wrote:
>
> > Just an update for everyone...
> >
> > The new website navigation structure is roughly in place. However there's
> > still work remaining to fill in the pages with the right content. There
> are
> > bugs in JIRA tracking the major gaps.
> >
> > In addition, JB is working on new CSS to give our site a snazzier look
> ;-)
> >
> > On Thu, Jun 16, 2016 at 10:11 PM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > Good point. It make sense to wait that it's actually implemented and
> > > available before putting on the website.
> > >
> > > Thanks !
> > > Regards
> > > JB
> > >
> > >
> > > On 06/17/2016 07:01 AM, Frances Perry wrote:
> > >
> > >> Good thoughts.
> > >>
> > >> There'll be a section on IO that comes over as part of the programming
> > >> guide being ported from the Cloud Dataflow docs. Hopefully that has
> the
> > >> technical info needed. Once we see how that's structured (Devin is
> still
> > >> playing around with a single page or multiple), then we can decide if
> we
> > >> need to make it more visible. Maybe we should add a summary table to
> the
> > >> overview page too?
> > >>
> > >> As for DSLs, I think it remains to be seen how we tightly we choose to
> > >> integrate them into Beam. As we've started discussing before, we may
> > >> decide
> > >> that some of them belong elsewhere because they change the
> user-visible
> > >> concepts, but should be discoverable from our documentation. Or others
> > may
> > >> more closely align and just expose subsets. But in any case -- totally
> > >> agree we should add the right concepts when we cross that bridge.
> > >>
> > >>
> > >>
> > >> On Thu, Jun 16, 2016 at 9:53 PM, Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > >> wrote:
> > >>
> > >> Hi Frances,
> > >>>
> > >>> great doc !
> > >>>
> > >>> Maybe in the "Learn" section, we can also add IOs (like SDKs, and
> > >>> runners), like we do in Camel (
> http://camel.apache.org/components.html
> > )
> > >>> For the SDKs, I would also add DSLs in the same section.
> > >>>
> > >>> WDYT ?
> > >>>
> > >>> Regards
> > >>> JB
> > >>>
> > >>>
> > >>> On 06/17/2016 12:21 AM, Frances Perry wrote:
> > >>>
> > >>> Good point, JB -- let's redo the page layout as well.
> > 
> >  I started with your proposal and tweaked it a bit to add in more
> > details
> >  and divide things a bit more according to use case (end user vs.
> >  runner/sdk
> >  developer):
> > 
> > 
> >  https://docs.google.com/document/d/1-0jMv7NnYp0Ttt4voulUMwVe_
> > qjBYeNMLm2LusYF3gQ/edit
> > 
> >  Let me know what you think, and what part you'd like to drive! I'd
> >  suggest
> >  we get the new section layout set this week, so we can parallelize
> > site
> >  design and assorted page content.
> > 
> >  On Mon, Jun 6, 2016 at 8:38 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> >  wrote:
> > 
> >  Hi James,
> > 
> > >
> > > very good idea !
> > >
> > > Couple of month ago, I completely revamped the Karaf website:
> > >
> > > http://karaf.apache.org/
> > >
> > > It could be a good skeleton in term of sections/pages.
> > >
> > > IMHO, for Beam, at least for the home page, we should have:
> > > 1. a clear message about what Beam is from an user perspective: why
> > > should
> > > I use Beam and write pipelines, what's the value, etc. The runner
> > > writers,
> > > or DSL writers will find their resources but not on the homepage
> (on
> > > dedicated section of the website).
> > >
> > > In term of sections, we could propose
> > > 1.1. Overview (with the three perspective/type of users)
> > > 1.2. Libraries: SDKs, DSLs, IOs, Runners
> > > 1.3. Documentation: Dev Guide, Samples, Runners Writer guide, ...
> > > 1.4. Community: mailing list, contribution guide, ...
> > > 1.5. Apache (link to ASF)
> > >
> > > 2. a look'n feel should be clean and professional, at least for the
> > > home
> > > page.
> > >
> > > I would love to help here !
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 06/06/2016 05:29 PM, James Malone wrote:
> > >
> > > Hello everyone!
> > >
> > >>
> > >> The current design of the Apache Beam website[1] 

Re: [REMINDER] Technical discussion on the mailing list

2016-10-05 Thread Kenneth Knowles
This is a great idea. And it produces many easy starter tickets! :-)

On Wed, Oct 5, 2016 at 4:51 AM Jean-Baptiste Onofré  wrote:

> Hi team,
>
> I would like to excuse myself to have forgotten to discuss and share with
> you a technical point and generally speaking do a small reminder.
>
> When we work with Eugene on the JdbcIO, we experimented AutoValue to deal
> with IO configuration. AutoValue provides a nice way to reduce and limit
> the boilerplate code required by the IO configuration.
> We used AutoValue in JdbcIO and, regarding the good improvements we saw,
> we started to refactor the other IOs.
>
> The use of AutoValue should have been notice and discussed on the mailing
> list.
>
> "If it doesn't exist on the mailing list, it doesn't exist at all."
>
> So, any comment happening on a GitHub pull request, or discussion on
> hangouts which can impact the project (generally speaking) has to happen on
> the mailing list.
>
> It provides project transparency and facilitates the new contribution
> onboarding.
>
> Thanks !
>
> Regards
> JB
>
>


Re: Apex Runner support for View.CreatePCollectionView

2016-09-15 Thread Kenneth Knowles
Hi Thomas,

The side inputs 1-pager is a forward-looking document for the design of
side inputs in Beam once the portability layers are completed. The current
SDK and implementations do not quite respect the same abstraction
boundaries, even though they are similar.

Here are some specifics about that 1-pager that I hope will help you right
now:

 - The purple cylinder that says "Runner materializes" corresponds to the
CreatePCollectionView transform. Eventually this should not appear in the
SDK or the pipeline representation, but today that is where you put your
logic to write to some runner-specific storage medium, etc.
 - This "Runner materializes" / "CreatePCollectionView" is consistent with
streaming, of course. When new data arrives, the runner makes the new side
input value available. Most of the View.asXYZ transforms have a GroupByKey
within them, so the triggering on the side input PCollection will regulate
this.
 - The red "RPC" boundary in the diagram will be part of the cross-language
Fn API. For today, that layer is not present, and it is the Java class
ViewFn on each PCollectionView. It takes an
Iterable and produces a ViewT.
 - If we were to use the existing ViewFns without modification, the
primitive "access_pattern" would be "iterable", not "multimap". Thus, the
access pattern does not support fetching an individual KV record
efficiently when the side input is large (when it is small, the map can be
built in memory and cached). As we move forwards, this should change.

And here are answers beyond the side input 1-pager:

 - The problem of expiry of the side input data is [BEAM-260]. The solution
is pretty easy, but I have been waiting to send out my proposal to solve it
with a [WindowMappingFn] since we have so many proposals already in flight.
I am sharing it now, here, since you brought it up again.
 - A good, though very large, reference is the recent addition of side
inputs to the Flink runner in [PR #737] by Aljoscha. In particular, it adds
[SideInputHandler] as a runner-independent way to build side inputs on top
of StateInternals. I suspect you would benefit from using this.

I hope this helps!

Kenn

[BEAM-260] https://issues.apache.org/jira/browse/BEAM-260
[WindowMappingFn] https://s.apache.org/beam-windowmappingfn-1-pager
[PR #737]
https://github.com/apache/incubator-beam/pull/737
[SideInputHandler] https://github.com/apache/incubator-beam/blob/master/
runners/core-java/src/main/java/org/apache/beam/runners/
core/SideInputHandler.java

On Thu, Sep 15, 2016 at 10:12 AM, Thomas Weise  wrote:

> Hi,
>
> I'm working on the Apex runner (
> https://github.com/apache/incubator-beam/pull/540) and based on the
> integration test results my next target is support for PCollectionView.
>
> I looked at the side inputs doc (
> https://s.apache.org/beam-side-inputs-1-pager) and see that a suggested
> implementation approach is RPC.
>
> Apex is a streaming engine where individual records flow through the
> pipeline and operators process data once it becomes available. Hence I'm
> also looking at side inputs as a stream vs. a call to fetch a specific
> record. But that would also require a ParDo operator to hold on to the side
> input state until it is no longer needed (based on expiry of the window)?
>
> I would appreciate your thoughts on this. Is there a good streaming based
> implementation to look at for reference? Also, any suggestions to break the
> support for side inputs into multiple tasks that can be taken up
> independently?
>
> Thanks!
> Thomas
>


Re: About Finishing Triggers

2016-09-14 Thread Kenneth Knowles
Caveat: I want to emphasize that I don't have a specific proposal. I
haven't thought through enough details to consider a proposal, or you would
have seen it already :-)

On Sep 14, 2016 5:14 AM, "Aljoscha Krettek"  wrote:
>
> Hi,
> I had a chat with Kenn at Flink Forward and he did an off-hand remark
about
> how it might be better if triggers where not allowed to mark a window as
> finished

Yes. I've seen many cases of accidental closing and intentional uses of
closing that were wrong, but very rarely intentional uses of closing that
were good.

When we chatted, I think I mentioned this pair of situations, which I
restate for the benefit of the thread and the list:

1. If a trigger's predicate is never satisfied, then we will emit buffered
results at window expiration against the trigger's wishes.
2. If a trigger "finishes" then we throw away the rest of the data.

I don't really like fact #2. It causes silly bugs in user pipelines. I only
can think of one good use case, which is to approximate some custom notion
of completeness other than the watermark.

If you compare, we think of allowed lateness as expressing when to drop
data. We don't think of "AfterEndOfWindow" as expressing when to drop data.
But a trigger that finishes to express a custom idea of completeness is
like the latter. And moving triggers to a DSL instead of a UDF further
reduces the applicability.

I think a good solution for custom notions of completeness should probably
have all pieces: (a) the measure of completeness (b) triggering based on it
in whatever way makes sense (c) a way of noting special levels of
completeness like the ON_TIME pane does for the watermark, and (d) a policy
for eventually considering the output complete enough that we can drop
further input. So that is a lot of different things to design carefully.

I also want to point out that this is not against phase transitions, such
as moving from early firings to late firing when the watermark reaches the
end of the window. That is like a "OnceTrigger" but it is helpful IMO to
separate in my mind the event that we are interested in from a trigger for
controlling output based on that event.

> and instead always be "Repeatedly" (if I understood correctly).

I don't mean necessarily to have this automatically wrapped, but also to
design the trigger DSL so triggers are all/mostly well-behaved. For
example, EOW + withEarlyFirings + withLateFirings is well crafted to make
only sensible things easy.

> Maybe you (Kenn) could go a bit more in depth about what you meant by this
> and if we should actually change this in Beam. Would this mean that we
then
> have the opposite of Repeatedly, i.e. Once, or Only.once(T)?

This is exactly a thought I have had sometimes - make it only possible to
use a OnceTrigger as the top level expression if it is explicitly
requested. This would at least quickly prevent the common pitfall.

But the OnceTrigger / Trigger split is not quite right to enforce this
restriction. Instead of distinguishing "at most once" from "any number of
times", we need to distinguish "finishes" and "never finishes". Or a
vocabulary I have started to favor in my head is "Predicate" and "Trigger"
where you have OnceTrigger via something like Once.at(Predicate) and
otherwise every other trigger you can construct will never lose data. I
actually have a branch sitting around with an experiment along these lines,
I think...

> I also noticed some inconsistencies in when triggers behave as repeated
> triggers and once triggers. For example, AfterPane.elementCountAtLeast(5)
> only fires once if used alone but it it fires repeatedly if used as the
> speculative trigger in
> AfterWatermark.pastEndOfWindow().withEarlyFirings(...). (This is true for
> all "once" triggers.)

This is actually by design. The early & late firings are automatically
repeated. But it is a good example to think about: if a user writes
.trigger(AfterCount(n)) they probably don't mean only once, but are
expecting it to fire whenever the predicate is satisfied. So, using the
vocabulary I mentioned, this example seems to encourage making an overload
.triggering(Predicate) the same as
.triggering(Repeatedly.forever(Predicate)). We can separate the Java
overloads/API question from the model, of course.

Kenn


Re: Remove legacy import-order?

2016-08-24 Thread Kenneth Knowles
+1 to import order

I don't care about actually enforcing formatting, but would add it to IDE
tips and just make it an "OK topic for code review". Enforcing it would
result in obscuring a lot of history for who to talk to about pieces of
code.

And by the way there is a recent build of the IntelliJ plugin for
https://github.com/google/google-java-format, available through the usual
plugin search functionality. I use it and it is very nice.

On Tue, Aug 23, 2016 at 11:26 PM, Aljoscha Krettek 
wrote:

> +1 on the import order
>
> +1 on also starting a discussion about enforced formatting
>
> On Wed, 24 Aug 2016 at 06:43 Jean-Baptiste Onofré  wrote:
>
> > Agreed.
> >
> > It makes sense for the import order.
> >
> > Regards
> > JB
> >
> > On 08/24/2016 02:32 AM, Ben Chambers wrote:
> > > I think introducing formatting should be a separate discussion.
> > >
> > > Regarding the import order: this PR demonstrates the change
> > > https://github.com/apache/incubator-beam/pull/869
> > >
> > > I would need to update the second part (applying optimize imports)
> prior
> > to
> > > actually merging.
> > >
> > > On Tue, Aug 23, 2016 at 5:08 PM Eugene Kirpichov
> > >  wrote:
> > >
> > >> Two cents: While we're at it, we could consider enforcing formatting
> as
> > >> well (https://github.com/google/google-java-format). That's a bigger
> > >> change
> > >> though, and I don't think it has checkstyle integration or anything
> like
> > >> that.
> > >>
> > >> On Tue, Aug 23, 2016 at 4:54 PM Dan Halperin
> > 
> > >> wrote:
> > >>
> > >>> yeah I think that we would be SO MUCH better off if we worked with an
> > >>> out-of-the-box IDE. We don't even distribute an IntelliJ/Eclipse
> config
> > >>> file right now, and I'd like to not have to.
> > >>>
> > >>> But, ugh, it will mess up ongoing PRs. I guess committers could fix
> > them
> > >> in
> > >>> merge, or we could just make proposers rebase. (Since committers are
> > most
> > >>> proposers, probably little harm in the latter).
> > >>>
> > >>> On Tue, Aug 23, 2016 at 4:11 PM, Jesse Anderson <
> je...@smokinghand.com
> > >
> > >>> wrote:
> > >>>
> >  Please. That's the one that always trips me up.
> > 
> >  On Tue, Aug 23, 2016, 4:10 PM Ben Chambers 
> > >> wrote:
> > 
> > > When Beam was contributed it inherited an import order [1] that was
> >  pretty
> > > arbitrary. We've added org.apache.beam [2], but continue to use
> this
> > > ordering.
> > >
> > > Both Eclipse and IntelliJ default to grouping imports into
> alphabetic
> > > order. I think it would simplify development if we switched our
> >  checkstyle
> > > ordering to agree with these IDEs. This also removes special
> > >> treatment
> >  for
> > > specific packages.
> > >
> > > If people agree, I'll send out a PR that changes the checkstyle
> > > configuration and runs IntelliJ's sort-imports on the existing
> files.
> > >
> > > -- Ben
> > >
> > > [1]
> > > org.apache.beam,com.google,android,com,io,Jama,junit,net,
> >  org,sun,java,javax
> > > [2] com.google,android,com,io,Jama,junit,net,org,sun,java,javax
> > >
> > 
> > >>>
> > >>
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: Configuring IntelliJ to enforce checkstyle rules

2016-08-24 Thread Kenneth Knowles
Nice step-by-step.

+1 to adding tips for particular IDEs in the contribution guide.

On Wed, Aug 24, 2016 at 7:48 AM, Jean-Baptiste Onofré 
wrote:

> Hi Stas,
>
> Thanks for sharing !
>
> As discussed with Amit on Hangout (and indirectly with you ;)), it's what
> I'm using in my config.
>
> Some stuff that I added on IntelliJ:
> - in Editor -> Code Style -> Java -> Tabs and Indents, I disabled "Use tab
> character", and defined 2 for "Tab Size" and "Indent", and 4 for
> "Continuation indent".
> - in Editor -> Code Style -> Java -> Imports, I changed the layout to
> match the checkstyle definition (static first, ,
> org.apache.beam.*, , com.google.*, ...)
> - in Editor -> Code Style -> Java -> Wrapping and Braces, I enabled
> "Ensure right margin is not exceeded"
>
> I also enabled checkstyle in code inspection, and the checkstyle button
> (next to the terminal button in the button bar).
>
> Related to the discussion about checkstyle, I think it makes sense to add
> a section about "Configuring IDE" in the contribution guide.
>
> WDYT ?
>
> Regards
> JB
>
>
> On 08/24/2016 04:35 PM, Stas Levin wrote:
>
>> Hi guys,
>>
>> Having IntelliJ enforce Beam's Checkstyle rules turned out to be very
>> useful for me, so I figured I'd share the steps just in case.
>>
>>1. Install the Checkstyle plugin
>>   1. Select the *Plugins* menu from the *Preferences* (Cmd+",")
>>   2. Click "*Browse Repositories*"
>>   3. Type "*Checkstyle-IDEA*" in the search box and select the search
>>   result
>>   4. Click "*install*"
>>   5. *Restart* Intellij
>>2. Activate Beam's checkstyle rules
>>   1. Select the newly added "*Checkstyle*" menu from the Preferences
>>   (Cmd+",") menu
>>   2. Hit the "*+*" icon to add a checkstyle file, *select Beam's
>>   checkstyle.xm*l
>>   "..incubator-beam/sdks/java/build-tools/src/main/resources/
>> beam/checkstyle.xml"
>>   3. Check the "*Active*" checkbox to activate
>>3. Done!
>>
>> Hope you find this useful.
>>
>> Regards,
>> Stas
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [PROPOSAL] Website page or Jira to host all current proposal discussion and docs

2016-08-09 Thread Kenneth Knowles
I didn't have a specific rubric, but here are some factors:

 - Impact on users
 - Impact on other devs (while we are incubating, this is possibly a big
deal)
 - Backwards compatibility (not that important until stable release if it
is minor)
 - Amount of detail needed to understand the proposal
 - Whether the proposal needs multiple re-readings to understand thoroughly
 - Whether the proposal will take a while to implement, or is basically a
one-PR thing

I think any of these is enough to consider a BIP. I'm sure others will
think of other considerations.

All my "no" answers are pretty mild on all categories IMO. Most of the
"yes" answers are heavy in more than one.

So actually I didn't specifically consider whether it was a model change,
but only the impact on users and backwards compatibility. For your example
of PipelineResult#waitToFinish, if we had a stable release then I would
have said "yes" for these reasons.

The "maybe" answers were all testing infrastructure, because they take a
while to complete and have high impact on development processes. But now
that I write these criteria down, I would change the "maybe" answers to
"no".

Thoughts?

Kenn

On Tue, Aug 9, 2016 at 1:15 AM, Ismaël Mejía <ieme...@gmail.com> wrote:

> Kenn, just to start the discussion, what was your criteria to decide what
> proposals are worth been a BIP ?
>
> I can clearly spot the most common case to create a BIP:  Changes to the
> model / SDK (this covers most of the 'yes' in your list, with the exception
> of Pipeline#waitToFinish).
>
> Do you guys have ideas for other criteria ? (e.g. are new runners and DSLs
> worth a BIP ?, or do Infrastructure issues deserve a BIP ?).
>
> Ismael
>
>
> On Mon, Aug 8, 2016 at 10:05 PM, Kenneth Knowles <k...@google.com.invalid>
> wrote:
>
> > +1 to the overall idea, though I would limit it to large and/or long-term
> > proposals.
> >
> > I like:
> >
> >  - JIRA for tracking: that's what it does best.
> >  - Google Docs for detailed commenting and revision - basically a wiki
> with
> > easier commenting
> >  - Beam site page for process description and list of current "BIPs",
> just
> > a one liner and a link to JIRA. A proposal to dev@beam could include a
> > link
> > to a PR against the asf-site to add the BIP. However, I would agree with
> > the counter-argument that this could just be a JIRA component or tag.
> > Either one works for me. Or a page with the process that links to a JIRA
> > saved search. The more formal list mostly just makes it even more
> visible,
> > right?
> >
> > I think that the number can be small. Here are examples scraped from the
> > mailing list archives (in random order) and whether I would use a "BIP":
> >
> >  - Runner API: yes
> >  - Serialization tech: no
> >  - Dynamic parameters: yes
> >  - Splittable DoFn: yes
> >  - Scio: yes
> >  - Pipeline#waitToFinish(), etc: no
> >  - DoFn setup / teardown: yes
> >  - State & Timers: yes
> >  - Pipeline job naming changes: no
> >  - CoGBK as primitive: yes
> >  - New website design: no
> >  - new DoFn: yes
> >  - Cluster infrastructure for tests: maybe
> >  - Beam recipes: no
> >  - Two spark runners: no
> >  - Nightly builds by Jenkins: maybe
> >
> > When I write them all down it really is a lot :-)
> >
> > Of course, the first thing that could be discussed in a [PROPOSAL] thread
> > would be whether to file a "BIP".
> >
> > Kenn
> >
> > On Mon, Aug 8, 2016 at 8:29 AM, Lukasz Cwik <lc...@google.com.invalid>
> > wrote:
> >
> > > +1 for the cwiki approach that Aljoshca and Ismael gave examples of.
> > >
> > > On Mon, Aug 8, 2016 at 2:57 AM, Ismaël Mejía <ieme...@gmail.com>
> wrote:
> > >
> > > > +1 for a more formal "Improvement Proposals" with ids we can refer
> to:
> > > >
> > > > like Flink does too:
> > > > https://cwiki.apache.org/confluence/display/FLINK/
> > > > Flink+Improvement+Proposals
> > > >
> > > >
> > > > On Mon, Aug 8, 2016 at 10:14 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > > > wrote:
> > > >
> > > > > Same think at Karaf: https://cwiki.apache.org/
> > > confluence/display/KARAF/
> > > > >
> > > > > Combine with Jira.
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > >
> > > > > On 08/08/201

Re: [PROPOSAL] Having 2 Spark runners to support Spark 1 users while advancing towards better streaming implementation with Spark 2

2016-08-04 Thread Kenneth Knowles
+1

I definitely think it is important to support spark 1 and 2 simultaneously,
and I agree that side-by-side seems the best way to do it. I'll refrain
from commenting on the specific technical aspects of the two runners and
focus just on the split: I am also curious about the answer to Dan's
question about what code is likely to be shared, if any.

On Thu, Aug 4, 2016 at 9:40 AM, Dan Halperin 
wrote:

> Can they share any substantial code? If not, they will really be separate
> runners.
>
> If so, would it make more sense to fork into runners/spark and
> runners/spark2?
>
>
>
> On Thu, Aug 4, 2016 at 9:33 AM, Ismaël Mejía  wrote:
>
> > +1
> >
> > In particular for three reasons:
> >
> > 1. The new DataSet API in spark 2 and the new semantics it allows for the
> > runner (and the effect that we cannot retro port this to the spark 1
> > runner).
> > 2. The current performance regressions in spark 2 (another reason to keep
> > the spark 1 runner).
> > 3. The different dependencies between spark versions (less important but
> > also a source of runtime conflicts).
> >
> > Just two points:
> > 1.  Considering the alpha state of the Structured Streaming API and the
> > performance regressions I consider that it is important to preserve the
> > previous TransformTranslator in the spark 2 runner, at least until spark
> 2
> > releases some stability fixes.
> > 2. Porting Read.Bound to the spark 1 runner is a must, we must guarantee
> > the same IO compatibility in both runners to make this ‘split’ make
> sense.
> >
> > Negative points of the proposal:
> > - More maintenance work + tests to do, but still worth at least for some
> > time given the current state.
> >
> > Extra comments:
> >
> > - This means that we will have two compatibility matrix columns now (at
> > least while we support spark 1) ?
> > - We must probably make clear for users the advantages/disadvantages of
> > both versions of the runner, and make clear that the spark 1 runner will
> be
> > almost on maintenance mode (with not many new features).
> > - We must also decide later on to deprecate the spark 1 runner, this will
> > depend in part of the feedback from users + the progress/adoption of
> spark
> > 2.
> >
> > Ismaël
> >
> > On Thu, Aug 4, 2016 at 8:39 AM, Amit Sela  wrote:
> >
> > > After discussions with JB, and understanding that a lot of companies
> > > running Spark will probably run 1.6.x for a while, we thought it would
> > be a
> > > good idea to have (some) support for both branches.
> > >
> > > The SparkRunnerV1 will mostly support Batch, but could also support
> > > “KeyedState” workflows and Sessions. As for streaming, I suggest to
> > > eliminate the awkward
> > >  > > runners/spark#streaming>
> > > way it uses Beam Windows, and only support Processing-Time windows.
> > >
> > > The SparkRunnerV2 will have a batch/streaming support relying on
> > Structured
> > > Streaming and the functionality it provides, and will provide in the
> > > future, to support the Beam model best as it can.
> > >
> > > The runners will exist under “runners/spark/spark1” and
> > > “runners/spark/spark2”.
> > >
> > > If this proposal is accepted, I will change JIRA tickets according to a
> > > proposed roadmap for both runners.
> > >
> > > General roadmap:
> > >
> > >
> > > SparkRunnerV1 should mostly “cleanup” and get rid of the
> Window-mocking,
> > > while specifically declaring Unsupported where it should.
> > >
> > > Additional features:
> > >
> > >1.
> > >
> > >Read.Bound support - actually supported in the SparkRunnerV2 branch
> > that
> > >is at work and it already passed some tests by JB and Ismael from
> > > Talend.
> > >I’ve also asked Michael Armbrust from Apache Spark to review this,
> and
> > > once
> > >it’s all set I’ll backport it to V1 as well.
> > >2.
> > >
> > >Consider support for “Keyed-State”.
> > >3.
> > >
> > >Consider support for “Sessions”
> > >
> > >
> > > SparkRunnerV2 branch  incubator-beam/pull/495>
> > > is
> > > at work right now and I hope to have it out supporting (some)
> event-time
> > > windowing, triggers and accumulation modes for streaming.
> > >
> > > Thanks,
> > > Amit
> > >
> >
>


Re: [PROPOSAL] Pipeline Runner API design doc

2016-08-02 Thread Kenneth Knowles
Hi,

Yes, there are a few things "TODO" including aggregators and triggers.
Triggers can either be an inline syntax tree or flattened and using
"pointers" like the transforms and PCollections. With coders we've hit
issues with the nesting and repetition that leads us to keep them
flattened. Essentially unreadable without un-flattening, so I would keep
things un-flattened if we weren't worried.

Kenn

On Tue, Aug 2, 2016 at 7:22 PM, Aljoscha Krettek <aljos...@apache.org>
wrote:

> Hi,
> thanks for putting this together. Now that I'm seeing them side by side I
> think the Avro schema looks a lot nicer than the JSON schema but it's
> probably alright since we don't want to change this often (as you already
> said). The advantage of JSON is that the (intermediate) plans can easily be
> inspected by humans.
>
> I think at this stage there is not much left to discuss on the plan
> representation. To me it seems pretty straightforward what has to be in
> there and that is already more or less in. The only real thing missing are
> triggers but there isn't yet a discussion about how that is going to work
> out, correct?
>
> Cheers,
> Aljoscha
>
> On Thu, 14 Jul 2016 at 21:34 Kenneth Knowles <k...@google.com.invalid>
> wrote:
>
> > Hi everyone,
> >
> > I wanted to circle back on this thread and with another invitation to a
> > discussion. Work on the high level refactorings to align the Java SDK
> with
> > the primitives of the proposed model is pretty far along, as is moving
> out
> > the stuff that we don't want in the user-facing SDK.
> >
> > Since our runners are all Java-based, and we tend to discuss the model in
> > Java first, I think part of the proposal that may have received less
> > attention was the concrete Avro schema towards the bottom of the doc.
> Since
> > our serialization tech discussion seemed to favor JSON on the front end,
> I
> > just spent a few minutes to port the Avro schema to a JSON schema and do
> > some project set up to demonstrate where & how it would incorporate into
> > the project structure. I'd done the same for Avro previously, so we can
> see
> > how they compare.
> >
> > I put the code in a PR, for discussion only at this point, at
> > https://github.com/apache/incubator-beam/pull/662. I'd love if you took
> a
> > look at the notes on the PR and briefly at the schema; I'll continue to
> > evolve it according to current & future feedback.
> >
> > Kenn
> >
> > On Wed, Mar 23, 2016 at 2:17 PM, Kenneth Knowles <k...@google.com> wrote:
> >
> > > Hi everyone,
> > >
> > > Incorporating the feedback from the 1-pager I circulated a week ago, I
> > > have put together a concrete design document for the new API(s).
> > >
> > >
> > >
> >
> https://docs.google.com/document/d/1bao-5B6uBuf-kwH1meenAuXXS0c9cBQ1B2J59I3FiyI/edit?usp=sharing
> > >
> > > I appreciate any and all feedback on the design.
> > >
> > > Kenn
> > >
> >
>


Re: Suggestion for Writing Sink Implementation

2016-07-28 Thread Kenneth Knowles
What I said earlier is not quite accurate, though my advice is the same.
Here are the corrections:

 - The Write transform actually has a too-general name, and Write.of(Sink)
only really works for finite data. It re-windows into the global window and
replaces any triggers.
 - So the special case in the Flink runner actually just _enables_ a (fake)
Sink to work.

We should probably rename Write to some more specific name that indicates
the particular strategy, and make it easier for a user to decide whether
that pattern is what they want. And the transform as-is should probably
reject unbounded inputs.

So you should still proceed with implementation via ParDo and your own
logic. If you want some logic similar to Write (but with different
windowing and triggering) then it is a pretty simple composite to derive
something from.

On Thu, Jul 28, 2016 at 12:37 PM, Chawla,Sumit <sumitkcha...@gmail.com>
wrote:

> Thanks Jean
>
> This is an interesting pattern here.  I see that its implemented as
> PTransform, with constructs ( WriteOperation/Writer)  pretty similar to
> Sink interface.  Would love to hear more pros/cons of this pattern :) .
> Definitely it gives more control over connection initialization and
> cleanup.
>
> Regards
> Sumit Chawla
>
>
> On Thu, Jul 28, 2016 at 12:20 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> > Hi Sumit,
> >
> > I created a PR containing Cassandra IO with a sink:
> >
> > https://github.com/apache/incubator-beam/pull/592
> >
> > Maybe it can help you.
> >
> > Regards
> > JB
> >
> >
> > On 07/28/2016 09:00 PM, Chawla,Sumit  wrote:
> >
> >> Hi Kenneth
> >>
> >> Thanks for looking into it. I am currently trying to implement Sinks for
> >> writing data into Cassandra/Titan DB.  My immediate goal is to run it on
> >> Flink Runner.
> >>
> >>
> >>
> >> Regards
> >> Sumit Chawla
> >>
> >>
> >> On Thu, Jul 28, 2016 at 11:56 AM, Kenneth Knowles
> <k...@google.com.invalid
> >> >
> >> wrote:
> >>
> >> Hi Sumit,
> >>>
> >>> I see what has happened here, from that snippet you pasted from the
> Flink
> >>> runner's code [1]. Thanks for looking into it!
> >>>
> >>> The Flink runner today appears to reject Write.Bounded transforms in
> >>> streaming mode if the sink is not an instance of UnboundedFlinkSink.
> The
> >>> intent of that code, I believe, was to special case UnboundedFlinkSink
> to
> >>> make it easy to use an existing Flink sink, not to disable all other
> >>> Write
> >>> transforms. What do you think, Max?
> >>>
> >>> Until we fix this issue, you should use ParDo transforms to do the
> >>> writing.
> >>> If you can share a little about your sink, we may be able to suggest
> >>> patterns for implementing it. Like Eugene said, the Write.of(Sink)
> >>> transform is just a specialized pattern of ParDo's, not a Beam
> primitive.
> >>>
> >>> Kenn
> >>>
> >>> [1]
> >>>
> >>>
> >>>
> https://github.com/apache/incubator-beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/FlinkStreamingTransformTranslators.java#L203
> >>>
> >>>
> >>> On Wed, Jul 27, 2016 at 5:57 PM, Eugene Kirpichov <
> >>> kirpic...@google.com.invalid> wrote:
> >>>
> >>> Thanks Sumit. Looks like your question is, indeed, specific to the
> Flink
> >>>> runner, and I'll then defer to somebody familiar with it.
> >>>>
> >>>> On Wed, Jul 27, 2016 at 5:25 PM Chawla,Sumit <sumitkcha...@gmail.com>
> >>>> wrote:
> >>>>
> >>>> Thanks a lot Eugene.
> >>>>>
> >>>>> My immediate requirement is to run this Sink on FlinkRunner. Which
> >>>>>>>>
> >>>>>>> mandates that my implementation must also implement SinkFunction<>.
> >>>>> In
> >>>>> that >>>case, none of the Sink<> methods get called anyway.
> >>>>>
> >>>>> I am using FlinkRunner. The Sink implementation that i was writing by
> >>>>> extending Sink<> class had to implement Flink Specific SinkFunction
> for
> >>>>>
> >>>> the
> >>>>
> >>>>> correct translation.
> >>>>>
> >>&g

[PROPOSAL] A brand new DoFn

2016-07-26 Thread Kenneth Knowles
Hi all,

I have a major new feature to propose: the next generation of DoFn.

It sounds a bit grandiose, but I think it is the best way to understand the
proposal.

This is strongly motivated by the design for state and timers, aka "per-key
workflows". Since the two features are separable and have separate design
docs, I have started a separate thread for each.

To get a quick overview of the proposal for a new DoFn, and how it improves
upon the flexibility and validation of DoFn, browse this presentation:

  https://s.apache.org/presenting-a-new-dofn

Due to the extent of this proposal, Ben & I have also prepared an in-depth
document at https://s.apache.org/a-new-dofn with additional details. Please
comment on particulars there, or just reply to this email.

The remainder of this email is yet another summary of the proposal, to
entice you to read the documents above and respond with a "+1".

This is a feature that has been an experimental feature of the Java SDK for
some time, under the name DoFnWithContext. For the purposes of this email
and the linked documents, I will call it NewDoFn and I will call the status
quo OldDoFn.

The differences between NewDoFn and and OldDoFn are most easily understood
with a quick code snippet:

new OldDoFn() {
  @Override
  public void processElement(ProcessContext c) { … }
}

new NewDoFn() {
  @ProcessElement   // <-- This is the only difference
  public void processElement(ProcessContext c) { … }
}

What changed? NewDoFn uses annotation-based dispatch instead of method
overrides. The method annotated with @ProcessElement is used to process
elements. It can have any name or signature, and validation is performed at
pipeline construction time.

Why do this? It allows the argument list for processElement to change. This
approach gives NewDoFn many advantages, which are demonstrated at length in
the linked documents. Here are some highlights:

 - Simpler backwards-compatible approaches to new features
 - Simpler composition of advanced features
 - Greater pipeline construction-time validation
 - Easier evolution of a simple anonymous DoFn into one that uses advanced
features

Here are some abbreviated demonstrations of things that work today or could
work easily with NewDoFn but require complex interrelated designs without
it:

Access the element's window:

new NewDoFn() {
  @ProcessElement
  public void processElement(ProcessContext c, BoundedWindow w) { … }
}

Use persistent state:

new NewDoFn() {
  @ProcessElement
  public void processElement(
  ProcessContext c,
  @StateId("cell-id") ValueState state) {
…
  }
}

Set and receive timers:

new NewDoFn() {
  @ProcessElement
  public void processElement(
  ProcessContext c,
  @TimerId("timer-id") Timer state) {
…
  }

  @OnTimer("timer-id")
  void onMyTimer(OnTimerContext) { … }
}

Receive a side input as a parameter:

new NewDoFn() {
  @ProcessElement
  public void processElement(
  ProcessContext c,
  @SideInput Supplier side) {
…
  }
}

So this is what I am proposing: We should move the Beam Java SDK to NewDoFn!

My proposed migration plan is:

1. leave a git tag before anything, so users can pin to it
2. mv DoFn OldDoFn && mv DoFnWithContext DoFn
3. get everything working with all runners
4. rm OldDoFn # a few weeks later

This will affect bleeding edge users, who will need to replace @Override
with @ProcessElement in all their DoFns. They can also pin to a commit
prior to the change or temporarily replace DoFn with OldDoFn everywhere.

I've already done step 2 in a branch at
https://github.com/kennknowles/incubator-beam/DoFnWithContext and ported a
few examples in their own commits. If you view those commits, you can see
how simple the migration path is.

Please let me know what you think. It is a big change, but one that I think
yields pretty nice rewards.

Kenn


[PROPOSAL] State and Timers for DoFn (aka per-key workflows)

2016-07-26 Thread Kenneth Knowles
Hi everyone,


I would like to offer a proposal for a much-requested feature in Beam:
Stateful processing in a DoFn. Please check out and comment on the proposal
at this URL:


  https://s.apache.org/beam-state


This proposal includes user-facing APIs for persistent state and timers.
Together, these provide rich capabilities that have been called "per-key
workflows", the subject of [BEAM-23].


Note that this proposal has an important prerequisite: a new design for
DoFn. The new DoFn is strongly motivated by this design for state and
timers, but we should discuss it separately. I will start a separate thread
for that.


On this email thread, I'd like to try to focus the discussion on state &
timers. And of course, please do comment on the particulars in the document.


Kenn


[BEAM-23] https://issues.apache.org/jira/browse/BEAM-23


Re: Jenkins build is still unstable: beam_PostCommit_RunnableOnService_SparkLocal #12

2016-07-25 Thread Kenneth Knowles
Looks like it didn't take. I don't think it can be done via the maven
command line. I think you may need to put this into the
 section of the pom for it to get plumbed in the
needed way. In searching about, I noticed that it is an internal system
property, not documented (why not?), so we might also set spark.ui.port=0
to just get an arbitrary unused port.

On Mon, Jul 25, 2016 at 9:48 AM, Dan Halperin 
wrote:

> Done. We'll see if that fixes things. If not, I'll turn off the build until
> I have more cycles to get it fixed up.
>
> Thanks Amit.
>
> On Sat, Jul 23, 2016 at 5:16 AM, Amit Sela  wrote:
>
> > Not sure what's the setup here, but there seems to be issues with the
> ports
> > for the UI.
> > Generally we don't need it for tests so you could add
> > -Dspark.ui.enabled=false
> > to the executing command.
> >
> > Thanks,
> > Amit
> >
> > -- Forwarded message -
> > From: Apache Jenkins Server 
> > Date: Sat, Jul 23, 2016 at 3:10 PM
> > Subject: Jenkins build is still unstable:
> > beam_PostCommit_RunnableOnService_SparkLocal #12
> > To: , , <
> > je...@smokinghand.com>
> >
> >
> > See <
> >
> >
> https://builds.apache.org/job/beam_PostCommit_RunnableOnService_SparkLocal/12/
> > >
> >
>


Re: Getting started with contribution

2016-07-23 Thread Kenneth Knowles
Hi Minudika,

Happy to hear from you!

We have labels in JIRA that you can look at to find starter tasks that
interest you. The tags are "starter" and "newbie" and "easyfix". They all
mean the same thing.

Take a look and if you find something that sounds interesting, comment on
the ticket and we can give it to you.

Kenn

On Jul 23, 2016 15:26, "Minudika Malshan"  wrote:

> Hi all,
>
> I am interested in contributing to Apache Beam project. I am going through
> the JIRRA issues hoping to find some issues to resolve as a start.
> It would be a great help if you can direct me to some issues which can be
> understandable to a novice person to this project and useful to get started
> with the project.
>
> Thanks a lot.
> Minudika
>


Re: [Proposal] Add waitToFinish(), cancel(), waitToRunning() to PipelineResult.

2016-07-21 Thread Kenneth Knowles
On Thu, Jul 21, 2016 at 1:42 PM, Robert Bradshaw <
rober...@google.com.invalid> wrote:
>
> (Totally backwards incompatible, we could calls this p.launch() for
> clarity, and maybe keep a run as run() { return
> p.launch().waitUntilFinish(); }.)
>

I must say this reads really well.


Re: [Proposal] Add waitToFinish(), cancel(), waitToRunning() to PipelineResult.

2016-07-21 Thread Kenneth Knowles
Some more comments:

 - What are the allowed/expected state transitions prior to RUNNING? Today,
I presume it is any nonterminal state, so it can be UNKNOWN or STOPPED
(which really means "not yet started") prior to RUNNING. Is this what we
want?

 - If a job can be paused, a transition from RUNNING to STOPPED, then
waitUntilPaused(Duration) makes sense.

 - Assuming there is some polling under the hood, are runners required to
send back a full history of transitions? Or can transitions be missed, with
only the latest state retrieved?

 - If the latter, then does waitUntilRunning() only wait until RUNNING or
does it also return when it sees STOPPED, which could certainly indicate
that the job transitioned to RUNNING then STOPPED in between polls. In that
case it is, today, the same as waitUntilStateIsKnown().

 - The obvious limit of this discussion is waitUntilState(Duration,
Set), which is the same amount of work to implement. Am I correct
that everyone in this thread thinks this generality is just not the right
thing for a user API?

 - This enum could probably use revision. I'd chose some combination of
tightening the enum, making it extensible, and make some aspect of it
free-form. Not sure where the best balance lies.



On Thu, Jul 21, 2016 at 12:47 PM, Ben Chambers <bchamb...@google.com.invalid
> wrote:

> (Minor Issue: I'd propose waitUntilDone and waitUntilRunning rather than
> waitToRunning which reads oddly)
>
> The only reason to separate submission from waitUntilRunning would be if
> you wanted to kick off several pipelines in quick succession, then wait for
> them all to be running. For instance:
>
> PipelineResult p1Future = p1.run();
> PipelineResult p2Future = p2.run();
> ...
>
> p1Future.waitUntilRunning();
> p2Future.waitUntilRunning();
> ...
>
> In this setup, you can more quickly start several pipelines, but your main
> program would wait and report any errors before exiting.
>
> On Thu, Jul 21, 2016 at 12:41 PM Robert Bradshaw
> <rober...@google.com.invalid> wrote:
>
> > I'm in favor of the proposal. My only question is whether we need
> > PipelineResult.waitToRunning(), instead I'd propose that run() block
> > until the pipeline's running/successfully submitted (or failed). This
> > would simplify the API--we'd only have one kind of wait that makes
> > sense in all cases.
> >
> > What kinds of interactions would one want to have with the
> > PipelineResults before it's running?
> >
> > On Thu, Jul 21, 2016 at 12:24 PM, Thomas Groh <tg...@google.com.invalid>
> > wrote:
> > > TestPipeline is probably the one runner that can be expected to block,
> as
> > > certainly JUnit tests and likely other tests will run the Pipeline, and
> > > succeed, even if the PipelineRunner throws an exception. Luckily, this
> > can
> > > be added to TestPipeline.run(), which already has additional behavior
> > > associated with it (currently regarding the unwrapping of
> > AssertionErrors)
> > >
> > > On Thu, Jul 21, 2016 at 11:40 AM, Kenneth Knowles
> <k...@google.com.invalid
> > >
> > > wrote:
> > >
> > >> I like this proposal. It makes pipeline.run() seem like a pretty
> normal
> > >> async request, and easy to program with. It removes the implicit
> > assumption
> > >> in the prior design that main() is pretty much just "build and run a
> > >> pipeline".
> > >>
> > >> The part of this that I care about most is being able to write a
> program
> > >> (not the pipeline, but the program that launches one or more
> pipelines)
> > >> that has reasonable cross-runner behavior.
> > >>
> > >> One comment:
> > >>
> > >> On Wed, Jul 20, 2016 at 3:39 PM, Pei He <pe...@google.com.invalid>
> > wrote:
> > >> >
> > >> > 4. PipelineRunner.run() should (but not required) do non-blocking
> runs
> > >> >
> > >>
> > >> I think we can elaborate on this a little bit. Obviously there might
> be
> > >> "blocking" in terms of, say, an HTTP round-trip to submit the job, but
> > >> run() should never be non-terminating.
> > >>
> > >> For a test runner that finishes the pipeline quickly, I would be fine
> > with
> > >> run() just executing the pipeline, but the PipelineResult should still
> > >> emulate the usual - just always returning a terminal status. It would
> be
> > >> annoying to add waitToFinish() to the end of all our tests, but
> leaving
> > a
> > >> run() makes the tests only work with special blocking runner wrappers
> > (and
> > >> make them poor examples). A JUnit @Rule for test pipeline would hide
> all
> > >> that, perhaps.
> > >>
> > >>
> > >> Kenn
> > >>
> >
>


Re: [Proposal] Add waitToFinish(), cancel(), waitToRunning() to PipelineResult.

2016-07-21 Thread Kenneth Knowles
I like this proposal. It makes pipeline.run() seem like a pretty normal
async request, and easy to program with. It removes the implicit assumption
in the prior design that main() is pretty much just "build and run a
pipeline".

The part of this that I care about most is being able to write a program
(not the pipeline, but the program that launches one or more pipelines)
that has reasonable cross-runner behavior.

One comment:

On Wed, Jul 20, 2016 at 3:39 PM, Pei He  wrote:
>
> 4. PipelineRunner.run() should (but not required) do non-blocking runs
>

I think we can elaborate on this a little bit. Obviously there might be
"blocking" in terms of, say, an HTTP round-trip to submit the job, but
run() should never be non-terminating.

For a test runner that finishes the pipeline quickly, I would be fine with
run() just executing the pipeline, but the PipelineResult should still
emulate the usual - just always returning a terminal status. It would be
annoying to add waitToFinish() to the end of all our tests, but leaving a
run() makes the tests only work with special blocking runner wrappers (and
make them poor examples). A JUnit @Rule for test pipeline would hide all
that, perhaps.


Kenn


[KUDOS] Contributed runner: Gearpump!

2016-07-20 Thread Kenneth Knowles
Hi all,

I would like to call attention to a huge contribution to Beam: a runner for
Apache Gearpump (incubating).

The runner landed on the gearpump-runner feature branch today. Check it
out! And contribute towards getting it ready for the master branch :-)

Please join me in special congratulations and thanks to Manu Zhang who
worked through everything on PR #323.

Kenn


Re: [PROPOSAL] CoGBK as primitive transform instead of GBK

2016-07-20 Thread Kenneth Knowles
+1

I assume that the intent is for the semantics of both GBK and CoGBK to be
unchanged, just swapping their status as primitives.

This seems like a good change, with strictly positive impact on users and
SDK authors, with only an extremely minor burden (doing an insertion of the
provided implementation in the worst case) on runner authors.

Kenn


On Wed, Jul 20, 2016 at 10:38 AM, Lukasz Cwik 
wrote:

> I would like to propose a change to Beam to make CoGBK the basis for
> grouping instead of GBK. The idea behind this proposal is that CoGBK is a
> more powerful operator then GBK allowing for two key benefits:
>
> 1) SDKs are simplified: transforming a CoGBK into a GBK is trivial while
> the reverse is not.
> 2) It will be easier for runners to provide more efficient implementations
> of CoGBK as they will be responsible for the logic which takes their own
> internal grouping implementation and maps it onto a CoGBK.
>
> This requires the following modifications to the Beam code base:
>
> 1) Make GBK a composite transform in terms of CoGBK.
> 2) Move the CoGBK from contrib to runners-core as an adapter*. Runners that
> more naturally support GBK can just use this and everything executes
> exactly as before.
>
> *just like GroupByKeyViaGroupByKeyOnly and UnboundedReadFromBoundedSource
>


Re: [PROPOSAL] Pipeline Runner API design doc

2016-07-14 Thread Kenneth Knowles
Hi everyone,

I wanted to circle back on this thread and with another invitation to a
discussion. Work on the high level refactorings to align the Java SDK with
the primitives of the proposed model is pretty far along, as is moving out
the stuff that we don't want in the user-facing SDK.

Since our runners are all Java-based, and we tend to discuss the model in
Java first, I think part of the proposal that may have received less
attention was the concrete Avro schema towards the bottom of the doc. Since
our serialization tech discussion seemed to favor JSON on the front end, I
just spent a few minutes to port the Avro schema to a JSON schema and do
some project set up to demonstrate where & how it would incorporate into
the project structure. I'd done the same for Avro previously, so we can see
how they compare.

I put the code in a PR, for discussion only at this point, at
https://github.com/apache/incubator-beam/pull/662. I'd love if you took a
look at the notes on the PR and briefly at the schema; I'll continue to
evolve it according to current & future feedback.

Kenn

On Wed, Mar 23, 2016 at 2:17 PM, Kenneth Knowles <k...@google.com> wrote:

> Hi everyone,
>
> Incorporating the feedback from the 1-pager I circulated a week ago, I
> have put together a concrete design document for the new API(s).
>
>
> https://docs.google.com/document/d/1bao-5B6uBuf-kwH1meenAuXXS0c9cBQ1B2J59I3FiyI/edit?usp=sharing
>
> I appreciate any and all feedback on the design.
>
> Kenn
>


Re: Improvements to issue/version tracking

2016-06-28 Thread Kenneth Knowles
+1

On Tue, Jun 28, 2016 at 12:06 AM, Jean-Baptiste Onofré 
wrote:

> +1
>
> Regards
> JB
>
>
> On 06/28/2016 01:01 AM, Davor Bonaci wrote:
>
>> Hi everyone,
>> I'd like to propose a simple change in Beam JIRA that will hopefully
>> improve our issue and version tracking -- to actually use the "Fix
>> Versions" field as intended [1].
>>
>> The goal would be to simplify issue tracking, streamline generation of
>> release notes, add a view of outstanding work towards a release, and
>> clearly communicate which Beam version contains fixes for each issue.
>>
>> The standard usage of the field is:
>> * For open (or in-progress/re-opened) issues, "Fix Versions" field is
>> optional and indicates an unreleased version that this issue is targeting.
>> The release is not expected to proceed unless this issue is fixed, or the
>> field is changed.
>> * For closed (or resolved) issues, "Fix Versions" field indicates a
>> released or unreleased version that has the fix.
>>
>> I think the field should be mandatory once the issue is resolved/closed
>> [4], so we make a deliberate choice about this. I propose we use "Not
>> applicable" for all those issues that aren't being resolved as Fixed
>> (e.g.,
>> duplicates, working as intended, invalid, etc.) and those that aren't
>> released (e.g., website, build system, etc.).
>>
>> We can then trivially view outstanding work for the next release [2], or
>> generate release notes [3].
>>
>> I'd love to hear if there are any comments! I know that at least JB
>> agrees,
>> as he was convincing me on this -- thanks ;).
>>
>> Thanks,
>> Davor
>>
>> [1]
>>
>> https://confluence.atlassian.com/adminjiraserver071/managing-versions-802592484.html
>> [2]
>>
>> https://issues.apache.org/jira/browse/BEAM/fixforversion/12335766/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel
>> [3]
>>
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12335764
>> [4] https://issues.apache.org/jira/browse/INFRA-12120
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: What is the "Keyed State" in the capability matrix?

2016-06-24 Thread Kenneth Knowles
Hi Shen,

The row refers to the ability for a DoFn in a ParDo to access per-key (and
window) state cells that persist beyond the lifetime of an element or
bundle. This is a feature that was in the later stages of design when the
Beam code was donated. Hence it a row in the graph, but even the Beam Model
column says "no", pending a public design proposal & consensus. Most
runners already have a similar capability at a low level; this feature
refers to exposing it in a nice way for users.

I have a design doc that I'm busily revising to make sense for the whole
community. I will send the doc to this list and add it to our technical
docs folder as soon as I can get it ready. You can follow BEAM-25 [1] if
you like, too.

Kenn

[1] https://issues.apache.org/jira/browse/BEAM-25


On Fri, Jun 24, 2016 at 10:56 AM, Shen Li  wrote:

> Hi,
>
> There is a "Keyed State" row in the  "What is being computed" section of
> the capability matrix. What does the "Keyed State" refer to? Is it a global
> key-value store?
>
> (
>
> http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html
> )
>
> Thanks,
>
> Shen
>


Re: [DISCUSS] Beam data plane serialization tech

2016-06-22 Thread Kenneth Knowles
rg
> >
> > > wrote:
> > >
> > > > Hi,
> > > > am I correct in assuming that the transmitted envelopes would mostly
> > > > contain coder-serialized values? If so, wouldn't the header of an
> > > envelope
> > > > just be the number of contained bytes and number of values? I'm
> > probably
> > > > missing something but with these assumptions I don't see the benefit
> of
> > > > using something like Avro/Thrift/Protobuf for serializing the
> > main-input
> > > > value envelopes. We would just need a system that can send byte data
> > > really
> > > > fast between languages/VMs.
> > > >
> > > > By the way, another interesting question (at least for me) is how
> other
> > > > data, such as side-inputs, is going to arrive at the DoFn if we want
> to
> > > > support a general interface for different languages.
> > > >
> > > > Cheers,
> > > > Aljoscha
> > > >
> > > > On Thu, 16 Jun 2016 at 22:33 Kenneth Knowles <k...@google.com.invalid
> >
> > > > wrote:
> > > >
> > > > > (Apologies for the formatting)
> > > > >
> > > > > On Thu, Jun 16, 2016 at 12:12 PM, Kenneth Knowles <k...@google.com>
> > > > wrote:
> > > > >
> > > > > > Hello everyone!
> > > > > >
> > > > > > We are busily working on a Runner API (for building and
> > transmitting
> > > > > > pipelines)
> > > > > > and a Fn API (for invoking user-defined functions found within
> > > > pipelines)
> > > > > > as
> > > > > > outlined in the Beam technical vision [1]. Both of these require
> a
> > > > > > language-independent serialization technology for
> interoperability
> > > > > between
> > > > > > SDKs
> > > > > > and runners.
> > > > > >
> > > > > > The Fn API includes a high-bandwidth data plane where bundles are
> > > > > > transmitted
> > > > > > via some serialization/RPC envelope (inside the envelope, the
> > stream
> > > of
> > > > > > elements is encoded with a coder) to transfer bundles between the
> > > > runner
> > > > > > and
> > > > > > the SDK, so performance is extremely important. There are many
> > > choices
> > > > > for
> > > > > > high
> > > > > > performance serialization, and we would like to start the
> > > conversation
> > > > > > about
> > > > > > what serialization technology is best for Beam.
> > > > > >
> > > > > > The goal of this discussion is to arrive at consensus on the
> > > question:
> > > > > > What
> > > > > > serialization technology should we use for the data plane
> envelope
> > of
> > > > the
> > > > > > Fn
> > > > > > API?
> > > > > >
> > > > > > To facilitate community discussion, we looked at the available
> > > > > > technologies and
> > > > > > tried to narrow the choices based on three criteria:
> > > > > >
> > > > > >  - Performance: What is the size of serialized data? How do we
> > expect
> > > > the
> > > > > >technology to affect pipeline speed and cost? etc
> > > > > >
> > > > > >  - Language support: Does the technology support the most
> > widespread
> > > > > > language
> > > > > >for data processing? Does it have a vibrant ecosystem of
> > > contributed
> > > > > >language bindings? etc
> > > > > >
> > > > > >  - Community: What is the adoption of the technology? How mature
> is
> > > it?
> > > > > > How
> > > > > >active is development? How is the documentation? etc
> > > > > >
> > > > > > Given these criteria, we came up with four technologies that are
> > good
> > > > > > contenders. All have similar & adequate schema capabilities.
> > > > > >
> > > > > >  - Apache Avro: Does not require code gen, but embedding the
> schema
> > > in
> > > > > the
> > > > > > data
> > > > > >could be an issue. Very popular.
> > > > > >
> > > > > >  - Apache Thrift: Probably a bit faster and compact than Avro. A
> > huge
> > > > > > number of
> > > > > >language supported.
> > > > > >
> > > > > >  - Protocol Buffers 3: Incorporates the lessons that Google has
> > > learned
> > > > > > through
> > > > > >long-term use of Protocol Buffers.
> > > > > >
> > > > > >  - FlatBuffers: Some benchmarks imply great performance from the
> > > > > zero-copy
> > > > > > mmap
> > > > > >idea. We would need to run representative experiments.
> > > > > >
> > > > > > I want to emphasize that this is a community decision, and this
> > > thread
> > > > is
> > > > > > just
> > > > > > the conversation starter for us all to weigh in. We just wanted
> to
> > do
> > > > > some
> > > > > > legwork to focus the discussion if we could.
> > > > > >
> > > > > > And there's a minor follow-up question: Once we settle here, is
> > that
> > > > > > technology
> > > > > > also suitable for the low-bandwidth Runner API for defining
> > > pipelines,
> > > > or
> > > > > > does
> > > > > > anyone think we need to consider a second technology (like JSON)
> > for
> > > > > > usability
> > > > > > reasons?
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/presentation/d/1E9seGPB_VXtY_KZP4HngDPTbsu5RVZFFaTlwEYa88Zw/present?slide=id.g108d3a202f_0_38
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
>


Re: Running examples with different runners

2016-06-21 Thread Kenneth Knowles
To expand on the RunnableOnService test suggestion, here [1] is the commit
from the Spark runner. You will get a lot more information if you can port
this for your runner than you would from an example end-to-end test.

Note that this just pulls in the tests from the core SDK. For testing with
other I/O connectors, you'll add them to the dependenciesToScan.

[1]
https://github.com/apache/incubator-beam/commit/4254749bf103c4bb6f68e316768c0aa46d9f7df0

On Tue, Jun 21, 2016 at 4:06 PM, Lukasz Cwik 
wrote:

> There is a start to getting more e2e like integration tests going with the
> first being WordCount.
>
> https://github.com/apache/incubator-beam/blob/master/examples/java/src/test/java/org/apache/beam/examples/WordCountIT.java
> You could add WindowedWordCountIT.java which will be launched with the
> proper configuration of the Apex runner pom.xml
>
> I would also suggest that you take a look at the @RunnableOnService tests
> which are a comprehensive validation suite of ~200ish tests that test
> everything from triggers to side inputs. It requires some pom changes and
> creating a test runner which is able to setup an apex environment.
>
> Furthermore, we could really use an addition to the Beam wiki about testing
> and how runners write tests/execute tests/...
>
> Some relevant links:
> Older presentation about getting cross runner tests going:
>
> https://docs.google.com/presentation/d/1uTb7dx4-Y2OM_B0_3XF_whwAL2FlDTTuq2QzP9sJ4Mg/edit#slide=id.g127d614316_19_39
>
> Examples of test runners:
>
> https://github.com/apache/incubator-beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/testing/TestDataflowRunner.java
>
> https://github.com/apache/incubator-beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/TestFlinkRunner.java
>
> https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/TestSparkRunner.java
>
> Section of pom dedicated to enabling runnable on service tests:
>
> https://github.com/apache/incubator-beam/blob/master/runners/spark/pom.xml#L54
>
> On Tue, Jun 21, 2016 at 2:21 PM, Thomas Weise 
> wrote:
>
> > Hi,
> >
> > As part of the Apex runner, we have a few unit tests for the supported
> > transformations. Next, I would like to test the WindowedWordCount
> example.
> >
> > Is there an example of configuring this pipeline for another runner? Is
> it
> > recommended to supply such configuration as a JUnit test? What is the
> > general (repeatable?) approach to exercise different runners with the set
> > of example pipelines?
> >
> > Thanks,
> > Thomas
> >
>


Re: Using GroupAlsoByWindowViaWindowSetDoFn for stream of input element

2016-06-21 Thread Kenneth Knowles
Broadly, yes: This and other semantics-preserving transformations are (by
definition) runner-independent and we have a home for them in mind. The
place we would put them is in the nascent runners-core module, which is
generally the place for utilities for helping to implement runners.

An optimization library has not been a priority since our existing runners
already have optimized execution strategies, and many optimizations are a
better fit for a runner-specific lower level "intermediate language" anyhow.

But if you implemented the transformation and contributed it, I think it
would be welcome.

On Tue, Jun 21, 2016 at 10:08 AM, Lukasz Cwik 
wrote:

> Maybe, the issue is that pushing the combine function upstream impacts the
> windowing and triggering behavior of the GBK. I don't believe its as simple
> as always being able to push the combiner upstream and it depends on how a
> runner has decided to implement GBK.
>
>
> On Tue, Jun 21, 2016 at 9:58 AM, Thomas Weise 
> wrote:
>
> > Hi Thomas,
> >
> > Thanks for the info. When the pipeline contains:
> >
> > .apply(Count.perElement())
> >
> > The translation looks as follows:
> >
> > 58   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
> > entering composite transform Count.PerElement
> > 58   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
> > visiting transform Init [AnonymousParDo]
> > 58   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
> > visiting value Count.PerElement/Init.out [PCollection]
> > 58   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
> > entering composite transform Count.PerKey [Combine.PerKey]
> > 58   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
> > visiting transform GroupByKey
> > 93   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
> > visiting value Count.PerElement/Count.PerKey/GroupByKey.out [PCollection]
> > 93   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
> > entering composite transform Combine.GroupedValues
> > 93   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
> > visiting transform AnonymousParDo
> > 93   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
> > visiting value
> > Count.PerElement/Count.PerKey/Combine.GroupedValues/AnonymousParDo.out
> > [PCollection]
> > 93   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
> > leaving composite transform Combine.GroupedValues
> > 93   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
> > leaving composite transform Count.PerKey [Combine.PerKey]
> > 93   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
> > leaving composite transform Count.PerElement
> >
> > So the runner's translator needs to take care of pushing the combine
> > function upstream, when it is possible. I was wondering whether this is
> > something that could be handled in a runner independent way?
> >
> > Thanks,
> > Thomas
> >
> >
> >
> >
> > On Fri, Jun 17, 2016 at 10:19 AM, Thomas Groh 
> > wrote:
> >
> > > Generally, the above code snippet will work, producing (after trigger
> > > firing) an output Iterable containing all of the input elements. It
> > may
> > > be notable that timers (and TimerInternals) are also per-key, so that
> > > interface must also be updated per element.
> > >
> > > By specifying the ReduceFn of the ReduceFnRunner, you can change how
> the
> > > ReduceFnRunner adds and merges state. The combining ReduceFn is
> suitable
> > > for use with upstream CombineFns, while buffering is suitable for
> general
> > > use.
> > >
> > > On Fri, Jun 17, 2016 at 9:52 AM, Thomas Weise 
> > > wrote:
> > >
> > > > The source for my windowed groupByKey experiment is here:
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/tweise/incubator-beam/blob/master/runners/apex/src/main/java/org/apache/beam/runners/apex/translators/functions/ApexGroupByKeyOperator.java
> > > >
> > > > The result is Iterable. In cases such as counting, what is the
> > > > recommended way to perform the incremental aggregation, without
> > building
> > > an
> > > > intermediate collection?
> > > >
> > > > Thomas
> > > >
> > > > On Fri, Jun 17, 2016 at 8:27 AM, Thomas Weise <
> thomas.we...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'm looking into using the GroupAlsoByWindowViaWindowSetDoFn to
> > > > accumulate
> > > > > the windowed state with the elements arriving one by one (stream).
> > > > >
> > > > > Once the window is complete, I would like to emit an Iterable or
> > > > > another form of aggregation of the elements. Is the following
> > supposed
> > > to
> > > > > lead to merging of current element with previously received
> elements
> > > for
> > > > > the same window:
> > > > >
> > > > > KeyedWorkItem kwi = KeyedWorkItems.elementsWorkItem(

Re: 0.1.0-incubating release

2016-06-07 Thread Kenneth Knowles
+1

Lovely. Very readable.

The "-parent" artifacts are just leaked implementation details of our build
configuration that no one should ever actually reference, right?

Kenn

On Tue, Jun 7, 2016 at 8:54 AM, Dan Halperin 
wrote:

> +2! This seems most concordant with other Apache products and the most
> future-proof.
>
> On Mon, Jun 6, 2016 at 9:35 PM, Jean-Baptiste Onofré 
> wrote:
>
> > +1
> >
> > Regards
> > JB
> >
> >
> > On 06/07/2016 02:49 AM, Davor Bonaci wrote:
> >
> >> After a few rounds of discussions and examining patterns of other
> >> projects,
> >> I think we are converging towards:
> >>
> >> * A flat group structure, where all artifacts belong to the
> >> org.apache.beam
> >> group.
> >> * Prefix all artifact ids with "beam-".
> >> * Name artifacts according to the existing directory/module layout:
> >> beam-sdks-java-core, beam-runners-google-cloud-dataflow-java, etc.
> >> * Suffix all parents with "-parent", e.g., "beam-parent",
> >> "sdks-java-parent", etc.
> >> * Create a "distributions" module, for the purpose of packaging the
> source
> >> code for the ASF release.
> >>
> >> I believe this approach takes into account everybody's feedback so far,
> >> and
> >> I think opposing positions have been retracted.
> >>
> >> Please comment if that's not the case, or if there are any additional
> >> points that we may have missed. If not, this is implemented in pending
> >> pull
> >> requests #420 and #423.
> >>
> >> Thanks!
> >>
> >> On Fri, Jun 3, 2016 at 9:59 AM, Thomas Weise 
> >> wrote:
> >>
> >> Another consideration for potential future packaging/distribution
> >>> solutions
> >>> is how the artifacts line up as files in a flat directory. For that it
> >>> may
> >>> be good to have a common prefix in the artifactId and unique
> artifactId.
> >>>
> >>> The name for the source archive (when relying on ASF parent POM) can
> also
> >>> be controlled without expanding the artifactId:
> >>>
> >>>   
> >>>  
> >>>
> >>>  maven-assembly-plugin
> >>>  
> >>>apache-beam
> >>>  
> >>>
> >>>  
> >>>   
> >>>
> >>> Thanks,
> >>> Thomas
> >>>
> >>> On Fri, Jun 3, 2016 at 9:39 AM, Davor Bonaci  >
> >>> wrote:
> >>>
> >>> BEAM-315 is definitely important. Normally, I'd always advocate for
> 
> >>> holding
> >>>
>  the release to pick that fix. For the very first release, however, I'd
>  prefer to proceed to get something out there and test the process. As
>  you
>  said, we can address this rather quickly once we have the fix merged
> in.
> 
>  In terms of Maven coordinates, there are two basic approaches:
>  * flat structure, where artifacts live under "org.apache.beam" group
> and
>  are differentiated by their artifact id.
>  * hierarchical structure, where we use different groups for different
> 
> >>> types
> >>>
>  of artifacts (org.apache.beam.sdks; org.apache.beam.runners).
> 
>  There are pros and cons on the both sides of the argument. Different
>  projects made different choices. Flat structure is easier to find and
>  navigate, but often breaks down with too many artifacts. Hierarchical
>  structure is just the opposite.
> 
>  On my end, the only important thing is consistency. We used to have
> it,
> 
> >>> and
> >>>
>  it got broken by PR #365. This part should be fixed -- we should
> either
>  finish the vision of the hierarchical structure, or rollback that PR
> to
> 
> >>> get
> >>>
>  back to a fully flat structure.
> 
>  My general biases tend to be:
>  * hierarchical structure, since we have many artifacts already.
>  * short identifiers; no need to repeat a part of the group id in the
>  artifact id.
> 
>  On Fri, Jun 3, 2016 at 4:03 AM, Jean-Baptiste Onofré  >
>  wrote:
> 
>  Hi Max,
> >
> > I discussed with Davor yesterday. Basically, I proposed:
> >
> > 1. To rename all parent with a prefix (beam-parent,
> >
>  flink-runner-parent,
> >>>
>  spark-runner-parent, etc).
> > 2. For the groupId, I prefer to use different groupId, it's clearer
> to
> >
>  me,
> 
> > and it's exactly the usage of the groupId. Some projects use a single
> > groupId (spark, hadoop, etc), others use multiple (camel, karaf,
> >
>  activemq,
> 
> > etc). I prefer different groupIds but ok to go back to single one.
> >
> > Anyway, I'm preparing a PR to introduce a new Maven module:
> > "distribution". The purpose is to address both BEAM-319 (first) and
> > BEAM-320 (later). It's where we will be able to define the different
> > distributions we plan to publish (source and binaries).
> >
> > Regards
> > JB
> >
> >
> > On 06/03/2016 11:02 AM, Maximilian 

Re: Fewer number of minor/trivial issues

2016-05-23 Thread Kenneth Knowles
We do already have a couple of issues labeled as "starter" for just this
purpose. I don't care much about the actual name; there are different words
people think of ("easy-win", "starter", "newbie", "low-hanging-fruit") so
probably it would be useful to have a good Jira search linked from the
contribution guide.

Kenn

On Mon, May 23, 2016 at 9:10 AM, Scott Wegner 
wrote:

> I'm working on integrating FindBugs static analysis into our build, which
> has uncovered a long list of outstanding issues. (JIRA
> , pull request
> ). Once integrated, I'd
> like to triage the baseline issues which will be a great source of
> low-hanging-fruit bugs.
>
> On Sun, May 22, 2016 at 3:07 AM Yash Sharma  wrote:
>
> > Great to know someone's on top of it already.
> > We should encourage creating /marking minor tasks which would give new
> > contributors opportunity to break the ice.
> >
> > Another place could be test cases for certain modules which will also
> > provide knowledge of code flow.
> >
> > -regards
> > On May 22, 2016 8:01 PM, "Jean-Baptiste Onofré"  wrote:
> >
> > > Hi Yash,
> > >
> > > During ApacheCon, I discussed with Davor to create a Jira tag: "low
> > > hanging fruit" ;)
> > > I did such tag in other Apache projects to encourage contribution.
> > >
> > > I see some potential Jira on this kind, especially documentation and
> > > examples.
> > >
> > > Regards
> > > JB
> > >
> > > On 05/22/2016 11:56 AM, Yash Sharma wrote:
> > >
> > >> Hi Experts,
> > >> I have just been checking the Beam issue tracker but could not find
> lot
> > of
> > >> minor/trivial tasks. Also many of the existing minor tasks are quite
> > old.
> > >> These tasks are very helpful for newbie contributions and it would be
> > >> great
> > >> to see some minor tasks (or a newbie label), and would serve as good
> > >> starting points for fresh contributors.
> > >>
> > >> Thoughts ?
> > >>
> > >> Best Regards,
> > >> Yash
> > >>
> > >>
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>


Re: [DISCUSS] Adding Some Sort of SideInputRunner

2016-04-28 Thread Kenneth Knowles
On Thu, Apr 28, 2016 at 1:26 AM, Aljoscha Krettek 
wrote:

> Bump.
>
> I'm afraid this might have gotten lost during the conferences/summits.
>
> On Thu, 21 Apr 2016 at 13:30 Aljoscha Krettek  wrote:
>
> > Ok, I'll try and start such a design. Before I can start, I have a few
> > questions about how the side inputs actually work.  Some of it is in the
> > docs, but some is also conjecture on my part about how MillWheel/Windmill
> > (I don't know, is the first the system and the second a worker in there?)
> > works.
> >
> > I'll write the questions as a set of assumptions and please correct me on
> > those where I'm wrong:
>

Sorry for the delayed reply. I may have taken you too literally. Since
nothing seemed wrong, I didn't say anything at all :-)


>
> > 1. Side inputs are always global/broadcast, they are never scoped to a
> key.
>

True. A side input is a way of reading/viewing an entire PCollection.


> 2. Mapping of main-input window to side-input window is done by the
> > side-input WindowFn.
>

True.


> If the side input has a larger window than the main
> > input, processing of main-input elements will block until the side input
> > becomes available. (Side inputs for a larger side-input window can become
> > available early if using a speculative trigger)
>

True to this point: The main input element waits for the necessary side
input window to be ready. It doesn't necessarily have to do with window
size. It could just be a scheduling thing, or other runner-specific reason.


>
> > 2.5 If we update the side input because a speculative trigger fires again
> > we don't reprocess the main-input elements that were already processed.
> > Only new elements see the updated side input.
>

True. This is a place where the decision is due mostly to feasibility of
implementation. It is easy to create a pipeline where this behavior is not
ideal.


> 3. The decision about whether side inputs are "available", i.e. complete
> > in case of list/map side inputs is made by a Trigger. (This is also true
> > for updates to side input caused by speculative triggers firing again.)
> > This uses the continuation trigger feature, which is easy for time
> triggers
> > and interesting for the AfterPane.elementCountAtLeast() trigger which
> > changes to AfterPane.elementCountAtLeast(1) on continuation and other
> > speculative/late triggers.
>

True up to this point: The side input is considered ready when there has
been any data output/added to the PCollection that it is being read as a
side input. So the upstream trigger controls this. I don't know if the
continuation trigger is important. The goal of the continuation trigger is
to handle an implementation detail: Once an upstream trigger has already
regulated the flow, we want downstream stuff to just proceed as fast as
reasonable.


> 4. About the StreamingSideInputDoFnRunner.java
> > <
> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/1e5524a7f5d0d774488cb0206ea6433085461775/sdk/src/main/java/com/google/cloud/dataflow/sdk/runners/worker/StreamingSideInputDoFnRunner.java>
> posted
> > by Kenneth. It uses StateInternals to buffer the elements of the main
> input
> > in a BagState.  I was under the assumption that state is always scoped
> to a
> > key but side inputs can also be used on non-keyed streams. In this case,
> > the state is scoped to the key group (the unit/granularity that is used
> to
> > rebalance in case of rescaling) and when we access the state we get all
> > elements for the key groups for which our parallel worker instance is
> > responsible. (This is the part where I am completely unsure about what is
> > happening... :-O)
>

In this case, the key that is scoping the state is just the sharding key
for the computation, just like you say. It may not be the actual key of a
KV element, and the PCollection may not even be keyed, but the computation
always has such a sharding.

I believe replacing the runner-specific pieces of that code might be most
of the way to what you want. And I think the runner-specific pieces are
really just there because it didn't matter to make it more general, so it
will be easy to change. The upcoming in-process runner is developing some
abstractions in this area. But I can easily believe that there are other
ways of designing this. For example in batch mode we simply schedule the
side input first. Even in streaming style, the runner backend might be able
to take some more responsibility. So that code is just one point in the
design space.

Kenn


Re: [DISCUSS] Adding Some Sort of SideInputRunner

2016-04-20 Thread Kenneth Knowles
Hi Aljoscha,

Great idea.

 - The logic for matching up the windows is WindowFn#getSideInputWindow [1]
 - The SDK used to have something along the lines of what you describe [2]
but we thought it was too runner-specific, directly referencing Dataflow
details, and with a particular model of buffering + timer. Perhaps it is a
starting place for your design?

Kenn

[1]
https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/WindowFn.java#L131

[2]
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/1e5524a7f5d0d774488cb0206ea6433085461775/sdk/src/main/java/com/google/cloud/dataflow/sdk/runners/worker/StreamingSideInputDoFnRunner.java

On Wed, Apr 20, 2016 at 4:25 AM, Jean-Baptiste Onofré 
wrote:

> Hi Aljoscha
>
> AFAIR, the Runner API Proposal document (from Kenneth) contains some
> points about side input.
>
>
> https://drive.google.com/folderview?id=0B-IhJZh9Ab52OFBVZHpsNjc4eXc=sharing
>
> I don't think it goes into the details of side inputs and windows, but
> definitely the document we should extend.
>
> Regards
> JB
>
>
>
> On 04/20/2016 11:55 AM, Aljoscha Krettek wrote:
>
>> Hi,
>> for https://issues.apache.org/jira/browse/BEAM-102 we will need to have
>> some functionality that deals with side inputs and windows (of both the
>> main input and the side inputs) and how they get matched and how we wait
>> for windows (blocking). I imagine that we could add some component that is
>> similar to ReduceFnRunner but for side inputs: We would just instantiate
>> it
>> with a factory for state storage, then push elements into it while
>> processing and it would provide a way to get a SideInputReader.
>>
>> I think this would not be specific to the Flink runner because other
>> runner
>> implementors will face similar problems. Are there any ideas/design docs
>> about such a thing already? If not, we should probably start designing.
>>
>> What do you think?
>>
>> Cheers,
>> Aljoscha
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: A question about windowed values

2016-04-13 Thread Kenneth Knowles
Good thread. Filed as https://issues.apache.org/jira/browse/BEAM-191.

On Wed, Apr 13, 2016 at 10:08 AM, Amit Sela  wrote:

> First of all, Thanks for the detailed explanation!
>
> I can say that from my point of view (as a runner developer) this is
> definitely confusing, especially discovering that an element in an empty
> window can be dropped at anytime, so +1 for Robert's comment on not having
> this public API, and according to Kenneth's lookup it looks like it's not
> entangled too deep.
>
> So I guess #valueInGlobalWindow should be the "go-to" default window (as
> long as no "real" windows are involved), should we consider making this
> more clear in the public API ? maybe WindowedValue#defaultValue(T) ?
> which will probably implement a global window.. just a thought.
>
> On Wed, Apr 13, 2016 at 7:29 PM Robert Bradshaw
> 
> wrote:
>
> > As Thomas says, the fact that we ever produce values in "no window" is
> > an implementation quirk that should probably be fixed. (IIRC, it's
> > used for the output of a GBK before we've done the
> > group-also-by-windows to figure out what window it really should be
> > in, so "value in unknown windows" would be a better choice).
> >
> > If a WindowFn doesn't assign a value to any windows, the system is
> > free to drop it. There are pros and cons to supporting this degenerate
> > case vs. making it an error. However, this should almost certainly not
> > be in the public API...
> >
> > - Robert
> >
> >
> > On Wed, Apr 13, 2016 at 9:06 AM, Thomas Groh 
> > wrote:
> > > Actually, my above claim isn't as strong as it can be.
> > >
> > > A value in no windows is considered to not exist. Values that are not
> > > assigned to any window can be dropped by a runner at *any time*. A
> > WindowFn
> > > *must* assign all elements to at least one window. All elements that
> are
> > > produced by any PTransform (including Sources) must be in a window,
> > > potentially the GlobalWindow.
> > >
> > > On Wed, Apr 13, 2016 at 8:52 AM, Thomas Groh  wrote:
> > >
> > >> Values should almost always be part of at least one window. WindowFns
> > >> should place all elements in at least one window, as values that are
> in
> > no
> > >> windows will be dropped when they reach a GroupByKey.
> > >>
> > >> Elements in no windows, for example those created by
> > >> WindowedValue.valueInEmptyWindows(T) are generally an implementation
> > >> detail of a transform; for example, in the InProcessPipelineRunner,
> the
> > KV > >> Iterable> elements output by a GroupByKeyOnly are in
> > >> empty windows - but by the time the element reaches the boundary of
> the
> > >> GroupByKey, the elements are reassigned to the appropriate window(s).
> > >>
> > >> On Tue, Apr 12, 2016 at 11:44 PM, Amit Sela 
> > wrote:
> > >>
> > >>> My instinct tells me that if a value does not belong to a specific
> > window
> > >>> (in time) it's a part of a global window, but if so, what's the role
> of
> > >>> the
> > >>> "empty window". When should an element be a "value in an empty
> window"
> > ?
> > >>>
> > >>
> > >>
> >
>


Re: TextIO.Read.Bound vs Create

2016-04-13 Thread Kenneth Knowles
This seems wrong. They should both be in the global window. I think your
trouble is
https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java#L472

On Tue, Apr 12, 2016 at 9:43 PM, Amit Sela  wrote:

> Why input values from *TextIO.Read.Bound *belong to an empty window while
> values from *Create* belong in a global window ?
>
> Thanks,
> Amit
>


Re: Add Beam up to https://analysis.apache.org?

2016-04-12 Thread Kenneth Knowles
Thanks for the hint! Filed https://issues.apache.org/jira/browse/INFRA-11642
.

On Fri, Apr 8, 2016 at 9:08 AM, Nitin Lamba <nitin.la...@gmail.com> wrote:

> Actually, I found out that one has to put in an infra request to hook-up to
> ASF's Sonar. You could look at [1] for examples.
>
> HTH,
> Nitin
> [1]
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20INFRA%20AND%20text%20~%20%22Sonar%22
>
>
>
>
> On Wed, Mar 30, 2016 at 1:43 PM, Kenneth Knowles <k...@google.com.invalid>
> wrote:
>
> > Something to look forward to, then :-)
> >
> > But I also note that there are incubating versions of Apache Falcon,
> Apache
> > Sirona, Apache Tamaya, Apache BatchEE, and Apache Sentry.
> >
> > On Tue, Mar 29, 2016 at 6:35 PM, Nitin Lamba <nla...@apache.org> wrote:
> >
> > > Looks fun though appears to be for Apache TLPs only.
> > >
> > > In the meantime, following could be used for project stats for mails
> [1]
> > > and git commits [2].
> > >
> > > Nitin
> > > [1] http://markmail.org/search/?q=Apache+Beam
> > > [2] https://biterg.io
> > >
> > >
> > > On Tue, Mar 29, 2016 at 10:09 AM, Kenneth Knowles
> <k...@google.com.invalid
> > >
> > > wrote:
> > >
> > > > This site looks very fun, possibly enlightening. Not urgent at all,
> but
> > > is
> > > > there just a bit to flip to get Beam added?
> > > >
> > > > Kenn
> > > >
> > >
> >
>


Re: Add Beam up to https://analysis.apache.org?

2016-03-30 Thread Kenneth Knowles
Something to look forward to, then :-)

But I also note that there are incubating versions of Apache Falcon, Apache
Sirona, Apache Tamaya, Apache BatchEE, and Apache Sentry.

On Tue, Mar 29, 2016 at 6:35 PM, Nitin Lamba <nla...@apache.org> wrote:

> Looks fun though appears to be for Apache TLPs only.
>
> In the meantime, following could be used for project stats for mails [1]
> and git commits [2].
>
> Nitin
> [1] http://markmail.org/search/?q=Apache+Beam
> [2] https://biterg.io
>
>
> On Tue, Mar 29, 2016 at 10:09 AM, Kenneth Knowles <k...@google.com.invalid>
> wrote:
>
> > This site looks very fun, possibly enlightening. Not urgent at all, but
> is
> > there just a bit to flip to get Beam added?
> >
> > Kenn
> >
>


Add Beam up to https://analysis.apache.org?

2016-03-29 Thread Kenneth Knowles
This site looks very fun, possibly enlightening. Not urgent at all, but is
there just a bit to flip to get Beam added?

Kenn


[PROPOSAL] Pipeline Runner API design doc

2016-03-23 Thread Kenneth Knowles
Hi everyone,

Incorporating the feedback from the 1-pager I circulated a week ago, I have
put together a concrete design document for the new API(s).

https://docs.google.com/document/d/1bao-5B6uBuf-kwH1meenAuXXS0c9cBQ1B2J59I3FiyI/edit?usp=sharing

I appreciate any and all feedback on the design.

Kenn


Re: Capability matrix question

2016-03-23 Thread Kenneth Knowles
+1 to considering "metric" / PMetric / etc.

On Wed, Mar 23, 2016 at 8:09 AM, Amit Sela  wrote:

> How about "PMetric" ?
>
> On Wed, Mar 23, 2016, 16:53 Frances Perry  wrote:
>
>>
 Perhaps I'm unclear on what an “Aggregator” is. I assumed that a line
 such as the following:

 PCollection> meanByName =
 dataPoints.apply(Mean.perKey());

 …would be considered an Aggregator, since it applies a mean aggregation
 over a window. Is that correct, with respect to the Beam terminology? If
 not, what would an example of an Aggregator be?

>>>
>> Ah, we may have some slightly confusing terminology here.
>>
>> In that code snippet you are using a PTransform (Mean.perKey) to combine
>> a PCollection using the Mean CombineFn
>> .
>> An Aggregator
>> 
>> takes a CombineFn and applies it continuously within a DoFn. So it's more
>> analogous to a 'counter'. You can see an example of aggregators in
>> DebuggingWordCount
>> 
>> .
>>
>> We never really used the term *aggregation *to refer to a general set of
>> PTransforms until we started describing things to the community. But it is
>> a useful word, so we've ended up in a bit of confusing state. Maybe we
>> should consider renaming Aggregator? Something like "metric" might be
>> clearer.
>>
>>


Re: Renaming process: first step Maven coordonates

2016-03-22 Thread Kenneth Knowles
+1 JB.

If it works for other incubating projects, then I'm happy to proceed.

On Mon, Mar 21, 2016 at 1:00 PM, Jean-Baptiste Onofré 
wrote:

> Hi Ben,
>
> it works fine with Maven >= 3.2.x (current version is 3.3.9).
>
> Most of incubator projects use x.x.x-incubating-SNAPSHOT:
>
>
> https://git1-us-west.apache.org/repos/asf?p=incubator-batchee.git;a=blob_plain;f=pom.xml;hb=HEAD
>
>
> https://git1-us-west.apache.org/repos/asf?p=incubator-apex-core.git;a=blob_plain;f=pom.xml;hb=HEAD
>
>
> https://git1-us-west.apache.org/repos/asf?p=incubator-atlas.git;a=blob_plain;f=pom.xml;hb=HEAD
>
>
> https://git1-us-west.apache.org/repos/asf?p=incubator-falcon.git;a=blob_plain;f=pom.xml;hb=HEAD
>
> etc
>
> And we don't have any problem with Maven, even in OSGi related projects
> which are a bit "complex" in versioning (as '-' is not allowed).
>
> finalName is not a solution, as it's not part of the Maven coordonates.
>
> I don't see any valid argument to use a different versioning in Beam, and
> we will be compliant with release management recommendation (
> http://incubator.apache.org/guides/releasemanagement.html).
>
> IMHO, we should use 0.1.0-incubating-SNAPSHOT (the Maven Parser uses the
> final -SNAPSHOT, and take 0.1.0-incubating as base version).
>
> Regards
> JB
>
>
> On 03/21/2016 07:22 PM, Ben Chambers wrote:
>
>> I don't think Maven will recognize 0.1.0-incubating-SNAPSHOT as a
>> snapshot.
>> It will recognize it as 0.1.0 with the "incubating-SNAPSHOT" qualifier.
>>
>> For instance, looking at the code for parsing qualifiers, it only handles
>> the string "SNAPSHOT" specially, not "incubating-SNAPSHOT".
>>
>> http://maven.apache.org/ref/3.0.4/maven-artifact/xref/org/apache/maven/artifact/versioning/ComparableVersion.html#52
>>
>> Looking at this Stack Overflow answer (
>> http://stackoverflow.com/a/31482463/4539304) it looks like support was
>> improved in Maven 3.2.4 to allow multiple qualifiers (its still unclear
>> whether incubating would be considered by the code as a qualifier).
>>
>> Either way, we shouldn't expect users to upgrade to Maven 3.2.4 or newer
>> just to get reasonable version number treatment. It seems like sticking
>> with the standard "-SNAPSHOT" and "" for releases is preferable.
>>
>> If the goal is to get incubating into the file names, I think we can
>> configure the Maven build process to do so. For instance, finalName
>> defaults to
>> "${project.artifactId}-${project.version}". If we
>> changed that to
>>
>> "${project.artifactId}-incubating-${project.version}"
>> it seems like we'd "incubating" in the file names without needing to
>> complicate the release numbering.
>>
>> On Mon, Mar 21, 2016 at 10:24 AM Jean-Baptiste Onofré 
>> wrote:
>>
>> Hi Ben,
>>>
>>> 1. True for Python, but it can go in a folder in sdk (sdk/python)
>>> anyway. I think the DSLs (Java based) and other languages that we might
>>> introduce (Scala, ...) can be the same.
>>>
>>> 2. The incubating has to be in the released filenames. So it can be in
>>> the version or name. Anyway, my proposal was 0.1.0-incubating-SNAPSHOT
>>> for a SNAPSHOT and 0.1.0-incubating for a release (it's what I did in
>>> the PR). Like this, the Maven standards are still valid.
>>>
>>> Regards
>>> JB
>>>
>>> On 03/21/2016 06:20 PM, Ben Chambers wrote:
>>>
 1. Regarding "java" as a module -- are we sure that other languages will

>>> be
>>>
 packaged using Maven as well? For instance, Python has its own ecosystem
 which likely doesn't play well with Python.

 2. Using the literal "SNAPSHOT" as the qualifier has special meaning

>>> Maven
>>>
 -- it is newer than all other qualified releases, but older than any
 unqualified release. It feels like we should take advantage of this,

>>> which
>>>
 makes our versioning more consistent with Maven standards. Specifically,
 snapshots should be 0.1.0-SNAPSHOT and releases should be 0.1.0.
   0.1.0-SNAPSHOT because that uses the standard definition of
 SNAPSHOT
   0.1.0 because if we had any qualifier than the 0.1.0-SNAPSHOT
 would

>>> be
>>>
 considered newer

 Davor's suggestion of putting the "incubating" in the name or
 description
 of the artifacts seems like a preferable option.

 On Mon, Mar 21, 2016 at 7:33 AM Jean-Baptiste Onofré 
 wrote:

 Hi beamers,
>
> I updated the PR according to your comments.
>
> I have couple of points I want to discuss:
>
> 1. All modules use the same groupId (org.apache.beam). In order to have
> a cleaner structure on the Maven repo, I wonder if it's not better to
> have different groupId depending of the artifacts. For instance,
> org.apache.beam.sdk, containing a module with java as artifactId (it
> will contain new artifacts with id python, scala, ...),
> org.apache.beam.runners containing modules with flink and spark as
> 

Re: Committer workflow

2016-03-21 Thread Kenneth Knowles
+1 for emphasizing the knowledge-sharing aspect of review.

I think it is the most important for project health, and the most fun too!
I love the chance to learn about a new piece of code (or learn how I messed
up in my own code :-)​


  1   2   >