Re: [VOTE] Release 0.4.0-incubating, release candidate #3

2016-12-20 Thread Thomas Weise
+1

- signatures
- build from source
- run quickstart with Apex runner

To get more folks participate in the release verification, it may be
helpful to publish verification steps/guidelines on the web site.

Thanks,
Thomas



On Tue, Dec 20, 2016 at 2:22 AM, Sergio Fernández  wrote:

> +1 (ipmc binding)
>
> So far I've successfully checked:
> * signatures and digests
> * source releases file layouts
> * matched git tags and commit ids
> * incubator suffix and disclaimer
> * NOTICE and LICENSE files
> * license headers
> * clean build (Java 1.8.0_91, Maven 3.3.9, Debian amd64)
>
>
> P.S.: you can keep the dev@ vote as long as you need to verify the
> technical part of the release, the single format requirement is that it
> should be open _at least_ 72 hours.
>
>
> On Tue, Dec 20, 2016 at 11:13 AM, Ismaël Mejía  wrote:
>
> > +1 (non-binding)
> >
> > - verified signatures + checksums
> > - run mvn clean verify -Prelease, all artifacts build and the tests run
> > smoothly
> >
> > A new 0.4.1 release to include the BigQuery fix is a good idea, I think
> as
> > we approach graduation it is important that we define (if we haven't,
> > probably I just don't know) what issues are release blocker​s for future
> > releases and which issues can be fixed as minor patch versions.
> >
> >
> > On Tue, Dec 20, 2016 at 10:02 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > Thanks for the update Dan.
> > >
> > > I think we should move forward on this release as, as you said, we have
> > > important improvements compared to 0.3.0-incubating release.
> > > We can do a 0.4.1-incubating pretty soon to address the bigquery IO
> > > issues. I'm volunteer to do that.
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 12/19/2016 09:21 PM, Dan Halperin wrote:
> > >
> > >> I vetted the binary artifacts accompanying the release by running
> > several
> > >> jobs on the Dataflow and Direct runners. At a high level, the release
> > >> looks
> > >> fine -- I ran some of my favorite jobs and they all worked swimmingly.
> > >>
> > >> There are some severe bugs in BigQueryIO in the release. Specifically,
> > we
> > >> broke the ability to write to BigQuery using different tables for
> every
> > >> window. To a large degree, this makes BigQuery useless when working
> with
> > >> unbounded data (streaming pipelines). The bugs have been fixed (and
> > >> accompanying tests added) in PRs #1651 and #1400.
> > >>
> > >> Conclusion: +0.8
> > >>
> > >> * 0.4.0-incubating RC3 is largely an improvement over
> 0.3.0-incubating,
> > >> especially in the user getting started experience.
> > >> * The bugs in BigQueryIO are blockers for BigQuery users, but this is
> > >> likely a relatively small fraction of the Beam community. I would not
> > >> retract RC3 based on this alone. Unless we plan to cut an RC4 for
> other
> > >> reasons, we should move forward with RC3.
> > >>
> > >> I'd hope that we hear from key users of the Apex, Flink, and Spark
> > runners
> > >> before closing the vote, even though it's technically been 72+ hours.
> I
> > >> suggest we wait to ensure they have an opportunity to chime in.
> > >>
> > >> Thanks,
> > >> Dan
> > >>
> > >>
> > >> Appendix: pom.xml changes to use binary releases from Apache Staging:
> > >>
> > >>   
> > >> 
> > >>   apache.staging
> > >>   Apache Development Staging Repository
> > >>   https://repository.apache.org/content/
> repositories/staging/
> > >> 
> > >>   
> > >> true
> > >>   
> > >>   
> > >> false
> > >>   
> > >> 
> > >>   
> > >>
> > >> On Sun, Dec 18, 2016 at 10:14 PM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > >> wrote:
> > >>
> > >> Hi guys,
> > >>>
> > >>> The good thing is that my issue to access repository.apache.org
> Nexus
> > is
> > >>> now fixed.
> > >>>
> > >>> To update the signature files, we have to drop the Nexus repository
> to
> > >>> stage a new one,
> > >>> meaning cancel the current vote to do a new RC4.
> > >>>
> > >>> I can do that, up to you.
> > >>>
> > >>> Anyway, regarding the release content, +1 (binding).
> > >>>
> > >>> Regards
> > >>> JB
> > >>>
> > >>>
> > >>> On 12/18/2016 06:56 PM, Davor Bonaci wrote:
> > >>>
> > >>> Indeed -- I did help JB with the release ever so slightly, due to the
> >  networking connectivity issue reaching repository.apache.org, which
> > JB
> >  further described and is tracked in INFRA-13086 [1]. This is not
> >  Beam-specific.
> > 
> >  The current signature shouldn't be a problem at all, but, since
> others
> >  are
> >  asking about it, I think it would be the best to simply re-sign the
> >  source
> >  .zip archive and continuing this vote. JB, what do you think?
> > 
> >  Regarding the release itself, I think we need to keep raising the
> >  quality
> >  and maturity release-over-release, and test signals are an excellent
> > way
> >  to
> >  demonstrate that. 

Re: [DISCUSS] Graduation to a top-level project

2016-11-22 Thread Thomas Weise
+1

I would like to mention the welcoming, growing community and the focus on
solid processes and testing.


On Tue, Nov 22, 2016 at 11:07 AM, Aljoscha Krettek 
wrote:

> +1
>
> I'm quite enthusiastic about the growth of the community and the open
> discussions!
>
> On Tue, 22 Nov 2016 at 19:51 Jason Kuster 
> wrote:
>
> > An enthusiastic +1!
> >
> > In particular it's been really great to see the commitment and interest
> of
> > the community in different kinds of testing. Between what we currently
> have
> > on Jenkins and Travis and the in-progress work on IO integration tests
> and
> > performance tests (plus, I'm sure, other things I'm not aware of) we're
> in
> > a really good place.
> >
> > On Tue, Nov 22, 2016 at 10:49 AM, Amit Sela 
> wrote:
> >
> > > +1, super exciting!
> > >
> > > Thanks to JB, Davor and the whole team for creating this community. I
> > think
> > > we've achieved a lot in a short time.
> > >
> > > Amit.
> > >
> > > On Tue, Nov 22, 2016, 20:36 Tyler Akidau 
> > > wrote:
> > >
> > > > +1, thanks to everyone who's invested time getting us to this point.
> > :-)
> > > >
> > > > -Tyler
> > > >
> > > > On Tue, Nov 22, 2016 at 10:33 AM Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > First of all, I would like to thank the whole team, and especially
> > > Davor
> > > > > for the great work and commitment to Apache and the community.
> > > > >
> > > > > Of course, a big +1 to move forward on graduation !
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > > On 11/22/2016 07:19 PM, Davor Bonaci wrote:
> > > > > > Hi everyone,
> > > > > > With all the progress we’ve had recently in Apache Beam, I think
> it
> > > is
> > > > > time
> > > > > > we start the discussion about graduation as a new top-level
> project
> > > at
> > > > > the
> > > > > > Apache Software Foundation.
> > > > > >
> > > > > > Graduation means we are a self-sustaining and self-governing
> > > community,
> > > > > and
> > > > > > ready to be a full participant in the Apache Software Foundation.
> > It
> > > > does
> > > > > > not imply that our community growth is complete or that a
> > particular
> > > > > level
> > > > > > of technical maturity has been reached, rather that we are on a
> > solid
> > > > > > trajectory in those areas. After graduation, we will still
> > > periodically
> > > > > > report to, and be overseen by, the ASF Board to ensure continued
> > > growth
> > > > > of
> > > > > > a healthy community.
> > > > > >
> > > > > > Graduation is an important milestone for the project. It is also
> > key
> > > to
> > > > > > further grow the user community: many users (incorrectly) see
> > > > incubation
> > > > > as
> > > > > > a sign of instability and are much less likely to consider us
> for a
> > > > > > production use.
> > > > > >
> > > > > > A way to think about graduation readiness is through the Apache
> > > > Maturity
> > > > > > Model [1]. I think we clearly satisfy all the requirements [2].
> It
> > is
> > > > > > probably worth emphasizing the recent community growth: over each
> > of
> > > > the
> > > > > > past three months, no single organization contributing to Beam
> has
> > > had
> > > > > more
> > > > > > than ~50% of the unique contributors per month [2, see
> > assumptions].
> > > > > That’s
> > > > > > a great statistic that shows how much we’ve grown our diversity!
> > > > > >
> > > > > > Process-wise, graduation consists of drafting a board resolution,
> > > which
> > > > > > needs to identify the full Project Management Committee, and
> > getting
> > > it
> > > > > > approved by the community, the Incubator, and the Board. Within
> the
> > > > Beam
> > > > > > community, most of these discussions and votes have to be on the
> > > > private@
> > > > > > mailing list, but, as usual, we’ll try to keep dev@ updated as
> > much
> > > as
> > > > > > possible.
> > > > > >
> > > > > > With that in mind, let’s use this discussion on dev@ for two
> > things:
> > > > > > * Collect additional data points on our progress that we may want
> > to
> > > > > > present to the Incubator as a part of the proposal to accept our
> > > > > graduation.
> > > > > > * Determine whether the community supports graduation. Please
> reply
> > > > +1/-1
> > > > > > with any additional comments, as appropriate. I’d encourage
> > everyone
> > > to
> > > > > > participate -- regardless whether you are an occasional visitor
> or
> > > > have a
> > > > > > specific role in the project -- we’d love to hear your
> perspective.
> > > > > >
> > > > > > Data points so far:
> > > > > > * Project’s maturity self-assessment [2].
> > > > > > * 1500 pull requests in incubation, which makes us one of the
> most
> > > > active
> > > > > > project across all of ASF on this metric.
> > > > > > * 3 releases, each driven by a different release manager.
> > > > > > * 120+ 

Re: Including Apex runner in Beam tutorial at Strata - Singapore

2016-11-15 Thread Thomas Weise
The runner currently only executes in embedded mode and I'm not sure that
will change prior to Strata due to dependency on next Apex core release.

I would suggest to aim for the following:


   - Create a sample wordcount project that produces the packaging that is
   required to launch the pipeline on YARN (I can take this up).
   - Create the instructions for folks to setup a YARN cluster locally. I
   think a good option for the demo would be to use the DataTorrent sandbox.
   - Create tutorial documentation that shows how to alter the example,
   launch it using the Apex CLI and also how to observe the application
   through the UI (comes with the sandbox).


Thanks,
Thomas





On Tue, Nov 15, 2016 at 7:07 PM, Tyler Akidau 
wrote:

> Yes, I'll be giving the tutorial in Singapore w/ Dan Halperin. We'd be
> happy to include Apex in the demos as part of that. Let's sync up offline
> about what that will entail.
>
> -Tyler
>
>
> On Tue, Nov 15, 2016 at 10:02 AM Davor Bonaci 
> wrote:
>
> > Hi Sandeep,
> > It would be great to include the Apex runner as a part of any tutorial
> > going forward. I suspect we'll have the 0.4.0-incubating release
> completed
> > just before Strata Singapore, which will the first release with the Apex
> > runner, so that aligns quite nicely.
> >
> > Are you planning to attend Strata Singapore? If so, I'd encourage you to
> > reach out to Tyler Akidau offline, who's leading the tutorial on this
> > conference.
> >
> > Davor
> >
> > On Tue, Nov 15, 2016 at 7:04 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > Hi Sandeep,
> > >
> > > Great news !
> > >
> > > Yes, you can definitely do a demo using the Apex runner. It's what Dan
> > and
> > > I are also planning during ApacheCon this week: same Wordcount example
> > > running on different execution engines.
> > >
> > > Maybe this blog could help you to prepare the demo:
> > > http://blog.nanthrax.net/2016/08/apache-beam-in-action-same-
> > > code-several-execution-engines/
> > >
> > > By the way, I will propose a PR to "merge" those blog to Beam website.
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 11/15/2016 04:00 PM, Sandeep Deshmukh wrote:
> > >
> > >> Dear Beam Community,
> > >>
> > >> There is a Beam tutorial in Strata-Singapore. I would like to explore
> > >> possibility of including the Apex runner as a part of that tutorial.
> As
> > >> Apex runner is recently merged into master branch of Beam, it would be
> > of
> > >> interest to many people.
> > >>
> > >> Please let us know if we can do so. I can accordingly work on the
> same.
> > >>
> > >> Regards,
> > >> Sandeep
> > >>
> > >>
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>


Re: [PROPOSAL] Merge apex-runner to master branch

2016-11-08 Thread Thomas Weise
Thanks for the support. It may be helpful to describe the roles of
"maintainer" and "supporter" in this context, perhaps even capture it on:

http://beam.apache.org/contribute/contribution-guide/

Thanks,
Thomas


On Tue, Nov 8, 2016 at 7:51 PM, Robert Bradshaw <rober...@google.com.invalid
> wrote:

> Nice. I'm +1 modulo one caveat below (hopefully easily addressed).
>
> On Tue, Nov 8, 2016 at 5:54 AM, Thomas Weise <t...@apache.org> wrote:
> > Hi,
> >
> > As per previous discussion [1], I would like to propose to merge the
> > apex-runner branch into master. The runner satisfies the criteria
> outlined
> > in [2] and merging it to master will give more visibility to other
> > contributors and users.
> >
> > Specifically the Apex runner addresses:
> >
> >- Have at least 2 contributors interested in maintaining it, and 1
> >committer interested in supporting it:  *I'm going to sign up for the
> >support and there are more folks interested. Some have already
> contributed
> >and helped with PR reviews, others from the Apex community have
> expressed
> >interest [3].*
>
> As anyone in the open source ecosystem knows, maintaining is a much
> higher bar than contributing, but very important. I'd like to see
> specific names here.
>
> >- Provide both end-user and developer-facing documentation:  *Runner
> has
> >README, capability matrix, Javadoc. Planning to add it to the tutorial
> >later.*
> >- Have at least a basic level of unit test coverage:  *Has 30 runner
> >specific tests and passes all Beam RunnableOnService tests.*
> >- Run all existing applicable integration tests with other Beam
> >components and create additional tests as appropriate: * Enabled
> runner
> >for examples integration tests in the same way as other runners.*
> >- Be able to handle a subset of the model that address a significant
> set of
> >use cases (aka. ‘traditional batch’ or ‘processing time
> > streaming’):  *Passes
> >RunnableOnService without exclusions and example IT.*
> >- Update the capability matrix with the current status:  *Done.*
> >- Add a webpage under learn/runners: *Same "TODO" page as other
> runners
> >added to site.*
> >
> > The PR for the merge: https://github.com/apache/incubator-beam/pull/1305
> >
> > (There are intermittent test failures in individual Travis runs that are
> > unrelated to the runner.)
> >
> > Thanks,
> > Thomas
> >
> > [1]
> > https://lists.apache.org/thread.html/2b420a35f05e47561f27c19e8ec648
> 4f595553f32da88fe593ad931d@%3Cdev.beam.apache.org%3E
> >
> > [2] http://beam.apache.org/contribute/contribution-guide/
> #feature-branches
> >
> > [3]
> > https://lists.apache.org/thread.html/6e7618768cdcde81c28aa9883a1fcf
> 4d3d4e41de4249547
> > <https://lists.apache.org/thread.html/6e7618768cdcde81c28aa9883a1fcf
> 4d3d4e41de4249547130691d52@%3Cdev.apex.apache.org%3E>
> > 130691d52@%3Cdev.apex.apache.org%3E
> > <https://lists.apache.org/thread.html/6e7618768cdcde81c28aa9883a1fcf
> 4d3d4e41de4249547130691d52@%3Cdev.apex.apache.org%3E>
>


[PROPOSAL] Merge apex-runner to master branch

2016-11-08 Thread Thomas Weise
Hi,

As per previous discussion [1], I would like to propose to merge the
apex-runner branch into master. The runner satisfies the criteria outlined
in [2] and merging it to master will give more visibility to other
contributors and users.

Specifically the Apex runner addresses:

   - Have at least 2 contributors interested in maintaining it, and 1
   committer interested in supporting it:  *I'm going to sign up for the
   support and there are more folks interested. Some have already contributed
   and helped with PR reviews, others from the Apex community have expressed
   interest [3].*
   - Provide both end-user and developer-facing documentation:  *Runner has
   README, capability matrix, Javadoc. Planning to add it to the tutorial
   later.*
   - Have at least a basic level of unit test coverage:  *Has 30 runner
   specific tests and passes all Beam RunnableOnService tests.*
   - Run all existing applicable integration tests with other Beam
   components and create additional tests as appropriate: * Enabled runner
   for examples integration tests in the same way as other runners.*
   - Be able to handle a subset of the model that address a significant set of
   use cases (aka. ‘traditional batch’ or ‘processing time
streaming’):  *Passes
   RunnableOnService without exclusions and example IT.*
   - Update the capability matrix with the current status:  *Done.*
   - Add a webpage under learn/runners: *Same "TODO" page as other runners
   added to site.*

The PR for the merge: https://github.com/apache/incubator-beam/pull/1305

(There are intermittent test failures in individual Travis runs that are
unrelated to the runner.)

Thanks,
Thomas

[1]
https://lists.apache.org/thread.html/2b420a35f05e47561f27c19e8ec6484f595553f32da88fe593ad931d@%3Cdev.beam.apache.org%3E

[2] http://beam.apache.org/contribute/contribution-guide/#feature-branches

[3]
https://lists.apache.org/thread.html/6e7618768cdcde81c28aa9883a1fcf4d3d4e41de4249547

130691d52@%3Cdev.apex.apache.org%3E



Re: Can we have more quick start examples ?

2016-10-27 Thread Thomas Weise
The Beam tutorials seem to address this:

https://github.com/eljefe6a/beamexample/blob/master/README.md


On Thu, Oct 27, 2016 at 8:04 AM, Manu Zhang  wrote:

> Hey guys,
>
> I find Beam examples under the examples folder are not easy to run due to
> dependency on Google specific services. Even the MinimalWordCount
>  examples/java/src/main/java/org/apache/beam/examples/MinimalWordCount.java
> >
> requires
> input and output to be on Google Cloud Storage. Others like
> WindowedWordCount
>  examples/java/src/main/java/org/apache/beam/examples/
> WindowedWordCount.java>
> require
> BigQuery.  I wouldn't expect newcomers to tweak IO themselves.
>
> Can we have more quick start examples that can be run anywhere ?
>
> Thanks,
> Manu Zhang
>


Re: [DISCUSS] Merging master -> feature branch

2016-10-26 Thread Thomas Weise
+1

For a merge from master to the feature branch that does not require extra
changes, RTC does not add value. It actually delays and burns reviewer time
(even mechanics need some) that "real" PRs could benefit from. If
adjustments are needed, then the regular process kicks in.

Thanks,
Thomas


On Wed, Oct 26, 2016 at 1:33 AM, Amit Sela  wrote:

> I generally agree with Kenneth.
>
> While working on the SparkRunnerV2 branch, it was a pain - i avoided
> frequent merges to avoid trivial PRs, but it cost me with very large and
> non-trivial merges later.
> I think that frequent merges for feature-branches should most of the time
> be trivial (no conflicts) and a committer should be allowed to self-merge
> once tests pass.
> As for conflicts, even for the smallest once I'd go with review just so
> it's very clear when self-merging is OK - we can always revisit this later
> and further discuss if we think we can improve this process.
>
> I guess +1 from me.
>
> Thanks,
> Amit.
>
> On Wed, Oct 26, 2016 at 8:10 AM Frances Perry 
> wrote:
>
> > On Tue, Oct 25, 2016 at 9:44 PM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > Agree. When possible it would be great to have the branch merged on
> > master
> > > quickly, even when it's not fully ready. It would give more visibility
> to
> > > potential contributors.
> > >
> >
> > This thread is about the opposite, I think -- merging master into feature
> > branches regularly to prevent them from getting out of sync.
> >
> > As for increasing the visibility of feature branches, we have these new
> > webpages:
> > http://beam.incubator.apache.org/contribute/work-in-progress/
> > http://beam.incubator.apache.org/contribute/contribution-
> > guide/#feature-branches
> > with more changes coming in the basic SDK/Runner landing pages too.
> >
>


Apex runner integration tests

2016-10-25 Thread Thomas Weise
The Apex runner has the integration tests enabled and that causes Travis PR
builds to fail with timeout (they complete in Jenkins).

What is the correct setup for this, when are the tests supposed to run?

https://github.com/apache/incubator-beam/blob/apex-runner/runners/apex/pom.xml#L190

Thanks,
Thomas


Re: The Availability of PipelineOptions

2016-10-25 Thread Thomas Weise
+1


On Tue, Oct 25, 2016 at 3:03 AM, Jean-Baptiste Onofré 
wrote:

> +1
>
> Agree
>
> Regards
> JB
>
> ⁣​
>
> On Oct 25, 2016, 12:01, at 12:01, Aljoscha Krettek 
> wrote:
> >+1 This sounds quite straightforward.
> >
> >On Tue, 25 Oct 2016 at 01:36 Thomas Groh 
> >wrote:
> >
> >> Hey everyone,
> >>
> >> I've been working on a declaration of intent for how we want to use
> >> PipelineOptions and an API change to be consistent with that intent.
> >This
> >> is generally part of the move to the Runner API, specifically the
> >desire to
> >> be able to reuse Pipelines and the ability to choose runner at the
> >time of
> >> the call to run.
> >>
> >> The high-level summary is I wan to remove the
> >Pipeline.getPipelineOptions
> >> method.
> >>
> >> I believe this will be compatible with other in-flight proposals,
> >> especially Dynamic PipelineOptions, but would love to see what
> >everyone
> >> else thinks. The document is available at the link below.
> >>
> >>
> >>
> >https://docs.google.com/document/d/1Wr05cYdqnCfrLLqSk-
> -XmGMGgDwwNwWZaFbxLKvPqEQ/edit?usp=sharing
> >>
> >> Thanks,
> >>
> >> Thomas
> >>
>


Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Thomas Weise
Hadoop FS has the local file system implementation that can be used for
testing ("file" URL, no service needed).

Thanks

On Wed, Oct 19, 2016 at 10:43 AM, Amit Sela <amitsel...@gmail.com> wrote:

> Oh cool, that didn't exist in 0.8 I think, but anything that is Kafka
> native is best.
> I'm pretty sure there's an embedded HDFS for testing as well.
>
> While embedded Kafka/HDFS won't reflect "real-life" distributed
> environment, it could be a good place to start and provide some basic
> functional testing.
>
> On Wed, Oct 19, 2016 at 8:25 PM Satish Duggana <sati...@apache.org> wrote:
>
> >
> > https://github.com/apache/kafka/blob/trunk/streams/src/
> test/java/org/apache/kafka/streams/integration/utils/
> EmbeddedKafkaCluster.java
> >
> > This is currently used in one of our repos and it comes as part of one of
> > kafka libs.
> >
> > On Wed, Oct 19, 2016 at 10:49 PM, Amit Sela <amitsel...@gmail.com>
> wrote:
> >
> > > The SparkRunner actually has an embedded Kafka for its unit tests.
> > >
> > > On Wed, Oct 19, 2016, 20:16 Thomas Weise <t...@apache.org> wrote:
> > >
> > > > Kafka can be embedded for the integration testing, which should
> > > > significantly simplify the setup.
> > > >
> > > > Here is an example I found:
> > > >
> > > > https://gist.github.com/fjavieralba/7930018
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Oct 19, 2016 at 9:44 AM, Dan Halperin
> > > <dhalp...@google.com.invalid
> > > > >
> > > > wrote:
> > > >
> > > > > My thoughts:
> > > > >
> > > > > * It's worth reading the Beam testing
> > > > > <http://beam.incubator.apache.org/contribute/testing/> document
> that
> > > > Jason
> > > > > Kuster wrote!
> > > > > * Beam already has support for "End-to-end" integration tests, of
> > > > examples
> > > > > (e.g., WordCountIT
> > > > > <https://github.com/apache/incubator-beam/blob/master/
> > > > > examples/java/src/test/java/org/apache/beam/examples/
> > > WordCountIT.java>)
> > > > > or data stores (e.g., BigtableReadIT
> > > > > <https://github.com/apache/incubator-beam/blob/master/
> > > > > sdks/java/io/google-cloud-platform/src/test/java/org/
> > > > > apache/beam/sdk/io/gcp/bigtable/BigtableReadIT.java>
> > > > > ).
> > > > >
> > > > > It sounds like we're is in agreement that we want one of these for
> > > > KafkaIO,
> > > > > too. (Among others,) This now circles back to other issues of
> testing
> > > > > (which are not yet covered in that doc):
> > > > >
> > > > > * Unlike Bigtable, Kafka is not a cloud service. We'll either need
> a
> > > > > permanent testing cluster or the ability to spin one up
> dynamically.
> > > That
> > > > > work is hard, but we need to figure it out. (Note: the cluster
> needs
> > to
> > > > be
> > > > > accessible from all runners and real clusters, so not just the
> local
> > > > > machine like the integration tests.)
> > > > > * Right now, only WordCountIT and a few other example ITs are run
> on
> > > all
> > > > > runners. We need to add per-runner postcommit suites that run all
> ITs
> > > in
> > > > > the project.
> > > > >
> > > > > Jason and several others of us have been thinking hard about the
> best
> > > > ways
> > > > > to build these tests and necessary test infrastructure. (See the
> > > > > performance thread Jason started. IMO the most important issue to
> > solve
> > > > > first is infrastructure). Please help!
> > > > >
> > > > > Dan
> > > > >
> > > > > On Wed, Oct 19, 2016 at 7:37 AM, Thomas Weise <t...@apache.org>
> > wrote:
> > > > >
> > > > > > +1 those are probably the most used sources. Hadoop FS has a
> number
> > > of
> > > > > > different implementations, HDFS is one of them.
> > > > > >
> > > > > > On Wed, Oct 19, 2016 at 2:55 AM, Amit Sela <amitsel...@gmail.com
> >
> > > > w

Re: [KUDOS] Contributed runner: Apache Apex!

2016-10-17 Thread Thomas Weise
Thanks to Kenn for helping with the review and many questions!

The focus till here has been on making the runner functional. I will start
creating JIRAs for follow-up work.

Looking forward to the next steps to make it a top-level runner and input
from the community on the same.

Thanks!
Thomas


On Mon, Oct 17, 2016 at 10:35 AM, Amit Sela <amitsel...@gmail.com> wrote:

> Congrats and thanks to everyone who was involved in this effort!
>
> On Mon, Oct 17, 2016 at 8:07 PM Neelesh Salian <nsal...@cloudera.com>
> wrote:
>
> > Awesome. Great work.
> >
> > On Mon, Oct 17, 2016 at 10:03 AM, Aljoscha Krettek <aljos...@apache.org>
> > wrote:
> >
> > > Congrats! :-)
> > >
> > > On Mon, 17 Oct 2016 at 18:55 Kenneth Knowles <k...@google.com.invalid>
> > > wrote:
> > >
> > > > *I would like to :-)
> > > >
> > > > On Mon, Oct 17, 2016 at 9:51 AM Kenneth Knowles <k...@google.com>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I would to, once again, call attention to a great addition to
> Beam: a
> > > > > runner for Apache Apex.
> > > > >
> > > > > After lots of review and much thoughtful revision, pull request
> #540
> > > has
> > > > > been merged to the apex-runner feature branch today. Please do
> take a
> > > > look,
> > > > > and help us put the finishing touches on it to get it ready for the
> > > > master
> > > > > branch.
> > > > >
> > > > > And please also congratulate and thank Thomas Weise for this large
> > > > > endeavor, Vlad Rosov who helped get the integration tests working,
> > and
> > > > > Guarav Gupta who contributed review comments.
> > > > >
> > > > > Kenn
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Neelesh Srinivas Salian
> > Customer Operations Engineer
> >
>


Re: [REMINDER] Technical discussion on the mailing list

2016-10-05 Thread Thomas Weise
How about sending just the notifications for creating new JIRA and opening
PR to dev@ so that those that are interested can start watching?

Thanks,
Thomas

On Wed, Oct 5, 2016 at 5:33 PM, Dan Halperin 
wrote:

> On Wed, Oct 5, 2016 at 5:13 PM, Daniel Kulp  wrote:
>
> > I just want to give a little more context to this….  I’ve been lurking on
> > this list for several months now reading everything that’s going on.
>  From
> > Apache’s standpoint, that should be a “very good start” for getting to
> know
> > what is happening in a project.
> >
> > On my last PR, Eugene commented about using the AutoValue pattern for
> part
> > of it which caught me off guard.   None of the other IO’s in master were
> > using it, there wasn’t any discussion on this list about it, I had no
> idea
> > what it was about.   So I asked JB to make sure I hadn’t missed anything.
> >
>
> > Anyway, this is one of the main concerns I have with Beam’s PR work flow,
> > I feel I’m missing things as there is significant amount of things not
> > happening on a list.   The initial pull request is going to the commits
> > list (ok, would prefer the dev list, but at least its on a list).
> However,
> > none of the comments or discussions or anything that is occurring as part
> > of the review is making it to any list.   The only people that “learn”
> from
> > the reviews are the reviewers and the person who initiated the PR unless
> > they go into each and every PR and read the comments (and find the news
> > ones and such).With my Apache hat on, this bothers me.
>
>
> Anyway, I don’t really understand why the full github integration wasn’t
> > setup for the beam PR’s so that the comments would come back to the lists
> > as well (and JIRA, BTW).
> >
>
> This part confuses me. We've been told that discussions on JIRA, even
> though they are emailed to the mailing lists, don't count as happening on
> the mailing list. So why would github integration be helpful vs just more
> spam?
>
> As another example, the comments on PR1003 are very applicable to anyone
> > looking into writing IO’s and they could learn about some of the “best
> > practices” presented there.
>
>
> As Beam grows during its incubation, we are moving a lot of knowledge to
> documentation, but yes -- right now, most of the I/O related practices live
> in Eugene's and my head (and now, JB's!). We're working on it, and hope to
> dramatically improve documentation for source authors in the next quarter.
>
> For AutoValue specifically, this is by no means codified and it is
> DEFINITELY not mandatory. Eugene and JB just experimented with it in the
> last few days and decided it was useful in a few cases. We do (or did,
> before this thread) need to have an actual discussion on the mailing list
> before moving forward further towards making it policy.
>
> Right now Ben Chambers is trying to apply AutoValue in places that need
> templated types and struggling with multiple ?s, so the discussion may need
> to continue! ...
>
> Thanks,
> Dan
>
> That’s basically the background as to why JB sent this.  I was confused and
> > bugged him.   :-)
> >
> > Dan
> >
> >
> >
> > > On Oct 5, 2016, at 1:51 PM, Jean-Baptiste Onofré 
> > wrote:
> > >
> > > Hi team,
> > >
> > > I would like to excuse myself to have forgotten to discuss and share
> > with you a technical point and generally speaking do a small reminder.
> > >
> > > When we work with Eugene on the JdbcIO, we experimented AutoValue to
> > deal with IO configuration. AutoValue provides a nice way to reduce and
> > limit the boilerplate code required by the IO configuration.
> > > We used AutoValue in JdbcIO and, regarding the good improvements we
> saw,
> > we started to refactor the other IOs.
> > >
> > > The use of AutoValue should have been notice and discussed on the
> > mailing list.
> > >
> > > "If it doesn't exist on the mailing list, it doesn't exist at all."
> > >
> > > So, any comment happening on a GitHub pull request, or discussion on
> > hangouts which can impact the project (generally speaking) has to happen
> on
> > the mailing list.
> > >
> > > It provides project transparency and facilitates the new contribution
> > onboarding.
> > >
> > > Thanks !
> > >
> > > Regards
> > > JB
> > >
> >
> > --
> > Daniel Kulp
> > dk...@apache.org  - http://dankulp.com/blog <
> > http://dankulp.com/blog>
> > Talend Community Coder - http://coders.talend.com <
> > http://coders.talend.com/>
> >
>


Re: Apex Runner support for View.CreatePCollectionView

2016-10-01 Thread Thomas Weise
Kenn,

Thanks for the pointer WRT the source watermark. It wasn't set to
MAX_TIMESTAMP upon end of input and hence the global window was never
emitted. Got the assertions almost working now, just need to propagate the
exceptions to the unit test driver and then we should be ready for a new
revision of the PR.

Thomas

On Tue, Sep 27, 2016 at 10:17 PM, Thomas Weise <thomas.we...@gmail.com>
wrote:

> I have GroupByKey working, here is a unit test:
>
> https://github.com/tweise/incubator-beam/blob/BEAM-261.
> sideinputs/runners/apex/src/test/java/org/apache/beam/
> runners/apex/translators/GroupByKeyTranslatorTest.java#L65L110
>
> For the earlier PAssert example, PAssert.GroupGlobally will assign global
> window and remove any triggering. Then the following groupBy won't emit an
> aggregate. I'm trying to figure out what I'm missing.
>
> > PCollection pcollection = pipeline.apply(Create.of(...));
> > PAssert.that(pcollection).empty();
>
> Thomas
>
>
> On Tue, Sep 27, 2016 at 5:02 PM, Kenneth Knowles <k...@google.com.invalid>
> wrote:
>
>> Hi Thomas,
>>
>> Great news about the side inputs! The only thing you should need for most
>> PAsserts to work is GroupByKey. A few PAsserts require side inputs, so if
>> you got those working then you should have everything you need for all the
>> PAsserts.
>>
>> The lack of triggering in PAsserts like the one you mention is because
>> they
>> rely on the behavior that the aggregation for a window is emitted when the
>> window expires and is garbage collected, so all the values for the window
>> are in one aggregation, thus it is the final value and can be tested. This
>> happens as part of the GroupAlsoByWindowViaWindowSetDoFn (the logic
>> itself
>> is part of ReduceFnRunner), so if you have state and timers working, you
>> should see output.
>>
>> If this doesn't seem to be happening, maybe you can give some more
>> details?
>>
>> Kenn
>>
>> On Tue, Sep 27, 2016 at 7:09 PM Thomas Weise <t...@apache.org> wrote:
>>
>> > Hi Kenn,
>> >
>> > Thanks, this was very helpful. I got the side input translation working
>> > now, although I want to go back and see if the View.asXYZ expansions
>> can be
>> > simplified.
>> >
>> > But before that I need to tackle PAssert, which is the next blocker for
>> me
>> > to get many of the integration tests working. I see that the PAsserts
>> > generate TimestampedValueInGlobalWindow with no triggers and so grouping
>> > will accumulate state but not emit anything
>> > (PAssert$0/GroupGlobally/GatherAllOutputs/GroupByKey).
>> >
>> > PCollection pcollection = pipeline.apply(Create.of(...));
>> > PAssert.that(pcollection).empty();
>> >
>> > Is there a good place to look for a basic understanding of PAssert and
>> what
>> > the runner needs to support?
>> >
>> > Thanks,
>> > Thomas
>> >
>> >
>> >
>> > On Thu, Sep 15, 2016 at 11:51 AM, Kenneth Knowles
>> <k...@google.com.invalid>
>> > wrote:
>> >
>> > > Hi Thomas,
>> > >
>> > > The side inputs 1-pager is a forward-looking document for the design
>> of
>> > > side inputs in Beam once the portability layers are completed. The
>> > current
>> > > SDK and implementations do not quite respect the same abstraction
>> > > boundaries, even though they are similar.
>> > >
>> > > Here are some specifics about that 1-pager that I hope will help you
>> > right
>> > > now:
>> > >
>> > >  - The purple cylinder that says "Runner materializes" corresponds to
>> the
>> > > CreatePCollectionView transform. Eventually this should not appear in
>> the
>> > > SDK or the pipeline representation, but today that is where you put
>> your
>> > > logic to write to some runner-specific storage medium, etc.
>> > >  - This "Runner materializes" / "CreatePCollectionView" is consistent
>> > with
>> > > streaming, of course. When new data arrives, the runner makes the new
>> > side
>> > > input value available. Most of the View.asXYZ transforms have a
>> > GroupByKey
>> > > within them, so the triggering on the side input PCollection will
>> > regulate
>> > > this.
>> > >  - The red "RPC" boundary in the diagram will be part of the
>> > cross-language
>> > > Fn

Re: Apex Runner support for View.CreatePCollectionView

2016-09-27 Thread Thomas Weise
I have GroupByKey working, here is a unit test:

https://github.com/tweise/incubator-beam/blob/BEAM-261.sideinputs/runners/apex/src/test/java/org/apache/beam/runners/apex/translators/GroupByKeyTranslatorTest.java#L65L110

For the earlier PAssert example, PAssert.GroupGlobally will assign global
window and remove any triggering. Then the following groupBy won't emit an
aggregate. I'm trying to figure out what I'm missing.

> PCollection pcollection = pipeline.apply(Create.of(...));
> PAssert.that(pcollection).empty();

Thomas


On Tue, Sep 27, 2016 at 5:02 PM, Kenneth Knowles <k...@google.com.invalid>
wrote:

> Hi Thomas,
>
> Great news about the side inputs! The only thing you should need for most
> PAsserts to work is GroupByKey. A few PAsserts require side inputs, so if
> you got those working then you should have everything you need for all the
> PAsserts.
>
> The lack of triggering in PAsserts like the one you mention is because they
> rely on the behavior that the aggregation for a window is emitted when the
> window expires and is garbage collected, so all the values for the window
> are in one aggregation, thus it is the final value and can be tested. This
> happens as part of the GroupAlsoByWindowViaWindowSetDoFn (the logic itself
> is part of ReduceFnRunner), so if you have state and timers working, you
> should see output.
>
> If this doesn't seem to be happening, maybe you can give some more details?
>
> Kenn
>
> On Tue, Sep 27, 2016 at 7:09 PM Thomas Weise <t...@apache.org> wrote:
>
> > Hi Kenn,
> >
> > Thanks, this was very helpful. I got the side input translation working
> > now, although I want to go back and see if the View.asXYZ expansions can
> be
> > simplified.
> >
> > But before that I need to tackle PAssert, which is the next blocker for
> me
> > to get many of the integration tests working. I see that the PAsserts
> > generate TimestampedValueInGlobalWindow with no triggers and so grouping
> > will accumulate state but not emit anything
> > (PAssert$0/GroupGlobally/GatherAllOutputs/GroupByKey).
> >
> > PCollection pcollection = pipeline.apply(Create.of(...));
> > PAssert.that(pcollection).empty();
> >
> > Is there a good place to look for a basic understanding of PAssert and
> what
> > the runner needs to support?
> >
> > Thanks,
> > Thomas
> >
> >
> >
> > On Thu, Sep 15, 2016 at 11:51 AM, Kenneth Knowles <k...@google.com.invalid
> >
> > wrote:
> >
> > > Hi Thomas,
> > >
> > > The side inputs 1-pager is a forward-looking document for the design of
> > > side inputs in Beam once the portability layers are completed. The
> > current
> > > SDK and implementations do not quite respect the same abstraction
> > > boundaries, even though they are similar.
> > >
> > > Here are some specifics about that 1-pager that I hope will help you
> > right
> > > now:
> > >
> > >  - The purple cylinder that says "Runner materializes" corresponds to
> the
> > > CreatePCollectionView transform. Eventually this should not appear in
> the
> > > SDK or the pipeline representation, but today that is where you put
> your
> > > logic to write to some runner-specific storage medium, etc.
> > >  - This "Runner materializes" / "CreatePCollectionView" is consistent
> > with
> > > streaming, of course. When new data arrives, the runner makes the new
> > side
> > > input value available. Most of the View.asXYZ transforms have a
> > GroupByKey
> > > within them, so the triggering on the side input PCollection will
> > regulate
> > > this.
> > >  - The red "RPC" boundary in the diagram will be part of the
> > cross-language
> > > Fn API. For today, that layer is not present, and it is the Java class
> > > ViewFn on each PCollectionView. It takes an
> > > Iterable<WindowedValue> and produces a ViewT.
> > >  - If we were to use the existing ViewFns without modification, the
> > > primitive "access_pattern" would be "iterable", not "multimap". Thus,
> the
> > > access pattern does not support fetching an individual KV record
> > > efficiently when the side input is large (when it is small, the map can
> > be
> > > built in memory and cached). As we move forwards, this should change.
> > >
> > > And here are answers beyond the side input 1-pager:
> > >
> > >  - The problem of expiry of the side input data is [BEAM-260]. The
> &g

Re: OldDoFn - CounterSet replacement

2016-08-17 Thread Thomas Weise
Hi Ben,

Thanks for the reply. Here is the PR:

https://github.com/apache/incubator-beam/pull/540

The doFnRunner instantiation in old style is here:

https://github.com/apache/incubator-beam/pull/540/files#diff-86746f538c22ebafd06fca17f0d0aa94R116

I should also note that focus of the PR is to establish the Apex runner
baseline and proper support for aggregators isn't part of it, it's
something I was planning to take up in subsequent round.

Thomas


On Wed, Aug 17, 2016 at 8:14 AM, Ben Chambers <bchamb...@apache.org> wrote:

> Hi Thomas!
>
> On Tue, Aug 16, 2016 at 9:40 PM Thomas Weise <thomas.we...@gmail.com>
> wrote:
>
> > I'm trying to rebase a PR and adjust for the DoFn changes.
> >
>
> Can you elaborate on what you're trying to do (or send a link to the PR)?
>
>
> > CounterSet is gone and there is now AggregatorFactory and I'm looking to
> > fix an existing usage of org.apache.beam.sdk.util.
> DoFnRunners.simpleRunner.
> >
>
> In practice, these should act the same. CounterSet was an implementation
> detail used to create implementation-specific Counters. The DoFnRunner was
> supposed to get the CounterSet that was wired up correctly. Now, the
> AggregatorFactory serves the role of creating wired-up Aggregators. As
> before, the DoFnRunner should be instantiated with an AggregatorFactory
> wired up to appropriately.
>
>
> > Given the instance of OldDoFn, what is the recommended way to obtain the
> > aggregator factory when creating the fn runner?
> >
>
> This should come from the runner. When the runner wants to instantiate a
> DoFnRunner to execute a user DoFn, it provides an AggregatorFactory that
> will wire up aggregators appropriately.
>
>
> > Thanks!
> >
> >
> > java.lang.NullPointerException
> > at
> >
> > org.apache.beam.sdk.util.DoFnRunnerBase$DoFnContext.
> createAggregatorInternal(DoFnRunnerBase.java:348)
> > at
> >
> > org.apache.beam.sdk.transforms.OldDoFn$Context.setupDelegateAggregator(
> OldDoFn.java:224)
> > at
> >
> > org.apache.beam.sdk.transforms.OldDoFn$Context.setupDelegateAggregators(
> OldDoFn.java:215)
> > at
> >
> > org.apache.beam.sdk.util.DoFnRunnerBase$DoFnContext.<
> init>(DoFnRunnerBase.java:214)
> > at org.apache.beam.sdk.util.DoFnRunnerBase.(
> DoFnRunnerBase.java:87)
> > at
> > org.apache.beam.sdk.util.SimpleDoFnRunner.(
> SimpleDoFnRunner.java:42)
> > at org.apache.beam.sdk.util.DoFnRunners.simpleRunner(
> DoFnRunners.java:60)
> >
>


Re: Running examples with different runners

2016-06-24 Thread Thomas Weise
Thanks! I will try that out.

Regarding the View translation, it still fails with HEAD of master (just
did a pull):

---
Test set: org.apache.beam.sdk.testing.PAssertTest
---
Tests run: 10, Failures: 1, Errors: 3, Skipped: 0, Time elapsed: 29.566 sec
<<< FAILURE! - in org.apache.beam.sdk.testing.PAssertTest
testIsEqualTo(org.apache.beam.sdk.testing.PAssertTest)  Time elapsed: 1.518
sec  <<< ERROR!
java.lang.IllegalStateException: no translator registered for
View.CreatePCollectionView
at
org.apache.beam.runners.apex.ApexPipelineTranslator.visitPrimitiveTransform(ApexPipelineTranslator.java:98)


On Fri, Jun 24, 2016 at 5:28 PM, Lukasz Cwik <lc...@google.com.invalid>
wrote:

> Below I outline a different approach than the DirectRunner which didn't
> require an override for Create since it knows that there was no data
> remaining and can correctly shut the pipeline down by pushing the watermark
> all the way through the pipeline. This is a superior approach but I believe
> is more difficult to get right.
>
> PAssert emits an aggregator with a specific name which states that the
> PAssert succeeded or failed:
>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/PAssert.java#L110
>
> The test Dataflow runner counts how many PAsserts were applied and then
> polls itself every 10 seconds checking to see if the aggregator has any
> failures or all the successes for streaming pipelines.
> Polling logic:
>
> https://github.com/apache/incubator-beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/testing/TestDataflowRunner.java#L114
> Check logic:
>
> https://github.com/apache/incubator-beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/testing/TestDataflowRunner.java#L177
>
> As for overriding a transform, the runner is currently invoked during
> application of a transform and is able to inject/replace/modify the
> transform that was being applied. The test Dataflow runner uses this a
> little bit to do the PAssert counting while the normal Dataflow runner does
> this a lot for its own specific needs.
>
> Finally, I believe Ken just made some changes which removed the requirement
> to support View.YYY and replaced it with GroupByKey so the no translator
> registered for View... may go away.
>
>
> On Fri, Jun 24, 2016 at 4:52 PM, Thomas Weise <thomas.we...@gmail.com>
> wrote:
>
> > Kenneth and Lukasz, thanks for the direction.
> >
> > Is there any information about other requirements to run the cross runner
> > tests and hints to troubleshoot. On first attempt they mosty fail due to
> > missing translator:
> >
> > PAssertTest.testIsEqualTo:219 ▒ IllegalState no translator registered for
> > View...
> >
> > Also, for run() to be synchronous or wait, there needs to be an exit
> > condition. I know how to solve this for the Apex runner specific tests.
> But
> > for the cross runner tests, what is the recommended way to do this?
> Kenneth
> > mentioned that Create could signal end of stream. Should I look to
> override
> > the Create transformation to configure the behavior ((just for this test
> > suite) and if so, is there an example how to do this cleanly?
> >
> > Thanks,
> > Thomas
> >
> >
> >
> >
> > On Tue, Jun 21, 2016 at 7:32 PM, Kenneth Knowles <k...@google.com.invalid
> >
> > wrote:
> >
> > > To expand on the RunnableOnService test suggestion, here [1] is the
> > commit
> > > from the Spark runner. You will get a lot more information if you can
> > port
> > > this for your runner than you would from an example end-to-end test.
> > >
> > > Note that this just pulls in the tests from the core SDK. For testing
> > with
> > > other I/O connectors, you'll add them to the dependenciesToScan.
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/incubator-beam/commit/4254749bf103c4bb6f68e316768c0aa46d9f7df0
> > >
> > > On Tue, Jun 21, 2016 at 4:06 PM, Lukasz Cwik <lc...@google.com.invalid
> >
> > > wrote:
> > >
> > > > There is a start to getting more e2e like integration tests going
> with
> > > the
> > > > first being WordCount.
> > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-beam/blob/master/examples/java/src/test/java/org/apache/beam/examples/Word

Running examples with different runners

2016-06-21 Thread Thomas Weise
Hi,

As part of the Apex runner, we have a few unit tests for the supported
transformations. Next, I would like to test the WindowedWordCount example.

Is there an example of configuring this pipeline for another runner? Is it
recommended to supply such configuration as a JUnit test? What is the
general (repeatable?) approach to exercise different runners with the set
of example pipelines?

Thanks,
Thomas


Re: Using GroupAlsoByWindowViaWindowSetDoFn for stream of input element

2016-06-21 Thread Thomas Weise
Hi Thomas,

Thanks for the info. When the pipeline contains:

.apply(Count.perElement())

The translation looks as follows:

58   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
entering composite transform Count.PerElement
58   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
visiting transform Init [AnonymousParDo]
58   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
visiting value Count.PerElement/Init.out [PCollection]
58   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
entering composite transform Count.PerKey [Combine.PerKey]
58   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
visiting transform GroupByKey
93   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
visiting value Count.PerElement/Count.PerKey/GroupByKey.out [PCollection]
93   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
entering composite transform Combine.GroupedValues
93   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
visiting transform AnonymousParDo
93   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
visiting value
Count.PerElement/Count.PerKey/Combine.GroupedValues/AnonymousParDo.out
[PCollection]
93   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
leaving composite transform Combine.GroupedValues
93   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
leaving composite transform Count.PerKey [Combine.PerKey]
93   [main] DEBUG org.apache.beam.runners.apex.ApexPipelineTranslator  -
leaving composite transform Count.PerElement

So the runner's translator needs to take care of pushing the combine
function upstream, when it is possible. I was wondering whether this is
something that could be handled in a runner independent way?

Thanks,
Thomas




On Fri, Jun 17, 2016 at 10:19 AM, Thomas Groh <tg...@google.com.invalid>
wrote:

> Generally, the above code snippet will work, producing (after trigger
> firing) an output Iterable containing all of the input elements. It may
> be notable that timers (and TimerInternals) are also per-key, so that
> interface must also be updated per element.
>
> By specifying the ReduceFn of the ReduceFnRunner, you can change how the
> ReduceFnRunner adds and merges state. The combining ReduceFn is suitable
> for use with upstream CombineFns, while buffering is suitable for general
> use.
>
> On Fri, Jun 17, 2016 at 9:52 AM, Thomas Weise <thomas.we...@gmail.com>
> wrote:
>
> > The source for my windowed groupByKey experiment is here:
> >
> >
> >
> https://github.com/tweise/incubator-beam/blob/master/runners/apex/src/main/java/org/apache/beam/runners/apex/translators/functions/ApexGroupByKeyOperator.java
> >
> > The result is Iterable. In cases such as counting, what is the
> > recommended way to perform the incremental aggregation, without building
> an
> > intermediate collection?
> >
> > Thomas
> >
> > On Fri, Jun 17, 2016 at 8:27 AM, Thomas Weise <thomas.we...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I'm looking into using the GroupAlsoByWindowViaWindowSetDoFn to
> > accumulate
> > > the windowed state with the elements arriving one by one (stream).
> > >
> > > Once the window is complete, I would like to emit an Iterable or
> > > another form of aggregation of the elements. Is the following supposed
> to
> > > lead to merging of current element with previously received elements
> for
> > > the same window:
> > >
> > > KeyedWorkItem<K, V> kwi = KeyedWorkItems.elementsWorkItem(
> > > kv.getKey(),
> > > Collections.singletonList(updatedWindowedValue));
> > >
> > > context.setElement(kwi, getStateInternalsForKey(kwi.key()));
> > > fn.processElement(context);
> > >
> > > The input here are always single elements.
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > >
> >
>


Re: Using GroupAlsoByWindowViaWindowSetDoFn for stream of input element

2016-06-17 Thread Thomas Weise
The source for my windowed groupByKey experiment is here:

https://github.com/tweise/incubator-beam/blob/master/runners/apex/src/main/java/org/apache/beam/runners/apex/translators/functions/ApexGroupByKeyOperator.java

The result is Iterable. In cases such as counting, what is the
recommended way to perform the incremental aggregation, without building an
intermediate collection?

Thomas

On Fri, Jun 17, 2016 at 8:27 AM, Thomas Weise <thomas.we...@gmail.com>
wrote:

> Hi,
>
> I'm looking into using the GroupAlsoByWindowViaWindowSetDoFn to accumulate
> the windowed state with the elements arriving one by one (stream).
>
> Once the window is complete, I would like to emit an Iterable or
> another form of aggregation of the elements. Is the following supposed to
> lead to merging of current element with previously received elements for
> the same window:
>
> KeyedWorkItem<K, V> kwi = KeyedWorkItems.elementsWorkItem(
> kv.getKey(),
> Collections.singletonList(updatedWindowedValue));
>
> context.setElement(kwi, getStateInternalsForKey(kwi.key()));
> fn.processElement(context);
>
> The input here are always single elements.
>
> Thanks,
> Thomas
>
>
>


Using GroupAlsoByWindowViaWindowSetDoFn for stream of input element

2016-06-17 Thread Thomas Weise
Hi,

I'm looking into using the GroupAlsoByWindowViaWindowSetDoFn to accumulate
the windowed state with the elements arriving one by one (stream).

Once the window is complete, I would like to emit an Iterable or another
form of aggregation of the elements. Is the following supposed to lead to
merging of current element with previously received elements for the same
window:

KeyedWorkItem kwi = KeyedWorkItems.elementsWorkItem(
kv.getKey(),
Collections.singletonList(updatedWindowedValue));

context.setElement(kwi, getStateInternalsForKey(kwi.key()));
fn.processElement(context);

The input here are always single elements.

Thanks,
Thomas


Re: 0.1.0-incubating release

2016-06-03 Thread Thomas Weise
Another consideration for potential future packaging/distribution solutions
is how the artifacts line up as files in a flat directory. For that it may
be good to have a common prefix in the artifactId and unique artifactId.

The name for the source archive (when relying on ASF parent POM) can also
be controlled without expanding the artifactId:

 

  
maven-assembly-plugin

  apache-beam

  

 

Thanks,
Thomas

On Fri, Jun 3, 2016 at 9:39 AM, Davor Bonaci 
wrote:

> BEAM-315 is definitely important. Normally, I'd always advocate for holding
> the release to pick that fix. For the very first release, however, I'd
> prefer to proceed to get something out there and test the process. As you
> said, we can address this rather quickly once we have the fix merged in.
>
> In terms of Maven coordinates, there are two basic approaches:
> * flat structure, where artifacts live under "org.apache.beam" group and
> are differentiated by their artifact id.
> * hierarchical structure, where we use different groups for different types
> of artifacts (org.apache.beam.sdks; org.apache.beam.runners).
>
> There are pros and cons on the both sides of the argument. Different
> projects made different choices. Flat structure is easier to find and
> navigate, but often breaks down with too many artifacts. Hierarchical
> structure is just the opposite.
>
> On my end, the only important thing is consistency. We used to have it, and
> it got broken by PR #365. This part should be fixed -- we should either
> finish the vision of the hierarchical structure, or rollback that PR to get
> back to a fully flat structure.
>
> My general biases tend to be:
> * hierarchical structure, since we have many artifacts already.
> * short identifiers; no need to repeat a part of the group id in the
> artifact id.
>
> On Fri, Jun 3, 2016 at 4:03 AM, Jean-Baptiste Onofré 
> wrote:
>
> > Hi Max,
> >
> > I discussed with Davor yesterday. Basically, I proposed:
> >
> > 1. To rename all parent with a prefix (beam-parent, flink-runner-parent,
> > spark-runner-parent, etc).
> > 2. For the groupId, I prefer to use different groupId, it's clearer to
> me,
> > and it's exactly the usage of the groupId. Some projects use a single
> > groupId (spark, hadoop, etc), others use multiple (camel, karaf,
> activemq,
> > etc). I prefer different groupIds but ok to go back to single one.
> >
> > Anyway, I'm preparing a PR to introduce a new Maven module:
> > "distribution". The purpose is to address both BEAM-319 (first) and
> > BEAM-320 (later). It's where we will be able to define the different
> > distributions we plan to publish (source and binaries).
> >
> > Regards
> > JB
> >
> >
> > On 06/03/2016 11:02 AM, Maximilian Michels wrote:
> >
> >> Thanks for getting us ready for the first release, Davor! We would
> >> like to fix BEAM-315 next week. Is there already a timeline for the
> >> first release? If so, we could also address this in a minor release.
> >> Releasing often will give us some experience with our release process
> >> :)
> >>
> >> I would like everyone to think about the artifact names and group ids
> >> again. "parent" and "flink" are not very suitable names for the Beam
> >> parent or the Flink Runner artifact (same goes for the Spark Runner).
> >> I'd prefer "beam-parent", "flink-runner", and "spark-runner" as
> >> artifact ids.
> >>
> >> One might think of Maven GroupIds as a sort of hierarchy but they're
> >> not. They're just an identifier. Renaming the parent pom to
> >> "apache-beam" or "beam-parent" would give us the old naming scheme
> >> which used flat group ids (before [1]).
> >>
> >> In the end, I guess it doesn't matter too much if we document the
> >> naming schemes accordingly. What matters is that we use a consistent
> >> naming scheme.
> >>
> >> Cheers,
> >> Max
> >>
> >> [1] https://issues.apache.org/jira/browse/BEAM-287
> >>
> >>
> >> On Thu, Jun 2, 2016 at 4:00 PM, Jean-Baptiste Onofré 
> >> wrote:
> >>
> >>> Actually, I think we can fix both issue in one commit.
> >>>
> >>> What do you think about renaming the main parent POM with:
> >>> groupId: org.apache.beam
> >>> artifactId: apache-beam
> >>>
> >>> ?
> >>>
> >>> Thanks to that, the source distribution will be named
> >>> apache-beam-xxx-sources.zip and it would be clearer to dev.
> >>>
> >>> Thoughts ?
> >>>
> >>> Regards
> >>> JB
> >>>
> >>>
> >>> On 06/02/2016 03:10 PM, Jean-Baptiste Onofré wrote:
> >>>
> 
>  Another annoying thing is the main parent POM artifactId.
> 
>  Now, it's just "parent". What do you think about renaming to
>  "beam-parent" ?
> 
>  Regarding the source distribution name, I would cancel this staging to
>  fix that (I will have a PR ready soon).
> 
>  Thoughts ?
> 
>  Regards
>  JB
> 
>  On 06/02/2016 03:46 AM, Davor Bonaci wrote:
> 
> 

Re: Serialization for org.apache.beam.sdk.util.WindowedValue$*

2016-06-03 Thread Thomas Weise
Amit,

Thanks for this pointer as well, CoderHelpers helps indeed!

Thomas

On Thu, Jun 2, 2016 at 12:51 PM, Amit Sela <amitsel...@gmail.com> wrote:

> Oh sorry, of course I meant Thomas Groh in my previous email.. But @Thomas
> Weise this example
> <
> https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/EvaluationContext.java#L108
> >
> might
> help, this is how the Spark runner uses Coders like Thomas Groh described.
>
> And i agree that we should consider making PipelineOptions Serializable or
> provide a generic solution for Runners.
>
> Hope this helps,
> Amit
>
> On Thu, Jun 2, 2016 at 10:35 PM Amit Sela <amitsel...@gmail.com> wrote:
>
> > Thomas is right, though in my case, I encountered this issue when using
> > Spark's new API that uses Encoders
> > <
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoder.scala>
> not
> > just for serialization but also for "translating" the object into a
> schema
> > of optimized execution with Tungsten
> > <
> https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
> >.
> >
> > I this case I'm using Kryo and I've solved this by registering (in Spark
> > not Beam) custom serializers from
> > https://github.com/magro/kryo-serializers
> > I would consider (in the future) to implement Encoders with the help of
> > Coders but I still didn't wrap my mind around this.
> >
> > On Thu, Jun 2, 2016 at 9:59 PM Thomas Groh <tg...@google.com.invalid>
> > wrote:
> >
> >> The Beam Model ensures that all PCollections have a Coder; the
> PCollection
> >> Coder is the standard way to materialize the elements of a
> >> PCollection[1][2]. Most SDK-provided classes that will need to be
> >> transferred across the wire have an associated coder, and some
> additional
> >> default datatypes have coders associated with (in the CoderRegistry[3]).
> >>
> >> FullWindowedValueCoder[4] is capable of encoding and decoding the
> entirety
> >> of a WindowedValue, and is constructed from a ValueCoder (obtained from
> >> the
> >> PCollection) and a WindowCoder (obtained from the WindowFn of the
> >> WindowingStrategy of the PCollection). Given an input PCollection `pc`,
> >> you
> >> can construct the FullWindowedValueCoder with the following code snippet
> >>
> >> ```
> >> FullWindowedValueCoder.of(pc.getCoder(),
> >> pc.getWindowingStrategy().getWindowFn().windowCoder())
> >> ```
> >>
> >> [1]
> >>
> >>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java
> >> [2]
> >>
> >>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollection.java#L130
> >> [3]
> >>
> >>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/CoderRegistry.java#L94
> >> [4]
> >>
> >>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/WindowedValue.java#L515
> >>
> >> On Thu, Jun 2, 2016 at 10:41 AM, Thomas Weise <thomas.we...@gmail.com>
> >> wrote:
> >>
> >> > Hi Amit,
> >> >
> >> > Thanks for the help. I implemented the same serialization workaround
> for
> >> > the PipelineOptions. Since every distributed runner will have to solve
> >> > this, would it make sense to provide the serialization support along
> >> with
> >> > the interface proxy?
> >> >
> >> > Here is the exception I get with with WindowedValue:
> >> >
> >> > com.esotericsoftware.kryo.KryoException: Class cannot be created
> >> (missing
> >> > no-arg constructor):
> >> > org.apache.beam.sdk.util.WindowedValue$TimestampedValueInGlobalWindow
> >> > at
> >> >
> >> >
> >>
> com.esotericsoftware.kryo.Kryo$DefaultInstantiatorStrategy.newInstantiatorOf(Kryo.java:1228)
> >> > at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1049)
> >> > at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1058)
> >> > at
> >> >
> >> >
> >>
> com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:547)
> >> > at
>

Re: Serialization for org.apache.beam.sdk.util.WindowedValue$*

2016-06-03 Thread Thomas Weise
Thanks, works like a charm! For such hidden gems there should be a Beam
runner newbie guide ;-)

Thomas


On Thu, Jun 2, 2016 at 11:59 AM, Thomas Groh <tg...@google.com.invalid>
wrote:

> The Beam Model ensures that all PCollections have a Coder; the PCollection
> Coder is the standard way to materialize the elements of a
> PCollection[1][2]. Most SDK-provided classes that will need to be
> transferred across the wire have an associated coder, and some additional
> default datatypes have coders associated with (in the CoderRegistry[3]).
>
> FullWindowedValueCoder[4] is capable of encoding and decoding the entirety
> of a WindowedValue, and is constructed from a ValueCoder (obtained from the
> PCollection) and a WindowCoder (obtained from the WindowFn of the
> WindowingStrategy of the PCollection). Given an input PCollection `pc`, you
> can construct the FullWindowedValueCoder with the following code snippet
>
> ```
> FullWindowedValueCoder.of(pc.getCoder(),
> pc.getWindowingStrategy().getWindowFn().windowCoder())
> ```
>
> [1]
>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java
> [2]
>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollection.java#L130
> [3]
>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/CoderRegistry.java#L94
> [4]
>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/WindowedValue.java#L515
>
> On Thu, Jun 2, 2016 at 10:41 AM, Thomas Weise <thomas.we...@gmail.com>
> wrote:
>
> > Hi Amit,
> >
> > Thanks for the help. I implemented the same serialization workaround for
> > the PipelineOptions. Since every distributed runner will have to solve
> > this, would it make sense to provide the serialization support along with
> > the interface proxy?
> >
> > Here is the exception I get with with WindowedValue:
> >
> > com.esotericsoftware.kryo.KryoException: Class cannot be created (missing
> > no-arg constructor):
> > org.apache.beam.sdk.util.WindowedValue$TimestampedValueInGlobalWindow
> > at
> >
> >
> com.esotericsoftware.kryo.Kryo$DefaultInstantiatorStrategy.newInstantiatorOf(Kryo.java:1228)
> > at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1049)
> > at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1058)
> > at
> >
> >
> com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:547)
> > at
> >
> >
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:523)
> > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
> >
> > Thanks,
> > Thomas
> >
> >
> > On Wed, Jun 1, 2016 at 12:45 AM, Amit Sela <amitsel...@gmail.com> wrote:
> >
> > > Hi Thomas,
> > >
> > > Spark and the Spark runner are using kryo for serialization and it
> seems
> > to
> > > work just fine. What is your exact problem ? stack trace/message ?
> > > I've hit an issue with Guava's ImmutableList/Map etc. and used
> > > https://github.com/magro/kryo-serializers for that.
> > >
> > > For PipelineOptions you can take a look at the Spark runner code here:
> > >
> > >
> >
> https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkRuntimeContext.java#L73
> > >
> > > I'd be happy to assist with Kryo.
> > >
> > > Thanks,
> > > Amit
> > >
> > > On Wed, Jun 1, 2016 at 7:10 AM Thomas Weise <t...@apache.org> wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm working on putting together a basic runner for Apache Apex.
> > > >
> > > > Hitting a couple of serialization related issues with running tests.
> > Apex
> > > > is using Kryo for serialization by default (and Kryo can delegate to
> > > other
> > > > serialization frameworks).
> > > >
> > > > The inner classes of WindowedValue are private and have no default
> > > > constructor, which the Kryo field serializer does not like. Also
> these
> > > > classes are not Java serializable, so that's not a fallback option
> (not
> > > > that it would be efficient anyways).
> > > >
> > > > What's the recommended technique to move the WindowedValues over the
> > > wire?
> > > >
> > > > Also, PipelineOptions aren't serializable, while most other classes
> > are.
> > > > They are needed for example with DoFnRunnerBase, so what's the
> > > recommended
> > > > way to distribute them? Disassemble/reassemble? :)
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > >
> >
>


Re: Serialization for org.apache.beam.sdk.util.WindowedValue$*

2016-06-02 Thread Thomas Weise
Hi Amit,

Thanks for the help. I implemented the same serialization workaround for
the PipelineOptions. Since every distributed runner will have to solve
this, would it make sense to provide the serialization support along with
the interface proxy?

Here is the exception I get with with WindowedValue:

com.esotericsoftware.kryo.KryoException: Class cannot be created (missing
no-arg constructor):
org.apache.beam.sdk.util.WindowedValue$TimestampedValueInGlobalWindow
at
com.esotericsoftware.kryo.Kryo$DefaultInstantiatorStrategy.newInstantiatorOf(Kryo.java:1228)
at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1049)
at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1058)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:547)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:523)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)

Thanks,
Thomas


On Wed, Jun 1, 2016 at 12:45 AM, Amit Sela <amitsel...@gmail.com> wrote:

> Hi Thomas,
>
> Spark and the Spark runner are using kryo for serialization and it seems to
> work just fine. What is your exact problem ? stack trace/message ?
> I've hit an issue with Guava's ImmutableList/Map etc. and used
> https://github.com/magro/kryo-serializers for that.
>
> For PipelineOptions you can take a look at the Spark runner code here:
>
> https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkRuntimeContext.java#L73
>
> I'd be happy to assist with Kryo.
>
> Thanks,
> Amit
>
> On Wed, Jun 1, 2016 at 7:10 AM Thomas Weise <t...@apache.org> wrote:
>
> > Hi,
> >
> > I'm working on putting together a basic runner for Apache Apex.
> >
> > Hitting a couple of serialization related issues with running tests. Apex
> > is using Kryo for serialization by default (and Kryo can delegate to
> other
> > serialization frameworks).
> >
> > The inner classes of WindowedValue are private and have no default
> > constructor, which the Kryo field serializer does not like. Also these
> > classes are not Java serializable, so that's not a fallback option (not
> > that it would be efficient anyways).
> >
> > What's the recommended technique to move the WindowedValues over the
> wire?
> >
> > Also, PipelineOptions aren't serializable, while most other classes are.
> > They are needed for example with DoFnRunnerBase, so what's the
> recommended
> > way to distribute them? Disassemble/reassemble? :)
> >
> > Thanks,
> > Thomas
> >
>


Serialization for org.apache.beam.sdk.util.WindowedValue$*

2016-05-31 Thread Thomas Weise
Hi,

I'm working on putting together a basic runner for Apache Apex.

Hitting a couple of serialization related issues with running tests. Apex
is using Kryo for serialization by default (and Kryo can delegate to other
serialization frameworks).

The inner classes of WindowedValue are private and have no default
constructor, which the Kryo field serializer does not like. Also these
classes are not Java serializable, so that's not a fallback option (not
that it would be efficient anyways).

What's the recommended technique to move the WindowedValues over the wire?

Also, PipelineOptions aren't serializable, while most other classes are.
They are needed for example with DoFnRunnerBase, so what's the recommended
way to distribute them? Disassemble/reassemble? :)

Thanks,
Thomas