from:"Ismaël Mejía"

Re: A personal update

2017-12-13 Thread Ismaël Mejía

Hello Davor, great to know you are going to continue contributing to
the project. Welcome back and best of wishes for this new phase !

On Wed, Dec 13, 2017 at 3:12 PM, Kenneth Knowles  wrote:
> Great to have you back!
>
> On Tue, Dec 12, 2017 at 11:20 PM, Robert Bradshaw 
> wrote:
>>
>> Great to hear from you again, and really happy you're sticking around!
>>
>> - Robert
>>
>>
>> On Tue, Dec 12, 2017 at 10:47 PM, Ahmet Altay  wrote:
>> > Welcome back! Looking forward to your contributions.
>> >
>> > Ahmet
>> >
>> > On Tue, Dec 12, 2017 at 10:05 PM, Jesse Anderson
>> > 
>> > wrote:
>> >>
>> >> Congrats!
>> >>
>> >>
>> >> On Wed, Dec 13, 2017, 5:54 AM Jean-Baptiste Onofré 
>> >> wrote:
>> >>>
>> >>> Hi Davor,
>> >>>
>> >>> welcome back !!
>> >>>
>> >>> It's really great to see you back active in the Beam community. We
>> >>> really
>> >>> need you !
>> >>>
>> >>> I'm so happy !
>> >>>
>> >>> Regards
>> >>> JB
>> >>>
>> >>> On 12/13/2017 05:51 AM, Davor Bonaci wrote:
>> >>> > My dear friends,
>> >>> > As many of you have noticed, I’ve been visibly absent from the
>> >>> > project
>> >>> > for a
>> >>> > little while. During this time, a great number of you kept reaching
>> >>> > out, and for
>> >>> > that I’m deeply humbled and grateful to each and every one of you.
>> >>> >
>> >>> > I needed some time for personal reflection, which led to a
>> >>> > transition
>> >>> > in my
>> >>> > professional life. As things have settled, I’m happy to again be
>> >>> > working among
>> >>> > all of you, as we propel this project forward. I plan to be active
>> >>> > in
>> >>> > the
>> >>> > future, but perhaps not quite full-time as I was before.
>> >>> >
>> >>> > In the near term, I’m working on getting the report to the Board
>> >>> > completed, as
>> >>> > well as framing the discussion about the project state and vision
>> >>> > going
>> >>> > forwards. Additionally, I’ll make sure that we foster healthy
>> >>> > community
>> >>> > culture
>> >>> > and operate in the Apache Way.
>> >>> >
>> >>> > For those who are curious, I’m happy to say that I’m starting a
>> >>> > company
>> >>> > building
>> >>> > products related to Beam, along with several other members of this
>> >>> > community and
>> >>> > authors of this technology. I’ll share more on this next year, but
>> >>> > until then if
>> >>> > you have a data processing problem or an Apache Beam question, I’d
>> >>> > love
>> >>> > to hear
>> >>> > from you ;-).
>> >>> >
>> >>> > Thanks -- and so happy to be back!
>> >>> >
>> >>> > Davor
>> >>>
>> >>> --
>> >>> Jean-Baptiste Onofré
>> >>> jbono...@apache.org
>> >>> http://blog.nanthrax.net
>> >>> Talend - http://www.talend.com
>> >
>> >
>
>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2017-12-11 Thread Ismaël Mejía

Hello, I wanted to bring back this subject because I think we should
take action on this and at least first have a shaded version of guava.
I was playing with a toy project and I did the procedure we use to
submit jars to a Hadoop cluster via Flink/Spark which involves
creating an uber jar and I realized that the size of the jar was way
bigger than I expected, and the fact that we shade guava in every
module contributes to this. I found guava shaded on:

sdks/java/core
runners/core-construction-java
runners/core-java
model/job-management
runners/spark
sdks/java/io/hadoop-file-system
sdks/java/io/kafka

This means at least 6 times more of the size it should which counts in
around 15MB more (2.4MB*6 deps) of extra weight that we can simply
reduce by using a shaded version of the library.

I add this point to the previous ones mentioned during the discussion
because this goes against the end user experience and affects us all
(devs/users).

Another question is if we should shade (and how) protocol buffers
because now with the portability work we are exposing it closer to the
end users. I say this because I also found an issue while running a
job on YARN with the spark runner because hadoop-common includes
protobuf-java 2 and I had to explicitly provide protocol-buffers 3 as
a dependency to be able to use triggers (note the Spark runner
translates them using some method from runners/core-java). Since
hadoop-common is provided in the cluster with the older version of
protobuf, I am afraid that this will bite us in the future.

Ismaël

ps. There is already a JIRA for that shading for protobuf on
hadoop-common but this is not coming until version 3 is out.
https://issues.apache.org/jira/browse/HADOOP-13136

ps2. Extra curious situation is to see that the dataflow-runner ends
up having guava shaded twice via its shaded version on
core-construction-java.

ps3. Of course this message means a de-facto +1 at least to do it for
guava and evaluate it for other libs.

On Tue, Oct 17, 2017 at 7:29 PM, Lukasz Cwik  wrote:
> An issue to call out is how to deal with our generated code (.avro and
> .proto) as I don't believe those plugins allow you to generate code using a
> shaded package prefix on imports.
>
> On Tue, Oct 17, 2017 at 10:28 AM, Thomas Groh 
> wrote:
>
>> +1 to the goal. I'm hugely in favor of not doing the same shading work
>> every time for dependencies we know we'll use.
>>
>> This also means that if we end up pulling in transitive dependencies we
>> don't want in any particular module we can avoid having to adjust our
>> repackaging strategy for that module - which I have run into face-first in
>> the past.
>>
>> On Tue, Oct 17, 2017 at 9:48 AM, Kenneth Knowles 
>> wrote:
>>
>> > Hi all,
>> >
>> > Shading is a big part of how we keep our dependencies sane in Beam. But
>> > downsides: shading is super slow, causes massive jar bloat, and kind of
>> > hard to get right because artifacts and namespaces are not 1-to-1.
>> >
>> > I know that some communities distribute their own shaded distributions of
>> > dependencies. I had a thought about doing something similar that I wanted
>> > to throw out there for people to poke holes in.
>> >
>> > To set the scene, here is how I view shading:
>> >
>> >  - A module has public dependencies and private dependencies.
>> >  - Public deps are used for data interchange; users must share these
>> deps.
>> >  - Private deps are just functionality and can be hidden (in our case,
>> > relocated + bundled)
>> >  - It isn't necessarily that simple, because public and private deps
>> might
>> > interact in higher-order ways ("public" is contagious)
>> >
>> > Shading is an implementation detail of expressing these characteristics.
>> We
>> > use shading selectively because of its downsides I mentioned above.
>> >
>> > But what about this idea: Introduce shaded deps as a single separate
>> > artifact.
>> >
>> >  - sdks/java/private-deps: bundled uber jar with relocated versions of
>> > everything we want to shade
>> >
>> >  - sdks/java/core and sdks/java/harness: no relocation or bundling -
>> > depends on `beam-sdks-java-private-deps` and imports like
>> > `org.apache.beam.sdk.private.com.google.common` directly (this is what
>> > they
>> > are rewritten to
>> >
>> > Some benefits
>> >
>> >  - much faster builds of other modules
>> >  - only one shaded uber jar
>> >  - rare/no rebuilds of the uber jar
>> >  - can use maven enforcer to forbid imports like com.google.common
>> >  - configuration all in one place
>> >  - no automated rewriting of our real code, which has led to some major
>> > confusion
>> >  - easy to implement incrementally
>> >
>> > Downsides:
>> >
>> >  - plenty of effort work to get there
>> >  - unclear how many different such deps modules we need; sharing them
>> could
>> > get weird
>> >  - if we hit a roadblock, we will have committed a lot of time
>> >
>> > Just something I was musing as I spent another evening waiting for slow
>> > builds to try to confirm

Re: Introduction + interest in helping Beam builds, tests, and releases

2017-12-08 Thread Ismaël Mejía

Welcome Alan looking forward to your help and in particular to have
ways to validate our releases with Hadoop's YARN too (Dataproc).
If I can add an extra point it would be to have also some 'backwards'
compatible version of Holden's wish.
So we can test for example the releases with previous versions of the
dependencies e.g. Spark 2.0, 2.1, 2.2, etc.

On Fri, Dec 8, 2017 at 8:49 AM, Ahmet Altay  wrote:
> Welcome Alan, this sounds great!
>
> On Thu, Dec 7, 2017 at 8:00 PM, Holden Karau  wrote:
>>
>> Also, and I know this is maybe a bit beyond the scope of what would make
>> sense initially, but if you wanted to set up something to test BEAM against
>> the new Spark/Flink RCs we could give feedback about any breaking changes we
>> see in upstream projects and I’d be happy to help with that :)
>>
>> On Fri, Dec 8, 2017 at 9:44 AM Eugene Kirpichov 
>> wrote:
>>>
>>> Awesome, excited to see release validation automated! Please let me know
>>> if you need help getting Flink and Spark runner validation on Dataproc - I
>>> did that manually and it involved some non-obvious steps.
>>>
>>> On Thu, Dec 7, 2017 at 5:29 PM Alan Myrvold  wrote:

 Hi, I'm Alan.

 I've been working with Google Cloud engineering productivity, and I'm
 keen on improving the Beam release process and build/test infrastructure.

 I will be first looking into scripting some of the release validation
 steps for the nightly java snapshot releases, but hope to learn and improve
 the whole development and testing experience for Beam.

 Look forward to working with everyone!

 Alan Myrvold

>> --
>> Twitter: https://twitter.com/holdenkarau
>
>

Re: Apache Ignite as a distributed processing back-ends

2017-12-08 Thread Ismaël Mejía

Hello Denis,

This is really gret news, I think Ignite can be integrated on Beam as
an IO in that case Beam developers will read/write their data from/to
Ignite from their data processing pipelines.

You can take a look at some of the existing IOs for ideas and follow
the Ptransform guide for style
https://github.com/apache/beam/tree/master/sdks/java/io
https://beam.apache.org/contribute/ptransform-style-guide/

Notice that there is an open JIRA to support a JCache based connector
so a good idea would be to implement it and use Ignite as the
reference example (of course you can go the Ignite native route but
community wise this would be neat).
https://issues.apache.org/jira/browse/BEAM-2584

>From a quicklook at the Compute Grid documentation in the website it
seems also that it could make sense to integrate Ignite into Beam as a
runner. This requires translating the Beam model into the appropriate
Ignite API. For this the best reference to start is :
https://beam.apache.org/contribute/runner-guide/

Also I saw you that you guys have a Filesystem Ignite’s (IGFS) with
support for HDFS so a first quick contribution would be to validate
that it works with Beam and add some documentation on how to use it.

Don’t hesitate to ask questions, create JIRAs, or contact us here or
in the slack channel if needed.

Best,
Ismaël

On Fri, Dec 8, 2017 at 6:54 AM, Romain Manni-Bucau
 wrote:
> Hi
>
> This sounds awesome to have an Ignite runner which could compete with
> hazelcast-jet.
>
> The entry point would be https://beam.apache.org/contribute/runner-guide/
> IMHO.
>
> Being on Ignite cluster also opens a lot of doors - reusing the filesystem
> or distributed structures. Very exiting.
>
> Le 8 déc. 2017 05:46, "Denis Magda"  a écrit :
>>
>> Hello Apache Beam fellows!
>>
>> We at Apache Ignite community came across your project and would be happy
>> to integrate with it.
>>
>> In short, Ignite is a distributed database and computational platform that
>> has its own map-reduce like component:
>> https://apacheignite.readme.io/docs/compute-grid
>>
>> The integration will give Beam users an ability to use Ignite as a
>> distributed processing back-end system and database.
>>
>> How should we proceed? Please share any relevant information.
>>
>> —
>> Denis
>> Ignite PMC

Re: Apache Beam, version 2.2.0

2017-12-08 Thread Ismaël Mejía

Thanks Eugene for opening the poll (sorry if I didn't before I was
quite busy in the last two days but expected to do it today).


On Fri, Dec 8, 2017 at 1:27 AM, Ahmet Altay  wrote:
>
> On Thu, Dec 7, 2017 at 3:51 PM, Eugene Kirpichov 
> wrote:
>>
>> I've sent the poll
>> https://lists.apache.org/thread.html/5bc2e184a24de9dbc8184ffd2720d1894010497d47d956b395e037df@%3Cuser.beam.apache.org%3E
>> Will figure out how to tweet from @ApacheBeam, and sent the Twitter poll
>> as well (or ask someone to).
>
>
> I tweeted the poll.
>
>>
>>
>> On Wed, Dec 6, 2017 at 1:47 PM Lukasz Cwik  wrote:
>>>
>>> +1 on moving forward with the plan suggested by kirpichov@
>>>
>>> On Wed, Dec 6, 2017 at 9:14 AM, Robert Bradshaw 
>>> wrote:
>>>>
>>>> +1 to moving forward with this plan.
>>>>
>>>> (FWIW, this seems *less* backwards incompatible than, say, moving from
>>>> Spark 1 to Spark 2, which was decided much quicker. I suppose the
>>>> Spark change has a lower bound on the number of users it could impact
>>>> though.)
>>>>
>>>> On Wed, Dec 6, 2017 at 9:09 AM, Eugene Kirpichov 
>>>> wrote:
>>>> > Okay, then let's go forward. Seems that we should:
>>>> > - Open a new poll on user@, in light of 2.2 having been released
>>>> > - Open a twitter poll
>>>> > - Tweet that there's also a poll going on on user@
>>>> > - Runner authors will reach out to respective runner user communities
>>>> > - 2 weeks later we gather results and decide
>>>> > ?
>>>> >
>>>> > On Wed, Dec 6, 2017 at 6:16 AM Ismaël Mejía  wrote:
>>>> >>
>>>> >> +1 For Eugene’s arguments waiting for Beam 3.0 seems still far away,
>>>> >> and starting to improve Beam to offer a Java 8 friendly experience
>>>> >> seems like an excellent idea.
>>>> >>
>>>> >> I understand the backwards compatibility argument. We should do the
>>>> >> poll in twitter + try to reach more users for comments. If you
>>>> >> consider that it is worth, I can open a second poll at user@.
>>>> >>
>>>> >> In any case we should try to move forward, even if we have more than
>>>> >> 5% of users who want to stay on Java 7 we can consider to maintain
>>>> >> minor releases of a backwards compatible version where we can
>>>> >> backport
>>>> >> only critical fixes e.g. security/data related errors but nothing
>>>> >> new,
>>>> >> in case some user really needs to have them. Of course this can be
>>>> >> some extra work (to be discussed).
>>>> >>
>>>> >>
>>>> >> On Tue, Dec 5, 2017 at 7:24 AM, Jean-Baptiste Onofré
>>>> >> 
>>>> >> wrote:
>>>> >> > +1, and sorry again, I thought we got an consensus.
>>>> >> >
>>>> >> > Regards
>>>> >> > JB
>>>> >> >
>>>> >> > On 12/05/2017 07:10 AM, Kenneth Knowles wrote:
>>>> >> >>
>>>> >> >> +1 to the poll and also to Reuven's point.
>>>> >> >>
>>>> >> >> Those without a support contract would have been using JDK 7
>>>> >> >> without
>>>> >> >> security updates for years. IMO it seems harmful, as a netizen, to
>>>> >> >> encourage
>>>> >> >> its use/existence.
>>>> >> >>
>>>> >> >> If there's no noise from the prior thread, then I would assume no
>>>> >> >> one
>>>> >> >> on
>>>> >> >> user@ has any objection. Anyone else with customers should reach
>>>> >> >> out to
>>>> >> >> them.
>>>> >> >>
>>>> >> >> Kenn
>>>> >> >>
>>>> >> >> On Mon, Dec 4, 2017 at 9:49 PM, Reuven Lax >>> >> >> <mailto:re...@google.com>> wrote:
>>>> >> >>
>>>> >> >> Technically it's a backwards-incompatible change, however if
>>>> >> >> we are
>>>> >> >> convinced the risk is low we could do it.
>&g

Re: Apache Beam, version 2.2.0

2017-12-06 Thread Ismaël Mejía

 7
>> would
>> be a blocker or hindrance to adopting the new release for me
>>
>> We could tweet this poll on Apache Beam twitter and publish on
>> user@,
>> and, say, if we receive 5% or fewer votes for option 3 after
>> keeping it
>> open for 2 weeks, then adopt Java 8 without a major version
>> change.
>>
>> WDYT?
>>
>> On Mon, Dec 4, 2017 at 8:34 PM Jean-Baptiste Onofré
>> > <mailto:j...@nanthrax.net>> wrote:
>>
>> Good idea ! Definitely +1
>>
>> Regards
>> JB
>>
>> On 12/05/2017 05:25 AM, Reuven Lax wrote:
>>  > We should bring this up on the Beam 3.0 thread. Since it's
>> technically a
>>  > backwards-incompatible change, it might make a good item
>> for Beam
>> 3.0.
>>  >
>>  > Reuven
>>  >
>>  > On Mon, Dec 4, 2017 at 8:20 PM, Jean-Baptiste Onofré
>> mailto:j...@nanthrax.net>
>>  > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> wrote:
>>  >
>>  > My apologizes, I thought we had a consensus already.
>>  >
>>  > Regards
>>  > JB
>>  >
>>  > On 12/04/2017 11:22 PM, Eugene Kirpichov wrote:
>>  >
>>  > Thanks JB for sending the detailed notes about new
>> stuff
>> in 2.2.0! A lot
>>  > of exciting things indeed.
>>  >
>>  > Regarding Java 8: I thought our consensus was to
>> have the
>> release notes
>>  > say that we're *considering* going Java8-only, and
>> use
>> that to get more
>>  > opinions from the user community - but I can't find
>> the
>> emails that made
>>  > me think so.
>>  >
>>  > +Ismaël Mejía <mailto:ieme...@gmail.com
>> <mailto:ieme...@gmail.com> <mailto:ieme...@gmail.com
>> <mailto:ieme...@gmail.com>>> - do
>>  > you think we should formally conclude the vote on
>> the
>> thread [VOTE]
>>  > [DISCUSSION] Remove support for Java 7?
>>  > Or should we take more steps - e.g. perhaps tweet a
>> link
>> to that thread
>>  > from the Beam twitter account, ask people to chime
>> in,
>> and wait for say
>>  > 2 weeks before declaring a conclusion?
>>  >
>>  > Let's also have a process JIRA for going Java8.
>> I've
>> filed one:
>>  > https://issues.apache.org/jira/browse/BEAM-3285
>> <https://issues.apache.org/jira/browse/BEAM-3285>
>>  > <https://issues.apache.org/jira/browse/BEAM-3285
>> <https://issues.apache.org/jira/browse/BEAM-3285>>
>>  >
>>  > On Mon, Dec 4, 2017 at 1:58 AM Jean-Baptiste Onofré
>> mailto:j...@nanthrax.net>
>>  > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>
>> <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>
>>  > <mailto:j...@nanthrax.net 
>> <mailto:j...@nanthrax.net>>>>
>> wrote:
>>  >
>>  >  Just an important note that we forgot to
>> mention.
>>  >
>>  >  !! The 2.2.0 release will be the last one
>> supporting
>> Spark 1.x and
>>  > Java 7 !!
>>  >
>>  >  Starting from Beam 2.3.0, the Spark runner
>> will work
>> only with
>>  > Spark 2.x and we
>>  >  will focus only Java 8.
>>  >
>>  >  Regards
>>  >  JB
>>  >
>>  >  On 12/04/

Re: [DISCUSS] Updating contribution guide for gitbox

2017-11-29 Thread Ismaël Mejía

+1 to Kenneth proposal, using reviewer and asignee, for the merge strategy
(a) +1 with the same arguments (preserving commits when they are
meaningful and isolated, ask committers to do extra squash if needed.

I don't really favor having one big commit per PR (in particular if
the change is big) because you lose information with this approach. We
should encourage contributors to do meaningful and isolated commits,
of course if this is not the case then we can go and squash them, but
this is something to review per case.

Regards,
Ismaël

On Wed, Nov 29, 2017 at 4:37 PM, Aljoscha Krettek  wrote:
> I think I agree with Kenn on the "merge question":
>  - There should be a merge commit because this records important information, 
> for example, I like having the option of figuring out what PR certain commits 
> came from
>  - Individual meaningful commits of a PR should be preserved, I think having 
> commits as small as possible is nice and the git history tells a story of 
> where the code came from
>  - fixup commits should be squashed
>
> The question of whether to keep or squash commits could also be solved by 
> enforcing 1 PR = 1 commit and making people open several PRs where they would 
> previously open one PR with several distinct and meaningful commits. This 
> might introduce quite some overhead, though.
>
> Best,
> Aljoscha
>
>> On 29. Nov 2017, at 09:40, Jean-Baptiste Onofré  wrote:
>>
>> Hi,
>>
>> I don't see why gitbox merge button change what we are doing.
>>
>> I agree with Kenn for 1 (reviewer field) & 2 (assignee field).
>>
>> IMHO, for 3, I think the reviewer should only use rebase & merge. The squash 
>> should be under the contributor scope. The reviewer can ask to squash some 
>> commits, but he should not do it himself (the contributor should update the 
>> PR with the squashes).
>>
>> My $0.01 ;)
>>
>> Regards
>> JB
>>
>> On 11/28/2017 06:45 PM, Kenneth Knowles wrote:
>>> Hi all,
>>> James brought up a great question in Slack, which was how should we use the 
>>> merge button, illustrated [1]
>>> I want to broaden the discussion to talk about all the new capabilities:
>>> 1. Whether & how to use the "reviewer" field
>>> 2. Whether & how to use the "assignee" field
>>> 3. Whether & how to use the merge button
>>> My preferences are:
>>> 1. Use the reviewer field instead of "R:" comments.
>>> 2. Use the assignee field to keep track of who the review is blocked on 
>>> (either the reviewer for more comments or the author for fixes)
>>> 3. Use merge commits, but editing the commit subject line
>>> To expand on part 3, GitHub's merge button has three options [1]. They are 
>>> not described accurately in the UI, as they all say "merge" when only one 
>>> of them performs a merge. They do the following:
>>> (a) Merge the branch with a merge commit
>>> (b) Squash all the commits, rebase and push
>>> (c) Rebase and push without squash
>>> Unlike our current guide, all of these result in a "merged" status for the 
>>> PR, so we can correctly distinguish those PRs that were actually merged.
>>> My votes on these options are:
>>> (a) +1 this preserves the most information
>>> (b) -1 this erases the most information
>>> (c) -0 this is just sort of a middle ground; it breaks commit hashes, does 
>>> not have a clear merge commit, but preserves other info
>>> Kenn
>>> [1] https://apachebeam.slack.com/messages/C1AAFJYMP/
>>> Kenn
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>

Re: [DISCUSS] Thinking about Beam 3.x roadmap and release schedule

2017-11-29 Thread Ismaël Mejía

It is good to see so much enthusiasm about the future of Beam
independently of the fact that we call it Beam 3 or no.

I have some doubts about the idea of a release per month, Apache
releases are designed to be slow-pace (via the 3-day voting process).
It is just a question that we have in the same month some holiday
period + some issues during the release that require two RCs and it
will easily take two weeks (of course I understand the will to improve
this considering our not so good statu quo of 6 weeks for the last two
votes). My point is that a monthly release can bring a ton of extra
work to validate every release, remember validating a release is not
just running the unit tests.

I want to add one idea to the wishlist for Beam in the future:

- We need to improve Beam’s monitorability in a unified way even if
this goes beyond the initial goals of the project because this is a
big pain point for Beam adopters. We need things like system metrics
and utilities to monitor what is going on with Beam pipelines in a
runner-agnostic way.

It would be nice to create JIRAs for the issues discussed in this
thread (that don’t exist yet) with this we can follow them and
categorize some sort of roadmap.


On Wed, Nov 29, 2017 at 7:05 AM, Romain Manni-Bucau
 wrote:
> Ps: forgot another wish: make usable beam sql. Today you need to add a fn
> before and after cause of that type breakage not consistent with the
> pipeline API. It would be nice to support pojo (extracted from the select
> fields or created from "views" like in jackson) bit not having to wrap the
> sql usage in multiple UDF would make it powerful and ready to use.
>
> Le 29 nov. 2017 07:01, "Romain Manni-Bucau"  a écrit
> :
>>
>> My user wishes - whatever version, it is just a number after all ;):
>>
>> - make coder usage simpler and consistent (PCollection TypeDescriptor and
>> Coder are duplicated in term of API)
>> - have a beam api (split from the sdk and internals and impl)
>> - have SDF supported by runners
>> - have a SDFRunner allowing to simulate the SDF lifecycle manually (same
>> for DoFn short term - see next point for the current issue)
>> - ensure classloader usage is consistent, ie any proxy is created into the
>> final artifact classloader (transform if custom, dofn/source/sdf otherwise)
>> - have a test compatibility kit (TCK) for runner. It would be a jar any
>> runner impl can import to run with surefire
>> - make IO configuration reflection friendly (get rid of the autovalue
>> pattern which is not industriablizable and allow pojo like classes or
>> alternatively support reading the conf from properties)
>> - support pipeline implicit option based on transform names to override
>> some attributes
>> - change runner implementations to let the bundle size have a pipeline
>> option defining an upper bound and not hardcode them arbitrarly - defaults
>> can stay the current ones
>> - better multi input/output support (just PCollection based and fully
>> wireable)
>> - a smoother pipeline API would be nice. I like hazelcast jet one for
>> instance
>>
>> Le 29 nov. 2017 03:29, "Robert Bradshaw"  a écrit :
>>>
>>> On Tue, Nov 28, 2017 at 9:48 AM, Reuven Lax  wrote:
>>> >
>>> > On Tue, Nov 28, 2017 at 9:14 AM, Jean-Baptiste Onofré 
>>> > wrote:
>>> >>
>>> >> Hi Reuven,
>>> >>
>>> >> Yes, I remember that we agreed on a release per month. However, we
>>> >> didn't
>>> >> do it before. I think the most important is not the period, it's more
>>> >> a
>>> >> stable pace. I think it's more interesting for our community to have
>>> >> "always" a release every two months, more than a tentative of a
>>> >> release
>>> >> every month that end later than that. Of course, if we can do both,
>>> >> it's
>>> >> perfect ;)
>>> >
>>> > Agree. A stable pace is the most important thing.
>>>
>>> +1, and I think everyone who's done a release is in favor of making it
>>> easier and more frequent. Someone should put together a proposal of
>>> easy things we can do to automate, etc.
>>>
>>> >> For Beam 3.x, I wasn't talking about breaking change, but more about
>>> >> "marketing" announcement. I think that, even if we don't break API,
>>> >> some
>>> >> features are "strong enough" to be "qualified" in a major version.
>>> >
>>> > Ah, good point. This doesn't stop us from checking in these new
>>> > features
>>> > into 2.x possibly tagged with an @Experimental flag. We can then use
>>> > 3.0 to
>>> > announce all these features more broadly, and remove @Experimental
>>> > tags.
>>> >
>>> > I would also like to see enterprise-ready BeamSQL and Java 7
>>> > deprecation on
>>> > the list for Beam 3.0
>>> >
>>> >>
>>> >> I think that any major idea & feature (breaking or not the API) are
>>> >> valuables for Beam 3.x (and it's a good sign for our community again
>>> >> ;)).
>>>
>>> I'm generally not a fan of bumping the major version number just
>>> because enough time has passed, or enough new features have gone in
>>> (and am mostly opposed to holding features bac

Re: [DISCUSS] Move away from Apache Maven as build tool

2017-11-27 Thread Ismaël Mejía

I have been a little bit out of the discussion on maven vs gradle
because I was expecting the technical proof of concepts to evaluate
the best approach. I deeply appreciate all the effort that Lukasz has
put into the gradle version, and I also think that during the
discussion Romain and others have bring some serious and important
points that make the decision less simple than I expected (in the end
sadly is not as simple as the fastest wins). In any case I don’t think
that it is wise just to switch immediately to gradle, at least if
switching means removing the maven files, we have to consider that the
‘full’ build/tests were introduced in the CI around one week ago, and
I am not sure that this is sufficient time to evaluate any possible
regression. Also I am particularly curious to know if the artifacts
are correct and complete. Has somebody already simulated a release
with the gradle build for example, this for me is a prerrequisite
before we even start discussing about the switch.


On Mon, Nov 27, 2017 at 9:34 PM, Lukasz Cwik  wrote:
> On Mon, Nov 27, 2017 at 11:51 AM, Romain Manni-Bucau 
> wrote:
>
>> 2017-11-27 20:26 GMT+01:00 Lukasz Cwik :
>> > Romain, as mentioned earlier, I identified that Maven was slower because
>> it
>> > needed to finish building the entire module before dependent modules
>> could
>> > start which included running tests, performing checkstyle, etc...
>> > Gradle is able to increase the parallelism of the build process since it
>> > has task driven parallelism so as long as the files are compiled, the
>> > dependent projects can start.
>>
>> This means we can implement a maven graph builder which is better than
>> the default one - surely with a thread safe local repo - and
>> contribute it back to solve it durably.
>>
>> If speed for a clean build was the only problem then maybe but lack of
> incremental builds across tasks is a goal we can actually achieve using
> Gradle and won't require rewriting almost all of the Maven plugins to
> support incremental builds.
>
>
>> >
>> > Maven and Gradle are both heavily used since there are ~146k Maven
>> projects
>> > on Github while there are ~122k Gradle project on Github. Do you have
>> data
>> > which shows that Maven is significantly more "mainstream"?
>>
>> Yep, project i worked on in companies using gradle: 0, all were based
>> on maven and maven was "tool-ed" versus gradle was "best effort" in
>> term of plugins.
>> Now  - with my EE background - I can guarantee you gradle is not able
>> to handle properly its build since it flatten the classpath and
>> plugins conflicts very quickly (their plugin dependency feature never
>> worked and almost no plugin impl it correctly).
>>
>> Wonder if it is easy to have the ASF stats, anyone knows?
>
>>
>> > I believe we want a rich multi-language SDK and community and feel as
>> > though it would be unwise to treat non JDK based languages as second
>> class.
>>
>> Hmm, not sure how it is related to the build tool since Maven and
>> Gradle have the same level of support - actually surprsingly maven is
>> better for js and surely as bad as gradle for others - or here again
>> we can create plugin like the frontend-maven-plugin if needed for
>> other languages.
>>
>> That said it can be an interesting other thread since people consuming
>> these languages will probably want their mainstream build tool and a
>> "standard" repository layout rather than a java one. But this is
>> harder to measure.
>>
>>
>>
>> > On Mon, Nov 27, 2017 at 11:00 AM, Romain Manni-Bucau <
>> rmannibu...@gmail.com>
>> > wrote:
>> >
>> >> Hi Lukasz,
>> >>
>> >> Did you manage to identify how maven was slower and test tesla stuff
>> >> and potentially a few other fixes?
>> >>
>> >> Side note: figures without python can be interesting cause locally - =
>> >> for me - python tends to flatten the figures whereas I get something
>> >> close to your conclusions without python part.
>> >>
>> >> My point is mainly that switching now on gradle and being back on
>> >> maven in a few months cause gradle ecosystem is far to support java 9
>> >> - or any other volatile reason like this one - is probably not a good
>> >> choice for a community. Maven is way more mainstream than gradle so
>> >> helps to encourage people to contribute - vs gradle will increase the
>> >> step to do it.
>> >>
>> >> I'd like to be sure before a switch that it is a one way decision and
>> >> that the build tool was not just challenged by itself and its current
>> >> state but also in the way it could be improved (= its community and
>> >> potentially some local hacks).
>> >>
>> >> Romain Manni-Bucau
>> >> @rmannibucau |  Blog | Old Blog | Github | LinkedIn
>> >>
>> >>
>> >> 2017-11-27 19:46 GMT+01:00 Lukasz Cwik :
>> >> > I have collected data by running several builds against master using
>> >> Gradle
>> >> > and Maven without using Gradle's support for incremental builds.
>> >> >
>> >> > Gradle (mins)
>> >> > min: 25.04
>> >> > max: 160.14
>> >> > median

Re: [RESULT][VOTE] Migrate to gitbox

2017-11-23 Thread Ismaël Mejía

If github already does the notifications, I think that having an extra
notifications/reviews mailing list could be overkill (or spammy).
However I can see the value of this for archival reasons, e.g. to
store the history of the project comments out of github for the
future.

+1 for new mailing list (reviews@) or disabled

I don't think that putting this in commits is a good idea, The commits
mailing list already has a good amount of stuff goinig on. I think
that adding more granular information will make it harder to follow.


On Thu, Nov 23, 2017 at 12:17 PM, Jean-Baptiste Onofré  
wrote:
> Hi,
>
> following the migration to gitbox, we now have a notification e-mail (on the
> dev mailing list) for each action on a PR (comments, closing, etc).
>
> It could be very verbose and I think we have to change that. For now, I will
> ask to disable this notification.
>
> However, I think it's worth ask on the mailing list. Basically we have the
> following options:
>
> - send the notification to commits@ mailing list
> - send the notification to a new mailing list (like review@ mailing list)
> - leave the notification disabled
>
> Please, let me know what you prefer.
>
> Thanks
> Regards
> JB
>
>
> On 11/23/2017 11:19 AM, Jean-Baptiste Onofré wrote:
>>
>> The migration is done, you have to update your local copy with git remote
>> set-url to use gitbox.apache.org instead of git-wip-us.apache.org.
>>
>> I'm checking the GitHub PRs (if we now have the merge button).
>>
>> Regards
>> JB
>>
>> On 11/23/2017 10:55 AM, Jean-Baptiste Onofré wrote:
>>>
>>> Hi guys,
>>>
>>> I just got an update from INFRA: the migration to gitbox starts now.
>>>
>>> Regards
>>> JB
>>>
>>> On 11/07/2017 05:51 PM, Jean-Baptiste Onofré wrote:

 Hi guys,

 quick update on the gitbox migration.

 I created a Jira for INFRA:

 https://issues.apache.org/jira/browse/INFRA-15456

 It should be done pretty soon.

 Regards
 JB

 On 10/23/2017 07:24 AM, Jean-Baptiste Onofré wrote:
>
> Hi all,
>
> this vote passed with only +1.
>
> I will requuest INFRA to move the repositories to gitbox.
>
> Thanks all for your vote !
>
> Regards
> JB
>
> On 10/10/2017 09:42 AM, Jean-Baptiste Onofré wrote:
>>
>> Hi all,
>>
>> following the discussion, here's the formal vote to migrate to gitbox:
>>
>> [ ] +1, Approve to migrate to gitbox
>> [ ] -1, Do not migrate (please provide specific comments)
>>
>> The vote will be open for at least 36 hours. It is adopted by majority
>> approval, with at least 3 PMC affirmative votes.
>>
>> Thanks,
>> Regards
>> JB
>
>

>>>
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: [VOTE] Fixing @yyy.com.INVALID mailing addresses

2017-11-23 Thread Ismaël Mejía

+1

On Thu, Nov 23, 2017 at 6:35 AM, Robert Bradshaw
 wrote:
> +1
>
> On Wed, Nov 22, 2017, 10:10 PM Jean-Baptiste Onofré  wrote:
>
>> +1
>>
>> Regards
>> JB
>>
>> On 11/23/2017 12:25 AM, Lukasz Cwik wrote:
>> > I have noticed that some e-mail addresses (notably @google.com) get
>> > .INVALID suffixed onto it so per...@yyy.com become
>> per...@yyy.com.INVALID
>> > in the From: header.
>> >
>> > I have figured out that this is an issue with the way that our mail
>> server
>> > is configured and opened
>> https://issues.apache.org/jira/browse/INFRA-15529.
>> >
>> > For those of us that are impacted, it makes it more difficult for users
>> to
>> > reply directly to the originator.
>> >
>> > Infra has asked to get consensus from PMC members before making the
>> change
>> > which I figured it would be easiest with a vote.
>> >
>> > Please vote:
>> > +1 Update mail server to stop suffixing .INVALID
>> > -1 Don't change mail server settings.
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>

Re: [VOTE] Choose the "new" Spark runner

2017-11-20 Thread Ismaël Mejía

Moving my vote from previous threads:

[ ] Use Spark 1 & Spark 2 Support Branch
[X] Use Spark 2 Only Branch

Ismaël


On Thu, Nov 16, 2017 at 2:08 PM, Jean-Baptiste Onofré  wrote:
> Hi guys,
>
> To illustrate the current discussion about Spark versions support, you can
> take a look on:
>
> --
> Spark 1 & Spark 2 Support Branch
>
> https://github.com/jbonofre/beam/tree/BEAM-1920-SPARK2-MODULES
>
> This branch contains a Spark runner common module compatible with both Spark
> 1.x and 2.x. For convenience, we introduced spark1 & spark2
> modules/artifacts containing just a pom.xml to define the dependencies set.
>
> --
> Spark 2 Only Branch
>
> https://github.com/jbonofre/beam/tree/BEAM-1920-SPARK2-ONLY
>
> This branch is an upgrade to Spark 2.x and "drop" support of Spark 1.x.
>
> As I'm ready to merge one of the other in the PR, I would like to complete
> the vote/discussion pretty soon.
>
> Correct me if I'm wrong, but it seems that the preference is to drop Spark
> 1.x to focus only on Spark 2.x (for the Spark 2 Only Branch).
>
> I would like to call a final vote to act the merge I will do:
>
> [ ] Use Spark 1 & Spark 2 Support Branch
> [ ] Use Spark 2 Only Branch
>
> This informal vote is open for 48 hours.
>
> Please, let me know what your preference is.
>
> Thanks !
> Regards
> JB
>
> On 11/13/2017 09:32 AM, Jean-Baptiste Onofré wrote:
>>
>> Hi Beamers,
>>
>> I'm forwarding this discussion & vote from the dev mailing list to the
>> user mailing list.
>> The goal is to have your feedback as user.
>>
>> Basically, we have two options:
>> 1. Right now, in the PR, we support both Spark 1.x and 2.x using three
>> artifacts (common, spark1, spark2). You, as users, pick up spark1 or spark2
>> in your dependencies set depending the Spark target version you want.
>> 2. The other option is to upgrade and focus on Spark 2.x in Beam 2.3.0. If
>> you still want to use Spark 1.x, then, you will be stuck up to Beam 2.2.0.
>>
>> Thoughts ?
>>
>> Thanks !
>> Regards
>> JB
>>
>>
>>  Forwarded Message 
>> Subject: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
>> Date: Wed, 8 Nov 2017 08:27:58 +0100
>> From: Jean-Baptiste Onofré 
>> Reply-To: dev@beam.apache.org
>> To: dev@beam.apache.org
>>
>> Hi all,
>>
>> as you might know, we are working on Spark 2.x support in the Spark
>> runner.
>>
>> I'm working on a PR about that:
>>
>> https://github.com/apache/beam/pull/3808
>>
>> Today, we have something working with both Spark 1.x and 2.x from a code
>> standpoint, but I have to deal with dependencies. It's the first step of the
>> update as I'm still using RDD, the second step would be to support dataframe
>> (but for that, I would need PCollection elements with schemas, that's
>> another topic on which Eugene, Reuven and I are discussing).
>>
>> However, as all major distributions now ship Spark 2.x, I don't think it's
>> required anymore to support Spark 1.x.
>>
>> If we agree, I will update and cleanup the PR to only support and focus on
>> Spark 2.x.
>>
>> So, that's why I'm calling for a vote:
>>
>>[ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
>>[ ] 0 (I don't care ;))
>>[ ] -1, I would like to still support Spark 1.x, and so having support
>> of both Spark 1.x and 2.x (please provide specific comment)
>>
>> This vote is open for 48 hours (I have the commits ready, just waiting the
>> end of the vote to push on the PR).
>>
>> Thanks !
>> Regards
>> JB
>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: New Contributor

2017-11-14 Thread Ismaël Mejía

Great news, Welcome Axel and Ben !


On Tue, Nov 14, 2017 at 11:46 PM, Reuven Lax  wrote:
> Welcome both of you!
>
> On Wed, Nov 15, 2017 at 6:14 AM, Griselda Cuevas 
> wrote:
>
>> Welcome guys!
>>
>> On 14 November 2017 at 13:11, Jesse Anderson 
>> wrote:
>>
>> > Welcome!
>> >
>> > On Tue, Nov 14, 2017, 10:03 PM Ben Sidhom 
>> > wrote:
>> >
>> > > Hey all,
>> > >
>> > > My name is Ben Sidhom. I'm an engineer at Google working on open source
>> > > data processing on top of GCP. I hope to contribute to the runner
>> > > portability effort along with Axel.
>> > >
>> > >
>> > > On 2017-11-14 11:38, Axel Magnuson  wrote:
>> > > > Hello All,>
>> > > >
>> > > > My name is Axel Magnuson and I intend to start contributing to the
>> > Beam>
>> > > > project.>
>> > > >
>> > > > I work as a Software Engineer at Google, with a background in Spark
>> > and>
>> > > > Hadoop.  I am hoping to make myself useful in particular around
>> > > portability>
>> > > > efforts and the open source engine runners.>
>> > > >
>> > > > Best,>
>> > > > Axel>
>> > > >
>> > > > -- >
>> > > > Axel Magnuson | Software Engineer | axelm...@google.com |   1 (425)
>> > > 893-4624>
>> > > >
>> > >
>> > --
>> > Thanks,
>> >
>> > Jesse
>> >
>>

Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-09 Thread Ismaël Mejía

+1 for the move to Spark 2 modulo preventing users and deciding on support:

I agree that having compatibility for both versions of Spark is
desirable but I am not sure if is worth the effort. Apart of the
reasons mentioned by Holden and Pei, I will add that the burden of
simultaneous maintenance could be bigger than the return, and also
that most Big Data/Cloud distributions have moved already to Spark 2,
so it makes sense to prioritize the new users better than the legacy
ones, in particular if we consider that Beam is a ‘recent’ project.

We can announce the end of the support for Spark 1 in the release
notes of Beam 2.2 and decide if we will support it in maintenance
mode, in this case we will backport or fix any reported issue related
to the Spark 1 runner on the 2.2.x branch let’s say for a year, but we
won’t add new functionalities. Or we can just decide not to support it
anymore and encourage users to move to Spark 2.

On Thu, Nov 9, 2017 at 6:59 AM, Pei HE  wrote:
> +1 on moving forward with Spark 2.x only.
> Spark 1 users can still use already released Spark runners, and we can
> support them with minor version releases for future bug fixes.
>
> I don't see how important it is to make future Beam releases available to
> Spark 1 users. If they choose not to upgrade Spark clusters, maybe they
> don't need the newest Beam releases as well.
>
> I think it is more important to 1). be able to leverage new features in
> Spark 2.x, 2.) extend user base to Spark 2.
> --
> Pei
>
>
> On Thu, Nov 9, 2017 at 1:45 PM, Holden Karau  wrote:
>
>> That's a good point about Oozie does only supporting only Spark 1 or 2 at a
>> time on a cluster -- but do we know people using Oozie and Spark 1 that
>> would still be using Spark 1 by the time of the next BEAM release? The last
>> Spark 1 release was a year ago (and last non-maintenance release almost 20
>> months ago).
>>
>> On Wed, Nov 8, 2017 at 9:30 PM, NerdyNick  wrote:
>>
>> > I don't know if ditching Spark 1 out right right now would be a great
>> move
>> > given that a lot of the main support applications around spark haven't
>> yet
>> > fully moved to Spark 2 yet. Yet alone have support for having a cluster
>> > with both. Oozie for example is still pre stable release for their Spark
>> 1
>> > and can't support a cluster with mixed Spark version. I think maybe doing
>> > as suggested above with the common, spark1, spark2 packaging might be
>> best
>> > during this carry over phase. Maybe even just flag spark 1 as deprecated
>> > and just being maintained might be enough.
>> >
>> > On Wed, Nov 8, 2017 at 10:25 PM, Holden Karau 
>> > wrote:
>> >
>> > > Also, upgrading Spark 1 to 2 is generally easier than changing JVM
>> > > versions. For folks using YARN or the hosted environments it pretty
>> much
>> > > trivial since you can effectively have distinct Spark clusters for each
>> > > job.
>> > >
>> > > On Wed, Nov 8, 2017 at 9:19 PM, Holden Karau 
>> > wrote:
>> > >
>> > > > I'm +1 on dropping Spark 1. There are a lot of exciting improvements
>> in
>> > > > Spark 2, and trying to write efficient code that runs between Spark 1
>> > and
>> > > > Spark 2 is super painful in the long term. It would be one thing if
>> > there
>> > > > were a lot of people available to work on the Spark runners, but it
>> > seems
>> > > > like we'd be better spent focusing our energy on the future.
>> > > >
>> > > > I don't know a lot of folks who are stuck on Spark 1, and the few
>> that
>> > I
>> > > > know are planning to migrate in the next few months anyways.
>> > > >
>> > > > Note: this is a non-binding vote as I'm not a committer or PMC
>> member.
>> > > >
>> > > > On Wed, Nov 8, 2017 at 3:43 AM, Ted Yu  wrote:
>> > > >
>> > > >> Having both Spark1 and Spark2 modules would benefit wider user base.
>> > > >>
>> > > >> I would vote for that.
>> > > >>
>> > > >> Cheers
>> > > >>
>> > > >> On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré <
>> > j...@nanthrax.net>
>> > > >> wrote:
>> > > >>
>> > > >> > Hi Robert,
>> > > >> >
>> > > >> > Thanks for your feedback !
>> > > >> >
>> > > >> > From an user perspective, with the current state of the PR, the
>> same
>> > > >> > pipelines can run on both Spark 1.x and 2.x: the only difference
>> is
>> > > the
>> > > >> > dependencies set.
>> > > >> >
>> > > >> > I'm calling the vote to get suck kind of feedback: if we consider
>> > > Spark
>> > > >> > 1.x still need to be supported, no problem, I will improve the PR
>> to
>> > > >> have
>> > > >> > three modules (common, spark1, spark2) and let users pick the
>> > desired
>> > > >> > version.
>> > > >> >
>> > > >> > Let's wait a bit other feedbacks, I will update the PR
>> accordingly.
>> > > >> >
>> > > >> > Regards
>> > > >> > JB
>> > > >> >
>> > > >> >
>> > > >> > On 11/08/2017 09:47 AM, Robert Bradshaw wrote:
>> > > >> >
>> > > >> >> I'm generally a -0.5 on this change, or at least doing so
>> hastily.
>> > > >> >>
>> > > >> >> As with dropping Java 7 support, I think this should at

Re: [VOTE] Release 2.2.0, release candidate #2

2017-11-08 Thread Ismaël Mejía

I tested the python version of the release I just created a new
virtualenv and run

python setup.py install and it gave me this message:

Traceback (most recent call last):
  File "setup.py", line 203, in 
'test': generate_protos_first(test),
  File "/usr/lib/python2.7/distutils/core.py", line 151, in setup
dist.run_commands()
  File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
  File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
  File 
"/home/ismael/.virtualenvs/beam-vote2/local/lib/python2.7/site-packages/setuptools/command/install.py",
line 67, in run
self.do_egg_install()
  File 
"/home/ismael/.virtualenvs/beam-vote2/local/lib/python2.7/site-packages/setuptools/command/install.py",
line 109, in do_egg_install
self.run_command('bdist_egg')
  File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
  File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
  File 
"/home/ismael/.virtualenvs/beam-vote2/local/lib/python2.7/site-packages/setuptools/command/bdist_egg.py",
line 169, in run
cmd = self.call_command('install_lib', warn_dir=0)
  File 
"/home/ismael/.virtualenvs/beam-vote2/local/lib/python2.7/site-packages/setuptools/command/bdist_egg.py",
line 155, in call_command
self.run_command(cmdname)
  File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
  File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
  File 
"/home/ismael/.virtualenvs/beam-vote2/local/lib/python2.7/site-packages/setuptools/command/install_lib.py",
line 11, in run
self.build()
  File "/usr/lib/python2.7/distutils/command/install_lib.py", line 109, in build
self.run_command('build_py')
  File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
  File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
  File "setup.py", line 143, in run
gen_protos.generate_proto_files()
  File 
"/home/ismael/releases/votes/beam/apache-beam-2.2.0-python/gen_protos.py",
line 66, in generate_proto_files
'Not in apache git tree; unable to find proto definitions.')
RuntimeError: Not in apache git tree; unable to find proto definitions.

Not sure if this is something in my environment, but this passed when
I validated the previous release (2.1.0).

On Wed, Nov 8, 2017 at 11:30 AM, Reuven Lax  wrote:
> Hi everyone,
>
> Please review and vote on the release candidate #2 for the version 2.2.0,
> as follows:
>   [ ] +1, Approve the release
>   [ ] -1, Do not approve the release (please provide specific comments)
>
>
> The complete staging area is available for your review, which includes:
>   * JIRA release notes [1],
>   * the official Apache source release to be deployed to dist.apache.org [2],
> which is signed with the key with fingerprint B98B7708 [3],
>   * all artifacts to be deployed to the Maven Central Repository [4],
>   * source code tag "v2.2.0-RC3" [5],
>   * website pull request listing the release and publishing the API
> reference manual [6].
>   * Java artifacts were built with Maven 3.5.0 and OpenJDK/Oracle JDK
> 1.8.0_144.
>   * Python artifacts are deployed along with the source release to the
> dist.apache.org [2].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Reuven
>
> [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> projectId=12319527&version=12341044
> [2] https://dist.apache.org/repos/dist/dev/beam/2.2.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1023/
> 
> [5] https://github.com/apache/beam/tree/v2.2.0-RC
> 3
> [6] https://github.com/apache/beam-site/pull/337

Re: [VOTE] Release 2.2.0, release candidate #2

2017-11-03 Thread Ismaël Mejía

I found some issues during the vote validation (not sure if those
would require a new vote since most seem to be packaging related and
we can get with it by removing the extra stuff that ended up in the
zip files):

1. I inspected the apache-beam-2.2.0-source-release.zip file and was a
bit surprised to notice that it was twice the size of the one for the
2.1.0 vote, then I discovered that the sdks/python/,eggs directory was
part of the 2.2.0 zip file (I suppose this is an issue).

2. There are some directories/files that appear in the zip file that
don't exist in the 2.2.0-rc2 git tag:

2.1.1/
foo/
model/
sdks/python/README.md

3. Then I run the rat validation and it broke because some files don't
have the correct (I suppose these are generated files that should not
be part of the final distribution). This is a part of the release
process that we have done manually and that has bitten us in the
latest two releases.

[WARNING] Files with unapproved licenses:
  sdks/python/apache_beam/portability/api/beam_runner_api_pb2_grpc.py
  sdks/python/apache_beam/portability/api/standard_window_fns_pb2.py
  sdks/python/apache_beam/portability/api/beam_job_api_pb2.py
  sdks/python/apache_beam/portability/api/endpoints_pb2.py
  sdks/python/apache_beam/portability/api/beam_artifact_api_pb2_grpc.py
  sdks/python/apache_beam/portability/api/beam_artifact_api_pb2.py
  sdks/python/apache_beam/portability/api/beam_fn_api_pb2_grpc.py
  sdks/python/apache_beam/portability/api/beam_fn_api_pb2.py
  sdks/python/apache_beam/portability/api/beam_runner_api_pb2.py
  sdks/python/apache_beam/portability/api/beam_provision_api_pb2.py
  sdks/python/apache_beam/portability/api/beam_job_api_pb2_grpc.py
  sdks/python/apache_beam/portability/api/endpoints_pb2_grpc.py
  sdks/python/apache_beam/portability/api/beam_provision_api_pb2_grpc.py
  sdks/python/apache_beam/portability/api/standard_window_fns_pb2_grpc.py

On Wed, Nov 1, 2017 at 4:47 AM, Reuven Lax  wrote:
> Hi everyone,
>
> Please review and vote on the release candidate #2 for the version 2.2.0,
> as follows:
>   [ ] +1, Approve the release
>   [ ] -1, Do not approve the release (please provide specific comments)
>
>
> The complete staging area is available for your review, which includes:
>   * JIRA release notes [1],
>   * the official Apache source release to be deployed to dist.apache.org
> [2], which is signed with the key with fingerprint B98B7708 [3],
>   * all artifacts to be deployed to the Maven Central Repository [4],
>   * source code tag "v2.2.0-RC2" [5],
>   * website pull request listing the release and publishing the API
> reference manual [6].
>   * Java artifacts were built with Maven 3.5.0 and OpenJDK/Oracle JDK
> 1.8.0_144.
>   * Python artifacts are deployed along with the source release to the
> dist.apache.org [2].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Reuven
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12341044
> [2] https://dist.apache.org/repos/dist/dev/beam/2.2.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1022/
> [5] https://github.com/apache/beam/tree/v2.2.0-RC2
> [6] https://github.com/apache/beam-site/pull/337

Re: [Proposal] Sharing Neville's post and upcoming meetups in the Twitter handle

2017-10-23 Thread Ismaël Mejía

Has anybody thought about getting some Beam 'swag' for these events
(the meetups + conference talks) ? So far I have seen some Beam
stickers around but it would be really nice to have some other items:
t-shirts, mugs, socks, corkscrews, webcam covers, whatever.
People seem to love these things and is a simple way to promote the project.

Ismaël


On Sat, Oct 21, 2017 at 7:40 AM, Jean-Baptiste Onofré  wrote:
> +1 for Neville's post.
>
> And no problem to promote the coming meetups.
>
> Regards
> JB
>
>
> On 10/20/2017 10:08 PM, Griselda Cuevas wrote:
>>
>> Hi everyone - What do you think about sharing Neville's blogpost[1] about
>> the road to Scio on the Apache Beam Twitter account?, I think it'd be good
>> to share some content since the last time we were active as 9/27.
>>
>> Also - could you help promote some of the upcoming Meetups? I made the
>> following tweets:
>>
>> 11/1 - San Francisco Cloud Mafia
>> Tweet:
>> Come join the SF Cloud Mafia to learn about stream & batch processing with
>> #ApacheBeam on Nov. 1st. https://www.meetup.com/San-
>> Francisco-Cloud-Mafia/events/244180581/
>>
>> 11/22 - StockholmApache Beam Meetup
>> Tweet:
>> Stockholm is ready for its first #ApacheBeam meetup on Nov. 22nd. Join if
>> you're around! https://www.meetup.com/Apache-Beam-Stockholm/
>>
>> [1] https://labs.spotify.com/2017/10/16/big-data-processing-at-
>> spotify-the-road-to-scio-part-1/
>>
>> Thanks!
>> G
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: Possibility of requiring Java 8 compiler for building Java 7 sources?

2017-10-17 Thread Ismaël Mejía

pache.org%3E
>> > > > > might
>> > > > > be it.
>> > > > >
>> > > > > Kafka is considering it:
>> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > > > > 118%3A+Drop+Support+for+Java+7+in+Kafka+0.11
>> > > > > and
>> > > > > quotes a number of other open-source projects that have switched
>> > > > > http://markmail.org/message/l7s276y3xkga2eqf
>> > > > >
>> > > > > So basically these projects all did a mailing list poll, and one
>> did
>> > > > also a
>> > > > > twitter poll.
>> > > > >
>> > > > > Beam has the advantage of being a relatively young project with
>> > > perhaps a
>> > > > > smaller base of users entrenched in using old versions of Java;
>> > > moreover,
>> > > > > Java version would matter only for the smaller subset of users who
>> > use
>> > > > Beam
>> > > > > Spark/Flink/Apex/.. runners (as opposed to Cloud Dataflow), which
>> is
>> > > > likely
>> > > > > an even more "early adopter"-ish group of users, as these runners
>> > > > generally
>> > > > > receive less support.
>> > > > >
>> > > > > It may be a good idea to have at least 1 release pass between
>> > > announcing
>> > > > > the intention to drop Java8 and actually dropping it (e.g. if we
>> > > decided
>> > > > it
>> > > > > now, then 2.4 would drop Java7). Also, we could start by switching
>> > > tests
>> > > > to
>> > > > > compile/run with java8 (Maven allows this). This is, I think,
>> pretty
>> > > much
>> > > > > safe to do immediately.
>> > > > >
>> > > > > On Mon, Oct 16, 2017 at 7:35 AM Ismaël Mejía 
>> > > wrote:
>> > > > >
>> > > > > > Any progress on this? What is the proposed way to validate if
>> users
>> > > > > > are still interested on Java 7? A vote on user or something
>> > > different?
>> > > > > >
>> > > > > >
>> > > > > > On Wed, Sep 27, 2017 at 7:59 PM, Kenneth Knowles
>> > > > > > > > > >
>> > > > > > wrote:
>> > > > > > > Agree with polling Beam users as well as each runner's
>> community
>> > in
>> > > > > > > aggregate.
>> > > > > > >
>> > > > > > > On Wed, Sep 27, 2017 at 9:44 AM, Jean-Baptiste Onofré <
>> > > > j...@nanthrax.net
>> > > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > >> Definitely agree
>> > > > > > >>
>> > > > > > >>
>> > > > > > >> On 09/27/2017 06:00 PM, Robert Bradshaw wrote:
>> > > > > > >>
>> > > > > > >>> I also think that it's time to seriously consider dropping
>> > > support
>> > > > > for
>> > > > > > >>> Java 7.
>> > > > > > >>>
>> > > > > > >>> On Tue, Sep 26, 2017 at 9:14 PM, Daniel Oliveira
>> > > > > > >>>  wrote:
>> > > > > > >>>
>> > > > > > >>>> Yes, just as Ismaël said it's a compilation blocker right
>> now
>> > > > > despite
>> > > > > > >>>> that
>> > > > > > >>>> (I believe) we don't use the extension that's breaking.
>> > > > > > >>>>
>> > > > > > >>>> As for other ways to solve this, if there is a way to avoid
>> > > > > compiling
>> > > > > > the
>> > > > > > >>>> advanced features of AutoValue that might be worth a try. We
>> > > could
>> > > > > > also
>> > > > > > >>>> try
>> > > > > > >>>> to get a release of AutoValue with the fix that works in
>> Java
>> > 7.
>> > > > &g

Re: Possibility of requiring Java 8 compiler for building Java 7 sources?

2017-10-16 Thread Ismaël Mejía

Any progress on this? What is the proposed way to validate if users
are still interested on Java 7? A vote on user or something different?


On Wed, Sep 27, 2017 at 7:59 PM, Kenneth Knowles  
wrote:
> Agree with polling Beam users as well as each runner's community in
> aggregate.
>
> On Wed, Sep 27, 2017 at 9:44 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Definitely agree
>>
>>
>> On 09/27/2017 06:00 PM, Robert Bradshaw wrote:
>>
>>> I also think that it's time to seriously consider dropping support for
>>> Java 7.
>>>
>>> On Tue, Sep 26, 2017 at 9:14 PM, Daniel Oliveira
>>>  wrote:
>>>
>>>> Yes, just as Ismaël said it's a compilation blocker right now despite
>>>> that
>>>> (I believe) we don't use the extension that's breaking.
>>>>
>>>> As for other ways to solve this, if there is a way to avoid compiling the
>>>> advanced features of AutoValue that might be worth a try. We could also
>>>> try
>>>> to get a release of AutoValue with the fix that works in Java 7. However
>>>> I
>>>> feel that slowly moving over to Java 8 is the most future-proof solution
>>>> if
>>>> it's possible.
>>>>
>>>> On Tue, Sep 26, 2017 at 2:47 PM, Ismaël Mejía  wrote:
>>>>
>>>> The current issue is that compilation fails on master because beam's
>>>>> parent pom is configured to fail if it finds warnings):
>>>>>
>>>>>  -Werror
>>>>>
>>>>> However if you remove that line from the parent pom the compilation
>>>>> passes.
>>>>>
>>>>> Of course this does not mean that everything is solved for Java 9,
>>>>> there are some tests that break and other issues because of other
>>>>> plugins and dependencies (e.g. bytebuddy), but those are not part of
>>>>> this discussion.
>>>>>
>>>>> On Tue, Sep 26, 2017 at 11:38 PM, Eugene Kirpichov
>>>>>  wrote:
>>>>>
>>>>>> AFAIK we don't use any advanced capabilities of AutoValue. Does that
>>>>>> mean
>>>>>> this issue is moot? I didn't quite understand from your email whether
>>>>>> it
>>>>>>
>>>>> is
>>>>>
>>>>>> a compilation blocker for Beam or not.
>>>>>>
>>>>>> On Tue, Sep 26, 2017 at 2:32 PM Ismaël Mejía 
>>>>>> wrote:
>>>>>>
>>>>>> Great that you are also working on this too Daniel and thanks for
>>>>>>> bringing this subject to the mailing list, I was waiting to  my return
>>>>>>> to office next week, but you did it first :)
>>>>>>>
>>>>>>> Eugene for reference (This is the issue on the migration to Java 9),
>>>>>>> notice that here the goal is first that beam passes mvn clean install
>>>>>>> with pure Java 9 (and also add this to jenkins), not to rewrite
>>>>>>> anything to use the new stuff (e.g. modules):
>>>>>>> https://issues.apache.org/jira/browse/BEAM-2530
>>>>>>>
>>>>>>> Eugene can you also PTAL at the AutoValue issue, more details on the
>>>>>>> issue, this is a warning so I don't know if it is really critical in
>>>>>>> particular because we are not using Memoization (do we?).
>>>>>>> https://github.com/google/auto/issues/503
>>>>>>>
>>>>>>> https://github.com/google/auto/commit/71514081f2ca6fb4ead2b7f0a25f5d
>>>>>>>
>>>>>> 02247b8532
>>>>>
>>>>>>
>>>>>>> Wouldn't the easiest way be that you guys convince the google.auto
>>>>>>> guys to generate that simple fix in a Java 7 compatible way and
>>>>>>> 'voila' ?
>>>>>>>
>>>>>>> However I agree that moving to Java 8 is an excellent idea and as
>>>>>>> Eugene mentions there is less friction now since most projects are
>>>>>>> moving, only pending issue are existing clusters on java 7 in the
>>>>>>> hadoop world, but those are less frequent now. Anyway this discussion
>>>>>>> is really important (maybe even worth a vote). Because moving to Java
>>>>>>

Re: CouchDbIO connector in beam io

2017-10-12 Thread Ismaël Mejía

This is an interesting one please go ahead and create the JIRA.

Maybe it is a good idea that you ping Seshadri and the guys who were
interested in implementing CouchbaseIO
https://issues.apache.org/jira/browse/BEAM-1893

I totally ignore if the APIs of CouchDb and Couchbase are similar but
if they do it would be nice to share some code. Also it would be nice
if the IO API is similar to other Document oriented stores like the
Mongo. We have similar data stores like Bigtable/HBase or
Elasticsearch/Solr sharing some of their 'style'.

Don't hesitate to ping me (us) or contact via slack if you have
questions or need some help.

Ismaël

On Thu, Oct 12, 2017 at 11:46 AM, tarush grover  wrote:
> I does not have the potential use case but CouchDB has been used into real
> time event storage and can be used for analytics so thought may be
> providing connector of that will be helpful.
>
> Not contacted Apache CouchDB team but now I am thinking, thanks for the
> input!!
>
> Regards,
> Tarush
>
> On Fri, Oct 6, 2017 at 3:08 AM, Chamikara Jayalath 
> wrote:
>
>> CouchDB sounds interesting. Could you expand a bit more on potential
>> use-cases ? Also did you get any input from Apache CouchDB team ?
>>
>> Thanks,
>> Cham
>>
>> On Thu, Oct 5, 2017 at 12:49 AM tarush grover 
>> wrote:
>>
>> > Hi All,
>> >
>> > I wanted to have inputs from community members regarding to have couchdb
>> io
>> > connectors in our beam io module.
>> >
>> > Regards,
>> > Tarush
>> >
>>

Re: [VOTE] Migrate to gitbox

2017-10-10 Thread Ismaël Mejía

+1 (non-binding)

On Tue, Oct 10, 2017 at 10:42 AM, Aljoscha Krettek  wrote:
> +1
>
>> On 10. Oct 2017, at 09:42, Jean-Baptiste Onofré  wrote:
>>
>> Hi all,
>>
>> following the discussion, here's the formal vote to migrate to gitbox:
>>
>> [ ] +1, Approve to migrate to gitbox
>> [ ] -1, Do not migrate (please provide specific comments)
>>
>> The vote will be open for at least 36 hours. It is adopted by majority
>> approval, with at least 3 PMC affirmative votes.
>>
>> Thanks,
>> Regards
>> JB
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>

Re: [DISCUSS] Switch to gitbox

2017-10-09 Thread Ismaël Mejía

+1

On Oct 9, 2017 6:52 PM, "Thomas Groh"  wrote:

> +1.
>
> I do love myself a forcing function for passing tests.
>
> On Mon, Oct 9, 2017 at 7:51 AM, Aljoscha Krettek 
> wrote:
>
> > +1
> >
> > > On 6. Oct 2017, at 18:57, Kenneth Knowles 
> > wrote:
> > >
> > > Sounds great. If I recall correctly, it means we could also us
> > assignment /
> > > review requests to pass pull requests around, instead of "R: foo"
> > comments.
> > >
> > > On Fri, Oct 6, 2017 at 9:30 AM, Tyler Akidau
>  > >
> > > wrote:
> > >
> > >> +1
> > >>
> > >> On Fri, Oct 6, 2017 at 8:54 AM Reuven Lax 
> > >> wrote:
> > >>
> > >>> +1
> > >>>
> > >>> On Oct 6, 2017 4:51 PM, "Lukasz Cwik" 
> > wrote:
> > >>>
> >  I think its a great idea and find that the mergebot works well on
> the
> >  website.
> >  Since gitbox enforces that the precommit checks pass, it would also
> be
> > >> a
> >  good forcing function for the community to maintain reliably passing
> > >>> tests.
> > 
> >  On Fri, Oct 6, 2017 at 4:58 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> >  wrote:
> > 
> > > Hi guys,
> > >
> > > We use Apache gitbox for the website and it works fine (as soon as
> > >> you
> > > linked your Apache and github with 2FA enabled).
> > >
> > > What do you think about moving to gitbox for the codebase itself ?
> > >
> > > It could speed up the review and merge for the PRs.
> > >
> > > Thoughts ?
> > >
> > > Regards
> > > JB
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> > >
> > 
> > >>>
> > >>
> >
> >
>

Re: Possibility of requiring Java 8 compiler for building Java 7 sources?

2017-09-26 Thread Ismaël Mejía

The current issue is that compilation fails on master because beam's
parent pom is configured to fail if it finds warnings):

-Werror

However if you remove that line from the parent pom the compilation passes.

Of course this does not mean that everything is solved for Java 9,
there are some tests that break and other issues because of other
plugins and dependencies (e.g. bytebuddy), but those are not part of
this discussion.

On Tue, Sep 26, 2017 at 11:38 PM, Eugene Kirpichov
 wrote:
> AFAIK we don't use any advanced capabilities of AutoValue. Does that mean
> this issue is moot? I didn't quite understand from your email whether it is
> a compilation blocker for Beam or not.
>
> On Tue, Sep 26, 2017 at 2:32 PM Ismaël Mejía  wrote:
>
>> Great that you are also working on this too Daniel and thanks for
>> bringing this subject to the mailing list, I was waiting to  my return
>> to office next week, but you did it first :)
>>
>> Eugene for reference (This is the issue on the migration to Java 9),
>> notice that here the goal is first that beam passes mvn clean install
>> with pure Java 9 (and also add this to jenkins), not to rewrite
>> anything to use the new stuff (e.g. modules):
>> https://issues.apache.org/jira/browse/BEAM-2530
>>
>> Eugene can you also PTAL at the AutoValue issue, more details on the
>> issue, this is a warning so I don't know if it is really critical in
>> particular because we are not using Memoization (do we?).
>> https://github.com/google/auto/issues/503
>>
>> https://github.com/google/auto/commit/71514081f2ca6fb4ead2b7f0a25f5d02247b8532
>>
>> Wouldn't the easiest way be that you guys convince the google.auto
>> guys to generate that simple fix in a Java 7 compatible way and
>> 'voila' ?
>>
>> However I agree that moving to Java 8 is an excellent idea and as
>> Eugene mentions there is less friction now since most projects are
>> moving, only pending issue are existing clusters on java 7 in the
>> hadoop world, but those are less frequent now. Anyway this discussion
>> is really important (maybe even worth a vote). Because moving to Java
>> 8 will allow us also to move some of the dependencies that we are
>> keeping in old versions and in general to move forward.
>>
>> What do the others think ?
>>
>>
>>
>> On Tue, Sep 26, 2017 at 11:09 PM, Eugene Kirpichov
>>  wrote:
>> > Very excited to hear that there's work on JDK9 support - is there a
>> public
>> > description of the plans for this work somewhere?
>> >
>> > In general, Beam could probably drop Java7 support altogether at some
>> point
>> > soon: Java7 has reached end-of-life (i.e. it's not receiving even
>> security
>> > patches) 2 years ago, and all major players in the data processing
>> > ecosystem have dropped Java7 support (Spark, Flink, Hadoop), so I presume
>> > the demand for Java7 support in the data processing industry is low. By
>> the
>> > way: would a Java8 migration be in the scope of your work in general?
>> >
>> > However, until we say that Beam requires Java8, what would be the
>> > implications of using a version of AutoValue that can only be compiled
>> with
>> > Java8? Are you saying that this is simply a matter of a compiler bug, and
>> > if we use a Java8 compiler but configured to use source and target
>> versions
>> > of 1.7 and using bootclasspath of rt.jar from 1.7, then the resulting
>> Beam
>> > artifacts will be usable by people who don't have Java8?
>> >
>> > On Tue, Sep 26, 2017 at 1:53 PM Daniel Oliveira
>> >  wrote:
>> >
>> >> So I've been working on JDK 9 support for Beam, and I have a bug in
>> >> AutoValue that can be fixed by updating our AutoValue dependency to the
>> >> latest. The problem is that AutoValue from 1.5+ seems to be banned for
>> Beam
>> >> due to requiring Java 8 compilers. However, it should still be possible
>> to
>> >> compile and execute Java 7 code from the Java 8 compiler by building
>> with
>> >> the correct arguments. So the fix to this bug would essentially require
>> >> Java 8 compilers even for compiling Java 7 code.
>> >>
>> >> Does anyone need to use Java 7 compilers? Because if not I would like to
>> >> continue with this fix.
>> >>
>>

Re: Possibility of requiring Java 8 compiler for building Java 7 sources?

2017-09-26 Thread Ismaël Mejía

Great that you are also working on this too Daniel and thanks for
bringing this subject to the mailing list, I was waiting to  my return
to office next week, but you did it first :)

Eugene for reference (This is the issue on the migration to Java 9),
notice that here the goal is first that beam passes mvn clean install
with pure Java 9 (and also add this to jenkins), not to rewrite
anything to use the new stuff (e.g. modules):
https://issues.apache.org/jira/browse/BEAM-2530

Eugene can you also PTAL at the AutoValue issue, more details on the
issue, this is a warning so I don't know if it is really critical in
particular because we are not using Memoization (do we?).
https://github.com/google/auto/issues/503
https://github.com/google/auto/commit/71514081f2ca6fb4ead2b7f0a25f5d02247b8532

Wouldn't the easiest way be that you guys convince the google.auto
guys to generate that simple fix in a Java 7 compatible way and
'voila' ?

However I agree that moving to Java 8 is an excellent idea and as
Eugene mentions there is less friction now since most projects are
moving, only pending issue are existing clusters on java 7 in the
hadoop world, but those are less frequent now. Anyway this discussion
is really important (maybe even worth a vote). Because moving to Java
8 will allow us also to move some of the dependencies that we are
keeping in old versions and in general to move forward.

What do the others think ?

On Tue, Sep 26, 2017 at 11:09 PM, Eugene Kirpichov
 wrote:
> Very excited to hear that there's work on JDK9 support - is there a public
> description of the plans for this work somewhere?
>
> In general, Beam could probably drop Java7 support altogether at some point
> soon: Java7 has reached end-of-life (i.e. it's not receiving even security
> patches) 2 years ago, and all major players in the data processing
> ecosystem have dropped Java7 support (Spark, Flink, Hadoop), so I presume
> the demand for Java7 support in the data processing industry is low. By the
> way: would a Java8 migration be in the scope of your work in general?
>
> However, until we say that Beam requires Java8, what would be the
> implications of using a version of AutoValue that can only be compiled with
> Java8? Are you saying that this is simply a matter of a compiler bug, and
> if we use a Java8 compiler but configured to use source and target versions
> of 1.7 and using bootclasspath of rt.jar from 1.7, then the resulting Beam
> artifacts will be usable by people who don't have Java8?
>
> On Tue, Sep 26, 2017 at 1:53 PM Daniel Oliveira
>  wrote:
>
>> So I've been working on JDK 9 support for Beam, and I have a bug in
>> AutoValue that can be fixed by updating our AutoValue dependency to the
>> latest. The problem is that AutoValue from 1.5+ seems to be banned for Beam
>> due to requiring Java 8 compilers. However, it should still be possible to
>> compile and execute Java 7 code from the Java 8 compiler by building with
>> the correct arguments. So the fix to this bug would essentially require
>> Java 8 compilers even for compiling Java 7 code.
>>
>> Does anyone need to use Java 7 compilers? Because if not I would like to
>> continue with this fix.
>>

Re: Java 9

2017-09-11 Thread Ismaël Mejía

Hello Alexey,

There is a JIRA issue covering Java 9 support:
https://issues.apache.org/jira/browse/BEAM-2530

The goal of this JIRA is to update the project to support the current
Beam features with Java 9 (we need to keep our current level of
backwards compatibility with Java >= 7). Migration to the new module
system is not part of this JIRA and probably will be something to
consider once this first step is done. This JIRA will be complete when
we can take the Beam source code and compile/test/use all the
artifacts with Java 9. Unfortunately we need changes not only in the
Beam source code but also on the full dependencies we use from Maven
plugins (enforcer, compiler, etc) to other libraries, e.g. Google
AutoValue, ByteBuddy/ASM, etc.

If you are interested in helping with this issue (or any other person
in the community) contributions are welcomed, if the fix you propose
works and is backwards compatible please feel free to open a new Pull
Request.

Regards,
Ismaël

On Sun, Sep 10, 2017 at 9:37 PM, Alexey Demin  wrote:
> Hi
>
> Do you have plan for support java 9 in nearest future?
>
> In beam's runners you have code like:
>
> if (classloader instanceof URLClassLoader) {
>   URLClassLoader urlClassLoader = (URLClassLoader) classloader;
>   for (URL url : urlClassLoader.getURLs()) {
>
> only for extract list of jars from classpath.
> but in java 9 jdk.internal.loader.ClassLoaders$AppClassLoader not extend
> URLClassLoader.
>
> why not use System.getProperty("java.class.path") if we need only list of
> jars from classpath?
>
> yes, this code not work correctly with enabled modules system,
> but he work without any problem with default parameters for java 9.
>
> I tried run examples and they worked correctly on java 9 runtime.
>
> Thanks,
> Alexey

Re: Merge branch DSL_SQL to master

2017-09-07 Thread Ismaël Mejía

+1
A nice feature to have on Beam. Great work guys !

On Thu, Sep 7, 2017 at 10:21 AM, Pei HE  wrote:
> +1
>
> On Thu, Sep 7, 2017 at 4:03 PM, tarush grover 
> wrote:
>
>> Thank you all, it was a great learning experience!
>>
>> Regards,
>> Tarush
>>
>> On Thu, 7 Sep 2017 at 1:05 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>> > +1
>> >
>> > Great work guys !
>> > Ready to help for the merge and maintain !
>> >
>> > Regards
>> > JB
>> >
>> > On 09/07/2017 08:48 AM, Mingmin Xu wrote:
>> > > Hi all,
>> > >
>> > > On behalf of the virtual Beam SQL team[1], I'd like to propose to merge
>> > > DSL_SQL branch into master (PR #3782 [2]) and include it in release
>> > version
>> > > 2.2.0, which will give it more visibility to other contributors and
>> > users.
>> > > The SQL feature satisfies the following criteria outlined in
>> contribution
>> > > guide[3].
>> > >
>> > > 1. Have at least 2 contributors interested in maintaining it, and 1
>> > > committer interested in supporting it
>> > >
>> > > * James and me will continue for new features and maintain it;
>> > >
>> > >Tyler, James and me will support it as committers;
>> > >
>> > > 2. Provide both end-user and developer-facing documentation
>> > >
>> > > * A web page[4] is added to describe the usage of SQL DSL and how it
>> > works;
>> > >
>> > >
>> > > 3. Have at least a basic level of unit test coverage
>> > >
>> > > * Totally 230 unit/integration tests, with code coverage 83.4%;
>> > >
>> > > 4. Run all existing applicable integration tests with other Beam
>> > components
>> > > and create additional tests as appropriate
>> > >
>> > > * Besides of integration tests in package
>> > org.apache.beam.sdk.extensions.sql,
>> > > there's another example in org.apache.beam.sdk.extensions.sql.example.
>> > > BeamSqlExample.
>> > >
>> > > [1]. Special thanks to all contributors/reviewers:
>> > >
>> > >   Tyler Akidau
>> > >
>> > >   Davor Bonaci
>> > >
>> > >   Robert Bradshaw
>> > >
>> > >   Lukasz Cwik
>> > >
>> > >   Tarush Grover
>> > >
>> > >   Kai Jiang
>> > >
>> > >   Kenneth Knowles
>> > >
>> > >   Jingsong Lee
>> > >
>> > >   Ismaël Mejía
>> > >
>> > >   Jean-Baptiste Onofré
>> > >
>> > >   James Xu
>> > >
>> > >   Mingmin Xu
>> > >
>> > > [2]. https://github.com/apache/beam/pull/3782
>> > >
>> > > [3]. https://beam.apache.org/contribute/contribution-guide/
>> > > #merging-into-master
>> > >
>> > > [4]. https://beam.apache.org/documentation/dsls/sql/
>> > >
>> > > Thanks!
>> > > 
>> > > Mingmin
>> > >
>> >
>> > --
>> > Jean-Baptiste Onofré
>> > jbono...@apache.org
>> > http://blog.nanthrax.net
>> > Talend - http://www.talend.com
>> >
>>

Re: Beam 2.2.0 release

2017-08-30 Thread Ismaël Mejía

The current master has accumulated a good amount of nice features
since 2.1.0 so a new release is welcomed. I have two JIRAs/PR that I
think are important to check/solve before the cut:

BEAM-2516 (this is a regression on the performance of Direct runner on
Java). We had never really defined if a performance regression is
critical to be a blocker. I executed WordCount with the kinglear.txt
(170KB) file in version 2.1.0 vs the current 2.2.0-SNAPSHOT and I
found that the execution time passed from 5s to 126s. So maybe we need
to review this one before the release. I can understand if others
consider this a minor issue because the Direct runner is not supposed
to be used for production, but this performance regression can cause a
bad impression for a casual user starting with Beam.

BEAM-2790 (fix reading from Amazon S3 via HadoopFileSystem). I think
this one is a nice to have. I am not sure that I can tackle it for the
wednesday cut. I’m OOO until the beginning of next week, but maybe
someone else can take a look. In the worst case this is not a release
blocker but definitely a really nice fix to include.

On Wed, Aug 30, 2017 at 8:49 PM, Eugene Kirpichov
 wrote:
> I'd like to get the following PRs into 2.2.0:
>
> #3765  [BEAM-2753
> ] Fixes translation of
> WriteFiles side inputs (important bugfix for DynamicDestinations in files)
> #3725  [BEAM-2827
> ] Introduces
> AvroIO.watchForNewFiles (parity for AvroIO with TextIO in a few important
> features)
> #3759  [BEAM-2828
> ] Moves Match into
> FileIO.match()/matchAll() (to prevent releasing current
> Match.filepatterns() into 2.2.0 and then having to keep it under that name)
>
> On Wed, Aug 30, 2017 at 11:31 AM Mingmin Xu  wrote:
>
>> Glad to see that 2.2.0 is coming. Can we include SQL feature in next
>> release? We're in the final stage and expect to merge back to master this
>> week.
>>
>> On Wed, Aug 30, 2017 at 11:27 AM, Reuven Lax 
>> wrote:
>>
>> > Now that Beam 2.1.0 has finally completed, I think we should cut Beam
>> 2.2.0
>> > soon. I volunteer to coordinate this release.
>> >
>> > Are there any pending pull requests that people think should be merged
>> > before we cut 2.2.0? If so, please let me know soon, as I would like to
>> cut
>> > by Wednesday of next week.
>> >
>> > Thanks,
>> >
>> > Reuven
>> >
>>
>>
>>
>> --
>> 
>> Mingmin
>>

Re: Policy for stale PRs

2017-08-16 Thread Ismaël Mejía

Thanks Ahmet for bringing this subject.

+1 to close the stale PRs automatically after a fixed time of inactivity.  90
days is ok, but maybe a shorter period is better. If we consider that being
stale is just not having any activity i.e., the author of the PR does not answer
any message. The author can buy extra time just by adding a message to say,
'wait I am still working on this', and win a complete period of time, so the
longer the staleness period is the longer it can eventually be extended.

I agree with Thomas the JIRAs should still stay open but should become
unassigned because the issue won't be yet fixed but we want to encourage people
to work on it.

Other additional subject that makes sense to discuss here is if we need policies
to avoid 'stale' JIRAs (JIRAs that have been taken but that don't have
progress)?, for example:

- Prevent contributors/committers from taking more than 'n' JIRAs at the same
  time (we should define this n considering the period of staleness, maybe 10?).

- Automatically free 'stale' JIRAs after a fixed time period with no active work

Remember the objective is to encourage more people to contribute but people
won't be encouraged to contribute on subjects that other people have taken, this
is a well known anti-pattern in volunteer communities, see
http://communitymgt.wikia.com/wiki/Cookie_Licking

On Wed, Aug 16, 2017 at 10:38 PM, Thomas Groh  wrote:
> JIRAs should only be closed if the issue that they track is no longer
> relevant (either via being fixed or being determined to not be a problem).
> If a JIRA isn't being meaningfully worked on, it should be unassigned (in
> all cases, not just if there's an associated pull request that has not been
> worked on).
>
> +1 on closing PRs with no action from the original author after some
> reasonable time frame (90 days is certainly reasonable; 30 might be too
> short) if the author has not responded to actionable feedback.
>
> On Wed, Aug 16, 2017 at 12:07 PM, Sourabh Bajaj <
> sourabhba...@google.com.invalid> wrote:
>
>> Some projects I have seen close stale PRs after 30 days, saying "Closing
>> due to lack of activity, please feel free to re-open".
>>
>> On Wed, Aug 16, 2017 at 12:05 PM Ahmet Altay 
>> wrote:
>>
>> > Sounds like we have consensus. Since this is a new policy, I would
>> suggest
>> > picking the most flexible option for now (90 days) and we can tighten it
>> in
>> > the future. To answer Kenn's question, I do not know, how other projects
>> > handle this. I did a basic search but could not find a good answer.
>> >
>> > What mechanism can we use to close PRs, assuming that author will be out
>> of
>> > communication. We can push a commit with a "This closes #xyz #abc"
>> message.
>> > Is there another way to do this?
>> >
>> > Ahmet
>> >
>> > On Wed, Aug 16, 2017 at 4:32 AM, Aviem Zur  wrote:
>> >
>> > > Makes sense to close after a long time of inactivity and no response,
>> and
>> > > as Kenn mentioned they can always re-open.
>> > >
>> > > On Wed, Aug 16, 2017 at 12:20 AM Jean-Baptiste Onofré > >
>> > > wrote:
>> > >
>> > > > If we consider the author, it makes sense.
>> > > >
>> > > > Regards
>> > > > JB
>> > > >
>> > > > On Aug 15, 2017, 01:29, at 01:29, Ted Yu 
>> wrote:
>> > > > >The proposal makes sense.
>> > > > >
>> > > > >If the author of PR doesn't respond for 90 days, the PR is likely
>> out
>> > > > >of
>> > > > >sync with current repo.
>> > > > >
>> > > > >Cheers
>> > > > >
>> > > > >On Mon, Aug 14, 2017 at 5:27 PM, Ahmet Altay
>> > > >
>> > > > >wrote:
>> > > > >
>> > > > >> Hi all,
>> > > > >>
>> > > > >> Do we have an existing policy for handling stale PRs? If not could
>> > we
>> > > > >come
>> > > > >> up with one. We are getting close to 100 open PRs. Some of the
>> open
>> > > > >PRs
>> > > > >> have not been touched for a while, and if we exclude the pings the
>> > > > >number
>> > > > >> will be higher.
>> > > > >>
>> > > > >> For example, we could close PRs that have not been updated by the
>> > > > >original
>> > > > >> author for 90 days even after multiple attempts to reach them
>> (e.g.
>> > > > >[1],
>> > > > >> [2] are such PRs.)
>> > > > >>
>> > > > >> What do you think?
>> > > > >>
>> > > > >> Thank you,
>> > > > >> Ahmet
>> > > > >>
>> > > > >> [1] https://github.com/apache/beam/pull/1464
>> > > > >> [2] https://github.com/apache/beam/pull/2949
>> > > > >>
>> > > >
>> > >
>> >
>>

Re: Hello from a newbie to the data world living in the city by the bay!

2017-08-16 Thread Ismaël Mejía

Hello and welcome Griselda, Umang, Justin

Apart of the links provided by Ahmet you might read Beam-related
material on the website (See Documentation > Programming Guide and
Documentation > Additional Resources among others).

But probably as important as improving your Beam related knowledge is
to understand the principles of an open source project and more
concretely the way the Apache projects work (in case this is your
first Apache project), concepts like How projects are structured
(PMCs, committers, votes, etc) and the most important ones Community
over Code and Meritocracy.

https://www.apache.org/foundation/how-it-works.html
https://blogs.apache.org/foundation/entry/asf_15_community_over_code

Welcome all and don't hesitate to ask questions, we are all here to
make this project better so for sure we can help.
Ismaël


On Tue, Aug 15, 2017 at 11:04 PM, Justin T  wrote:
> Hello Beam community,
>
> I am also a new member, and I feel a little better knowing that there
> others on the same boat:)
>
> My name is Justin and I work as a full stack engineer for Neustar, a
> marketing analytics company in San Diego. Over the past few weeks I have
> been getting more familiar with Beam via documentation, papers, videos, and
> the old email archives and I am very excited to start making contributions.
> Thank you Altay for the useful links!
>
> -Justin Tumale
>
> On Tue, Aug 15, 2017 at 11:19 AM, Ahmet Altay 
> wrote:
>
>> Welcome both of you!
>>
>> Some helpful starting points:
>> - Contribution guide [1]
>> - Unassigned starter issues in JIRA [2]
>>
>> Ahmet
>>
>> [1] https://beam.apache.org/contribute/contribution-guide/
>> [2]
>> https://issues.apache.org/jira/browse/BEAM-2632?jql=
>> project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20Reopened)%20AND%
>> 20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20starter%20AND%
>> 20assignee%20in%20(EMPTY)%20ORDER%20BY%20created%20DESC%
>> 2C%20priority%20DESC
>>
>> On Tue, Aug 15, 2017 at 11:13 AM, Umang Sharma 
>> wrote:
>>
>> > Hi Gris,
>> > Nice to meet you.
>> >
>> > I'd like to take this opportunity to introduce me to you and everyone
>> else
>> > in  the dev team.
>> >
>> > I’m m Umang Sharma. I'm an associate in Data Science and Applications at
>> > Accenture Digital.
>> >
>> >
>> > I write in python, Java and a number of other languages.
>> > I'd love to contribute to Beam. It'd br great if someone guides me to get
>> > started with contributing :)
>> >
>> > Among the other things i like are polo golf, giving talks and talking
>> about
>> > mu work .
>> >
>> > Thanks,
>> > Umang
>> >
>> >
>> > On Aug 15, 2017 22:40, "Griselda Cuevas" 
>> wrote:
>> >
>> > Hi Beam community,
>> >
>> > I’m Griselda (Gris) Cuevas and I’m very excited to join the community,
>> I’m
>> > looking forward to learning awesome things from you and to getting the
>> > chance to collaborate on great initiatives.
>> >
>> > I’m currently working at Google and I’m studying a masters in operations
>> > research and data science at UC Berkeley. I’m interested in Natural
>> > Language Processing, Information Retrieval and Online Communities. Some
>> > other fun topics I love are juggling, camping and -just getting into it-
>> >  listening to podcasts, so if you ever want to discuss and talk about any
>> > of these topics, here I am!
>> >
>> > Another reason why I’m here is because I want to help this project grow
>> and
>> > thrive. This means that you’ll see me contributing to the project,
>> reaching
>> > out to ask questions as I get familiar with our community, and I also
>> > helping evangelize Apache Beam by organizing meetups, hangouts, etc.
>> >
>> > I say bye for now, I’ll see you around,
>> >
>> > Cheers,
>> >
>> > G
>> >
>>

Re: [VOTE] Release 2.1.0, release candidate #3

2017-08-14 Thread Ismaël Mejía

+1 (non-binding)

- Validated signatures OK
- mvn clean verify -Prelease on both OpenJDK 1.7 and Oracle JDK 8 with
the docker development images (WIP), both OK
- Run WordCount on local Flink and Spark runners OK

Everything looks nice, only one minor thing (not blocking at all). The
proto generated files for python are not cleaned correctly and this
causes the validation to complain because the maven rat plugin does
not find the apache headers on the files  (this happens if you execute
mvn clean verify -Prelease immediately after the validation).

On Sun, Aug 13, 2017 at 6:52 AM, Jean-Baptiste Onofré  wrote:
> +1 (binding)
>
> I do my own tests and casting my own vote ;)
>
> Regards
> JB
>
> On 08/09/2017 07:08 AM, Jean-Baptiste Onofré wrote:
>>
>> Hi everyone,
>>
>> Please review and vote on the release candidate #3 for the version 2.1.0,
>> as follows:
>>
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>>
>> The complete staging area is available for your review, which includes:
>> * JIRA release notes [1],
>> * the official Apache source release to be deployed to dist.apache.org
>> [2], which is signed with the key with fingerprint C8282E76 [3],
>> * all artifacts to be deployed to the Maven Central Repository [4],
>> * source code tag "v2.1.0-RC3" [5],
>> * website pull request listing the release and publishing the API
>> reference manual [6].
>> * Python artifacts are deployed along with the source release to the
>> dist.apache.org [2].
>>
>> The vote will be open for at least 72 hours. It is adopted by majority
>> approval, with at least 3 PMC affirmative votes.
>>
>> Thanks,
>> JB
>>
>> [1]
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12340528
>> [2] https://dist.apache.org/repos/dist/dev/beam/2.1.0/
>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> [4] https://repository.apache.org/content/repositories/orgapachebeam-1020/
>> [5] https://github.com/apache/beam/tree/v2.1.0-RC3
>> [6] https://github.com/apache/beam-site/pull/270
>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: Proposal : An extension for sketch-based statistics

2017-08-14 Thread Ismaël Mejía

Kenneth’s idea of using sketches for state with the State API is
really interesting, it really opens some interesting use cases, I
haven’t really thought about it but I believe it is really an
appealing use case for the sketches. Note that the origin of this work
was in the line of statistics, in particular we were interested in
data sketches (specially the Cardinality ones) as a ‘lightweight’ way
to have approximate metrics.

There are two pending subjects to discuss:

1. Having sketches as approximate metrics seems interesting, however
the current Beam Metrics API does not allow User-Defined Metrics. I
don’t really know the details of the current metrics implementation.
It is eventually possibly to support this? I mean to extend metrics to
reuse something like the sketches extension?

2. There is also another contribution that Arnaud did in case there is
interest, it is just a transform for standard deviation. We decided
not to include it as part of the sketches extension since it was not
consistent with the approximate nature of the extension, but I think
it could be another interesting contribution as a subsequent PR (if
there is interest also on this).

Regards,
Ismaël

On Sat, Aug 12, 2017 at 11:20 AM, Arnaud Fournier
 wrote:
> Hello Kenneth, thank you for your answer.
>
> I read your blog post about stateful processing and that is indeed a great
> feature !
>
> So if I understood correctly we could use the combineFns to declare
> combiningStates so it can be used while processing elements in a DoFn. That
> opens up a lot more use cases for the sketches !
>
> Actually this was already possible for 2 sketches but now I refined the
> constructors of the 2 other sketches, and will do so for the other ones to
> come.
>
>
> Regards,
>
> Arnaud
>
> 2017-08-08 2:07 GMT+02:00 Kenneth Knowles :
>
>> This is a great development! I have wanted Beam to have a library of
>> sketches.
>>
>> What Eugene is referring to is the fact that you can write
>> Combine.perKey(combineFn) to use these in a transform but also
>> StateSpecs.combiningState(combineFn) to use them in a stateful ParDo. So
>> it
>> is good to make the CombineFn public and refine their constructors to be
>> user-friendly.
>>
>> Kenn
>>
>> On Fri, Aug 4, 2017 at 7:45 AM, Arnaud Fournier <
>> arnaudfournier...@gmail.com
>> > wrote:
>>
>> > Thanks for your comments, that is very encouraging !
>> >
>> > I have created a Jira : https://issues.apache.org/jira/browse/BEAM-2728
>> > and a PR : https://github.com/apache/beam/pull/3686
>> >
>> > Eugene and Lucas I saw that you already have some ideas so I put you as
>> > reviewers,
>> > I look forward to hear more from you.
>> >
>> > With Ismael and JB, we already thought about using some of these
>> indicators
>> > as metric cells,
>> > as it can be useful for some kinds of monitoring.
>> > But I have never heard about state cells, is it something like the
>> > QuantileState in ApproximateQuantiles ?
>> >
>> >
>> >
>> > 2017-08-04 3:14 GMT+02:00 Anand Iyer :
>> >
>> > > This is awesome!! Very exciting to see the addition of statistical and
>> > > data-mining algorithms to Apache Beam.
>> > >
>> > > On Thu, Aug 3, 2017 at 2:32 PM, Eugene Kirpichov <
>> > > kirpic...@google.com.invalid> wrote:
>> > >
>> > > > +1, Very exciting! I have some suggestions on the exact API to expose
>> > > (e.g.
>> > > > I think it makes sense to expose the CombineFn's directly, so that
>> they
>> > > can
>> > > > also be used for combining state cells and not just as PTransforms),
>> > but
>> > > > that can be handled during regular code review.
>> > > >
>> > > > On Thu, Aug 3, 2017 at 2:23 PM Sourabh Bajaj
>> > > >  wrote:
>> > > >
>> > > > > +1 to this.
>> > > > >
>> > > > > On Thu, Aug 3, 2017 at 6:28 AM Lukasz Cwik
>> > > >
>> > > > > wrote:
>> > > > >
>> > > > > > I'm most interested in the frequency / cardinality tools as it
>> > could
>> > > be
>> > > > > > used to help improve performance automatically for combiners by
>> > > > detecting
>> > > > > > the few keys case or automatically handle hot keys without
>> needing
>> > > > users
>> > > > > to
>> > > > > > specify the hints when they use a combiner.
>> > > > > >
>> > > > > > On Thu, Aug 3, 2017 at 5:35 AM, Jean-Baptiste Onofré <
>> > > j...@nanthrax.net>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Nice work Arnaud ;)
>> > > > > > >
>> > > > > > > Happy to have been able to help.
>> > > > > > >
>> > > > > > > Let's see what the others will think about this.
>> > > > > > >
>> > > > > > > Regards
>> > > > > > > JB
>> > > > > > >
>> > > > > > >
>> > > > > > > On 08/03/2017 02:32 PM, Arnaud Fournier wrote:
>> > > > > > >
>> > > > > > >> Hello everyone,
>> > > > > > >>
>> > > > > > >> My name is Arnaud Fournier and I am a CS student. I am
>> currently
>> > > > doing
>> > > > > > an
>> > > > > > >> internship at Talend.
>> > > > > > >>
>> > > > > > >> With the support of Jean-Baptiste Onofre and Ismaël Mejia, I
>> > have
>> > > > been
>> > > > > > >> working on statistical

Re: [ANNOUNCEMENT] New PMC members, August 2017 edition!

2017-08-11 Thread Ismaël Mejía

Congratulations Ahmet and Aviem, keep up the great work !

On Fri, Aug 11, 2017 at 8:30 PM, Thomas Groh  wrote:
> Congratulations to both of you! Looking forwards to both of your continued
> contributions.
>
> On Fri, Aug 11, 2017 at 10:40 AM, Davor Bonaci  wrote:
>
>> Please join me and the rest of Beam PMC in welcoming the following
>> committers as our newest PMC members. They have significantly contributed
>> to the project in different ways, and we look forward to many more
>> contributions in the future.
>>
>> * Ahmet Altay
>> Beyond significant work to drive the Python SDK to the master branch, Ahmet
>> has worked project-wide, driving releases, improving processes and testing,
>> and growing the community.
>>
>> * Aviem Zur
>> Beyond significant work in the Spark runner, Aviem has worked to improve
>> how the project operates, leading discussions on inclusiveness and
>> openness.
>>
>> Congratulations to both! Welcome!
>>
>> Davor
>>

Re: [ANNOUNCEMENT] New committers, August 2017 edition!

2017-08-11 Thread Ismaël Mejía

Congrats everyone, well deserved, excellent work guys !

On Fri, Aug 11, 2017 at 7:53 PM, Jesse Anderson
 wrote:
> Welcome!
>
> On Fri, Aug 11, 2017, 10:48 AM Jason Kuster 
> wrote:
>
>> Congrats to all, many thanks for the great contributions.
>>
>> On Fri, Aug 11, 2017 at 10:46 AM, Ahmet Altay 
>> wrote:
>>
>> > Congratulations to all of you. Well deserved and thank you for your
>> > contributions.
>> >
>> > On Fri, Aug 11, 2017 at 10:43 AM, tarush grover > >
>> > wrote:
>> >
>> > > Congratulations!!
>> > >
>> > > Regards,
>> > > Tarush
>> > >
>> > > On Fri, 11 Aug 2017 at 11:11 PM, Davor Bonaci 
>> wrote:
>> > >
>> > > > Please join me and the rest of Beam PMC in welcoming the following
>> > > > contributors as our newest committers. They have significantly
>> > > contributed
>> > > > to the project in different ways, and we look forward to many more
>> > > > contributions in the future.
>> > > >
>> > > > * Reuven Lax
>> > > > Reuven has been with the project since the very beginning,
>> contributing
>> > > > mostly to the core SDK and the GCP IO connectors. He accumulated 52
>> > > commits
>> > > > (19,824 ++ / 12,039 --). Most recently, Reuven re-wrote several IO
>> > > > connectors that significantly expanded their functionality.
>> > Additionally,
>> > > > Reuven authored important new design documents relating to update and
>> > > > snapshot functionality.
>> > > >
>> > > > * Jingsong Lee
>> > > > Jingsong has been contributing to Apache Beam since the beginning of
>> > the
>> > > > year, particularly to the Flink runner. He has accumulated 34 commits
>> > > > (11,214 ++ / 6,314 --) of deep, fundamental changes that
>> significantly
>> > > > improved the quality of the runner. Additionally, Jingsong has
>> > > contributed
>> > > > to the project in other ways too -- reviewing contributions, and
>> > > > participating in discussions on the mailing list, design documents,
>> and
>> > > > JIRA issue tracker.
>> > > >
>> > > > * Mingmin Xu
>> > > > Mingmin started the SQL DSL effort, and has driven it to the point of
>> > > > merging to the master branch. In this effort, he extended the project
>> > to
>> > > > the significant new user community.
>> > > >
>> > > > * Mingming (James) Xu
>> > > > James joined the SQL DSL effort, contributing some of the trickier
>> > parts,
>> > > > such as the Join functionality. Additionally, he's consistently shown
>> > > > himself to be an insightful code reviewer, significantly impacting
>> the
>> > > > project’s code quality and ensuring the success of the new major
>> > > component.
>> > > >
>> > > > * Manu Zhang
>> > > > Manu initiated and developed a runner for the Apache Gearpump
>> > > (incubating)
>> > > > engine, and has driven it to the point of merging to the master
>> branch.
>> > > In
>> > > > this effort, he accumulated 65 commits (7,812 ++ / 4,882 --) and
>> > extended
>> > > > the project to the new user community.
>> > > >
>> > > > Congratulations to all five! Welcome!
>> > > >
>> > > > Davor
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> ---
>> Jason Kuster
>> Apache Beam / Google Cloud Dataflow
>>
> --
> Thanks,
>
> Jesse

Re: [CANCEL][VOTE] Release 2.1.0, release candidate #2

2017-07-24 Thread Ismaël Mejía

Not a blocker but maybe it is worth considering the fix for
https://issues.apache.org/jira/browse/BEAM-2587 too.

I also was bitten by this issue and I could only get it to work by
doing a 'pip install --user grpcio-tools' (not sure if this is a
proper solution but it works for me), however when I validated the
python only source code it worked out of the box without issue.

On Mon, Jul 24, 2017 at 2:37 PM, Jean-Baptiste Onofré  wrote:
> Awesome !
>
> Thanks Aljoscha
>
> Regards
> JB
>
>
> On 07/24/2017 02:32 PM, Aljoscha Krettek wrote:
>>
>> I opened a PR against the release-2.1.0 branch:
>> https://github.com/apache/beam/pull/3625
>> 
>>
>> This should not fail any tests since it was recently reviewed and merged
>> for the master.
>>
>> Best,
>> Aljoscha
>>
>>> On 24. Jul 2017, at 14:09, Jean-Baptiste Onofré  wrote:
>>>
>>> +1
>>>
>>> Definitely good to have it for RC3.
>>>
>>> Regards
>>> JB
>>>
>>> On 07/24/2017 02:05 PM, Aljoscha Krettek wrote:

 When we're cutting a new RC anyways we could also include the fixes for
 https://issues.apache.org/jira/browse/BEAM-2571
 . It's an actual bug in 
 the
 Flink Runner and the fix for that is a set of three fixes that should be
 easy to cherry-pick on top of the release branch.
 If we agree I could open a PR for that.
 Best,
 Aljoscha
>
> On 24. Jul 2017, at 13:47, Aviem Zur  wrote:
>
> We also have two tests failing in Spark runner as detailed by the
> following
> two tickets:
> https://issues.apache.org/jira/browse/BEAM-2670
> https://issues.apache.org/jira/browse/BEAM-2671
>
> On Mon, Jul 24, 2017 at 11:44 AM Jean-Baptiste Onofré 
> wrote:
>
>> Hi all,
>>
>> due to https://issues.apache.org/jira/browse/BEAM-2662, I cancel this
>> vote.
>>
>> We also have a build issue with the Spark runner that I would like to
>> fix
>> for RC3:
>>
>>
>>
>> https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_ValidatesRunner_Spark/2446/
>>
>> So, we are going to work on the Spark runner test fix for RC3
>> (BEAM-2662 is
>> already fixed on release-2.1.0 branch).
>>
>> I will submit RC3 to vote as soon as Spark runner tests are fully OK.
>>
>> Regards
>> JB
>>
>> On 07/18/2017 06:30 PM, Jean-Baptiste Onofré wrote:
>>>
>>> Hi everyone,
>>>
>>> Please review and vote on the release candidate #2 for the version
>>
>> 2.1.0, as
>>>
>>> follows:
>>>
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>>
>>> The complete staging area is available for your review, which
>>> includes:
>>> * JIRA release notes [1],
>>> * the official Apache source release to be deployed to
>>> dist.apache.org
>>
>> [2],
>>>
>>> which is signed with the key with fingerprint C8282E76 [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v2.1.0-RC2" [5],
>>> * website pull request listing the release and publishing the API
>>
>> reference
>>>
>>> manual [6].
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by
>>> majority
>>
>> approval,
>>>
>>> with at least 3 PMC affirmative votes.
>>>
>>> Thanks,
>>> JB
>>>
>>> [1]
>>>
>>
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12340528
>>>
>>>
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.1.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4]
>>
>> https://repository.apache.org/content/repositories/orgapachebeam-1019/
>>>
>>> [5] https://github.com/apache/beam/tree/v2.1.0-RC2
>>> [6] https://github.com/apache/beam-site/pull/270
>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>
>>
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: [PROPOSAL] Connectors for memcache and Couchbase

2017-07-11 Thread Ismaël Mejía

Hello again,

Thanks Lukasz for the details. We will take a look and discuss with
the others on how to achieve this. We hadn’t considered the case of a
full scan Read (as Eugene mentions) so now your comments about the
snapshot make more sense, however I am still wondering if the snapshot
is worth the effort, and I say this because we are not doing something
like this for other data stores, but it is a really interesting idea.
I also didn’t know Memcached had watches, this reminds me of Redis’
pubsub mechanism and could make sense for a possible unbounded source
as you mention. Another idea to explore, and I think JB is doing
something like this for RedisIO.

@Eugene, thanks also for your comments, you are right this is more of
a lookup but I am not sure that renaming it lookup will make things
easier for the end users considering that other IOs use the read()
convention and they indeed can do lookups as well as full scans of the
data. I partially agree with you in the usefulness of lookup, but a
simple example that comes to my mind is doing a lookup in Memcached to
use it as a Side Input of a Pipeline. Finally I agree that supporting
other commands is something important we just have to be sure to get
the correct abstraction for this, I suppose we should restrict it to
idempotent operations (so not incr/decr), and eventually make users
pass the expiry time in date format so it does not get ‘overwritten’
if a worker fails and the operation is re-executed. And about this
point it is probably a good idea that we have some common semantics of
the API for the different in memory stores (Redis, Memcached, JCache,
etc),

Any other ideas/comments?

I think it is important now that we get a first working version now
and then we can refine it incrementally with the different ideas.

Ismaël

On Tue, Jul 11, 2017 at 3:20 AM, Eugene Kirpichov
 wrote:
> I think Madhusudan's proposal does not involve reading the whole contents
> of the memcached cluster - it's applied to a PCollection of keys.
> So I'd suggest to call it MemcachedIO.lookup() rather than
> MemcachedIO.read(). And it will not involve the questions of splitting -
> however, it *will* involve snapshot consistency (looking up the same key at
> different times may yield different results, including a null result).
>
> Concur with others - please take a look at
> https://beam.apache.org/documentation/io/authoring-overview/ and
> https://beam.apache.org/contribute/ptransform-style-guide/ , as well as at
> the code of other IO transforms. The proposed API contradicts several best
> practices described in these documents, but is easily fixable.
>
> I recommend to also consider how you plan to extend this to support other
> commands - and which commands do you expect to ever support.
> Also, I'm unsure about the usefulness of MemcachedIO.lookup(). What's an
> example real-world use case for such a bulk lookup operation, where you
> transform a PCollection of keys into a PCollection of key/value pairs? I
> suppose such a use case exists, but I'd like to know more about it, to see
> whether this is the best API for it.
>
> On Mon, Jul 10, 2017 at 9:18 AM Lukasz Cwik 
> wrote:
>
>> Splitting on slabs should allow you to split more finely grained then per
>> server since each server itself maintains this information. If you take a
>> look at the memcached protocol, you can see that lru_crawler supports a
>> metadump command which will enumerate all the key for a set of given slabs
>> or for all the slabs.
>>
>> For the consistency part, you can get a snapshot like effect (snapshot like
>> since its per server and not across the server farm) by combining
>> the "watch mutations evictions" command on one connection with the
>> "lru_crawler metadump all" on another connection to the same memcached
>> server. By first connecting using a watcher and then performing a dump you
>> can create two logical streams of data that can be joined to get a snapshot
>> per server. If the amount of data/mutations/evications is small, you can
>> perform all of this within a DoFn otherwise you can just treat each as two
>> different outputs which you join and perform the same logical operation to
>> rebuild the snapshot on a per key basis.
>>
>> Interestingly, the "watch mutations" command would allow one to build a
>> streaming memcache IO which shows all changes occurring underneath.
>>
>> memcached protocol:
>> https://github.com/memcached/memcached/blob/master/doc/protocol.txt
>>
>> On Mon, Jul 10, 2017 at 2:41 AM, Ismaël Mejía  wrote:
>>
>> > Hello,
>> >
>> > Thanks Lukasz for bring some of this subjects. I have briefly
>> > discussed with the guys working on this they

Re: MergeBot is here!

2017-07-11 Thread Ismaël Mejía

Thanks a lot Jason,

Great that Infra solved (2) so fast.

About (3), maybe the extra pause/validation is not needed, because the
bot will in principle make its work appropriately, maybe what we could
just have is a way to see the git branch with the commits that
mergebot will do as part of the review process similar to the view we
have for the staging website, just as an extra validation but in the
same step.

About the squashing parameter I think it is a good idea, but I am
afraid that this ends up becoming more complex/interactive because
squashing can imply to rewrite the commit message so I am not sure if
this achievable (or worth the effort).

 WDYT ?

On Tue, Jul 11, 2017 at 10:37 AM, Aljoscha Krettek  wrote:
> +1
>
> This is excellent!
>
>> On 10. Jul 2017, at 21:42, Jason Kuster  
>> wrote:
>>
>> (quick update re #2 above): ~4 minutes after I reopened the ticket, it's
>> fixed.
>> https://github.com/apache/infrastructure-puppet/commit/709944291da5e8aea711cb8578f0594deb45e222
>> updates the website to the correct address. Infra is once again the best.
>>
>> On Mon, Jul 10, 2017 at 12:38 PM, Jason Kuster 
>> wrote:
>>
>>> Glad to hear everyone's pretty happy about it! Have a couple answers for
>>> your questions.
>>>
>>> Ted: I believe the MFA stuff (two-factor auth on github) is necessary for
>>> getting the additional features on GitHub (reviewer, etc), but may not be
>>> necessary for MergeBot. I'll check in with Infra and get back to you.
>>>
>>> Ismaël: Great questions! Answered below.
>>>
>>> 1. The code will likely be transitioned over to an Infra-controlled
>>> repository, but for now is under my account: https://github.com/
>>> jasonkuster/merge-bot. It's written in Python, so Python aficionados
>>> especially feel free to take a look, kick the tires, and open PRs.
>>>
>>> 2. Glad to hear mergebot worked for you. :) The website not showing
>>> appears to be an issue with transitioning to GitBox; it seems a reference
>>> may have not been updated. Thanks for the report! I've reopened
>>> https://issues.apache.org/jira/browse/INFRA-14405 to track.
>>>
>>> 3. I'd love to chat about this more! It's totally possible to have
>>> mergebot pause and show the status of the repository before it does the
>>> final push, but given that mergebot is merging PRs serially I don't want to
>>> have someone forget to click "ok" and block other people's PRs. One other
>>> option would be to allow the person requesting the merge to say something
>>> like "@asfgit merge squash" or "@asfgit merge nosquash", parametrizing the
>>> merge request. Thoughts?
>>>
>>> On Mon, Jul 10, 2017 at 10:52 AM, Mark Liu 
>>> wrote:
>>>
>>>> +1 Awesome work!
>>>>
>>>> Thank you Jason!!!
>>>>
>>>> Mark
>>>>
>>>> On Mon, Jul 10, 2017 at 10:05 AM, Robert Bradshaw <
>>>> rober...@google.com.invalid> wrote:
>>>>
>>>>> +1, this is great! I'll second Ismaël's list requests, especially 1 and
>>>> 3.
>>>>>
>>>>> On Mon, Jul 10, 2017 at 2:09 AM, Ismaël Mejía 
>>>> wrote:
>>>>>> Excellent!, Automation of such repetitive (and error-prone) tasks is
>>>>>> strongly welcomed.
>>>>>>
>>>>>> Thanks for making this happen Jason!
>>>>>>
>>>>>> Some comments:
>>>>>>
>>>>>> 1. I suppose the code of mergebot is now part of Apache Infra, no? Do
>>>>>> you know exactly where the code is hosted? And what is the procedure
>>>>>> in case somebody wants to improve it or change something in the
>>>>>> future? I suppose other projects can/would benefit of this.
>>>>>>
>>>>>> 2. I configured and used the mergebot with success, however the
>>>>>> website does not reflect the changes of the PR I 'merged', I suppose
>>>>>> there are still some things we have to fix, because the changes are
>>>>>> not there.
>>>>>> (The PR I am talking about is https://github.com/apache/
>>>>> beam-site/pull/264)
>>>>>>
>>>>>> 3. Other thing I noticed is that the mergebot didn’t squash the
>>>>>> commits (this probably makes sense) and I didn’t realize this to do it
>>>>>> before b

Re: [PROPOSAL] Connectors for memcache and Couchbase

2017-07-10 Thread Ismaël Mejía

Hello,

Thanks Lukasz for bring some of this subjects. I have briefly
discussed with the guys working on this they are the same team who did
HCatalogIO (Hive).

We just analyzed the different libraries that allowed to develop this
integration from Java and decided that the most complete
implementation was spymemcached. One thing I really didn’t like of
their API is that there is not an abstraction for Mutation (like in
Bigtable/Hbase) but a corresponding method for each operation so to
make things easier we discussed to focus first on read/write.

@Lukasz for the enumeration part, I am not sure I follow, we had just
discussed a naive approach for splitting by server given that
Memcached is not a cluster but a server farm ‘which means every server
is its own’ we thought this will be the easiest way to partition, is
there any technical issue that impeaches this (creating a
BoundedSource and just read per each server)? Or partitioning by slabs
will bring us a better optimization? (Notice I am far from an expert
on Memcached).

For the consistency part I assumed it will be inconsistent when
reading, because I didn’t know how to do the snapshot but if you can
give us more details on how to do this, and why it is worth the effort
(vs the cost of the snapshot), this will be something interesting to
integrate.

Thanks,
Ismaël

On Sun, Jul 9, 2017 at 7:39 PM, Lukasz Cwik  wrote:
> For the source:
> Do you plan to support enumerating all the keys via cachedump / lru_crawler
> metadump / ...?
> If there is an option which doesn't require enumerating the keys, how will
> splitting be done (no splitting / splitting on slab ids / ...)?
> Can the cache be read while its still being modified (will effectively a
> snapshot be made using a watcher or is it expected that the cache will be
> read only or inconsistent when reading)?
>
> Also, as a usability point, all PTransforms are meant to be applied to
> PCollections and not vice versa.
> e.g.
> PCollection keys = ...;
> keys.apply(MemCacheIO.withConfig());
>
> This makes it so that people can write:
> PCollection<...> output =
> input.apply(ptransform1).apply(ptransform2).apply(...);
> It also makes it so that a PTransform can be applied to multiple
> PCollections.
>
> If you haven't already, I would also suggest that you take a look at the
> Pipeline I/O guide: https://beam.apache.org/documentation/io/io-toc/
> Talks about various usability points and how to write a good I/O connector.
>
>
> On Sat, Jul 8, 2017 at 9:31 PM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi,
>>
>> Great job !
>>
>> I'm looking forward for the PRs review.
>>
>> Regards
>> JB
>>
>>
>> On 07/08/2017 09:50 AM, Madhusudan Borkar wrote:
>>
>>> Hi,
>>> We are proposing to build connectors for memcache first and then use it
>>> for
>>> Couchbase. The connector for memcache will be build as a IOTransform and
>>> then it can be used for other memcache implementations including
>>> Couchbase.
>>>
>>> 1. As Source
>>>
>>> input will be a key(String / byte[]), output will be a KV
>>>
>>> where key - String / byte[]
>>>
>>> value - String / byte[]
>>>
>>> Spymemcached supports a multi-get operation where it takes a bunch of
>>> keys and retrieves the associated values, the input PCollection can
>>> be
>>> bundled into multiple batches and each batch can be submitted via the
>>> multi-get operation.
>>>
>>> PCollection> values =
>>>
>>> MemCacheIO
>>>
>>> .withConfig()
>>>
>>> .read()
>>>
>>> .withKey(PCollection);
>>>
>>>
>>> 2. As Sink
>>>
>>> input will be a KV, output will be none or probably a
>>> boolean indicating the outcome of the operation
>>>
>>>
>>>
>>>
>>>
>>> //write
>>>
>>> MemCacheIO
>>>
>>> .withConfig()
>>>
>>> .write()
>>>
>>> .withEntries(PCollection>);
>>>
>>>
>>> Implementation plan
>>>
>>> 1. Develop Memcache connector with 'set' and 'add' operation
>>>
>>> 2. Then develop other operations
>>>
>>> 3. Use Memcache connector for Couchbase
>>>
>>>
>>> Thanks @Ismael for help
>>>
>>> Please, let us know your views.
>>>
>>> Madhu Borkar
>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>

Re: MergeBot is here!

2017-07-10 Thread Ismaël Mejía

Excellent!, Automation of such repetitive (and error-prone) tasks is
strongly welcomed.

Thanks for making this happen Jason!

Some comments:

1. I suppose the code of mergebot is now part of Apache Infra, no? Do
you know exactly where the code is hosted? And what is the procedure
in case somebody wants to improve it or change something in the
future? I suppose other projects can/would benefit of this.

2. I configured and used the mergebot with success, however the
website does not reflect the changes of the PR I 'merged', I suppose
there are still some things we have to fix, because the changes are
not there.
(The PR I am talking about is https://github.com/apache/beam-site/pull/264)

3. Other thing I noticed is that the mergebot didn’t squash the
commits (this probably makes sense) and I didn’t realize this to do it
before because there is not a preview of the state of the actions that
the mergebot is going to do, can this eventually be improved? (I don’t
know if this makes sense because this will add an extra validation
step and we must trust robots anyway :P).

This new issue is something that reviewers/committers must remember,
and talking about this we need to update this in the contribution
guide to include the configuration/use of the mergebot instructions.

Thanks again Jason and the other who made this possible, this is great!
Ismaël

ps. I’m eager to see this included too for the beam project.

On Sat, Jul 8, 2017 at 7:28 AM, tarush grover  wrote:
> This is really good!!
>
> Regards,
> Tarush
>
> On Sat, 8 Jul 2017 at 10:20 AM, Jean-Baptiste Onofré 
> wrote:
>
>> That's awesome !
>>
>> Thanks Jason !
>>
>> Regards
>> JB
>>
>> On 07/07/2017 10:21 PM, Jason Kuster wrote:
>> > Hi Beam Community,
>> >
>> > Early on in the project, we had a number of discussions about creating an
>> > automated tool for merging pull requests. I’m happy to announce that
>> we’ve
>> > developed such a tool and it is ready for experimental usage in Beam!
>> >
>> > The tool, MergeBot, works in conjunction with ASF’s existing GitBox tool,
>> > providing numerous benefits:
>> > * Automating the merge process -- instead of many manual steps with
>> > multiple Git remotes, merging is as simple as commenting a specific
>> command
>> > in GitHub.
>> > * Automatic verification of each pull request against the latest master
>> > code before merge.
>> > * Merge queue enforces an ordering of pull requests, which ensures that
>> > pull requests that have bad interactions don’t get merged at the same
>> time.
>> > * GitBox-enabled features such as reviewers, assignees, and labels.
>> > * Enabling enhanced use of tools like reviewable.io.
>> >
>> > If you are a committer, the first step is to link your Apache and GitHub
>> > accounts at http://gitbox.apache.org/setup. Once the accounts are
>> linked,
>> > you should have immediate access to new GitHub features like labels,
>> > assignees, etc., as well as the ability to merge pull requests by simply
>> > commenting “@asfgit merge” on the pull request. MergeBot will communicate
>> > its status back to you via the same mechanism used already by Jenkins.
>> >
>> > This functionally is currently enabled for the “beam-site” repository
>> only.
>> > In this phase, we’d like to gather feedback and improve the user
>> experience
>> > -- so please comment back early and often. Once we are happy with the
>> > experience, we’ll deploy it on the main Beam repository, and recommend it
>> > for wider adoption.
>> >
>> > I’d like to give a huge thank you to the Apache Infrastructure team,
>> > especially Daniel Pono Takamori, Daniel Gruno, and Chris Thistlethwaite
>> who
>> > were instrumental in bringing this project to fruition. Additionally,
>> this
>> > could not have happened without the extensive work Davor put in to keep
>> > things moving along. Thank you Davor.
>> >
>> > Looking forward to hearing your comments and feedback. Thanks.
>> >
>> > Jason
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>

Re: [DISCUSS] Bridge beam metrics to underlying runners to support metrics reporters?

2017-06-23 Thread Ismaël Mejía

That seems like a great idea (improving the current metrics design), I
suppose there is a tradeoff between complexity and simplicity, and
when I read the design document I think that some design decisions
were done for the sake of simplicity, however as dropwizard is the
'de-facto' standard for metrics (at least in the java world), then it
makes sense to align more with it, and that also reminds me that Aviem
also wanted to add dropwizard's EWMA to the metrics API, so there is
still some work to do.


On Fri, Jun 23, 2017 at 11:00 AM, Cody Innowhere  wrote:
> Hi Ismaël,
> Yes Distribution is similar to codahale's Histogram without the quantiles,
> and what I meant "adding support of Histogram" might be extending
> Distribution
> so that quantiles can be supported.
> I think in metrics area dropwizard metrics is more or less a standard and
> many frameworks have direct support for this including Spark, Flink and
> JStorm
> (Well I happen to be the developer of JStorm metrics and our internal alert
> engine), and you're right if beam metrics are compatible with dropwizard we
> can surely benefit from it.
>
> I've also read the design doc and IMHO it's not easy to support
> Meter/Histogram (currently Distribution is a bit too simple). I'm thinking
> about adding full
> support of dropwizard metrics and will come up with a doc later so that we
> can discuss this in detail.
>
> On Fri, Jun 23, 2017 at 4:30 PM, Ismaël Mejía  wrote:
>
>> Cody not sure if I follow, but isn't Distribution on Beam similar to
>> codahale/dropwizard's HIstogram (without the quantiles) ?
>>
>> Meters are also in the plan but not implemented yet, see the Metrics
>> design doc:
>> https://s.apache.org/beam-metrics-api
>>
>> If I understand what you want is to have some sort of compatibility
>> with dropwizard so we can benefit of their sinks ? Is this or I am
>> misreading it, if so, that would be neat, however the only problem is
>> the way the aggregation phase passes on Beam because of distribution
>> vs dropwizard (not sure if they have some implementation that takes
>> distribution into account).
>>
>> Improving metrics is in the agenda and contributions are welcomed
>> because the API is still evolving and we can have new ideas as part of
>> it.
>>
>> Regards,
>> Ismaël
>>
>>
>> On Fri, Jun 23, 2017 at 9:29 AM, Cody Innowhere 
>> wrote:
>> > Yes I agree with you and sorry for messing them together in this
>> discussion.
>> > I just wonder if someone plans to support Meters/Histograms in the near
>> > future. If so, we might need to modify metrics a bit in beam sdk IMHO,
>> > that's the reason I started this discussion.
>> >
>> > On Fri, Jun 23, 2017 at 3:21 PM, Jean-Baptiste Onofré 
>> > wrote:
>> >
>> >> Hi Codi,
>> >>
>> >> I think there are two "big" topics around metrics:
>> >>
>> >> - what we collect
>> >> - where we send the collected data
>> >>
>> >> The "generic metric sink" (BEAM-2456) is for the later: we don't really
>> >> change/touch the collected data (or maybe just in case of data format)
>> we
>> >> send to the sink.
>> >>
>> >> The Meters/Histograms is both more the collected data IMHO.
>> >>
>> >> Regards
>> >> JB
>> >>
>> >>
>> >> On 06/23/2017 04:09 AM, Cody Innowhere wrote:
>> >>
>> >>> Hi JB,
>> >>> Glad to hear that.
>> >>> Still, I'm thinking about adding support of Meters & Histograms(maybe
>> >>> extending Distribution). As the discussion mentions, problem is that
>> >>> Meter/Histogram
>> >>> cannot be updated directly in current way because their internal data
>> >>> decays after time. Do you plan to refactor current implementation so
>> that
>> >>> they can be supported while working on the generic metric sink?
>> >>>
>> >>> On Thu, Jun 22, 2017 at 9:37 PM, Jean-Baptiste Onofré > >
>> >>> wrote:
>> >>>
>> >>> Hi
>> >>>>
>> >>>> Agree with Aviem and yes actually I'm working on a generic metric
>> sink. I
>> >>>> created a Jira about that. I'm off today, I will send some details
>> asap.
>> >>>>
>> >>>> Regards
>> >>>> JB
>> >>>>
>> >>

Re: [DISCUSS] Bridge beam metrics to underlying runners to support metrics reporters?

2017-06-23 Thread Ismaël Mejía

Cody not sure if I follow, but isn't Distribution on Beam similar to
codahale/dropwizard's HIstogram (without the quantiles) ?

Meters are also in the plan but not implemented yet, see the Metrics design doc:
https://s.apache.org/beam-metrics-api

If I understand what you want is to have some sort of compatibility
with dropwizard so we can benefit of their sinks ? Is this or I am
misreading it, if so, that would be neat, however the only problem is
the way the aggregation phase passes on Beam because of distribution
vs dropwizard (not sure if they have some implementation that takes
distribution into account).

Improving metrics is in the agenda and contributions are welcomed
because the API is still evolving and we can have new ideas as part of
it.

Regards,
Ismaël


On Fri, Jun 23, 2017 at 9:29 AM, Cody Innowhere  wrote:
> Yes I agree with you and sorry for messing them together in this discussion.
> I just wonder if someone plans to support Meters/Histograms in the near
> future. If so, we might need to modify metrics a bit in beam sdk IMHO,
> that's the reason I started this discussion.
>
> On Fri, Jun 23, 2017 at 3:21 PM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi Codi,
>>
>> I think there are two "big" topics around metrics:
>>
>> - what we collect
>> - where we send the collected data
>>
>> The "generic metric sink" (BEAM-2456) is for the later: we don't really
>> change/touch the collected data (or maybe just in case of data format) we
>> send to the sink.
>>
>> The Meters/Histograms is both more the collected data IMHO.
>>
>> Regards
>> JB
>>
>>
>> On 06/23/2017 04:09 AM, Cody Innowhere wrote:
>>
>>> Hi JB,
>>> Glad to hear that.
>>> Still, I'm thinking about adding support of Meters & Histograms(maybe
>>> extending Distribution). As the discussion mentions, problem is that
>>> Meter/Histogram
>>> cannot be updated directly in current way because their internal data
>>> decays after time. Do you plan to refactor current implementation so that
>>> they can be supported while working on the generic metric sink?
>>>
>>> On Thu, Jun 22, 2017 at 9:37 PM, Jean-Baptiste Onofré 
>>> wrote:
>>>
>>> Hi

 Agree with Aviem and yes actually I'm working on a generic metric sink. I
 created a Jira about that. I'm off today, I will send some details asap.

 Regards
 JB

 On Jun 22, 2017, 15:16, at 15:16, Aviem Zur  wrote:

> Hi Cody,
>
> Some of the runners have their own metrics sink, for example Spark
> runner
> uses Spark's metrics sink which you can configure to send the metrics
> to
> backends such as Graphite.
>
> There have been ideas floating around for a Beam metrics sink extension
> which will allow users to send Beam metrics to various metrics
> backends, I
> believe @JB is working on something along these lines.
>
> On Thu, Jun 22, 2017 at 2:00 PM Cody Innowhere 
> wrote:
>
> Hi guys,
>> Currently metrics are implemented in runners/core as CounterCell,
>> GaugeCell, DistributionCell, etc. If we want to send metrics to
>>
> external
>
>> systems via metrics reporter, we would have to define another set of
>> metrics, say, codahale metrics, and update codahale metrics
>>
> periodically
>
>> with beam sdk metrics, which is inconvenient and inefficient.
>>
>> Another problem is that Meter/Histogram cannot be updated directly in
>>
> this
>
>> way because their internal data decays after time.
>>
>> My opinion would be bridge beam sdk metrics to underlying runners so
>>
> that
>
>> updates would directly apply to underlying runners (Flink, Spark,
>>
> etc)
>
>> without conversion.
>>
>> Specifically, currently we already delegate
>> Metrics.counter/gauge/distribution to
>>
> DelegatingCounter/Gauge/Distribution,
>
>> which uses MetricsContainer to store the actual metrics with the
>> implementation of MetricsContainerImpl. If we can add an API in
>> MetricsEnvironment to allow runners to override the default
>>
> implementation,
>
>> say, for flink, we have FlinkMetricsContainerImpl, then all metric
>>
> updates
>
>> will directly apply to metrics in FlinkMetricsContainerImpl without
>> intermediate conversion and updates. And since the metrics are
>> runner-specific, it would be a lot easier to support metrics
>>
> reporters as
>
>> well as Meters/Histograms.
>>
>> What do you think?
>>
>>

>>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>

Re: [DISCUSS] Apache Beam 2.1.0 release next week ?

2017-06-21 Thread Ismaël Mejía

Thahks JB for keeping the time based release agenda. I really don't
have any blocker but I would like to have the hadoop version alignment
PR merged before this one and probably also Nexmark (considering that
Etienne fixed most of the issues and we have already the LGTM, we are
just waiting for a last review on the final fixes).
Of course none of the two are blockers, but it would be nice to get
them merged out if possible.
When do you plan to start the vote?

On Thu, Jun 22, 2017 at 6:48 AM, Reuven Lax  wrote:
> Does mean that value-dependent FileBasedSink will miss 2.1.0, but I guess
> it will make 2.2.0 then.
>
> On Wed, Jun 21, 2017 at 7:23 PM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi guys,
>>
>> As we released 2.0.0 (first stable release) last month during ApacheCon,
>> and to maintain our release pace, I would like to release 2.1.0 next week.
>>
>> This release would include lot of bug fixes and some new features:
>>
>> https://issues.apache.org/jira/projects/BEAM/versions/12340528
>>
>> I'm volunteer to be release manager for this one.
>>
>> Thoughts ?
>>
>> Thanks,
>> Regards
>> JB
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>

Re: Beam Proposal: Pipeline Drain

2017-06-12 Thread Ismaël Mejía

Hello Reuven,

I finally took the time to read the Drain proposal, thanks a lot for
bringing this,  it looks like a nice fit with the current APIs and it
would be great if this could be implemented as much as possible in a
Runner independent way.

I am eager now to see the snapshot and update proposal.
Thanks again,
Ismaël

On Tue, Jun 6, 2017 at 10:03 PM, Reuven Lax  wrote:
>
> I believe so, but it looks like the dispenser is an interface to the user
> for such features. We still need to define what the semantics of features
> like Drain are, and how they affect the pipeline.
>
> On Tue, Jun 6, 2017 at 12:06 PM, Jean-Baptiste Onofré 
> wrote:
>
> > Hi Reuven,
> >
> > In the "Apache Beam: Technical vision" document (dating from the
> > incubation) (https://docs.google.com/document/d/1UyAeugHxZmVlQ5cEWo_eOPg
> > XNQA1oD-rGooWOSwAqh8/edit?usp=sharing), I added a section named "Beam
> > Pipelines Dispenser".
> >
> > The idea is to be able to bootstrap, run and control pipelines (and the
> > runners).
> >
> > I think it's somehow related. WDYT ?
> >
> > Regards
> > JB
> >
> > On 06/06/2017 07:43 PM, Reuven Lax wrote:
> >
> >> Hi all,
> >>
> >> Beam is a great programming mode, but in order to really run pipelines
> >> (especially streaming pipelines which are "always on") in a production
> >> setting, there is a set of features necessary. Dataflow has a couple of
> >> those features built in (Drain and Update), and inspired by those I'll be
> >> sending out a few proposals for similar features in Beam.
> >>
> >> Please note that my intention here is _not_ to simply forklift the
> >> Dataflow
> >> features to Beam. The Dataflow features are being used as inspiration, and
> >> we have two years of experience how real users have used these feature
> >> (and
> >> also experienced when users have found these features limited and
> >> frustrating). In every case my Beam proposals are different - hopefully
> >> better! - than the actual Dataflow feature that exists today.
> >>
> >> I think all runners would greatly benefit from production-control features
> >> like this, and I would love to see community input. The first proposal is
> >> for a way of draining a streaming pipeline before stopping it, and here it
> >> is
> >>  >> GhDPmm3cllSN8IMmWci8/edit>
> >> .
> >>
> >> Reuven
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >

Re: [DISCUSS] HadoopInputFormat based IOs

2017-06-01 Thread Ismaël Mejía

Stephen I agree with you that the most important thing is not to lose
functionality for the integration tests, so it is important to keep at
least one of the two (Cassandra or Elasticsearch) to do a real
integration test for HIFIO.

Your proposal of making the IT tests for the native IOs parallelized
with WriteThenRead seems excellent, we have to sync to see how we can
do this, my only doubt is that the setup takes still long time (at
least for the bounded sources, it will all depend on the size of the
generated data).

I think we are good in our discussion (at least the missing part of
the tests) so please confirm me if you agree with this phrase to
summarize the thread:

New IOs based on HIFIO won’t be merged as tests of HIFIO or as
independent IOs. However we still encourage these contributions as
documentation on how to use HIFIO with different data stores, these
contributions will be part of the documentation in the Beam website.


On Tue, May 30, 2017 at 11:16 PM, Stephen Sisk  wrote:
> Ah, thanks for clarifying ismael.
>
> I think you would agree that we need to have integration testing of HIFIO.
> Cassandra and ES are currently the only ITs for HIFIO. If we want to write
> ITs for HIFIO that don't rely on ES/Cassandra with the idea that we'd
> remove ES/Cassandra, I could be okay with that. The data store in question
> would need to have both small & large k8s cluster scripts so that we can do
> small & large integration tests (since that's what's currently supported
> with HIFIO today and I don't think we should go backwards.)
>
> The reason I hesitate to use a data store that doesn't have a native
> implementation is that we can use ES/Cassandra's native write transform to
> eventually switch HIFIO ITs to the new writeThenRead style IO IT [1] that
> will *drastically* simplify maintenance requirements for the HIFIO tests.
> WriteThenRead writes the test data inside of the test, thus removing the
> requirement for a separate data loading step outside of the step. We
> *could* write inside the test setup code (thus running only on one
> machine), but for larger data amounts, that takes too long - it's easier to
> do the write using the IO, which runs in parallel, and thus is a lot
> quicker. That means we need a data store that has a native, parallelizable
> write.
>
> What do you think? Basically, I agree with you in principal, but given that
> using a data store without a native implementation either a separate data
> loading step or slower tests, I'd strongly prefer to keep using
> ES/Cassandra. (you could make the case that we should remove one of them.
> I'm not attached to keeping both.)
>
>
>> having [ES/Cassandra HIFIO read-code] in the source code base would [not]
> be
> consistent with the ideas of the previous paragraph.
> I do agree with this. If we keep the ES/Cassandra HIFIO test code, I'd
> propose that we add comments in there directing people to the correct
> native source.
>
> S
> [1] writeThenRead style IO IT -
> https://lists.apache.org/thread.html/26ee3ba827c2917c393ab26ce97e7491846594d8f574b5ae29a44551@%3Cdev.beam.apache.org%3E
>
> On Tue, May 30, 2017 at 1:47 PM Ismaël Mejía  wrote:
>
>> The whole goal of this discussion is that we define what shall we do
>> when someone wants to add a new IO that uses HIFIO. The consensus so
>> far following the PR comments + this thread is that it should be
>> discouraged and those contribution be included as documentation in the
>> website, and that we should give priority to the native
>> implementations, which seems reasonable (e,g, to encourage better
>> implementations and avoid the maintenance burden).
>>
>> So, I was wondering what would be a good rule to justify that we have
>> tests for some data stores as part of the tests of HIFIO and I don't
>> see a strong reason to do this, in particular once those have native
>> implementations, to be more clear, in the current case we have HIFIO
>> tests (jdk1.8-tests) for Elasticsearch5 and Cassandra which both are
>> not covered by the native IOs yet. However once the native IOs for
>> both systems are merged I don't see any reason to keep the extra tests
>> in HIFIO, because we will be doing a double effort to test an IO that
>> is not native, and that does not support Write, so I think we should
>> remove those. Also not having this in the source code base would be
>> consistent with the ideas of the previous paragraph.
>>
>> But well maybe I am missing something here, do you see any strong
>> reason to keep them.
>>

Re: [DISCUSS] HadoopInputFormat based IOs

2017-05-30 Thread Ismaël Mejía

The whole goal of this discussion is that we define what shall we do
when someone wants to add a new IO that uses HIFIO. The consensus so
far following the PR comments + this thread is that it should be
discouraged and those contribution be included as documentation in the
website, and that we should give priority to the native
implementations, which seems reasonable (e,g, to encourage better
implementations and avoid the maintenance burden).

So, I was wondering what would be a good rule to justify that we have
tests for some data stores as part of the tests of HIFIO and I don't
see a strong reason to do this, in particular once those have native
implementations, to be more clear, in the current case we have HIFIO
tests (jdk1.8-tests) for Elasticsearch5 and Cassandra which both are
not covered by the native IOs yet. However once the native IOs for
both systems are merged I don't see any reason to keep the extra tests
in HIFIO, because we will be doing a double effort to test an IO that
is not native, and that does not support Write, so I think we should
remove those. Also not having this in the source code base would be
consistent with the ideas of the previous paragraph.

But well maybe I am missing something here, do you see any strong
reason to keep them.

Re: [DISCUSS] HadoopInputFormat based IOs

2017-05-30 Thread Ismaël Mejía

I agree 100% with Stephen points, I think that including a
'discoverability' section for these IOs that are shared by multiple
data stores is a great step, in particular for the HIF ones.

I would like that we define what would we do in concrete with the
HIFIO based implementations of IOs once their native implementation is
merged, e.g. today we have Cassandra and Elasticsearch5 examples based
on HIF that will be clearly redundant once we have the native
versions, so they should maybe moved into the proposed website
section. What do you guys think?

Any other ideas/comments on the general subject?



On Tue, May 23, 2017 at 7:25 PM, Stephen Sisk  wrote:
> hey,
>
> Thanks for bringing this up! It's definitely an interesting question and I
> can see both sides of the argument.
>
> I can see the appeal of HIFIO wrapper IOs as stop-gaps and if they have
> good test coverage, it does ensure that the HIFIO route is working. If we
> have good IT coverage, it also means there's fewer steps involved in
> building a native IO as well, since the ITs will already be written.
>
> However, I think I'm still assuming that the community will implement
> native IOs for most data stores that users want to interact with, and thus
> I'd still discourage building IOs that are just HIFIO/jdbc wrappers. I'd
> personally rather devote time and resources to native IOs. If we don't see
> traction on building more IOs then I'd be more open to it.
>
> If we do choose to go down this "Don't build HIFIO wrappers, just improve
> discoverability" route, one idea I had floating around in my head was that
> we might add a section to the Built-in IO Transforms page that covers
> "non-native but readable" IOs (better name suggestions appreciated :) -
> that could include a list of data stores that jdbc/jms/hifio support and
> link to HIFIO's info on how to use them. (That might also be a good place
> to document the performance tradeoffs of using HIFIO)
>
> S
>
>
> On Tue, May 23, 2017 at 9:53 AM Ismaël Mejía  wrote:
>
>> Hello, I bring this subject to the mailing list to see everybody’s
>> opinion on the subject.
>>
>> The recent inclusion of HadoopInputFormatIO (HiFiIO) gave Beam users
>> the option to ‘easily’ include data stores that support the
>> Hadoop-based partitioning scheme. There are currently examples of how
>> to use it for example to read from Elasticsearch and Cassandra. In
>> both cases we already have specific IOs on master or as WIP so using
>> HiFiIO based IO is not needed.
>>
>> During the review of the recent IO for Hive (HCatalog) that uses
>> HiFiIO instead of a native API, there was a discussion about the fact
>> that this shouldn’t be included as a specific IO but better to add the
>> tests/documentation of how to read Hive records using the existing
>> HiFiIO. This makes sense from an abstraction point of view, however
>> there are visibility issues since end users would need to repackage
>> and discover the supported (and tested) HiFi-based IOs that won’t be
>> explicit in the code base.
>>
>> I would like to know what other members of the community think about
>> this, is it worth to have individual IOs based on HiFiIO for things
>> that we currently don’t support (e.g. Hive or Amazon Redshift) (option
>> 1) or maybe it is just better to add just the tests/docs of how to use
>> them as proposed in the PR (option 2).
>>
>> Feel free to comment/vote or maybe add an eventual third option if you
>> think there is one better option.
>>
>> Regards,
>> Ismaël Mejía
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-1158
>>

Re: [New Proposal] Hive connector using native api

2017-05-24 Thread Ismaël Mejía

One quick thing I forgot to mention is that maybe it is a good idea
for the guys working on the Beam SQL implementation to take a look at
their needs for this IO considering that it could be quite useful to
test the SQL (given the structured nature of HCatalog).


On Wed, May 24, 2017 at 11:54 AM, Ismaël Mejía  wrote:
> Hello,
>
> I created a new JIRA for this native implementation of the IO so feel
> free to PR the 'native' implementation using this ticket.
> https://issues.apache.org/jira/browse/BEAM-2357
>
> We will discuss all the small details in the PR.
>
> The old JIRA (BEAM-1158) will still be there just to add the read
> example for HCatalog using HIFIO.
>
> Regards,
> Ismaël
>
>
> On Wed, May 24, 2017 at 8:03 AM, Jean-Baptiste Onofré  
> wrote:
>> Hi,
>>
>> It looks good. I just saw some issues:
>>
>> - javadoc is not correct in HiveIO (it says write() for read ;)).
>> - estimated size is global to the table (doesn't consider the filter). It's
>> not a big deal, but it should be documented.
>> - you don't use the desired bundle size provided by the runner for the
>> split. You are using the Hive split count, which is fine, just explain in
>> the main javadoc maybe.
>> - the reader should set current to null with nothing is read
>> - getCurrent() should throw NoSuchElementException in case of current is
>> null
>> - in the writer, the flush should happen at the end of the batch as you did,
>> but also when the bundle is finished
>>
>> Thanks !
>> Great work
>>
>> Regards
>> JB
>>
>>
>> On 05/24/2017 01:36 AM, Seshadri Raghunathan wrote:
>>>
>>> Hi,
>>>
>>>
>>> You can find a draft implementation of the same here :
>>>
>>>
>>> HiveIO Source -
>>> https://github.com/seshadri-cr/beam/commit/b74523c13e03dc70038bc1e348ce270fbb3fd99b
>>>
>>> HiveIO Sink -
>>> https://github.com/seshadri-cr/beam/commit/0008f772a989c8cd817a99987a145fbf2f7fc795
>>>
>>>
>>> Please let us know your comments and suggestions.
>>>
>>>
>>> Regards,
>>>
>>> Seshadri
>>>
>>> 408 601 7548
>>>
>>>
>>> From: Madhusudan Borkar [mailto:mbor...@etouch.net]
>>> Sent: Tuesday, May 23, 2017 3:12 PM
>>> To: dev@beam.apache.org; Seshadri Raghunathan ;
>>> Rajesh Pandey 
>>> Subject: [New Proposal] Hive connector using native api
>>>
>>>
>>> Hi,
>>>
>>> HadoopIO can be used to read from Hive. It doesn't provide writing to
>>> Hive. This new proposal for Hive connector includes both source and sink. It
>>> uses Hive native api.
>>>
>>> Apache HCatalog provides way to read / write to hive without using
>>> mapreduce. HCatReader reads data from cluster, using basic storage
>>> abstraction of tables and rows. HCatWriter writes to cluster and a batching
>>> process will be used to write in bulk. Please refer to Apache documentation
>>> on HCatalog ReaderWriter
>>> https://cwiki.apache.org/confluence/display/Hive/HCatalog+ReaderWriter
>>>
>>>
>>> Solution:
>>>
>>> It will work like:
>>>
>>>
>>> pipeline.apply(HiveIO.read()
>>>
>>> .withMetastoreUri("uri") //mandatory
>>>
>>> .withTable("myTable") //mandatory
>>>
>>> .withDatabase("myDb") //optional, assumes default if none specified
>>>
>>> .withPartition(“partition”) //optional,should be specified if the table is
>>> partitioned
>>>
>>>
>>> pipeline.apply(HiveIO.write()
>>>
>>> .withMetastoreUri("uri") //mandatory
>>>
>>> .withTable("myTable") //mandatory
>>>
>>> .withDatabase("myDb") //optional, assumes default if none specified
>>>
>>> .withPartition(“partition”) //optional
>>>
>>> .withBatchSize(size)) //optional
>>>
>>>
>>> Please, let us know your comments and suggestions.
>>>
>>>
>>>
>>>
>>> Madhu Borkar
>>>
>>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com

Re: [New Proposal] Hive connector using native api

2017-05-24 Thread Ismaël Mejía

Hello,

I created a new JIRA for this native implementation of the IO so feel
free to PR the 'native' implementation using this ticket.
https://issues.apache.org/jira/browse/BEAM-2357

We will discuss all the small details in the PR.

The old JIRA (BEAM-1158) will still be there just to add the read
example for HCatalog using HIFIO.

Regards,
Ismaël


On Wed, May 24, 2017 at 8:03 AM, Jean-Baptiste Onofré  wrote:
> Hi,
>
> It looks good. I just saw some issues:
>
> - javadoc is not correct in HiveIO (it says write() for read ;)).
> - estimated size is global to the table (doesn't consider the filter). It's
> not a big deal, but it should be documented.
> - you don't use the desired bundle size provided by the runner for the
> split. You are using the Hive split count, which is fine, just explain in
> the main javadoc maybe.
> - the reader should set current to null with nothing is read
> - getCurrent() should throw NoSuchElementException in case of current is
> null
> - in the writer, the flush should happen at the end of the batch as you did,
> but also when the bundle is finished
>
> Thanks !
> Great work
>
> Regards
> JB
>
>
> On 05/24/2017 01:36 AM, Seshadri Raghunathan wrote:
>>
>> Hi,
>>
>>
>> You can find a draft implementation of the same here :
>>
>>
>> HiveIO Source -
>> https://github.com/seshadri-cr/beam/commit/b74523c13e03dc70038bc1e348ce270fbb3fd99b
>>
>> HiveIO Sink -
>> https://github.com/seshadri-cr/beam/commit/0008f772a989c8cd817a99987a145fbf2f7fc795
>>
>>
>> Please let us know your comments and suggestions.
>>
>>
>> Regards,
>>
>> Seshadri
>>
>> 408 601 7548
>>
>>
>> From: Madhusudan Borkar [mailto:mbor...@etouch.net]
>> Sent: Tuesday, May 23, 2017 3:12 PM
>> To: dev@beam.apache.org; Seshadri Raghunathan ;
>> Rajesh Pandey 
>> Subject: [New Proposal] Hive connector using native api
>>
>>
>> Hi,
>>
>> HadoopIO can be used to read from Hive. It doesn't provide writing to
>> Hive. This new proposal for Hive connector includes both source and sink. It
>> uses Hive native api.
>>
>> Apache HCatalog provides way to read / write to hive without using
>> mapreduce. HCatReader reads data from cluster, using basic storage
>> abstraction of tables and rows. HCatWriter writes to cluster and a batching
>> process will be used to write in bulk. Please refer to Apache documentation
>> on HCatalog ReaderWriter
>> https://cwiki.apache.org/confluence/display/Hive/HCatalog+ReaderWriter
>>
>>
>> Solution:
>>
>> It will work like:
>>
>>
>> pipeline.apply(HiveIO.read()
>>
>> .withMetastoreUri("uri") //mandatory
>>
>> .withTable("myTable") //mandatory
>>
>> .withDatabase("myDb") //optional, assumes default if none specified
>>
>> .withPartition(“partition”) //optional,should be specified if the table is
>> partitioned
>>
>>
>> pipeline.apply(HiveIO.write()
>>
>> .withMetastoreUri("uri") //mandatory
>>
>> .withTable("myTable") //mandatory
>>
>> .withDatabase("myDb") //optional, assumes default if none specified
>>
>> .withPartition(“partition”) //optional
>>
>> .withBatchSize(size)) //optional
>>
>>
>> Please, let us know your comments and suggestions.
>>
>>
>>
>>
>> Madhu Borkar
>>
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

[DISCUSS] HadoopInputFormat based IOs

2017-05-23 Thread Ismaël Mejía

Hello, I bring this subject to the mailing list to see everybody’s
opinion on the subject.

The recent inclusion of HadoopInputFormatIO (HiFiIO) gave Beam users
the option to ‘easily’ include data stores that support the
Hadoop-based partitioning scheme. There are currently examples of how
to use it for example to read from Elasticsearch and Cassandra. In
both cases we already have specific IOs on master or as WIP so using
HiFiIO based IO is not needed.

During the review of the recent IO for Hive (HCatalog) that uses
HiFiIO instead of a native API, there was a discussion about the fact
that this shouldn’t be included as a specific IO but better to add the
tests/documentation of how to read Hive records using the existing
HiFiIO. This makes sense from an abstraction point of view, however
there are visibility issues since end users would need to repackage
and discover the supported (and tested) HiFi-based IOs that won’t be
explicit in the code base.

I would like to know what other members of the community think about
this, is it worth to have individual IOs based on HiFiIO for things
that we currently don’t support (e.g. Hive or Amazon Redshift) (option
1) or maybe it is just better to add just the tests/docs of how to use
them as proposed in the PR (option 2).

Feel free to comment/vote or maybe add an eventual third option if you
think there is one better option.

Regards,
Ismaël Mejía

[1] https://issues.apache.org/jira/browse/BEAM-1158

Re: First stable release completed!

2017-05-17 Thread Ismaël Mejía

Amazing milestone, congrats everyone!

On Wed, May 17, 2017 at 7:54 PM, Reuven Lax  wrote:
> Sweet!
>
> On Wed, May 17, 2017 at 4:28 AM, Davor Bonaci  wrote:
>
>> The first stable release is now complete!
>>
>> Release artifacts are available through various repositories, including
>> dist.apache.org, Maven Central, and PyPI. The website is updated, and
>> announcements are published.
>>
>> Apache Software Foundation press release:
>> http://globenewswire.com/news-release/2017/05/17/986839/0/
>> en/The-Apache-Software-Foundation-Announces-Apache-Beam-v2-0-0.html
>>
>> Beam blog:
>> https://beam.apache.org/blog/2017/05/17/beam-first-stable-release.html
>>
>> Congratulations to everyone -- this is a really big milestone for the
>> project, and I'm proud to be a part of this great community.
>>
>> Davor
>>

Re: [VOTE] First stable release: release candidate #4

2017-05-14 Thread Ismaël Mejía

+1 (non-binding)

Validated signatures OK
Run mvn clean verify -Prelease OK
Executed Nexmark with Direct/Spark/Flink/Apex runners in local mode
(temporally downgraded to 2.0.0 to validate the version). OK

This is looking great now. As Robert said, a release to be proud of.

On Sun, May 14, 2017 at 8:25 AM, Robert Bradshaw
 wrote:
> +1
>
> Verified all the checksums and signatures. (Now that both md5 and sha1
> are broken, we should probably provide sha-256 as well.)
>
> Spot checked the site and documentation, left comments on the PR. The
> main landing page has nothing about the Beam stable release, and the
> top blog entry (right in the center) mentions 0.6.0 which catches the
> eye. I assume a 2.0.0 blog will be here shortly?
>
> Ran a couple of trivial but novel direct-runner pipelines (Python and Java).
>
> https://github.com/tensorflow/transform is pinning 0.6.0, so we won't
> break them (though hopefully they'll upgrade to >=2.0.0 shortly after
> the release).
>
> The Python zipfile at
> https://dist.apache.org/repos/dist/dev/beam/2.0.0-RC4/ is missing
> sdks/python/apache_beam/transforms/trigger_transcripts.yaml. This will
> cause some tests to be skipped (but no failure). However, I don't
> think it's worth cutting another release candidate for.
>
> Everything else is looking great. This is a release to be proud of!
>
> - Robert
>
>
>
> On Sat, May 13, 2017 at 8:40 PM, Mingmin Xu  wrote:
>> +1
>>
>> Test beam-examples with FlinkRunner, and several cases of KafkaIO/JdbcIO.
>>
>> Thanks!
>> Mingmin
>>
>> On Sat, May 13, 2017 at 7:38 PM, Ahmet Altay 
>> wrote:
>>
>>> +1
>>>
>>> - Tested Python wordcount with DirectRunner & DataflowRunner on
>>> Windows/Mac/Linux, and python mobile gaming examples with DirectRunner &
>>> DataflowRunner on Mac/Linux
>>> - Verified that generated pydocs are accurate.
>>> - Python zip file has valid metadata and contains LICENSE, NOTICE and
>>> README.
>>>
>>> Ahmet
>>>
>>> On Sat, May 13, 2017 at 1:12 AM, María García Herrero <
>>> mari...@google.com.invalid> wrote:
>>>
>>> > +1 -- validated python quickstart and mobile game for DirectRunner and
>>> > DataflowRunner on Linux (RC3) and Mac (RC4).
>>> >
>>> > Go Beam!
>>> >
>>> > On Fri, May 12, 2017 at 11:12 PM, Jean-Baptiste Onofré 
>>> > wrote:
>>> >
>>> > > +1 (binding)
>>> > >
>>> > > Tested on beam-samples, especially focus on HDFS support, etc.
>>> > >
>>> > > Thanks !
>>> > > Regards
>>> > > JB
>>> > >
>>> > >
>>> > > On 05/13/2017 06:47 AM, Davor Bonaci wrote:
>>> > >
>>> > >> Hi everyone --
>>> > >> After going through several release candidates, setting and validating
>>> > >> acceptance criteria, running a hackathon, and polishing the release,
>>> now
>>> > >> is
>>> > >> the time to vote!
>>> > >>
>>> > >> Please review and vote on the release candidate #4 for the version
>>> > 2.0.0,
>>> > >> as follows:
>>> > >> [ ] +1, Approve the release
>>> > >> [ ] -1, Do not approve the release (please provide specific comments)
>>> > >>
>>> > >> The complete staging area is available for review, which includes:
>>> > >> * JIRA release notes [1],
>>> > >> * the official Apache source release to be deployed to
>>> dist.apache.org
>>> > >> [2],
>>> > >> which is signed with the key with fingerprint 8F0D334F [3],
>>> > >> * all artifacts to be deployed to the Maven Central Repository [4],
>>> > >> * source code tag "v2.0.0-RC4" [5],
>>> > >> * website pull request listing the release and publishing the API
>>> > >> reference
>>> > >> manual [6],
>>> > >> * Python artifacts are deployed along with the source release to the
>>> > >> dist.apache.org [2].
>>> > >>
>>> > >> Jenkins suites:
>>> > >> * https://builds.apache.org/job/beam_PreCommit_Java_
>>> MavenInstall/11439/
>>> > >> * https://builds.apache.org/job/beam_PostCommit_Java_
>>> MavenInstall/3801/
>>> > >> * https://builds.apache.org/job/beam_PostCommit_Python_Verify/2216/
>>> > >> *
>>> > >> https://builds.apache.org/job/beam_PostCommit_Java_Validates
>>> > >> Runner_Apex/1461/
>>> > >> *
>>> > >> https://builds.apache.org/job/beam_PostCommit_Java_Validates
>>> > >> Runner_Dataflow/3123/
>>> > >> *
>>> > >> https://builds.apache.org/job/beam_PostCommit_Java_Validates
>>> > >> Runner_Flink/2808/
>>> > >> *
>>> > >> https://builds.apache.org/job/beam_PostCommit_Java_Validates
>>> > >> Runner_Spark/2060/
>>> > >>
>>> > >> The vote will be open for at least 72 hours. It is adopted by majority
>>> > >> approval of qualified votes, with at least 3 PMC affirmative votes.
>>> > >>
>>> > >> Thanks!
>>> > >>
>>> > >> Davor
>>> > >>
>>> > >> [1]
>>> > >> https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
>>> > >> ctId=12319527&version=12339746
>>> > >> [2] https://dist.apache.org/repos/dist/dev/beam/2.0.0-RC4/
>>> > >> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> > >> [4] https://repository.apache.org/content/repositories/orgapache
>>> > >> beam-1017/
>>> > >> [5] https://github.com/apache/beam/tree/v2.0.0-RC4
>>> > >> [6] https://github.com/apache/

Re: [DISCUSSION] using NexMark for Beam

2017-05-14 Thread Ismaël Mejía

Hello,

Thanks Etienne for opening the Pull Request and starting the
discussion for the review process. I also want to thank publicly all
the people that somehow contributed to this:

- Mark Shields and the original people at google who worked at nexmark
for contributing this in the first place.
- Etienne because his work and constant help really improved the
status of the queries, your work on query 3 was really nice, and also
for the hard work of helping me test all the queries with all the
runners and ping the runner maintainers for fixes.
- Aviem/Amit for all the help to solve the issues with the spark
runner whose support is now almost feature complete (even in
streaming!).
- Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for
quickly adding the support for metrics.
- Thomas Groh and Kenneth for fixing some needed parts in Direct
Runner + answering our questions on the State/Timer API.
- JB and the talend crew for all the feedback and help to run in our
benchmark cluster.
- And of course the rest of the Beam community :)

Some comments:

- This does not need to have a feature branch since we have been
working on this in a fork for months now and with the stable API we
can simply do a traditional PR review. Of course the review is a bit
bigger so we expect it to take some time, but I hope we can get some
quick progress once FSR is out.

- We need a hand from the google guys, for the moment we have tested
all the queries in all the runners, but not in the Dataflow runner
because we don't have access to it (well we have but not with the
freedom that you guys have to run the benchmark at will), so if we can
get some access that would be nice or if this is not possible, it
would be nice if some of you guys help us test/report any given issue
on this runner,

- We also have to decide the future of some features, this is probably
independent of the current PR and part of the evolution of Nexmark on
Beam:

-- There are still some pending things that can be improved even after
the review once in master, e.g. we have for the moment only synthetic
sources but the original version took also data from Pubsub, we have
to define the correct scope for this and given the case also add other
sources, e.g. Kafka, HDFS.

-- Query 10 is really oriented to testing Google Runner/IOs specific
features, so we have to decide what to do with this one, maybe
mirroring it with Kafka/HDFS to have something equivalent in the
Apache world.

This is all for now, I am really glad that this is finally happening
and I hope this soon gets merged.

Ismaël

On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik  wrote:
> I think these are valuable enough that we should get them into apache/master
>
> On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi,
>>
>> PR or even a feature branch could work. Up to you.
>>
>> Regards
>> JB
>>
>>
>> On 05/12/2017 10:55 AM, Etienne Chauchot wrote:
>>
>>> Hi guys,
>>>
>>> I wanted to let you know that I have just submitted a PR around NexMark.
>>> This is
>>> a port of the NexMark queries to Beam, to be used as integration tests.
>>> This can also be used as A-B testing (no-regression or performance
>>> comparison
>>> between 2 versions of the same engine or of the same runner)
>>>
>>> This a continuation of the previous PR (#99) from Mark Shields.
>>> The code has changed quite a bit: some queries have changed to use new
>>> Beam APIs
>>> and there where some big refactorings. More important, we can now run all
>>> the
>>> queries in all the runners.
>>>
>>> Nevertheless, there are still some open issues in Nexmark
>>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see issue
>>> links
>>> in https://issues.apache.org/jira/browse/BEAM-160)
>>>
>>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at the
>>> ApacheCon. The PR is not perfect but it is in a good shape to share it.
>>>
>>> Best,
>>>
>>> Etienne
>>>
>>>
>>>
>>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
>>>
 This is great! Having a variety of realistic-ish pipelines running on all
 runners complements the validation suite and IO IT work.

 If I recall, some of these involve heavy and esoteric uses of state, so
 definitely give me a ping if you hit any trouble.

 Kenn

 On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot 
 wrote:

 Hi all,
>
> Ismael and I are working on upgrading the Nexmark implementation for
> Beam.
> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
> https://issues.apache.org/jira/browse/BEAM-160. We are continuing the
> work done by Mark Shields. See https://github.com/apache/beam/pull/366
> for the original PR.
>
> The PR contains queries that have a wide coverage of the Beam model and
> that represent a realistic end user use case (some come from client
> experience on Google Cloud Dataflow).
>
> So far, we have upgraded the implementation to the latest

Re: First stable release: version designation?

2017-05-04 Thread Ismaël Mejía

My vote, like Davor:
Slight preference toward 2.0.0, but fine with 1.0.0

On Thu, May 4, 2017 at 9:32 PM, Thomas Weise  wrote:
> I'm in the relaxed 1.0.0 camp.
>
> --
> sent from mobile
> On May 4, 2017 12:29 PM, "Mingmin Xu"  wrote:
>
>> I slightly prefer1.0.0 for the *first* stable release, but fine with 2.0.0.
>>
>> On Thu, May 4, 2017 at 12:25 PM, Lukasz Cwik 
>> wrote:
>>
>> > Put me under Strongly for 2.0.0
>> >
>> > On Thu, May 4, 2017 at 12:24 PM, Kenneth Knowles > >
>> > wrote:
>> >
>> > > I'll join Davor's group.
>> > >
>> > > On Thu, May 4, 2017 at 12:07 PM, Davor Bonaci 
>> wrote:
>> > >
>> > > > I don't think we have reached a consensus here yet. Let's re-examine
>> > this
>> > > > after some time has passed.
>> > > >
>> > > > If I understand everyone's opinion correctly, this is the summary:
>> > > >
>> > > > Strongly for 2.0.0:
>> > > > * Aljoscha
>> > > > * Dan
>> > > >
>> > > > Slight preference toward 2.0.0, but fine with 1.0.0:
>> > > > * Davor
>> > > >
>> > > > Strongly for 1.0.0: none.
>> > > >
>> > > > Slight preference toward 1.0.0, but fine with 2.0.0:
>> > > > * Amit
>> > > > * Jesse
>> > > > * JB
>> > > > * Ted
>> > > >
>> > > > Any additional opinions?
>> > > >
>> > > > Thanks!
>> > > >
>> > > > Davor
>> > > >
>> > > > On Wed, Mar 8, 2017 at 12:58 PM, Amit Sela 
>> > wrote:
>> > > >
>> > > > > If we were to go with a 2.0 release, we would have to be very clear
>> > on
>> > > > > maturity of different modules; for example python SDK is not as
>> > mature
>> > > as
>> > > > > Java SDK, some runners support streaming better than others, some
>> run
>> > > on
>> > > > > YARN better than others, etc.
>> > > > >
>> > > > > My only reservation here is that the Apache community usually
>> expects
>> > > > > version 2.0 to be a mature products, so I'm OK as long as we do
>> some
>> > > > > "maturity-analysis" and document properly.
>> > > > >
>> > > > > On Tue, Mar 7, 2017 at 4:48 AM Ted Yu  wrote:
>> > > > >
>> > > > > > If we end up with version 2.0, more effort (trying out more use
>> > > > scenarios
>> > > > > > e.g.) should go into release process to make sure what is
>> released
>> > is
>> > > > > > indeed stable.
>> > > > > >
>> > > > > > Normally people would have higher expectation on 2.0 release
>> > compared
>> > > > to
>> > > > > > 1.0 release.
>> > > > > >
>> > > > > > On Mon, Mar 6, 2017 at 6:34 PM, Davor Bonaci 
>> > > wrote:
>> > > > > >
>> > > > > > > It sounds like we'll end up with two camps on this topic. This
>> > > issue
>> > > > is
>> > > > > > > probably best resolved with a vote, but I'll try to rephrase
>> the
>> > > > > question
>> > > > > > > once to see whether a consensus is possible.
>> > > > > > >
>> > > > > > > Instead of asking which option is better, does anyone think the
>> > > > project
>> > > > > > > would be negatively impacted if we were to decide on, in your
>> > > > opinion,
>> > > > > > the
>> > > > > > > less desirable variant? If so, can you comment on the negative
>> > > impact
>> > > > > of
>> > > > > > > the less desirable alternative please?
>> > > > > > >
>> > > > > > > (I understand this may be pushing it a bit, but I think a
>> > possible
>> > > > > > > consensus on this is worth it. Personally, I'll stay away from
>> > > > weighing
>> > > > > > in
>> > > > > > > on this topic.)
>> > > > > > >
>> > > > > > > On Thu, Mar 2, 2017 at 2:57 AM, Aljoscha Krettek <
>> > > > aljos...@apache.org>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > I prefer 2.0.0 for the first stable release. It totally makes
>> > > sense
>> > > > > for
>> > > > > > > > people coming from Dataflow 1.x and I can already envision
>> the
>> > > > > > confusion
>> > > > > > > > between Beam 1.5 and Dataflow 1.5.
>> > > > > > > >
>> > > > > > > > On Thu, 2 Mar 2017 at 07:42 Jean-Baptiste Onofré <
>> > > j...@nanthrax.net>
>> > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi Davor,
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > For a Beam community perspective, 1.0.0 would make more
>> > sense.
>> > > We
>> > > > > > have
>> > > > > > > a
>> > > > > > > > > fair number of people starting with Beam (without knowing
>> > > > > Dataflow).
>> > > > > > > > >
>> > > > > > > > > However, as Dataflow SDK (origins of Beam) was in 1.0.0, in
>> > > order
>> > > > > to
>> > > > > > > > > avoid confusion with users coming to Beam from Dataflow,
>> > 2.0.0
>> > > > > could
>> > > > > > > > help.
>> > > > > > > > >
>> > > > > > > > > I have a preference to 1.0.0 anyway, but I would understand
>> > > > > starting
>> > > > > > > > > from 2.0.0.
>> > > > > > > > >
>> > > > > > > > > Regards
>> > > > > > > > > JB
>> > > > > > > > >
>> > > > > > > > > On 03/01/2017 07:56 PM, Davor Bonaci wrote:
>> > > > > > > > > > The first stable release is our next major project-wide
>> > goal;
>> > > > see
>> > > > > > > > > > discussion in [1]. I've been referring to it as "the
>> first
>> > > > stable
>> > > > > > > > > release"
>> > > > > > > > > > for a long time, not "1.0.0" o

Re: Congratulations Davor!

2017-05-04 Thread Ismaël Mejía

Congratulations Davor!
Your membership is really deserved, You really got the Apache spirit !

On Thu, May 4, 2017 at 5:02 PM, Thomas Groh  wrote:
> Congratulations!
>
> On Thu, May 4, 2017 at 7:56 AM, Thomas Weise  wrote:
>
>> Congrats!
>>
>>
>> On Thu, May 4, 2017 at 7:53 AM, Sourabh Bajaj <
>> sourabhba...@google.com.invalid> wrote:
>>
>> > Congrats!!
>> > On Thu, May 4, 2017 at 7:48 AM Mingmin Xu  wrote:
>> >
>> > > Congratulations @Davor!
>> > >
>> > >
>> > > > On May 4, 2017, at 7:08 AM, Amit Sela  wrote:
>> > > >
>> > > > Congratulations Davor!
>> > > >
>> > > >> On Thu, May 4, 2017, 10:02 JingsongLee 
>> > wrote:
>> > > >>
>> > > >> Congratulations!
>> > > >> --
>> > > >> From:Jesse Anderson 
>> > > >> Time:2017 May 4 (Thu) 21:36
>> > > >> To:dev 
>> > > >> Subject:Re: Congratulations Davor!
>> > > >> Congrats!
>> > > >>
>> > > >>> On Thu, May 4, 2017, 6:20 AM Aljoscha Krettek > >
>> > > wrote:
>> > > >>>
>> > > >>> Congrats! :-)
>> > >  On 4. May 2017, at 14:34, Kenneth Knowles > >
>> > > >>> wrote:
>> > > 
>> > >  Awesome!
>> > > 
>> > > > On Thu, May 4, 2017 at 1:19 AM, Ted Yu 
>> > wrote:
>> > > >
>> > > > Congratulations, Davor!
>> > > >
>> > > > On Thu, May 4, 2017 at 12:45 AM, Aviem Zur > > > >>> wrote:
>> > > >
>> > > >> Congrats Davor! :)
>> > > >>
>> > > >> On Thu, May 4, 2017 at 10:42 AM Jean-Baptiste Onofré <
>> > > >> j...@nanthrax.net>
>> > > >> wrote:
>> > > >>
>> > > >>> Congrats ! Well deserved ;)
>> > > >>>
>> > > >>> Regards
>> > > >>> JB
>> > > >>>
>> > >  On 05/04/2017 09:30 AM, Jason Kuster wrote:
>> > >  Hi all,
>> > > 
>> > >  The ASF has just published a blog post[1] welcoming new
>> members
>> > of
>> > > > the
>> > >  Apache Software Foundation, and our own Davor Bonaci is among
>> > > them!
>> > >  Congratulations and thank you to Davor for all of your work
>> for
>> > > the
>> > > >> Beam
>> > >  community, and the ASF at large. Well deserved.
>> > > 
>> > >  Best,
>> > > 
>> > >  Jason
>> > > 
>> > >  [1] https://blogs.apache.org/foundation/entry/the-apache-sof
>> > >  tware-foundation-welcomes
>> > > 
>> > >  P.S. I dug through the list to make sure I wasn't missing any
>> > > other
>> > > >> Beam
>> > >  community members; if I have, my sincerest apologies and
>> please
>> > > >> recognize
>> > >  them on this or a new thread.
>> > > 
>> > > >>>
>> > > >>> --
>> > > >>> Jean-Baptiste Onofré
>> > > >>> jbono...@apache.org
>> > > >>> http://blog.nanthrax.net
>> > > >>> Talend - http://www.talend.com
>> > > >>>
>> > > >>
>> > > >
>> > > >>>
>> > > >>> --
>> > > >> Thanks,
>> > > >>
>> > > >> Jesse
>> > > >>
>> > > >>
>> > >
>> >
>>

Re: [PROPOSAL] HiveIO - updated link to document

2017-04-25 Thread Ismaël Mejía

Hello,

I created the HiveIO JIRA and followed the initial discussions about
the best approach for HiveIO so I want first to suggest you to read
the previous thread(s) on the mailing list.

https://www.mail-archive.com/dev@beam.incubator.apache.org/msg02313.html

The main idea I concluded from that thread is that a really valuable
part of accesing Hive for Beam is to access the records exposed via
the catalog of the data using HCatalog. This approach is way more
interesting because Beam can benefit of the multiple runner execution
to process the data exposed by Hive in all the different runners. This
is not the case if we invoke HiveQL (or SQL) queries over Hive via Map
Reduce. Note also that you can do this today on Beam by using the
JdbcIO + the specific Hive JDBC configuration.

It is probably a good idea that you take a look at how the Flink
connector does this, because it is essentially the same idea that we
want.
https://github.com/apache/flink/tree/master/flink-connectors/flink-hcatalog
(Note try to not get confused by the name of the classes on Flink vs
Hive because they are really similar).

Also take a look at HadoopIO because you can make a simpler
implementation by reusing the code that is there because
HCatInputFormat is a Hadoop InputFormat class.

So the idea at least for the read part would be to build a
PCollection from a Hadoop Configuration + the database
name + the table + eventually a filter and this Pcollection will be
processed on the Beam Pipelines, the advantage of this approach is
that once the Beam SQL DSL is ready it will integrate perfectly with
this IO so we can have SQL reading from Hive/Hcatalog and processing
on whatever runner the users want.

Finally if you agree with this approach I think that probably it makes
sense to rename the IO into HCatalogIO as Flink does,

One extra thing I have still not looked at the write part but I
suppose that it should be something similar.

Regards,
Ismael.

Re: [DISCUSSION] Encouraging more contributions

2017-04-25 Thread Ismaël Mejía

I think it is important to clarify that the developer documentation
discussed in this thread is of two kinds:

6.1. Documents with proposals and new designs, those covered by the
Beam Improvement Proposal (BEAM-566), and that we need to put with a
single file index (I remember there was a google dir for this but not
sure it is still valid, and in any case probably the website is a
better place for this). Is there any progress on this?

6.2. Documentation about how things work, so new developers can get
into developing features/fixes for the project, those are the kind
that Kenneth/Etienne mention and include Stephen’s IO guide but could
be definitely expanded to include things like how does the different
runner translation works, or some details on triggers/materialization
of panes/windows from the SDK point of view. However the hard part of
this documents is that they should be maintained e.g. updated when the
code evolves so they don’t get outdated as JB mentions.

On Tue, Apr 25, 2017 at 10:47 AM, Wesley Tanaka
 wrote:
> These are the ones I've come across so far, are there others?
>
> * Dynamic DoFn https://s.apache.org/a-new-dofn
>
> ** Splittable DoFn (Obsoletes Source API) http://s.apache.org/splittable-do-fn
>
> ** State and Timers for DoFn: https://s.apache.org/beam-state
>
>
> * Lateness https://s.apache.org/beam-lateness
>
>
> * Metrics API http://s.apache.org/beam-metrics-api
>
> ** I/O Metrics https://s.apache.org/standard-io-metrics
>
>
> * Runner API http://s.apache.org/beam-runner-api
>
> ** https://s.apache.org/beam-runner-composites
>
> ** https://s.apache.org/beam-side-inputs-1-pager
>
>
> * Fn API http://s.apache.org/beam-fn-api
>
> ---
> Wesley Tanaka
> https://wtanaka.com/
>
>
> On Monday, April 24, 2017, 2:45:45 PM HST, Sourabh Bajaj 
>  wrote:
> For 6. I think having them in one page on the website where we can find the
> design docs more easily would be great.
>
> 7. For low-hanging-fruit, one thing I really liked from some Mozilla
> projects was assigning a mentor on the ticket. Someone you can reach out to
> if you have questions. I think this makes the entry barrier really low for
> first time contributors who might feel intimidated asking questions
> completely in public.
>
> On Mon, Apr 24, 2017 at 10:06 AM Kenneth Knowles 
> wrote:
>
>> I like the subject Etienne has brought up, and will give it a number in
>> this list :-)
>>
>> 6. Have more technical reference docs (not just workspace set up) for
>> contributors.
>>
>> I think this overlaps a lot with a prior discussion about where to collect
>> design proposals [1]. Design docs used to be just dropped into a public
>> folder, but that got disorganized. And that thread was about work in
>> progress, so JIRA was a good place for details after a dev@ thread agrees
>> on a proposal. At this point, the designs are pretty solid conceptually or
>> even implemented and we could start to build out deeper technical bits on
>> the web site, or at least some place that people can find it. We do have
>> the Testing Guide and the PTransform Style Guide and somewhere near there
>> we could have deeper references. I think we need a broader vision for the
>> "table of contents" here.
>>
>> For my docs (triggers, lateness, runner API, side inputs, state, coders) I
>> haven't had time, but I do intend to both translate from GDoc to some other
>> format and also rewrite versions for users where appropriate. Probably this
>> will mean coming up with that table of contents.
>>
>> Kenn
>>
>> [1]
>>
>> https://lists.apache.org/thread.html/%3c6bc60c88-cf91-4fff-eae6-fea6ee06f...@nanthrax.net%3E
>>
>>
>> On Mon, Apr 24, 2017 at 9:33 AM, Neelesh Salian 
>> wrote:
>>
>> > Agreed. I have some old JIRAs that I am cleaning up.
>> >
>> > Thank you for bringing this up.
>> >
>> > On Mon, Apr 24, 2017 at 9:29 AM, Jean-Baptiste Onofré 
>> > wrote:
>> >
>> > > Same also for Slack, github comments, etc.
>> > >
>> > > From a Apache perspective, it should happen on the mailing list,
>> > > eventually referencing a central wiki/faq/whatever.
>> > >
>> > > Regards
>> > > JB
>> > >
>> > >
>> > > On 04/24/2017 06:23 PM, Mingmin Xu wrote:
>> > >
>> > >> many design documents are mixed in maillist, jira comments, it would
>> be
>> > a
>> > >> big help to put them in a centralized list. Also I would expect more
>> > >> wiki/blogs to provide in-depth analysis, like the translation from
>> > >> pipeline
>> > >> to runner specified topology, window/trigger implementation. Without
>> > these
>> > >> knowledge, it's hard to touch the core concepts.
>> > >>
>> > >> On Mon, Apr 24, 2017 at 6:03 AM, Jean-Baptiste Onofré <
>> j...@nanthrax.net>
>> > >> wrote:
>> > >>
>> > >> Got it. By experience on other Apache projects, it's really hard to
>> > >>> maintain ;)
>> > >>>
>> > >>> Regards
>> > >>> JB
>> > >>>
>> > >>>
>> > >>> On 04/24/2017 02:56 PM, Etienne Chauchot wrote:
>> > >>>
>> > >>> Hi JB,
>> > 
>> >  I was proposing a FAQ (or another form), not something about

Re: [DISCUSSION] Encouraging more contributions

2017-04-24 Thread Ismaël Mejía

+1 Great idea Aviem, thanks for bringing this subject to the mailing list.

I agree in particular with the freeing JIRA part, I think we shouldn’t
keep assigned JIRAs that are things that we don’t expect to solve in
the next weeks. (note the exception for this are the long features).

I would add two more issues.

4. We need to react and review code faster for new contributors and
belp them as much as we can.

I know that this one implies extra work but I have seen many times
people asking for reviews days after they create a PR and even worse,
people who have not been able to merge their changes because they were
dealing with a long code review and then a different PR already
included changes that fixed the same issue.

5. We should try to keep the number of open pull requests low.

Our average number of open Pull Requests is continuously increasing
(current average is 70), There are some PRs in open discussion but
some are clearly stagnated , maybe we should have like a deadline,
like if no discussions or improvements were done in the last month we
must close them and if there is still interest well they will be
re-opened in that case.

The ‘good news’ is that we have 350 unassigned unresolved issues that
anyone can take this is a good improvement but I agree that we can do
better.

Ismaël

On Sun, Apr 23, 2017 at 6:32 AM, Jean-Baptiste Onofré  wrote:
> Hi,
>
> as we already discussed about that, +1.
>
> I would also propose to not assign new Jira automatically: now, the Jira is
> automatically assigned to the Jira component leader.
>
> Regards
> JB
>
>
> On 04/22/2017 04:31 PM, Aviem Zur wrote:
>>
>> Hi all,
>>
>> I wanted to start a discussion about actions we can take to encourage more
>> contributions to the project.
>>
>> A few points I've been thinking of:
>>
>> 1. Have people unassign themselves from issues they're not actively
>> working
>> on.
>> 2. Have the community engage more in triage, improving tickets
>> descriptions
>> and raising concerns.
>> 3. Clean house - apply (2) to currently open issues (over 800). Perhaps
>> some can be closed.
>>
>> Thoughts? Ideas?
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: Pipeline termination in the unified Beam model

2017-04-18 Thread Ismaël Mejía

+1 Having a unified termination semantics for all runners is super important.

Stas or Aviem, is it feasible to do this for the Spark runner or the
timeout is due to a technical limitation of spark.

Thomas Weise, Aljoscha anything to say on this?

Aljoscha, what is the current status for the Flink runner. is there
any progress towards BEAM-593 ?


On Tue, Apr 18, 2017 at 5:05 PM, Stas Levin  wrote:
> Ted, the timeout is needed mostly for testing purposes.
> AFAIK there is no easy way to express the fact a source is "done" in a
> Spark native streaming application.
> Moreover, the Spark streaming "native" flow can either "awaitTermination()"
> or "awaitTerminationOrTimeout(...)". If you "awaitTermination" then you're
> blocked until the execution is either stopped or has failed, so if you wish
> to stop the app sooner, say after a certain period of time,
> "awaitTerminationOrTimeout(...)" may be the way to go.
>
> Using the unified approach discussed in this thread, when a source is
> "done" (i.e. the watermark is +Infinity) the app (e.g. runner) would
> gracefully stop.
>
>
>
> On Tue, Apr 18, 2017 at 3:19 PM Ted Yu  wrote:
>
>> Why is the timeout needed for Spark ?
>>
>> Thanks
>>
>> > On Apr 18, 2017, at 3:05 AM, Etienne Chauchot 
>> wrote:
>> >
>> > +1 on "runners really terminate in a timely manner to easily
>> programmatically orchestrate Beam pipelines in a portable way, you do need
>> to know whether
>> > the pipeline will finish without thinking about the specific runner and
>> its options"
>> >
>> > As an example, in Nexmark, we have streaming mode tests, and for the
>> benchmark, we need all the queries to behave the same between runners
>> towards termination.
>> >
>> > For now, to have the consistent behavior, in this mode we need to set a
>> timeout (a bit random and flaky) on waitUntilFinish() for spark but this
>> timeout is not needed for direct runner.
>> >
>> > Etienne
>> >
>> >> Le 02/03/2017 à 19:27, Kenneth Knowles a écrit :
>> >> Isn't this already the case? I think semantically it is an unavoidable
>> >> conclusion, so certainly +1 to that.
>> >>
>> >> The DirectRunner and TestDataflowRunner both have this behavior already.
>> >> I've always considered that a streaming job running forever is just
>> [very]
>> >> suboptimal shutdown latency :-)
>> >>
>> >> Some bits of the discussion on the ticket seem to surround whether or
>> how
>> >> to communicate this property in a generic way. Since a runner owns its
>> >> PipelineResult it doesn't seem necessary.
>> >>
>> >> So is the bottom line just that you want to more strongly insist that
>> >> runners really terminate in a timely manner? I'm +1 to that, too, for
>> >> basically the reason Stas gives: In order to easily programmatically
>> >> orchestrate Beam pipelines in a portable way, you do need to know
>> whether
>> >> the pipeline will finish without thinking about the specific runner and
>> its
>> >> options (as with our RunnableOnService tests).
>> >>
>> >> Kenn
>> >>
>> >> On Thu, Mar 2, 2017 at 9:09 AM, Dan Halperin
>> 
>> >> wrote:
>> >>
>> >>> Note that even "unbounded pipeline in a streaming
>> runner".waitUntilFinish()
>> >>> can return, e.g., if you cancel it or terminate it. It's totally
>> reasonable
>> >>> for users to want to understand and handle these cases.
>> >>>
>> >>> +1
>> >>>
>> >>> Dan
>> >>>
>> >>> On Thu, Mar 2, 2017 at 2:53 AM, Jean-Baptiste Onofré 
>> >>> wrote:
>> >>>
>>  +1
>> 
>>  Good idea !!
>> 
>>  Regards
>>  JB
>> 
>> 
>> > On 03/02/2017 02:54 AM, Eugene Kirpichov wrote:
>> >
>> > Raising this onto the mailing list from
>> > https://issues.apache.org/jira/browse/BEAM-849
>> >
>> > The issue came up: what does it mean for a pipeline to finish, in the
>> >>> Beam
>> > model?
>> >
>> > Note that I am deliberately not talking about "batch" and "streaming"
>> > pipelines, because this distinction does not exist in the model.
>> Several
>> > runners have batch/streaming *modes*, which implement the same
>> semantics
>> > (potentially different subsets: in batch mode typically a runner will
>> > reject pipelines that have at least one unbounded PCollection) but
>> in an
>> > operationally different way. However we should define pipeline
>> >>> termination
>> > at the level of the unified model, and then make sure that all
>> runners
>> >>> in
>> > all modes implement that properly.
>> >
>> > One natural way is to say "a pipeline terminates when the output
>> > watermarks
>> > of all of its PCollection's progress to +infinity". (Note: this can
>> be
>> > generalized, I guess, to having partial executions of a pipeline: if
>> > you're
>> > interested in the full contents of only some collections, then you
>> wait
>> > until only the watermarks of those collections progress to infinity)
>> >
>> > A typical "batch" runner mode does not implement watermarks - we can
>> >>> thi

Re: [DISCUSSION] PAssert success/failure count validation for all runners

2017-04-10 Thread Ismaël Mejía

I have the impression this conversation went into a different
sub-discussion ignoring the core subject that is if it makes sense to
do the implementation of Passert as we are doing it right now (1), or
in a runner agnostic way (2).

Big +1 for (2).

And I think also this is critical enough to be part of the First
Stable Release ones (FSR), again I have the impression that removing
aggregators is a top priority, but we cannot remove them if we don’t
cover the exact same use cases with metrics and we have an acceptable
level of support from the runners.

Regards,
Ismaël

On Sat, Apr 8, 2017 at 4:00 PM, Aljoscha Krettek  wrote:
> @kenn What’s the design you’re mentioning? (I probably missed it because I’m 
> not completely up to data on the Jiras and ML because of Flink Forward 
> preparations)
>
>> On 7. Apr 2017, at 12:42, Kenneth Knowles  wrote:
>>
>> We also have a design that improves the signal even without metrics, so I'm
>> pretty happy with this.
>>
>> On Fri, Apr 7, 2017 at 12:12 PM, Lukasz Cwik 
>> wrote:
>>
>>> I like the usage of metrics since it doesn't depend on external resources.
>>> I believe there could be some small amount of code shared between runners
>>> for the PAssert metric verification.
>>>
>>> I would say that PAssert by itself and PAssert with metrics are two levels
>>> of testing available. For runners that don't support metrics than PAssert
>>> gives a signal (albeit weaker one) and ones that do support metrics will
>>> have a stronger signal for execution correctness.
>>>
>>> On Fri, Apr 7, 2017 at 11:59 AM, Aviem Zur  wrote:
>>>
 Currently, PAssert assertions may not happen and tests will pass while
 silently hiding issues.

 Up until now, several runners have implemented an assertion that the
>>> number
 of expected successful assertions have actually happened, and that no
 failed assertions have happened. (runners which check this are Dataflow
 runner and Spark runner).

 This has been valuable in the past to find bugs which were hidden by
 passing tests.

 The work to repeat this in https://issues.apache.org/
>>> jira/browse/BEAM-1726
 has
 surfaced bugs in the Flink runner that were also hidden by passing tests.
 However, with the removal of aggregators in
 https://issues.apache.org/jira/browse/BEAM-1148 this ticket will be
>>> harder
 to implement, since Flink runner does not support metrics.

 I believe that validating that runners do in fact support Beam model is a
 blocker for first stable release. (BEAM-1726 was also marked as a blocker
 for Flink runner).

 I think we have one of 2 choices here:
 1. Keep implementing this for each runner separately.
 2. Implement this in a runner agnostic way (For runners which support
 metrics - use metrics, for those that do not use a fallback
>>> implementation,
 perhaps using files or some other method). This should be covered by the
 following ticket: https://issues.apache.org/jira/browse/BEAM-1763

 Thoughts?

>>>
>

Re: Update of Pei in Alibaba

2017-04-07 Thread Ismaël Mejía

Hello Basti,

Thanks a lot for answering, I imagined that, that all the improvements
of both JStorm and Heron wouldn’t translate perfectly but still a
worth goal to try to have the ‘common’ storm parts isolated so they
can be shared with the other runners.

Really interesting, I wish you guys the best for this work, and
welcome to the community.

Ismaël

On Thu, Apr 6, 2017 at 11:51 AM, 刘键(Basti Liu)  wrote:
> Hi Ismaël,
>
>
>
> Sorry for the late response. I am the developer of JStorm, and currently work 
> with Pei on JStorm runner.
>
> We have went through current Storm runner( 
> <https://github.com/apache/storm/commits/beam-runner> 
> https://github.com/apache/storm/commits/beam-runner). But it is a very draft 
> version, several PTransforms are not supported or not fully supported, 
> especial for window, trigger and state.
>
>
>
> Generally, JStorm is compatible with the basic API of Storm, while providing 
> improvements or new features on topology master, state manager, exactly once, 
> message transfer mechanism, stage-by-stage backpressure flow control, 
> metrics, etc.
>
> For the basic “at most once” job, JStorm runner can be reused on Storm. But 
> for “window”, “state” and “exactly once” job, unfortunately, JStorm runner 
> can’t be reused. Anyway, we will figure out if the propagation is possible 
> for Storm in the future.
>
>
>
> Regards
>
> Jian Liu(Basti)
>
>
>
> -Original Message-
> From: Ismaël Mejía [mailto:ieme...@gmail.com]
> Sent: Sunday, April 02, 2017 3:18 AM
> To: dev@beam.apache.org
> Subject: Re: Update of Pei in Alibaba
>
>
>
> Excellent news,
>
>
>
> Pei it would be great to have a new runner. I am curious about how different 
> are the implementations of storm among them considering that there are 
> already three 'versions': Storm, Jstorm and Heron, I wonder if one runner 
> could traduce to an API that would cover all of them (of course maybe I am 
> super naive I really don't know much about JStorm or Heron and how much they 
> differ from the original storm).
>
>
>
> Jingson, I am super curious about this Galaxy project, it is there any public 
> information about it? is this related to the previous blink ali baba project? 
> I already looked a bit but searching "Ali baba galaxy"
>
> is a recipe for a myriad of telephone sellers :)
>
>
>
> Nice to see that you are going to keep contributing to the project Pei.
>
>
>
> Regards,
>
> Ismaël
>
>
>
>
>
>
>
> On Sat, Apr 1, 2017 at 4:59 PM, Tibor Kiss < <mailto:tibor.k...@gmail.com> 
> tibor.k...@gmail.com> wrote:
>
>> Exciting times, looking forward to try it out!
>
>>
>
>> I shall mention that Taylor Goetz also started creating a BEAM runner
>
>> using Storm.
>
>> His work is available in the storm repo:
>
>>  <https://github.com/apache/storm/commits/beam-runner> 
>> https://github.com/apache/storm/commits/beam-runner
>
>> Maybe it's worth while to take a peek and see if something is reusable
>
>> from there.
>
>>
>
>> - Tibor
>
>>
>
>> On Sat, Apr 1, 2017 at 4:37 AM, JingsongLee < 
>> <mailto:lzljs3620...@aliyun.com> lzljs3620...@aliyun.com> wrote:
>
>>
>
>>> Wow, very glad to see JStorm also started building BeamRunner.
>
>>> I am working in Galaxy (Another streaming process engine in Alibaba).
>
>>> I hope that we can work together to promote the use of Apache Beam in
>
>>> Alibaba and China.
>
>>>
>
>>> best,
>
>>> JingsongLee
>
>>> --Fro
>
>>> m:Pei HE < <mailto:pei...@gmail.com> pei...@gmail.com>Time:2017 Apr 1 (Sat) 
>>> 09:24To:dev <
>
>>>  <mailto:dev@beam.apache.org%3eSubject:Update> 
>>> dev@beam.apache.org>Subject:Update of Pei in Alibaba Hi all, On
>
>>> February, I moved from Seattle to Hangzhou, China, and joined Alibaba.
>
>>> And, I want to give an update of things in here.
>
>>>
>
>>> A colleague and I have been working on JStorm
>
>>> < <https://github.com/alibaba/jstorm> https://github.com/alibaba/jstorm> 
>>> runner. We have a prototype that
>
>>> works with WordCount and PAssert. (I am going to start a separate
>
>>> email thread about how to get it reviewed and merged in Apache Beam.)
>
>>> We also have Spark clusters, and are very interested in using Spark
>
>>> runner.
>
>>>
>
>>> Last Saturday, I went to China Hadoop Summit, and gave a talk about
>
>>> Apache Beam model. While many companies gave talks of their in-house
>
>>> solutions for unified batch&streaming and unified SQL, there are also
>
>>> lots of interests and enthusiasts of Beam.
>
>>>
>
>>> Looking forward to chat more.
>
>>> --
>
>>> Pei
>
>>>
>
>>>
>
>>
>
>>
>
>> --
>
>> Kiss Tibor
>
>>
>
>> +36 70 275 9863
>
>>  <mailto:tibor.k...@gmail.com> tibor.k...@gmail.com
>

Re: Update of Pei in Alibaba

2017-04-03 Thread Ismaël Mejía

Thanks Jingsong for answering, and the Streamscope ref, I am going to
check the paper, the concept of non-global-checkpointing sounds super
interesting.

It is nice that you guys are also trying to promote the move to a unified model.

Regards,
Ismaël


On Sun, Apr 2, 2017 at 3:40 PM, JingsongLee  wrote:
> Hi Ismaël,
> We have a streaming computing platform in Alibaba.
> Galaxy is an internal system, so you can't find some information from Google.
> It is becoming more like StreamScope (you can search it for the paper).
> Non-global-checkpoint makes failure recovery quickly and makes streaming
> applications easier to develop and debug.
>
>
> But as far as I know, each engine has its own tradeoffs, has its own good 
> cases.
> So we also developed an open source platform, which has Spark, Flink and so 
> on.
> We hope we can use Apache Beam to unify the user program model.  This will 
> make
>  the user learning costs are low, the application migration costs are low.
> (Not only from batch to streaming, but also conducive to migration from the
> streaming to the streaming.)
>
>
> --From:Ismaël 
> Mejía Time:2017 Apr 2 (Sun) 03:18To:dev 
> Subject:Re: Update of Pei in Alibaba
> Excellent news,
>
> Pei it would be great to have a new runner. I am curious about how
> different are the implementations of storm among them considering that
> there are already three 'versions': Storm, Jstorm and Heron, I wonder
> if one runner could traduce to an API that would cover all of them (of
> course maybe I am super naive I really don't know much about JStorm or
> Heron and how much they differ from the original storm).
>
> Jingson, I am super curious about this Galaxy project, it is there any
> public information about it? is this related to the previous blink ali
> baba project? I already looked a bit but searching "Ali baba galaxy"
> is a recipe for a myriad of telephone sellers :)
>
> Nice to see that you are going to keep contributing to the project Pei.
>
> Regards,
> Ismaël
>
>
>
> On Sat, Apr 1, 2017 at 4:59 PM, Tibor Kiss  wrote:
>> Exciting times, looking forward to try it out!
>>
>> I shall mention that Taylor Goetz also started creating a BEAM runner using
>> Storm.
>> His work is available in the storm repo:
>> https://github.com/apache/storm/commits/beam-runner
>> Maybe it's worth while to take a peek and see if something is reusable from
>> there.
>>
>> - Tibor
>>
>> On Sat, Apr 1, 2017 at 4:37 AM, JingsongLee  wrote:
>>
>>> Wow, very glad to see JStorm also started building BeamRunner.
>>> I am working in Galaxy (Another streaming process engine in Alibaba).
>>> I hope that we can work together to promote the use of Apache Beam
>>> in Alibaba and China.
>>>
>>> best,
>>> JingsongLee
>>> --From:Pei
>>> HE Time:2017 Apr 1 (Sat) 09:24To:dev <
>>> dev@beam.apache.org>Subject:Update of Pei in Alibaba
>>> Hi all,
>>> On February, I moved from Seattle to Hangzhou, China, and joined Alibaba.
>>> And, I want to give an update of things in here.
>>>
>>> A colleague and I have been working on JStorm
>>>  runner. We have a prototype that works
>>> with WordCount and PAssert. (I am going to start a separate email thread
>>> about how to get it reviewed and merged in Apache Beam.)
>>> We also have Spark clusters, and are very interested in
>>> using Spark runner.
>>>
>>> Last Saturday, I went to China Hadoop Summit, and gave a talk about Apache
>>> Beam model. While many companies gave talks of their
>>> in-house solutions for
>>> unified batch&streaming and unified SQL, there are also lots of interests
>>> and enthusiasts of Beam.
>>>
>>> Looking forward to chat more.
>>> --
>>> Pei
>>>
>>>
>>
>>
>> --
>> Kiss Tibor
>>
>> +36 70 275 9863
>> tibor.k...@gmail.com

Re: [PROPOSAL] ORC support

2017-04-01 Thread Ismaël Mejía

+1

>From my previous work experience ORC in certain cases performs better
than Parquet and really deserves to be supported.


On Sat, Apr 1, 2017 at 5:58 PM, Ted Yu  wrote:
> +1
>
>> On Apr 1, 2017, at 8:31 AM, Tibor Kiss  wrote:
>>
>> Hello,
>>
>> Recently the Optimized Row Columnar (ORC) file format was spin off from Hive
>> and became a top level Apache Project: https://orc.apache.org/
>>
>> It is similar to Parquet in a sense that it uses column major format but
>> ORC has
>> a more elaborate type system and stores basic statistics about each row.
>>
>> I'd be interested extending Beam with ORC support if others find it helpful
>> too.
>>
>> What do you think?
>>
>> - Tibor

Re: Update of Pei in Alibaba

2017-04-01 Thread Ismaël Mejía

Excellent news,

Pei it would be great to have a new runner. I am curious about how
different are the implementations of storm among them considering that
there are already three 'versions': Storm, Jstorm and Heron, I wonder
if one runner could traduce to an API that would cover all of them (of
course maybe I am super naive I really don't know much about JStorm or
Heron and how much they differ from the original storm).

Jingson, I am super curious about this Galaxy project, it is there any
public information about it? is this related to the previous blink ali
baba project? I already looked a bit but searching "Ali baba galaxy"
is a recipe for a myriad of telephone sellers :)

Nice to see that you are going to keep contributing to the project Pei.

Regards,
Ismaël

On Sat, Apr 1, 2017 at 4:59 PM, Tibor Kiss  wrote:
> Exciting times, looking forward to try it out!
>
> I shall mention that Taylor Goetz also started creating a BEAM runner using
> Storm.
> His work is available in the storm repo:
> https://github.com/apache/storm/commits/beam-runner
> Maybe it's worth while to take a peek and see if something is reusable from
> there.
>
> - Tibor
>
> On Sat, Apr 1, 2017 at 4:37 AM, JingsongLee  wrote:
>
>> Wow, very glad to see JStorm also started building BeamRunner.
>> I am working in Galaxy (Another streaming process engine in Alibaba).
>> I hope that we can work together to promote the use of Apache Beam
>> in Alibaba and China.
>>
>> best,
>> JingsongLee
>> --From:Pei
>> HE Time:2017 Apr 1 (Sat) 09:24To:dev <
>> dev@beam.apache.org>Subject:Update of Pei in Alibaba
>> Hi all,
>> On February, I moved from Seattle to Hangzhou, China, and joined Alibaba.
>> And, I want to give an update of things in here.
>>
>> A colleague and I have been working on JStorm
>>  runner. We have a prototype that works
>> with WordCount and PAssert. (I am going to start a separate email thread
>> about how to get it reviewed and merged in Apache Beam.)
>> We also have Spark clusters, and are very interested in
>> using Spark runner.
>>
>> Last Saturday, I went to China Hadoop Summit, and gave a talk about Apache
>> Beam model. While many companies gave talks of their
>> in-house solutions for
>> unified batch&streaming and unified SQL, there are also lots of interests
>> and enthusiasts of Beam.
>>
>> Looking forward to chat more.
>> --
>> Pei
>>
>>
>
>
> --
> Kiss Tibor
>
> +36 70 275 9863
> tibor.k...@gmail.com

Re: First IO IT Running!

2017-03-22 Thread Ismaël Mejía

Excellent news, I am eager to see more IOs/Runners been included in
the Integration Tests, and I will be glad to contribute in anything I
can.

Congratulations for this important milestone.
Ismaël

ps. I will try to reproduce the Kubernetes setup so I will be
eventually annoying you with questions.

On Wed, Mar 22, 2017 at 11:28 AM, Aljoscha Krettek  wrote:
> Great news! I can’t wait to also have support for this for the Flink Runner. 
> Which is partially blocked by me or others working on the Flink Runner, I 
> guess… :-(
>> On 22 Mar 2017, at 05:15, Jean-Baptiste Onofré  wrote:
>>
>> Awesome !!! Great news !
>>
>> Thanks guys for that !
>>
>> I started to implement IT in JMS, MQTT, Redis, Cassandra IOs. I keep you 
>> posted.
>>
>> Regards
>> JB
>>
>> On 03/21/2017 11:01 PM, Stephen Sisk wrote:
>>> I'm really excited to see these tests are running!
>>>
>>> These Jdbc tests are testing against a postgres instance - that instance is
>>> running on the kubernetes cluster I've set up for beam IO ITs as discussed
>>> in the "Hosting data stores for IO transform testing" thread[0]. I set up
>>> that postgres instance using the kubernetes scripts for Jdbc[1]. Anyone can
>>> run their own kubernetes cluster and do the same thing for themselves to
>>> run the ITs. (I'd actually to love to hear about that if anyone does it.)
>>>
>>> I'm excited to get a few more ITs using this infrastructure so we can test
>>> it out/smooth out the remaining rough edges in creating ITs. I'm happy to
>>> answer questions about that on the mailing list, but we obviously have to
>>> have the process written down - the Testing IO Transforms in Apache Beam
>>> doc [2] covers how to do this, but is still rough. I'm working on getting
>>> that up on the website and ironing out the rough edges [3], but generally
>>> reading that doc plus checking out how the JdbcIO or ElasticsearchIO tests
>>> work should give you a sense of how to get it working. I'm also thinking we
>>> might want to simplify the way we do data loading, so I don't consider this
>>> process fully stabilized, but I'll port code written according to the
>>> current standards to the new standards if we make changes.
>>>
>>> ElasticsearchIO has all the prerequisites, so I'd like to get them going in
>>> the near future. I know JB has started on this in his RedisIO PR, and the
>>> HadoopInputFormatIO also has ITs & k8 scripts, so there's more in the pipe.
>>> For now, each datastore has to be manually set up, but I'd like to automate
>>> that process - I'll file a JIRA ticket shortly for that.
>>>
>>> Thanks,
>>> Stephen
>>> [0] Hosting data stores for IO transform testing -
>>> https://lists.apache.org/thread.html/9fd3c51cb679706efa4d0df2111a6ac438b851818b639aba644607af@%3Cdev.beam.apache.org%3E
>>> [1] Postgres k8 scripts -
>>> https://github.com/apache/beam/tree/master/sdks/java/io/jdbc/src/test/resources/kubernetes
>>> [2] IO testing guide -
>>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit?usp=sharing
>>> [3] Jira for IO guide - https://issues.apache.org/jira/browse/BEAM-1025
>>>
>>> On Tue, Mar 21, 2017 at 2:28 PM Jason Kuster 
>>> 
>>> wrote:
>>>
 Hi all,

 Exciting news! As of yesterday, we have checked in the Jenkins
 configuration for our first continuously running IO Integration Test! You
 can check it out in Jenkins here[1]. We’re also publishing results to a
 database, and we’ve turned up a basic dashboarding system where you can see
 the results here[2]. Caveat: there are only two runs, and we’ll be tweaking
 the underlying system still, so don’t panic that we’re up and to the right
 currently. ;)

 This is the first test running continuously on top of the performance / IO
 testing infrastructure described in this doc[3].  Initial support for Beam
 is now present in PerfKit Benchmarker; given what they had already, it was
 easiest to add support for Dataflow and Java. We need your help to add
 additional support! The doc lists a number of JIRA issues to build out
 support for other systems. I’m happy to work with people to help them
 understand what is necessary for these tasks; just send an email to the
 list if you need help and I’ll help you move forwards.

 Looking forward to it!

 Jason

 [1] https://builds.apache.org/job/beam_PerformanceTests_JDBC/
 [2]
 https://apache-beam-testing.appspot.com/explore?dashboard=5714163003293696
 [3]

 https://docs.google.com/document/d/1PsjGPSN6FuorEEPrKEP3u3m16tyOzph5FnL2DhaRDz0/edit?ts=58a78e73

 --
 ---
 Jason Kuster
 Apache Beam / Google Cloud Dataflow

>>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>

Re: Docker image dependencies

2017-03-22 Thread Ismaël Mejía

You have really good points, I agree 100%, docker is easier if it is
local, once we talk about distributions all of them has their
pros/cons. I don’t intend to re open the discussion and of course it
would be silly to go back and remake all the work you already have
done.

We already agreed on kubernetes and this is it. My point mentioning
docker-compose was more from the we need to make the life of IT tests
contributors easier, and maybe adding an extra tool is not the way,
but at least we will need better documentation or references to help
developers bootstrap their Kubernetes so they can contribute and
validate the tests in their own.

On Wed, Mar 22, 2017 at 12:14 AM, Stephen Sisk  wrote:
> Hey Ismael,
>
> I definitely agree with you that we want something that developers will
> actually be able to/want to use.
>
> in my experience *all* the container orchestration engines are non-trivial
> to set up. When I started examining solutions for beam hosting, I did
> installs of mesos, kubernetes and docker. Docker is easier in the "run only
> on my local machine" case if devs have it set up, but to do anything
> interesting (ie, interact with machines that aren't already yours), they
> all involve work to get them setup on each machine you want to use[4].
>
> Kubernetes has some options that make it extremely simple to setup - both
> AWS[2] and GCE[3] seem to be straightforward to set up for simple dev
> clusters, with scripts to automate the process (I'm assuming docker has
> similar setups.)
>
> Once kubernetes is set up, it's also a simple yaml file + command to set up
> multiple machines. The kubernetes setup for postgres[5] shows a simple one
> machine example, and the kubernetes setups for HIFIO[6] show multi-machine
> examples.
>
> We've spent a lot of time discussing the various options - when we talked
> about this earlier [1] we decided we would move forward with investigating
> kubernetes, so that's what I used for the IO ITs work I've been doing,
> which we've now gotten working.
>
> Do you feel the advantages of docker are such that we should re-open the
> discussion and potentially re-do the work we've done so far to get k8
> working?
>
> I took a genuine look at docker earlier in the process and it didn't seem
> like it was better than the other options in any dimensions (other than
> "developers usually have it installed already"), and kubernetes/mesos
> seemed to be more stable/have more of the features discussed in [1].
> Perhaps that's changed?
>
> I think we are just starting to use container orchestration engines, and so
> while I don't want to throw away the work we've done so far, I also don't
> want to have to do it later if there are reasons we knew about now. :)
>
> S
>
> [1]
> https://lists.apache.org/thread.html/9fd3c51cb679706efa4d0df2111a6ac438b851818b639aba644607af@%3Cdev.beam.apache.org%3E
>
> [2] k8 AWS - https://kubernetes.io/docs/getting-started-guides/aws/
> [3] k8 GKE - https://cloud.google.com/container-engine/docs/quickstart or
> https://kubernetes.io/docs/getting-started-guides/gce/
> [4] docker swarm on GCE -
> https://rominirani.com/docker-swarm-on-google-compute-engine-364765b400ed#.gzvruzis9
>
> [5] postgres k8 script -
> https://github.com/apache/beam/tree/master/sdks/java/io/jdbc/src/test/resources/kubernetes
>
> [6]
> https://github.com/diptikul/incubator-beam/tree/HIFIO-CS-ES/sdks/java/io/hadoop/jdk1.8-tests/src/test/resources/kubernetes
>
>
> On Mon, Mar 20, 2017 at 3:25 PM Ismaël Mejía  wrote:
>
> I have somehow forgotten this one.
>
>> Basically - I'm trying to keep number of tools at a minimum while still
>> providing good support for the functionality we need. Does docker-compose
>> provide something beyond the functionality that k8 does? I'm not familiar
>> with docker-compose, but looking at
>> https://docs.docker.com/ it doesn't
>> seem to provide anything that k8 doesn't already.
>
> I agree to have the most minimal set of tools, I mentioned
> docker-compose because I consider also its advantages because its
> installation is trivial compared to kubernetes (or even minikube for a
> local install), docker-compose does not have any significant advantage
> over kubernetes apart of been easier to install/use.
>
> But well, better to be consistent and go full with kubernetes, however
> we need to find a way to help IO authors to bootstrap this, because
> from my experience creating a cluster with docker-compose is a yaml
> file + a command, not sure if the basic installation and run of
> kubernetes is that easy.
>
> Ismaël
>
> On Wed, Mar 15, 2017 at 8:09 PM, Stephen Sisk

Re: Beam spark 2.x runner status

2017-03-22 Thread Ismaël Mejía

Amit, I suppose JB is talking about the RDD based version, so no need
to worry about SparkSession or different incompatible APIs.

Remember the idea we are discussing is to have in master both the
spark 1 and spark 2 runners using the RDD based translation. At the
same time we can have a feature branch to evolve the DataSet based
translator (this one will replace the RDD based translator for spark 2
once it is mature).

The advantages have been already discussed as well as the possible
issues so I think we have to see now if JB's idea is feasible and how
hard would be to live with this while the DataSet version evolves.

I think what we are trying to avoid is to have a long living branch
for a spark 2 runner based on RDD  because the maintenance burden
would be even worse. We would have to fight not only with the double
merge of fixes (in case the profile idea does not work), but also with
the continue evolution of Beam and we would end up in the long living
branch mess that others runners have dealt with (e.g. the Apex runner)

https://lists.apache.org/thread.html/12cc086f5ffe331cc70b89322ce5416c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E

What do you think about this Amit ? Would you be ok to go with it if
JB's profile idea proves to help with the msintenance issues ?

Ismaël

On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu  wrote:
> hbase-spark module doesn't use SparkSession. So situation there is simpler
> :-)
>
> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela  wrote:
>
>> I'm still wondering how we'll do this - it's not just different
>> implementations of the same Class, but a completely different concepts such
>> as using SparkSession in Spark 2 instead of SparkContext/StreamingContext
>> in Spark 1.
>>
>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu  wrote:
>>
>> > I have done some work over in HBASE-16179 where compatibility modules are
>> > created to isolate changes in Spark 2.x API so that code in hbase-spark
>> > module can be reused.
>> >
>> > FYI
>> >
>>

Re: Docker image dependencies

2017-03-20 Thread Ismaël Mejía

I have somehow forgotten this one.

> Basically - I'm trying to keep number of tools at a minimum while still
> providing good support for the functionality we need. Does docker-compose
> provide something beyond the functionality that k8 does? I'm not familiar
> with docker-compose, but looking at
> https://docs.docker.com/ it doesn't
> seem to provide anything that k8 doesn't already.

I agree to have the most minimal set of tools, I mentioned
docker-compose because I consider also its advantages because its
installation is trivial compared to kubernetes (or even minikube for a
local install), docker-compose does not have any significant advantage
over kubernetes apart of been easier to install/use.

But well, better to be consistent and go full with kubernetes, however
we need to find a way to help IO authors to bootstrap this, because
from my experience creating a cluster with docker-compose is a yaml
file + a command, not sure if the basic installation and run of
kubernetes is that easy.

Ismaël

On Wed, Mar 15, 2017 at 8:09 PM, Stephen Sisk  wrote:
> thanks for the discussion! In general, I agree with the sentiments
> expressed here. I updated
> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.hlirex1vus1a
> to
> reflect this discussion. (The plan is still that I will put that on the
> website.)
>
> Apache Docker Repository - are you talking about
> https://hub.docker.com/u/apache/ ? If not, can you point me at more info? I
> can't seem to find info about this on the publicly visible apache-infra
> mailing lists thatI could find, and the apache infra website doesn't seem
> to mention a docker repository.
>
>
>
>> However the current Beam Elasticsearch IO does not support Elasticsearch
> 5, and elastic does not have an image for version 2, so in this particular 
> case
> following the priority order we should use the official docker image (2)
> for the tests (assuming that both require the same version). Do you agree
> with this ?
>
> Yup, that makes sense to me.
>
>
>
>> How do we deal with IOs that require more than one base image, this is a  
>> common
> scenario for projects that depend on Zookeeper?
>
> Is there a reason not to just run a kubernetes ReplicaController+Service
> for these cases? k8 can easily support having a hostname that pods can rely
> on having the zookeeper instance. It also uses text config - see
> https://github.com/apache/beam/tree/master/sdks/java/io/jdbc/src/test/resources/kubernetes,
> and sets up the connections/nameservice between the hosts - if other tests
> wanted to rely on postgres, it could just connect to host "postgres" and
> postgres is there.
>
> Basically - I'm trying to keep number of tools at a minimum while still
> providing good support for the functionality we need. Does docker-compose
> provide something beyond the functionality that k8 does? I'm not familiar
> with docker-compose, but looking at
> https://docs.docker.com/compose/overview/#compose-documentation it doesn't
> seem to provide anything that k8 doesn't already.
>
>
> S
>
> On Wed, Mar 15, 2017 at 7:10 AM Ismaël Mejía  wrote:
>
> Hi, Thanks for bringing this subject to the mailing list.
>
> +1
> We definitely need a consensus on this, and I agree with your proposal and
> JB’s comments modulo certain clarifications:
>
> I think we shall go in this priority order if the version of the image we
> want is available:
>
> 1. Image provided by the creator of the data source/sink (if they
> officially maintain it). (This is the case of Elasticsearch for example) or
> the Apache projects (if they provide one) as JB mentions.
> 2. Official docker images (because they have security fixes and have
> guaranteed maintenance.
> 3. Non-official docker images or images from other providers that have good
> maintainers e.g. quay.io
>
> It makes sense to use the same image for all the tests. and to use the
> fixed versions supported by the respective IO to avoid possible issues
> during testing between different versions/naming of env variables, etc.
>
> The Elasticsearch case is a 'good' example because it shows all the current
> issues:
>
> We should not use one elasticsearch image (elk) for some tests and a
> different one for the other (the quay one), and if we resolve by priority
> we would take the image provided by the creator (1) for both cases.
> https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html
> However the current Beam Elasticsearch IO does not support Elasticsearch 5,
> and elastic does not have an image for version 2, so in this particular
> case following the priority order we should use

Re: [ANNOUNCEMENT] New committers, March 2017 edition!

2017-03-20 Thread Ismaël Mejía

Thanks everyone, Feels great to be part of the team.
Congratulations to the other new committers !

-Ismaël

On Mon, Mar 20, 2017 at 2:50 PM, Tyler Akidau
 wrote:
> Welcome!
>
> On Mon, Mar 20, 2017, 02:25 Jean-Baptiste Onofré  wrote:
>
>> Welcome aboard, and congrats !
>>
>> Really happy to count you all in the team ;)
>>
>> Regards
>> JB
>>
>> On 03/17/2017 10:13 PM, Davor Bonaci wrote:
>> > Please join me and the rest of Beam PMC in welcoming the following
>> > contributors as our newest committers. They have significantly
>> contributed
>> > to the project in different ways, and we look forward to many more
>> > contributions in the future.
>> >
>> > * Chamikara Jayalath
>> > Chamikara has been contributing to Beam since inception, and previously
>> to
>> > Google Cloud Dataflow, accumulating a total of 51 commits (8,301 ++ /
>> 3,892
>> > --) since February 2016 [1]. He contributed broadly to the project, but
>> > most significantly to the Python SDK, building the IO framework in this
>> SDK
>> > [2], [3].
>> >
>> > * Eugene Kirpichov
>> > Eugene has been contributing to Beam since inception, and previously to
>> > Google Cloud Dataflow, accumulating a total of 95 commits (22,122 ++ /
>> > 18,407 --) since February 2016 [1]. In recent months, he’s been driving
>> the
>> > Splittable DoFn effort [4]. A true expert on IO subsystem, Eugene has
>> > reviewed nearly every IO contributed to Beam. Finally, Eugene contributed
>> > the Beam Style Guide, and is championing it across the project.
>> >
>> > * Ismaël Mejia
>> > Ismaël has been contributing to Beam since mid-2016, accumulating a total
>> > of 35 commits (3,137 ++ / 1,328 --) [1]. He authored the HBaseIO
>> connector,
>> > helped on the Spark runner, and contributed in other areas as well,
>> > including cross-project collaboration with Apache Zeppelin. Ismaël
>> reported
>> > 24 Jira issues.
>> >
>> > * Aviem Zur
>> > Aviem has been contributing to Beam since early fall, accumulating a
>> total
>> > of 49 commits (6,471 ++ / 3,185 --) [1]. He reported 43 Jira issues, and
>> > resolved ~30 issues. Aviem improved the stability of the Spark runner a
>> > lot, and introduced support for metrics. Finally, Aviem is championing
>> > dependency management across the project.
>> >
>> > Congratulations to all four! Welcome!
>> >
>> > Davor
>> >
>> > [1]
>> >
>> https://github.com/apache/beam/graphs/contributors?from=2016-02-01&to=2017-03-17&type=c
>> > [2]
>> >
>> https://github.com/apache/beam/blob/v0.6.0/sdks/python/apache_beam/io/iobase.py#L70
>> > [3]
>> >
>> https://github.com/apache/beam/blob/v0.6.0/sdks/python/apache_beam/io/iobase.py#L561
>> > [4] https://s.apache.org/splittable-do-fn
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>

Re: splitIntoBundles vs. generateInitialSplits

2017-03-20 Thread Ismaël Mejía

This is an forgotten one, Stas did you create a JIRA about this one? I
think this change should be also tagged as First version release,
because this is an API change and can break stuff if we do it later
on.

On Wed, Jan 11, 2017 at 4:30 PM, Jean-Baptiste Onofré  wrote:
> Hi Eugene and Stas,
>
> Just back from couple of days off and jump on this discussion.
>
> I agree with Stas: it's worth to create a Jira about that. The only
> "semantic" difference is unbounded vs bounded source, but the behavior is
> the same.
>
> Regards
> JB
>
>
> On 01/11/2017 04:26 PM, Stas Levin wrote:
>>
>> Eugene, that makes a lot of sense to me.
>>
>> Do you think it's worth filing a Jira ticket?
>>
>> On Wed, Jan 11, 2017 at 2:58 AM Eugene Kirpichov
>>  wrote:
>>
>> I agree that the methods are named somewhat confusingly, and ideally would
>> be named the same. Both of the names miss some aspect of the underlying
>> concept.
>>
>> The underlying concept is split the source into smaller sub-sources which,
>> if you read all of them, would have read the same data as the original
>> one.
>> "splitIntoBundles" assumes that 1 source = 1 bundle, which is completely
>> false in streaming, and only partially true in batch (I'm talking about
>> the
>> Dataflow runner).
>> "generateInitialSplits" assumes that this splitting happens only
>> "initially", i.e. at job startup time. This is currently true in practice
>> for all existing runners, but it doesn't have to be - we could conceivably
>> call it again at some point during the job if we see that some of the
>> sub-sources are still too large.
>>
>> The analogous method in Splittable DoFn (
>> https://s.apache.org/splittable-do-fn) is called @SplitRestriction, but
>> there are no restrictions in source API, only sources.
>>
>> Perhaps both should be called simply "split", or "splitIntoSubSources".
>>
>> On Mon, Jan 9, 2017 at 2:12 PM Stas Levin  wrote:
>>
>>> Definitely seems like the formatting got lost in translation, sorry about
>>> that :)
>>>
>>> I guess both cases (methods) create splits, which are essentially a list
>>
>> of
>>>
>>> bounded/unbounded source instances, each responsible for reading certain
>>> segments (physical or otherwise) of the data.
>>>
>>> On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk 
>>> wrote:
>>>
 hi!

 I think your strikethrough got lost due to this being a text-only email
 list. To make sure, I think you're asking the following:
 " would it be reasonable to think of splitIntoBundles as generateSplits?
>>>
>>> "

 (ie, you strikethrough'd Initial)

 They are very similar and I definitely also think of them as occupying
>>>
>>> the

 same niche. I'll let someone else who was around for naming discuss
>>>
>>> whether

 it was intentional or not. Conceptually, the way that bounded vs
>>>
>>> streaming

 are handled means that they are doing slightly different things: a
>>>
>>> bounded

 source is really kind of creating physical chunks of the data, whereas
>>>
>>> the

 streaming source is creating conceptual divisions of the data that will
>>>
>>> be

 used later. I'm not sure that's worth the confusion caused by the
 differences.

 One thing to clarify - splitIntoBundles does have an "Initial" aspect to
 it. I don't believe there is a publicly defined/written down order the
 Sources & Reader methods are called in, but a runner trying to get
 efficiency would be able to use splitIntoBundles during job startup to
>>
>> be

 able to split up the work before creating readers rather than after
 creating readers and waiting to use splitAtFraction.

 S

 On Sun, Jan 8, 2017 at 6:06 AM Stas Levin  wrote:

> Hi,
>
> A short terminology question regarding "bundle", and
> particularly splitIntoBundles vs. generateInitialSplits.
>
> In *BoundedSource* we have:
> List> *splitIntoBundles*(...)
>
> In *UnboundedSource* we have:
> List>
> *generateInitialSplits*(...)
>
> I was wondering if the names were intentionally made different, i.e.

 "into
>
> bundles" vs "into splits"?
> In a way these two methods carry out a very similar task, would it be
> reasonable to think of *splitIntoBundles *as *generate*Initial*Splits?
>>>
>>> *
>
> (strikethrough due to "initial" not being applicable in the case of

 bounded
>
> sources)
>
> Regards,
> Stas
>

>>>
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: Performance Testing Next Steps

2017-03-16 Thread Ismaël Mejía

> .. if the provider we are bringing up also
> provides the data store, we can just omit the data store for that benchmark
> and use what we've already brought up. Does that answer your question, or
> have I misunderstood?

Yes, and it is a perfect approach for the case, great idea.

> Great point -- I neglected to include the DirectRunner in the plans here.
> I'll add it to the doc and file a JIRA.

Excellent.

This work is super interesting so don’t hesitate to ask anything from
us the rest of the community because I think there are many of us
interested and we can give a hand if needed.


On Thu, Mar 16, 2017 at 9:17 AM, Jason Kuster
 wrote:
> Thanks Ismael for the comments! Replied inline.
>
> On Wed, Mar 15, 2017 at 8:18 AM, Ismaël Mejía  wrote:
>
>> Excellent proposal, sorry to jump into this discussion so late, this
>> was in my toread list for almost two weeks, and I finally got the time
>> to read the document and I have two minor comments:
>>
>> I have the impression that the strict separation of Providers (the
>> data-processing systems) and Resources (the concrete Data Stores)
>> makes sense for the general case, but is lacking if what we want to
>> test are things in the Hadoop ecosystem where the data stores commonly
>> co-exist in the same group of machines with the data-processing
>> systems (the Providers), e.g. HDFS, Hbase + YARN. This is important to
>> correctly test that data locality works correctly for example. Have
>> you considered such case?
>>
>
> Definitely interesting to think about, and I don't think I added provisions
> for this in the doc. My impression, though, is that since the providers and
> the data stores are not coupled, if the provider we are bringing up also
> provides the data store, we can just omit the data store for that benchmark
> and use what we've already brought up. Does that answer your question, or
> have I misunderstood?
>
>>
>> Another thing I noticed is that in the list of runners supporting PKB
>> the Direct Runner is not included, is there any particular reason for
>> this? I think that even if performance is not the main goal of the
>> direct runner it can be nice to have it there too to catch any
>> performance regressions, or is it because it is already ready for it?
>> what do you think?
>>
>>
> Great point -- I neglected to include the DirectRunner in the plans here.
> I'll add it to the doc and file a JIRA.
>
>
>> Thanks,
>> Ismaël
>>
>> On Thu, Mar 2, 2017 at 11:49 PM, Amit Sela  wrote:
>> > Looks great, and I'll be sure to follow this. Ping me if I can assist in
>> > any way!
>> >
>> > On Fri, Mar 3, 2017 at 12:09 AM Ahmet Altay 
>> > wrote:
>> >
>> >> Sounds great, thank you!
>> >>
>> >> On Thu, Mar 2, 2017 at 1:41 PM, Jason Kuster > >> .invalid
>> >> > wrote:
>> >>
>> >> > D'oh, my bad Ahmet. I've opened BEAM-1610, which handles support for
>> >> Python
>> >> > in PKB against the Dataflow runner. Once the Fn API progresses some
>> more
>> >> we
>> >> > can add some work items for the other runners too. Let's chat about
>> this
>> >> > more, maybe next week?
>> >> >
>> >> > On Thu, Mar 2, 2017 at 1:31 PM, Ahmet Altay > >
>> >> > wrote:
>> >> >
>> >> > > Thank you Jason, this is great.
>> >> > >
>> >> > > Which one of these issues fall into the land of sdk-py?
>> >> > >
>> >> > > Ahmet
>> >> > >
>> >> > > On Thu, Mar 2, 2017 at 12:34 PM, Jason Kuster <
>> >> > > jasonkus...@google.com.invalid> wrote:
>> >> > >
>> >> > > > Glad to hear the excitement. :)
>> >> > > >
>> >> > > > Filed BEAM-1595 - 1609 to track work items. Some of these fall
>> under
>> >> > > runner
>> >> > > > components, please feel free to reach out to me if you have any
>> >> > questions
>> >> > > > about how to accomplish these.
>> >> > > >
>> >> > > > Best,
>> >> > > >
>> >> > > > Jason
>> >> > > >
>> >> > > > On Wed, Mar 1, 2017 at 5:50 AM, Aljoscha Krettek <
>> >> aljos...@apache.org>
>> >> > > > wrote:
>> >> > > >
>&

Re: Beam spark 2.x runner status

2017-03-15 Thread Ismaël Mejía

> So you're suggesting we copy-paste the current runner and adapt whatever is
> necessary so it runs with Spark 2 ?

Yes

> This also means any bug-fix / improvement would have to be maintained in
> two runners, and I wouldn't wanna do that.

No, this is the reason I first proposed to deprecate the spark 1
runner (and just eventually do bug fixes but no new development on it)
to keep maintenance minimal, so the current line of
development/maintenance would move into the spark 2 RDD version.
Additionally in parallel we can progress on the Dataset based
translation, but this will be considered experimental, so no
maintenance compromises. Finally when the DataSet version is mature we
will get rid of the RDD one.

This has the additional advantage of not having a long living branch
or doing a full write as a starting point.

> I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset API.
> Since the RDD API is mature, it should be the runner in master (not
> preventing another runner once Dataset API is mature enough) and the
> version (1.6.3 or 2.x) should be determined by the common installation.

I agree with you but the reality is that people who run on clusters
need the specific versions of the libraries, this is independent of
the APIs.

Don’t you think that taking the approach I described at least reduce a
little bit the maintenance burden? Of course I understand your
hesitation, but if we decide the exact set of features that won’t be
supported in the spark 1 runner we can branch out from it, of course
again this decision is totally up to your consideration.

On Wed, Mar 15, 2017 at 5:57 PM, Amit Sela  wrote:
> So you're suggesting we copy-paste the current runner and adapt whatever is
> necessary so it runs with Spark 2 ?
> This also means any bug-fix / improvement would have to be maintained in
> two runners, and I wouldn't wanna do that.
>
> I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset API.
> Since the RDD API is mature, it should be the runner in master (not
> preventing another runner once Dataset API is mature enough) and the
> version (1.6.3 or 2.x) should be determined by the common installation.
>
> That's why I believe we still need to leave things as they are, but start
> working on the Dataset API runner.
> Otherwise, we'll have the current runner, another RDD API runner with Spark
> 2, and a third one for the Dataset API. I don't want to maintain all of
> them. It's a mess.
>
> On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía  wrote:
>
>> > However, I do feel that we should use the Dataset API, starting with
>> batch
>> > support first. WDYT ?
>>
>> Well, this is the exact current status quo, and it will take us some
>> time to have something as complete as what we have with the spark 1
>> runner for the spark 2.
>>
>> The other proposal has two advantages:
>>
>> One is that we can leverage on the existing implementation (with the
>> needed adjustments) to run Beam pipelines on Spark 2, in the end final
>> users don’t care so much if pipelines are translated via RDD/DStream
>> or Dataset, they just want to know that with Beam they can run their
>> code in their favorite data processing framework.
>>
>> The other advantage is that we can base the work on the latest spark
>> version and advance simultaneously in translators for both APIs, and
>> once we consider that the DataSet is mature enough we can stop
>> maintaining the RDD one and make it the official one.
>>
>> The only missing piece is backporting new developments on the RDD
>> based translator from the spark 2 version into the spark 1, but maybe
>> this won’t be so hard if we consider what you said, that at this point
>> we are getting closer to have streaming right (of course you are the
>> most appropriate person to decide if we are in a sufficient good shape
>> to make this, so backporting things won’t be so hard).
>>
>> Finally I agree with you, I would prefer a nice and full featured
>> translator based on the Structured Streaming API but the question is
>> how much time this will take to be in shape and the impact on final
>> users who are already requesting this. This is the reason why I think
>> the more conservative approach (keeping around the RDD translator) and
>> moving incrementally makes sense.
>>
>> On Wed, Mar 15, 2017 at 4:52 PM, Amit Sela  wrote:
>> > I feel that as we're getting closer to supporting streaming with Spark 1
>> > runner, and having Structured Streaming advance in Spark 2, we could
>> start
>> > work on Spark 2 runner in a separate branch.
>> >
>> > However, I do

Re: Beam spark 2.x runner status

2017-03-15 Thread Ismaël Mejía

> However, I do feel that we should use the Dataset API, starting with batch
> support first. WDYT ?

Well, this is the exact current status quo, and it will take us some
time to have something as complete as what we have with the spark 1
runner for the spark 2.

The other proposal has two advantages:

One is that we can leverage on the existing implementation (with the
needed adjustments) to run Beam pipelines on Spark 2, in the end final
users don’t care so much if pipelines are translated via RDD/DStream
or Dataset, they just want to know that with Beam they can run their
code in their favorite data processing framework.

The other advantage is that we can base the work on the latest spark
version and advance simultaneously in translators for both APIs, and
once we consider that the DataSet is mature enough we can stop
maintaining the RDD one and make it the official one.

The only missing piece is backporting new developments on the RDD
based translator from the spark 2 version into the spark 1, but maybe
this won’t be so hard if we consider what you said, that at this point
we are getting closer to have streaming right (of course you are the
most appropriate person to decide if we are in a sufficient good shape
to make this, so backporting things won’t be so hard).

Finally I agree with you, I would prefer a nice and full featured
translator based on the Structured Streaming API but the question is
how much time this will take to be in shape and the impact on final
users who are already requesting this. This is the reason why I think
the more conservative approach (keeping around the RDD translator) and
moving incrementally makes sense.

On Wed, Mar 15, 2017 at 4:52 PM, Amit Sela  wrote:
> I feel that as we're getting closer to supporting streaming with Spark 1
> runner, and having Structured Streaming advance in Spark 2, we could start
> work on Spark 2 runner in a separate branch.
>
> However, I do feel that we should use the Dataset API, starting with batch
> support first. WDYT ?
>
> On Wed, Mar 15, 2017 at 5:47 PM Ismaël Mejía  wrote:
>
>> > So you propose to have the Spark 2 branch a clone of the current one with
>> > adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
>> while
>> > still using the RDD API ?
>>
>> Yes this is exactly what I have in mind.
>>
>> > I think that having another Spark runner is great if it has value,
>> > otherwise, let's just bump the version.
>>
>> There is value because most people are already starting to move to
>> spark 2 and all Big Data distribution providers support it now, as
>> well as the Cloud-based distributions (Dataproc and EMR) not like the
>> last time we had this discussion.
>>
>> > We could think of starting to migrate the Spark 1 runner to Spark 2 and
>> > follow with Dataset API support feature-by-feature as ot advances, but I
>> > think most Spark installations today still run 1.X, or am I wrong ?
>>
>> No, you are right, that’s why I didn’t even mentioned removing the
>> spark 1 runner, I know that having to support things for both versions
>> can add additional work for us, but maybe the best approach would be
>> to continue the work only in the spark 2 runner (both refining the RDD
>> based translator and starting to create the Dataset one there that
>> co-exist until the DataSet API is mature enough) and keep the spark 1
>> runner only for bug-fixes for the users who are still using it (like
>> this we don’t have to keep backporting stuff). Do you see any other
>> particular issue?
>>
>> Ismaël
>>
>> On Wed, Mar 15, 2017 at 3:39 PM, Amit Sela  wrote:
>> > So you propose to have the Spark 2 branch a clone of the current one with
>> > adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
>> while
>> > still using the RDD API ?
>> >
>> > I think that having another Spark runner is great if it has value,
>> > otherwise, let's just bump the version.
>> > My idea of having another runner for Spark was not to support more
>> versions
>> > - we should always support the most popular version in terms of
>> > compatibility - the idea was to try and make Beam work with Structured
>> > Streaming, which is still not fully mature so that's why we're not
>> heavily
>> > investing there.
>> >
>> > We could think of starting to migrate the Spark 1 runner to Spark 2 and
>> > follow with Dataset API support feature-by-feature as ot advances, but I
>> > think most Spark installations today still run 1.X, or am I wrong ?
>> >
>> > On Wed, Mar 15, 2017 at 4:26 PM Ismaël Mejí

Re: Beam spark 2.x runner status

2017-03-15 Thread Ismaël Mejía

> So you propose to have the Spark 2 branch a clone of the current one with
> adaptations around Context->Session, Accumulator->AccumulatorV2 etc. while
> still using the RDD API ?

Yes this is exactly what I have in mind.

> I think that having another Spark runner is great if it has value,
> otherwise, let's just bump the version.

There is value because most people are already starting to move to
spark 2 and all Big Data distribution providers support it now, as
well as the Cloud-based distributions (Dataproc and EMR) not like the
last time we had this discussion.

> We could think of starting to migrate the Spark 1 runner to Spark 2 and
> follow with Dataset API support feature-by-feature as ot advances, but I
> think most Spark installations today still run 1.X, or am I wrong ?

No, you are right, that’s why I didn’t even mentioned removing the
spark 1 runner, I know that having to support things for both versions
can add additional work for us, but maybe the best approach would be
to continue the work only in the spark 2 runner (both refining the RDD
based translator and starting to create the Dataset one there that
co-exist until the DataSet API is mature enough) and keep the spark 1
runner only for bug-fixes for the users who are still using it (like
this we don’t have to keep backporting stuff). Do you see any other
particular issue?

Ismaël

On Wed, Mar 15, 2017 at 3:39 PM, Amit Sela  wrote:
> So you propose to have the Spark 2 branch a clone of the current one with
> adaptations around Context->Session, Accumulator->AccumulatorV2 etc. while
> still using the RDD API ?
>
> I think that having another Spark runner is great if it has value,
> otherwise, let's just bump the version.
> My idea of having another runner for Spark was not to support more versions
> - we should always support the most popular version in terms of
> compatibility - the idea was to try and make Beam work with Structured
> Streaming, which is still not fully mature so that's why we're not heavily
> investing there.
>
> We could think of starting to migrate the Spark 1 runner to Spark 2 and
> follow with Dataset API support feature-by-feature as ot advances, but I
> think most Spark installations today still run 1.X, or am I wrong ?
>
> On Wed, Mar 15, 2017 at 4:26 PM Ismaël Mejía  wrote:
>
>> BIG +1 JB,
>>
>> If we can just jump the version number with minor changes staying as
>> close as possible to the current implementation for spark 1 we can go
>> faster and offer in principle the exact same support but for version
>> 2.
>>
>> I know that the advanced streaming stuff based on the DataSet API
>> won't be there but with this common canvas the community can iterate
>> to create a DataSet based translator at the same time. In particular I
>> consider the most important thing is that the spark 2 branch should
>> not live for long time, this should be merged into master really fast
>> for the benefit of everybody.
>>
>> Ismaël
>>
>>
>> On Wed, Mar 15, 2017 at 1:57 PM, Jean-Baptiste Onofré 
>> wrote:
>> > Hi Amit,
>> >
>> > What do you think of the following:
>> >
>> > - in the mean time that you reintroduce the Spark 2 branch, what about
>> > "extending" the version in the current Spark runner ? Still using
>> > RDD/DStream, I think we can support Spark 2.x even if we don't yet
>> leverage
>> > the new provided features.
>> >
>> > Thoughts ?
>> >
>> > Regards
>> > JB
>> >
>> >
>> > On 03/15/2017 07:39 PM, Amit Sela wrote:
>> >>
>> >> Hi Cody,
>> >>
>> >> I will re-introduce this branch soon as part of the work on BEAM-913
>> >> <https://issues.apache.org/jira/browse/BEAM-913>.
>> >> For now, and from previous experience with the mentioned branch, batch
>> >> implementation should be straight-forward.
>> >> Only issue is with streaming support - in the current runner (Spark 1.x)
>> >> we
>> >> have experimental support for windows/triggers and we're working towards
>> >> full streaming support.
>> >> With Spark 2.x, there is no "general-purpose" stateful operator for the
>> >> Dataset API, so I was waiting to see if the new operator
>> >> <https://github.com/apache/spark/pull/17179> planned for next version
>> >> could
>> >> help with that.
>> >>
>> >> To summarize, I will introduce a skeleton for the Spark 2 runner with
>> >> batch
>> >> support as soon as I can as a separate branch.
>> >>
>> >> Thanks,
>> >> Amit
>> >>
>> >> On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere 
>> >> wrote:
>> >>
>> >>> Hi guys,
>> >>> Is there anybody who's currently working on Spark 2.x runner? A old PR
>> >>> for
>> >>> spark 2.x runner was closed a few days ago, so I wonder what's the
>> status
>> >>> now, and is there a roadmap for this?
>> >>> Thanks~
>> >>>
>> >>
>> >
>> > --
>> > Jean-Baptiste Onofré
>> > jbono...@apache.org
>> > http://blog.nanthrax.net
>> > Talend - http://www.talend.com
>>

Re: Performance Testing Next Steps

2017-03-15 Thread Ismaël Mejía

Excellent proposal, sorry to jump into this discussion so late, this
was in my toread list for almost two weeks, and I finally got the time
to read the document and I have two minor comments:

I have the impression that the strict separation of Providers (the
data-processing systems) and Resources (the concrete Data Stores)
makes sense for the general case, but is lacking if what we want to
test are things in the Hadoop ecosystem where the data stores commonly
co-exist in the same group of machines with the data-processing
systems (the Providers), e.g. HDFS, Hbase + YARN. This is important to
correctly test that data locality works correctly for example. Have
you considered such case?

Another thing I noticed is that in the list of runners supporting PKB
the Direct Runner is not included, is there any particular reason for
this? I think that even if performance is not the main goal of the
direct runner it can be nice to have it there too to catch any
performance regressions, or is it because it is already ready for it?
what do you think?

Thanks,
Ismaël

On Thu, Mar 2, 2017 at 11:49 PM, Amit Sela  wrote:
> Looks great, and I'll be sure to follow this. Ping me if I can assist in
> any way!
>
> On Fri, Mar 3, 2017 at 12:09 AM Ahmet Altay 
> wrote:
>
>> Sounds great, thank you!
>>
>> On Thu, Mar 2, 2017 at 1:41 PM, Jason Kuster > .invalid
>> > wrote:
>>
>> > D'oh, my bad Ahmet. I've opened BEAM-1610, which handles support for
>> Python
>> > in PKB against the Dataflow runner. Once the Fn API progresses some more
>> we
>> > can add some work items for the other runners too. Let's chat about this
>> > more, maybe next week?
>> >
>> > On Thu, Mar 2, 2017 at 1:31 PM, Ahmet Altay 
>> > wrote:
>> >
>> > > Thank you Jason, this is great.
>> > >
>> > > Which one of these issues fall into the land of sdk-py?
>> > >
>> > > Ahmet
>> > >
>> > > On Thu, Mar 2, 2017 at 12:34 PM, Jason Kuster <
>> > > jasonkus...@google.com.invalid> wrote:
>> > >
>> > > > Glad to hear the excitement. :)
>> > > >
>> > > > Filed BEAM-1595 - 1609 to track work items. Some of these fall under
>> > > runner
>> > > > components, please feel free to reach out to me if you have any
>> > questions
>> > > > about how to accomplish these.
>> > > >
>> > > > Best,
>> > > >
>> > > > Jason
>> > > >
>> > > > On Wed, Mar 1, 2017 at 5:50 AM, Aljoscha Krettek <
>> aljos...@apache.org>
>> > > > wrote:
>> > > >
>> > > > > Thanks for writing this and taking care of this, Jason!
>> > > > >
>> > > > > I'm afraid I also cannot add anything except that I'm excited to
>> see
>> > > some
>> > > > > results from this.
>> > > > >
>> > > > > On Wed, 1 Mar 2017 at 03:28 Kenneth Knowles > >
>> > > > > wrote:
>> > > > >
>> > > > > Just got a chance to look this over. I don't have anything to add,
>> > but
>> > > > I'm
>> > > > > pretty excited to follow this project. Have the JIRAs been filed
>> > since
>> > > > you
>> > > > > shared the doc?
>> > > > >
>> > > > > On Wed, Feb 22, 2017 at 10:38 AM, Jason Kuster <
>> > > > > jasonkus...@google.com.invalid> wrote:
>> > > > >
>> > > > > > Hey all, just wanted to pop this up again for people -- if anyone
>> > has
>> > > > > > thoughts on performance testing please feel welcome to chime in.
>> :)
>> > > > > >
>> > > > > > On Fri, Feb 17, 2017 at 4:03 PM, Jason Kuster <
>> > > jasonkus...@google.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi all,
>> > > > > > >
>> > > > > > > I've written up a doc on next steps for getting performance
>> > testing
>> > > > up
>> > > > > > and
>> > > > > > > running for Beam. I'd love to hear from people -- there's a
>> fair
>> > > > amount
>> > > > > > of
>> > > > > > > work encapsulated in here, but the end result is that we have a
>> > > > > > performance
>> > > > > > > testing system which we can use for benchmarking all aspects of
>> > > Beam,
>> > > > > > which
>> > > > > > > would be really exciting. Looking forward to your thoughts.
>> > > > > > >
>> > > > > > > https://docs.google.com/document/d/
>> > 1PsjGPSN6FuorEEPrKEP3u3m16tyOz
>> > > > > > > ph5FnL2DhaRDz0/edit?ts=58a78e73
>> > > > > > >
>> > > > > > > Best,
>> > > > > > >
>> > > > > > > Jason
>> > > > > > >
>> > > > > > > --
>> > > > > > > ---
>> > > > > > > Jason Kuster
>> > > > > > > Apache Beam / Google Cloud Dataflow
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > ---
>> > > > > > Jason Kuster
>> > > > > > Apache Beam / Google Cloud Dataflow
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > ---
>> > > > Jason Kuster
>> > > > Apache Beam / Google Cloud Dataflow
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > ---
>> > Jason Kuster
>> > Apache Beam / Google Cloud Dataflow
>> >
>>

Re: Beam spark 2.x runner status

2017-03-15 Thread Ismaël Mejía

BIG +1 JB,

If we can just jump the version number with minor changes staying as
close as possible to the current implementation for spark 1 we can go
faster and offer in principle the exact same support but for version
2.

I know that the advanced streaming stuff based on the DataSet API
won't be there but with this common canvas the community can iterate
to create a DataSet based translator at the same time. In particular I
consider the most important thing is that the spark 2 branch should
not live for long time, this should be merged into master really fast
for the benefit of everybody.

Ismaël


On Wed, Mar 15, 2017 at 1:57 PM, Jean-Baptiste Onofré  wrote:
> Hi Amit,
>
> What do you think of the following:
>
> - in the mean time that you reintroduce the Spark 2 branch, what about
> "extending" the version in the current Spark runner ? Still using
> RDD/DStream, I think we can support Spark 2.x even if we don't yet leverage
> the new provided features.
>
> Thoughts ?
>
> Regards
> JB
>
>
> On 03/15/2017 07:39 PM, Amit Sela wrote:
>>
>> Hi Cody,
>>
>> I will re-introduce this branch soon as part of the work on BEAM-913
>> .
>> For now, and from previous experience with the mentioned branch, batch
>> implementation should be straight-forward.
>> Only issue is with streaming support - in the current runner (Spark 1.x)
>> we
>> have experimental support for windows/triggers and we're working towards
>> full streaming support.
>> With Spark 2.x, there is no "general-purpose" stateful operator for the
>> Dataset API, so I was waiting to see if the new operator
>>  planned for next version
>> could
>> help with that.
>>
>> To summarize, I will introduce a skeleton for the Spark 2 runner with
>> batch
>> support as soon as I can as a separate branch.
>>
>> Thanks,
>> Amit
>>
>> On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere 
>> wrote:
>>
>>> Hi guys,
>>> Is there anybody who's currently working on Spark 2.x runner? A old PR
>>> for
>>> spark 2.x runner was closed a few days ago, so I wonder what's the status
>>> now, and is there a roadmap for this?
>>> Thanks~
>>>
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: Docker image dependencies

2017-03-15 Thread Ismaël Mejía

Hi, Thanks for bringing this subject to the mailing list.

+1
We definitely need a consensus on this, and I agree with your proposal and
JB’s comments modulo certain clarifications:

I think we shall go in this priority order if the version of the image we
want is available:

1. Image provided by the creator of the data source/sink (if they
officially maintain it). (This is the case of Elasticsearch for example) or
the Apache projects (if they provide one) as JB mentions.
2. Official docker images (because they have security fixes and have
guaranteed maintenance.
3. Non-official docker images or images from other providers that have good
maintainers e.g. quay.io

It makes sense to use the same image for all the tests. and to use the
fixed versions supported by the respective IO to avoid possible issues
during testing between different versions/naming of env variables, etc.

The Elasticsearch case is a 'good' example because it shows all the current
issues:

We should not use one elasticsearch image (elk) for some tests and a
different one for the other (the quay one), and if we resolve by priority
we would take the image provided by the creator (1) for both cases.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html
However the current Beam Elasticsearch IO does not support Elasticsearch 5,
and elastic does not have an image for version 2, so in this particular
case following the priority order we should use the official docker image
(2) for the tests (assuming that both require the same version).
 Do you agree with this ?

Thinking about the ELK image I came with a new question. How do we deal
with IOs that require more than one base image, this is a common scenario
for projects that depend on Zookeeper? e.g. Kafka/Solr.  Usually people
coordinate those with a docker-compose file that creates an artificial
network to connect the Zookeeper image and the Kafka/Solr one
 just executing the 'docker-compose up' command
. Will we adopt this for such cases ?

I know that Kubernetes does this too, but the docker-compose format is
quite easy and textual,
and it is usually ready with the docker installation, additionally the
docker-compose files can easily be translated with kompose into Kubernetes
resources.

Ismaël

On Wed, Mar 15, 2017 at 3:17 AM, Jean-Baptiste Onofré 
wrote:

> Hi Stephen,
>
> 1. About the docker repositories, we now have official Docker repo at
> Apache. So, for the Apache projects, I would recommend the Apache official
> repo. Anyway, generally speaking, I would recommend the official repo (from
> the projects).
>
> 2. To avoid "unpredictable" breaking change, I would pin to a particular
> versions, and explicitly update if needed.
>
> 3. It's better that docker images are under an unique responsibility scope
> as different IOs can use the same resources, so they should use the same
> provided docker.
>
> By the way, I also have a docker coming for RedisIO ;)
>
> Regards
> JB
>
>
> On 03/15/2017 08:01 AM, Stephen Sisk wrote:
>
>> hi!
>>
>> as part of doing the work to enable IO ITs, we decided we want to use
>> docker. As part of that, we need to run docker images and they'll probably
>> be pulled from a docker repository.
>>
>> Questions:
>> * What docker repositories (and users on docker hub) do we as a group
>> allow
>> for images we'll run for hosted data stores?
>>  -> My proposal is we should only use repositories/images that are
>> regularly updated and that have someone saying that the images we depend
>> on
>> are secure. In the set of images currently linked to by checked in code/in
>> PR code, quay.io and official docker images seem fine. They both have
>> security scans (for what that's worth) and generally seem okay.
>>
>> * Do we pin to particular docker images or allow our version to float?
>>  -> I have seen docker images change in insecure way (e.g. switching the
>> name of the password parameter, meaning that the data store was secure
>> when
>> set up, and became insecure because no password was set after the image
>> update), so I'd prefer to pin to particular versions, and update on a
>> periodic basis.
>>
>> I'm relatively new to docker best practices, so I'm open to suggestions on
>> this.
>>
>> Current ITs with docker images:
>> * Jdbc - https://hub.docker.com/_/postgres/  (official image)
>> * Elasticsearch - https://hub.docker.com/r/sebp/elk/ (semi-official
>> looking
>> image)
>> * (PR in-flight
>> > ff9aebc9e99a3f324c9cf75a9R52>)
>> HadoopInputFormat's elasticsearch and cassandra tests -
>> https://hub.docker.com/_/cassandra/ and
>> https://quay.io/repository/pires/docker-elasticsearch-kubern
>> etes?tag=5.2.2&tab=tags
>> (official image, and image from quay.io, which provides security audits
>> of
>> their images)
>>
>> The more I think about it, the less I'm excited about the sebp/elk image -
>> I'm sure it's fine, but I'd prefer using images from a source that we know
>> i

Re: Style: how much testing for transform builder classes?

2017-03-15 Thread Ismaël Mejía

+1 to Vikas point maybe the right place to enforce things correct
build tests is in the validate and like this reduce the test
boilerplate and only test the validate, but I wonder if this totally
covers both cases (the buildsCorrectly and
buildsCorrectlyInDifferentOrder ones).

I answer Eugene’s question here even if you are aware now since you
commented in the PR, so everyone understands the case.

The case is pretty simple, when you extend an IO and add a new
configuration parameter, suppose we have withFoo(String foo) and we
want to add withBar(String bar). In some cases the implementation or
even worse the combination of those are not built correctly, so the
only way to guarantee that this works is to have code that tests the
complete parameter combination or tests that at least assert that the
object is built correctly.

This is something that can happen both with or without AutoValue
because the with method is hand-written and the natural tendency with
boilerplate methods like this is to copy/paste, so we can end up doing
silly things like:

private Read(String foo, String bar) { … }

public Read withBar(String bar) {
  return new Read(foo, null);
}

in this case the reference to bar is not stored or assigned (this is
similar to the case of the DatastoreIO PR), and AutoValue may seem to
solve this issue but you can also end up with this situation if you
copy paste the withFoo method and just change the method name:

public Read withBar(String foo) {
  return builder().setFoo(foo).build();
}

Of course both seem silly but both can happen and the tests at least
help to discover those, if Vikas proposition covers the
testsBuildCorrectly and testsBuildCorrectlyInDifferentOrder kind of
tests I think it is OK to get rid of those.

On Wed, Mar 15, 2017 at 1:05 AM, vikas rk  wrote:
> Yes, what I meant is: Necessary tests are ones that blocks users if not
> present. Trivial or non-trivial shouldn't be the issue in such cases.
>
> Some of the boilerplate code and tests is because IO PTransforms are
> returned to the user before they are fully constructed and actual
> validation happens in the validate method rather than at construction. I
> understand that the reasoning here is that we want to support constructing
> them with options in any order and using Builder pattern can be confusing.
>
> If validate method is where all the validation happens, then we should able
> to eliminate some redundant checks and tests during construction time like
> in *withOption* methods here
> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java#L199>
>  and here
> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreV1.java#L387>
> as
> these are also checked in the validate method.
>
>
>
>
>
>
>
>
>
>
>
> -Vikas
>
>
>
> On 14 March 2017 at 15:40, Eugene Kirpichov 
> wrote:
>
>> Thanks all. Looks like people are on board with the general direction
>> though it remains to refine it to concrete guidelines to go into style
>> guide.
>>
>> Ismaël, can you give more details about the situation you described in the
>> first paragraph? Is it perhaps that really a RunnableOnService test was
>> missing (and perhaps still is), rather than a builder test?
>>
>> Vikas, regarding trivial tests and user waiting for a work-around: in the
>> situation I described, they don't really need a workaround - they specified
>> an invalid value and have been minorly inconvenienced because the error
>> they got about it was not very readable, so fixing their value took them a
>> little longer than it could have, but they fixed it and their work is not
>> blocked. I think Robert's arguments about the cost of trivial tests apply.
>>
>> I agree that the author should be at liberty to choose which validation to
>> unit-test and which to skip as trivial, so documentation on this topic
>> should be in the form of guidelines, high-quality example code (i.e. clean
>> up the unit tests of IOs bundled with Beam SDK), and informal knowledge in
>> the heads of readers of this thread, rather than hard rules.
>>
>> On Tue, Mar 14, 2017 at 8:07 AM Ismaël Mejía  wrote:
>>
>> > +0.5
>> >
>> > I used to think that some of those tests were not worth, for example
>> > testBuildRead and
>> > testBuildReadAlt. However the reality is that these tests allowed me to
>> > find bugs both during the development of HBaseIO and just yesterday when
>> I
>> > tried to test the write support for the emulator with DataStoreIO (that
>> > lack

Re: [RESULT] [VOTE] Release 0.6.0, release candidate #2

2017-03-15 Thread Ismaël Mejía

Thanks Ahmet for dealing with the release, I just tried the pip install
apache-beam and the wordcount example and as you said it feels awesome to
see this working so easily now. Congrats to everyone working on the python
SDK !


On Wed, Mar 15, 2017 at 8:17 AM, Ahmet Altay 
wrote:

> This release is now complete. Thanks to everyone who have helped make this
> release possible!
>
> Before sending a note to users@, I would like to make a pass over the
> website and simplify things now that we have an official python release. I
> did the first 'pip install apache-beam' today and it felt amazing!
>
> Ahmet
>
>
> On Tue, Mar 14, 2017 at 2:22 PM, Ahmet Altay  wrote:
>
> > I'm happy to announce that we have unanimously approved this release.
> >
> > There are 7 approving votes, 4 of which are binding:
> > * Aljoscha Krettek
> > * Davor Bonaci
> > * Ismaël Mejía
> > * Jean-Baptiste Onofré
> > * Robert Bradshaw
> > * Ted Yu
> > * Tibor Kiss
> >
> > There are no disapproving votes.
> >
> > Thanks everyone!
> >
> > Ahmet
> >
>

Re: Style: how much testing for transform builder classes?

2017-03-14 Thread Ismaël Mejía

+0.5

I used to think that some of those tests were not worth, for example
testBuildRead and
testBuildReadAlt. However the reality is that these tests allowed me to
find bugs both during the development of HBaseIO and just yesterday when I
tried to test the write support for the emulator with DataStoreIO (that
lacked a parameter in testBuildWrite and didn’t have a testBuildWriteAlt
and broke in that case too), so I now believe they are not necessarily
useless.

I agree with the idea of trying to test the most important things first and
as Kenneth said trying to reduce the tests of successful validation,
however I am not sure how this can be formalized in the style guide
considering that the value of the tests seems to be based on the judgment
of both the developer and the reviewer.

One final aspect if we achieve consensus is that we should not forget that
if we reduce some of this tests we have to pay attention to not reduce the
current level of coverage.

Regards,
Ismaël




On Mon, Mar 13, 2017 at 8:37 PM, vikas rk  wrote:

> +0.5
>
> My two cents,
>
>   * However trivial the test is it should be added unless user has a easy
> workaround to not having to wait for a few days until the trivial fixes are
> merged to beam and then propagated to the runner.
>
>   * While I agree with trivial tests like "ensuring meaningful error
> message thrown for null options" or "ordering of parameters" not being
> strictly necessary and definitely not something that should be used as
> reference for other IOs, (staying consistent with other IOs is probably the
> reason why DatastoreIO has some of them), I still think some degree of
> leeway for the author is desirable depending on what level of user
> experience they want to provide, as long as it is not a burden for adding
> new tests.
>
>
>
> On 11 March 2017 at 14:16, Jean-Baptiste Onofré  wrote:
>
> > +1
> >
> > Testing is always hard, especially to have concrete tests. Reducing the
> > "noise" is a good idea.
> >
> > Regards
> > JB
> >
> >
> > On 03/10/2017 04:09 PM, Eugene Kirpichov wrote:
> >
> >> Hello,
> >>
> >> I've seen a pattern in a couple of different transforms (IOs) where we,
> I
> >> think, spend an excessive amount of code unit-testing the trivial
> builder
> >> methods.
> >>
> >> E.g. a significant part of
> >> https://github.com/apache/beam/blob/master/sdks/java/io/goog
> >> le-cloud-platform/src/test/java/org/apache/beam/sdk/io/
> >> gcp/datastore/DatastoreV1Test.java
> >> is
> >> devoted to that, and a significant part of the in-progress Hadoop
> >> InputFormat IO.
> >>
> >> In particular, I mean unit-testing trivial builder functionality such
> as:
> >> - That setting a parameter actually sets it
> >> - That setting parameters in one order gives the same effect as a
> >> different
> >> order
> >> - That a null value of a parameter is rejected with an appropriate error
> >> message
> >> - That a non-null value of a parameter is accepted
> >>
> >> I'd like to come to a consensus as to how much unit-testing of such
> stuff
> >> is appropriate when developing a new transform. And then put this
> >> consensus
> >> into the PTransform Style Guide
> >> .
> >>
> >> I think such tests are not worth their weight in lines of code, for a
> few
> >> reasons:
> >> - Whether or not a parameter actually takes effect, should be tested
> using
> >> a semantic test (RunnableOnService). Testing whether a getter returns
> the
> >> same thing a setter set is neither necessary (already covered by a
> >> semantic
> >> test) nor sufficient (it's possible that the expansion of the transform
> >> forgets to even call the getter).
> >> - Testing "a non-null value (or an otherwise valid value) is accepted"
> is
> >> redundant with every semantic test that supplies the value.
> >> - For testing of supplying a different order of parameters: my main
> >> objections are 1) I'm not sure what kind of reasonably-likely coding
> error
> >> this guards against, especially given most multi-parameter transforms
> use
> >> AutoValue anyway, and given that "does a setter take effect" is already
> >> tested via bullets above 2) given such a coding error exists, there's no
> >> way to tell which orders  or how many you should test; unit-testing two
> or
> >> three orders probably gives next to no protection, and unit-testing more
> >> than that is impractical.
> >> - Testing "a null value is rejected" effectively unit-tests a single
> line
> >> that says checkNotNull(...). I don't think it's worth 10 or so lines of
> >> code to test something this trivial: even if an author forgot to add
> this
> >> line, and then a used supplied null - the user will just get a less
> >> informative error message, possibly report it as a bug, and the error
> will
> >> easily get fixed. I.e. it's a very low-risk thing to get wrong.
> >>
> >> I think the following similar types of tests are still worth doing
> though:
> >> - Non-trivial parameter v

Re: [VOTE] Release 0.6.0, release candidate #2

2017-03-13 Thread Ismaël Mejía

+1 (non-binding)

- verified signatures + checksums
- run mvn clean install -Prelease, all artifacts build and the tests run
smoothly (modulo some local issues I had with the installation of tox for
the python sdk, I created a PR to fix those in case other people can have
the same trouble).

Some remarks still to fix from the release, but that I don’t consider
blockers:

1. The section Getting Started in the main README.md needs to be updated
with the information about the creating/activating the virtualenv. At this
moment just running mvn clean install won’t work without this.

2.  Both zip files in the current release produce a folder with the same
name ‘apache-beam-0.6.0’. This can be messy if users unzip both files into
the same folder (as happened to me, the compressed files should produce a
directory with the exact same name that the file, so
apache-beam-0.6.0-python.zip will produce apache-beam-0.6.0-python and the
other its respective directory.

3. The name of the files of the release probably should be different:

The source release could be just apache-beam-0.6.0.zip instead of
apache-beam-0.6.0-source-release.zip considering that we don’t have binary
artifacts, or just apache-beam-0.6.0-src.zip following the convention of
other apache projects.

The python release also could be renamed from
apache-beam-0.6.0-bin-python.zip instead of apache-beam-0.6.0-python.zip so
users understand that these are executable files (but well I am not sure
about that one considering that python is a scripting language).

Finally I would prefer that we have a .tar.gz release as JB mentioned in
the previous vote, and as most apache projects do. In any case if the zip
is somehow a requirement it would be nice to have both a .zip and a .tar.gz
file.

Re: [VOTE] Release 0.6.0, release candidate #2

2017-03-12 Thread Ismaël Mejía

I found an issue too with the .md5 and sha1 files of the python release,
they refer to a different default file (a forgotten part of the renaming):

curl
https://dist.apache.org/repos/dist/dev/beam/0.6.0/apache-beam-0.6.0-python.zip.md5
7d4170e381ce0e1aa8d11bee2e63d151  apache-beam-0.6.0.zip

This one shuld have been apache-beam-0.6.0-python.zip and the same for the
sha1 because if users run the classic tools to validate:
md5sum -c apache-beam-0.6.0-python.zip.md5
md5sum: apache-beam-0.6.0.zip: No such file or directory
apache-beam-0.6.0.zip: FAILED open or read
md5sum: WARNING: 1 listed file could not be read

I don't know if this is critical to trigger another vote, but it is an
issue too.
Ismaël.

On Sun, Mar 12, 2017 at 9:02 AM, Ahmet Altay 
wrote:

> Amit,
>
> I was able to successfully build in a clean environment with the following
> commands:
>
> git checkout tags/v0.6.0-RC2 -b RC2
> mvn clean install -Prelease
>
> I am not a very familiar with maven build process, it would be great if
> someone else can also confirm this.
>
> Ahmet
>
>
>
> On Sat, Mar 11, 2017 at 11:00 PM, Amit Sela  wrote:
>
> > Building the RC2 tag failed for me with: "mvn clean install -Prelease"
> on a
> > missing artifact "beam-sdks-java-harness" when trying to build
> > "beam-sdks-java-javadoc".
> >
> > I want to make sure It's not something local that happens in my env. so
> if
> > anyone else could validate this it would be great.
> >
> > Amit
> >
> > On Sat, Mar 11, 2017 at 9:48 PM Robert Bradshaw
> > 
> > wrote:
> >
> > > On Fri, Mar 10, 2017 at 9:05 PM, Ahmet Altay  >
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Please review and vote on the release candidate #2 for the version
> > 0.6.0,
> > > > as follows:
> > > > [ ] +1, Approve the release
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > >
> > > > The complete staging area is available for your review, which
> includes:
> > > > * JIRA release notes [1],
> > > > * the official Apache source release to be deployed to
> dist.apache.org
> > > > [2],
> > > > which is signed with the key with fingerprint 6096FA00 [3],
> > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > > * source code tag "v0.6.0-RC2" [5],
> > > > * website pull request listing the release and publishing the API
> > > reference
> > > > manual [6].
> > > > * python artifacts are deployed along with the source release to to
> > > > dist.apache.org [2].
> > > >
> > >
> > > Are there plans also to deploy this at PyPi, and if so, what are the
> > > details?
> > >
> > >
> > > > A suite of Jenkins jobs:
> > > > * PreCommit_Java_MavenInstall [7],
> > > > * PostCommit_Java_MavenInstall [8],
> > > > * PostCommit_Java_RunnableOnService_Apex [9],
> > > > * PostCommit_Java_RunnableOnService_Flink [10],
> > > > * PostCommit_Java_RunnableOnService_Spark [11],
> > > > * PostCommit_Java_RunnableOnService_Dataflow [12]
> > > > * PostCommit_Python_Verify [13]
> > > >
> > > > Compared to release candidate #1, this candidate contains pull
> requests
> > > > #2217 [14], #2221 [15], # [16], #2224 [17], and #2225 [18]; see
> the
> > > > discussion for reasoning.
> > > >
> > > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > > > approval, with at least 3 PMC affirmative votes.
> > > >
> > > > Thanks,
> > > > Ahmet
> > > >
> > > > [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
> > > > ctId=12319527&version=12339256
> > > > [2] https://dist.apache.org/repos/dist/dev/beam/0.6.0/
> > > > [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
> > > > [4]
> > > https://repository.apache.org/content/repositories/orgapachebeam-1013/
> > > > [5] https://git-wip-us.apache.org/repos/asf?p=beam.git;a=tag;h=r
> > > > efs/tags/v0.6.0-RC2
> > > > [6] https://github.com/apache/beam-site/pull/175
> > > > [7] https://builds.apache.org/view/Beam/job/beam_PreCommit_Java_
> > > > MavenInstall/8340/
> > > > [8] https://builds.apache.org/view/Beam/job/beam_PostCommit_
> > > > Java_MavenInstall/2877/
> > > > [9] https://builds.apache.org/view/Beam/job/beam_PostCommit_Java
> > > > _RunnableOnService_Apex/736/
> > > > [10] https://builds.apache.org/view/Beam/job/beam_PostCommit_Java
> > > > _RunnableOnService_Flink/1895/
> > > > [11] https://builds.apache.org/view/Beam/job/beam_PostCommit_Java
> > > > _RunnableOnService_Spark/1207/
> > > > [12] https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_
> > > > RunnableOnService_Dataflow/2526/
> > > > [13] https://builds.apache.org/view/Beam/job/beam_PostCommit_Pyth
> > > > on_Verify/1481/
> > > > [14] https://github.com/apache/beam/pull/2217
> > > > [15] https://github.com/apache/beam/pull/2221
> > > > [16] https://github.com/apache/beam/pull/
> > > > [17] https://github.com/apache/beam/pull/2224
> > > > [18] https://github.com/apache/beam/pull/2225
> > > >
> > >
> >
>

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-03-02 Thread Ismaël Mejía

Hello,

I answer since I have been leading the refactor to hadoop-common. My
criteria to move a class into hadoop-common is that it is used at least by
more than one other module or IO, this is the reason is not big, but it can
grow if needed.

+1 for option #1 because of the visibility reasons you mention.
For the concrete PR I have the following remarks:

>From looking at the PR I think today you can already do the basic refactors
that depend on hadoop-common (to avoid adding repeated code):
- Remove NullWritableCoder from hadoop/inputformat and refactor to use
WritableCoder from hadoop-common.
- Remove WritableCoder from hadoop/inputformat and refactor to use
WritableCoder from hadoop-common.

I have other comments but since those are not directly related to the
refactoring I will address those in the PR.

Thanks for bringing this issue back to the mailing-list Stephen.
Ismaël



On Thu, Mar 2, 2017 at 3:32 PM, Jean-Baptiste Onofré 
wrote:

> By the way Stephen, when BEAM-59 will be done, hadoop IO will only
> contains the hadoop format support (no HdfsFileSource or HdfsSink required
> as it will use the "regular" FileIO).
>
> Agree ?
>
> Regards
> JB
>
>
> On 03/02/2017 03:27 PM, Jean-Baptiste Onofré wrote:
>
>> Hi Stephen,
>>
>> I agree to use the following structure (and it's basically what I
>> proposed in a comment of the PR):
>>
>> io/hadoop
>> io/hadoop-common
>> io/hbase
>>
>> I would be more than happy to help on the "merge" of HdfsIO and
>> HadoopFormat.
>>
>> Regards
>> JB
>>
>> On 03/01/2017 08:00 PM, Stephen Sisk wrote:
>>
>>> I wanted to follow up on this thread since I see some potential blocking
>>> questions arising, and I'm trying to help dipti along with her PR.
>>>
>>> Dipti's PR[1] is currently written to put files into:
>>> io/hadoop/inputformat
>>>
>>> The recent changes to create hadoop-common created:
>>> io/hadoop-common
>>>
>>> This means that the overall structure if we take the HIFIO PR as-is would
>>> be:
>>> io/hadoop/inputformat - the HIFIO (copies of some code in
>>> hadoop-common and
>>> hdfs, but no dependency on hadoop-common)
>>> io/hadoop-common - module with some shared code
>>> io/hbase - hbase IO transforms
>>> io/hdfs - FileInputFormat IO transforms - much shared code with
>>> hadoop/inputformat.
>>>
>>> Which I don't think is great b/c there's a common dir, but only some
>>> directories use it, and there's lots of similar-but-slightly different
>>> code
>>> in hadoop/inputformat and hdfsio. I don't believe anyone intends this
>>> to be
>>> the final result.
>>>
>>> After looking at the comments in this thread, I'd like to recommend the
>>> following end-result:  (#1)
>>> io/hadoop -  the HIFIO (dependency on  hadoop-common) - contains both
>>> HadoopInputFormatIO.java and HDFSFileSink/HDFSFileSource (so contents of
>>> hdfs and hadoop/inputformat)
>>> io/hadoop-common - module with some shared code
>>> io/hbase - hbase IO transforms
>>>
>>> To get there I propose the following steps:
>>> 1. finish current PR [1] with only renaming the containing module from
>>> hadoop/inputformat -> hadoop, and taking dependency on hadoop-common
>>> 2. someone does cleanup to reconcile hdfs and hadoop directories,
>>> including
>>> renaming the files so they make sense
>>>
>>> I would also be fine with: (#2)
>>> io/hadoop - container dir only
>>> io/hadoop/common
>>> io/hadoop/hbase
>>> io/hadoop/inputformat
>>>
>>> I think the downside of #2 is that it hides hbase, which I think deserves
>>> to be top level.
>>>
>>> Other comments:
>>> It should be noted that when we have all modules use hadoop-common, we'll
>>> be forcing all hadoop modules to have the same dependencies on hadoop - I
>>> think this makes sense, but worth noting that as the one advantage of the
>>> "every hadoop IO transform has its own hadoop dependency"
>>>
>>> On the naming discussion: I personally prefer "inputformat" as the
>>> name of
>>> the directory, but I defer to the folks who know the hadoop community
>>> more.
>>>
>>> S
>>>
>>> [1] HadoopInputFormatIO PR - https://github.com/apache/beam/pull/1994
>>> [2] HdfsIO dependency change PR -
>>> https://github.com/apache/beam/pull/2087
>>>
>>>
>>> On Fri, Feb 17, 2017 at 9:38 AM, Dipti Kulkarni <
>>> dipti_dkulka...@persistent.com> wrote:
>>>
>>> Thank you  all for your inputs!


 -Original Message-
 From: Dan Halperin [mailto:dhalp...@google.com.INVALID]
 Sent: Friday, February 17, 2017 12:17 PM
 To: dev@beam.apache.org
 Subject: Re: Merge HadoopInputFormatIO and HDFSIO in a single module

 Raghu, Amit -- +1 to your expertise :)

 On Thu, Feb 16, 2017 at 3:39 PM, Amit Sela 
 wrote:

 I agree with Dan on everything regarding HdfsFileSystem - it's super
> convenient for users to use TextIO with HdfsFileSystem rather then
> replacing the IO and also specifying the InputFormat type.
>
> I disagree on "HadoopIO" - I think that people who work with Hadoop
> would find this na

Re: Next major milestone: first stable release

2017-03-01 Thread Ismaël Mejía

Just added the two I mentioned in my previous message.Thanks Davor.

On Wed, Mar 1, 2017 at 6:27 PM, Aljoscha Krettek 
wrote:

> On it!
>
> On Wed, 1 Mar 2017 at 18:17 Davor Bonaci  wrote:
>
> > We've now moved the discussion into the content of the first stable
> > release.
> >
> > I've created a version in JIRA called "First stable release". I'd like to
> > invite everyone to triage JIRA issues you care about, and assign "Fix
> > Versions" field to "First stable release" to mark the issue blocking for
> > the first stable release. This creates a project-wide burndown list and
> we
> > can track our progress towards the goal.
> >
> > I'll try make a pass over as many JIRA issues as possible over the next
> day
> > or two, but it would be great if everyone, particularly component leads
> in
> > JIRA, take a pass too!
> >
> > On Wed, Mar 1, 2017 at 2:51 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > Yes, fully agree.
> > >
> > > As far as I understood/know, BEAM-59 is targeted for Beam 1.0 (it's
> what
> > > we discussed with Pei and Davor).
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 03/01/2017 11:39 AM, Ismaël Mejía wrote:
> > >
> > >> Also joining a bit late, I agree with Amit, HDFS improvements are a
> > really
> > >> good thing to have before the stable release. I will also add the
> > >> IOChannelFactory refactorings to support things like
> > Read.from(“hdfs://”)
> > >> aka BEAM-59.
> > >>
> > >> In the worse case particular IOs can still be marked as experimental
> to
> > >> show users that they can still evolve, even after the first ‘stable’
> > >> release, the part that we have to pay more attention is not to break
> the
> > >> core SDK. And the question about Data Locality (BEAM-673) is where I
> am
> > >> afraid that we can have some breaking changes because there is not a
> way
> > >> from the IOs (Source/Sink) to send ‘a hint’ to the runner about Data
> > >> Locality (please correct me if I am wrong). And this even if not
> > supported
> > >> in the first stable release by any runner, would be a really great
> thing
> > >> to
> > >> have and I think this is a good moment to do it, to avoid breaking any
> > >> IO/runner signature because of new methods.
> > >>
> > >> What do the others think ?
> > >> Ismaël
> > >>
> > >>
> > >>
> > >> On Tue, Feb 28, 2017 at 6:29 PM, Amit Sela 
> > wrote:
> > >>
> > >> Joining in just a bit late, I'll be quick and say that IMHO the SDK is
> > >>> mature enough and so my only point to add is *HDFS support*.
> > >>> I think that in terms of adoption we have to support HDFS as a
> > >>> "first-class
> > >>> citizen" via the FileSystem API, and provide data locality (batch) on
> > top
> > >>> of it - it serves not only HDFS, but other eco-system IOs such as
> > HBase.
> > >>> From my experience with talking to people and companies, most are
> > running
> > >>> batch in production with some streaming POC or even production use,
> but
> > >>> batch still takes most of production work. If we give them the same
> > >>> production results, with the Beam API, we can on-board them faster
> and
> > >>> make
> > >>> it easier for them to adopt streaming as well.
> > >>>
> > >>> Thanks,
> > >>> Amit
> > >>>
> > >>> On Tue, Feb 28, 2017 at 7:12 PM Davor Bonaci 
> wrote:
> > >>>
> > >>> Alright -- sounds like we have a consensus to proceed with the first
> > >>>>
> > >>> stable
> > >>>
> > >>>> release after 0.6.0, targeting end of March / early April. I'll kick
> > off
> > >>>> separate threads for specific decisions we need to make.
> > >>>>
> > >>>> On Thu, Feb 23, 2017 at 6:07 AM, Aljoscha Krettek <
> > aljos...@apache.org>
> > >>>> wrote:
> > >>>>
> > >>>> I think we're ready for this! The public APIs are in very good
> shape,
> > >>>>> especially now that we have the new DoFn, user facing state and
&g

Re: Next major milestone: first stable release

2017-03-01 Thread Ismaël Mejía

Also joining a bit late, I agree with Amit, HDFS improvements are a really
good thing to have before the stable release. I will also add the
IOChannelFactory refactorings to support things like Read.from(“hdfs://”)
aka BEAM-59.

In the worse case particular IOs can still be marked as experimental to
show users that they can still evolve, even after the first ‘stable’
release, the part that we have to pay more attention is not to break the
core SDK. And the question about Data Locality (BEAM-673) is where I am
afraid that we can have some breaking changes because there is not a way
from the IOs (Source/Sink) to send ‘a hint’ to the runner about Data
Locality (please correct me if I am wrong). And this even if not supported
in the first stable release by any runner, would be a really great thing to
have and I think this is a good moment to do it, to avoid breaking any
IO/runner signature because of new methods.

What do the others think ?
Ismaël



On Tue, Feb 28, 2017 at 6:29 PM, Amit Sela  wrote:

> Joining in just a bit late, I'll be quick and say that IMHO the SDK is
> mature enough and so my only point to add is *HDFS support*.
> I think that in terms of adoption we have to support HDFS as a "first-class
> citizen" via the FileSystem API, and provide data locality (batch) on top
> of it - it serves not only HDFS, but other eco-system IOs such as HBase.
> From my experience with talking to people and companies, most are running
> batch in production with some streaming POC or even production use, but
> batch still takes most of production work. If we give them the same
> production results, with the Beam API, we can on-board them faster and make
> it easier for them to adopt streaming as well.
>
> Thanks,
> Amit
>
> On Tue, Feb 28, 2017 at 7:12 PM Davor Bonaci  wrote:
>
> > Alright -- sounds like we have a consensus to proceed with the first
> stable
> > release after 0.6.0, targeting end of March / early April. I'll kick off
> > separate threads for specific decisions we need to make.
> >
> > On Thu, Feb 23, 2017 at 6:07 AM, Aljoscha Krettek 
> > wrote:
> >
> > > I think we're ready for this! The public APIs are in very good shape,
> > > especially now that we have the new DoFn, user facing state and timers
> > and
> > > splittable DoFn. Not all Runners support the more advanced features but
> > we
> > > can work on this after a stable release and there are enough runners
> that
> > > support a large part of the features.
> > >
> > > Best,
> > > Aljoscha
> > >
> > > On Thu, 23 Feb 2017 at 06:15 Kenneth Knowles 
> > > wrote:
> > >
> > > > On Wed, Feb 22, 2017 at 5:35 PM, Chamikara Jayalath <
> > > chamik...@apache.org>
> > > > wrote:
> > > > >
> > > > > I think, this point applies to Python SDK as well (though as you
> > > > mentioned,
> > > > > API hiding in Python is a mere convention (prefix with underscore)
> > not
> > > > > enforced. We already have mechanism for marking APIs as deprecated
> > > which
> > > > > might be useful here:
> > > > > https://github.com/apache/beam/blob/master/sdks/python/
> > > > > apache_beam/utils/annotations.py
> > > > >
> > > > > - Cham
> > > > >
> > > >
> > > > Perhaps an explicit @public annotation would fit. I could imagine
> > easily
> > > > generating a spec to check against from such annotations, though
> > tooling
> > > is
> > > > secondary to documentation.
> > > >
> > > > Kenn
> > > >
> > >
> >
>

Re: Interest in a (virtual) contributor meeting?

2017-02-23 Thread Ismaël Mejía

+1 to do it periodically about different subjects.

It is a good idea to have a sort of mini agenda, in the sense that the two
previous meetings had really different focus, the first one was about
contributors meeting each other and discussion of ongoing work just after
the project started on Apache, the second one was really focused on the
SplittableDoFn proposal, it was more focused on runner writers and IO
authors and it was a real 'tour de force' lead by Eugene,


On Wed, Feb 22, 2017 at 3:19 PM, Kobi Salant  wrote:

> +1
>
> בתאריך 22 בפבר' 2017 2:54 PM,‏ "Aljoscha Krettek" 
> כתב:
>
> > +1
> >
> > On Wed, 22 Feb 2017 at 10:08 JingsongLee 
> wrote:
> >
> > > +1
> > >
> > >
> > > 来自阿里邮箱 iPhone版 --原始邮件 --发件人：Davor
> Bonaci
> > <
> > > da...@apache.org>日期：2017-02-22 11:19:12收件人：dev@beam.apache.org <
> > > dev@beam.apache.org>主题：Interest in a (virtual) contributor meeting?In
> > the
> > > early days of the project, we have held a few meetings for the
> > > initial community to get to know each other. Since then, the community
> > has
> > > grown a huge amount, but we haven't organized any get-togethers.
> > >
> > > I wanted to gauge interest in a potential video conference call in the
> > near
> > > future. No specific agenda -- simply a chance for everyone to meet
> others
> > > and see the faces of people we share a common passion with. Of course,
> an
> > > open discussion on any topic of interest to the contributor community
> is
> > > welcome. This would be strictly informal -- any decisions are reserved
> > for
> > > the mailing list discussions.
> > >
> > > If you'd be interested in attending, please reply back. If there's
> > > sufficient interest, I'd be happy to try to organize something in the
> > near
> > > future.
> > >
> > > Thanks!
> > >
> > > Davor
> > >
> >
>

Re: Metrics for Beam IOs.

2017-02-22 Thread Ismaël Mejía

Hello,

Thanks everyone for giving your points of view. I was waiting to see how
the conversation evolved to summarize it and continue on the open points.

Points where mostly everybody agrees (please correct me if somebody still
disagrees):

- Default metrics should not affect performance, for that reason they
should be calculated by default. (in this case disabling them should matter
less and probably we can evaluate if we make this configurable per IO in
the future).

- There is a clear distinction between metrics that are useful for the user
and metrics that are useful for the runner. The user metrics are the most
important for user experience. (we will focus on these).

- We should support metrics for IOs in both APIs: Source/Sink based IOs and
SDF.

- IO metrics should focus on what's relevant to the Pipeline. Relevant
metrics should be discussed in a per IO basis. (It is hard to generalize,
but probably we will make progress faster just creating metrics for each
one and then consolidating the common ones).

Points where consensus is not yet achieved

- Should IOs expose metrics that are useful for the runners? And if so How?

I think this is important but not relevant for the current discussion so we
should probably open a different conversation for this. My only comment
around this is that we must prevent the interference of mixing runner and
user oriented metrics (probably with a namespace).

- Where is the frontier of the responsibilities of Beam for metrics? Should
we have a runner-agnostic way to recollect metrics (different from result
polling)?

We can offer a plugin-like system to push metrics into given sinks, JB
proposed an approach similar to Karaf’s Decanter. There is also the issue
of pull-based metrics like those of Codehale.

As a user I think having something like what JB proposed is nice, even a
REST service to query stuff about pipelines in a runner-agnostic way would
make me happy too, but again it is up to the community to decide how can we
we implement this and if this should be part of Beam.

What do you guys think about the pending issues? Did I miss something else ?

Ismaël



On Sat, Feb 18, 2017 at 9:02 PM, Jean-Baptiste Onofré 
wrote:

> Yes, agree Ben.
>
> More than the collected metrics, my question is more how to
> "expose"/"push" those metrics.
>
> Imagine, I have a pipeline executed using the Spark runner on a Spark
> cluster. Now, I change this pipeline to use the dataflow runner on Google
> Cloud Dataflow service. I have to completely change the way of getting
> metrics.
> Optionally, if I'm able to define some like --metric-appender=foo.cfg
> containing additionally to the execution engine specific layer, a target
> where I can push the metric (like elasticsearch or kafka), I can implement
> a simple and generic way of harvesting metrics.
> It's totally fine to have some metric specific to Spark when using the
> Spark runner and others specific to Dataflow when using the dataflow
> runner, my point is more: how I can send the metrics to my global
> monitoring/reporting layer.
> Somehow, we can see the metrics like meta side output of the pipeline,
> send to a target sink.
>
> I don't want to change or bypass the execution engine specific, I mean
> provide a way for the user to target his system (optional).
>
> Regards
> JB
>
>
> On 02/18/2017 06:15 PM, Ben Chambers wrote:
>
>> The question is how much of metrics has to be in the runner and how much
>> can be shared. So far we share the API the user uses to report metrics -
>> this is the most important part since it is required for pipelines to be
>> portable.
>>
>> The next piece that could be shared is something related to reporting.
>> But,
>> implementing logically correct metrics requires the execution engine to be
>> involved, since it depends on how and which bundles are retried. What I'm
>> not sure about is how much can be shared and/or made available for these
>> runners vs. how much is tied to the execution engine.
>>
>> On Sat, Feb 18, 2017, 8:52 AM Jean-Baptiste Onofré 
>> wrote:
>>
>> For Spark, I fully agree. My point is more when the execution engine or
>>> runner doesn't provide anything or we have to provide a generic way of
>>> harvesting/pushing metrics.
>>>
>>> It could at least be a documentation point. Actually, I'm evaluation the
>>> monitoring capabilities of the different runners.
>>>
>>> Regards
>>> JB
>>>
>>> On 02/18/2017 05:47 PM, Amit Sela wrote:
>>>
 That's what I don't understand - why would we want that ?
 Taking on responsibilities in the "stack" should have a good reason.

 Someone choosing to run Beam on Spark/Flink/Apex would have to take care

>>> of
>>>
 installing those clusters, right ? perhaps providing them a resilient
 underlying FS ? and if he wants, setup monitoring (which even with the

>>> API
>>>
 proposed he'd have to do).

 I just don't see why it should be a part of the runner and/or Metrics

>>> API.
>>>

Re: Hbase IO preview

2017-02-22 Thread Ismaël Mejía

For those interested (and working on Beam snapshots), HBaseIO was merged
today (thanks Dan and JB for the review/help). Notice that some design
decisions in the IO are fundamentally aligned with the idea of helping
users to switch ‘easily’ from/to Hbase/Bigtable. But don’t worry, any user
of HBase will understand the API.

There are already some small optimizations to do (There are some JIRAs
filled for this). This message is also a call to the interested users to
report any issue or additional need not covered by the current version of
the IO.

Thanks,
Ismaël

On Wed, Dec 14, 2016 at 8:42 PM, Ismaël Mejía  wrote:

> We are having progress with this one, we will keep you informed once the
> branch is ready for testing/contribution so you can try it (or help us
> improve it).
>
> For the moment you can track the progress following this JIRA
> https://issues.apache.org/jira/browse/BEAM-1157
>
> On Wed, Dec 14, 2016 at 3:25 PM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi Andrew,
>>
>> I have a protobuf issue on this IO that I would like to address.
>>
>> Sorry, I didn't have time to work on it this week. I do my best to push
>> something work-able asap.
>>
>> Regards
>> JB
>>
>>
>> On 12/14/2016 03:18 PM, Andrew Hoblitzell wrote:
>>
>>> Any update on which branch the preview for HBase IO might be available
>>> in?
>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>

Re: Better developer instructions for using Maven?

2017-02-16 Thread Ismaël Mejía

JB, Maybe I was not clear, when I talked about the tests I was thinking
more about execute them in parallel in the same machine, this is not the
case today for some test suites, and for these the tests need to be refined
to support this, and configured via the pom to execute the tests in
parallel per method, class, etc. Of course we need to check if this is
worth, because I can imagine that the more expensive time for example in
the IO case comes from starting the embedded versions of the IOs (e.g.
HadoopMiniCluster, MongodExecutable, HBasetestingutility, etc) and not from
the tests themselves but this has to be evaluated.



On Wed, Feb 15, 2017 at 5:46 PM, Jean-Baptiste Onofré 
wrote:

> On Jenkins it's possible to run several jobs in the same time but on
> different executor. That's the easiest way.
>
> Regards
> JB
>
> On Feb 15, 2017, 10:15, at 10:15, "Ismaël Mejía" 
> wrote:
> >This question got lost in the discussion, but there is a small
> >improvement
> >that we can do:
> >
> >> Just to check, are we doing parallel builds?
> >
> >We are on jenkins, not in travis, there is an ongoing PR to fix this.
> >
> >What we can improve is to check if we can run some of the test suites
> >in
> >parallel to gain some extra time. For exemple most of the IOs and some
> >runners don't execute tests in parallel.
> >
> >Ismael
> >
> >(slightly related), is there a way to get change the timeout of travis
> >jobs). Because it is failing most of the time now because of this, and
> >it
> >is quite noisey to have so many false positives.
> >
> >
> >
> >
> >On Fri, Feb 10, 2017 at 8:00 PM, Robert Bradshaw <
> >rober...@google.com.invalid> wrote:
> >
> >> On Fri, Feb 10, 2017 at 8:45 AM, Dan Halperin
> > >> >
> >> wrote:
> >>
> >> > On Fri, Feb 10, 2017 at 7:42 AM, Kenneth Knowles
> > >> >
> >> > wrote:
> >> >
> >> > > On Feb 10, 2017 07:36, "Dan Halperin"
> >
> >> > wrote:
> >> > >
> >> > > Before we added checkstyle it was under a minute. Now it's over
> >five?
> >> > > That's awful IMO
> >> > >
> >> > >
> >> > > Checkstyle didn't cause all that, did it?
> >> > >
> >> >
> >> > The "5 minutes" was going with Aviem's numbers after this change.
> >But
> >> yes,
> >> > Checkstyle alone substantially (>+50%) the time from what it was
> >> previously
> >> > to adding it back to the default build.
> >>
> >>
> >> Just to check, are we doing parallel builds?
> >>
> >>
> >> >
> >> > Noting that findbugs takes quite a lot more time. Javadoc and jar
> >are the
> >> > > other two slow ones.
> >> > >
> >> > > RAT is fast. But it has very poor error messages, so we wouldn't
> >want a
> >> > new
> >> > > contributor trying to figure out what is going on without our
> >help.
> >> > >
> >> >
> >> > There is a larger philosophical issue here: is there a point of
> >Jenkins
> >> > precommit testing? Why not just make `mvn install` run everything
> >that
> >> > Jenkins does? For that matter, why don't committers just push
> >directly to
> >> > master? Wouldn't that make everyone's life easier?
> >> >
> >> > I'd argue that's not true.
> >> >
> >> > 1. Developer productivity -- Jenkins should run many more checks
> >than
> >> > developers do. Especially time-, resource-, or setup- intensive
> >tasks.
> >> > 2. Automated enforcement -- Jenkins is better at running the right
> >> commands
> >> > than we are.
> >> > 3. Lower the barrier to entry -- individual developers need not
> >have a
> >> > running Spark/Flink/Apex/Dataflow setup in order to contribute
> >code.
> >> > 4. Focus on the user -- someone checking out the code and using it
> >for
> >> the
> >> > first time does not care whether the code style checks or has the
> >right
> >> > licenses -- that should have been enforced by the Beam team before
> >> > committing.
> >> >
> >> > We should be *very* choosy about what we enforce on every developer
> >every
> >> > time they go to compile. I probably compil

Re: Better developer instructions for using Maven?

2017-02-15 Thread Ismaël Mejía

This question got lost in the discussion, but there is a small improvement
that we can do:

> Just to check, are we doing parallel builds?

We are on jenkins, not in travis, there is an ongoing PR to fix this.

What we can improve is to check if we can run some of the test suites in
parallel to gain some extra time. For exemple most of the IOs and some
runners don't execute tests in parallel.

Ismael

(slightly related), is there a way to get change the timeout of travis
jobs). Because it is failing most of the time now because of this, and it
is quite noisey to have so many false positives.




On Fri, Feb 10, 2017 at 8:00 PM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> On Fri, Feb 10, 2017 at 8:45 AM, Dan Halperin  >
> wrote:
>
> > On Fri, Feb 10, 2017 at 7:42 AM, Kenneth Knowles  >
> > wrote:
> >
> > > On Feb 10, 2017 07:36, "Dan Halperin" 
> > wrote:
> > >
> > > Before we added checkstyle it was under a minute. Now it's over five?
> > > That's awful IMO
> > >
> > >
> > > Checkstyle didn't cause all that, did it?
> > >
> >
> > The "5 minutes" was going with Aviem's numbers after this change. But
> yes,
> > Checkstyle alone substantially (>+50%) the time from what it was
> previously
> > to adding it back to the default build.
>
>
> Just to check, are we doing parallel builds?
>
>
> >
> > Noting that findbugs takes quite a lot more time. Javadoc and jar are the
> > > other two slow ones.
> > >
> > > RAT is fast. But it has very poor error messages, so we wouldn't want a
> > new
> > > contributor trying to figure out what is going on without our help.
> > >
> >
> > There is a larger philosophical issue here: is there a point of Jenkins
> > precommit testing? Why not just make `mvn install` run everything that
> > Jenkins does? For that matter, why don't committers just push directly to
> > master? Wouldn't that make everyone's life easier?
> >
> > I'd argue that's not true.
> >
> > 1. Developer productivity -- Jenkins should run many more checks than
> > developers do. Especially time-, resource-, or setup- intensive tasks.
> > 2. Automated enforcement -- Jenkins is better at running the right
> commands
> > than we are.
> > 3. Lower the barrier to entry -- individual developers need not have a
> > running Spark/Flink/Apex/Dataflow setup in order to contribute code.
> > 4. Focus on the user -- someone checking out the code and using it for
> the
> > first time does not care whether the code style checks or has the right
> > licenses -- that should have been enforced by the Beam team before
> > committing.
> >
> > We should be *very* choosy about what we enforce on every developer every
> > time they go to compile. I probably compile Beam 50x-100x a day.
> Literally,
> > the extra minutes you want to add here will cost me an hour daily.
> >
>
> By the same token of having a different bar for the Jenkins presubmit vs.
> what's run locally, I think it makes a lot of sense to run a different
> command for iterative development than you run before creating a pull
> request. E.g. during development I'll often run only one test rather than
> the entire suite, but do run the entire suite occasionally (often before
> commit, especially before pushing).
>
> The contributors guild should give a suggested command to run before
> creating a PR, right in the docs of how to create a PR, which may be more
> expensive than what you run during development. IMHO, this should be fairly
> comprehensive (certainly tests and checkstyle, maybe javadoc and findbugs).
> This should be the "default" command that the one-time-contributor should
> know. For those compiling 50x or more a day, I think the burden of learning
> a second (or more) cheaper commands is not high, and we could even put such
> a thing in the docs (and hopefully a common maven convention like "mvn
> test").
>
> I've listed the fraction of commits I think will break one of the following
> > if that property is not tested:
> >
> > * compiling (100%)
> > * tests (100%)
> > * checkstyle (90%)
> > * javadoc (30%)
> > * findbugs (5%)
> > * rat (1%)
> >
> > So you can see where I stand and why. I'm sorry that 1/20 PRs has Apache
> > RAT catch a licensing issue or Findbugs catch a threading issue -- you
> can
> > always get a larger set of the precommit checks using -Prelease, though
> of
> > course the integration tests and runnableonservice tests may catch more
> > issues still. But I want my developer minutes back for the 95%+ of the
> > rest.
> >
> > Dan
> >
>

Metrics for Beam IOs.

2017-02-14 Thread Ismaël Mejía

Hello,

The new metrics API allows us to integrate some basic metrics into the Beam
IOs. I have been following some discussions about this on JIRAs/PRs, and I
think it is important to discuss the subject here so we can have more
awareness and obtain ideas from the community.

First I want to thank Ben for his work on the metrics API, and Aviem for
his ongoing work on metrics for IOs, e.g. KafkaIO) that made me aware of
this subject.

There are some basic ideas to discuss e.g.

- What are the responsibilities of Beam IOs in terms of Metrics
(considering the fact that the actual IOs, server + client, usually provide
their own)?

- What metrics are relevant to the pipeline (or some particular IOs)? Kafka
backlog for one could point that a pipeline is behind ingestion rate.

- Should metrics be calculated on IOs by default or no?

- If metrics are defined by default does it make sense to allow users to
disable them?

Well these are just some questions around the subject so we can create a
common set of practices to include metrics in the IOs and eventually
improve the transform guide with this. What do you think about this? Do you
have other questions/ideas?

Thanks,
Ismaël

Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-30 Thread Ismaël Mejía

+1 (non-binding)

- verified signatures + checksums
- run mvn clean verify -Prelease, all artifacts build and the tests run
smoothly

Great to see a shorter release cycle, the improvements and the new IOs.


On Fri, Jan 27, 2017 at 9:55 PM, Jean-Baptiste Onofré 
wrote:

> Hi everyone,
>
> Please review and vote on the release candidate #1 for the version 0.5.0
> as follows:
>
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
>
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org
> [2], which is signed with the key with fingerprint C8282E76 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v0.5.0-RC1" [5],
> * website pull request listing the release and publishing the API
> reference manual [6].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PPMC affirmative votes.
>
> Thanks,
> JB
>
> [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
> ctId=12319527&version=12338859
> [2] https://dist.apache.org/repos/dist/dev/beam/0.5.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1010/
> [5] https://git-wip-us.apache.org/repos/asf?p=beam.git;a=tag;h=r
> efs/tags/v0.5.0-RC1
> [6] https://github.com/apache/beam-site/pull/132
>

Re: [ANNOUNCEMENT] New committers, January 2017 edition!

2017-01-27 Thread Ismaël Mejía

Congratulations, well deserved guys !


On Fri, Jan 27, 2017 at 9:28 AM, Amit Sela  wrote:

> Welcome and congratulations to all!
>
> On Fri, Jan 27, 2017, 10:12 Ahmet Altay  wrote:
>
> > Thank you all! And congratulations to other new committers.
> >
> > Ahmet
> >
> > On Thu, Jan 26, 2017 at 9:45 PM, Kobi Salant 
> > wrote:
> >
> > > Congrats! Well deserved Stas
> > >
> > > בתאריך 27 בינו' 2017 7:26,‏ "Frances Perry"  כתב:
> > >
> > > > Woohoo! Congrats ;-)
> > > >
> > > > On Thu, Jan 26, 2017 at 9:05 PM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > > > wrote:
> > > >
> > > > > Welcome aboard !⁣
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > > On Jan 27, 2017, 01:27, at 01:27, Davor Bonaci 
> > > wrote:
> > > > > >Please join me and the rest of Beam PMC in welcoming the following
> > > > > >contributors as our newest committers. They have significantly
> > > > > >contributed
> > > > > >to the project in different ways, and we look forward to many more
> > > > > >contributions in the future.
> > > > > >
> > > > > >* Stas Levin
> > > > > >Stas has contributed across the breadth of the project, from the
> > Spark
> > > > > >runner to the core pieces and Java SDK. Looking at code
> > contributions
> > > > > >alone, he authored 43 commits and reported 25 issues. Stas is very
> > > > > >active
> > > > > >on the mailing lists too, contributing to good discussions and
> > > > > >proposing
> > > > > >improvements to the Beam model.
> > > > > >
> > > > > >* Ahmet Altay
> > > > > >Ahmet is a major contributor to the Python SDK, both in terms of
> > > design
> > > > > >and
> > > > > >code contribution. Looking at code contributions alone, he
> authored
> > 98
> > > > > >commits and reviewed dozens of pull requests. With Python SDK’s
> > > > > >imminent
> > > > > >merge to the master branch, Ahmet contributed towards
> establishing a
> > > > > >new
> > > > > >major component in Beam.
> > > > > >
> > > > > >* Pei He
> > > > > >Pei has been contributing to Beam since its inception,
> accumulating
> > a
> > > > > >total
> > > > > >of 118 commits since February. He has made several major
> > > contributions,
> > > > > >most recently by redesigning IOChannelFactory / FileSystem APIs
> (in
> > > > > >progress), which would extend Beam’s portability to many
> additional
> > > > > >file
> > > > > >systems and cloud providers.
> > > > > >
> > > > > >Congratulations to all three! Welcome!
> > > > > >
> > > > > >Davor
> > > > >
> > > >
> > >
> >
>

Re: Request for becoming a contributor

2017-01-24 Thread Ismaël Mejía

Similar to yesterday's discussion about opening access to the slack
channel, I wonder if it makes sense to let people assign themselves as
contributors and pick JIRAs without asking for this, Is this possible with
Apache's JIRA? And do you think this is a good idea?

On Tue, Jan 24, 2017 at 7:15 AM, Vincent Wang  wrote:

> Got it, thanks!
>
> Kenneth Knowles 于2017年1月24日周二 下午2:12写道：
>
> > Done! And welcome! You are probably are already communicating, but I have
> > CC'd Manu Zhang anyhow.
> >
> > On Mon, Jan 23, 2017 at 9:41 PM, Vincent Wang 
> wrote:
> >
> > > Hi,
> > >
> > > This is Huafeng from Apache Gearpump(incubating) team and I'd like to
> > make
> > > contribution to Gearpump runner on Beam.
> > > Would you mind adding me to the contributor list? My jira system email
> is
> > > huafeng.w...@intel.com
> > >
> > > Thanks,
> > > Huafeng
> > >
> >
>

Re: Beam Fn API

2017-01-24 Thread Ismaël Mejía

Awesome job Lukasz, Excellent, I have to confess the first time I heard
about
the Fn API idea I was a bit incredulous, but you are making it real,
amazing!

Just one question from your document, you said that 80% of the extra (15%)
time
goes into encoding and decoding the data for your test case, can you expand
in
your current ideas to improve this? (I am not sure I completely understand
the
issue).


On Mon, Jan 23, 2017 at 7:10 PM, Lukasz Cwik 
wrote:

> Responded inline.
>
> On Sat, Jan 21, 2017 at 8:20 AM, Amit Sela  wrote:
>
> > This is truly amazing Luke!
> >
> > If I understand this right, the runner executing the DoFn will delegate
> the
> > function code and input data (and state, coders, etc.) to the container
> > where it will execute with the user's SDK of choice, right ?
>
>
> Yes, that is correct.
>
>
> > I wonder how the containers relate to the underlying engine's worker
> > processes ? is it a 1-1, container per worker ? if there's less "work"
> for
> > the worker's Java process (for example) now and it becomes a sort of
> > "dispatcher", would that change the resource allocation commonly used for
> > the same Pipeline so that the worker processes would require less
> > resources, while giving those to the container ?
> >
>
> I think with the four services (control, data, state, logging) you can go
> with a 1-1 relationship or break it up more finely grained and dedicate
> some machines to have specific tasks. Like you could have a few machines
> dedicated to log aggregation which all the workers push their logs to.
> Similarly, you could have some machines that have a lot of memory which
> would be better to be able to do shuffles in memory and then this cluster
> of high memory machines could front the data service. I believe there is a
> lot of flexibility based upon what a runner can do and what it specializes
> in and believe that with more effort comes more possibilities albeit with
> increased internal complexity.
>
> The layout of resources depends on whether the services and SDK containers
> are co-hosted on the same machine or whether there is a different
> architecture in play. In a co-hosted configuration, it seems likely that
> the SDK container will get more resources but is dependent on the runner
> and pipeline shape (shuffle heavy dominated pipelines will look different
> then ParDo dominated pipelines).
>
>
> > About executing sub-graphs, would it be true to say that as long as
> there's
> > no shuffle, you could keep executing in the same container ? meaning that
> > the graph is broken into sub-graphs by shuffles ?
> >
>
> The only thing that is required is that the Apache Beam model is preserved
> so typical break points will be at shuffles and language crossing points
> (e.g. Python ParDo -> Java ParDo). A runner is free to break up the graph
> even more for other reasons.
>
>
> > I have to dig-in deeper, so I could have more questions ;-) thanks Luke!
> >
> > On Sat, Jan 21, 2017 at 1:52 AM Lukasz Cwik 
> > wrote:
> >
> > > I updated the PR description to contain the same.
> > >
> > > I would start by looking at the API/object model definitions found in
> > > beam_fn_api.proto
> > > <
> > > https://github.com/lukecwik/incubator-beam/blob/fn_api/
> > sdks/common/fn-api/src/main/proto/beam_fn_api.proto
> > > >
> > >
> > > Then depending on your interest, look at the following:
> > > * FnHarness.java
> > > <
> > > https://github.com/lukecwik/incubator-beam/blob/fn_api/
> > sdks/java/harness/src/main/java/org/apache/beam/fn/
> harness/FnHarness.java
> > > >
> > > is the main entry point.
> > > * org.apache.beam.fn.harness.data
> > > <
> > > https://github.com/lukecwik/incubator-beam/tree/fn_api/
> > sdks/java/harness/src/main/java/org/apache/beam/fn/harness/data
> > > >
> > > contains the most interesting bits of code since it is able to
> multiplex
> > a
> > > gRPC stream into multiple logical streams of elements bound for
> multiple
> > > concurrent process bundle requests. It also contains the code to take
> > > multiple logical outbound streams and multiplex them back onto a gRPC
> > > stream.
> > > * org.apache.beam.runners.core
> > > <
> > > https://github.com/lukecwik/incubator-beam/tree/fn_api/
> > sdks/java/harness/src/main/java/org/apache/beam/runners/core
> > > >
> > > contains additional runners akin to the DoFnRunner found in
> runners-core
> > to
> > > support sources and gRPC endpoints.
> > >
> > > Unless your really interested in how domain sockets, epoll, nio channel
> > > factories or how stream readiness callbacks work in gRPC, I would avoid
> > the
> > > packages org.apache.beam.fn.harness.channel and
> > > org.apache.beam.fn.harness.stream. Similarly I would avoid
> > > org.apache.beam.fn.harness.fn and org.apache.beam.fn.harness.fake as
> > they
> > > don't add anything meaningful to the api.
> > >
> > > Code package descriptions:
> > >
> > > org.apache.beam.fn.harness.FnHarness: main entry point
> > > org.apache.beam.fn.harness.control: Control ser

Re: Better developer instructions for using Maven?

2017-01-24 Thread Ismaël Mejía

I just wanted to know if we have achieved some consensus about this, I just
saw this PR that reminded me about this discussion.

https://github.com/apache/beam/pull/1829

It is important that we mention the existing profiles (and the intended
checks) in the contribution guide (e.g. -Prelease (or -Pall-checks triggers
these validations).

I can add this to the guide if you like once we define the checks per
stage/profile.

Ismaël


On Wed, Jan 11, 2017 at 8:12 AM, Aviem Zur  wrote:

> I agree with Dan and Lukasz.
> Developers should not be expected to know beforehand which specific
> profiles to run.
> The phase specified in the PR instructions (`verify`) should run all the
> relevant verifications and be the "slower" build, while a preceding
> lifecycle, such as `test`, should run the "faster" verifications.
>
> Aviem.
>
> On Mon, Jan 9, 2017 at 7:57 PM Robert Bradshaw  >
> wrote:
>
> > On Mon, Jan 9, 2017 at 3:49 AM, Aljoscha Krettek 
> > wrote:
> > > I also usually prefer "mvn verify" to to the expected thing but I see
> > that
> > > quick iteration times are key.
> >
> > I see
> > https://maven.apache.org/guides/introduction/
> introduction-to-the-lifecycle.html
> >
> > verify - run any checks on results of integration tests to ensure
> > quality criteria are met
> >
> > Of course our integration tests are long enough that we shouldn't be
> > putting all of them here, but I too would expect checkstyle.
> >
> > Perhaps we could introduce a verify-fast or somesuch for fast (but
> > lower coverage) turnaround time. I would expect "mvn verify test" to
> > pass before submitting a PR, and would want to run that before asking
> > others to look at it. I think this should be our criteria (i.e. what
> > will a new but maven-savvy user run before pushing their code).
> >
> > > As long as the pre-commit hooks still check everything I'm ok with
> making
> > > the default a little more lightweight.
> >
> > The fact that our pre-commit hooks take a long time to run does change
> > things. Nothing more annoying than seeing that your PR failed 3 hours
> > later because you had some trailing whitespace...
> >
> > > On Thu, 5 Jan 2017 at 21:49 Lukasz Cwik 
> > wrote:
> > >
> > >> I was hoping that the default mvn verify would be the slow build and a
> > >> profile could be enabled that would skip checks to make things faster
> > for
> > >> regular contributors. This way a person doesn't need to have detailed
> > >> knowledge of all our profiles and what they do (typically mvn verify)
> > will
> > >> do the right thing most of the time.
> > >>
> > >> On Thu, Jan 5, 2017 at 9:30 AM, Dan Halperin
> > 
> > >> wrote:
> > >>
> > >> > On Thu, Jan 5, 2017 at 9:28 AM, Jesse Anderson <
> je...@smokinghand.com
> > >
> > >> > wrote:
> > >> >
> > >> > > @dan are you saying that mvn verify isn't doing checkstyle
> anymore?
> > >> >
> > >> >
> > >> > `mvn verify` alone should not be running checkstyle, if modules are
> > >> > configured correctly.
> > >> >
> > >> >
> > >> > > Some of
> > >> > > the checkstyles are still running for a few modules. Also, the
> > >> > contribution
> > >> > > docs will need to change.
> > >> >
> > >> >
> > >> > Yes. The PR includes discussion of these other needed changes,
> > >> > unfortunately one PR can't change two repositories.
> > >> >
> > >> > Please continue the discussion on the PR, then I will summarize it
> > back
> > >> > into the dev thread.
> > >> >
> > >> > Thanks,
> > >> > Dan
> > >> >
> > >> >
> > >> > > They say to run mvn verify before commits.
> > >> > >
> > >> > > On Thu, Jan 5, 2017 at 9:25 AM Dan Halperin
> > >>  > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Several folks seem to have been confused after BEAM-246, where
> we
> > >> moved
> > >> > > the
> > >> > > > "slow things" into the release profile. I've started a
> discussion
> > >> with
> > >> > > > https://github.com/apache/beam/pull/1740 to see if there are
> > things
> > >> we
> > >> > > can
> > >> > > > do to fill these gaps.
> > >> > > >
> > >> > > > Would love folks to chime in with opinions.
> > >> > > >
> > >> > > > Dan
> > >> > > >
> > >> > > > On Wed, Jan 4, 2017 at 1:34 PM, Jesse Anderson <
> > >> je...@smokinghand.com>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > @Eugene, yes that failed on the checkstyle.
> > >> > > > >
> > >> > > > > On Wed, Jan 4, 2017 at 1:27 PM Eugene Kirpichov
> > >> > > > >  wrote:
> > >> > > > >
> > >> > > > > > Try just -Prelease.
> > >> > > > > > On Wed, Jan 4, 2017 at 1:21 PM Jesse Anderson <
> > >> > je...@smokinghand.com
> > >> > > >
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Fails because I don't have a secret key.
> > >> > > > > > >
> > >> > > > > > > On Wed, Jan 4, 2017 at 1:03 PM Jean-Baptiste Onofré <
> > >> > > j...@nanthrax.net
> > >> > > > >
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi Jesse,
> > >> > > > > > > >
> > >> > > > > > > > Could you try the same with:
> > >> > > > > > > >
> > >> > > > > > > > mvn verify -Preleas

Re: [VOTE] Merge Python SDK to the master branch

2017-01-23 Thread Ismaël Mejía

[X] +1, Merge python-sdk branch to master after the 0.5.0 release

Big +1, unbounded support will come/stabilize later on (as it happened with
InProcessRunner), visibility is more important.

On Mon, Jan 23, 2017 at 8:36 AM, Sergio Fernández  wrote:

> +1
>
> On Fri, Jan 20, 2017 at 9:24 PM, Robert Bradshaw <
> rober...@google.com.invalid> wrote:
>
> > On Fri, Jan 20, 2017 at 9:03 AM, Ahmet Altay 
> > wrote:
> > > Hi all,
> > >
> > >
> > > Please review the earlier discussion on the status of the Python SDK
> [1]
> > > and vote on merging the python-sdk branch to the master branch, as
> > follows:
> >
> > [X] +1, Merge python-sdk branch to master after the 0.5.0 release, and
> > release it in the subsequent minor release.
> >
> > - Robert
> >
>
>
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e: sergio.fernan...@redlink.co
> w: http://redlink.co
>

Re: Hosting data stores for IO Transform testing

2017-01-18 Thread Ismaël Mejía

Hello again,

Stephen, I agree with you the real question is what is the scope of the
tests, maybe the discussion so far has been more about testing a ‘real’
data store and finding infra/performance issues (and future regressions),
but having a modern cluster manager opens the door to create more
interesting integration tests like the ones I mentioned, in particular my
idea is more oriented towards the validation of the ‘correct’expected
behavior of the IOs and runners. But this is quite ambitious for a first
goal, maybe we should first get things working and let this for later (if
there is still interest).

I am not sure that unit tests are enough to test distribution issues
because they are harder to simulate in particular if we add the fact that
we can have too many moving pieces. For example, imagine that we run a Beam
pipeline deployed via Spark on a YARN cluster (where some nodes can fail)
that reads from Kafka (with some slow partition) and writes to Cassandra
(with a partition that goes down). You see, this is a quite complex
combination of pieces (and possible issues), but it is not a totally
artificial scenario, in fact this is a common architecture, and this can
(at least in theory) be simulated with a cluster manager, but I don’t see
how can I easily reproduce this with a unit test.

Anyway, this scenario makes me think that the boundaries of what we want to
test are really important. Complexity can be huge.

About the Mesos package question, effectively I referred to Mesos Universe
(the repo you linked), and what you said is sadly true, it is not easy to
find multi-node instance packages that are the most interesting ones for
our tests (in both k8s or mesos). I agree with your decision of using
Kubernetes, I just wanted to mention that in some cases we will need to
produce these multi-node packages to have interesting tests.

Ismaël

On Wed, Jan 18, 2017 at 10:09 PM, Jean-Baptiste Onofré 
wrote:

> Yes, for both DCOS (Mesos+Marathon) and Kubernetes, I think we may find
> single node config but not sure for multi-node setup. Anyway, I'm not sure
> if we find a multi-node configuration, it would cover our needs.
>
> Regards
> JB
>
> On 01/18/2017 12:52 PM, Stephen Sisk wrote:
>
>> ah! I looked around a bit more and found the dcos package repo -
>> https://github.com/mesosphere/universe/tree/version-3.x/repo/packages
>>
>> poking around a bit, I can find a lot of packages for single node
>> instances, but not many packages for multi-node instances. Single node
>> instance packages are kind of useful, but I don't think it's *too*
>> helpful.
>> The multi-node instance packages that run the data store's high
>> availability mode are where the real work is, and it seems like both
>> kubernetes helm and dcos' package universe don't have a lot of those.
>>
>> S
>>
>> On Wed, Jan 18, 2017 at 9:56 AM Stephen Sisk  wrote:
>>
>> Hi Ishmael,
>>>
>>> these are good questions, thanks for raising them.
>>>
>>> Ability to modify network/compute resources to simulate failures
>>> =
>>> I see two real questions here:
>>> 1. Is this something we want to do?
>>> 2. Is it possible with both/either?
>>>
>>> So far, the test strategy I've been advocating is that we test problems
>>> like this in unit tests rather than do this in ITs/Perf tests. Otherwise,
>>> it's hard to re-create the same conditions.
>>>
>>> I can investigate whether it's possible, but I want to clarify whether
>>> this is something that we care about. I know both support killing
>>> individual nodes. I haven't seen a lot of network control in either, but
>>> haven't tried to look for it.
>>>
>>> Availability of ready to play packages
>>> 
>>> I did look at this, and as far as I could tell, mesos didn't have any
>>> pre-built packages for multi-node clusters of data stores. If there's a
>>> good repository of them that we trust, that would definitely save us
>>> time.
>>> Can you point me at the mesos repository?
>>>
>>> S
>>>
>>>
>>>
>>> On Wed, Jan 18, 2017 at 8:37 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
>>> ⁣Hi Ismael
>>>
>>> Stephen will reply with details but I know he did a comparison and
>>> evaluate different options.
>>>
>>> He tested with the jdbc Io itests.
>>>
>>> Regards
>>> JB
>>>
>>> On Jan 18, 2017, 08:26, at 08:26, "Ismaël Mejía" 
>>> wrote:
>>>
>&

Re: Hosting data stores for IO Transform testing

2017-01-18 Thread Ismaël Mejía

t;>>>>> Here's my thinking on the questions we've raised here -
> > >>>>>>>>>>
> > >>>>>>>>>> Embedded versions of data stores for testing
> > >>>>>>>>>> 
> > >>>>>>>>>> Summary: yes! But we still need real data stores to test
> > against.
> > >>>>>>>>>>
> > >>>>>>>>>> I am a gigantic fan of using embedded versions of the various
> > data
> > >>>>>>>>>> stores.
> > >>>>>>>>>> I think we should test everything we possibly can using them,
> > >>>>>>>>>> and do
> > >>>>>>>>>>
> > >>>>>>>>> the
> > >>>>>>
> > >>>>>>> majority of our correctness testing using embedded versions + the
> > >>>>>>>>>>
> > >>>>>>>>> direct
> > >>>>>>
> > >>>>>>> runner. However, it's also important to have at least one test
> that
> > >>>>>>>>>> actually connects to an actual instance, so we can get
> coverage
> > >>>>>>>>>> for
> > >>>>>>>>>> things
> > >>>>>>>>>> like credentials, real connection strings, etc...
> > >>>>>>>>>>
> > >>>>>>>>>> The key point is that embedded versions definitely can't cover
> > the
> > >>>>>>>>>> performance tests, so we need to host instances if we want to
> > test
> > >>>>>>>>>>
> > >>>>>>>>> that.
> > >>>>>>
> > >>>>>>> I consider the integration tests/performance benchmarks to be
> > >>>>>>>>>> costly
> > >>>>>>>>>> things
> > >>>>>>>>>> that we do only for the IO transforms with large amounts of
> > >>>>>>>>>> community
> > >>>>>>>>>> support/usage. A random IO transform used by a few users
> doesn't
> > >>>>>>>>>> necessarily need integration & perf tests, but for heavily
> used
> > IO
> > >>>>>>>>>> transforms, there's a lot of community value in these tests.
> The
> > >>>>>>>>>> maintenance proposal below scales with the amount of community
> > >>>>>>>>>> support
> > >>>>>>>>>> for
> > >>>>>>>>>> a particular IO transform.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Reusing data stores ("use the data stores across executions.")
> > >>>>>>>>>> --
> > >>>>>>>>>> Summary: I favor a hybrid approach: some frequently used, very
> > >>>>>>>>>> small
> > >>>>>>>>>> instances that we keep up all the time + larger
> multi-container
> > >>>>>>>>>> data
> > >>>>>>>>>> store
> > >>>>>>>>>> instances that we spin up for perf tests.
> > >>>>>>>>>>
> > >>>>>>>>>> I don't think we need to have a strong answer to this
> question,
> > >>>>>>>>>> but I
> > >>>>>>>>>> think
> > >>>>>>>>>> we do need to know what range of capabilities we need, and use
> > >>>>>>>>>> that to
> > >>>>>>>>>> inform our requirements on the hosting infrastructure. I think
> > >>>>>>>>>> kubernetes/mesos + docker can support all the scenarios I
> > discuss
> > >>>>>>>>>>
> > >>>>>>>>> below.
> > >>>>>>
> > >>>>>>> I had been thinking of a hybrid approach - reuse some instances
&g

Re: Graduation!

2017-01-10 Thread Ismaël Mejía

Congratulations everyone, this is great news, graduation will give new
users confidence about the project and its community.
Awesome !

Ismaël

ps. @Michal It is only one unified API for both bounded and unbounded data,
the behavior changes depending on the data sources.

On Tue, Jan 10, 2017 at 4:33 PM, Thomas Weise  wrote:

> Congratulations everyone!
>
> Thomas
>
>
> On Tue, Jan 10, 2017 at 6:14 AM, Bobby Evans 
> wrote:
>
> > Yes great news, Congradulations
> >
> > - Bobby
> >
> > On Tuesday, January 10, 2017, 8:11:43 AM CST, Jungtaek Lim <
> > kabh...@gmail.com> wrote:Congrats all!
> >
> > - Jungtaek Lim (HeartSaVioR)
> >
> > 2017년 1월 10일 (화) 오후 10:53, Andrew Hoblitzell  > >님이
> > 작성:
> >
> > > Good work all!
> > >
> > > On Tue, Jan 10, 2017 at 8:50 AM, Jesse Anderson  >
> > > wrote:
> > >
> > > > Excellent!
> > > >
> > > > On Tue, Jan 10, 2017, 7:12 AM Jacky Li  wrote:
> > > >
> > > > > Great work! Congratulations!
> > > > >
> > > > > Regards,
> > > > > Jacky
> > > > >
> > > > > > 在 2017年1月10日，下午7:11，Sergio Fernández  写道：
> > > > > >
> > > > > > Congrats, guys!
> > > > > >
> > > > > > On Tue, Jan 10, 2017 at 12:07 PM, Davor Bonaci  >
> > > > wrote:
> > > > > >
> > > > > >> The ASF has publicly announced our graduation!
> > > > > >>
> > > > > >>
> > > > > >> https://blogs.apache.org/foundation/entry/the-apache-
> > > > > >> software-foundation-announces
> > > > > >>
> > > > > >>https://beam.apache.org/blog/2017/01/10/beam-graduates.html
> > > > > >>
> > > > > >> Graduation is a recognition of the community that we have built
> > > > > together. I
> > > > > >> am humbled to be part of this group and this project, and so
> > excited
> > > > for
> > > > > >> what we can accomplish together going forward.
> > > > > >>
> > > > > >> Davor
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sergio Fernández
> > > > > > Partner Technology Manager
> > > > > > Redlink GmbH
> > > > > > m: +43 6602747925 <+43%20660%202747925>
> > > > > > e: sergio.fernan...@redlink.co
> > > > > > w: http://redlink.co
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

< 3 4 5 6 7 8

701 - 798 of 798 matches

Mail list logo