Re: [VOTE] Release 2.3.0, release candidate #3

2018-02-12 Thread Romain Manni-Bucau
you can't once you closed the staging repo


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-12 17:32 GMT+01:00 Reuven Lax <re...@google.com>:

> If it's possible to just publish the module separately, that seems better
> than RC4
>
> On Mon, Feb 12, 2018 at 8:28 AM, Ismaël Mejía <ieme...@gmail.com> wrote:
>
>> Sorry the skip was an error while merging this module with the tests
>> during the move to Java 1.8. I will create a JIRA + PR, but wonder if
>> we can somehow just publish this artifact to avoid creating a new
>> RC+vote..
>>
>> On Mon, Feb 12, 2018 at 5:22 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>> > But that's it: deploy skip is set to  the module, so it's expected.
>> >
>> > Regards
>> > JB
>> >
>> > On 02/12/2018 05:21 PM, Romain Manni-Bucau wrote:
>> >> oops sorry, read too fast (thanks to not align artifactId and folder
>> names ;)):
>> >> deploy#skip=true in the module :)
>> >>
>> >>
>> >> Romain Manni-Bucau
>> >> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> >> <https://rmannibucau.metawerx.net/> | Old Blog
>> >> <http://rmannibucau.wordpress.com> | Github <
>> https://github.com/rmannibucau> |
>> >> LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
>> >> <https://www.packtpub.com/application-development/java-ee-8-
>> high-performance>
>> >>
>> >> 2018-02-12 17:19 GMT+01:00 Romain Manni-Bucau <rmannibu...@gmail.com
>> >> <mailto:rmannibu...@gmail.com>>:
>> >>
>> >> it is not in the parent modules so completely skipped from the
>> reactor
>> >>
>> >>
>> >> Romain Manni-Bucau
>> >> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> >> <https://rmannibucau.metawerx.net/> | Old Blog
>> >> <http://rmannibucau.wordpress.com> | Github
>> >> <https://github.com/rmannibucau> | LinkedIn
>> >> <https://www.linkedin.com/in/rmannibucau> | Book
>> >> <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance>
>> >>
>> >> 2018-02-12 17:15 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net
>> >> <mailto:j...@nanthrax.net>>:
>> >>
>> >> Hi Neville,
>> >>
>> >> Let me take a look on the profile used for the release:perform.
>> >>
>> >> I'll keep you posted.
>> >>
>> >> Regards
>> >> JB
>> >>
>> >> On 02/12/2018 05:10 PM, Neville Li wrote:
>> >> > I don't see a beam-sdks-java-io-hadoop-input-format
>> artifact in the staging
>> >> > repo, but the Maven module still exists:
>> >> > https://github.com/apache/beam/tree/v2.3.0-RC3/sdks/java/io/
>> hadoop-input-format
>> >> <https://github.com/apache/beam/tree/v2.3.0-RC3/sdks/java/
>> io/hadoop-input-format>
>> >> >
>> >> > Was it not published by mistake? We still have code that
>> depends on this.
>> >> >
>> >> > On Mon, Feb 12, 2018 at 3:55 AM Romain Manni-Bucau <
>> rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>
>> >> > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>>>
>> wrote:
>> >> >
>> >> > Ok, checked custom jobs on spark and direct runners +
>> -parameters is usable
>> >> > + some advanced sdk-core integration usages (outside
>> runners) - not sure
>> >> > where it fits the spreadsheet though.
>> >> >
>> >> >
>> >> > Romain Manni-Bucau
>> >> > @rmannibucau <https://twitter.com/rmannibucau
>> >> <https://twitter.com/rmannibucau>> |  Blog
>> >> > <https://rmannibucau.metawerx.net/
>> >> <http

Re: [VOTE] Release 2.3.0, release candidate #3

2018-02-12 Thread Romain Manni-Bucau
oops sorry, read too fast (thanks to not align artifactId and folder names
;)): deploy#skip=true in the module :)


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-12 17:19 GMT+01:00 Romain Manni-Bucau <rmannibu...@gmail.com>:

> it is not in the parent modules so completely skipped from the reactor
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
> 2018-02-12 17:15 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>
>> Hi Neville,
>>
>> Let me take a look on the profile used for the release:perform.
>>
>> I'll keep you posted.
>>
>> Regards
>> JB
>>
>> On 02/12/2018 05:10 PM, Neville Li wrote:
>> > I don't see a beam-sdks-java-io-hadoop-input-format artifact in the
>> staging
>> > repo, but the Maven module still exists:
>> > https://github.com/apache/beam/tree/v2.3.0-RC3/sdks/java/io/
>> hadoop-input-format
>> >
>> > Was it not published by mistake? We still have code that depends on
>> this.
>> >
>> > On Mon, Feb 12, 2018 at 3:55 AM Romain Manni-Bucau <
>> rmannibu...@gmail.com
>> > <mailto:rmannibu...@gmail.com>> wrote:
>> >
>> > Ok, checked custom jobs on spark and direct runners + -parameters
>> is usable
>> > + some advanced sdk-core integration usages (outside runners) - not
>> sure
>> > where it fits the spreadsheet though.
>> >
>> >
>> > Romain Manni-Bucau
>> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> > <https://rmannibucau.metawerx.net/> | Old Blog
>> > <http://rmannibucau.wordpress.com> | Github
>> > <https://github.com/rmannibucau> | LinkedIn
>> > <https://www.linkedin.com/in/rmannibucau> | Book
>> > <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance>
>> >
>> > 2018-02-11 20:54 GMT+01:00 Eugene Kirpichov <kirpic...@google.com
>> > <mailto:kirpic...@google.com>>:
>> >
>> > Reminder: validation spreadsheet
>> > at  https://s.apache.org/beam-2.3.0-release-validation .
>> > It'd be good to accompany votes by specifying in the
>> spreadsheet what
>> > has been validated.
>> >
>> > On Sun, Feb 11, 2018 at 7:57 AM Romain Manni-Bucau
>> > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote:
>> >
>> > +1
>> >
>> >
>> > Romain Manni-Bucau
>> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> > <https://rmannibucau.metawerx.net/> | Old Blog
>> > <http://rmannibucau.wordpress.com> | Github
>> > <https://github.com/rmannibucau> | LinkedIn
>> > <https://www.linkedin.com/in/rmannibucau> | Book
>> > <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance>
>> >
>> > 2018-02-11 6:33 GMT+01:00 Jean-Baptiste Onofré <
>> j...@nanthrax.net
>> > <mailto:j...@nanthrax.net>>:
>> >
>> > Hi everyone,
>> >
>> > Please review and vote on the release candidate #3 for
>> the
>> > version 2.3.0, as
>> > follows:
>> >
>> > [ ] +1, Approve the release
>> > [ ] -1, Do not approve the release (please provide
>> specific
>> > comments)
>> >
>> >
>> > The complete staging area is available for your review,
>> which
>> > includes:
>> > * JIRA release notes [1],
>> > * the official Apache source release to be deployed to
>> > 

Re: [VOTE] Release 2.3.0, release candidate #3

2018-02-12 Thread Romain Manni-Bucau
it is not in the parent modules so completely skipped from the reactor


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-12 17:15 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:

> Hi Neville,
>
> Let me take a look on the profile used for the release:perform.
>
> I'll keep you posted.
>
> Regards
> JB
>
> On 02/12/2018 05:10 PM, Neville Li wrote:
> > I don't see a beam-sdks-java-io-hadoop-input-format artifact in the
> staging
> > repo, but the Maven module still exists:
> > https://github.com/apache/beam/tree/v2.3.0-RC3/sdks/
> java/io/hadoop-input-format
> >
> > Was it not published by mistake? We still have code that depends on this.
> >
> > On Mon, Feb 12, 2018 at 3:55 AM Romain Manni-Bucau <
> rmannibu...@gmail.com
> > <mailto:rmannibu...@gmail.com>> wrote:
> >
> > Ok, checked custom jobs on spark and direct runners + -parameters is
> usable
> > + some advanced sdk-core integration usages (outside runners) - not
> sure
> > where it fits the spreadsheet though.
> >
> >
> > Romain Manni-Bucau
> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> > <https://rmannibucau.metawerx.net/> | Old Blog
> > <http://rmannibucau.wordpress.com> | Github
> > <https://github.com/rmannibucau> | LinkedIn
> > <https://www.linkedin.com/in/rmannibucau> | Book
> > <https://www.packtpub.com/application-development/java-
> ee-8-high-performance>
> >
> > 2018-02-11 20:54 GMT+01:00 Eugene Kirpichov <kirpic...@google.com
> > <mailto:kirpic...@google.com>>:
> >
> > Reminder: validation spreadsheet
> > at  https://s.apache.org/beam-2.3.0-release-validation .
> > It'd be good to accompany votes by specifying in the spreadsheet
> what
> > has been validated.
> >
> > On Sun, Feb 11, 2018 at 7:57 AM Romain Manni-Bucau
> > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote:
> >
> > +1
> >
> >
> > Romain Manni-Bucau
> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> > <https://rmannibucau.metawerx.net/> | Old Blog
> > <http://rmannibucau.wordpress.com> | Github
> > <https://github.com/rmannibucau> | LinkedIn
> > <https://www.linkedin.com/in/rmannibucau> | Book
> > <https://www.packtpub.com/application-development/java-
> ee-8-high-performance>
> >
> > 2018-02-11 6:33 GMT+01:00 Jean-Baptiste Onofré <
> j...@nanthrax.net
> > <mailto:j...@nanthrax.net>>:
> >
> > Hi everyone,
> >
> > Please review and vote on the release candidate #3 for
> the
> > version 2.3.0, as
> > follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide
> specific
> > comments)
> >
> >
> > The complete staging area is available for your review,
> which
> > includes:
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to
> > dist.apache.org <http://dist.apache.org> [2],
> > which is signed with the key with fingerprint C8282E76
> [3],
> > * all artifacts to be deployed to the Maven Central
> Repository [4],
> > * source code tag "v2.3.0-RC3" [5],
> > * website pull request listing the release and
> publishing the
> > API reference
> > manual [6].
> > * Java artifacts were built with Maven 3.3.9 and Oracle
> JDK
> > 1.8.0_111.
> > * Python artifacts are deployed along with the source
> release to the
> > dist.apache.org <http://dist.apache.org> [2].
> >
> > The vote will be open for at least 72 hours. It is
> adopted by
> > majority approval,
> >  

Re: [VOTE] Release 2.3.0, release candidate #3

2018-02-12 Thread Romain Manni-Bucau
Ok, checked custom jobs on spark and direct runners + -parameters is usable
+ some advanced sdk-core integration usages (outside runners) - not sure
where it fits the spreadsheet though.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-11 20:54 GMT+01:00 Eugene Kirpichov <kirpic...@google.com>:

> Reminder: validation spreadsheet at  https://s.apache.org/beam-
> 2.3.0-release-validation .
> It'd be good to accompany votes by specifying in the spreadsheet what has
> been validated.
>
> On Sun, Feb 11, 2018 at 7:57 AM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> +1
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>> 2018-02-11 6:33 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>>
>>> Hi everyone,
>>>
>>> Please review and vote on the release candidate #3 for the version
>>> 2.3.0, as
>>> follows:
>>>
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>>
>>> The complete staging area is available for your review, which includes:
>>> * JIRA release notes [1],
>>> * the official Apache source release to be deployed to dist.apache.org
>>> [2],
>>> which is signed with the key with fingerprint C8282E76 [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v2.3.0-RC3" [5],
>>> * website pull request listing the release and publishing the API
>>> reference
>>> manual [6].
>>> * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval,
>>> with at least 3 PMC affirmative votes.
>>>
>>> Thanks,
>>> JB
>>>
>>> [1]
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?
>>> projectId=12319527=12341608
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4] https://repository.apache.org/content/repositories/
>>> orgapachebeam-1028/
>>> [5] https://github.com/apache/beam/tree/v2.3.0-RC3
>>> [6] https://github.com/apache/beam-site/pull/381
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>>


Re: [INFO] Gradle build is flaky on Jenkins

2018-02-09 Thread Romain Manni-Bucau
did you check the concurrency? For me it is plain wrong since it is
hardcoded in the build file and doesnt let me or the tool customize it.
This means it just blocks my computer in general and makes some timeout
related tests fail easily. (this kind of config should always be customized
on the CI and never hardcoded IMHO)


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-09 12:00 GMT+01:00 Aljoscha Krettek <aljos...@apache.org>:

> Yes, I was just about to write about this as well. In my recent PRs this
> always failed for different reasons.
>
> Thanks for looking into this!
>
> > On 9. Feb 2018, at 11:35, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> >
> > Hi guys,
> >
> > I noticed that the Gradle build on Jenkins is flaky: it almost always
> fails for
> > different reason (I can't download pentaho artifact sometime,
> interrupted other
> > times).
> >
> > Jenkins is doing:
> >
> > Jenkins: ./gradlew --continue --rerun-tasks :javaPreCommit
> >
> > I gonna investigate how to improve the situation.
> >
> > Regards
> > JB
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
>
>


Re: [VOTE] Release 2.3.0, release candidate #2

2018-02-08 Thread Romain Manni-Bucau
since it breaks only examples not sure it does worth yet another reroll
(which means already a 2 weeks delay on the plan). Users will be affected
the same anyway - and in an expected way until beam handles classloaders
per transform. A note in the side is fine probably.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-09 7:23 GMT+01:00 Chamikara Jayalath <chamik...@google.com>:

>
>
> On Thu, Feb 8, 2018 at 10:18 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> It means a RC3 then.
>>
>> Basically, we have two options:
>>
>> 1. I cancel RC2, to include PR 4645 and cut a RC3. It can be done super
>> fast
>> (today).
>>
>
> +1 for option 1 since IMO we should not release with quickstart broken for
> Spark.
>
>
>> 2. We continue RC2 vote and we add a note about shading (as I did for the
>> TextIO
>> issue with Flink runner).
>>
>> I'm more in favor of 1 as the fix is already there and cutting a release
>> is
>> super fast for me.
>>
>> Thoughts ?
>>
>> Regards
>> JB
>>
>> On 02/09/2018 07:06 AM, Chamikara Jayalath wrote:
>> > +1 for continuing the release after the immediate fix
>> > (https://github.com/apache/beam/pull/4645). I don't
>> > think https://issues.apache.org/jira/browse/BEAM-3519 is due to a
>> recent update
>> > to google-cloud-platform module so the issue likely existed in some
>> form in
>> > previous releases as well.
>> >
>> > - Cham
>> >
>> > On Thu, Feb 8, 2018 at 9:47 PM Romain Manni-Bucau <
>> rmannibu...@gmail.com
>> > <mailto:rmannibu...@gmail.com>> wrote:
>> >
>> > IMHO it is not a blocker but an incompatibility between spark and
>> some IO
>> > stack. Trivial workaround is to shade the io before importing it in
>> its
>> > project. Amternative is to wrap IO in custom classloaders.
>> >
>> > Didnt check for this one but it is a common beam issue to have
>> conflicts
>> > between runners/io or even 2 ios so it shouldnt block a release by
>> itself
>> > until beam aims to solve properly conflicts - which means without
>> shading
>> > which breaks the io ecosystem on the user side.
>> >
>> > Just my 2cts
>> >
>> > Le 9 févr. 2018 06:07, "Jean-Baptiste Onofré" <j...@nanthrax.net
>> > <mailto:j...@nanthrax.net>> a écrit :
>> >
>> > Is it specific to this release ? I think it was like this
>> before no ?
>> >
>> > Regards
>> > JB
>> >
>> > On 02/09/2018 12:48 AM, Kenneth Knowles wrote:
>> > > Since root cause is https://issues.apache.org/
>> jira/browse/BEAM-3519 I
>> > marked it
>> > > a blocker so we can discuss fixes or workarounds there.
>> > >
>> > > On Thu, Feb 8, 2018 at 1:24 PM, Lukasz Cwik <lc...@google.com
>> > <mailto:lc...@google.com>
>> > > <mailto:lc...@google.com <mailto:lc...@google.com>>> wrote:
>> > >
>> > > I validated several of the quickstarts and updated the
>> spreadsheet and
>> > > currently am voting -1 for this release due to Spark
>> runner
>> > failing. Filed
>> > > https://issues.apache.org/jira/browse/BEAM-3668
>> > > <https://issues.apache.org/jira/browse/BEAM-3668> with
>> the full
>> > details.
>> > >
>> > >
>> > > On Thu, Feb 8, 2018 at 10:32 AM, Valentyn Tymofieiev
>> > <valen...@google.com <mailto:valen...@google.com>
>> > > <mailto:valen...@google.com <mailto:valen...@google.com>>>
>> wrote:
>> > >
>> > > Yes (thanks
>> > Kenn!): https://s.apache.org/beam-2.3.0-release-validation
>> > > <https://s.apache.org/beam-2.3.0-release-validation>
>> > >
>> > > On Thu, Feb 8, 2018 at 10:14 

Re: [VOTE] Release 2.3.0, release candidate #2

2018-02-08 Thread Romain Manni-Bucau
IMHO it is not a blocker but an incompatibility between spark and some IO
stack. Trivial workaround is to shade the io before importing it in its
project. Amternative is to wrap IO in custom classloaders.

Didnt check for this one but it is a common beam issue to have conflicts
between runners/io or even 2 ios so it shouldnt block a release by itself
until beam aims to solve properly conflicts - which means without shading
which breaks the io ecosystem on the user side.

Just my 2cts

Le 9 févr. 2018 06:07, "Jean-Baptiste Onofré" <j...@nanthrax.net> a écrit :

> Is it specific to this release ? I think it was like this before no ?
>
> Regards
> JB
>
> On 02/09/2018 12:48 AM, Kenneth Knowles wrote:
> > Since root cause is https://issues.apache.org/jira/browse/BEAM-3519 I
> marked it
> > a blocker so we can discuss fixes or workarounds there.
> >
> > On Thu, Feb 8, 2018 at 1:24 PM, Lukasz Cwik <lc...@google.com
> > <mailto:lc...@google.com>> wrote:
> >
> > I validated several of the quickstarts and updated the spreadsheet
> and
> > currently am voting -1 for this release due to Spark runner failing.
> Filed
> > https://issues.apache.org/jira/browse/BEAM-3668
> > <https://issues.apache.org/jira/browse/BEAM-3668> with the full
> details.
> >
> >
> > On Thu, Feb 8, 2018 at 10:32 AM, Valentyn Tymofieiev <
> valen...@google.com
> > <mailto:valen...@google.com>> wrote:
> >
> > Yes (thanks Kenn!): https://s.apache.org/
> beam-2.3.0-release-validation
> > <https://s.apache.org/beam-2.3.0-release-validation>
> >
> > On Thu, Feb 8, 2018 at 10:14 AM, Eugene Kirpichov <
> kirpic...@google.com
> > <mailto:kirpic...@google.com>> wrote:
> >
> > Do we have a release validation spreadsheet for this one?
> >
> > On Thu, Feb 8, 2018 at 9:30 AM Ahmet Altay <al...@google.com
> > <mailto:al...@google.com>> wrote:
> >
> > +1
> >
> > I verified python quick start, mobile gaming examples,
> streaming
> > on Direct and Dataflow runners. Thank you JB!
> >
> > On Thu, Feb 8, 2018 at 2:27 AM, Romain Manni-Bucau
> > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>>
> wrote:
> >
> > +1 (non-binding), thanks JB for the effort!
> >
> >
> > Romain Manni-Bucau
> > @rmannibucau <https://twitter.com/rmannibucau> |
>  Blog
> > <https://rmannibucau.metawerx.net/> | Old Blog
> > <http://rmannibucau.wordpress.com> | Github
> > <https://github.com/rmannibucau> | LinkedIn
> > <https://www.linkedin.com/in/rmannibucau> | Book
> > <https://www.packtpub.com/
> application-development/java-ee-8-high-performance>
> >
> > 2018-02-08 11:12 GMT+01:00 Ismaël Mejía <
> ieme...@gmail.com
> > <mailto:ieme...@gmail.com>>:
> >
> > +1 (binding)
> >
> > Validated SHAs + tag vs source.zip file.
> > Run mvn clean install -Prelease OK
> > Validated that the 3 regressions reported for
> RC1 were
> > fixed.
> > Run Nexmark on Direct/Flink runner on local
> mode, no
> > regressions now.
> > Installed python version on virtualenv and run
> wordcount
> > with success.
> >
> > On Thu, Feb 8, 2018 at 6:37 AM, Jean-Baptiste
> Onofré
> > <j...@nanthrax.net <mailto:j...@nanthrax.net>>
> wrote:
> > > Hi everyone,
> > >
> > > Please review and vote on the release
> candidate #2 for
> > the version 2.3.0, as
> > > follows:
> > >
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please
> provide
> > specific comments)
> > >
> > >
> > > The complete

Re: dependencies.txt in META-INF?

2018-02-08 Thread Romain Manni-Bucau
Was too much abused by libs and not supported everywhere :(

Le 8 févr. 2018 22:39, "Lukasz Cwik" <lc...@google.com> a écrit :

> It is unfortunate that setting Class-Path is so broken.
>
> On Wed, Feb 7, 2018 at 10:55 PM, Romain Manni-Bucau <rmannibu...@gmail.com
> > wrote:
>
>> Not really:
>> 1. I need to have the gav
>> 2. Please never set Class-Path of the manifest. It leads to broken
>> runtime in most environments :(.
>>
>>
>> Le 8 févr. 2018 05:19, "Lukasz Cwik" <lc...@google.com> a écrit :
>>
>>> Looking at the Gradle shadow plugin, it seems like it is doing what you
>>> ask. Does this fit your usecase?
>>>
>>> From: http://imperceptiblethoughts.com/shadow/#configuring_the_run
>>> time_classpath
>>>
>>> Additionally, Shadow automatically configures the manifest of the
>>> shadowJar task to contain a Class-Path entry in the JAR manifest. The
>>> value of the Class-Path entry is the name of all dependencies resolved
>>> in the shadow configuration for the project.
>>>
>>> dependencies {
>>>   shadow 'junit:junit:3.8.2'
>>> }
>>>
>>> Inspecting the META-INF/MANIFEST.MF entry in the JAR file will reveal
>>> the following attribute:
>>>
>>> Class-Path: junit-3.8.2.jar
>>>
>>>
>>>
>>> On Wed, Feb 7, 2018 at 9:27 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> 2018-02-07 18:21 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>>>>
>>>>> What kinds of features would this enable within the Apache Beam SDK or
>>>>> allow for users to write (looking for some reason as to why this is not
>>>>> just a one off change to support a use case)?
>>>>>
>>>>
>>>> It allows to build a classpath and to rely on beam without requiring
>>>> maven to resolve the poms and it is way faster than resolving pom model. I
>>>> full ack it is a bit limit like case but it doesn't cost much to beam too
>>>> so thought I could ask before
>>>> doing something 100% custom.
>>>>
>>>>
>>>>> Would it list all the transitive dependencies?
>>>>>
>>>>
>>>> all runtime ones (= not test and provided ones - even if i can live
>>>> with it listing them all, I just don't see why it would)
>>>>
>>>>
>>>>>
>>>>> How would you test that it works?
>>>>>
>>>>
>>>> It is a maven plugin so not sure it requires a test in beam itself but
>>>> on my side I have some test for this kind of thing already running a server
>>>> from this kind of file typically.
>>>>
>>>>
>>>>>
>>>>> On Wed, Feb 7, 2018 at 7:23 AM, Romain Manni-Bucau <
>>>>> rmannibu...@gmail.com> wrote:
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I have a use case where I would resolve beam classpath
>>>>>> programmatically. I wonder if it would be possible to add in META-INF (or
>>>>>> BEAM-INF, in the jar is the main request ;)) a dependencies.txt (or other
>>>>>> file) listing all the mandatory dependencies. I'm mainly interested by 
>>>>>> the
>>>>>> java sdk core module but can be beneficial to others as well.
>>>>>>
>>>>>> With maven it is just a matter of defining:
>>>>>>
>>>>>> 
>>>>>>   org.apache.maven.plugins
>>>>>>   maven-dependency-plugin
>>>>>>   ${dependency-plugin.version}
>>>>>>   
>>>>>> 
>>>>>>   create-META-INF/dependencies.txt
>>>>>>   prepare-package
>>>>>>   
>>>>>> list
>>>>>>   
>>>>>>   
>>>>>> 
>>>>>> ${project.build.outputDirectory}/META-INF/dependencies.txt
>>>>>>   
>>>>>> 
>>>>>>   
>>>>>> 
>>>>>>
>>>>>> with gradle it is a loop around a resolvedconfiguration which dumps
>>>>>> the artifacts in a maven format (group:name:type:version)
>>>>>>
>>>>>> My interest of it being in beam is to be able to upgrade beam without
>>>>>> having to re-release these metadata.
>>>>>>
>>>>>> Is it something the project could be interested in?
>>>>>>
>>>>>> Romain Manni-Bucau
>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>


Re: A 15x speed-up in local Python DirectRunner execution

2018-02-08 Thread Romain Manni-Bucau
Very interesting! Sounds like a sane way for beam future and I'm very happy
it is consistent with the current Java experience: no need to interlace
runners at the end, it makes design, code and user experience way better
than trying to put everything in the direct runner :).

Le 8 févr. 2018 19:20, "María García Herrero"  a écrit :

> Amazing improvement, Charles.
> Thanks for the effort!
>
>
> On Thu, Feb 8, 2018 at 10:14 AM Eugene Kirpichov 
> wrote:
>
>> Sounds awesome, congratulations and thanks for making this happen!
>>
>> On Thu, Feb 8, 2018 at 10:07 AM Raghu Angadi  wrote:
>>
>>> This is terrific news! Thanks Charles.
>>>
>>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen  wrote:
>>>
 Local execution of Beam pipelines on the Python DirectRunner currently
 suffers from performance issues, which makes it hard for pipeline authors
 to iterate, especially on medium to large size datasets.  We would like to
 optimize and make this a better experience for Beam users.

 The FnApiRunner was written as a way of leveraging the portability
 framework execution code path for local portability development. We've
 found it also provides great speedups in batch execution with no user
 changes required, so we propose to switch to use this runner by default in
 batch pipelines.  For example, WordCount on the Shakespeare dataset with a
 single CPU core now takes 50 seconds to run, compared to 12 minutes before;
 this is a 15x performance improvement that users can get for free,
 with no user pipeline changes.

 The JIRA for this change is here (https://issues.apache.org/
 jira/browse/BEAM-3644), and a candidate patch is available here (
 https://github.com/apache/beam/pull/4634). I have been working over
 the last month on making this an automatic drop-in replacement for the
 current DirectRunner when applicable.  Before it becomes the default, you
 can try this runner now by manually specifying apache_beam.runners.
 portability.fn_api_runner.FnApiRunner as the runner.

 Even with this change, local Python pipeline execution can only
 effectively use one core because of the Python GIL.  A natural next step to
 further improve performance will be to refactor the FnApiRunner to allow
 for multi-process execution.  This is being tracked here (
 https://issues.apache.org/jira/browse/BEAM-3645).

 Best,

 Charles

>>>
>
> --
>
> Impact is the effect that wouldn’t have happened if you hadn’t done what you
> did.
>
>


Re: [VOTE] Release 2.3.0, release candidate #2

2018-02-08 Thread Romain Manni-Bucau
+1 (non-binding), thanks JB for the effort!


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-08 11:12 GMT+01:00 Ismaël Mejía <ieme...@gmail.com>:

> +1 (binding)
>
> Validated SHAs + tag vs source.zip file.
> Run mvn clean install -Prelease OK
> Validated that the 3 regressions reported for RC1 were fixed.
> Run Nexmark on Direct/Flink runner on local mode, no regressions now.
> Installed python version on virtualenv and run wordcount with success.
>
> On Thu, Feb 8, 2018 at 6:37 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
> > Hi everyone,
> >
> > Please review and vote on the release candidate #2 for the version
> 2.3.0, as
> > follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to dist.apache.org
> [2],
> > which is signed with the key with fingerprint C8282E76 [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "v2.3.0-RC2" [5],
> > * website pull request listing the release and publishing the API
> reference
> > manual [6].
> > * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
> > * Python artifacts are deployed along with the source release to the
> > dist.apache.org [2].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> approval,
> > with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > JB
> >
> > [1]
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> projectId=12319527=12341608
> > [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > [4] https://repository.apache.org/content/repositories/
> orgapachebeam-1027/
> > [5] https://github.com/apache/beam/tree/v2.3.0-RC2
> > [6] https://github.com/apache/beam-site/pull/381
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
>


Re: dependencies.txt in META-INF?

2018-02-07 Thread Romain Manni-Bucau
Not really:
1. I need to have the gav
2. Please never set Class-Path of the manifest. It leads to broken runtime
in most environments :(.


Le 8 févr. 2018 05:19, "Lukasz Cwik" <lc...@google.com> a écrit :

> Looking at the Gradle shadow plugin, it seems like it is doing what you
> ask. Does this fit your usecase?
>
> From: http://imperceptiblethoughts.com/shadow/#configuring_the_
> runtime_classpath
>
> Additionally, Shadow automatically configures the manifest of the
> shadowJar task to contain a Class-Path entry in the JAR manifest. The
> value of the Class-Path entry is the name of all dependencies resolved in
> the shadow configuration for the project.
>
> dependencies {
>   shadow 'junit:junit:3.8.2'
> }
>
> Inspecting the META-INF/MANIFEST.MF entry in the JAR file will reveal the
> following attribute:
>
> Class-Path: junit-3.8.2.jar
>
>
>
> On Wed, Feb 7, 2018 at 9:27 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>>
>>
>> 2018-02-07 18:21 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>>
>>> What kinds of features would this enable within the Apache Beam SDK or
>>> allow for users to write (looking for some reason as to why this is not
>>> just a one off change to support a use case)?
>>>
>>
>> It allows to build a classpath and to rely on beam without requiring
>> maven to resolve the poms and it is way faster than resolving pom model. I
>> full ack it is a bit limit like case but it doesn't cost much to beam too
>> so thought I could ask before
>> doing something 100% custom.
>>
>>
>>> Would it list all the transitive dependencies?
>>>
>>
>> all runtime ones (= not test and provided ones - even if i can live with
>> it listing them all, I just don't see why it would)
>>
>>
>>>
>>> How would you test that it works?
>>>
>>
>> It is a maven plugin so not sure it requires a test in beam itself but on
>> my side I have some test for this kind of thing already running a server
>> from this kind of file typically.
>>
>>
>>>
>>> On Wed, Feb 7, 2018 at 7:23 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I have a use case where I would resolve beam classpath
>>>> programmatically. I wonder if it would be possible to add in META-INF (or
>>>> BEAM-INF, in the jar is the main request ;)) a dependencies.txt (or other
>>>> file) listing all the mandatory dependencies. I'm mainly interested by the
>>>> java sdk core module but can be beneficial to others as well.
>>>>
>>>> With maven it is just a matter of defining:
>>>>
>>>> 
>>>>   org.apache.maven.plugins
>>>>   maven-dependency-plugin
>>>>   ${dependency-plugin.version}
>>>>   
>>>> 
>>>>   create-META-INF/dependencies.txt
>>>>   prepare-package
>>>>   
>>>> list
>>>>   
>>>>   
>>>> 
>>>> ${project.build.outputDirectory}/META-INF/dependencies.txt
>>>>   
>>>> 
>>>>   
>>>> 
>>>>
>>>> with gradle it is a loop around a resolvedconfiguration which dumps the
>>>> artifacts in a maven format (group:name:type:version)
>>>>
>>>> My interest of it being in beam is to be able to upgrade beam without
>>>> having to re-release these metadata.
>>>>
>>>> Is it something the project could be interested in?
>>>>
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>> <http://rmannibucau.wordpress.com> | Github
>>>> <https://github.com/rmannibucau> | LinkedIn
>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>
>>>
>>>
>>
>


Re: dependencies.txt in META-INF?

2018-02-07 Thread Romain Manni-Bucau
2018-02-07 18:21 GMT+01:00 Lukasz Cwik <lc...@google.com>:

> What kinds of features would this enable within the Apache Beam SDK or
> allow for users to write (looking for some reason as to why this is not
> just a one off change to support a use case)?
>

It allows to build a classpath and to rely on beam without requiring maven
to resolve the poms and it is way faster than resolving pom model. I full
ack it is a bit limit like case but it doesn't cost much to beam too so
thought I could ask before
doing something 100% custom.


> Would it list all the transitive dependencies?
>

all runtime ones (= not test and provided ones - even if i can live with it
listing them all, I just don't see why it would)


>
> How would you test that it works?
>

It is a maven plugin so not sure it requires a test in beam itself but on
my side I have some test for this kind of thing already running a server
from this kind of file typically.


>
> On Wed, Feb 7, 2018 at 7:23 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> Hi guys,
>>
>> I have a use case where I would resolve beam classpath programmatically.
>> I wonder if it would be possible to add in META-INF (or BEAM-INF, in the
>> jar is the main request ;)) a dependencies.txt (or other file) listing all
>> the mandatory dependencies. I'm mainly interested by the java sdk core
>> module but can be beneficial to others as well.
>>
>> With maven it is just a matter of defining:
>>
>> 
>>   org.apache.maven.plugins
>>   maven-dependency-plugin
>>   ${dependency-plugin.version}
>>   
>> 
>>   create-META-INF/dependencies.txt
>>   prepare-package
>>   
>> list
>>   
>>   
>> 
>> ${project.build.outputDirectory}/META-INF/dependencies.txt
>>   
>> 
>>   
>> 
>>
>> with gradle it is a loop around a resolvedconfiguration which dumps the
>> artifacts in a maven format (group:name:type:version)
>>
>> My interest of it being in beam is to be able to upgrade beam without
>> having to re-release these metadata.
>>
>> Is it something the project could be interested in?
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>
>


dependencies.txt in META-INF?

2018-02-07 Thread Romain Manni-Bucau
Hi guys,

I have a use case where I would resolve beam classpath programmatically. I
wonder if it would be possible to add in META-INF (or BEAM-INF, in the jar
is the main request ;)) a dependencies.txt (or other file) listing all the
mandatory dependencies. I'm mainly interested by the java sdk core module
but can be beneficial to others as well.

With maven it is just a matter of defining:


  org.apache.maven.plugins
  maven-dependency-plugin
  ${dependency-plugin.version}
  

  create-META-INF/dependencies.txt
  prepare-package
  
list
  
  

${project.build.outputDirectory}/META-INF/dependencies.txt
  

  


with gradle it is a loop around a resolvedconfiguration which dumps the
artifacts in a maven format (group:name:type:version)

My interest of it being in beam is to be able to upgrade beam without
having to re-release these metadata.

Is it something the project could be interested in?

Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>


Re: Schema-Aware PCollections revisited

2018-02-05 Thread Romain Manni-Bucau
I would add a use case: single serialization mecanism accross a pipeline.
JSON allows to handle generic records (JsonObject) as well as POJO
serialization and both are compatible. Compared to avro built-in mecanism,
it is not intrusive in the models which is a key feature of an API. It also
increases the portability with other languages and simplifies the cluster
setup/maintenance of streams, and development - keep in mind people can
(do) use beam without the portable API which has been so intrusive lately
too.

It also joins the API driven world where we live now - and will not change
soon ;).

Le 6 févr. 2018 06:06, "Kenneth Knowles" <k...@google.com> a écrit :

Joining late, but very interested. Commented on the doc. Since there's a
forked discussion between doc and thread, I want to say this on the thread:

1. I have used JSON schema in production for describing the structure of
analytics events and it is OK but not great. If you are sure your data is
only JSON, use it. For Beam the hierarchical structure is meaningful while
the atomic pieces should be existing coders. When we integrate with SQL
that can get more specific.

2. Overall, I found the discussion and doc a bit short on use cases. I can
propose a few:

 - incoming topic of events from clients (at various levels of upgrade /
schema adherence)
 - async update of client and pipeline in the above
 - archive of files that parse to a POJO of known schema, or archive of all
of the above
 - SQL integration / columnar operation with all of the above
 - autogenerated UI integration with all of the above

My impression is that the design will nail SQL integration and
autogenerated UI but will leave compatibility/evolution concerns for later.
IMO this is smart as they are much harder.

Kenn

On Mon, Feb 5, 2018 at 1:55 PM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> None, Json-p - the spec so no strong impl requires - as record API and a
> custom light wrapping for schema - like https://github.com/Talend
> /component-runtime/blob/master/component-form/component-
> form-model/src/main/java/org/talend/sdk/component/form/
> model/jsonschema/JsonSchema.java (note this code is used for something
> else) or a plain JsonObject which should be sufficient.
>
> side note: Apache Johnzon would probably be happy to host an enriched
> schema module based on jsonp if you feel it better this way.
>
>
> Le 5 févr. 2018 21:43, "Reuven Lax" <re...@google.com> a écrit :
>
> Which json library are you thinking of? At least in Java, there's always
> been a problem of no good standard Json library.
>
>
>
> On Mon, Feb 5, 2018 at 12:03 PM, Romain Manni-Bucau <rmannibu...@gmail.com
> > wrote:
>
>>
>>
>> Le 5 févr. 2018 19:54, "Reuven Lax" <re...@google.com> a écrit :
>>
>> multiplying by 1.0 doesn't really solve the right problems. The number
>> type used by Javascript (and by extension, they standard for json) only has
>> 53 bits of precision. I've seen many, many bugs caused because of this -
>> the input data may easily contain numbers too large for 53 bits.
>>
>>
>> You have alternative than string at the end whatever schema you use so
>> not sure it is an issue. At least if runtime is in java or mainstream
>> languages.
>>
>>
>>
>> In addition, Beam's schema representation must be no less general than
>> other common representations. For the case of an ETL pipeline, if input
>> fields are integers the output fields should also be numbers. We shouldn't
>> turn them into floats because the schema class we used couldn't distinguish
>> between ints and floats. If anything, Avro schemas are a better fit here as
>> they are more general.
>>
>>
>> This is what previous definition does. Avro are not better for 2 reasons:
>>
>> 1. Their dep stack is a clear blocker and please dont even speak of yet
>> another uncontrolled shade in the API. Until avro become an api only and
>> not an impl this is a bad fit for beam.
>> 2. They must be json friendly so you are back on json + metada so
>> jsonschema+extension entry is strictly equivalent and as typed
>>
>>
>>
>> Reuven
>>
>> On Sun, Feb 4, 2018 at 9:31 AM, Romain Manni-Bucau <rmannibu...@gmail.com
>> > wrote:
>>
>>> You can handle integers using multipleOf: 1.0 IIRC.
>>> Yes limitations are still here but it is a good starting model and to be
>>> honest it is good enough - not a single model will work good enough even if
>>> you can go a little bit further with other models a bit more complex.
>>> That said the idea is to enrich the model with a beam object which would
>>> allow to complete the metadata as required

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Romain Manni-Bucau
None, Json-p - the spec so no strong impl requires - as record API and a
custom light wrapping for schema - like
https://github.com/Talend/component-runtime/blob/master/component-form/component-form-model/src/main/java/org/talend/sdk/component/form/model/jsonschema/JsonSchema.java
(note this code is used for something else) or a plain JsonObject which
should be sufficient.

side note: Apache Johnzon would probably be happy to host an enriched
schema module based on jsonp if you feel it better this way.

Le 5 févr. 2018 21:43, "Reuven Lax" <re...@google.com> a écrit :

Which json library are you thinking of? At least in Java, there's always
been a problem of no good standard Json library.



On Mon, Feb 5, 2018 at 12:03 PM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

>
>
> Le 5 févr. 2018 19:54, "Reuven Lax" <re...@google.com> a écrit :
>
> multiplying by 1.0 doesn't really solve the right problems. The number
> type used by Javascript (and by extension, they standard for json) only has
> 53 bits of precision. I've seen many, many bugs caused because of this -
> the input data may easily contain numbers too large for 53 bits.
>
>
> You have alternative than string at the end whatever schema you use so not
> sure it is an issue. At least if runtime is in java or mainstream languages.
>
>
>
> In addition, Beam's schema representation must be no less general than
> other common representations. For the case of an ETL pipeline, if input
> fields are integers the output fields should also be numbers. We shouldn't
> turn them into floats because the schema class we used couldn't distinguish
> between ints and floats. If anything, Avro schemas are a better fit here as
> they are more general.
>
>
> This is what previous definition does. Avro are not better for 2 reasons:
>
> 1. Their dep stack is a clear blocker and please dont even speak of yet
> another uncontrolled shade in the API. Until avro become an api only and
> not an impl this is a bad fit for beam.
> 2. They must be json friendly so you are back on json + metada so
> jsonschema+extension entry is strictly equivalent and as typed
>
>
>
> Reuven
>
> On Sun, Feb 4, 2018 at 9:31 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> You can handle integers using multipleOf: 1.0 IIRC.
>> Yes limitations are still here but it is a good starting model and to be
>> honest it is good enough - not a single model will work good enough even if
>> you can go a little bit further with other models a bit more complex.
>> That said the idea is to enrich the model with a beam object which would
>> allow to complete the metadata as required when needed (never?).
>>
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>> 2018-02-04 18:21 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>>
>>> Sorry guys, I was off today. Happy to be part of the party too ;)
>>>
>>> Regards
>>> JB
>>>
>>> On 02/04/2018 06:19 PM, Reuven Lax wrote:
>>> > Romain, since you're interested maybe the two of us should put
>>> together a
>>> > proposal for how to set this things (hints, schema) on PCollections? I
>>> don't
>>> > think it'll be hard - the previous list thread on hints already agreed
>>> on a
>>> > general approach, and we would just need to flesh it out.
>>> >
>>> > BTW in the past when I looked, Json schemas seemed to have some odd
>>> limitations
>>> > inherited from Javascript (e.g. no distinction between integer and
>>> > floating-point types). Is that still true?
>>> >
>>> > Reuven
>>> >
>>> > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com
>>> > <mailto:rmannibu...@gmail.com>> wrote:
>>> >
>>> >
>>> >
>>> > 2018-02-04 17:53 GMT+01:00 Reuven Lax <re...@google.com
>>> > <mailto:re...@google.com>>:
>>> >
>>> >
>>> >
>>> > On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau
>>> > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote:
>>> >
>>> >
>>> &

Re: coder evolutions?

2018-02-05 Thread Romain Manni-Bucau
Does it mean we would change the implicit resolution? Do you see it being
backward compatible? If so sounds a good solution.

Le 5 févr. 2018 20:36, "Kenneth Knowles" <k...@google.com> a écrit :

> TL;DR: create _new_ coders is not a problem. If you have a new idea for an
> encoding, you can build it alongside and users can use it. We also need
> data migration, and this is probably the easy way to be ready for that.
>
> We made a pretty big mistake in our naming of ListCoder, SetCoder, and
> IterableLikeCoder because they make users think it is the
> only/best/canonical encoding. We did it right with e.g. VarLongCoder and
> BigEndianLongCoder. There is a default, but it is just a default.
>
> We actually already need "SetIterableLikeCoder" (aka SetCoder) and perhaps
> "LexicallySortedBytesSetCoder" so we can change coder inference to ask for
> a deterministic coder when it is needed instead of first asking for "any"
> coder and then crashing when we get the wrong type.
>
> Kenn
>
> On Mon, Feb 5, 2018 at 11:00 AM, Robert Bradshaw <rober...@google.com>
> wrote:
>
>> Just to clarify, the issue is that for some types (byte array being
>> the simplest) one needs to know the length of the data in order to
>> decode it from the stream. In particular, the claim is that many
>> libraries out there that do encoding/decoding assume they can gather
>> this information from the end of the stream and so don't explicitly
>> record it. For nested values, someone needs to record these lengths.
>> Note that in the Fn API, nearly everything is nested, as the elements
>> are sent as a large byte stream of concatenated encoded elements.
>>
>> Your proposed solution is to require all container coders (though I
>> think your PR only considers IterableLikeCoder, there's others, and
>> there's the Elements proto itself) to prefix element encodings with
>> sizes so it can give truncated streams on decoding. I think this
>> places an undue burden (and code redundancy in) container coders, and
>> disallows optimization on those coders that don't need to be length
>> prefixed (and note that *prefixing* with length is not the only way to
>> delimit a stream, we shouldn't impose that restriction as well).
>> Instead, I'd keep thing the way they are, but offer a new Coder
>> subclass that users can subclass if they want to write an "easy" Coder
>> that does the delimiting for them (on encode and decode). We would
>> point users to this for writing custom coders in the easiest way
>> possible as a good option, and keeps the current Coder API the same.
>>
>> On Mon, Feb 5, 2018 at 10:21 AM, Romain Manni-Bucau
>> <rmannibu...@gmail.com> wrote:
>> > Answered inlined but I want to highlight beam is a portable API on top
>> of
>> > well known vendors API which have friendly shortcuts. So the background
>> here
>> > is to make beam at least user friendly.
>> >
>> > Im fine if the outcome of the discussion is coder concept is wrong or
>> > something like that but Im not fine to say we dont want to solve an API
>> > issue, to not say bug, of a project which has an API as added value.
>> >
>> > I understand the perf concern which must be balanced with the fact
>> > derialization is not used for each step/transform and that currently the
>> > coder API is already intrusive and heavy for dev but also not usable by
>> most
>> > existing codecs out there. Even some jaxb or plain xml flavors dont work
>> > with it :(.
>> >
>> >
>> > Le 5 févr. 2018 18:46, "Robert Bradshaw" <rober...@google.com> a écrit
>> :
>> >
>> > On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
>> > <rmannibu...@gmail.com> wrote:
>> >> Hi guys,
>> >>
>> >> I submitted a PR on coders to enhance 1. the user experience 2. the
>> >> determinism and handling of coders.
>> >>
>> >> 1. the user experience is linked to what i sent some days ago: close
>> >> handling of the streams from a coder code. Long story short I add a
>> >> SkipCloseCoder which can decorate a coder and just wraps the stream
>> (input
>> >> or output) in flavors skipping close() calls. This avoids to do it by
>> >> default (which had my preference if you read the related thread but not
>> >> the
>> >> one of everybody) but also makes the usage of a coder with this issue
>> easy
>> >> since the of() of the coder just wraps itself in this delagating coder.
>>

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Romain Manni-Bucau
Le 5 févr. 2018 19:54, "Reuven Lax" <re...@google.com> a écrit :

multiplying by 1.0 doesn't really solve the right problems. The number type
used by Javascript (and by extension, they standard for json) only has 53
bits of precision. I've seen many, many bugs caused because of this - the
input data may easily contain numbers too large for 53 bits.


You have alternative than string at the end whatever schema you use so not
sure it is an issue. At least if runtime is in java or mainstream languages.



In addition, Beam's schema representation must be no less general than
other common representations. For the case of an ETL pipeline, if input
fields are integers the output fields should also be numbers. We shouldn't
turn them into floats because the schema class we used couldn't distinguish
between ints and floats. If anything, Avro schemas are a better fit here as
they are more general.


This is what previous definition does. Avro are not better for 2 reasons:

1. Their dep stack is a clear blocker and please dont even speak of yet
another uncontrolled shade in the API. Until avro become an api only and
not an impl this is a bad fit for beam.
2. They must be json friendly so you are back on json + metada so
jsonschema+extension entry is strictly equivalent and as typed



Reuven

On Sun, Feb 4, 2018 at 9:31 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> You can handle integers using multipleOf: 1.0 IIRC.
> Yes limitations are still here but it is a good starting model and to be
> honest it is good enough - not a single model will work good enough even if
> you can go a little bit further with other models a bit more complex.
> That said the idea is to enrich the model with a beam object which would
> allow to complete the metadata as required when needed (never?).
>
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
> 2018-02-04 18:21 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>
>> Sorry guys, I was off today. Happy to be part of the party too ;)
>>
>> Regards
>> JB
>>
>> On 02/04/2018 06:19 PM, Reuven Lax wrote:
>> > Romain, since you're interested maybe the two of us should put together
>> a
>> > proposal for how to set this things (hints, schema) on PCollections? I
>> don't
>> > think it'll be hard - the previous list thread on hints already agreed
>> on a
>> > general approach, and we would just need to flesh it out.
>> >
>> > BTW in the past when I looked, Json schemas seemed to have some odd
>> limitations
>> > inherited from Javascript (e.g. no distinction between integer and
>> > floating-point types). Is that still true?
>> >
>> > Reuven
>> >
>> > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau <
>> rmannibu...@gmail.com
>> > <mailto:rmannibu...@gmail.com>> wrote:
>> >
>> >
>> >
>> > 2018-02-04 17:53 GMT+01:00 Reuven Lax <re...@google.com
>> > <mailto:re...@google.com>>:
>> >
>> >
>> >
>> > On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau
>> > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote:
>> >
>> >
>> > 2018-02-04 17:37 GMT+01:00 Reuven Lax <re...@google.com
>> > <mailto:re...@google.com>>:
>> >
>> > I'm not sure where proto comes from here. Proto is one
>> example
>> > of a type that has a schema, but only one example.
>> >
>> > 1. In the initial prototype I want to avoid modifying
>> the
>> > PCollection API. So I think it's best to create a
>> special
>> > SchemaCoder, and pass the schema into this coder. Later
>> we might
>> > targeted APIs for this instead of going through a coder.
>> > 1.a I don't see what hints have to do with this?
>> >
>> >
>> > Hints are a way to replace the new API and unify the way to
>> pass
>> > metadata in beam instead of adding a new custom way each
>> time.
>> >
>> >
>> > I don't think schema is a hint. But I hear what your saying -
>> hint is a
>> > type of PC

Re: coder evolutions?

2018-02-05 Thread Romain Manni-Bucau
Would this work for everyone - can update the pr if so:

If coder is not built in
Prefix with byte size
Else
Current behavior

?

Le 5 févr. 2018 19:21, "Romain Manni-Bucau" <rmannibu...@gmail.com> a
écrit :

> Answered inlined but I want to highlight beam is a portable API on top of
> well known vendors API which have friendly shortcuts. So the background
> here is to make beam at least user friendly.
>
> Im fine if the outcome of the discussion is coder concept is wrong or
> something like that but Im not fine to say we dont want to solve an API
> issue, to not say bug, of a project which has an API as added value.
>
> I understand the perf concern which must be balanced with the fact
> derialization is not used for each step/transform and that currently the
> coder API is already intrusive and heavy for dev but also not usable by
> most existing codecs out there. Even some jaxb or plain xml flavors dont
> work with it :(.
>
> Le 5 févr. 2018 18:46, "Robert Bradshaw" <rober...@google.com> a écrit :
>
> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
> <rmannibu...@gmail.com> wrote:
> > Hi guys,
> >
> > I submitted a PR on coders to enhance 1. the user experience 2. the
> > determinism and handling of coders.
> >
> > 1. the user experience is linked to what i sent some days ago: close
> > handling of the streams from a coder code. Long story short I add a
> > SkipCloseCoder which can decorate a coder and just wraps the stream
> (input
> > or output) in flavors skipping close() calls. This avoids to do it by
> > default (which had my preference if you read the related thread but not
> the
> > one of everybody) but also makes the usage of a coder with this issue
> easy
> > since the of() of the coder just wraps itself in this delagating coder.
> >
> > 2. this one is more nasty and mainly concerns IterableLikeCoders. These
> ones
> > use this kind of algorithm (keep in mind they work on a list):
> >
> > writeSize()
> > for all element e {
> > elementCoder.write(e)
> > }
> > writeMagicNumber() // this one depends the size
> >
> > The decoding is symmetric so I bypass it here.
> >
> > Indeed all these writes (reads) are done on the same stream. Therefore it
> > assumes you read as much bytes than you write...which is a huge
> assumption
> > for a coder which should by contract assume it can read the stream...as a
> > stream (until -1).
> >
> > The idea of the fix is to change this encoding to this kind of algorithm:
> >
> > writeSize()
> > for all element e {
> > writeElementByteCount(e)
> > elementCoder.write(e)
> > }
> > writeMagicNumber() // still optionally
>
> Regardless of the backwards incompatibility issues, I'm unconvinced
> that prefixing every element with its length is a good idea. It can
> lead to blow-up in size (e.g. a list of ints, and it should be noted
> that containers with lots of elements bias towards having small
> elements) and also writeElementByteCount(e) could be very inefficient
> for many type e (e.g. a list of lists).
>
>
> What is your proposal Robert then? Current restriction is clearly a
> blocker for portability, users, determinism and is unsafe and only
> checkable at runtime so not something we should lead to keep.
>
> Alternative i thought about was to forbid implicit coders but it doesnt
> help users.
>
>
>
> > This way on the decode size you can wrap the stream by element to enforce
> > the limitation of the byte count.
> >
> > Side note: this indeed enforce a limitation due to java byte limitation
> but
> > if you check coder code it is already here at the higher level so it is
> not
> > a big deal for now.
> >
> > In terms of implementation it uses a LengthAwareCoder which delegates to
> > another coder the encoding and just adds the byte count before the actual
> > serialization. Not perfect but should be more than enough in terms of
> > support and perf for beam if you think real pipelines (we try to avoid
> > serializations or it is done on some well known points where this algo
> > should be enough...worse case it is not a huge overhead, mainly just some
> > memory overhead).
> >
> >
> > The PR is available at https://github.com/apache/beam/pull/4594. If you
> > check you will see I put it "WIP". The main reason is that it changes the
> > encoding format for containers (lists, iterable, ...) and therefore
> breaks
> > python/go/... tests and the standard_coders.yml definition. Some help on
> > that would be very we

Re: coder evolutions?

2018-02-05 Thread Romain Manni-Bucau
Answered inlined but I want to highlight beam is a portable API on top of
well known vendors API which have friendly shortcuts. So the background
here is to make beam at least user friendly.

Im fine if the outcome of the discussion is coder concept is wrong or
something like that but Im not fine to say we dont want to solve an API
issue, to not say bug, of a project which has an API as added value.

I understand the perf concern which must be balanced with the fact
derialization is not used for each step/transform and that currently the
coder API is already intrusive and heavy for dev but also not usable by
most existing codecs out there. Even some jaxb or plain xml flavors dont
work with it :(.

Le 5 févr. 2018 18:46, "Robert Bradshaw" <rober...@google.com> a écrit :

On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
<rmannibu...@gmail.com> wrote:
> Hi guys,
>
> I submitted a PR on coders to enhance 1. the user experience 2. the
> determinism and handling of coders.
>
> 1. the user experience is linked to what i sent some days ago: close
> handling of the streams from a coder code. Long story short I add a
> SkipCloseCoder which can decorate a coder and just wraps the stream (input
> or output) in flavors skipping close() calls. This avoids to do it by
> default (which had my preference if you read the related thread but not
the
> one of everybody) but also makes the usage of a coder with this issue easy
> since the of() of the coder just wraps itself in this delagating coder.
>
> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
ones
> use this kind of algorithm (keep in mind they work on a list):
>
> writeSize()
> for all element e {
> elementCoder.write(e)
> }
> writeMagicNumber() // this one depends the size
>
> The decoding is symmetric so I bypass it here.
>
> Indeed all these writes (reads) are done on the same stream. Therefore it
> assumes you read as much bytes than you write...which is a huge assumption
> for a coder which should by contract assume it can read the stream...as a
> stream (until -1).
>
> The idea of the fix is to change this encoding to this kind of algorithm:
>
> writeSize()
> for all element e {
> writeElementByteCount(e)
> elementCoder.write(e)
> }
> writeMagicNumber() // still optionally

Regardless of the backwards incompatibility issues, I'm unconvinced
that prefixing every element with its length is a good idea. It can
lead to blow-up in size (e.g. a list of ints, and it should be noted
that containers with lots of elements bias towards having small
elements) and also writeElementByteCount(e) could be very inefficient
for many type e (e.g. a list of lists).


What is your proposal Robert then? Current restriction is clearly a blocker
for portability, users, determinism and is unsafe and only checkable at
runtime so not something we should lead to keep.

Alternative i thought about was to forbid implicit coders but it doesnt
help users.



> This way on the decode size you can wrap the stream by element to enforce
> the limitation of the byte count.
>
> Side note: this indeed enforce a limitation due to java byte limitation
but
> if you check coder code it is already here at the higher level so it is
not
> a big deal for now.
>
> In terms of implementation it uses a LengthAwareCoder which delegates to
> another coder the encoding and just adds the byte count before the actual
> serialization. Not perfect but should be more than enough in terms of
> support and perf for beam if you think real pipelines (we try to avoid
> serializations or it is done on some well known points where this algo
> should be enough...worse case it is not a huge overhead, mainly just some
> memory overhead).
>
>
> The PR is available at https://github.com/apache/beam/pull/4594. If you
> check you will see I put it "WIP". The main reason is that it changes the
> encoding format for containers (lists, iterable, ...) and therefore breaks
> python/go/... tests and the standard_coders.yml definition. Some help on
> that would be very welcomed.
>
> Technical side note if you wonder: UnownedInputStream doesn't even allow
to
> mark the stream so there is no real fast way to read the stream as fast as
> possible with standard buffering strategies and to support this automatic
> IterableCoder wrapping which is implicit. In other words, if beam wants to
> support any coder, including the ones not requiring to write the size of
the
> output - most of the codecs - then we need to change the way it works to
> something like that which does it for the user which doesn't know its
coder
> got wrapped.
>
> Hope it makes sense, if not, don't hesitate to ask questions.
>
> Happy end of week-end.
>
> Romain Manni-Bucau
> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book


Re: coder evolutions?

2018-02-05 Thread Romain Manni-Bucau
Thanks, created https://issues.apache.org/jira/browse/BEAM-3616


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-04 22:12 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:

> Done
>
> Regards
> JB
>
> On 02/04/2018 09:14 PM, Romain Manni-Bucau wrote:
> > Works for me. So a jira with target version = 3.
> >
> > Can someone with the karma check we have a 3.0.0 in jira system please?
> >
> > Le 4 févr. 2018 20:46, "Reuven Lax" <re...@google.com  re...@google.com>>
> > a écrit :
> >
> > Seems fine to me. At some point we might want to do an audit of
> existing
> > Jira issues, because I suspect there are issues that should be
> targeted to
> > 3.0 but are not yet tagged.
> >
> > On Sun, Feb 4, 2018 at 11:41 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> > <mailto:j...@nanthrax.net>> wrote:
> >
> > I would prefer to use Jira, with "wish"/"ideas", and adding Beam
> 3.0.0
> > version.
> >
> > WDYT ?
> >
> > Regards
> > JB
> >
> > On 02/04/2018 07:55 PM, Reuven Lax wrote:
> > > Do we have a good place to track the items for Beam 3.0, or is
> Jira the best
> > > place? Romain has a good point - if this gets forgotten when
> we do Beam 3.0,
> > > then we're stuck waiting around till Beam 4.0.
> > >
> > > Reuven
> > >
> > > On Sun, Feb 4, 2018 at 9:27 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net <mailto:j...@nanthrax.net>
> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> wrote:
> > >
> > > That's a good point. In the roadmap for Beam 3, I think it
> makes
> > sense to add a
> > > point about this.
> > >
> > > Regards
> > > JB
> > >
> > > On 02/04/2018 06:18 PM, Eugene Kirpichov wrote:
> > > > I think doing a change that would break pipeline update
> for
> > every single user of
> > > > Flink and Dataflow needs to be postponed until a next
> major
> > version. Pipeline
> > > > update is a very frequently used feature, especially by
> the
> > largest users. We've
> > > > had those users get significantly upset even when we
> > accidentally broke update
> > > > compatibility for some special cases of individual
> transforms;
> > breaking it
> > > > intentionally and project-wide is too extreme to be
> justified by
> > the benefits of
> > > > the current change.
> > > >
> > > > That said, I think concerns about coder APIs are
> reasonable, and
> > it is
> > > > unfortunate that we effectively can't make changes to
> them right
> > now. It would
> > > > be great if in the next major version we were better
> prepared
> > for evolution of
> > > > coders, e.g. by having coders support a version marker or
> >     something like that,
> > > > with an API for detecting the version of data on wire and
> > reading or writing
> > > > data of an old version. Such a change (introducing
> versioning)
> > would also, of
> > > > course, be incompatible and would need to be postponed
> until a
> > major version -
> > > > but, at least, subsequent changes wouldn't.
> > > >
> > > > ...And as I was typing this email, seems that this is
> what the
> > thread already
> > > > came to!
> > > >
> > > > On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau
> > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>
> > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>>
> > > > <mai

Re: [DISCUSS] State of the project: Culture and governance

2018-02-04 Thread Romain Manni-Bucau
preservation purposes, google docs are really
> >> > practical, but have the issue that can be removed without a ‘public’
> >> > copy in the open (at least the wiki will avoid this). This is still an
> >> > open subject because we haven’t finished a proposal process (see
> >> > below).
> >> >
> >> >> I don't see why and where Beam could be different from the other
> Apache
> >> >> projects
> >> > for the first three points.
> >> >
> >> > The three points are:
> >> > 1. Code of conduct: I agree with you.
> >> > 2. Proposal process: There is not Apache agreed way to do this, every
> >> > community decides its way. And we haven’t finished this work, see
> >> > https://issues.apache.org/jira/browse/BEAM-566
> >> > 3. Governance model: We follow the Apache process, but I agree that
> >> > this is not the appropriate channel to discuss, since governance is a
> >> > subject of the PMC.
> >> >
> >> >> I disagree about publishing the criteria to earn committership
> >> >
> >> > I understand because this can be subjective, I want to clarify that I
> >> > am not in any case wanting to give a simple recipe to become committer
> >> > because this simply does not exist, but if you take a look at the link
> >> > I sent, you will see that we should make some points explicit, e.g.
> >> > the importance of community building and the Apache Way. PTAL I think
> >> > this is really good and I don't see why others could disagree:
> >> >
> >> > https://flink.apache.org/how-to-contribute.html#how-to-becom
> e-a-committer
> >> >
> >> >
> >> > Romain,
> >> >
> >> >> More you add policies and rules around a project more you need energy
> >> >> to make it respected and enforced. At that stage you need to ask
> yourself if
> >> >> it does worth it?
> >> >
> >> > I agree with you, policy brings extra bureaucracy and this is
> >> > something to avoid, but we need to make the areas where we are unaware
> >> > of policies explicit and I think that we should link to such policies
> >> > when appropriate as part of being an open community, e.g. reminding
> >> > and respecting the code of conduct that we must follow is not a
> >> > burden, it is a must.
> >> >
> >> > I will let the discussion still open and wait for others opinions,
> >> > once the activity calms down I will wrap up and create new
> >> > threads/JIRAs so we can track progress in the future.
> >> >
> >> >
> >> >
> >> > On Wed, Jan 24, 2018 at 2:19 AM, Robert Bradshaw <rober...@google.com
> >
> >> > wrote:
> >> >> On Tue, Jan 23, 2018 at 9:29 AM, Romain Manni-Bucau
> >> >> <rmannibu...@gmail.com> wrote:
> >> >>> Hi Ismael,
> >> >>>
> >> >>> More you add policies and rules around a project more you need
> energy
> >> >>> to
> >> >>> make it respected and enforced. At that stage you need to ask
> yourself
> >> >>> if it
> >> >>> does worth it?
> >> >>
> >> >> +1. I also agree with JB that we should be deferring to Apache for
> >> >> things like Code of Conduct, etc. (perhaps more explicitly, though
> >> >> that might not even be necessary).
> >> >>
> >> >>> I'm not sure it does for Beam and even if sometimes on PR you can
> find
> >> >>> some
> >> >>> comments "picky" (and guess me I thought it more than once ;)), it
> is
> >> >>> not a
> >> >>> bad community and people are quite nice. Using github is a big boost
> >> >>> to help people to do PR without having to read a doc (this is key
> for
> >> >>> contributions IMHO) so best is probably to manage to review faster
> if
> >> >>> possible and be lighter in terms of review, even if it requires a
> core
> >> >>> dev
> >> >>> commit after a merge IMHO (while it doesnt break and bring something
> >> >>> it is
> >> >>> good to merge kind of rule).
> >> >>
> >> >> Agreed that using Github is a huge win here for lowering the bar for
> >> >> contribution. We still force people to go th

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Romain Manni-Bucau
Im off tonight but can we try to do it next week (tomorrow)? If not please
answer to this thread with outcomes and Ill catch up tmr morning.

Le 4 févr. 2018 20:23, "Reuven Lax" <re...@google.com> a écrit :

Cool, let's chat about this on slack for a bit (which I realized I've been
signed out of for some time).

Reuven

On Sun, Feb 4, 2018 at 9:21 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Sorry guys, I was off today. Happy to be part of the party too ;)
>
> Regards
> JB
>
> On 02/04/2018 06:19 PM, Reuven Lax wrote:
> > Romain, since you're interested maybe the two of us should put together a
> > proposal for how to set this things (hints, schema) on PCollections? I
> don't
> > think it'll be hard - the previous list thread on hints already agreed
> on a
> > general approach, and we would just need to flesh it out.
> >
> > BTW in the past when I looked, Json schemas seemed to have some odd
> limitations
> > inherited from Javascript (e.g. no distinction between integer and
> > floating-point types). Is that still true?
> >
> > Reuven
> >
> > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau <
> rmannibu...@gmail.com
> > <mailto:rmannibu...@gmail.com>> wrote:
> >
> >
> >
> > 2018-02-04 17:53 GMT+01:00 Reuven Lax <re...@google.com
> > <mailto:re...@google.com>>:
> >
> >
> >
> > On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau
> > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote:
> >
> >
> > 2018-02-04 17:37 GMT+01:00 Reuven Lax <re...@google.com
> > <mailto:re...@google.com>>:
> >
> > I'm not sure where proto comes from here. Proto is one
> example
> > of a type that has a schema, but only one example.
> >
> > 1. In the initial prototype I want to avoid modifying the
> > PCollection API. So I think it's best to create a special
> > SchemaCoder, and pass the schema into this coder. Later
> we might
> > targeted APIs for this instead of going through a coder.
> > 1.a I don't see what hints have to do with this?
> >
> >
> > Hints are a way to replace the new API and unify the way to
> pass
> > metadata in beam instead of adding a new custom way each
> time.
> >
> >
> > I don't think schema is a hint. But I hear what your saying -
> hint is a
> > type of PCollection metadata as is schema, and we should have a
> unified
> > API for setting such metadata.
> >
> >
> > :), Ismael pointed me out earlier this week that "hint" had an old
> meaning
> > in beam. My usage is purely the one done in most EE spec (your
> "metadata" in
> > previous answer). But guess we are aligned on the meaning now, just
> wanted
> > to be sure.
> >
> >
> >
> >
> >
> >
> >
> > 2. BeamSQL already has a generic record type which fits
> this use
> > case very well (though we might modify it). However as
> mentioned
> > in the doc, the user is never forced to use this generic
> record
> > type.
> >
> >
> > Well yes and not. A type already exists but 1. it is very
> strictly
> > limited (flat/columns only which is very few of what big
> data SQL
> > can do) and 2. it must be aligned on the converge of generic
> data
> > the schema will bring (really read "aligned" as "dropped in
> favor
> > of" - deprecated being a smooth way to do it).
> >
> >
> > As I said the existing class needs to be modified and extended,
> and not
> > just for this schema us was. It was meant to represent Calcite
> SQL rows,
> > but doesn't quite even do that yet (Calcite supports nested
> rows).
> > However I think it's the right basis to start from.
> >
> >
> > Agree on the state. Current impl issues I hit (additionally to the
> nested
> > support which would require by itself a kind of visitor solution)
> are the
> > fact to own the schema in the record and handle field by field the
> > serialization instead of as a whole which is how it would be handled
> with a
> > schema IMHO.
> >
> > Concretely what I don't want is to do a PoC which works - the

Re: coder evolutions?

2018-02-04 Thread Romain Manni-Bucau
Works for me. So a jira with target version = 3.

Can someone with the karma check we have a 3.0.0 in jira system please?

Le 4 févr. 2018 20:46, "Reuven Lax" <re...@google.com> a écrit :

> Seems fine to me. At some point we might want to do an audit of existing
> Jira issues, because I suspect there are issues that should be targeted to
> 3.0 but are not yet tagged.
>
> On Sun, Feb 4, 2018 at 11:41 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> I would prefer to use Jira, with "wish"/"ideas", and adding Beam 3.0.0
>> version.
>>
>> WDYT ?
>>
>> Regards
>> JB
>>
>> On 02/04/2018 07:55 PM, Reuven Lax wrote:
>> > Do we have a good place to track the items for Beam 3.0, or is Jira the
>> best
>> > place? Romain has a good point - if this gets forgotten when we do Beam
>> 3.0,
>> > then we're stuck waiting around till Beam 4.0.
>> >
>> > Reuven
>> >
>> > On Sun, Feb 4, 2018 at 9:27 AM, Jean-Baptiste Onofré <j...@nanthrax.net
>> > <mailto:j...@nanthrax.net>> wrote:
>> >
>> > That's a good point. In the roadmap for Beam 3, I think it makes
>> sense to add a
>> > point about this.
>> >
>> > Regards
>> > JB
>> >
>> > On 02/04/2018 06:18 PM, Eugene Kirpichov wrote:
>> > > I think doing a change that would break pipeline update for every
>> single user of
>> > > Flink and Dataflow needs to be postponed until a next major
>> version. Pipeline
>> > > update is a very frequently used feature, especially by the
>> largest users. We've
>> > > had those users get significantly upset even when we accidentally
>> broke update
>> > > compatibility for some special cases of individual transforms;
>> breaking it
>> > > intentionally and project-wide is too extreme to be justified by
>> the benefits of
>> > > the current change.
>> > >
>> > > That said, I think concerns about coder APIs are reasonable, and
>> it is
>> > > unfortunate that we effectively can't make changes to them right
>> now. It would
>> > > be great if in the next major version we were better prepared for
>> evolution of
>> > > coders, e.g. by having coders support a version marker or
>> something like that,
>> > > with an API for detecting the version of data on wire and reading
>> or writing
>> > > data of an old version. Such a change (introducing versioning)
>> would also, of
>> > > course, be incompatible and would need to be postponed until a
>> major version -
>> > > but, at least, subsequent changes wouldn't.
>> > >
>> > > ...And as I was typing this email, seems that this is what the
>> thread already
>> > > came to!
>> > >
>> > > On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau <
>> rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>
>> > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>>>
>> wrote:
>> > >
>> > > I like this idea of migration support at coder level. It
>> would require to
>> > > add a metadata in all outputs which would represent the
>> version then coders
>> > > can handle the logic properly depending the version - we can
>> assume a coder
>> > > dev upgrade the version when he breaks the representation I
>> hope ;).
>> > > With this: no runner impact at all :).
>> > >
>> > >
>> > > Romain Manni-Bucau
>> > > @rmannibucau <https://twitter.com/rmannibucau
>> > <https://twitter.com/rmannibucau>> |  Blog
>> > > <https://rmannibucau.metawerx.net/
>> > <https://rmannibucau.metawerx.net/>> | Old Blog
>> > > <http://rmannibucau.wordpress.com <
>> http://rmannibucau.wordpress.com>>
>> > | Github
>> > > <https://github.com/rmannibucau <
>> https://github.com/rmannibucau>> |
>> > LinkedIn
>> > > <https://www.linkedin.com/in/rmannibucau
>> > <https://www.linkedin.com/in/rmannibucau>> | Book
>> > >
>> >  <https://www.packtpub.com/application-development/java-ee-
>> 8

Re: coder evolutions?

2018-02-04 Thread Romain Manni-Bucau
yep sadly :(

how should we track it properly to not forget it for v3? (I dont trust jira
much but if we don't have anything better...)

when do we start beam 3? next week? :)


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-04 18:18 GMT+01:00 Eugene Kirpichov <kirpic...@google.com>:

> I think doing a change that would break pipeline update for every single
> user of Flink and Dataflow needs to be postponed until a next major
> version. Pipeline update is a very frequently used feature, especially by
> the largest users. We've had those users get significantly upset even when
> we accidentally broke update compatibility for some special cases of
> individual transforms; breaking it intentionally and project-wide is too
> extreme to be justified by the benefits of the current change.
>
> That said, I think concerns about coder APIs are reasonable, and it is
> unfortunate that we effectively can't make changes to them right now. It
> would be great if in the next major version we were better prepared for
> evolution of coders, e.g. by having coders support a version marker or
> something like that, with an API for detecting the version of data on wire
> and reading or writing data of an old version. Such a change (introducing
> versioning) would also, of course, be incompatible and would need to be
> postponed until a major version - but, at least, subsequent changes
> wouldn't.
>
> ...And as I was typing this email, seems that this is what the thread
> already came to!
>
> On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> I like this idea of migration support at coder level. It would require to
>> add a metadata in all outputs which would represent the version then coders
>> can handle the logic properly depending the version - we can assume a coder
>> dev upgrade the version when he breaks the representation I hope ;).
>> With this: no runner impact at all :).
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>> 2018-02-04 18:09 GMT+01:00 Reuven Lax <re...@google.com>:
>>
>>> It would already break quite a number of users at this point.
>>>
>>> I think what we should be doing is moving forward on the snapshot/update
>>> proposal. That proposal actually provides a way forward when coders change
>>> (it proposes a way to map an old snapshot to one using the new coder, so
>>> changes to coders in the future will be much easier to make. However much
>>> of the implementation for this will likely be at the runner level, not the
>>> SDK level.
>>>
>>> Reuven
>>>
>>> On Sun, Feb 4, 2018 at 9:04 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>> I fully understand that, and this is one of the reason managing to
>>>> solve these issues is very important and ASAP. My conclusion is that we
>>>> must break it now to avoid to do it later when usage will be way more
>>>> developped - I would be very happy to be wrong on that point - so I started
>>>> this PR and this thread. We can postpone it but it would break later so for
>>>> probably more users.
>>>>
>>>>
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>> <http://rmannibucau.wordpress.com> | Github
>>>> <https://github.com/rmannibucau> | LinkedIn
>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>
>>>> 2018-02-04 17:49 GMT+01:00 Reuven Lax <re...@google.com>:
>>>>
>>>>> Unfortunately several runners (at least Flink and Dataflow) support
>>>>> in-place update of streaming pipelines as a key feature, and changing 
>>>>> coder
>>>>&

Re: coder evolutions?

2018-02-04 Thread Romain Manni-Bucau
I like this idea of migration support at coder level. It would require to
add a metadata in all outputs which would represent the version then coders
can handle the logic properly depending the version - we can assume a coder
dev upgrade the version when he breaks the representation I hope ;).
With this: no runner impact at all :).


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-04 18:09 GMT+01:00 Reuven Lax <re...@google.com>:

> It would already break quite a number of users at this point.
>
> I think what we should be doing is moving forward on the snapshot/update
> proposal. That proposal actually provides a way forward when coders change
> (it proposes a way to map an old snapshot to one using the new coder, so
> changes to coders in the future will be much easier to make. However much
> of the implementation for this will likely be at the runner level, not the
> SDK level.
>
> Reuven
>
> On Sun, Feb 4, 2018 at 9:04 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> I fully understand that, and this is one of the reason managing to solve
>> these issues is very important and ASAP. My conclusion is that we must
>> break it now to avoid to do it later when usage will be way more developped
>> - I would be very happy to be wrong on that point - so I started this PR
>> and this thread. We can postpone it but it would break later so for
>> probably more users.
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>> 2018-02-04 17:49 GMT+01:00 Reuven Lax <re...@google.com>:
>>
>>> Unfortunately several runners (at least Flink and Dataflow) support
>>> in-place update of streaming pipelines as a key feature, and changing coder
>>> format breaks this. This is a very important feature of both runners, and
>>> we should endeavor not to break them.
>>>
>>> In-place snapshot and update is also a top-level Beam proposal that was
>>> received positively, though neither of those runners yet implement the
>>> proposed interface.
>>>
>>> Reuven
>>>
>>> On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>> Sadly yes, and why the PR is actually WIP. As mentionned it modifies it
>>>> and requires some updates in other languages and the standard_coders.yml
>>>> file (I didn't find how this file was generated).
>>>> Since coders must be about volatile data I don't think it is a big deal
>>>> to change it though.
>>>>
>>>>
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>> <http://rmannibucau.wordpress.com> | Github
>>>> <https://github.com/rmannibucau> | LinkedIn
>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>
>>>> 2018-02-04 17:34 GMT+01:00 Reuven Lax <re...@google.com>:
>>>>
>>>>> One question - does this change the actual byte encoding of elements?
>>>>> We've tried hard not to do that so far for reasons of compatibility.
>>>>>
>>>>> Reuven
>>>>>
>>>>> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau <
>>>>> rmannibu...@gmail.com> wrote:
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I submitted a PR on coders to enhance 1. the user experience 2. the
>>>>>> determinism and handling of coders.
>>>>>>
>>>>>> 1. the user experience is linked to what i sent some days ago: close
>>>>>> handling of the streams from a coder code. Long story short I add a
>>>>>> SkipCloseCoder which can decorate a coder and just wraps the stream 

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Romain Manni-Bucau
2018-02-04 17:53 GMT+01:00 Reuven Lax <re...@google.com>:

>
>
> On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>>
>> 2018-02-04 17:37 GMT+01:00 Reuven Lax <re...@google.com>:
>>
>>> I'm not sure where proto comes from here. Proto is one example of a type
>>> that has a schema, but only one example.
>>>
>>> 1. In the initial prototype I want to avoid modifying the PCollection
>>> API. So I think it's best to create a special SchemaCoder, and pass the
>>> schema into this coder. Later we might targeted APIs for this instead of
>>> going through a coder.
>>> 1.a I don't see what hints have to do with this?
>>>
>>
>> Hints are a way to replace the new API and unify the way to pass metadata
>> in beam instead of adding a new custom way each time.
>>
>
> I don't think schema is a hint. But I hear what your saying - hint is a
> type of PCollection metadata as is schema, and we should have a unified API
> for setting such metadata.
>

:), Ismael pointed me out earlier this week that "hint" had an old meaning
in beam. My usage is purely the one done in most EE spec (your "metadata"
in previous answer). But guess we are aligned on the meaning now, just
wanted to be sure.


>
>
>>
>>
>>>
>>> 2. BeamSQL already has a generic record type which fits this use case
>>> very well (though we might modify it). However as mentioned in the doc, the
>>> user is never forced to use this generic record type.
>>>
>>>
>> Well yes and not. A type already exists but 1. it is very strictly
>> limited (flat/columns only which is very few of what big data SQL can do)
>> and 2. it must be aligned on the converge of generic data the schema will
>> bring (really read "aligned" as "dropped in favor of" - deprecated being a
>> smooth way to do it).
>>
>
> As I said the existing class needs to be modified and extended, and not
> just for this schema us was. It was meant to represent Calcite SQL rows,
> but doesn't quite even do that yet (Calcite supports nested rows). However
> I think it's the right basis to start from.
>

Agree on the state. Current impl issues I hit (additionally to the nested
support which would require by itself a kind of visitor solution) are the
fact to own the schema in the record and handle field by field the
serialization instead of as a whole which is how it would be handled with a
schema IMHO.

Concretely what I don't want is to do a PoC which works - they all work
right? and integrate to beam without thinking to a global solution for this
generic record issue and its schema standardization. This is where Json(-P)
has a lot of value IMHO but requires a bit more love than just adding
schema in the model.


>
>
>>
>> So long story short the main work of this schema track is not only on
>> using schema in runners and other ways but also starting to make beam
>> consistent with itself which is probably the most important outcome since
>> it is the user facing side of this work.
>>
>>
>>>
>>> On Sun, Feb 4, 2018 at 12:22 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>> @Reuven: is the proto only about passing schema or also the generic
>>>> type?
>>>>
>>>> There are 2.5 topics to solve this issue:
>>>>
>>>> 1. How to pass schema
>>>> 1.a. hints?
>>>> 2. What is the generic record type associated to a schema and how to
>>>> express a schema relatively to it
>>>>
>>>> I would be happy to help on 1.a and 2 somehow if you need.
>>>>
>>>> Le 4 févr. 2018 03:30, "Reuven Lax" <re...@google.com> a écrit :
>>>>
>>>>> One more thing. If anyone here has experience with various OSS
>>>>> metadata stores (e.g. Kafka Schema Registry is one example), would you 
>>>>> like
>>>>> to collaborate on implementation? I want to make sure that source schemas
>>>>> can be stored in a variety of OSS metadata stores, and be easily pulled
>>>>> into a Beam pipeline.
>>>>>
>>>>> Reuven
>>>>>
>>>>> On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax <re...@google.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> If there are no concerns, I would like to start working on a
>>>>>> prototype. It's just a prototype, so I don't think it will have the final
&

Re: coder evolutions?

2018-02-04 Thread Romain Manni-Bucau
I fully understand that, and this is one of the reason managing to solve
these issues is very important and ASAP. My conclusion is that we must
break it now to avoid to do it later when usage will be way more developped
- I would be very happy to be wrong on that point - so I started this PR
and this thread. We can postpone it but it would break later so for
probably more users.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-04 17:49 GMT+01:00 Reuven Lax <re...@google.com>:

> Unfortunately several runners (at least Flink and Dataflow) support
> in-place update of streaming pipelines as a key feature, and changing coder
> format breaks this. This is a very important feature of both runners, and
> we should endeavor not to break them.
>
> In-place snapshot and update is also a top-level Beam proposal that was
> received positively, though neither of those runners yet implement the
> proposed interface.
>
> Reuven
>
> On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> Sadly yes, and why the PR is actually WIP. As mentionned it modifies it
>> and requires some updates in other languages and the standard_coders.yml
>> file (I didn't find how this file was generated).
>> Since coders must be about volatile data I don't think it is a big deal
>> to change it though.
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>> 2018-02-04 17:34 GMT+01:00 Reuven Lax <re...@google.com>:
>>
>>> One question - does this change the actual byte encoding of elements?
>>> We've tried hard not to do that so far for reasons of compatibility.
>>>
>>> Reuven
>>>
>>> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I submitted a PR on coders to enhance 1. the user experience 2. the
>>>> determinism and handling of coders.
>>>>
>>>> 1. the user experience is linked to what i sent some days ago: close
>>>> handling of the streams from a coder code. Long story short I add a
>>>> SkipCloseCoder which can decorate a coder and just wraps the stream (input
>>>> or output) in flavors skipping close() calls. This avoids to do it by
>>>> default (which had my preference if you read the related thread but not the
>>>> one of everybody) but also makes the usage of a coder with this issue easy
>>>> since the of() of the coder just wraps itself in this delagating coder.
>>>>
>>>> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
>>>> ones use this kind of algorithm (keep in mind they work on a list):
>>>>
>>>> writeSize()
>>>> for all element e {
>>>> elementCoder.write(e)
>>>> }
>>>> writeMagicNumber() // this one depends the size
>>>>
>>>> The decoding is symmetric so I bypass it here.
>>>>
>>>> Indeed all these writes (reads) are done on the same stream. Therefore
>>>> it assumes you read as much bytes than you write...which is a huge
>>>> assumption for a coder which should by contract assume it can read the
>>>> stream...as a stream (until -1).
>>>>
>>>> The idea of the fix is to change this encoding to this kind of
>>>> algorithm:
>>>>
>>>> writeSize()
>>>> for all element e {
>>>> writeElementByteCount(e)
>>>> elementCoder.write(e)
>>>> }
>>>> writeMagicNumber() // still optionally
>>>>
>>>> This way on the decode size you can wrap the stream by element to
>>>> enforce the limitation of the byte count.
>>>>
>>>> Side note: this indeed enforce a limitation due to java byte limitation
>>>> but if you check coder code it is already here at the higher level so it is
>&

Re: coder evolutions?

2018-02-04 Thread Romain Manni-Bucau
Sadly yes, and why the PR is actually WIP. As mentionned it modifies it and
requires some updates in other languages and the standard_coders.yml file
(I didn't find how this file was generated).
Since coders must be about volatile data I don't think it is a big deal to
change it though.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-04 17:34 GMT+01:00 Reuven Lax <re...@google.com>:

> One question - does this change the actual byte encoding of elements?
> We've tried hard not to do that so far for reasons of compatibility.
>
> Reuven
>
> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> Hi guys,
>>
>> I submitted a PR on coders to enhance 1. the user experience 2. the
>> determinism and handling of coders.
>>
>> 1. the user experience is linked to what i sent some days ago: close
>> handling of the streams from a coder code. Long story short I add a
>> SkipCloseCoder which can decorate a coder and just wraps the stream (input
>> or output) in flavors skipping close() calls. This avoids to do it by
>> default (which had my preference if you read the related thread but not the
>> one of everybody) but also makes the usage of a coder with this issue easy
>> since the of() of the coder just wraps itself in this delagating coder.
>>
>> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
>> ones use this kind of algorithm (keep in mind they work on a list):
>>
>> writeSize()
>> for all element e {
>> elementCoder.write(e)
>> }
>> writeMagicNumber() // this one depends the size
>>
>> The decoding is symmetric so I bypass it here.
>>
>> Indeed all these writes (reads) are done on the same stream. Therefore it
>> assumes you read as much bytes than you write...which is a huge assumption
>> for a coder which should by contract assume it can read the stream...as a
>> stream (until -1).
>>
>> The idea of the fix is to change this encoding to this kind of algorithm:
>>
>> writeSize()
>> for all element e {
>> writeElementByteCount(e)
>> elementCoder.write(e)
>> }
>> writeMagicNumber() // still optionally
>>
>> This way on the decode size you can wrap the stream by element to enforce
>> the limitation of the byte count.
>>
>> Side note: this indeed enforce a limitation due to java byte limitation
>> but if you check coder code it is already here at the higher level so it is
>> not a big deal for now.
>>
>> In terms of implementation it uses a LengthAwareCoder which delegates to
>> another coder the encoding and just adds the byte count before the actual
>> serialization. Not perfect but should be more than enough in terms of
>> support and perf for beam if you think real pipelines (we try to avoid
>> serializations or it is done on some well known points where this algo
>> should be enough...worse case it is not a huge overhead, mainly just some
>> memory overhead).
>>
>>
>> The PR is available at https://github.com/apache/beam/pull/4594. If you
>> check you will see I put it "WIP". The main reason is that it changes the
>> encoding format for containers (lists, iterable, ...) and therefore breaks
>> python/go/... tests and the standard_coders.yml definition. Some help on
>> that would be very welcomed.
>>
>> Technical side note if you wonder: UnownedInputStream doesn't even allow
>> to mark the stream so there is no real fast way to read the stream as fast
>> as possible with standard buffering strategies and to support this
>> automatic IterableCoder wrapping which is implicit. In other words, if beam
>> wants to support any coder, including the ones not requiring to write the
>> size of the output - most of the codecs - then we need to change the way it
>> works to something like that which does it for the user which doesn't know
>> its coder got wrapped.
>>
>> Hope it makes sense, if not, don't hesitate to ask questions.
>>
>> Happy end of week-end.
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>
>


Re: Schema-Aware PCollections revisited

2018-02-04 Thread Romain Manni-Bucau
2018-02-04 17:37 GMT+01:00 Reuven Lax <re...@google.com>:

> I'm not sure where proto comes from here. Proto is one example of a type
> that has a schema, but only one example.
>
> 1. In the initial prototype I want to avoid modifying the PCollection API.
> So I think it's best to create a special SchemaCoder, and pass the schema
> into this coder. Later we might targeted APIs for this instead of going
> through a coder.
> 1.a I don't see what hints have to do with this?
>

Hints are a way to replace the new API and unify the way to pass metadata
in beam instead of adding a new custom way each time.


>
> 2. BeamSQL already has a generic record type which fits this use case very
> well (though we might modify it). However as mentioned in the doc, the user
> is never forced to use this generic record type.
>
>
Well yes and not. A type already exists but 1. it is very strictly limited
(flat/columns only which is very few of what big data SQL can do) and 2. it
must be aligned on the converge of generic data the schema will bring
(really read "aligned" as "dropped in favor of" - deprecated being a smooth
way to do it).

So long story short the main work of this schema track is not only on using
schema in runners and other ways but also starting to make beam consistent
with itself which is probably the most important outcome since it is the
user facing side of this work.


>
> On Sun, Feb 4, 2018 at 12:22 AM, Romain Manni-Bucau <rmannibu...@gmail.com
> > wrote:
>
>> @Reuven: is the proto only about passing schema or also the generic type?
>>
>> There are 2.5 topics to solve this issue:
>>
>> 1. How to pass schema
>> 1.a. hints?
>> 2. What is the generic record type associated to a schema and how to
>> express a schema relatively to it
>>
>> I would be happy to help on 1.a and 2 somehow if you need.
>>
>> Le 4 févr. 2018 03:30, "Reuven Lax" <re...@google.com> a écrit :
>>
>>> One more thing. If anyone here has experience with various OSS metadata
>>> stores (e.g. Kafka Schema Registry is one example), would you like to
>>> collaborate on implementation? I want to make sure that source schemas can
>>> be stored in a variety of OSS metadata stores, and be easily pulled into a
>>> Beam pipeline.
>>>
>>> Reuven
>>>
>>> On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax <re...@google.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> If there are no concerns, I would like to start working on a prototype.
>>>> It's just a prototype, so I don't think it will have the final API (e.g.
>>>> for the prototype I'm going to avoid change the API of PCollection, and use
>>>> a "special" Coder instead). Also even once we go beyond prototype, it will
>>>> be @Experimental for some time, so the API will not be fixed in stone.
>>>>
>>>> Any more comments on this approach before we start implementing a
>>>> prototype?
>>>>
>>>> Reuven
>>>>
>>>> On Wed, Jan 31, 2018 at 1:12 PM, Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>> If you need help on the json part I'm happy to help. To give a few
>>>>> hints on what is very doable: we can add an avro module to johnzon (asf
>>>>> json{p,b} impl) to back jsonp by avro (guess it will be one of the first 
>>>>> to
>>>>> be asked) for instance.
>>>>>
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>> <https://www.linkedin.com/in/rmannibucau>
>>>>>
>>>>> 2018-01-31 22:06 GMT+01:00 Reuven Lax <re...@google.com>:
>>>>>
>>>>>> Agree. The initial implementation will be a prototype.
>>>>>>
>>>>>> On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré <
>>>>>> j...@nanthrax.net> wrote:
>>>>>>
>>>>>>> Hi Reuven,
>>>>>>>
>>>>>>> Agree to be able to describe the schema with different format. The
>>>>>>> good point about json schemas is that they are described by a spec. My
>>>>>>> point is also to avoid the reinvent the wheel. Just an abstract to be 
>>

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Romain Manni-Bucau
@Reuven: is the proto only about passing schema or also the generic type?

There are 2.5 topics to solve this issue:

1. How to pass schema
1.a. hints?
2. What is the generic record type associated to a schema and how to
express a schema relatively to it

I would be happy to help on 1.a and 2 somehow if you need.

Le 4 févr. 2018 03:30, "Reuven Lax" <re...@google.com> a écrit :

> One more thing. If anyone here has experience with various OSS metadata
> stores (e.g. Kafka Schema Registry is one example), would you like to
> collaborate on implementation? I want to make sure that source schemas can
> be stored in a variety of OSS metadata stores, and be easily pulled into a
> Beam pipeline.
>
> Reuven
>
> On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax <re...@google.com> wrote:
>
>> Hi all,
>>
>> If there are no concerns, I would like to start working on a prototype.
>> It's just a prototype, so I don't think it will have the final API (e.g.
>> for the prototype I'm going to avoid change the API of PCollection, and use
>> a "special" Coder instead). Also even once we go beyond prototype, it will
>> be @Experimental for some time, so the API will not be fixed in stone.
>>
>> Any more comments on this approach before we start implementing a
>> prototype?
>>
>> Reuven
>>
>> On Wed, Jan 31, 2018 at 1:12 PM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> If you need help on the json part I'm happy to help. To give a few hints
>>> on what is very doable: we can add an avro module to johnzon (asf json{p,b}
>>> impl) to back jsonp by avro (guess it will be one of the first to be asked)
>>> for instance.
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>> <http://rmannibucau.wordpress.com> | Github
>>> <https://github.com/rmannibucau> | LinkedIn
>>> <https://www.linkedin.com/in/rmannibucau>
>>>
>>> 2018-01-31 22:06 GMT+01:00 Reuven Lax <re...@google.com>:
>>>
>>>> Agree. The initial implementation will be a prototype.
>>>>
>>>> On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré <j...@nanthrax.net
>>>> > wrote:
>>>>
>>>>> Hi Reuven,
>>>>>
>>>>> Agree to be able to describe the schema with different format. The
>>>>> good point about json schemas is that they are described by a spec. My
>>>>> point is also to avoid the reinvent the wheel. Just an abstract to be able
>>>>> to use Avro, Json, Calcite, custom schema descriptors would be great.
>>>>>
>>>>> Using coder to describe a schema sounds like a smart move to implement
>>>>> quickly. However, it has to be clear in term of documentation to avoid
>>>>> "side effect". I still think PCollection.setSchema() is better: it should
>>>>> be metadata (or hint ;))) on the PCollection.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 31/01/2018 20:16, Reuven Lax wrote:
>>>>>
>>>>>> As to the question of how a schema should be specified, I want to
>>>>>> support several common schema formats. So if a user has a Json schema, or
>>>>>> an Avro schema, or a Calcite schema, etc. there should be adapters that
>>>>>> allow setting a schema from any of them. I don't think we should prefer 
>>>>>> one
>>>>>> over the other. While Romain is right that many people know Json, I think
>>>>>> far fewer people know Json schemas.
>>>>>>
>>>>>> Agree, schemas should not be enforced (for one thing, that wouldn't
>>>>>> be backwards compatible!). I think for the initial prototype I will
>>>>>> probably use a special coder to represent the schema (with setSchema an
>>>>>> option on the coder), largely because it doesn't require modifying
>>>>>> PCollection. However I think longer term a schema should be an optional
>>>>>> piece of metadata on the PCollection object. Similar to the previous
>>>>>> discussion about "hints," I think this can be set on the producing
>>>>>> PTransform, and a SetSchema PTransform will allow attaching a schema to 
>>>>>> any
>>>>>> PCollection (i.e. pc.apply(SetSchema.of(schema)))

Re: rename: BeamRecord -> Row

2018-02-03 Thread Romain Manni-Bucau
This is as true as the renaming is not needed so I guess the PR owner will
decide ;). Thanks for the clarification.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-03 18:36 GMT+01:00 Reuven Lax <re...@google.com>:

> Oh I agree 100%, however I'm just saying that we shouldn't ask the SQL
> effort to halt just because the schema effort overlaps. There's at least
> one other pending PR on this class (to do with automatic POJO generation).
>
> Also the name of the Record/Row class is somewhat independent of
> everything else in the schema discussion, and doesn't really need to block
> on that. That's why I started this thread. there was enough discussion on
> the PR itself that I felt that the community should be aware, as I assume
> not everyone follows all PR discussions :)
>
> Reuven
>
> On Sat, Feb 3, 2018 at 9:00 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> I know Reuven, but when you check what it does, it is exactly the same
>> and the current work will be to replace by the schema work so better to
>> avoid a round trip of work which will be throw away in any case. Also note
>> that current structure is flat and very limiting for modern SQL so the
>> alignment of both will be beneficial to beam in any case so better to
>> ensure all parts of the projects move in the same direction instead of
>> requiring yet another layer of conversion, no?
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>> 2018-02-03 16:32 GMT+01:00 Reuven Lax <re...@google.com>:
>>
>>> This is a core part of SQL which is ongoing.
>>>
>>> On Feb 2, 2018 11:45 PM, "Romain Manni-Bucau" <rmannibu...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> Shouldnt the discussion on schema which has a direct impact on this
>>>> generic container be closed before any action on this?
>>>>
>>>>
>>>> Le 3 févr. 2018 01:09, "Ankur Chauhan" <an...@malloc64.com> a écrit :
>>>>
>>>>> ++
>>>>>
>>>>> On Fri, Feb 2, 2018 at 1:33 PM Rafael Fernandez <rfern...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Very strong +1
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 2, 2018 at 1:24 PM Reuven Lax <re...@google.com> wrote:
>>>>>>
>>>>>>> We're looking at renaming the BeamRecord class
>>>>>>> <https://github.com/apache/beam/pull/4550>, that was used for
>>>>>>> columnar data. There was sufficient discussion on the naming, that I 
>>>>>>> want
>>>>>>> to make sure the dev list is aware of naming plans here.
>>>>>>>
>>>>>>> BeamRecord is a columnar, field-based record. Currently it's used by
>>>>>>> BeamSQL, and the plan is to use it for schemas as well. "Record" is a
>>>>>>> confusing name for this class, as all elements in the Beam model are
>>>>>>> referred to as "records," whether or not they have schemas. "Row" is a 
>>>>>>> much
>>>>>>> clearer name.
>>>>>>>
>>>>>>> There was a lot of discussion whether to name this BeamRow or just
>>>>>>> plain Row (in the org.apache.beam.values namespace). The argument in 
>>>>>>> favor
>>>>>>> of BeamRow was so that people aren't forced to qualify their type names 
>>>>>>> in
>>>>>>> the case of a conflict with a Row from another package. The argument in
>>>>>>> favor of Row was that it's a better name, it's in the Beam namespace
>>>>>>> anyway, and it's what the rest of the world (Cassandra, Hive, Spark, 
>>>>>>> etc.)
>>>>>>> calls similar classes.
>>>>>>>
>>>>>>> RIght not consensus on the PR is leaning to Row. If you feel
>>>>>>> strongly, please speak up :)
>>>>>>>
>>>>>>> Reuven
>>>>>>>
>>>>>>
>>
>


Re: rename: BeamRecord -> Row

2018-02-03 Thread Romain Manni-Bucau
I know Reuven, but when you check what it does, it is exactly the same and
the current work will be to replace by the schema work so better to avoid a
round trip of work which will be throw away in any case. Also note that
current structure is flat and very limiting for modern SQL so the alignment
of both will be beneficial to beam in any case so better to ensure all
parts of the projects move in the same direction instead of requiring yet
another layer of conversion, no?


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-03 16:32 GMT+01:00 Reuven Lax <re...@google.com>:

> This is a core part of SQL which is ongoing.
>
> On Feb 2, 2018 11:45 PM, "Romain Manni-Bucau" <rmannibu...@gmail.com>
> wrote:
>
>> Hi
>>
>> Shouldnt the discussion on schema which has a direct impact on this
>> generic container be closed before any action on this?
>>
>>
>> Le 3 févr. 2018 01:09, "Ankur Chauhan" <an...@malloc64.com> a écrit :
>>
>>> ++
>>>
>>> On Fri, Feb 2, 2018 at 1:33 PM Rafael Fernandez <rfern...@google.com>
>>> wrote:
>>>
>>>> Very strong +1
>>>>
>>>>
>>>> On Fri, Feb 2, 2018 at 1:24 PM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> We're looking at renaming the BeamRecord class
>>>>> <https://github.com/apache/beam/pull/4550>, that was used for
>>>>> columnar data. There was sufficient discussion on the naming, that I want
>>>>> to make sure the dev list is aware of naming plans here.
>>>>>
>>>>> BeamRecord is a columnar, field-based record. Currently it's used by
>>>>> BeamSQL, and the plan is to use it for schemas as well. "Record" is a
>>>>> confusing name for this class, as all elements in the Beam model are
>>>>> referred to as "records," whether or not they have schemas. "Row" is a 
>>>>> much
>>>>> clearer name.
>>>>>
>>>>> There was a lot of discussion whether to name this BeamRow or just
>>>>> plain Row (in the org.apache.beam.values namespace). The argument in favor
>>>>> of BeamRow was so that people aren't forced to qualify their type names in
>>>>> the case of a conflict with a Row from another package. The argument in
>>>>> favor of Row was that it's a better name, it's in the Beam namespace
>>>>> anyway, and it's what the rest of the world (Cassandra, Hive, Spark, etc.)
>>>>> calls similar classes.
>>>>>
>>>>> RIght not consensus on the PR is leaning to Row. If you feel strongly,
>>>>> please speak up :)
>>>>>
>>>>> Reuven
>>>>>
>>>>


Re: rename: BeamRecord -> Row

2018-02-02 Thread Romain Manni-Bucau
Hi

Shouldnt the discussion on schema which has a direct impact on this generic
container be closed before any action on this?


Le 3 févr. 2018 01:09, "Ankur Chauhan"  a écrit :

> ++
>
> On Fri, Feb 2, 2018 at 1:33 PM Rafael Fernandez 
> wrote:
>
>> Very strong +1
>>
>>
>> On Fri, Feb 2, 2018 at 1:24 PM Reuven Lax  wrote:
>>
>>> We're looking at renaming the BeamRecord class
>>> , that was used for columnar
>>> data. There was sufficient discussion on the naming, that I want to make
>>> sure the dev list is aware of naming plans here.
>>>
>>> BeamRecord is a columnar, field-based record. Currently it's used by
>>> BeamSQL, and the plan is to use it for schemas as well. "Record" is a
>>> confusing name for this class, as all elements in the Beam model are
>>> referred to as "records," whether or not they have schemas. "Row" is a much
>>> clearer name.
>>>
>>> There was a lot of discussion whether to name this BeamRow or just plain
>>> Row (in the org.apache.beam.values namespace). The argument in favor of
>>> BeamRow was so that people aren't forced to qualify their type names in the
>>> case of a conflict with a Row from another package. The argument in favor
>>> of Row was that it's a better name, it's in the Beam namespace anyway, and
>>> it's what the rest of the world (Cassandra, Hive, Spark, etc.) calls
>>> similar classes.
>>>
>>> RIght not consensus on the PR is leaning to Row. If you feel strongly,
>>> please speak up :)
>>>
>>> Reuven
>>>
>>


Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-02-02 Thread Romain Manni-Bucau
well we can disagree on the code - it is fine ;), but the needed part of it
by beam is not huge and in any case it can be forked without requiring 10
classes - if so we'll use another impl than the guava one ;). This is the
whole point.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-02 17:24 GMT+01:00 Reuven Lax <re...@google.com>:

> TypeToken is not trivial. I've written code to do what TypeToken does
> before (figuring out generic ancestor types). It's actually somewhat
> tricky, and the code I wrote had subtle bugs in it; eventually we removed
> this code in favor of Guava's implementation :)
>
> On Fri, Feb 2, 2018 at 7:47 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> Yep, note I never said to reinvent the wheel, we can copy it from guava,
>> openwebbeans or any other impl. Point was more to avoid to depend on
>> something we don't own for that which is after all not that much code. I
>> also think we can limit it a lot to align it on what is supported by beam
>> (I'm thinking to coders here) but this can be another topic.
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>> 2018-02-02 16:33 GMT+01:00 Kenneth Knowles <k...@google.com>:
>>
>>> On Fri, Feb 2, 2018 at 7:18 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>> Don't forget beam doesn't support much behind it (mainly only a few
>>>> ParameterizedType due the usage code path) so it is mainly only about
>>>> handling parameterized types and typevariables recursively. Not a lot of
>>>> work. Main concern being it is in the API so using a shade as an API is
>>>> never a good idea, in particular cause the shade can be broken and requires
>>>> to setup clirr or things like that and when it breaks you are done and need
>>>> to fork it anyway. Limiting the dependencies for an API - as beam is - is
>>>> always saner even if it requires a small fork of code.
>>>>
>>>
>>> The main thing that TypeToken is used for is capturing generics that are
>>> lost by Java reflection. It is a bit tricky, actually.
>>>
>>> Kenn
>>>
>>>
>>>
>>>>
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>> <http://rmannibucau.wordpress.com> | Github
>>>> <https://github.com/rmannibucau> | LinkedIn
>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>
>>>> 2018-02-02 15:49 GMT+01:00 Kenneth Knowles <k...@google.com>:
>>>>
>>>>> On Fri, Feb 2, 2018 at 6:41 AM, Romain Manni-Bucau <
>>>>> rmannibu...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> 2018-02-02 15:37 GMT+01:00 Kenneth Knowles <k...@google.com>:
>>>>>>
>>>>>>> Another couple:
>>>>>>>
>>>>>>>  - User-facing TypeDescriptor is a very thin wrapper on Guava's
>>>>>>> TypeToken
>>>>>>>
>>>>>>
>>>>>> Technically reflect Type is enough
>>>>>>
>>>>>
>>>>> If you want to try to remove TypeToken from underneath TypeDescriptor,
>>>>> I have no objections as long as you expand the test suite to cover all the
>>>>> functionality where we trust TypeToken's tests. Good luck :-)
>>>>>
>>>>> Kenn
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>>  - ImmutableList and friends and their builders are very widely used
>>>>>>> and IMO still add a lot for readability and preventing

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-02-02 Thread Romain Manni-Bucau
Yep, note I never said to reinvent the wheel, we can copy it from guava,
openwebbeans or any other impl. Point was more to avoid to depend on
something we don't own for that which is after all not that much code. I
also think we can limit it a lot to align it on what is supported by beam
(I'm thinking to coders here) but this can be another topic.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-02 16:33 GMT+01:00 Kenneth Knowles <k...@google.com>:

> On Fri, Feb 2, 2018 at 7:18 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> Don't forget beam doesn't support much behind it (mainly only a few
>> ParameterizedType due the usage code path) so it is mainly only about
>> handling parameterized types and typevariables recursively. Not a lot of
>> work. Main concern being it is in the API so using a shade as an API is
>> never a good idea, in particular cause the shade can be broken and requires
>> to setup clirr or things like that and when it breaks you are done and need
>> to fork it anyway. Limiting the dependencies for an API - as beam is - is
>> always saner even if it requires a small fork of code.
>>
>
> The main thing that TypeToken is used for is capturing generics that are
> lost by Java reflection. It is a bit tricky, actually.
>
> Kenn
>
>
>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>> 2018-02-02 15:49 GMT+01:00 Kenneth Knowles <k...@google.com>:
>>
>>> On Fri, Feb 2, 2018 at 6:41 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> 2018-02-02 15:37 GMT+01:00 Kenneth Knowles <k...@google.com>:
>>>>
>>>>> Another couple:
>>>>>
>>>>>  - User-facing TypeDescriptor is a very thin wrapper on Guava's
>>>>> TypeToken
>>>>>
>>>>
>>>> Technically reflect Type is enough
>>>>
>>>
>>> If you want to try to remove TypeToken from underneath TypeDescriptor, I
>>> have no objections as long as you expand the test suite to cover all the
>>> functionality where we trust TypeToken's tests. Good luck :-)
>>>
>>> Kenn
>>>
>>>
>>>>
>>>>
>>>>>  - ImmutableList and friends and their builders are very widely used
>>>>> and IMO still add a lot for readability and preventing someone coming 
>>>>> along
>>>>> and adding mistakes to a codebase
>>>>>
>>>>
>>>> Sugar but not required. When you compare the cost of a shade versus of
>>>> duplicating the parts we need there is no real match IMHO.
>>>>
>>>>
>>>>>
>>>>> So considering it all, I would keep a vendored Guava (but also move
>>>>> off where we can, and also have our own improvements). I hope it will be a
>>>>> near-empty build file to generate it so not a maintenance burden.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles <k...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Nice. This sounds like a great idea (or two?) and goes along with
>>>>>> what I just started for futures.
>>>>>>
>>>>>> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
>>>>>> assigned to Ismaël :-) and converted my futures thing to a subtask.
>>>>>>
>>>>>> Specific things for our micro guava:
>>>>>>
>>>>>>  - checkArgumentNotNull can throw the right exception
>>>>>>  - our own Optional because Java's is not Serializable
>>>>>>  - futures combinators since many are missing, especially Java's
>>>>>> don't do exceptions right
>>>>>>
>>>>>> Protobuf: didn't file a

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-02-02 Thread Romain Manni-Bucau
Don't forget beam doesn't support much behind it (mainly only a few
ParameterizedType due the usage code path) so it is mainly only about
handling parameterized types and typevariables recursively. Not a lot of
work. Main concern being it is in the API so using a shade as an API is
never a good idea, in particular cause the shade can be broken and requires
to setup clirr or things like that and when it breaks you are done and need
to fork it anyway. Limiting the dependencies for an API - as beam is - is
always saner even if it requires a small fork of code.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-02 15:49 GMT+01:00 Kenneth Knowles <k...@google.com>:

> On Fri, Feb 2, 2018 at 6:41 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>>
>>
>> 2018-02-02 15:37 GMT+01:00 Kenneth Knowles <k...@google.com>:
>>
>>> Another couple:
>>>
>>>  - User-facing TypeDescriptor is a very thin wrapper on Guava's TypeToken
>>>
>>
>> Technically reflect Type is enough
>>
>
> If you want to try to remove TypeToken from underneath TypeDescriptor, I
> have no objections as long as you expand the test suite to cover all the
> functionality where we trust TypeToken's tests. Good luck :-)
>
> Kenn
>
>
>>
>>
>>>  - ImmutableList and friends and their builders are very widely used and
>>> IMO still add a lot for readability and preventing someone coming along and
>>> adding mistakes to a codebase
>>>
>>
>> Sugar but not required. When you compare the cost of a shade versus of
>> duplicating the parts we need there is no real match IMHO.
>>
>>
>>>
>>> So considering it all, I would keep a vendored Guava (but also move off
>>> where we can, and also have our own improvements). I hope it will be a
>>> near-empty build file to generate it so not a maintenance burden.
>>>
>>> Kenn
>>>
>>> On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles <k...@google.com> wrote:
>>>
>>>> Nice. This sounds like a great idea (or two?) and goes along with what
>>>> I just started for futures.
>>>>
>>>> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
>>>> assigned to Ismaël :-) and converted my futures thing to a subtask.
>>>>
>>>> Specific things for our micro guava:
>>>>
>>>>  - checkArgumentNotNull can throw the right exception
>>>>  - our own Optional because Java's is not Serializable
>>>>  - futures combinators since many are missing, especially Java's don't
>>>> do exceptions right
>>>>
>>>> Protobuf: didn't file an issue because I'm not sure
>>>>
>>>> I was wondering if pre-shading works. We really need to get rid of it
>>>> from public APIs in a 100% reliable way. It is also a problem for Dataflow.
>>>> I was wondering if one approach is to pre-shade gcpio-protobuf-java,
>>>> gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to take
>>>> a Message object. (and do the same for beam-model-protobuf-java since the
>>>> model bits have to depend on each other but nothing else).
>>>>
>>>> Kenn
>>>>
>>>> On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía <ieme...@gmail.com> wrote:
>>>>
>>>>> Huge +1 to get rid of Guava!
>>>>>
>>>>> This solves annoying dependency issues for some IOs and allow us to
>>>>> get rid of the shading that makes current jars bigger than they should
>>>>> be.
>>>>>
>>>>> We can create our own 'micro guava' package with some classes for
>>>>> things that are hard to migrate, or that we  prefer to still have like
>>>>> the check* methods for example. Given the size of the task we should
>>>>> probably divide it into subtasks, more important is to get rid of it
>>>>> for 'sdks/java/core'. We can then attack other areas afterwards.
>>>>>
>>>>> Other important idea would be to get rid of Protobuf in public APIs
>>>>> like GCPIO and to better shade it from leaking into the runners. An
>>>>> unexpected side effect of this is a leak o

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-02-02 Thread Romain Manni-Bucau
2018-02-02 15:37 GMT+01:00 Kenneth Knowles <k...@google.com>:

> Another couple:
>
>  - User-facing TypeDescriptor is a very thin wrapper on Guava's TypeToken
>

Technically reflect Type is enough


>  - ImmutableList and friends and their builders are very widely used and
> IMO still add a lot for readability and preventing someone coming along and
> adding mistakes to a codebase
>

Sugar but not required. When you compare the cost of a shade versus of
duplicating the parts we need there is no real match IMHO.


>
> So considering it all, I would keep a vendored Guava (but also move off
> where we can, and also have our own improvements). I hope it will be a
> near-empty build file to generate it so not a maintenance burden.
>
> Kenn
>
> On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles <k...@google.com> wrote:
>
>> Nice. This sounds like a great idea (or two?) and goes along with what I
>> just started for futures.
>>
>> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
>> assigned to Ismaël :-) and converted my futures thing to a subtask.
>>
>> Specific things for our micro guava:
>>
>>  - checkArgumentNotNull can throw the right exception
>>  - our own Optional because Java's is not Serializable
>>  - futures combinators since many are missing, especially Java's don't do
>> exceptions right
>>
>> Protobuf: didn't file an issue because I'm not sure
>>
>> I was wondering if pre-shading works. We really need to get rid of it
>> from public APIs in a 100% reliable way. It is also a problem for Dataflow.
>> I was wondering if one approach is to pre-shade gcpio-protobuf-java,
>> gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to take
>> a Message object. (and do the same for beam-model-protobuf-java since the
>> model bits have to depend on each other but nothing else).
>>
>> Kenn
>>
>> On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía <ieme...@gmail.com> wrote:
>>
>>> Huge +1 to get rid of Guava!
>>>
>>> This solves annoying dependency issues for some IOs and allow us to
>>> get rid of the shading that makes current jars bigger than they should
>>> be.
>>>
>>> We can create our own 'micro guava' package with some classes for
>>> things that are hard to migrate, or that we  prefer to still have like
>>> the check* methods for example. Given the size of the task we should
>>> probably divide it into subtasks, more important is to get rid of it
>>> for 'sdks/java/core'. We can then attack other areas afterwards.
>>>
>>> Other important idea would be to get rid of Protobuf in public APIs
>>> like GCPIO and to better shade it from leaking into the runners. An
>>> unexpected side effect of this is a leak of netty via gRPC/protobuf
>>> that is byting us for the Spark runner, but well that's worth a
>>> different discussion.
>>>
>>>
>>> On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
>>> <rmannibu...@gmail.com> wrote:
>>> > a map of list is fine and not a challenge we'll face long I hope ;)
>>> >
>>> >
>>> > Romain Manni-Bucau
>>> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn
>>> >
>>> > 2018-02-01 9:40 GMT+01:00 Reuven Lax <re...@google.com>:
>>> >>
>>> >> Not sure we'll be able to replace them all. Things like guava Table
>>> and
>>> >> Multimap don't have great replacements in Java8.
>>> >>
>>> >> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <
>>> j...@nanthrax.net>
>>> >> wrote:
>>> >>>
>>> >>> +1, it was on my TODO for a while waiting the Java8 update.
>>> >>>
>>> >>> Regards
>>> >>> JB
>>> >>>
>>> >>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
>>> >>> > Why not dropping guava for all beam codebase? With java 8 it is
>>> quite
>>> >>> > easy to do
>>> >>> > it and avoid a bunch of conflicts. Did it in 2 projects with quite
>>> a
>>> >>> > good result.
>>> >>> >
>>> >>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <lc...@google.com
>>> >>> > <mailto:lc...@google.com>> a écrit :
>>> >>> >
>>> >>> > Make sure to include the guava version in the artifact name so
>>> tha

Re: [PROPOSAL] Switch from Guava futures vs Java 8 futures

2018-02-01 Thread Romain Manni-Bucau
+1 indeed

Le 1 févr. 2018 21:34, "Eugene Kirpichov"  a écrit :

> Reducing dependency on Guava in favor of something Java-standard sounds
> great, +1.
>
> On Thu, Feb 1, 2018 at 11:53 AM Reuven Lax  wrote:
>
>> Unless there's something that doesn't work in Java 8 future, +1 to
>> migrating.
>>
>> On Thu, Feb 1, 2018 at 10:54 AM, Kenneth Knowles  wrote:
>>
>>> Hi all,
>>>
>>> Luke, Thomas, and I had some in-person discussions about the use of Java
>>> 8 futures and Guava futures in the portability support code. I wanted to
>>> bring our thoughts to the dev list for feedback.
>>>
>>> As background:
>>>
>>>  - Java 5+ "Future" lacks the main purpose of future, which is async
>>> chaining.
>>>  - Guava introduced ListenableFuture to do real future-oriented
>>> programming
>>>  - Java 8 added CompletionStage which is more-or-less the expected
>>> interface
>>>
>>> It is still debatable whether Java got it right [1]. But since it is
>>> standardized, doesn't need to be shaded, etc, it is worth trying to just
>>> use it carefully in the right ways. So we thought to propose that we
>>> migrate most uses of Guava futures to Java 8 futures.
>>>
>>> What do you think? Have we missed an important problem that would make
>>> this a deal-breaker?
>>>
>>> Kenn
>>>
>>> [1] e.g. https://stackoverflow.com/questions/38744943/
>>> listenablefuture-vs-completablefuture#comment72041244_39250452 and such
>>> discussions are likely to occur whenever you bring it up with someone who
>>> cares a lot about futures :-)
>>>
>>
>>


Re: drop scala....version from artifact ;)

2018-02-01 Thread Romain Manni-Bucau
Flink, Gearpump, Spark, and GCE provisioning are affected by this "issue".
Dropping it if we never manage 2 versions is nicer for end users IMHO but
I'm fine keeping it. Just would like to ensure it is uniform accross the
whole projet.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-02-01 14:58 GMT+01:00 Aljoscha Krettek <aljos...@apache.org>:

> I think Kafka IO doesn't have a transitive Scala dependency anymore
> because Kafka removed that from their client code a while ago.
>
> Best,
> Aljoscha
>
> > On 1. Feb 2018, at 14:48, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> >
> > I got your point Aljoscha. Flink runner is the only module using this
> suffix.
> >
> > Spark runner, Kafka IO, and others also have a scala dep but don't use
> the suffix.
> >
> > So, we have three options:
> > 1. We leave as it is right now
> > 2. We remove suffix from Flink runner
> > 3. We add suffix to other modules (Spark runner, Kafka IO, ...)
> >
> > Thoughts ?
> >
> > I'm OK to stay on 1 for now.
> >
> > Regards
> > JB
> >
> > On 02/01/2018 02:45 PM, Aljoscha Krettek wrote:
> >> I think it's not wise to remove the Scala suffix. When using the Flink
> Runner you have to make sure that the Scala version matches the Scala
> version of the Flink Cluster. And I think comparing the suffix of your
> flink-runner dependency and the suffix of your Flink dist is an easy way of
> doing that.
> >>
> >>
> >>> On 31. Jan 2018, at 16:55, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
> >>>
> >>> Hi Romain,
> >>>
> >>> AFAIR only Flink runner uses scala version in the artifactId.
> >>>
> >>> +1 for me.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On 01/31/2018 04:45 PM, Romain Manni-Bucau wrote:
> >>>> Hi guys
> >>>>
> >>>> since beam supports a single version of runners why not dropping the
> scala
> >>>> version from the artifactId?
> >>>>
> >>>> ATM upgrades are painful cause you upgrade beam version+ runner
> artifactIds.
> >>>>
> >>>> wdyt?
> >>>>
> >>>> Romain Manni-Bucau
> >>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> >>>> <https://rmannibucau.metawerx.net/> | Old Blog
> >>>> <http://rmannibucau.wordpress.com> | Github <https://github.com/
> rmannibucau> |
> >>>> LinkedIn <https://www.linkedin.com/in/rmannibucau>
> >>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbono...@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >>
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
>
>


Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-02-01 Thread Romain Manni-Bucau
a map of list is fine and not a challenge we'll face long I hope ;)


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-02-01 9:40 GMT+01:00 Reuven Lax <re...@google.com>:

> Not sure we'll be able to replace them all. Things like guava Table and
> Multimap don't have great replacements in Java8.
>
> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> +1, it was on my TODO for a while waiting the Java8 update.
>>
>> Regards
>> JB
>>
>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
>> > Why not dropping guava for all beam codebase? With java 8 it is quite
>> easy to do
>> > it and avoid a bunch of conflicts. Did it in 2 projects with quite a
>> good result.
>> >
>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" <lc...@google.com
>> > <mailto:lc...@google.com>> a écrit :
>> >
>> > Make sure to include the guava version in the artifact name so that
>> we can
>> > have multiple vendored versions.
>> >
>> > On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <k...@google.com
>> > <mailto:k...@google.com>> wrote:
>> >
>> > I didn't have time for this, but it just bit me. We definitely
>> have
>> > Guava on the API surface of runner support code in ways that get
>> > incompatibly shaded. I will probably start "1a" by making a
>> shaded
>> > library org.apache.beam:vendored-guava and starting to use it.
>> It sounds
>> > like there is generally unanimous support for that much, anyhow.
>> >
>> > Kenn
>> >
>> > On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek <
>> aljos...@apache.org
>> > <mailto:aljos...@apache.org>> wrote:
>> >
>> > Thanks Ismaël for bringing up this discussion again!
>> >
>> > I would be in favour of 1) and more specifically of 1a)
>> >
>> > Aljoscha
>> >
>> >
>> >> On 12. Dec 2017, at 18:56, Lukasz Cwik <lc...@google.com
>> >> <mailto:lc...@google.com>> wrote:
>> >>
>> >> You can always run tests on post shaded artifacts instead
>> of the
>> >> compiled classes, it just requires us to change our maven
>> surefire
>> >> / gradle test configurations but it is true that most IDEs
>> would
>> >> behave better with a dependency jar unless you delegate
>> all the
>> >> build/test actions to the build system and then it won't
>> matter.
>> >>
>> >> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles <
>> k...@google.com
>> >> <mailto:k...@google.com>> wrote:
>> >>
>> >> There's also, with additional overhead,
>> >>
>> >> 1a) A relocated and shipped package for each thing we
>> want to
>> >> relocate. I think this has also been tried outside
>> Beam...
>> >>
>> >> Pros:
>> >> * all the pros of 1) plus no bloat beyond what is
>> necessary
>> >> Cons:
>> >> * abandons whitelist approach for public deps,
>> reverting to
>> >> blacklist approach for trouble things like guava, so a
>> bit
>> >> less principled
>> >>
>> >> For both 1) and 1a) I would add:
>> >>
>> >> Pros:
>> >> * clearly readable dependency since code will `import
>> >> org.apache.beam.private.guava21` and IDEs will
>> understand it
>> >> is a distinct lilbrary
>> >> * can run tests on unpackaged classes, as long as deps
>> are
>> >> shaded or provided as jars
>> >> * no mysterious action at a distance from inherited
>> configuration
>> >> Cons:
>> >> * need to adjust imports
>> >>
>&g

Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-31 Thread Romain Manni-Bucau
Yep but makes one other step to work in beam - dont forget you already
passed like 10 steps when you end up on coders.

My view was that the skip first close was a win-win for beam and users bit
technically you are right, users can always do it themselves.

Le 1 févr. 2018 07:16, "Lukasz Cwik" <lc...@google.com> a écrit :

> I'm not sure what you mean by it closes the door since as the caller of
> the library you can create a wrapper filter input stream that ignores close
> calls effectively overriding what happens in the UnownedInputStream.
>
> On Wed, Jan 31, 2018 at 10:08 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>>
>>
>> Le 1 févr. 2018 03:10, "Lukasz Cwik" <lc...@google.com> a écrit :
>>
>> Yes, people will write bad coders which is why this is there. No, I don't
>> think swallowing one close is what we want.
>>
>> In the case where you wants to pass an input/output stream to a library
>> that incorrectly assumes ownership, the input/output stream should be
>> wrapped right before the call to the library with a filter input/output
>> stream that swallows the close and not propagate ignoring this bad behavior
>> elsewhere.
>>
>>
>> Hmm,
>>
>> Elsewhere is nowhere else here since it wouldnt have any side effect
>> except not enforcing another layer and making smoothly work for most
>> mappers.
>>
>> Anyway I can live with it but I'm a bit sad it closes the door to the
>> easyness to write extensions.
>>
>>
>> On Wed, Jan 31, 2018 at 12:04 PM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> Hmm, here we are the ones owning the call since it is in a coder, no? Do
>>> we assume people will badly implement coders? In this particular case we
>>> can assume close() will be called by a framework I think.
>>> What about swallowing one close() and fail on the second?
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>> <http://rmannibucau.wordpress.com> | Github
>>> <https://github.com/rmannibucau> | LinkedIn
>>> <https://www.linkedin.com/in/rmannibucau>
>>>
>>> 2018-01-31 20:59 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>>>
>>>> Because people write code like:
>>>> myMethod(InputStream in) {
>>>>   InputStream child = new InputStream(in);
>>>>   child.close();
>>>> }
>>>>
>>>> InputStream in = new FileInputStream(... path ...);
>>>> myMethod(in);
>>>> myMethod(in);
>>>>
>>>> An exception will be thrown when the second myMethod call occurs.
>>>>
>>>> Unfortunately not everyone wraps their calls to a coder with an
>>>> UnownedInputStream or a filter input stream which drops close() calls is
>>>> why its a problem and in the few places it is done it is used to prevent
>>>> bugs from creeping in.
>>>>
>>>>
>>>>
>>>> On Tue, Jan 30, 2018 at 11:29 AM, Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>> I get the issue but I don't get the last part. Concretely we can
>>>>> support any lib by just removing the exception in the close, no? What 
>>>>> would
>>>>> be the issue? No additional wrapper, no lib integration issue.
>>>>>
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>> <https://www.linkedin.com/in/rmannibucau>
>>>>>
>>>>> 2018-01-30 19:29 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>>>>>
>>>>>> Its common in the code base that input and output streams are passed
>>>>>> around and the caller is responsible for closing it, not the callee. The
>>>>>> UnownedInputStream is to guard against libraries that are poorly behaved
>>>>>> and assume they get ownership of the stream when it is given to them.
>>>>>>
>>>>>> In the code:
>>>>>> myMethod(InputStream in) {
>>>>>>   InputStream child = new InputStream(in);
>>>>>

Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-31 Thread Romain Manni-Bucau
Le 1 févr. 2018 03:10, "Lukasz Cwik" <lc...@google.com> a écrit :

Yes, people will write bad coders which is why this is there. No, I don't
think swallowing one close is what we want.

In the case where you wants to pass an input/output stream to a library
that incorrectly assumes ownership, the input/output stream should be
wrapped right before the call to the library with a filter input/output
stream that swallows the close and not propagate ignoring this bad behavior
elsewhere.


Hmm,

Elsewhere is nowhere else here since it wouldnt have any side effect except
not enforcing another layer and making smoothly work for most mappers.

Anyway I can live with it but I'm a bit sad it closes the door to the
easyness to write extensions.


On Wed, Jan 31, 2018 at 12:04 PM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> Hmm, here we are the ones owning the call since it is in a coder, no? Do
> we assume people will badly implement coders? In this particular case we
> can assume close() will be called by a framework I think.
> What about swallowing one close() and fail on the second?
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau>
>
> 2018-01-31 20:59 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>
>> Because people write code like:
>> myMethod(InputStream in) {
>>   InputStream child = new InputStream(in);
>>   child.close();
>> }
>>
>> InputStream in = new FileInputStream(... path ...);
>> myMethod(in);
>> myMethod(in);
>>
>> An exception will be thrown when the second myMethod call occurs.
>>
>> Unfortunately not everyone wraps their calls to a coder with an
>> UnownedInputStream or a filter input stream which drops close() calls is
>> why its a problem and in the few places it is done it is used to prevent
>> bugs from creeping in.
>>
>>
>>
>> On Tue, Jan 30, 2018 at 11:29 AM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> I get the issue but I don't get the last part. Concretely we can support
>>> any lib by just removing the exception in the close, no? What would be the
>>> issue? No additional wrapper, no lib integration issue.
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>> <http://rmannibucau.wordpress.com> | Github
>>> <https://github.com/rmannibucau> | LinkedIn
>>> <https://www.linkedin.com/in/rmannibucau>
>>>
>>> 2018-01-30 19:29 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>>>
>>>> Its common in the code base that input and output streams are passed
>>>> around and the caller is responsible for closing it, not the callee. The
>>>> UnownedInputStream is to guard against libraries that are poorly behaved
>>>> and assume they get ownership of the stream when it is given to them.
>>>>
>>>> In the code:
>>>> myMethod(InputStream in) {
>>>>   InputStream child = new InputStream(in);
>>>>   child.close();
>>>> }
>>>>
>>>> InputStream in = ...
>>>> myMethod(in);
>>>> myMethod(in);
>>>> When should "in" be closed?
>>>>
>>>> To get around this issue, create a filter input/output stream that
>>>> ignores close calls like on the JAXB coder:
>>>> https://github.com/apache/beam/blob/master/sdks/java/io/xml/
>>>> src/main/java/org/apache/beam/sdk/io/xml/JAXBCoder.java#L181
>>>>
>>>> We can instead swap around this pattern so that the caller guards
>>>> against callees closing by wrapping with a filter input/output stream but
>>>> this costs an additional method call for each input/output stream call.
>>>>
>>>>
>>>> On Tue, Jan 30, 2018 at 10:04 AM, Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> All is in the subject ;)
>>>>>
>>>>> Rational is to support any I/O library and not fail when the close is
>>>>> encapsulated.
>>>>>
>>>>> Any blocker to swallow this close call?
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>> <https://www.linkedin.com/in/rmannibucau>
>>>>>
>>>>
>>>>
>>>
>>
>


Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Romain Manni-Bucau
@ismael: any vote can be changes from -1 to +1 (or +-0) without additional
delay

Le 1 févr. 2018 03:15, "Lukasz Cwik"  a écrit :

> Note that a user reported TextIO being broken on Flink.
> Thread is here: https://lists.apache.org/thread.html/
> 47b16c94032392782505415e010970fd2a9480891c55c2f7b5de92bd@%
> 3Cuser.beam.apache.org%3E
> Can someone confirm/refute?
>
> On Wed, Jan 31, 2018 at 3:36 PM, Konstantinos Katsiapis <
> katsia...@google.com> wrote:
>
>> +1 (non-binding). tensorflow.transform
>>  0.5.0 is blocked on Apache
>> Beam 2.3
>>
>>
>> On Wed, Jan 31, 2018 at 5:59 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>>> +1 (binding)
>>>
>>> Casting my own +1 ;)
>>>
>>> Regards
>>> JB
>>>
>>> On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
>>> > Hi everyone,
>>> >
>>> > Please review and vote on the release candidate #1 for the version
>>> 2.3.0, as
>>> > follows:
>>> >
>>> > [ ] +1, Approve the release
>>> > [ ] -1, Do not approve the release (please provide specific comments)
>>> >
>>> >
>>> > The complete staging area is available for your review, which includes:
>>> > * JIRA release notes [1],
>>> > * the official Apache source release to be deployed to dist.apache.org
>>> [2],
>>> > which is signed with the key with fingerprint C8282E76 [3],
>>> > * all artifacts to be deployed to the Maven Central Repository [4],
>>> > * source code tag "v2.3.0-RC1" [5],
>>> > * website pull request listing the release and publishing the API
>>> reference
>>> > manual [6].
>>> > * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
>>> > * Python artifacts are deployed along with the source release to the
>>> > dist.apache.org [2].
>>> >
>>> > The vote will be open for at least 72 hours. It is adopted by majority
>>> approval,
>>> > with at least 3 PMC affirmative votes.
>>> >
>>> > Thanks,
>>> > JB
>>> >
>>> > [1]
>>> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
>>> ctId=12319527=12341608
>>> > [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
>>> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> > [4] https://repository.apache.org/content/repositories/orgapache
>>> beam-1026/
>>> > [5] https://github.com/apache/beam/tree/v2.3.0-RC1
>>> > [6] https://github.com/apache/beam-site/pull/381
>>> >
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>>
>>
>> --
>> Gus Katsiapis | Software Engineer | katsia...@google.com | 650-918-7487
>> <(650)%20918-7487>
>>
>
>


Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-01-31 Thread Romain Manni-Bucau
Why not dropping guava for all beam codebase? With java 8 it is quite easy
to do it and avoid a bunch of conflicts. Did it in 2 projects with quite a
good result.

Le 1 févr. 2018 06:50, "Lukasz Cwik"  a écrit :

> Make sure to include the guava version in the artifact name so that we can
> have multiple vendored versions.
>
> On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles  wrote:
>
>> I didn't have time for this, but it just bit me. We definitely have Guava
>> on the API surface of runner support code in ways that get incompatibly
>> shaded. I will probably start "1a" by making a shaded library
>> org.apache.beam:vendored-guava and starting to use it. It sounds like there
>> is generally unanimous support for that much, anyhow.
>>
>> Kenn
>>
>> On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek 
>> wrote:
>>
>>> Thanks Ismaël for bringing up this discussion again!
>>>
>>> I would be in favour of 1) and more specifically of 1a)
>>>
>>> Aljoscha
>>>
>>>
>>> On 12. Dec 2017, at 18:56, Lukasz Cwik  wrote:
>>>
>>> You can always run tests on post shaded artifacts instead of the
>>> compiled classes, it just requires us to change our maven surefire / gradle
>>> test configurations but it is true that most IDEs would behave better with
>>> a dependency jar unless you delegate all the build/test actions to the
>>> build system and then it won't matter.
>>>
>>> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles  wrote:
>>>
 There's also, with additional overhead,

 1a) A relocated and shipped package for each thing we want to relocate.
 I think this has also been tried outside Beam...

 Pros:
 * all the pros of 1) plus no bloat beyond what is necessary
 Cons:
 * abandons whitelist approach for public deps, reverting to blacklist
 approach for trouble things like guava, so a bit less principled

 For both 1) and 1a) I would add:

 Pros:
 * clearly readable dependency since code will `import
 org.apache.beam.private.guava21` and IDEs will understand it is a
 distinct lilbrary
 * can run tests on unpackaged classes, as long as deps are shaded or
 provided as jars
 * no mysterious action at a distance from inherited configuration
 Cons:
 * need to adjust imports

 Kenn

 On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik  wrote:

> I would suggest that either we use:
> 1) A common deps package containing shaded dependencies allows for
> Pros
> * doesn't require the user to build an uber jar
> Risks
> * dependencies package will keep growing even if something is or isn't
> needed by all of Apache Beam leading to a large jar anyways negating any
> space savings
>
> 2) Shade within each module to a common location like
> org.apache.beam.relocated.guava
> Pros
> * you only get the shaded dependencies of the things that are required
> * its one less dependency for users to manage
> Risks
> * requires an uber jar to be built to get the space savings (either by
> a user or a distribution of Apache Beam) otherwise we negate any space
> savings.
>
> If we either use a common relocation scheme or a dependencies jar then
> each relocation should specifically contain the version number of the
> package because we would like to allow for us to be using two different
> versions of the same library.
>
> For the common deps package approach, should we check in code where
> the imports contain the relocated location (e.g. import
> org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)?
>
>
> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré  > wrote:
>
>> Thanks for bringing that back.
>>
>> Indeed guava is shaded in different uber-jar. Maybe we can have a
>> common deps module that we include once (but the user will have to
>> explicitly define the dep) ?
>>
>> Basically, what do you propose for protobuf (unfortunately, I don't
>> see an obvious) ?
>>
>> Regards
>> JB
>>
>>
>> On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>>
>>> Hello, I wanted to bring back this subject because I think we should
>>> take action on this and at least first have a shaded version of
>>> guava.
>>> I was playing with a toy project and I did the procedure we use to
>>> submit jars to a Hadoop cluster via Flink/Spark which involves
>>> creating an uber jar and I realized that the size of the jar was way
>>> bigger than I expected, and the fact that we shade guava in every
>>> module contributes to this. I found guava shaded on:
>>>
>>> sdks/java/core
>>> runners/core-construction-java
>>> runners/core-java
>>> model/job-management
>>> runners/spark
>>> 

Re: Tracking Sickbayed tests in Jira

2018-01-31 Thread Romain Manni-Bucau
If it helps I'm using on another project:

# to find @Ignore tests
$ find . -name '*.java' | xargs grep -n1 @Ignore
# to find test classes (not methods)
$ find . -name '*.java' | xargs grep @Ignore | sed 's#:.*##' | sort -u
# to find modules with @Ignore
$ find . -name '*.java' | xargs grep @Ignore | sed 's#src/.*##' | sort -u
# to count ignored tests
$ find . -name '*.java' | xargs grep  @Ignore  | wc -l

last one mixed with a loop and git allows to follow the evolution and check
if it grows or decreases.



Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-31 22:40 GMT+01:00 Thomas Groh <tg...@google.com>:

> Hey everyone;
>
> I've realized that although we tend to tag any test we suppress (due to
> consistent flakiness) in the codebase, and file an associated JIRA issue
> with the failure, we don't have any centralized way to track tests that
> we're currently suppressing. To try and get more visibility into our
> suppressed tests (without running `grep -r @Ignore ...` over the codebase
> over and over), I've created a label for these tests, and applied it to all
> of the issues that annotated `@Ignore` tests point to.
>
> Ideally, all of our suppressed tests would be tagged with this label, so
> we can get some visibility into which components we would normally expect
> to have coverage but don't currently.
>
> The search to look at all of these issues is
> https://issues.apache.org/jira/browse/BEAM-3583?jql=
> project%20%3D%20BEAM%20AND%20labels%20%3D%20sickbay
>
> If you're looking for something to do, or have other issues that should be
> labelled, feel free to jump right in.
>
> Yours,
>
> Thomas
>


Re: Schema-Aware PCollections revisited

2018-01-31 Thread Romain Manni-Bucau
If you need help on the json part I'm happy to help. To give a few hints on
what is very doable: we can add an avro module to johnzon (asf json{p,b}
impl) to back jsonp by avro (guess it will be one of the first to be asked)
for instance.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-31 22:06 GMT+01:00 Reuven Lax <re...@google.com>:

> Agree. The initial implementation will be a prototype.
>
> On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> Hi Reuven,
>>
>> Agree to be able to describe the schema with different format. The good
>> point about json schemas is that they are described by a spec. My point is
>> also to avoid the reinvent the wheel. Just an abstract to be able to use
>> Avro, Json, Calcite, custom schema descriptors would be great.
>>
>> Using coder to describe a schema sounds like a smart move to implement
>> quickly. However, it has to be clear in term of documentation to avoid
>> "side effect". I still think PCollection.setSchema() is better: it should
>> be metadata (or hint ;))) on the PCollection.
>>
>> Regards
>> JB
>>
>> On 31/01/2018 20:16, Reuven Lax wrote:
>>
>>> As to the question of how a schema should be specified, I want to
>>> support several common schema formats. So if a user has a Json schema, or
>>> an Avro schema, or a Calcite schema, etc. there should be adapters that
>>> allow setting a schema from any of them. I don't think we should prefer one
>>> over the other. While Romain is right that many people know Json, I think
>>> far fewer people know Json schemas.
>>>
>>> Agree, schemas should not be enforced (for one thing, that wouldn't be
>>> backwards compatible!). I think for the initial prototype I will probably
>>> use a special coder to represent the schema (with setSchema an option on
>>> the coder), largely because it doesn't require modifying PCollection.
>>> However I think longer term a schema should be an optional piece of
>>> metadata on the PCollection object. Similar to the previous discussion
>>> about "hints," I think this can be set on the producing PTransform, and a
>>> SetSchema PTransform will allow attaching a schema to any PCollection (i.e.
>>> pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I
>>> think schema should be similar to hints, it's just another piece of
>>> metadata on the PCollection (though something interpreted by the model,
>>> where hints are interpreted by the runner)
>>>
>>> Reuven
>>>
>>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <j...@nanthrax.net
>>> <mailto:j...@nanthrax.net>> wrote:
>>>
>>> Hi,
>>>
>>> I think we should avoid to mix two things in the discussion (and so
>>> the document):
>>>
>>> 1. The element of the collection and the schema itself are two
>>> different things.
>>> By essence, Beam should not enforce any schema. That's why I think
>>> it's a good
>>> idea to set the schema optionally on the PCollection
>>> (pcollection.setSchema()).
>>>
>>> 2. From point 1 comes two questions: how do we represent a schema ?
>>> How can we
>>> leverage the schema to simplify the serialization of the element in
>>> the
>>> PCollection and query ? These two questions are not directly related.
>>>
>>>   2.1 How do we represent the schema
>>> Json Schema is a very interesting idea. It could be an abstract and
>>> other
>>> providers, like Avro, can be bind on it. It's part of the json
>>> processing spec
>>> (javax).
>>>
>>>   2.2. How do we leverage the schema for query and serialization
>>> Also in the spec, json pointer is interesting for the querying.
>>>     Regarding the
>>> serialization, jackson or other data binder can be used.
>>>
>>> It's still rough ideas in my mind, but I like Romain's idea about
>>> json-p usage.
>>>
>>> Once 2.3.0 release is out, I will start to update the document with
>>> those ideas,
>>> and PoC.
>>>
>>> Thanks !
>>> Regards
>>> JB

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Romain Manni-Bucau
Hmm, it is a hint semantically or it is deducable from the transform. Doing
the union of both you cover all cases. Then how it is forwarded from the
transform to the runtime is in runner API not the user (pipeline) API so
I'm not sure I see the case you reference where it has a semantic API. Can
you detail it please?


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-31 20:45 GMT+01:00 Reuven Lax <re...@google.com>:

> I don't think "hint" is the right API, as schema is not a hint (it has
> semantic meaning). However I think the API for schema should look similar
> to any "hint" API.
>
> On Wed, Jan 31, 2018 at 11:40 AM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>>
>>
>> Le 31 janv. 2018 20:16, "Reuven Lax" <re...@google.com> a écrit :
>>
>> As to the question of how a schema should be specified, I want to support
>> several common schema formats. So if a user has a Json schema, or an Avro
>> schema, or a Calcite schema, etc. there should be adapters that allow
>> setting a schema from any of them. I don't think we should prefer one over
>> the other. While Romain is right that many people know Json, I think far
>> fewer people know Json schemas.
>>
>>
>> Agree but schema would get an API for beam usage - dont think there is a
>> standard we can use and we cant use any vendor specific api in beam - so
>> not a big deal IMO/not a blocker.
>>
>>
>>
>> Agree, schemas should not be enforced (for one thing, that wouldn't be
>> backwards compatible!). I think for the initial prototype I will probably
>> use a special coder to represent the schema (with setSchema an option on
>> the coder), largely because it doesn't require modifying PCollection.
>> However I think longer term a schema should be an optional piece of
>> metadata on the PCollection object. Similar to the previous discussion
>> about "hints," I think this can be set on the producing PTransform, and a
>> SetSchema PTransform will allow attaching a schema to any PCollection (i.e.
>> pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I
>> think schema should be similar to hints, it's just another piece of
>> metadata on the PCollection (though something interpreted by the model,
>> where hints are interpreted by the runner)
>>
>>
>> Schema should probably be contributable from the transform when mandatory
>> - thinking of avro io here - or an hint as fallback when optional probably.
>> This sounds good to me and doesnt require another public API than hint.
>>
>>
>> Reuven
>>
>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>
>>> Hi,
>>>
>>> I think we should avoid to mix two things in the discussion (and so the
>>> document):
>>>
>>> 1. The element of the collection and the schema itself are two different
>>> things.
>>> By essence, Beam should not enforce any schema. That's why I think it's
>>> a good
>>> idea to set the schema optionally on the PCollection
>>> (pcollection.setSchema()).
>>>
>>> 2. From point 1 comes two questions: how do we represent a schema ? How
>>> can we
>>> leverage the schema to simplify the serialization of the element in the
>>> PCollection and query ? These two questions are not directly related.
>>>
>>>  2.1 How do we represent the schema
>>> Json Schema is a very interesting idea. It could be an abstract and other
>>> providers, like Avro, can be bind on it. It's part of the json
>>> processing spec
>>> (javax).
>>>
>>>  2.2. How do we leverage the schema for query and serialization
>>> Also in the spec, json pointer is interesting for the querying.
>>> Regarding the
>>> serialization, jackson or other data binder can be used.
>>>
>>> It's still rough ideas in my mind, but I like Romain's idea about json-p
>>> usage.
>>>
>>> Once 2.3.0 release is out, I will start to update the document with
>>> those ideas,
>>> and PoC.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
>>> >
>>> >
>>> > Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com

Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-31 Thread Romain Manni-Bucau
Hmm, here we are the ones owning the call since it is in a coder, no? Do we
assume people will badly implement coders? In this particular case we can
assume close() will be called by a framework I think.
What about swallowing one close() and fail on the second?


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-31 20:59 GMT+01:00 Lukasz Cwik <lc...@google.com>:

> Because people write code like:
> myMethod(InputStream in) {
>   InputStream child = new InputStream(in);
>   child.close();
> }
>
> InputStream in = new FileInputStream(... path ...);
> myMethod(in);
> myMethod(in);
>
> An exception will be thrown when the second myMethod call occurs.
>
> Unfortunately not everyone wraps their calls to a coder with an
> UnownedInputStream or a filter input stream which drops close() calls is
> why its a problem and in the few places it is done it is used to prevent
> bugs from creeping in.
>
>
>
> On Tue, Jan 30, 2018 at 11:29 AM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>> I get the issue but I don't get the last part. Concretely we can support
>> any lib by just removing the exception in the close, no? What would be the
>> issue? No additional wrapper, no lib integration issue.
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau>
>>
>> 2018-01-30 19:29 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>>
>>> Its common in the code base that input and output streams are passed
>>> around and the caller is responsible for closing it, not the callee. The
>>> UnownedInputStream is to guard against libraries that are poorly behaved
>>> and assume they get ownership of the stream when it is given to them.
>>>
>>> In the code:
>>> myMethod(InputStream in) {
>>>   InputStream child = new InputStream(in);
>>>   child.close();
>>> }
>>>
>>> InputStream in = ...
>>> myMethod(in);
>>> myMethod(in);
>>> When should "in" be closed?
>>>
>>> To get around this issue, create a filter input/output stream that
>>> ignores close calls like on the JAXB coder:
>>> https://github.com/apache/beam/blob/master/sdks/java/io/xml/
>>> src/main/java/org/apache/beam/sdk/io/xml/JAXBCoder.java#L181
>>>
>>> We can instead swap around this pattern so that the caller guards
>>> against callees closing by wrapping with a filter input/output stream but
>>> this costs an additional method call for each input/output stream call.
>>>
>>>
>>> On Tue, Jan 30, 2018 at 10:04 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> All is in the subject ;)
>>>>
>>>> Rational is to support any I/O library and not fail when the close is
>>>> encapsulated.
>>>>
>>>> Any blocker to swallow this close call?
>>>>
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>> <http://rmannibucau.wordpress.com> | Github
>>>> <https://github.com/rmannibucau> | LinkedIn
>>>> <https://www.linkedin.com/in/rmannibucau>
>>>>
>>>
>>>
>>
>


Re: Schema-Aware PCollections revisited

2018-01-31 Thread Romain Manni-Bucau
Le 31 janv. 2018 20:16, "Reuven Lax" <re...@google.com> a écrit :

As to the question of how a schema should be specified, I want to support
several common schema formats. So if a user has a Json schema, or an Avro
schema, or a Calcite schema, etc. there should be adapters that allow
setting a schema from any of them. I don't think we should prefer one over
the other. While Romain is right that many people know Json, I think far
fewer people know Json schemas.


Agree but schema would get an API for beam usage - dont think there is a
standard we can use and we cant use any vendor specific api in beam - so
not a big deal IMO/not a blocker.



Agree, schemas should not be enforced (for one thing, that wouldn't be
backwards compatible!). I think for the initial prototype I will probably
use a special coder to represent the schema (with setSchema an option on
the coder), largely because it doesn't require modifying PCollection.
However I think longer term a schema should be an optional piece of
metadata on the PCollection object. Similar to the previous discussion
about "hints," I think this can be set on the producing PTransform, and a
SetSchema PTransform will allow attaching a schema to any PCollection (i.e.
pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I think
schema should be similar to hints, it's just another piece of metadata on
the PCollection (though something interpreted by the model, where hints are
interpreted by the runner)


Schema should probably be contributable from the transform when mandatory -
thinking of avro io here - or an hint as fallback when optional probably.
This sounds good to me and doesnt require another public API than hint.


Reuven

On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi,
>
> I think we should avoid to mix two things in the discussion (and so the
> document):
>
> 1. The element of the collection and the schema itself are two different
> things.
> By essence, Beam should not enforce any schema. That's why I think it's a
> good
> idea to set the schema optionally on the PCollection
> (pcollection.setSchema()).
>
> 2. From point 1 comes two questions: how do we represent a schema ? How
> can we
> leverage the schema to simplify the serialization of the element in the
> PCollection and query ? These two questions are not directly related.
>
>  2.1 How do we represent the schema
> Json Schema is a very interesting idea. It could be an abstract and other
> providers, like Avro, can be bind on it. It's part of the json processing
> spec
> (javax).
>
>  2.2. How do we leverage the schema for query and serialization
> Also in the spec, json pointer is interesting for the querying. Regarding
> the
> serialization, jackson or other data binder can be used.
>
> It's still rough ideas in my mind, but I like Romain's idea about json-p
> usage.
>
> Once 2.3.0 release is out, I will start to update the document with those
> ideas,
> and PoC.
>
> Thanks !
> Regards
> JB
>
> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
> >
> >
> > Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com
> > <mailto:re...@google.com>> a écrit :
> >
> >
> >
> > On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com
> > <mailto:rmannibu...@gmail.com>> wrote:
> >
> > Hi
> >
> > I have some questions on this: how hierarchic schemas would
> work? Seems
> > it is not really supported by the ecosystem (out of custom
> stuff) :(.
> > How would it integrate smoothly with other generic record types
> - N bridges?
> >
> >
> > Do you mean nested schemas? What do you mean here?
> >
> >
> > Yes, sorry - wrote the mail too late ;). Was hierarchic data and nested
> schemas.
> >
> >
> > Concretely I wonder if using json API couldnt be beneficial:
> json-p is a
> > nice generic abstraction with a built in querying mecanism
> (jsonpointer)
> > but no actual serialization (even if json and binary json are
> very
> > natural). The big advantage is to have a well known ecosystem -
> who
> > doesnt know json today? - that beam can reuse for free:
> JsonObject
> > (guess we dont want JsonValue abstraction) for the record type,
> > jsonschema standard for the schema, jsonpointer for the
> > delection/projection etc... It doesnt enforce the actual
> serialization
> > (json, smile, avro, ...) but provide an expressive and alread
> known API
> > so i see it as a big win-win for users (no need to learn a 

drop scala....version from artifact ;)

2018-01-31 Thread Romain Manni-Bucau
Hi guys

since beam supports a single version of runners why not dropping the scala
version from the artifactId?

ATM upgrades are painful cause you upgrade beam version+ runner
artifactIds.

wdyt?

Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>


Re: IO plans?

2018-01-31 Thread Romain Manni-Bucau
Thanks JB, this is great news since they are highly used IO in the industry
and really awaited now.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-31 14:42 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:

> By the way, short term: ParquetIO, RabbitMqIO are coming (PRs already
> open).
>
> Regards
> JB
>
> On 01/31/2018 02:41 PM, Jean-Baptiste Onofré wrote:
> > Hi Romain,
> >
> > I have some IOs locally and some idea:
> >
> > - ExecIO (it has been proposed as PR but declined)
> > - ConsoleIO (generic)
> > - SocketIO
> > - RestIO
> > - MinaIO
> >
> > I also created the other IOs Jira.
> >
> > Regards
> > JB
> >
> > On 01/31/2018 01:57 PM, Romain Manni-Bucau wrote:
> >> Hi guys,
> >>
> >> is there a plan for future IO and some tracking somewhere?
> >>
> >> I particularly wonder if there are plans for a HTTP IO and common
> server IO like
> >> SFTP, SSH, etc...
> >>
> >> Romain Manni-Bucau
> >> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> >> <https://rmannibucau.metawerx.net/> | Old Blog
> >> <http://rmannibucau.wordpress.com> | Github <https://github.com/
> rmannibucau> |
> >> LinkedIn <https://www.linkedin.com/in/rmannibucau>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


IO plans?

2018-01-31 Thread Romain Manni-Bucau
Hi guys,

is there a plan for future IO and some tracking somewhere?

I particularly wonder if there are plans for a HTTP IO and common server IO
like SFTP, SSH, etc...

Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>


Re: untyped pipeline API?

2018-01-30 Thread Romain Manni-Bucau
Well guess it was a wording issue more than anything else.

That said it is not true for all runners so can still need some more love
later but i dont have a solution yet for it. Just wondered if a better way
to solve it was here already.

Le 30 janv. 2018 22:36, "Reuven Lax" <re...@google.com> a écrit :

> The point isn't runner constraints, the point is that runners might
> decisions to break fusion at unexpected points (e.g. the decision might be
> made because the runner has profile data about previous runs of the
> pipeline, and knows it should break it at that point). The SDK has no good
> way of knowing what those decisions will be, so needs to conservatively
> assume it could happen anywhere.
>
> On Tue, Jan 30, 2018 at 1:31 PM, Romain Manni-Bucau <rmannibu...@gmail.com
> > wrote:
>
>> Hmm starts to smell like the old question "how to enforce runner
>> constraints without enforcing too much" :(.
>>
>> Anyway, that is enough for me for this topic.
>> Thanks for the clarification and reminders guys.
>>
>> Le 30 janv. 2018 22:29, "Reuven Lax" <re...@google.com> a écrit :
>>
>>> Where the split points are depends on the runner. Runners are free to
>>> split at any point (and often do to prevent cycles from appearing in the
>>> graph).
>>>
>>> On Tue, Jan 30, 2018 at 1:27 PM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>> I kind of agree on all of that and brings me to the interesting point
>>>> of that topic: why coders are that enforced if not used most of the time -
>>>> flat processor chain to caricature it?
>>>>
>>>> Shouldnt it be relaxed a bit and just enforced at split or shuffle
>>>> points?
>>>>
>>>>
>>>> Le 30 janv. 2018 22:09, "Ben Chambers" <bchamb...@apache.org> a écrit :
>>>>
>>>>> It sounds like in your specific case you're saying that the same
>>>>> encoding can be viewed by the Java type system two different ways. For
>>>>> instance, if you have an object Person that is convertible to JSON using
>>>>> Jackson, than that JSON encoding can be viewed as either a Person or a
>>>>> Map<String, Object> looking at the JSON fields. In that case, there needs
>>>>> to be some kind of "view change" change transform to change the type of 
>>>>> the
>>>>> PCollection.
>>>>>
>>>>> I'm not sure an untyped API would be better here. Requiring the "view
>>>>> change" be explicit means we can ensure the types are compatible, and also
>>>>> makes it very clear when this kind of change is desired.
>>>>>
>>>>> Some background on Coders that may be relevant:
>>>>>
>>>>> It might help to to think about Coders as the specification of how
>>>>> elements in a PCollection are encoded if/when the runner needs to. If you
>>>>> are trying to read JSON or XML records from a source, that is part of the
>>>>> source transform (reading JSON or XML records) and not part of the
>>>>> collection produced by the transform.
>>>>>
>>>>> Consider further -- even if you read XML records from a source, you
>>>>> likely *wouldn't* want to use an XML Coder for those records within the
>>>>> pipeline, as every time the pipeline needed to serialize them you would
>>>>> produce much larger amounts of data (XML is not an efficient/compact
>>>>> encoding). Instead, you likely want to read XML records from the source 
>>>>> and
>>>>> then encode those within the pipeline using something more efficient. Then
>>>>> convert them to something more readable but possibly less-efficient before
>>>>> they exit the pipeline at a sink.
>>>>>
>>>>> On Tue, Jan 30, 2018 at 12:23 PM Kenneth Knowles <k...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Ah, this is a point that Robert brings up quite often: one reason we
>>>>>> put coders on PCollections instead of doing that work in PTransforms is
>>>>>> that the runner (plus SDK harness) can automatically only serialize when
>>>>>> necessary. So the default in Beam is that the thing you want to happen is
>>>>>> already done. There are some corner cases when you get to the portability
>>>>>> framework but I am pretty sure it 

Re: untyped pipeline API?

2018-01-30 Thread Romain Manni-Bucau
Hmm starts to smell like the old question "how to enforce runner
constraints without enforcing too much" :(.

Anyway, that is enough for me for this topic.
Thanks for the clarification and reminders guys.

Le 30 janv. 2018 22:29, "Reuven Lax" <re...@google.com> a écrit :

> Where the split points are depends on the runner. Runners are free to
> split at any point (and often do to prevent cycles from appearing in the
> graph).
>
> On Tue, Jan 30, 2018 at 1:27 PM, Romain Manni-Bucau <rmannibu...@gmail.com
> > wrote:
>
>> I kind of agree on all of that and brings me to the interesting point of
>> that topic: why coders are that enforced if not used most of the time -
>> flat processor chain to caricature it?
>>
>> Shouldnt it be relaxed a bit and just enforced at split or shuffle points?
>>
>>
>> Le 30 janv. 2018 22:09, "Ben Chambers" <bchamb...@apache.org> a écrit :
>>
>>> It sounds like in your specific case you're saying that the same
>>> encoding can be viewed by the Java type system two different ways. For
>>> instance, if you have an object Person that is convertible to JSON using
>>> Jackson, than that JSON encoding can be viewed as either a Person or a
>>> Map<String, Object> looking at the JSON fields. In that case, there needs
>>> to be some kind of "view change" change transform to change the type of the
>>> PCollection.
>>>
>>> I'm not sure an untyped API would be better here. Requiring the "view
>>> change" be explicit means we can ensure the types are compatible, and also
>>> makes it very clear when this kind of change is desired.
>>>
>>> Some background on Coders that may be relevant:
>>>
>>> It might help to to think about Coders as the specification of how
>>> elements in a PCollection are encoded if/when the runner needs to. If you
>>> are trying to read JSON or XML records from a source, that is part of the
>>> source transform (reading JSON or XML records) and not part of the
>>> collection produced by the transform.
>>>
>>> Consider further -- even if you read XML records from a source, you
>>> likely *wouldn't* want to use an XML Coder for those records within the
>>> pipeline, as every time the pipeline needed to serialize them you would
>>> produce much larger amounts of data (XML is not an efficient/compact
>>> encoding). Instead, you likely want to read XML records from the source and
>>> then encode those within the pipeline using something more efficient. Then
>>> convert them to something more readable but possibly less-efficient before
>>> they exit the pipeline at a sink.
>>>
>>> On Tue, Jan 30, 2018 at 12:23 PM Kenneth Knowles <k...@google.com> wrote:
>>>
>>>> Ah, this is a point that Robert brings up quite often: one reason we
>>>> put coders on PCollections instead of doing that work in PTransforms is
>>>> that the runner (plus SDK harness) can automatically only serialize when
>>>> necessary. So the default in Beam is that the thing you want to happen is
>>>> already done. There are some corner cases when you get to the portability
>>>> framework but I am pretty sure it already works this way. If you show what
>>>> is a PTransform and PCollection in your example it might show where we can
>>>> fix things.
>>>>
>>>> On Tue, Jan 30, 2018 at 12:17 PM, Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>> Indeed,
>>>>>
>>>>> I'll take a stupid example to make it shorter.
>>>>> I have a source emitting Person objects ({name:...,id:...}) serialized
>>>>> with jackson as JSON.
>>>>> Then my pipeline processes them with a DoFn taking a Map<String,
>>>>> String>. Here I set the coder to read json as a map.
>>>>>
>>>>> However a Map<String, String> is not a Person so my pipeline needs an
>>>>> intermediate step to convert one into the other and has in the design an
>>>>> useless serialization round trip.
>>>>>
>>>>> If you check the chain you have: Person -> JSON -> Map<String, String>
>>>>> -> JSON -> Map<String, String> whereas Person -> JSON -> Map<String,
>>>>> String> is fully enough cause there is equivalence of JSON in this 
>>>>> example.
>>>>>
>>>>> In other words if an co

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Romain Manni-Bucau
Not sure how it fits in terms of API yet but +1 for the high level view.
Makes perfect sense.

Le 30 janv. 2018 21:41, "Jean-Baptiste Onofré" <j...@nanthrax.net> a écrit :

> Hi Robert,
>
> Good point and idea for the Composite transform. It would apply nicely on
> all transforms based on composite.
>
> I also agree that the hint is more on the transform than the PCollection
> itself.
>
> Thanks !
> Regards
> JB
>
> On 30/01/2018 21:26, Robert Bradshaw wrote:
>
>> Many hints make more sense for PTransforms (the computation itself)
>> than for PCollections. In addition, when we want properties attached
>> to PCollections of themselves, it often makes sense to let these be
>> provided by the producing PTransform (e.g. coders and schemas are
>> often functions of the input metadata and the operation itself, and
>> can't just be set arbitrarily).
>>
>> Also, we already have a perfectly standard way of nesting transforms
>> (or even sets of transforms), namely composite transforms. In terms of
>> API design I would propose writing a composite transform that applies
>> constraints/hints/requirements to all its inner transforms. This
>> translates nicely to the Fn API as well.
>>
>> On Tue, Jan 30, 2018 at 12:14 PM, Kenneth Knowles <k...@google.com> wrote:
>>
>>> It seems like most of these use cases are hints on a PTransform and not a
>>> PCollection, no? CPU, memory, expected parallelism, etc are. Then you
>>> could
>>> just have:
>>>  pc.apply(WithHints(myTransform, ))
>>>
>>> For a PCollection hints that might make sense are bits like total size,
>>> element size, and throughput. All things that the Dataflow folks have
>>> said
>>> should be measured instead of hinted. But I understand that we shouldn't
>>> force runners to do infeasible things like build a whole no-knobs
>>> service on
>>> top of a super-knobby engine.
>>>
>>> Incidentally for portability, we have this "environment" object that is
>>> basically the docker URL of an SDK harness that can execute a function.
>>> We
>>> always intended that same area of the proto (exact fields TBD) to have
>>> things like requirements for CPU, memory, GPUs, disk, etc. It is likely a
>>> good place for hints.
>>>
>>> BTW good idea to ask users@ for their pain points and bring them back
>>> to the
>>> dev list to motivate feature design discussions.
>>>
>>> Kenn
>>>
>>> On Tue, Jan 30, 2018 at 12:00 PM, Reuven Lax <re...@google.com> wrote:
>>>
>>>>
>>>> I think the hints would logically be metadata in the pcollection, just
>>>> like coder and schema.
>>>>
>>>> On Jan 30, 2018 11:57 AM, "Jean-Baptiste Onofré" <j...@nanthrax.net>
>>>> wrote:
>>>>
>>>>>
>>>>> Great idea for AddHints.of() !
>>>>>
>>>>> What would be the resulting PCollection ? Just a PCollection of hints
>>>>> or
>>>>> the pc elements + hints ?
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 30/01/2018 20:52, Reuven Lax wrote:
>>>>>
>>>>>>
>>>>>> I think adding hints for runners is reasonable, though hints should
>>>>>> always be assumed to be optional - they shouldn't change semantics of
>>>>>> the
>>>>>> program (otherwise you destroy the portability promise of Beam).
>>>>>> However
>>>>>> there are many types of hints that some runners might find useful
>>>>>> (e.g. this
>>>>>> step needs more memory. this step runs ML algorithms, and should run
>>>>>> on a
>>>>>> machine with GPUs. etc.)
>>>>>>
>>>>>> Robert has mentioned in the past that we should try and keep
>>>>>> PCollection
>>>>>> an immutable object, and not introduce new setters on it. We slightly
>>>>>> break
>>>>>> this already today with PCollection.setCoder, and that has caused some
>>>>>> problems. Hints can be set on PTransforms though, and propagate to
>>>>>> that
>>>>>> PTransform's output PCollections. This is nearly as easy to use
>>>>>> however, as
>>>>>> we can implement a helper PTransform that can be used to se

Re: untyped pipeline API?

2018-01-30 Thread Romain Manni-Bucau
I kind of agree on all of that and brings me to the interesting point of
that topic: why coders are that enforced if not used most of the time -
flat processor chain to caricature it?

Shouldnt it be relaxed a bit and just enforced at split or shuffle points?


Le 30 janv. 2018 22:09, "Ben Chambers" <bchamb...@apache.org> a écrit :

> It sounds like in your specific case you're saying that the same encoding
> can be viewed by the Java type system two different ways. For instance, if
> you have an object Person that is convertible to JSON using Jackson, than
> that JSON encoding can be viewed as either a Person or a Map<String,
> Object> looking at the JSON fields. In that case, there needs to be some
> kind of "view change" change transform to change the type of the
> PCollection.
>
> I'm not sure an untyped API would be better here. Requiring the "view
> change" be explicit means we can ensure the types are compatible, and also
> makes it very clear when this kind of change is desired.
>
> Some background on Coders that may be relevant:
>
> It might help to to think about Coders as the specification of how
> elements in a PCollection are encoded if/when the runner needs to. If you
> are trying to read JSON or XML records from a source, that is part of the
> source transform (reading JSON or XML records) and not part of the
> collection produced by the transform.
>
> Consider further -- even if you read XML records from a source, you likely
> *wouldn't* want to use an XML Coder for those records within the pipeline,
> as every time the pipeline needed to serialize them you would produce much
> larger amounts of data (XML is not an efficient/compact encoding). Instead,
> you likely want to read XML records from the source and then encode those
> within the pipeline using something more efficient. Then convert them to
> something more readable but possibly less-efficient before they exit the
> pipeline at a sink.
>
> On Tue, Jan 30, 2018 at 12:23 PM Kenneth Knowles <k...@google.com> wrote:
>
>> Ah, this is a point that Robert brings up quite often: one reason we put
>> coders on PCollections instead of doing that work in PTransforms is that
>> the runner (plus SDK harness) can automatically only serialize when
>> necessary. So the default in Beam is that the thing you want to happen is
>> already done. There are some corner cases when you get to the portability
>> framework but I am pretty sure it already works this way. If you show what
>> is a PTransform and PCollection in your example it might show where we can
>> fix things.
>>
>> On Tue, Jan 30, 2018 at 12:17 PM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> Indeed,
>>>
>>> I'll take a stupid example to make it shorter.
>>> I have a source emitting Person objects ({name:...,id:...}) serialized
>>> with jackson as JSON.
>>> Then my pipeline processes them with a DoFn taking a Map<String,
>>> String>. Here I set the coder to read json as a map.
>>>
>>> However a Map<String, String> is not a Person so my pipeline needs an
>>> intermediate step to convert one into the other and has in the design an
>>> useless serialization round trip.
>>>
>>> If you check the chain you have: Person -> JSON -> Map<String, String>
>>> -> JSON -> Map<String, String> whereas Person -> JSON -> Map<String,
>>> String> is fully enough cause there is equivalence of JSON in this example.
>>>
>>> In other words if an coder output is readable from another coder input,
>>> the java strong typing doesn't know about it and can enforce some fake
>>> steps.
>>>
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>> <http://rmannibucau.wordpress.com> | Github
>>> <https://github.com/rmannibucau> | LinkedIn
>>> <https://www.linkedin.com/in/rmannibucau>
>>>
>>> 2018-01-30 21:07 GMT+01:00 Kenneth Knowles <k...@google.com>:
>>>
>>>> I'm not sure I understand your question. Can you explain more?
>>>>
>>>> On Tue, Jan 30, 2018 at 11:50 AM, Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> just encountered an issue with the pipeline API and wondered if you
>>>>> thought about it.
>>>>>
>>>>> It can happen the Coders are compatible between them. Simple example
>>>>> is a text coder like JSON or XML will be able to read text. However with
>>>>> the pipeline API you can't support this directly and
>>>>> enforce the user to use an intermediate state to be typed.
>>>>>
>>>>> Is there already a way to avoid these useless round trips?
>>>>>
>>>>> Said otherwise: how to handle coders transitivity?
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>> <https://www.linkedin.com/in/rmannibucau>
>>>>>
>>>>
>>>>
>>>
>>


Re: untyped pipeline API?

2018-01-30 Thread Romain Manni-Bucau
Indeed,

I'll take a stupid example to make it shorter.
I have a source emitting Person objects ({name:...,id:...}) serialized with
jackson as JSON.
Then my pipeline processes them with a DoFn taking a Map<String, String>.
Here I set the coder to read json as a map.

However a Map<String, String> is not a Person so my pipeline needs an
intermediate step to convert one into the other and has in the design an
useless serialization round trip.

If you check the chain you have: Person -> JSON -> Map<String, String> ->
JSON -> Map<String, String> whereas Person -> JSON -> Map<String, String>
is fully enough cause there is equivalence of JSON in this example.

In other words if an coder output is readable from another coder input, the
java strong typing doesn't know about it and can enforce some fake steps.



Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-30 21:07 GMT+01:00 Kenneth Knowles <k...@google.com>:

> I'm not sure I understand your question. Can you explain more?
>
> On Tue, Jan 30, 2018 at 11:50 AM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>> Hi guys,
>>
>> just encountered an issue with the pipeline API and wondered if you
>> thought about it.
>>
>> It can happen the Coders are compatible between them. Simple example is a
>> text coder like JSON or XML will be able to read text. However with the
>> pipeline API you can't support this directly and
>> enforce the user to use an intermediate state to be typed.
>>
>> Is there already a way to avoid these useless round trips?
>>
>> Said otherwise: how to handle coders transitivity?
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau>
>>
>
>


Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Romain Manni-Bucau
I think so too but `pc.apply(AddHints.of(hint1, hint2, hint3))` is a bit
ambiguous for me (is it affecting the previous collection?)

Maybe AddHints.on(collection, hint1, hint2, ...) is an acceptable
compromise? Less fluent but not ambiguous (based on the same pattern as
views).


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-30 21:00 GMT+01:00 Reuven Lax <re...@google.com>:

> I think the hints would logically be metadata in the pcollection, just
> like coder and schema.
>
> On Jan 30, 2018 11:57 AM, "Jean-Baptiste Onofré" <j...@nanthrax.net> wrote:
>
>> Great idea for AddHints.of() !
>>
>> What would be the resulting PCollection ? Just a PCollection of hints or
>> the pc elements + hints ?
>>
>> Regards
>> JB
>>
>> On 30/01/2018 20:52, Reuven Lax wrote:
>>
>>> I think adding hints for runners is reasonable, though hints should
>>> always be assumed to be optional - they shouldn't change semantics of the
>>> program (otherwise you destroy the portability promise of Beam). However
>>> there are many types of hints that some runners might find useful (e.g.
>>> this step needs more memory. this step runs ML algorithms, and should run
>>> on a machine with GPUs. etc.)
>>>
>>> Robert has mentioned in the past that we should try and keep PCollection
>>> an immutable object, and not introduce new setters on it. We slightly break
>>> this already today with PCollection.setCoder, and that has caused some
>>> problems. Hints can be set on PTransforms though, and propagate to that
>>> PTransform's output PCollections. This is nearly as easy to use however, as
>>> we can implement a helper PTransform that can be used to set hints. I.e.
>>>
>>> pc.apply(AddHints.of(hint1, hint2, hint3))
>>>
>>> Is no harder than called pc.addHint()
>>>
>>> Reuven
>>>
>>> On Tue, Jan 30, 2018 at 11:39 AM, Jean-Baptiste Onofré <j...@nanthrax.net
>>> <mailto:j...@nanthrax.net>> wrote:
>>>
>>> Maybe I should have started the discussion on the user mailing list:
>>> it would be great to have user feedback on this, even if I got your
>>> points.
>>>
>>> Sometime, I have the feeling that whatever we are proposing and
>>> discussing, it doesn't go anywhere. At some point, to attract more
>>> people, we have to get ideas from different perspective/standpoint.
>>>
>>> Thanks for the feedback anyway.
>>>
>>> Regards
>>> JB
>>>
>>> On 30/01/2018 20:27, Romain Manni-Bucau wrote:
>>>
>>>
>>>
>>> 2018-01-30 19:52 GMT+01:00 Kenneth Knowles <k...@google.com
>>> <mailto:k...@google.com> <mailto:k...@google.com
>>> <mailto:k...@google.com>>>:
>>>
>>>
>>>  I generally like having certain "escape hatches" that are
>>> well
>>>  designed and limited in scope, and anything that turns out
>>> to be
>>>  important becomes first-class. But this one I don't really
>>> like
>>>  because the use cases belong elsewhere. Of course, they
>>> creep so you
>>>  should assume they will be unbounded in how much gets
>>> stuffed into
>>>  them. And the definition of a "hint" is that deleting it
>>> does not
>>>  change semantics, just performance/monitor/UI etc but this
>>> does not
>>>  seem to be true.
>>>
>>>  "spark.persist" for idempotent replay in a sink:
>>>- this is already @RequiresStableInput
>>>- it is not a hint because if you don't persist your
>>> results are
>>>  incorrect
>>>- it is a property of a DoFn / transform not a PCollection
>>>
>>>
>>> Let's put this last point aside since we'll manage to make it
>>> working wherever we store it ;).
>>>
>>>
>>>  schema:
>>>- should be first-class
>>>
>>>
>>> Except it doesn't make

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Romain Manni-Bucau
Hmm, can work for pipeline hints but for transform hints we would need:

p.apply(AddHint.of(.).wrap(originalTransform))

Would work for me too.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-30 20:57 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:

> Great idea for AddHints.of() !
>
> What would be the resulting PCollection ? Just a PCollection of hints or
> the pc elements + hints ?
>
> Regards
> JB
>
> On 30/01/2018 20:52, Reuven Lax wrote:
>
>> I think adding hints for runners is reasonable, though hints should
>> always be assumed to be optional - they shouldn't change semantics of the
>> program (otherwise you destroy the portability promise of Beam). However
>> there are many types of hints that some runners might find useful (e.g.
>> this step needs more memory. this step runs ML algorithms, and should run
>> on a machine with GPUs. etc.)
>>
>> Robert has mentioned in the past that we should try and keep PCollection
>> an immutable object, and not introduce new setters on it. We slightly break
>> this already today with PCollection.setCoder, and that has caused some
>> problems. Hints can be set on PTransforms though, and propagate to that
>> PTransform's output PCollections. This is nearly as easy to use however, as
>> we can implement a helper PTransform that can be used to set hints. I.e.
>>
>> pc.apply(AddHints.of(hint1, hint2, hint3))
>>
>> Is no harder than called pc.addHint()
>>
>> Reuven
>>
>> On Tue, Jan 30, 2018 at 11:39 AM, Jean-Baptiste Onofré <j...@nanthrax.net
>> <mailto:j...@nanthrax.net>> wrote:
>>
>> Maybe I should have started the discussion on the user mailing list:
>> it would be great to have user feedback on this, even if I got your
>> points.
>>
>> Sometime, I have the feeling that whatever we are proposing and
>>     discussing, it doesn't go anywhere. At some point, to attract more
>> people, we have to get ideas from different perspective/standpoint.
>>
>> Thanks for the feedback anyway.
>>
>> Regards
>> JB
>>
>> On 30/01/2018 20:27, Romain Manni-Bucau wrote:
>>
>>
>>
>> 2018-01-30 19:52 GMT+01:00 Kenneth Knowles <k...@google.com
>> <mailto:k...@google.com> <mailto:k...@google.com
>>
>> <mailto:k...@google.com>>>:
>>
>>
>>  I generally like having certain "escape hatches" that are
>> well
>>  designed and limited in scope, and anything that turns out
>> to be
>>  important becomes first-class. But this one I don't really
>> like
>>  because the use cases belong elsewhere. Of course, they
>> creep so you
>>  should assume they will be unbounded in how much gets
>> stuffed into
>>  them. And the definition of a "hint" is that deleting it
>> does not
>>  change semantics, just performance/monitor/UI etc but this
>> does not
>>  seem to be true.
>>
>>  "spark.persist" for idempotent replay in a sink:
>>- this is already @RequiresStableInput
>>- it is not a hint because if you don't persist your
>> results are
>>  incorrect
>>- it is a property of a DoFn / transform not a PCollection
>>
>>
>> Let's put this last point aside since we'll manage to make it
>> working wherever we store it ;).
>>
>>
>>  schema:
>>- should be first-class
>>
>>
>> Except it doesn't make sense everywhere. It is exactly like
>> saying "implement this" and 2 lines later "it doesn't do
>> anything for you". If you think wider on schema you will want to
>> do far more - like getting them from the previous step etc... -
>> which makes it not an API thing. However, with some runner like
>> spark, being able to specifiy it will enable to optimize the
>> execution. There is a clear mismatch between a consistent and
>> user friendly generic and portable API, and a runtime, runner
>> specific, implementation.
>>
>> This is all f

untyped pipeline API?

2018-01-30 Thread Romain Manni-Bucau
Hi guys,

just encountered an issue with the pipeline API and wondered if you thought
about it.

It can happen the Coders are compatible between them. Simple example is a
text coder like JSON or XML will be able to read text. However with the
pipeline API you can't support this directly and
enforce the user to use an intermediate state to be typed.

Is there already a way to avoid these useless round trips?

Said otherwise: how to handle coders transitivity?

Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>


Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-30 Thread Romain Manni-Bucau
I get the issue but I don't get the last part. Concretely we can support
any lib by just removing the exception in the close, no? What would be the
issue? No additional wrapper, no lib integration issue.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-30 19:29 GMT+01:00 Lukasz Cwik <lc...@google.com>:

> Its common in the code base that input and output streams are passed
> around and the caller is responsible for closing it, not the callee. The
> UnownedInputStream is to guard against libraries that are poorly behaved
> and assume they get ownership of the stream when it is given to them.
>
> In the code:
> myMethod(InputStream in) {
>   InputStream child = new InputStream(in);
>   child.close();
> }
>
> InputStream in = ...
> myMethod(in);
> myMethod(in);
> When should "in" be closed?
>
> To get around this issue, create a filter input/output stream that ignores
> close calls like on the JAXB coder: https://github.com/apache/
> beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/
> beam/sdk/io/xml/JAXBCoder.java#L181
>
> We can instead swap around this pattern so that the caller guards against
> callees closing by wrapping with a filter input/output stream but this
> costs an additional method call for each input/output stream call.
>
>
> On Tue, Jan 30, 2018 at 10:04 AM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>> Hi guys,
>>
>> All is in the subject ;)
>>
>> Rational is to support any I/O library and not fail when the close is
>> encapsulated.
>>
>> Any blocker to swallow this close call?
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau>
>>
>
>


Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Romain Manni-Bucau
2018-01-30 19:52 GMT+01:00 Kenneth Knowles <k...@google.com>:

> I generally like having certain "escape hatches" that are well designed
> and limited in scope, and anything that turns out to be important becomes
> first-class. But this one I don't really like because the use cases belong
> elsewhere. Of course, they creep so you should assume they will be
> unbounded in how much gets stuffed into them. And the definition of a
> "hint" is that deleting it does not change semantics, just
> performance/monitor/UI etc but this does not seem to be true.
>
> "spark.persist" for idempotent replay in a sink:
>  - this is already @RequiresStableInput
>  - it is not a hint because if you don't persist your results are incorrect
>  - it is a property of a DoFn / transform not a PCollection
>

Let's put this last point aside since we'll manage to make it working
wherever we store it ;).


>
> schema:
>  - should be first-class
>

Except it doesn't make sense everywhere. It is exactly like saying
"implement this" and 2 lines later "it doesn't do anything for you". If you
think wider on schema you will want to do far more - like getting them from
the previous step etc... - which makes it not an API thing. However, with
some runner like spark, being able to specifiy it will enable to optimize
the execution. There is a clear mismatch between a consistent and user
friendly generic and portable API, and a runtime, runner specific,
implementation.

This is all fine as an issue for a portable API and why all EE API have a
map to pass properties somewhere so I don't see why beam wouldn't fall in
that exact same bucket since it embraces the drawback of the portability
and we already hit it since several releases.


>
> step parallelism (you didn't mention but most runners need some control):
>  - this is a property of the data and the pipeline together, not just the
> pipeline
>

Good one but this can be configured from the pipeline or even a transform.
This doesn't mean the data is not important - and you are more than right
on that point, just that it is configurable without referencing the data
(using ranges is a trivial example even if not the most efficient).


>
> So I just don't actually see a use case for free-form hints that we
> haven't already covered.
>

There are several cases, even in the direct runner to be able to
industrialize it:
- use that particular executor instance
- debug these infos for that transform

etc...

As a high level design I think it is good to bring hints to beam to avoid
to add ad-hoc solution each time and take the risk to loose the portability
of the main API.


>
> Kenn
>
> On Tue, Jan 30, 2018 at 9:55 AM, Romain Manni-Bucau <rmannibu...@gmail.com
> > wrote:
>
>> Lukasz, the point is that you have to choice to either bring all
>> specificities to the main API which makes most of the API not usable or
>> implemented or the opposite, not support anything. Introducing hints will
>> allow to have eagerly for some runners some features - or just some very
>> specific things - and once mainstream it can find a place in the main API.
>> This is saner than the opposite since some specificities can never find a
>> good place.
>>
>> The little thing we need to take care with that is to avoid to introduce
>> some feature flipping as support some feature not doable with another
>> runner. It should really be about runing a runner execution (like the
>> schema in spark).
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau>
>>
>> 2018-01-30 18:42 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>>
>>> Good point Luke: in that case, the hint will be ignored by the runner if
>>> the hint is not for him. The hint can be generic (not specific to a
>>> runner). It could be interesting for the schema support or IOs, not
>>> specific to a runner.
>>>
>>> What do you mean by gathering PTransforms/PCollections and where ?
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On 30/01/2018 18:35, Lukasz Cwik wrote:
>>>
>>>> If the hint is required to run the persons pipeline well, how do you
>>>> expect that the person we be able to migrate their pipeline to another
>>>> runner?
>>>>
>>>> A lot of hints like "spark.persist" are really the user trying to tell
>>>> us something

why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-30 Thread Romain Manni-Bucau
Hi guys,

All is in the subject ;)

Rational is to support any I/O library and not fail when the close is
encapsulated.

Any blocker to swallow this close call?

Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>


Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Romain Manni-Bucau
Lukasz, the point is that you have to choice to either bring all
specificities to the main API which makes most of the API not usable or
implemented or the opposite, not support anything. Introducing hints will
allow to have eagerly for some runners some features - or just some very
specific things - and once mainstream it can find a place in the main API.
This is saner than the opposite since some specificities can never find a
good place.

The little thing we need to take care with that is to avoid to introduce
some feature flipping as support some feature not doable with another
runner. It should really be about runing a runner execution (like the
schema in spark).


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-30 18:42 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:

> Good point Luke: in that case, the hint will be ignored by the runner if
> the hint is not for him. The hint can be generic (not specific to a
> runner). It could be interesting for the schema support or IOs, not
> specific to a runner.
>
> What do you mean by gathering PTransforms/PCollections and where ?
>
> Thanks !
> Regards
> JB
>
> On 30/01/2018 18:35, Lukasz Cwik wrote:
>
>> If the hint is required to run the persons pipeline well, how do you
>> expect that the person we be able to migrate their pipeline to another
>> runner?
>>
>> A lot of hints like "spark.persist" are really the user trying to tell us
>> something about the PCollection, like it is very small. I would prefer if
>> we gathered this information about PTransforms and PCollections instead of
>> runner specific knobs since then each runner can choose how best to map
>> such a property on their internal representation.
>>
>> On Tue, Jan 30, 2018 at 2:21 AM, Jean-Baptiste Onofré <j...@nanthrax.net
>> <mailto:j...@nanthrax.net>> wrote:
>>
>> Hi,
>>
>> As part of the discussion about schema, Romain mentioned hint. I
>> think it's
>> worth to have an explanation about that and especially it could be
>> wider than
>> schema.
>>
>> Today, to give information to the runner, we use PipelineOptions.
>> The runner can
>> use these options, and apply for all inner representation of the
>> PCollection in
>> the runner.
>>
>> For instance, for the Spark runner, the persistence storage level
>> (memory, disk,
>> ...) can be defined via pipeline options.
>>
>> Then, the Spark runner automatically defines if RDDs have to be
>> persisted (using
>> the storage level defined in the pipeline options), for instance if
>> the same
>> POutput/PCollection is read several time.
>>
>> However, the user doesn't have any way to provide indication to the
>> runner to
>> deal with a specific PCollection.
>>
>> Imagine, the user has a pipeline like this:
>> pipeline.apply().apply().apply(). We
>> have three PCollections involved in this pipeline. It's not
>> currently possible
>> to give indications how the runner should "optimized" and deal with
>> the second
>> PCollection only.
>>
>> The idea is to add a method on the PCollection:
>>
>> PCollection.addHint(String key, Object value);
>>
>> For instance:
>>
>> collection.addHint("spark.persist", StorageLevel.MEMORY_ONLY);
>>
>> I see three direct usage of this:
>>
>> 1. Related to schema: the schema definition could be a hint
>> 2. Related to the IO: add headers for the IO and the runner how to
>> specifically
>> process a collection. In Apache Camel, we have headers on the
>> message and
>> properties on the exchange similar to this. It allows to give some
>> indication
>> how to process some messages on the Camel component. We can imagine
>> the same of
>> the IO (using the PCollection hints to react accordingly).
>> 3. Related to runner optimization: I see for instance a way to use
>> RDD or
>> dataframe in Spark runner, or even specific optimization like
>> persist. I had lot
>> of questions from Spark users saying: "in my Spark job, I know where
>> and how I
>> should use persist (rdd.persist()), but I can't do such optimization
>> using
>> Beam". So it could be a good improvements.
>>
>> Thoughts ?
>>
>> Regards
>> JB
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org <mailto:jbono...@apache.org>
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>
>>


Re: Should we have a predictable test run order?

2018-01-30 Thread Romain Manni-Bucau
What I was used to do is to capture the output when I identified some of
these cases. Once it is reproduced I grep the "Running" lines from
surefire. This gives me a reproducible order. Then with a kind of dichotomy
you can find the "previous" test making your test failing and you can
configure this sequence in idea.

Not perfect but better than hiding the issue probably.

Also running "clean" enforces inodes to change and increase the probability
to reproduce it on linux.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-30 18:03 GMT+01:00 Daniel Kulp <dk...@apache.org>:

> The biggest problem with random is that if a test fails due to an
> interaction, you have no way to reproduce it.   You could re-run with
> random 10 times and it might not fail again.   Thus, what good did it do to
> even flag the failure?  At least with alphabetical and reverse
> alphabetical, if a tests fails, you can rerun and actually have a chance to
> diagnose the failure.   A test that randomly fails once out of every 20
> times it runs tends to get @Ignored, not fixed.   I’ve seen that way too
> often.  :(
>
> Dan
>
>
> > On Jan 30, 2018, at 11:38 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
> >
> > Hi Daniel,
> >
> > As a quick fix it sounds good but doesnt it hide a leak or issue (in
> test setup or in main code)? Long story short: using a random order can
> allow to find bugs faster instead of hiding them and discover them randomly
> adding a new test.
> >
> > That said, good point to have it configurable with a -D or -P and be
> able to test quickly this flag.
> >
> >
> > Le 30 janv. 2018 17:33, "Daniel Kulp" <dk...@apache.org> a écrit :
> > I spent a couple hours this morning trying to figure out why two of the
> SQL tests are failing on my machine, but not for Jenkins or for JB.   Not
> knowing anything about the SQL stuff, it was very hard to debug and it
> wouldn’t fail within Eclipse or even if I ran that individual test from the
> command line with -Dtest= .   Thus, a real pain…
> >
> > It turns out, there is an interaction problem between it and a test that
> is running before it on my machine, but on Jenkins and JB’s machine, the
> tests are run in a different order so the problem doesn’t surface.   So
> here’s the question:
> >
> > Should the surefire configuration specify a “runOrder” so that the tests
> would run the same on all of our machines?   By default, the runOrder is
> “filesystem” so depending on the order that the filesystem returns the test
> classes to surefire, the tests would run in different order.   It looks
> like my APFS Mac returns them in a different order than JB’s Linux.But
> that also means if there is a Jenkins test failure or similar, I might not
> be able to reproduce it.   (Or a Windows person or even a Linux user using
> a different fs than Jenkins)   For most of the projects I use, we generally
> have “alphabetical” to make things completely
> predictable.   That said, by making things non-deterministic, it can find
> issues like this where tests aren’t cleaning themselves up correctly.
> Could do a runOrder=hourly to flip back and forth between alphabetical and
> reverse-alphabetical.  Predictable, but changes to detect issues.
> >
> > Thoughts?
> >
> >
> > --
> > Daniel Kulp
> > dk...@apache.org - http://dankulp.com/blog
> > Talend Community Coder - http://coders.talend.com
> >
>
> --
> Daniel Kulp
> dk...@apache.org - http://dankulp.com/blog
> Talend Community Coder - http://coders.talend.com
>
>


Re: Should we have a predictable test run order?

2018-01-30 Thread Romain Manni-Bucau
Hi Daniel,

As a quick fix it sounds good but doesnt it hide a leak or issue (in test
setup or in main code)? Long story short: using a random order can allow to
find bugs faster instead of hiding them and discover them randomly adding a
new test.

That said, good point to have it configurable with a -D or -P and be able
to test quickly this flag.


Le 30 janv. 2018 17:33, "Daniel Kulp"  a écrit :

> I spent a couple hours this morning trying to figure out why two of the
> SQL tests are failing on my machine, but not for Jenkins or for JB.   Not
> knowing anything about the SQL stuff, it was very hard to debug and it
> wouldn’t fail within Eclipse or even if I ran that individual test from the
> command line with -Dtest= .   Thus, a real pain…
>
> It turns out, there is an interaction problem between it and a test that
> is running before it on my machine, but on Jenkins and JB’s machine, the
> tests are run in a different order so the problem doesn’t surface.   So
> here’s the question:
>
> Should the surefire configuration specify a “runOrder” so that the tests
> would run the same on all of our machines?   By default, the runOrder is
> “filesystem” so depending on the order that the filesystem returns the test
> classes to surefire, the tests would run in different order.   It looks
> like my APFS Mac returns them in a different order than JB’s Linux.But
> that also means if there is a Jenkins test failure or similar, I might not
> be able to reproduce it.   (Or a Windows person or even a Linux user using
> a different fs than Jenkins)   For most of the projects I use, we generally
> have “alphabetical” to make things completely
> predictable.   That said, by making things non-deterministic, it can find
> issues like this where tests aren’t cleaning themselves up correctly.
> Could do a runOrder=hourly to flip back and forth between alphabetical and
> reverse-alphabetical.  Predictable, but changes to detect issues.
>
> Thoughts?
>
>
> --
> Daniel Kulp
> dk...@apache.org - http://dankulp.com/blog
> Talend Community Coder - http://coders.talend.com
>
>


Re: Schema-Aware PCollections revisited

2018-01-29 Thread Romain Manni-Bucau
Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com> a écrit :



On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> Hi
>
> I have some questions on this: how hierarchic schemas would work? Seems it
> is not really supported by the ecosystem (out of custom stuff) :(. How
> would it integrate smoothly with other generic record types - N bridges?
>

Do you mean nested schemas? What do you mean here?


Yes, sorry - wrote the mail too late ;). Was hierarchic data and nested
schemas.


> Concretely I wonder if using json API couldnt be beneficial: json-p is a
> nice generic abstraction with a built in querying mecanism (jsonpointer)
> but no actual serialization (even if json and binary json are very
> natural). The big advantage is to have a well known ecosystem - who doesnt
> know json today? - that beam can reuse for free: JsonObject (guess we dont
> want JsonValue abstraction) for the record type, jsonschema standard for
> the schema, jsonpointer for the delection/projection etc... It doesnt
> enforce the actual serialization (json, smile, avro, ...) but provide an
> expressive and alread known API so i see it as a big win-win for users (no
> need to learn a new API and use N bridges in all ways) and beam (impls are
> here and API design already thought).
>

I assume you're talking about the API for setting schemas, not using them.
Json has many downsides and I'm not sure it's true that everyone knows it;
there are also competing schema APIs, such as Avro etc.. However I think we
should give Json a fair evaluation before dismissing it.


It is a wider topic than schema. Actually schema are not the first citizen
but a generic data representation is. That is where json hits almost any
other API. Then, when it comes to schema, json has a standard for that so
we are all good.

Also json has a good indexing API compared to alternatives which are
sometimes a bit faster - for noop transforms - but are hardly usable or
make the code not that readable.

Avro is a nice competitor but it is compatible - actually avro is json
driven by design - but its API is far to be that easy due to its schema
enforcement which is heavvvyyy and worse is you cant work with avro without
a schema. Json would allow to reconciliate the dynamic and static cases
since the job wouldnt change except the setschema.

That is why I think json is a good compromise and having a standard API for
it allow to fully customize the imol as will if needed - even using avro or
protobuf.

Side note on beam api: i dont think it is good to use a main API for runner
optimization. It enforces something to be shared on all runners but not
widely usable. It is also misleading for users. Would you set a flink
pipeline option with dataflow? My proposal here is to use hints -
properties - instead of something hardly defined in the API then
standardize it if all runners support it.



> Wdyt?
>
> Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré" <j...@nanthrax.net> a écrit :
>
>> Hi Reuven,
>>
>> Thanks for the update ! As I'm working with you on this, I fully agree
>> and great
>> doc gathering the ideas.
>>
>> It's clearly something we have to add asap in Beam, because it would
>> allow new
>> use cases for our users (in a simple way) and open new areas for the
>> runners
>> (for instance dataframe support in the Spark runner).
>>
>> By the way, while ago, I created BEAM-3437 to track the PoC/PR around
>> this.
>>
>> Thanks !
>>
>> Regards
>> JB
>>
>> On 01/29/2018 02:08 AM, Reuven Lax wrote:
>> > Previously I submitted a proposal for adding schemas as a first-class
>> concept on
>> > Beam PCollections. The proposal engendered quite a bit of discussion
>> from the
>> > community - more discussion than I've seen from almost any of our
>> proposals to
>> > date!
>> >
>> > Based on the feedback and comments, I reworked the proposal document
>> quite a
>> > bit. It now talks more explicitly about the different between dynamic
>> schemas
>> > (where the schema is not fully not know at graph-creation time), and
>> static
>> > schemas (which are fully know at graph-creation time). Proposed APIs
>> are more
>> > fleshed out now (again thanks to feedback from community members), and
>> the
>> > document talks in more detail about evolving schemas in long-running
>> streaming
>> > pipelines.
>> >
>> > Please take a look. I think this will be very valuable to Beam, and
>> welcome any
>> > feedback.
>> >
>> > https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ
>> 12pHGK0QIvXS1FOTgRc/edit#
>> >
>> > Reuven
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>


Re: Schema-Aware PCollections revisited

2018-01-29 Thread Romain Manni-Bucau
Hi

I have some questions on this: how hierarchic schemas would work? Seems it
is not really supported by the ecosystem (out of custom stuff) :(. How
would it integrate smoothly with other generic record types - N bridges?

Concretely I wonder if using json API couldnt be beneficial: json-p is a
nice generic abstraction with a built in querying mecanism (jsonpointer)
but no actual serialization (even if json and binary json are very
natural). The big advantage is to have a well known ecosystem - who doesnt
know json today? - that beam can reuse for free: JsonObject (guess we dont
want JsonValue abstraction) for the record type, jsonschema standard for
the schema, jsonpointer for the delection/projection etc... It doesnt
enforce the actual serialization (json, smile, avro, ...) but provide an
expressive and alread known API so i see it as a big win-win for users (no
need to learn a new API and use N bridges in all ways) and beam (impls are
here and API design already thought).

Wdyt?

Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré"  a écrit :

> Hi Reuven,
>
> Thanks for the update ! As I'm working with you on this, I fully agree and
> great
> doc gathering the ideas.
>
> It's clearly something we have to add asap in Beam, because it would allow
> new
> use cases for our users (in a simple way) and open new areas for the
> runners
> (for instance dataframe support in the Spark runner).
>
> By the way, while ago, I created BEAM-3437 to track the PoC/PR around this.
>
> Thanks !
>
> Regards
> JB
>
> On 01/29/2018 02:08 AM, Reuven Lax wrote:
> > Previously I submitted a proposal for adding schemas as a first-class
> concept on
> > Beam PCollections. The proposal engendered quite a bit of discussion
> from the
> > community - more discussion than I've seen from almost any of our
> proposals to
> > date!
> >
> > Based on the feedback and comments, I reworked the proposal document
> quite a
> > bit. It now talks more explicitly about the different between dynamic
> schemas
> > (where the schema is not fully not know at graph-creation time), and
> static
> > schemas (which are fully know at graph-creation time). Proposed APIs are
> more
> > fleshed out now (again thanks to feedback from community members), and
> the
> > document talks in more detail about evolving schemas in long-running
> streaming
> > pipelines.
> >
> > Please take a look. I think this will be very valuable to Beam, and
> welcome any
> > feedback.
> >
> > https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHG
> K0QIvXS1FOTgRc/edit#
> >
> > Reuven
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [PROPOSAL] Add a blog post for every new release

2018-01-29 Thread Romain Manni-Bucau
+1 to have it as a best effort - most of projects do. But as JB said, if it
slows down the release motivation it shouldn't be enforced but just
encouraged. A good solution Ismael is you take this responsability for the
coming releases after the release manager is done with the annoucement.
This way we have the best of both worlds :).


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-29 15:02 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:

> Hi Ismaël
>
> The idea is good, but the post should be pretty short. Let me explain:
>
> - We will have a release every two months now, so, some releases might be
> lighter than others, and it's normal
> - the Jira Release Notes already provides lot of details
>
> For instance, in Apache projects like Karaf, Camel, and others, we do the
> announcement of a release on the mailing lists with the release notes
> linked.
> Sometime, we do a blog to highlight some interesting new features, but
> it's not
> systematic.
>
> So, I agree: it's a good idea and I would give some highlights about what
> we are
> doing and where we are heading. However, I don't think we have to
> "enforce" such
> blog post for every single release. It's a best effort.
>
> My $0.01 ;)
>
> Regards
> JB
>
> On 01/29/2018 02:47 PM, Ismaël Mejía wrote:
> > This is a fork of a recent message I sent as part of the preparations
> > for the next release.
> >
> > [tl;dr] I would like to propose that we create a new blog post for
> > every new release and that this becomes part of the release guide.
> >
> > I think that even if we do shorter releases we need to make this part
> > of the release process. We haven’t been really consistent about
> > communication on new releases in the past. Sometimes we did a blog
> > post and sometimes we didn’t.
> >
> > In particular I was a bit upset that we didn't do a blog post for the
> > last two releases, and the list of JIRA issues sadly does not cover
> > the importance of some of the features of those releases. I am still a
> > bit upset that we didn't publicly mentioned features like the SQL
> > extension, the recent IOs, the new FileIO related improvements and
> > Nexmark. Also I think the blog format is better for ‘marketing
> > reasons’ because not everybody reads the mailing list.
> >
> > Of course the only issue about this is to decide what to put in the
> > release notes and who will do it. We can do this by sharing a google
> > doc that everyone can edit to add their highlights and then reformat
> > it for blog publication, a bit similar to the format used by Gris for
> > the newsletter. Actually if we have paced releases probably we can mix
> > both the release notes and the newsletter into one, no ?
> >
> > What do you think? Other ideas/disagreement/etc.
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [HEADS UP] Preparing Beam 2.3.0

2018-01-28 Thread Romain Manni-Bucau
Will be online this afternoon. Ping me if you need help.

Le 28 janv. 2018 11:36, "Jean-Baptiste Onofré" <j...@nanthrax.net> a écrit :

> By the way, I created BEAM-3551 for the tracking.
>
> I will work on it today.
>
> Regards
> JB
>
> On 01/28/2018 09:50 AM, Romain Manni-Bucau wrote:
> > Hi guys
> >
> > Out if curiosity, can -parameters (javac) be part of the 2.3 if not
> already?
> >
> > Le 27 janv. 2018 18:39, "Jean-Baptiste Onofré" <j...@nanthrax.net
> > <mailto:j...@nanthrax.net>> a écrit :
> >
> > Hi Reuven,
> >
> > I gonna bump 3392 and 3087 to 2.4.0. For the PR, yes Eugene did a
> first round
> > review, I will work on it now.
> >
> > We will pretty close !
> >
> > Thanks !
> > Regards
> > JB
> >
> > On 01/27/2018 05:58 PM, Reuven Lax wrote:
> > > Seems that 3392 is not a blocker, and neither is 3087. Looks like
> Eugene is
> > > already reviewing the PR form BEAM-793.
> > >
> > > Reuven
> > >
> > > On Sat, Jan 27, 2018 at 4:00 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> > <mailto:j...@nanthrax.net>
> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> wrote:
> > >
> > > Hi guys,
> > >
> > > we still have 7 Jira targeted to 2.3.0.
> > >
> > > For most of them, Ismaël and I are doing the PRs/fixes and we
> have
> > review in
> > > progress.
> > >
> > > I'm a little bit concerned by BEAM-3392: it's flagged as
> blocker but it's
> > > related to a specific branch. Can you please provide an update
> asap ?
> > >
> > > However, I didn't have any update for BEAM-3087 (related to
> the Flink
> > runner).
> > > Without update soon, I will bump to 2.4.0.
> > >
> > > I would need a review on PR for BEAM-793 (PR #4500). To avoid
> to break
> > anything
> > > for existing user, I set the backoff strategy optional, the
> user has to
> > > explicitly set to use it.
> > >
> > > I'm waiting a little more before cutting the release
> (especially for
> > BEAM-3392
> > > and BEAM-3087). However, I would like to cut the release asap.
> > >
> > > Thanks,
> > > Regards
> > > JB
> > >
> > >
> > > On 01/23/2018 10:39 AM, Jean-Baptiste Onofré wrote:
> > > > Hi guys,
> > > >
> > > > Some days ago, I proposed to start Beam 2.3.0 around January
> 26th.
> > So, we are
> > > > few days from this date.
> > > >
> > > > As a best effort, can you please in Jira flag the Jira with
> fix
> > version 2.3.0
> > > > and blocker for the release. Then, I will know when I can
> start the
> > > release process.
> > > >
> > > > Thanks !
> > > >
> > > > Regards
> > > > JB
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org <mailto:jbono...@apache.org>
> > <mailto:jbono...@apache.org <mailto:jbono...@apache.org>>
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org <mailto:jbono...@apache.org>
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [HEADS UP] Preparing Beam 2.3.0

2018-01-28 Thread Romain Manni-Bucau
Hi guys

Out if curiosity, can -parameters (javac) be part of the 2.3 if not already?

Le 27 janv. 2018 18:39, "Jean-Baptiste Onofré"  a écrit :

> Hi Reuven,
>
> I gonna bump 3392 and 3087 to 2.4.0. For the PR, yes Eugene did a first
> round
> review, I will work on it now.
>
> We will pretty close !
>
> Thanks !
> Regards
> JB
>
> On 01/27/2018 05:58 PM, Reuven Lax wrote:
> > Seems that 3392 is not a blocker, and neither is 3087. Looks like Eugene
> is
> > already reviewing the PR form BEAM-793.
> >
> > Reuven
> >
> > On Sat, Jan 27, 2018 at 4:00 AM, Jean-Baptiste Onofré  > > wrote:
> >
> > Hi guys,
> >
> > we still have 7 Jira targeted to 2.3.0.
> >
> > For most of them, Ismaël and I are doing the PRs/fixes and we have
> review in
> > progress.
> >
> > I'm a little bit concerned by BEAM-3392: it's flagged as blocker but
> it's
> > related to a specific branch. Can you please provide an update asap ?
> >
> > However, I didn't have any update for BEAM-3087 (related to the
> Flink runner).
> > Without update soon, I will bump to 2.4.0.
> >
> > I would need a review on PR for BEAM-793 (PR #4500). To avoid to
> break anything
> > for existing user, I set the backoff strategy optional, the user has
> to
> > explicitly set to use it.
> >
> > I'm waiting a little more before cutting the release (especially for
> BEAM-3392
> > and BEAM-3087). However, I would like to cut the release asap.
> >
> > Thanks,
> > Regards
> > JB
> >
> >
> > On 01/23/2018 10:39 AM, Jean-Baptiste Onofré wrote:
> > > Hi guys,
> > >
> > > Some days ago, I proposed to start Beam 2.3.0 around January 26th.
> So, we are
> > > few days from this date.
> > >
> > > As a best effort, can you please in Jira flag the Jira with fix
> version 2.3.0
> > > and blocker for the release. Then, I will know when I can start the
> > release process.
> > >
> > > Thanks !
> > >
> > > Regards
> > > JB
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org 
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Great work closing PRs for 2.3.0 release

2018-01-26 Thread Romain Manni-Bucau
Cleanup done on my side - and thanks JB to have caught one in the night for
me. Thanks for the heads up, it seems it is easy to forget PR on github ;).


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-25 22:52 GMT+01:00 Kenneth Knowles <k...@google.com>:

> Nice! Back under 100.
>
> On Wed, Jan 24, 2018 at 4:57 PM, Lukasz Cwik <lc...@google.com> wrote:
>
>> I would like to give praise to the community for closing about 30 PRs in
>> the past couple of days for the 2.3.0 release.
>>
>
>


Re: Gradle / Mvn diff

2018-01-25 Thread Romain Manni-Bucau
Well it is more about consistency and reliability than speed here.
Comoilabtion result is just corrupted :(

Le 25 janv. 2018 20:33, "Lukasz Cwik" <lc...@google.com> a écrit :

> You can only get incremental support at the build system level, not at the
> individual tool level like javac. The task the represents compilation would
> need to be broken up into smaller tasks with smaller source sets to speed
> up compilation of really large modules.
>
> On Wed, Jan 24, 2018 at 11:12 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>> One pitfall I noted with javac wrappers is that if you dont clean - and
>> loose javac incremental support then deleted classes stay here and are
>> considered in the output. Common example is a nested class which has been
>> deleted leading to a corrupted enclosing class or a test which is ran but
>> deleted from java/. Any known way to protect us from it and keep the
>> uncremental support for big modules?
>>
>> Le 25 janv. 2018 01:22, "Lukasz Cwik" <lc...@google.com> a écrit :
>>
>>> Dependency driven works, incremental works for most java modules.
>>> I use incremental almost all the time and just do one validation pass at
>>> the end before opening the PR where I use '--rerun-tasks' to be sure.
>>> Allows me to iterate on a task in seconds.
>>>
>>> On Wed, Jan 24, 2018 at 4:07 PM, Kenneth Knowles <k...@google.com> wrote:
>>>
>>>> These are two different things: dependency-driven build (which works)
>>>> and incremental build (which seems not to, at least right now?).
>>>>
>>>> On Wed, Jan 24, 2018 at 2:24 PM, Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>> Hmm, I'll try to refine it then next time we work with Ismael but can
>>>>> be a setup issue or a human (bad command) issue at the end. Thanks for the
>>>>> help, will make next iteration way easier probably :)
>>>>>
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>> <https://www.linkedin.com/in/rmannibucau>
>>>>>
>>>>> 2018-01-24 23:05 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>>>>>
>>>>>> Tasks always run any dependencies that are required. So if you ask to
>>>>>> run test it shouldn't run javadoc/checkstyle/... but should compile the
>>>>>> code and compile the code of all dependencies. test should never have a
>>>>>> dependency on checkstyle or javadoc or similar 'check' like tasks as they
>>>>>> shouldn't be needed.
>>>>>>
>>>>>> I set up the gradle build so that everytime you run a command in
>>>>>> gradle, it generates a task dependency tree dot file (look for visteg.dot
>>>>>> inside build/reports). I uploaded this one to imgur[1] for the
>>>>>> ':sdks:java:core:build' task to show what tasks are required. Note that
>>>>>> 'sdks:java:core:test' doesn't depend on checkstyle or spotless.
>>>>>>
>>>>>> 1: https://imgur.com/a/ZvYUX
>>>>>>
>>>>>> On Wed, Jan 24, 2018 at 12:50 PM, Romain Manni-Bucau <
>>>>>> rmannibu...@gmail.com> wrote:
>>>>>>
>>>>>>> Hmm, do I miss something or it only works for iterative runs when
>>>>>>> trying to identify an issue and not for the case you rebuild due to code
>>>>>>> changes (where you would need like 5-6 tasks at least - generate, 
>>>>>>> compile,
>>>>>>> test, ...)?
>>>>>>>
>>>>>>> In case it is unclear: there are 2 needs: direct execution/task ->
>>>>>>> fulfilled and clarified now (just a doc issue I think), fast cycle 
>>>>>>> skipping
>>>>>>> not mandatory tasks like style related ones
>>>>>>>
>>>>>>>
>>>>>>> Romain Manni-Bucau
>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>> <http://rmannibucau.wordpress.com> | Github
&g

Re: Gradle / Mvn diff

2018-01-24 Thread Romain Manni-Bucau
One pitfall I noted with javac wrappers is that if you dont clean - and
loose javac incremental support then deleted classes stay here and are
considered in the output. Common example is a nested class which has been
deleted leading to a corrupted enclosing class or a test which is ran but
deleted from java/. Any known way to protect us from it and keep the
uncremental support for big modules?

Le 25 janv. 2018 01:22, "Lukasz Cwik" <lc...@google.com> a écrit :

> Dependency driven works, incremental works for most java modules.
> I use incremental almost all the time and just do one validation pass at
> the end before opening the PR where I use '--rerun-tasks' to be sure.
> Allows me to iterate on a task in seconds.
>
> On Wed, Jan 24, 2018 at 4:07 PM, Kenneth Knowles <k...@google.com> wrote:
>
>> These are two different things: dependency-driven build (which works) and
>> incremental build (which seems not to, at least right now?).
>>
>> On Wed, Jan 24, 2018 at 2:24 PM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> Hmm, I'll try to refine it then next time we work with Ismael but can be
>>> a setup issue or a human (bad command) issue at the end. Thanks for the
>>> help, will make next iteration way easier probably :)
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>> <http://rmannibucau.wordpress.com> | Github
>>> <https://github.com/rmannibucau> | LinkedIn
>>> <https://www.linkedin.com/in/rmannibucau>
>>>
>>> 2018-01-24 23:05 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>>>
>>>> Tasks always run any dependencies that are required. So if you ask to
>>>> run test it shouldn't run javadoc/checkstyle/... but should compile the
>>>> code and compile the code of all dependencies. test should never have a
>>>> dependency on checkstyle or javadoc or similar 'check' like tasks as they
>>>> shouldn't be needed.
>>>>
>>>> I set up the gradle build so that everytime you run a command in
>>>> gradle, it generates a task dependency tree dot file (look for visteg.dot
>>>> inside build/reports). I uploaded this one to imgur[1] for the
>>>> ':sdks:java:core:build' task to show what tasks are required. Note that
>>>> 'sdks:java:core:test' doesn't depend on checkstyle or spotless.
>>>>
>>>> 1: https://imgur.com/a/ZvYUX
>>>>
>>>> On Wed, Jan 24, 2018 at 12:50 PM, Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>> Hmm, do I miss something or it only works for iterative runs when
>>>>> trying to identify an issue and not for the case you rebuild due to code
>>>>> changes (where you would need like 5-6 tasks at least - generate, compile,
>>>>> test, ...)?
>>>>>
>>>>> In case it is unclear: there are 2 needs: direct execution/task ->
>>>>> fulfilled and clarified now (just a doc issue I think), fast cycle 
>>>>> skipping
>>>>> not mandatory tasks like style related ones
>>>>>
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>> <https://www.linkedin.com/in/rmannibucau>
>>>>>
>>>>> 2018-01-24 19:50 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>>>>>
>>>>>> Gradle already has each task explicitly broken out. Kenn is pointing
>>>>>> out that you your development use case shouldn't use the './gradlew
>>>>>> :sdks:java:core:build' task since it is really an aggregator that
>>>>>> represents do everything within that project. This is the current list of
>>>>>> tasks available for :sdks:java:core:
>>>>>> :sdks:java:core:assemble  - Assembles the outputs of this project.
>>>>>> :sdks:java:core:build  - Assembles and tests this project.
>>>>>> :sdks:java:core:buildDependents  - Assembles and tests this project
>>>>>> and all projects that depend on it.
>>>>>> :sdks:java:core:buildEnvironment  - Displays all buildscript
>>>>>>

Re: Gradle / Mvn diff

2018-01-24 Thread Romain Manni-Bucau
Hmm, I'll try to refine it then next time we work with Ismael but can be a
setup issue or a human (bad command) issue at the end. Thanks for the help,
will make next iteration way easier probably :)


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-24 23:05 GMT+01:00 Lukasz Cwik <lc...@google.com>:

> Tasks always run any dependencies that are required. So if you ask to run
> test it shouldn't run javadoc/checkstyle/... but should compile the code
> and compile the code of all dependencies. test should never have a
> dependency on checkstyle or javadoc or similar 'check' like tasks as they
> shouldn't be needed.
>
> I set up the gradle build so that everytime you run a command in gradle,
> it generates a task dependency tree dot file (look for visteg.dot inside
> build/reports). I uploaded this one to imgur[1] for the
> ':sdks:java:core:build' task to show what tasks are required. Note that
> 'sdks:java:core:test' doesn't depend on checkstyle or spotless.
>
> 1: https://imgur.com/a/ZvYUX
>
> On Wed, Jan 24, 2018 at 12:50 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>> Hmm, do I miss something or it only works for iterative runs when trying
>> to identify an issue and not for the case you rebuild due to code changes
>> (where you would need like 5-6 tasks at least - generate, compile, test,
>> ...)?
>>
>> In case it is unclear: there are 2 needs: direct execution/task ->
>> fulfilled and clarified now (just a doc issue I think), fast cycle skipping
>> not mandatory tasks like style related ones
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau>
>>
>> 2018-01-24 19:50 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>>
>>> Gradle already has each task explicitly broken out. Kenn is pointing out
>>> that you your development use case shouldn't use the './gradlew
>>> :sdks:java:core:build' task since it is really an aggregator that
>>> represents do everything within that project. This is the current list of
>>> tasks available for :sdks:java:core:
>>> :sdks:java:core:assemble  - Assembles the outputs of this project.
>>> :sdks:java:core:build  - Assembles and tests this project.
>>> :sdks:java:core:buildDependents  - Assembles and tests this project and
>>> all projects that depend on it.
>>> :sdks:java:core:buildEnvironment  - Displays all buildscript
>>> dependencies declared in project :sdks:java:core.
>>> :sdks:java:core:buildNeeded  - Assembles and tests this project and all
>>> projects it depends on.
>>> :sdks:java:core:check  - Runs all checks.
>>> :sdks:java:core:checkstyleMain  - Run Checkstyle analysis for main
>>> classes
>>> :sdks:java:core:checkstyleTest  - Run Checkstyle analysis for test
>>> classes
>>> :sdks:java:core:classes  - Assembles main classes.
>>> :sdks:java:core:clean  - Deletes the build directory.
>>> :sdks:java:core:compileJava  - Compiles main Java source.
>>> :sdks:java:core:compileTestJava  - Compiles test Java source.
>>> :sdks:java:core:components  - Displays the components produced by
>>> project :sdks:java:core. [incubating]
>>> :sdks:java:core:dependencies  - Displays all dependencies declared in
>>> project :sdks:java:core.
>>> :sdks:java:core:dependencyInsight  - Displays the insight into a
>>> specific dependency in project :sdks:java:core.
>>> :sdks:java:core:dependencyReport  - Generates a report about your
>>> library dependencies.
>>> :sdks:java:core:dependentComponents  - Displays the dependent
>>> components of components in project :sdks:java:core. [incubating]
>>> :sdks:java:core:findbugsMain  - Run FindBugs analysis for main classes
>>> :sdks:java:core:findbugsTest  - Run FindBugs analysis for test classes
>>> :sdks:java:core:generateAvroJava  - Generates main Avro Java source
>>> files from schema/protocol definition files.
>>> :sdks:java:core:generateAvroProtocol  - Generates main Avro protocol
>>> definition files from IDL files.
>>> :sdks:java:core:generateTestAvroJava  - Generates test Avro Java source
>>> files from sche

Re: Gradle / Mvn diff

2018-01-24 Thread Romain Manni-Bucau
Hmm, do I miss something or it only works for iterative runs when trying to
identify an issue and not for the case you rebuild due to code changes
(where you would need like 5-6 tasks at least - generate, compile, test,
...)?

In case it is unclear: there are 2 needs: direct execution/task ->
fulfilled and clarified now (just a doc issue I think), fast cycle skipping
not mandatory tasks like style related ones


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-24 19:50 GMT+01:00 Lukasz Cwik <lc...@google.com>:

> Gradle already has each task explicitly broken out. Kenn is pointing out
> that you your development use case shouldn't use the './gradlew
> :sdks:java:core:build' task since it is really an aggregator that
> represents do everything within that project. This is the current list of
> tasks available for :sdks:java:core:
> :sdks:java:core:assemble  - Assembles the outputs of this project.
> :sdks:java:core:build  - Assembles and tests this project.
> :sdks:java:core:buildDependents  - Assembles and tests this project and
> all projects that depend on it.
> :sdks:java:core:buildEnvironment  - Displays all buildscript dependencies
> declared in project :sdks:java:core.
> :sdks:java:core:buildNeeded  - Assembles and tests this project and all
> projects it depends on.
> :sdks:java:core:check  - Runs all checks.
> :sdks:java:core:checkstyleMain  - Run Checkstyle analysis for main classes
> :sdks:java:core:checkstyleTest  - Run Checkstyle analysis for test classes
> :sdks:java:core:classes  - Assembles main classes.
> :sdks:java:core:clean  - Deletes the build directory.
> :sdks:java:core:compileJava  - Compiles main Java source.
> :sdks:java:core:compileTestJava  - Compiles test Java source.
> :sdks:java:core:components  - Displays the components produced by project
> :sdks:java:core. [incubating]
> :sdks:java:core:dependencies  - Displays all dependencies declared in
> project :sdks:java:core.
> :sdks:java:core:dependencyInsight  - Displays the insight into a specific
> dependency in project :sdks:java:core.
> :sdks:java:core:dependencyReport  - Generates a report about your library
> dependencies.
> :sdks:java:core:dependentComponents  - Displays the dependent components
> of components in project :sdks:java:core. [incubating]
> :sdks:java:core:findbugsMain  - Run FindBugs analysis for main classes
> :sdks:java:core:findbugsTest  - Run FindBugs analysis for test classes
> :sdks:java:core:generateAvroJava  - Generates main Avro Java source files
> from schema/protocol definition files.
> :sdks:java:core:generateAvroProtocol  - Generates main Avro protocol
> definition files from IDL files.
> :sdks:java:core:generateTestAvroJava  - Generates test Avro Java source
> files from schema/protocol definition files.
> :sdks:java:core:generateTestAvroProtocol  - Generates test Avro protocol
> definition files from IDL files.
> :sdks:java:core:help  - Displays a help message.
> :sdks:java:core:htmlDependencyReport  - Generates an HTML report about
> your library dependencies.
> :sdks:java:core:install  - Installs the archives artifacts into the local
> Maven repository.
> :sdks:java:core:jacocoTestCoverageVerification  - Verifies code coverage
> metrics based on specified rules for the test task.
> :sdks:java:core:jacocoTestReport  - Generates code coverage report for
> the test task.
> :sdks:java:core:jar  - Assembles a jar archive containing the main classes.
> :sdks:java:core:javadoc  - Generates Javadoc API documentation for the
> main source code.
> :sdks:java:core:knows  - Do you know who knows?
> :sdks:java:core:model  - Displays the configuration model of project
> :sdks:java:core. [incubating]
> :sdks:java:core:packageTests  -
> :sdks:java:core:processResources  - Processes main resources.
> :sdks:java:core:processTestResources  - Processes test resources.
> :sdks:java:core:projectReport  - Generates a report about your project.
> :sdks:java:core:projects  - Displays the sub-projects of project
> :sdks:java:core.
> :sdks:java:core:properties  - Displays the properties of project
> :sdks:java:core.
> :sdks:java:core:propertyReport  - Generates a report about your
> properties.
> :sdks:java:core:shadowJar  - Create a combined JAR of project and runtime
> dependencies
> :sdks:java:core:shadowTestJar  -
> :sdks:java:core:spotlessApply  - Applies code formatting steps to
> sourcecode in-place.
> :sdks:java:core:spotlessCheck  - Checks that sourcecode satisfies
> formatting steps.
> :sdks:java:core:spotlessJava  -
> :sdks:java:core:spotlessJa

Re: Gradle / Mvn diff

2018-01-23 Thread Romain Manni-Bucau
Can have mischecked gradle setup but I don't think we are here yet, if you
are not bound in a module and work accross 2 modules and iterate between
working on both and one, you will likely not bypass the "checks" in  a
satisfying way without a long -x command, is there a magic flag I missed?
Also not sure about the last point and how gradle helps here - it is rather
the opposite due to the way it loads it model IMHO - so not sure what would
be the consequence in terms of action(s) but can have missed the point.



Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-24 0:20 GMT+01:00 Kenneth Knowles <k...@google.com>:

> On Tue, Jan 23, 2018 at 2:51 PM, Romain Manni-Bucau <rmannibu...@gmail.com
> > wrote:
>
> Hmm, did you read it right Kenn? I think the idea was to skip all
>> validation/sanity checks tasks at once (gradle  -Pfast) instead of
>> doing it manually (gradle -x findbugs -x checkstyle etc...)
>>
>
> Yes, I read it right. We all want the same thing - not doing a bunch of
> extra useless unrequested stuff when developing. The concept of skipping is
> backwards. We don't need configs that skip things, because in a correct
> dependency-driven build they are already not running.
>
> So since I don't want to pretend to know gradle's invocations yet I will
> call it $TOOL. Here's a collection of imaginary commands:
>
> $TOOL :sdks:java:core:unittest  # or $TOOL test :sdks:java:core or
> whatever
> $TOOL :sdks:java:core:findbugs
> $TOOL :sdks:java:core:checkstyle
> $TOOL :sdks:java:core:javadoc
>
> None of these causes any of the others to run. Anything else is a bug. The
> `findbugs` and `test` cause a build of the needed jars and nothing else.
>
> Another example:
>
> $TOOL :runners:core-java:unittest
>
> This builds the model, the core SDK, and the runners-core module, then
> runs the unit tests of the runners-core module. It does not test SDK core,
> or run any javadoc, findbugs, or checkstyle on any module. Anything else is
> a bug.
>
> Now, to build a precommit that is easy to reproduce on one line, you could
> build a compound task
>
> $TOOL :sdks:java:core:precommit  # runs a selection of targets that we
> define
>
> At this point you might want to skip things from the :verify task here.
> But really, you probably just want to run the things you are interested in
> and you don't need custom hooks in the aggregated task.
>
> My understanding is that gradle can support all of this, if we are
> disciplined. Getting to this point is the main/only reason I supported
> gradle.
>
> Kenn
>
>
>
>>
>>
>
>
>>
>>> Kenn
>>>
>>>
>>>
>>>> >>
>>>> >> diff --git a/examples/java/build.gradle b/examples/java/build.gradle
>>>> >> index 0fc0b17df..001bd8b38 100644
>>>> >> --- a/examples/java/build.gradle
>>>> >> +++ b/examples/java/build.gradle
>>>> >> @@ -130,7 +130,7 @@ def preCommitAdditionalFlags = [
>>>> >>dataflowStreamingRunner: [ "--streaming=true" ],
>>>> >>  ]
>>>> >>  for (String runner : preCommitRunners) {
>>>> >> -  tasks.create(name: runner + "PreCommit", type: Test) {
>>>> >> +  tasks.create(name: runner + "PreCommit", type: Test, description:
>>>> "Run tests
>>>> >> for runner ${runner.replace('Runner', '')}") {
>>>> >>  def preCommitBeamTestPipelineOptions = [
>>>> >> "--project=apache-beam-testing",
>>>> >> "--tempRoot=gs://temp-storage-for-end-to-end-tests",
>>>> >>
>>>> >>
>>>> >>
>>>> >> Romain Manni-Bucau
>>>> >> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> >> <https://rmannibucau.metawerx.net/> | Old Blog
>>>> >> <http://rmannibucau.wordpress.com> | Github <
>>>> https://github.com/rmannibucau> |
>>>> >> LinkedIn <https://www.linkedin.com/in/rmannibucau>
>>>> >>
>>>> >> 2018-01-23 8:45 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net
>>>> >> <mailto:j...@nanthrax.net>>:
>>>> >>
>>>> >> Hi Romain,
>>>> >>

Re: Gradle / Mvn diff

2018-01-23 Thread Romain Manni-Bucau
2018-01-24 0:08 GMT+01:00 Lukasz Cwik <lc...@google.com>:

>
>
> On Tue, Jan 23, 2018 at 2:51 PM, Romain Manni-Bucau <rmannibu...@gmail.com
> > wrote:
>
>>
>>
>> 2018-01-23 23:44 GMT+01:00 Kenneth Knowles <k...@google.com>:
>>
>>> Romain - really good point about docs getting out of date.
>>>
>>> > On 01/23/2018 09:29 AM, Romain Manni-Bucau wrote:
>>>> >> Yep,
>>>> >>
>>>> >> a compromise can be to ensure all custom tasks have a description
>>>> maybe:
>>>>
>>>
>>> This is a great idea.
>>>
>>> Let's do both! I think the thing that the comments in the gradle code
>>> cannot capture are the ways that you might combine them, like the way you
>>> override properties, etc.
>>>
>>> On Tue, Jan 23, 2018 at 9:13 AM, Ismaël Mejía <ieme...@gmail.com> wrote:
>>>
>>>> I used to test kafka
>>>> with a different version of the client quickly by doing this with
>>>> maven:
>>>>
>>>> mvn verify -Prelease -Dkafka.clients.version=0.10.0.1 -pl
>>>> 'sdks/java/io/kafka'
>>>>
>>>> but I don't have any idea of how to do this on Gradle.
>>>>
>>>
>>> Did you figure this out. Luke - can you suggest something?
>>>
>>>
> Some ideas:
> * We can define a static set of properties like we did in maven in the
> gradle.properties which allow users to override them.
>

Isn't it only for the daemon start and therefore now that usable when you
run it to validate you didn't break the compat with previous versions?


> * We can just dynamically set versions of libraries by adding a function
> to build_rules.gradle which checks to see if there is a property defined
> for that library and use it. Many ways to map names from just the library
> name, or library name + group automatically. This way we don't have to
> maintain a large list of properties and each library can be overridden from
> the command line.
>

+1, kind of maven -D becoming gradle -P so quite natural IMHO


> * If we defined everything in the root build.gradle then you can edit the
> single line related to the property version.
> * Add a special property which is a list of overrides you want for library
> dependencies/versions and have it applied in build_rules.gradle.
>

Can be needed as well for dependencies "linked" but more as a nice to have
rather as a blocker IMHO


>
>
>> Of course everybody has a different workflow, but I am pretty sure
>>>> there are some common tasks that we can document for people (like me)
>>>> that are new to gradle.
>>>
>>>
>>> Yea, actually, it never even occurred to me that you would use the
>>> command line to test against other Kafka versions :-).
>>>
>>> I have private gists with lots of maven invocations for doing things
>>> like running ValidatesRunner or example ITs on just a particular runner. It
>>> always requires multiple maven commands that are both multiple lines. I was
>>> halfway for a while but I am now all gradle. I will start building the same
>>> thing.
>>>
>>>
>>>> - How to skip the Java/Python/Go build depending on your priorities.
>>>>
>>>
>>> IMO they should NOT run by default. Everything should be
>>> dependency-driven. When I ask to run Java SDK tests, or Java examples IT on
>>> the Flink Runner, it is *incorrect* for any Python or Go builds to run.
>>>
>>
> This is already the case. Only if you do './gradlew build' will everything
> get built. Unless your specifically saying build only Go code which is
> vague because we have some cross language dependencies where Go requires
> java to build. Best to stick with gradles buildNeeded and buildDependents.
>
>
>>
>>>
>>>> - How to run an individual IntegrationTest or ValidatesRunner test.
>>>>
>>>
>>> Yes, #1 use case
>>>
>>>
>>>> - How to skip findbugs,checkstyle,javadoc gneeration, etc. To have an
>>>> ultra quick build.
>>>>
>>>
>>> Again, I think they should be independent tasks. I should be able to run
>>> *any* of them without running *any* of the others. It is incorrect for
>>> anything to cause any other thing to run if it does not directly require
>>> its outputs.
>>>
>>> There can be an aggregated "verify" command but I will actually very
>>> rarely run that until I am done with a la

Re: Gradle / Mvn diff

2018-01-23 Thread Romain Manni-Bucau
2018-01-23 23:44 GMT+01:00 Kenneth Knowles <k...@google.com>:

> Romain - really good point about docs getting out of date.
>
> > On 01/23/2018 09:29 AM, Romain Manni-Bucau wrote:
>> >> Yep,
>> >>
>> >> a compromise can be to ensure all custom tasks have a description
>> maybe:
>>
>
> This is a great idea.
>
> Let's do both! I think the thing that the comments in the gradle code
> cannot capture are the ways that you might combine them, like the way you
> override properties, etc.
>
> On Tue, Jan 23, 2018 at 9:13 AM, Ismaël Mejía <ieme...@gmail.com> wrote:
>
>> I used to test kafka
>> with a different version of the client quickly by doing this with
>> maven:
>>
>> mvn verify -Prelease -Dkafka.clients.version=0.10.0.1 -pl
>> 'sdks/java/io/kafka'
>>
>> but I don't have any idea of how to do this on Gradle.
>>
>
> Did you figure this out. Luke - can you suggest something?
>
>
> Of course everybody has a different workflow, but I am pretty sure
>> there are some common tasks that we can document for people (like me)
>> that are new to gradle.
>
>
> Yea, actually, it never even occurred to me that you would use the command
> line to test against other Kafka versions :-).
>
> I have private gists with lots of maven invocations for doing things like
> running ValidatesRunner or example ITs on just a particular runner. It
> always requires multiple maven commands that are both multiple lines. I was
> halfway for a while but I am now all gradle. I will start building the same
> thing.
>
>
>> - How to skip the Java/Python/Go build depending on your priorities.
>>
>
> IMO they should NOT run by default. Everything should be
> dependency-driven. When I ask to run Java SDK tests, or Java examples IT on
> the Flink Runner, it is *incorrect* for any Python or Go builds to run.
>
>
>> - How to run an individual IntegrationTest or ValidatesRunner test.
>>
>
> Yes, #1 use case
>
>
>> - How to skip findbugs,checkstyle,javadoc gneeration, etc. To have an
>> ultra quick build.
>>
>
> Again, I think they should be independent tasks. I should be able to run
> *any* of them without running *any* of the others. It is incorrect for
> anything to cause any other thing to run if it does not directly require
> its outputs.
>
> There can be an aggregated "verify" command but I will actually very
> rarely run that until I am done with a large chunk of work.
>
> As for what the aggregated "verify" command does, people kept arguing
> about what to make default. As long as we have a correct build for
> individual checks (aka not running extra things) then I am happy for the
> default to be long and slow, but we should still build profiles for both.
>

Hmm, did you read it right Kenn? I think the idea was to skip all
validation/sanity checks tasks at once (gradle  -Pfast) instead of
doing it manually (gradle -x findbugs -x checkstyle etc...)


>
> Kenn
>
>
>
>> >>
>> >> diff --git a/examples/java/build.gradle b/examples/java/build.gradle
>> >> index 0fc0b17df..001bd8b38 100644
>> >> --- a/examples/java/build.gradle
>> >> +++ b/examples/java/build.gradle
>> >> @@ -130,7 +130,7 @@ def preCommitAdditionalFlags = [
>> >>dataflowStreamingRunner: [ "--streaming=true" ],
>> >>  ]
>> >>  for (String runner : preCommitRunners) {
>> >> -  tasks.create(name: runner + "PreCommit", type: Test) {
>> >> +  tasks.create(name: runner + "PreCommit", type: Test, description:
>> "Run tests
>> >> for runner ${runner.replace('Runner', '')}") {
>> >>  def preCommitBeamTestPipelineOptions = [
>> >> "--project=apache-beam-testing",
>> >> "--tempRoot=gs://temp-storage-for-end-to-end-tests",
>> >>
>> >>
>> >>
>> >> Romain Manni-Bucau
>> >> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> >> <https://rmannibucau.metawerx.net/> | Old Blog
>> >> <http://rmannibucau.wordpress.com> | Github <
>> https://github.com/rmannibucau> |
>> >> LinkedIn <https://www.linkedin.com/in/rmannibucau>
>> >>
>> >> 2018-01-23 8:45 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net
>> >> <mailto:j...@nanthrax.net>>:
>> >>
>> >> Hi Romain,
>> >>
>> >> I think we are pretty close: agree to add some explicit tasks &am

Re: [DISCUSS] State of the project: Culture and governance

2018-01-23 Thread Romain Manni-Bucau
Hi Ismael,

More you add policies and rules around a project more you need energy to
make it respected and enforced. At that stage you need to ask yourself if
it does worth it?

I'm not sure it does for Beam and even if sometimes on PR you can find some
comments "picky" (and guess me I thought it more than once ;)), it is not a
bad community and people are quite nice. Using github is a big boost
to help people to do PR without having to read a doc (this is key for
contributions IMHO) so best is probably to manage to review faster if
possible and be lighter in terms of review, even if it requires a core dev
commit after a merge IMHO (while it doesnt break and bring something it is
good to merge kind of rule).



Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-23 17:56 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:

> Hi,
>
> I would like to remind: we are an Apache project, not an isolated one.
> As an Apache member, it's really important to me.
>
> 1. The code of conduct is the one from Apache.
>
> 2. If it's not happen on the mailing list, it doesn't exist. That's the
> Apache
> rule. We already discussed about that in the past: having wiki or Google
> doc is
> not a problem as soon as a summary is sent on the mailing list.
>
> I don't see why and where Beam could be different from the other Apache
> projects
> for the first three points.
>
> A valid point is about contribution policies around CI and review. I
> disagree
> about publishing the criteria to earn committership, and even more for
> PMC. As
> already said, a contribution can have many forms, so, criteria in term of
> number
> can be inaccurate.
>
> As these subjects can be sensible, I would also prefer to discuss on the
> private
> mailing list first (to get agreement between the PMC members) before
> publishing
> publicly on the dev mailing list.
>
> My $0.01
>
> Regards
> JB
>
> On 01/23/2018 05:43 PM, Ismaël Mejía wrote:
> > This is a sub-thread of the state of the project one initiated by
> > Davor. Since this subject can be part of the community issues I would
> > like to focus on the state of the project for its contributors so we
> > don’t mix the discussion with the end-user thread.
> >
> > I hope other members of the community bring ideas or issues that we
> > have/can improve to make contribution to this project easier and
> > welcoming. I consider that this is a really important area, we need to
> > guarantee that we have a sane culture for the project, where we
> > respect contributors and anyone can feel safe to ask questions,
> > propose ideas and contribute. We have done a good job until now but of
> > course things can be still improved.
> >
> > Some ideas:
> >
> > * Code of conduct
> >
> > We don’t have a code of conduct, most communities deal with this only
> > when problems arise. I think we should discuss this in advance, we can
> > maybe write ours, or adopt one existing like the ASF one. It is
> > essential that if we accept one code of conduct we really do take it
> > into account, and respect it during all our community interactions,
> > and apply actions in the cases when someone doesn’t.
> > https://www.apache.org/foundation/policies/conduct.html
> >
> > * Proposal process
> >
> > So far we have a somehow loose but effective process with documents
> > shared on google docs, and further discussion in the mailing list, we
> > should formalize a bit more or finish the work on BEAM-566. Some
> > guidelines on blockers for new proposals should be specified, e.g.
> > backwards compatibility, etc. And most of this documents will better
> > end as part of the website or some wiki for historical preservation.
> >
> > * Governance model
> >
> > Our governance model is of course based on Apache’s meritocracy one,
> > we should encourage this, and always be aligned with the ASF policies,
> > but also we need better criteria for consensus in technical decisions,
> > so far the vote system has been a way to reach consensus but we have
> > to find better ways to balance situations that can seem arbitrary or
> > where technical decisions have to be made even with a lack of
> > consensus, transparency and clear communication are key to avoid
> > frustration.
> >
> > * Contribution policies
> >
> > So far we as a community have been welcoming on new contributions, but
> > som

Re: JUnit 5 review

2018-01-23 Thread Romain Manni-Bucau
Hi Kenneth,

issue is JUnit 5 is not JUnit 4 +1, it is another project (like TestNG is)
so migration is not even an option. My goal in the PR was to enable people
using JUnit 5 as a base framework to test beam pipelines and be able to
reuse all their tooling goodness like this extension
https://talend.github.io/component-runtime/documentation-testing.html#_configuring_environments
or any extension allowing to start a server before running tests. I tried
to mitigate the changes to keep it very familiar for a JUnit 4 user.

In terms of usage you have an example at
https://github.com/rmannibucau/incubator-beam/blob/e61c650d57738206f3d02d949181074afccb87bb/sdks/java/core/src/test/java/org/apache/beam/sdk/testing/junit5/WithApacheBeamCustomRunnerTest.java#L42

It is aligned on the way JUnit 5 designs its extensions. The rule
equivalent was just introduced in JUnit this week but reuses this extension
mecanism, it just allows to instantiate it manually in a static field
instead of using the meta annotation which keep the extension of this PR
valuable.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-23 16:52 GMT+01:00 Kenneth Knowles <k...@google.com>:

> As open source, IMO it is fine to do something just because you are
> interested, as long as it works in the interest of the project. I'm not
> opposed, but there isn't enough information yet.
>
> I would like to see a design document about the differences between JUnit
> 4 and 5 and how that will affect Beam (examples: @Rule and @Runner changes)
> and maybe some information about how JUnit 5 is being received by other
> projects. Generally, mentioned also on the "automatic parameters for IOs"
> thread, sizable changes with implications for the project should be
> preceded by design documents to gather feedback from the community.
>
> Incidentally, scanning the PR, I see things that looks like they aren't
> just the JUnit 4 to 5 migration. You should narrow the focus to just the
> migration.
>
> Kenn
>
> On Tue, Jan 23, 2018 at 1:41 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> Great, thanks !
>>
>> We will resume our review here once Beam 2.3.0 is out.
>>
>> Regards
>> JB
>>
>> On 01/23/2018 10:28 AM, Romain Manni-Bucau wrote:
>> > Oki JB,
>> >
>> > Will implement it on my side until beam supports it then.
>> >
>> > Thanks for the feedback.
>> >
>> >
>> > Romain Manni-Bucau
>> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> > <https://rmannibucau.metawerx.net/> | Old Blog
>> > <http://rmannibucau.wordpress.com> | Github <
>> https://github.com/rmannibucau> |
>> > LinkedIn <https://www.linkedin.com/in/rmannibucau>
>> >
>> > 2018-01-23 10:24 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net
>> > <mailto:j...@nanthrax.net>>:
>> >
>> > Hi Romain,
>> >
>> > Definitely it's not something targeted for Beam 2.3.0.
>> >
>> > It's interesting, but it sounds a bit like a lonesome cowboy effort.
>> >
>> > I think it would have been great to discuss a bit in term of
>> priority (on the
>> > mailing list) before rushing on the PR. Couple of highlights in the
>> Jira or PR
>> > would be appreciated too.
>> >
>> > So, please, keep the PR open, I will take a look asap.
>> >
>> > Regards
>> > JB
>> >
>> > On 01/23/2018 09:40 AM, Romain Manni-Bucau wrote:
>> > > Hi guys,
>> > >
>> > > Anyone able to have a look to the JUnit 5 PR
>> > > (https://github.com/apache/beam/pull/4360
>> > <https://github.com/apache/beam/pull/4360>)?
>> > >
>> > > Worse case a "yes we'll move this direction" or "no we don't care
>> about JUnit 5
>> > > for now" feedback would be very valuable for me.
>> > >
>> > > Thanks,
>> > > Romain Manni-Bucau
>> > > @rmannibucau <https://twitter.com/rmannibucau
>> > <https://twitter.com/rmannibucau>> |  Blog
>> > > <https://rmannibucau.metawerx.net/ <https://rmannibucau.metawerx.
>> net/>> |
>> > Old Blog
>> > > <http://rmannibucau.wordpress.com <http://rmannibucau.wordpress.
>> com>>
>> > | Github <https://github.com/rmannibucau <
>> https://github.com/rmannibucau>> |
>> > > LinkedIn <https://www.linkedin.com/in/rmannibucau
>> > <https://www.linkedin.com/in/rmannibucau>>
>> >
>> > --
>> > Jean-Baptiste Onofré
>> > jbono...@apache.org <mailto:jbono...@apache.org>
>> > http://blog.nanthrax.net
>> > Talend - http://www.talend.com
>> >
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>


Re: JUnit 5 review

2018-01-23 Thread Romain Manni-Bucau
Oki JB,

Will implement it on my side until beam supports it then.

Thanks for the feedback.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-23 10:24 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:

> Hi Romain,
>
> Definitely it's not something targeted for Beam 2.3.0.
>
> It's interesting, but it sounds a bit like a lonesome cowboy effort.
>
> I think it would have been great to discuss a bit in term of priority (on
> the
> mailing list) before rushing on the PR. Couple of highlights in the Jira
> or PR
> would be appreciated too.
>
> So, please, keep the PR open, I will take a look asap.
>
> Regards
> JB
>
> On 01/23/2018 09:40 AM, Romain Manni-Bucau wrote:
> > Hi guys,
> >
> > Anyone able to have a look to the JUnit 5 PR
> > (https://github.com/apache/beam/pull/4360)?
> >
> > Worse case a "yes we'll move this direction" or "no we don't care about
> JUnit 5
> > for now" feedback would be very valuable for me.
> >
> > Thanks,
> > Romain Manni-Bucau
> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> > <https://rmannibucau.metawerx.net/> | Old Blog
> > <http://rmannibucau.wordpress.com> | Github <https://github.com/
> rmannibucau> |
> > LinkedIn <https://www.linkedin.com/in/rmannibucau>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


JUnit 5 review

2018-01-23 Thread Romain Manni-Bucau
Hi guys,

Anyone able to have a look to the JUnit 5 PR (
https://github.com/apache/beam/pull/4360)?

Worse case a "yes we'll move this direction" or "no we don't care about
JUnit 5 for now" feedback would be very valuable for me.

Thanks,
Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>


Re: Gradle / Mvn diff

2018-01-23 Thread Romain Manni-Bucau
Yep,

a compromise can be to ensure all custom tasks have a description maybe:

diff --git a/examples/java/build.gradle b/examples/java/build.gradle
index 0fc0b17df..001bd8b38 100644
--- a/examples/java/build.gradle
+++ b/examples/java/build.gradle
@@ -130,7 +130,7 @@ def preCommitAdditionalFlags = [
   dataflowStreamingRunner: [ "--streaming=true" ],
 ]
 for (String runner : preCommitRunners) {
-  tasks.create(name: runner + "PreCommit", type: Test) {
+  tasks.create(name: runner + "PreCommit", type: Test, description: "Run
tests for runner ${runner.replace('Runner', '')}") {
 def preCommitBeamTestPipelineOptions = [
"--project=apache-beam-testing",
"--tempRoot=gs://temp-storage-for-end-to-end-tests",



Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-23 8:45 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:

> Hi Romain,
>
> I think we are pretty close: agree to add some explicit tasks & projects
> names.
>
> We can add additional tasks like skipAudit, for instance.
>
> As reminder, gradle tasks provides the list of tasks and gradle projects
> provides the list of projects/modules.
>
> Regards
> JB
>
> On 01/23/2018 08:12 AM, Romain Manni-Bucau wrote:
> > Hmm, I have to admit docs dont have my favor cause they are easily
> outdated and
> > hard to search but you hit a good point. Starting by renaming properly
> the tasks
> > and maybe writing what is done in build files - since it is code and
> even "api
> > for dev", it requires as much comments than the main api - can be better
> to start.
> >
> > Also a big switch flag to bypass checkstyle/findbugs/... can be good
> while in
> > dev since these phases cost a looot for nothing while you validates your
> code in
> > runners modules for instance.
> >
> > Le 23 janv. 2018 07:15, "Kenneth Knowles" <k...@google.com
> > <mailto:k...@google.com>> a écrit :
> >
> > On Mon, Jan 22, 2018 at 10:03 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com
> > <mailto:rmannibu...@gmail.com>> wrote:
> >
> > @Kenneth: why not dropping the doc for a script with comments in
> the
> > project? A "RUNME.sh" ;).
> >
> >
> > That's cool, too, but also any shell one liner can be a gradle one
> liner or
> > mvn two/three liner :-). it is just trading one command that you
> cannot
> > guess easily for a different one that you still can't guess easily.
> >
> > For example, are the SparkRunner ValidatesRunner tests in the
> SparkRunner or
> > the core SDK or a third module that integrates the two? And why
> would you
> > know that the example ITs are called "sparkRunnerPreCommit"? It
> doesn't even
> > make sense really to have "precommit" or "postcommit" except as
> aliases to
> > make it easy to repro Jenkins' behavior - they have no other
> intrinsic meaning.
> >
> > So I was proposing a mapping from "full sentence + description" to
> one liner
> > to help people navigate the targets that we set up. Some web page or
> doc
> > that people can just quickly scan to find out to do common things,
> easier
> > than groovy or XML.
> >
> > Kenn
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Gradle / Mvn diff

2018-01-22 Thread Romain Manni-Bucau
Hmm, I have to admit docs dont have my favor cause they are easily outdated
and hard to search but you hit a good point. Starting by renaming properly
the tasks and maybe writing what is done in build files - since it is code
and even "api for dev", it requires as much comments than the main api -
can be better to start.

Also a big switch flag to bypass checkstyle/findbugs/... can be good while
in dev since these phases cost a looot for nothing while you validates your
code in runners modules for instance.

Le 23 janv. 2018 07:15, "Kenneth Knowles" <k...@google.com> a écrit :

On Mon, Jan 22, 2018 at 10:03 PM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> @Kenneth: why not dropping the doc for a script with comments in the
> project? A "RUNME.sh" ;).
>

That's cool, too, but also any shell one liner can be a gradle one liner or
mvn two/three liner :-). it is just trading one command that you cannot
guess easily for a different one that you still can't guess easily.

For example, are the SparkRunner ValidatesRunner tests in the SparkRunner
or the core SDK or a third module that integrates the two? And why would
you know that the example ITs are called "sparkRunnerPreCommit"? It doesn't
even make sense really to have "precommit" or "postcommit" except as
aliases to make it easy to repro Jenkins' behavior - they have no other
intrinsic meaning.

So I was proposing a mapping from "full sentence + description" to one
liner to help people navigate the targets that we set up. Some web page or
doc that people can just quickly scan to find out to do common things,
easier than groovy or XML.

Kenn


Re: Gradle / Mvn diff

2018-01-22 Thread Romain Manni-Bucau
Le 22 janv. 2018 21:46, "Lukasz Cwik" <lc...@google.com> a écrit :

1. Are you trying to have version overrides in a module depend on the
parent's version and not in one global place?,
Doesn't this lead to compatibility issues if you don't live with a single
version of a dependency across the entire repo (unless that dependency is
shaded away of course).


Ismael can detail this point more than me but this is sadly already the
case. We were looking to override part of the tree due to incompatibilities
between spark and bigquery driver.


2. How is what you describe different from './gradlew :runners:spark:build'


I want to run only spark in the wordcount module for instance. Not a runner
module but a runner execution in a multi runners module.


3. Can be overridden on command line or per user properties file but I
would rather have our users execute as close to what we test in Jenkins so
that the differences between a dev and CI workflow is minimal.


Hmm, are you sure it is the case everywhere? For me default should be
usable and CI use overrides but both options are valid. Just a very
different experience compared to maven.


4. configuration on demand is enabled by default and it should only
configure the projects that are needed so I'm not sure what you are asking
for here.


Running a submodule only is slower than maven so when working on a module -
during dev - it is quite costly, in particular debugging through the build.


On Mon, Jan 22, 2018 at 12:20 PM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> Hi
>
> As mentionned in another thread, Im sending this mail to report some
> differences between maven and gradle setups - percieved as regressions from
> this side of the fence:
>
> 1. Parent versions are not usable in children as variables - btw why not
> putting them in gradle.properties as ut is often done? (Not blocking)
> 2. Multi executions are not all runnable once by one. Typical example is
> surefire executions are selectable using surefire:test@id with maven but
> the for loop in gradle is never parameterized so no way to run only spark
> runner test suite for instance. (Almost blocking to work efficiently but
> easy to fix)
> 3. Concurrency is hardcoded and way too high for most computers leading to
> freezing the computer and preventing the user to do anything (tested on an
> i7 with 32G of RAM and a SSD). (Blocking but easy to fix i guess if we use
> the rule of thumb to keep concurrency off by default)
> 4. Setup is slow when not using the daemon since it browses the whole
> project so a lazy setup can be beneficial when working on submodules (not
> that blocking until you rely on build.gradle setup)
>
>
>


Gradle / Mvn diff

2018-01-22 Thread Romain Manni-Bucau
Hi

As mentionned in another thread, Im sending this mail to report some
differences between maven and gradle setups - percieved as regressions from
this side of the fence:

1. Parent versions are not usable in children as variables - btw why not
putting them in gradle.properties as ut is often done? (Not blocking)
2. Multi executions are not all runnable once by one. Typical example is
surefire executions are selectable using surefire:test@id with maven but
the for loop in gradle is never parameterized so no way to run only spark
runner test suite for instance. (Almost blocking to work efficiently but
easy to fix)
3. Concurrency is hardcoded and way too high for most computers leading to
freezing the computer and preventing the user to do anything (tested on an
i7 with 32G of RAM and a SSD). (Blocking but easy to fix i guess if we use
the rule of thumb to keep concurrency off by default)
4. Setup is slow when not using the daemon since it browses the whole
project so a lazy setup can be beneficial when working on submodules (not
that blocking until you rely on build.gradle setup)


Re: IO configuration and @AutoValue future

2018-01-19 Thread Romain Manni-Bucau
@Lukasz: not really since if the parameters changes then the new version
will get the new data so this is not a constraint on beam but on the
configuration storage which must handle anyway versions compatibility
somehow so not a big deal IMHO.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-19 18:41 GMT+01:00 Lukasz Cwik <lc...@google.com>:

> Note that using the -parameters flag in javac will require that we never
> change parameter names inside methods increasing the backwards
> compatibility burden.
>
> On Thu, Jan 18, 2018 at 12:33 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> Great idea !
>>
>> It sounds good to me.
>>
>> Regards
>> JB
>>
>> On 01/18/2018 09:27 AM, Romain Manni-Bucau wrote:
>>
>>> @JB: thought about another option which can be almost hurtless for beam:
>>>
>>> 1. we ensure all "config" classes are public (would avoid nasty hacks)
>>> 2. while migrating to java 8 you activate the javac -parameters flag
>>>
>>> does it sound better?
>>>
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>>> https://rmannibucau.metawerx.net/> | Old Blog <
>>> http://rmannibucau.wordpress.com> | Github <
>>> https://github.com/rmannibucau> | LinkedIn <
>>> https://www.linkedin.com/in/rmannibucau>
>>>
>>> 2018-01-14 19:25 GMT+01:00 Romain Manni-Bucau <rmannibu...@gmail.com
>>> <mailto:rmannibu...@gmail.com>>:
>>>
>>> Works for me but forbids the usage of the abstract class cause of
>>> final
>>> fields. That said if beam can get a Factory.createIO(clazz,
>>> configAsPrimitivesMap) Im happy whatever solution is used.
>>>
>>>
>>> Le 14 janv. 2018 17:19, "Jean-Baptiste Onofré" <j...@nanthrax.net
>>> <mailto:j...@nanthrax.net>> a écrit :
>>>
>>>
>>> Hi Romain,
>>>
>>> I think the missing thing for automation projects is probably
>>> more
>>> around "documentation" for the setters/getters.
>>>
>>> So, why not:
>>> 1. we don't change the usage and AutoValue itself
>>> 2. we can imagine to add a new set of annotations in IO Common
>>> with a
>>> specific annotation processor that create another POJO class, not
>>> actually used in the IO code, but "describing" the configuration
>>> for
>>> automation projects. This POJO will be public, no final.
>>>
>>> WDYT ?
>>>
>>> Regards
>>> JB
>>>
>>> On 12/01/2018 19:26, Romain Manni-Bucau wrote:
>>>
>>> Hi guys
>>>
>>> I'd like to discuss the IO configuration.
>>>
>>> My goal is to be able to instrospect (or equivalent) the IO
>>> to
>>> instantiate them programmatically in a generic manner from a
>>> generic
>>> config - this is not yet linked to the system property topic
>>> but can
>>> benefit beam on this other topic too.
>>>
>>> Auto value loosing the final fields ordering is impossible
>>> to use
>>> until you parse sources.
>>>
>>> In other words: auto value is nice for programmatic usage
>>> but is
>>> blocking for any automotion on top of it - even using unsafe
>>> is not
>>> an option ATM :(.
>>>
>>> Can we try to get something to solve that need please?
>>>
>>> Here are the solutions I see (pick just one ;)):
>>>
>>> 1. migrate IO to IOOptions (based on pipeline options kind of
>>> design). This is a breaking change - but I'm sure we can
>>> mitigate it
>>> in term of user compatibility - but it unifies the beam
>>> config and
>>> enables to not have IO specific configurations which can vary
>>> between the IO if not developped by the same guy.
>>> 2. Add an extension to @AutoValue to also ge

Re: IO configuration and @AutoValue future

2018-01-18 Thread Romain Manni-Bucau
@JB: thought about another option which can be almost hurtless for beam:

1. we ensure all "config" classes are public (would avoid nasty hacks)
2. while migrating to java 8 you activate the javac -parameters flag

does it sound better?



Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-14 19:25 GMT+01:00 Romain Manni-Bucau <rmannibu...@gmail.com>:

> Works for me but forbids the usage of the abstract class cause of final
> fields. That said if beam can get a Factory.createIO(clazz,
> configAsPrimitivesMap) Im happy whatever solution is used.
>
>
> Le 14 janv. 2018 17:19, "Jean-Baptiste Onofré" <j...@nanthrax.net> a écrit :
>
>> Hi Romain,
>>
>> I think the missing thing for automation projects is probably more around
>> "documentation" for the setters/getters.
>>
>> So, why not:
>> 1. we don't change the usage and AutoValue itself
>> 2. we can imagine to add a new set of annotations in IO Common with a
>> specific annotation processor that create another POJO class, not actually
>> used in the IO code, but "describing" the configuration for automation
>> projects. This POJO will be public, no final.
>>
>> WDYT ?
>>
>> Regards
>> JB
>>
>> On 12/01/2018 19:26, Romain Manni-Bucau wrote:
>>
>>> Hi guys
>>>
>>> I'd like to discuss the IO configuration.
>>>
>>> My goal is to be able to instrospect (or equivalent) the IO to
>>> instantiate them programmatically in a generic manner from a generic config
>>> - this is not yet linked to the system property topic but can benefit beam
>>> on this other topic too.
>>>
>>> Auto value loosing the final fields ordering is impossible to use until
>>> you parse sources.
>>>
>>> In other words: auto value is nice for programmatic usage but is
>>> blocking for any automotion on top of it - even using unsafe is not an
>>> option ATM :(.
>>>
>>> Can we try to get something to solve that need please?
>>>
>>> Here are the solutions I see (pick just one ;)):
>>>
>>> 1. migrate IO to IOOptions (based on pipeline options kind of design).
>>> This is a breaking change - but I'm sure we can mitigate it in term of user
>>> compatibility - but it unifies the beam config and enables to not have IO
>>> specific configurations which can vary between the IO if not developped by
>>> the same guy.
>>> 2. Add an extension to @AutoValue to also generate the field names order
>>> in the create() (@Fields({"address","username","password"}). Cheap but
>>> the way to instantiate an IO from a config is still a pain (think
>>> Factory.createIO(clazz, properties))
>>> 3. Also generate a plain pojo from the abstract @AutoValue class - this
>>> requires to modify the source class to make it working but is feasible with
>>> a processor
>>> 4. drop autovalue and use plain pojo - writing it cause it is a
>>> technical option but it leads to break as much as 1 without getting all the
>>> benefit to have an agnostic config and the tooling we can build on top of it
>>> 5. probably others
>>>
>>>
>>> Wdyt?
>>>
>>> Personally I really like 1 cause it starts to create a clean programming
>>> model we can then build other features on.
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>>> https://rmannibucau.metawerx.net/> | Old Blog <
>>> http://rmannibucau.wordpress.com> | Github <
>>> https://github.com/rmannibucau> | LinkedIn <
>>> https://www.linkedin.com/in/rmannibucau>
>>>
>>


Re: IO configuration and @AutoValue future

2018-01-14 Thread Romain Manni-Bucau
Works for me but forbids the usage of the abstract class cause of final
fields. That said if beam can get a Factory.createIO(clazz,
configAsPrimitivesMap) Im happy whatever solution is used.


Le 14 janv. 2018 17:19, "Jean-Baptiste Onofré" <j...@nanthrax.net> a écrit :

> Hi Romain,
>
> I think the missing thing for automation projects is probably more around
> "documentation" for the setters/getters.
>
> So, why not:
> 1. we don't change the usage and AutoValue itself
> 2. we can imagine to add a new set of annotations in IO Common with a
> specific annotation processor that create another POJO class, not actually
> used in the IO code, but "describing" the configuration for automation
> projects. This POJO will be public, no final.
>
> WDYT ?
>
> Regards
> JB
>
> On 12/01/2018 19:26, Romain Manni-Bucau wrote:
>
>> Hi guys
>>
>> I'd like to discuss the IO configuration.
>>
>> My goal is to be able to instrospect (or equivalent) the IO to
>> instantiate them programmatically in a generic manner from a generic config
>> - this is not yet linked to the system property topic but can benefit beam
>> on this other topic too.
>>
>> Auto value loosing the final fields ordering is impossible to use until
>> you parse sources.
>>
>> In other words: auto value is nice for programmatic usage but is blocking
>> for any automotion on top of it - even using unsafe is not an option ATM :(.
>>
>> Can we try to get something to solve that need please?
>>
>> Here are the solutions I see (pick just one ;)):
>>
>> 1. migrate IO to IOOptions (based on pipeline options kind of design).
>> This is a breaking change - but I'm sure we can mitigate it in term of user
>> compatibility - but it unifies the beam config and enables to not have IO
>> specific configurations which can vary between the IO if not developped by
>> the same guy.
>> 2. Add an extension to @AutoValue to also generate the field names order
>> in the create() (@Fields({"address","username","password"}). Cheap but
>> the way to instantiate an IO from a config is still a pain (think
>> Factory.createIO(clazz, properties))
>> 3. Also generate a plain pojo from the abstract @AutoValue class - this
>> requires to modify the source class to make it working but is feasible with
>> a processor
>> 4. drop autovalue and use plain pojo - writing it cause it is a technical
>> option but it leads to break as much as 1 without getting all the benefit
>> to have an agnostic config and the tooling we can build on top of it
>> 5. probably others
>>
>>
>> Wdyt?
>>
>> Personally I really like 1 cause it starts to create a clean programming
>> model we can then build other features on.
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>> https://rmannibucau.metawerx.net/> | Old Blog <
>> http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibuca
>> u> | LinkedIn <https://www.linkedin.com/in/rmannibucau>
>>
>


Re: IO configuration and @AutoValue future

2018-01-12 Thread Romain Manni-Bucau
Hi JB

2018-01-13 7:51 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:

> Hi Romain,
>
> Clearly AutoValue is really convenient for developer. It reduces the
> boilerplate of getter/setter for configuration.
>
> I'm not a big fan of an IOOptions, because it's a breaking change and I
> think we gonna loose some flexibility for developers.
>
> POJO is basically what we did before using AutoValue.
>
> A potential new option would be to do an improvement in AutoValue (on the
> annotations or the way it does the fields generation).
>

Preventing to chain them can be nice too, so would be enforcing a single
ConfigPojo for IO configuration. Would that work?


>
> Regards
> JB
>
> On 12/01/2018 19:26, Romain Manni-Bucau wrote:
>
>> Hi guys
>>
>> I'd like to discuss the IO configuration.
>>
>> My goal is to be able to instrospect (or equivalent) the IO to
>> instantiate them programmatically in a generic manner from a generic config
>> - this is not yet linked to the system property topic but can benefit beam
>> on this other topic too.
>>
>> Auto value loosing the final fields ordering is impossible to use until
>> you parse sources.
>>
>> In other words: auto value is nice for programmatic usage but is blocking
>> for any automotion on top of it - even using unsafe is not an option ATM :(.
>>
>> Can we try to get something to solve that need please?
>>
>> Here are the solutions I see (pick just one ;)):
>>
>> 1. migrate IO to IOOptions (based on pipeline options kind of design).
>> This is a breaking change - but I'm sure we can mitigate it in term of user
>> compatibility - but it unifies the beam config and enables to not have IO
>> specific configurations which can vary between the IO if not developped by
>> the same guy.
>> 2. Add an extension to @AutoValue to also generate the field names order
>> in the create() (@Fields({"address","username","password"}). Cheap but
>> the way to instantiate an IO from a config is still a pain (think
>> Factory.createIO(clazz, properties))
>> 3. Also generate a plain pojo from the abstract @AutoValue class - this
>> requires to modify the source class to make it working but is feasible with
>> a processor
>> 4. drop autovalue and use plain pojo - writing it cause it is a technical
>> option but it leads to break as much as 1 without getting all the benefit
>> to have an agnostic config and the tooling we can build on top of it
>> 5. probably others
>>
>>
>> Wdyt?
>>
>> Personally I really like 1 cause it starts to create a clean programming
>> model we can then build other features on.
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>> https://rmannibucau.metawerx.net/> | Old Blog <
>> http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibuca
>> u> | LinkedIn <https://www.linkedin.com/in/rmannibucau>
>>
>


Re: IO configuration and @AutoValue future

2018-01-12 Thread Romain Manni-Bucau
Le 12 janv. 2018 20:49, "Kenneth Knowles" <k...@google.com> a écrit :

Would you be able to put together a doc with a little more motivation and
background on the feature, plus specific proposal? I think there's a lot of
desire to build Beam pipelines in ways that are not programmatic and we can
have a good discussion and make sure that changes meet the goals.


Ok, will try to work on it soon but dont be surprise if you get no news
next week


Separately, I've built *many* systems to non-programmatically build things
that were previously done via code. One lesson is that whatever starts as
reflection-based "totally automatic" tooling never suits all your needs
beyond simple prototypes. You'll almost certainly end up building a system
of annotations in the code or a system with configuration schema on the
side. So I would not invest in making things reflection friendly unless we
know this will never need to grow into more.


Saw some working like xbean/tomee and others failing due to too much
complexity for tye config. Im sure well make it if we want . This is about
config, not yet about generic runtimes ;)



Kenn

On Fri, Jan 12, 2018 at 10:26 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> Hi guys
>
> I'd like to discuss the IO configuration.
>
> My goal is to be able to instrospect (or equivalent) the IO to instantiate
> them programmatically in a generic manner from a generic config - this is
> not yet linked to the system property topic but can benefit beam on this
> other topic too.
>
> Auto value loosing the final fields ordering is impossible to use until
> you parse sources.
>
> In other words: auto value is nice for programmatic usage but is blocking
> for any automotion on top of it - even using unsafe is not an option ATM :(.
>
> Can we try to get something to solve that need please?
>
> Here are the solutions I see (pick just one ;)):
>
> 1. migrate IO to IOOptions (based on pipeline options kind of design).
> This is a breaking change - but I'm sure we can mitigate it in term of user
> compatibility - but it unifies the beam config and enables to not have IO
> specific configurations which can vary between the IO if not developped by
> the same guy.
> 2. Add an extension to @AutoValue to also generate the field names order
> in the create() (@Fields({"address","username","password"}). Cheap but
> the way to instantiate an IO from a config is still a pain (think
> Factory.createIO(clazz, properties))
> 3. Also generate a plain pojo from the abstract @AutoValue class - this
> requires to modify the source class to make it working but is feasible with
> a processor
> 4. drop autovalue and use plain pojo - writing it cause it is a technical
> option but it leads to break as much as 1 without getting all the benefit
> to have an agnostic config and the tooling we can build on top of it
> 5. probably others
>
>
> Wdyt?
>
> Personally I really like 1 cause it starts to create a clean programming
> model we can then build other features on.
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau>
>


Re: jackson to parse options?

2018-01-12 Thread Romain Manni-Bucau
Yes, we moves a few steps forward - was waiting before speaking of it but
seems it led to it.

Idea is to prepend a prefix per IO in the key. Currently the easiest
working solution is to sanitize the transform/fn name (since we validate
the unicity by default) and we are done. If not znough we can pass the
actual prefix to use in the pipeline but I doubt we need to be that fancy.

Logic would just be like
https://github.com/apache/tomee/blob/master/container/openejb-core/src/main/java/org/apache/openejb/config/ConfigurationFactory.java#L1542
at the end.

Le 12 janv. 2018 20:39, "Lukasz Cwik" <lc...@google.com> a écrit :

Your original proposal was too vague so I was extrapolating on what I
thought you meant. At this point it seems like what I extrapolated and what
your talking about are far enough apart that I can't extrapolate what you
mean and you'll need to be significantly more detailed in what your
suggesting.

With the system property approach, how would a user use the same IO
multiple times?

On Fri, Jan 12, 2018 at 10:18 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

>
> 2018-01-12 19:12 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>
>>
>>
>> On Fri, Jan 12, 2018 at 10:01 AM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>>
>>> 2018-01-12 18:54 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>>>
>>>> Some concerns would be:
>>>> * How does a user discover what options can be set and what values they
>>>> take?
>>>>
>>>
>>> Tempted to say "same as today", what's the regression you would see? A
>>> --verbose can be nice thought to dump the DAG at startup (before
>>> translation) - but with this impl or not ;).
>>>
>> PipelineOptionsFactory supports passing in "--help" which prints all the
>> descriptions of what options it is aware of that can be set to the
>> terminal. Also, users can find all PipelineOptions interfaces and all
>> sub-interfaces using an IDE or javadoc.
>>
>
> Yep, same except it would be enriched with the (included optional
> prefixes).
>
> But still misses the DAG dump which is valuable to go one step further and
> allows a fully configurable through options DAG - read from system
> properties or not, this is just the properties source ending in a String[].
>
>
>>
>>
>>>
>>>
>>>> * System properties are global so how would you prevent system
>>>> properties from conflicting within Apache Beam and with other non Apache
>>>> Beam libraries that may rely on system properties?
>>>>
>>>
>>> The prefixes I mentionned. Far enough in all libs so why not lib,
>>> anything special I missed?
>>>
>> Thats a good amount of additional typing per option, even if its just the
>> 5 characters "beam.".
>>
>
> You do it in your launcher but enables you to do it. Today it is up to
> each dev to store its own config without any normalization which is a lost
> of value IMHO we can easily fix.
>
>
>>
>>>
>>>> * There are users who launch more then one pipeline from the same
>>>> application, how would system properties work there?
>>>>
>>>
>>> And these users don't use the system properties but a custom way to
>>> build String[] with their own configs so no issues I think. As you can or
>>> not use the beamTestPipelineOptions system property, you would use this way
>>> to configure it or not.
>>> But having it would be nice for a lot of users + tests which wouldnt
>>> really need this beamTestPipelineOptions (thinking out loud).
>>>
>> That is only for integration testing and using multiple system properties
>> that then get mapped to PipelineOptions just for integration testing seems
>> like we will add a whole level of complexity for little gain.
>>
>
> Can be, so we keep the previous values which are way sufficient from what
> I saw in batch usages and kind of lack in a built-in fashion in Beam for
> now.
>
>
>>
>>>
>>>>
>>>> Note that the system property usage is only for integration testing and
>>>> not meant for users writing pipelines.
>>>>
>>>> On Thu, Jan 11, 2018 at 9:38 PM, Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>> Hmm, here is my thought: allowing main options settings is super
>>>>> important and one of the most important user experience point. I d even 
>>>>> say
>>>>> any IO config shouldnt be an 

IO configuration and @AutoValue future

2018-01-12 Thread Romain Manni-Bucau
Hi guys

I'd like to discuss the IO configuration.

My goal is to be able to instrospect (or equivalent) the IO to instantiate
them programmatically in a generic manner from a generic config - this is
not yet linked to the system property topic but can benefit beam on this
other topic too.

Auto value loosing the final fields ordering is impossible to use until you
parse sources.

In other words: auto value is nice for programmatic usage but is blocking
for any automotion on top of it - even using unsafe is not an option ATM :(.

Can we try to get something to solve that need please?

Here are the solutions I see (pick just one ;)):

1. migrate IO to IOOptions (based on pipeline options kind of design). This
is a breaking change - but I'm sure we can mitigate it in term of user
compatibility - but it unifies the beam config and enables to not have IO
specific configurations which can vary between the IO if not developped by
the same guy.
2. Add an extension to @AutoValue to also generate the field names order in
the create() (@Fields({"address","username","password"}). Cheap but the way
to instantiate an IO from a config is still a pain (think
Factory.createIO(clazz, properties))
3. Also generate a plain pojo from the abstract @AutoValue class - this
requires to modify the source class to make it working but is feasible with
a processor
4. drop autovalue and use plain pojo - writing it cause it is a technical
option but it leads to break as much as 1 without getting all the benefit
to have an agnostic config and the tooling we can build on top of it
5. probably others


Wdyt?

Personally I really like 1 cause it starts to create a clean programming
model we can then build other features on.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>


Re: jackson to parse options?

2018-01-12 Thread Romain Manni-Bucau
2018-01-12 19:12 GMT+01:00 Lukasz Cwik <lc...@google.com>:

>
>
> On Fri, Jan 12, 2018 at 10:01 AM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>>
>> 2018-01-12 18:54 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>>
>>> Some concerns would be:
>>> * How does a user discover what options can be set and what values they
>>> take?
>>>
>>
>> Tempted to say "same as today", what's the regression you would see? A
>> --verbose can be nice thought to dump the DAG at startup (before
>> translation) - but with this impl or not ;).
>>
> PipelineOptionsFactory supports passing in "--help" which prints all the
> descriptions of what options it is aware of that can be set to the
> terminal. Also, users can find all PipelineOptions interfaces and all
> sub-interfaces using an IDE or javadoc.
>

Yep, same except it would be enriched with the (included optional prefixes).

But still misses the DAG dump which is valuable to go one step further and
allows a fully configurable through options DAG - read from system
properties or not, this is just the properties source ending in a String[].


>
>
>>
>>
>>> * System properties are global so how would you prevent system
>>> properties from conflicting within Apache Beam and with other non Apache
>>> Beam libraries that may rely on system properties?
>>>
>>
>> The prefixes I mentionned. Far enough in all libs so why not lib,
>> anything special I missed?
>>
> Thats a good amount of additional typing per option, even if its just the
> 5 characters "beam.".
>

You do it in your launcher but enables you to do it. Today it is up to each
dev to store its own config without any normalization which is a lost of
value IMHO we can easily fix.


>
>>
>>> * There are users who launch more then one pipeline from the same
>>> application, how would system properties work there?
>>>
>>
>> And these users don't use the system properties but a custom way to build
>> String[] with their own configs so no issues I think. As you can or not use
>> the beamTestPipelineOptions system property, you would use this way to
>> configure it or not.
>> But having it would be nice for a lot of users + tests which wouldnt
>> really need this beamTestPipelineOptions (thinking out loud).
>>
> That is only for integration testing and using multiple system properties
> that then get mapped to PipelineOptions just for integration testing seems
> like we will add a whole level of complexity for little gain.
>

Can be, so we keep the previous values which are way sufficient from what I
saw in batch usages and kind of lack in a built-in fashion in Beam for now.


>
>>
>>>
>>> Note that the system property usage is only for integration testing and
>>> not meant for users writing pipelines.
>>>
>>> On Thu, Jan 11, 2018 at 9:38 PM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>> Hmm, here is my thought: allowing main options settings is super
>>>> important and one of the most important user experience point. I d even say
>>>> any IO config shouldnt be an auto value but an option and through the
>>>> transform name you should be able to prefux the option name and override a
>>>> particular IO config through the command line - just to say how important
>>>> it is and avois hack by IO for that.
>>>>
>>>> The fact there is this global system property, json integration, and in
>>>> this PR another system property which is a bit weird for a user IMHO can
>>>> mean we nees to update the config design. If each option has an equivalent
>>>> system property then no need of all these hacks.
>>>>
>>>> Ex: tempLocation would support -DtempLocation, -Dbeam.tempLocation,
>>>> -D.templocation (+ transform name prefix
>>>> if we add IO config one day)
>>>>
>>>> Then jackson is only here to parse the optionswhich is already done
>>>> manually since options use a custom serializer so it can be dropped, no?
>>>>
>>>> Is this reasonning wrong?
>>>>
>>>> Le 12 janv. 2018 00:11, "Lukasz Cwik" <lc...@google.com> a écrit :
>>>>
>>>>> Robert Bradshaw had the idea of migrating away from using
>>>>> main(String[] args) and just refactoring the code to test the WordCount
>>>>> PTransform allowing one to write a traditional JUnit test that didn't call
>>>>&

Re: jackson to parse options?

2018-01-12 Thread Romain Manni-Bucau
2018-01-12 18:54 GMT+01:00 Lukasz Cwik <lc...@google.com>:

> Some concerns would be:
> * How does a user discover what options can be set and what values they
> take?
>

Tempted to say "same as today", what's the regression you would see? A
--verbose can be nice thought to dump the DAG at startup (before
translation) - but with this impl or not ;).


> * System properties are global so how would you prevent system properties
> from conflicting within Apache Beam and with other non Apache Beam
> libraries that may rely on system properties?
>

The prefixes I mentionned. Far enough in all libs so why not lib, anything
special I missed?


> * There are users who launch more then one pipeline from the same
> application, how would system properties work there?
>

And these users don't use the system properties but a custom way to build
String[] with their own configs so no issues I think. As you can or not use
the beamTestPipelineOptions system property, you would use this way to
configure it or not.
But having it would be nice for a lot of users + tests which wouldnt really
need this beamTestPipelineOptions (thinking out loud).


>
> Note that the system property usage is only for integration testing and
> not meant for users writing pipelines.
>
> On Thu, Jan 11, 2018 at 9:38 PM, Romain Manni-Bucau <rmannibu...@gmail.com
> > wrote:
>
>> Hmm, here is my thought: allowing main options settings is super
>> important and one of the most important user experience point. I d even say
>> any IO config shouldnt be an auto value but an option and through the
>> transform name you should be able to prefux the option name and override a
>> particular IO config through the command line - just to say how important
>> it is and avois hack by IO for that.
>>
>> The fact there is this global system property, json integration, and in
>> this PR another system property which is a bit weird for a user IMHO can
>> mean we nees to update the config design. If each option has an equivalent
>> system property then no need of all these hacks.
>>
>> Ex: tempLocation would support -DtempLocation, -Dbeam.tempLocation,
>> -D.templocation (+ transform name prefix
>> if we add IO config one day)
>>
>> Then jackson is only here to parse the optionswhich is already done
>> manually since options use a custom serializer so it can be dropped, no?
>>
>> Is this reasonning wrong?
>>
>> Le 12 janv. 2018 00:11, "Lukasz Cwik" <lc...@google.com> a écrit :
>>
>>> Robert Bradshaw had the idea of migrating away from using main(String[]
>>> args) and just refactoring the code to test the WordCount PTransform
>>> allowing one to write a traditional JUnit test that didn't call
>>> main(String[] args). This would change what the contents of our examples
>>> are and make them more amenable to testing which could be good guidance for
>>> developers as well.
>>>
>>> On Thu, Jan 11, 2018 at 3:08 PM, Lukasz Cwik <lc...@google.com> wrote:
>>>
>>>> 1) TestPipeline#convertToArgs is meant to convert TestPipelineOptions
>>>> into a String[] containing --arg=value for integration tests that only have
>>>> a main(String[] arg) entry point like WordCount. There is this PR[1] that
>>>> is outstanding that is attempting to clean this up and simplify it so that
>>>> we aren't doing this haphazard conversion.
>>>> [1] https://github.com/apache/beam/pull/4346
>>>>
>>>> I haven't yet commented on the PR but I was going to suggest that the
>>>> user add String[] getMainArgs to TestPipelineOptions and not do the
>>>> convertToArgs hackiness for ITs that use main().
>>>>
>>>> On Thu, Jan 11, 2018 at 1:53 PM, Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>> Some inputs - maybe comments? - would be appreciated on:
>>>>>
>>>>> 1. Why using json in https://github.com/apache/b
>>>>> eam/blob/master/sdks/java/core/src/main/java/org/apache/beam
>>>>> /sdk/testing/TestPipeline.java#L480 and not iterate over the options?
>>>>> 2. Also this is likely used to do a fromArgs and create another
>>>>> pipeline options later no? Is this api that useful here or does it need an
>>>>> options composition - fully object?
>>>>>
>>>>> Le 11 janv. 2018 19:09, "Lukasz Cwik" <lc...@google.com> a écrit :
>>>>>
>>>>>> Give links to the code segments that you want background on.
>

Re: jackson to parse options?

2018-01-11 Thread Romain Manni-Bucau
Hmm, here is my thought: allowing main options settings is super important
and one of the most important user experience point. I d even say any IO
config shouldnt be an auto value but an option and through the transform
name you should be able to prefux the option name and override a particular
IO config through the command line - just to say how important it is and
avois hack by IO for that.

The fact there is this global system property, json integration, and in
this PR another system property which is a bit weird for a user IMHO can
mean we nees to update the config design. If each option has an equivalent
system property then no need of all these hacks.

Ex: tempLocation would support -DtempLocation, -Dbeam.tempLocation,
-D.templocation (+ transform name prefix
if we add IO config one day)

Then jackson is only here to parse the optionswhich is already done
manually since options use a custom serializer so it can be dropped, no?

Is this reasonning wrong?

Le 12 janv. 2018 00:11, "Lukasz Cwik" <lc...@google.com> a écrit :

> Robert Bradshaw had the idea of migrating away from using main(String[]
> args) and just refactoring the code to test the WordCount PTransform
> allowing one to write a traditional JUnit test that didn't call
> main(String[] args). This would change what the contents of our examples
> are and make them more amenable to testing which could be good guidance for
> developers as well.
>
> On Thu, Jan 11, 2018 at 3:08 PM, Lukasz Cwik <lc...@google.com> wrote:
>
>> 1) TestPipeline#convertToArgs is meant to convert TestPipelineOptions
>> into a String[] containing --arg=value for integration tests that only have
>> a main(String[] arg) entry point like WordCount. There is this PR[1] that
>> is outstanding that is attempting to clean this up and simplify it so that
>> we aren't doing this haphazard conversion.
>> [1] https://github.com/apache/beam/pull/4346
>>
>> I haven't yet commented on the PR but I was going to suggest that the
>> user add String[] getMainArgs to TestPipelineOptions and not do the
>> convertToArgs hackiness for ITs that use main().
>>
>> On Thu, Jan 11, 2018 at 1:53 PM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> Some inputs - maybe comments? - would be appreciated on:
>>>
>>> 1. Why using json in https://github.com/apache/b
>>> eam/blob/master/sdks/java/core/src/main/java/org/apache/beam
>>> /sdk/testing/TestPipeline.java#L480 and not iterate over the options?
>>> 2. Also this is likely used to do a fromArgs and create another pipeline
>>> options later no? Is this api that useful here or does it need an options
>>> composition - fully object?
>>>
>>> Le 11 janv. 2018 19:09, "Lukasz Cwik" <lc...@google.com> a écrit :
>>>
>>>> Give links to the code segments that you want background on.
>>>>
>>>> On Wed, Jan 10, 2018 at 12:44 AM, Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>> updated my junit 5 PR to show that: https://github.com/apach
>>>>> e/beam/pull/4360/files#diff-578d1770f8b47ebbc1e74a2c19de9a6aR28
>>>>>
>>>>> It doesn't remove jackson yet but exposes a nicer user interface for
>>>>> the config.
>>>>>
>>>>> I'm not fully clear on all the jackson usage yet, there are some round
>>>>> trips (PO -> json -> PO) which are weird without more knowledge.
>>>>>
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>> <https://www.linkedin.com/in/rmannibucau>
>>>>>
>>>>> 2018-01-09 22:57 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>>>>>
>>>>>> Removing large dependencies is great but I can only assume it will be
>>>>>> a lot of work so if you can get it to work in a backwards compatible way
>>>>>> great.
>>>>>>
>>>>>> On Tue, Jan 9, 2018 at 1:52 PM, Romain Manni-Bucau <
>>>>>> rmannibu...@gmail.com> wrote:
>>>>>>
>>>>>>> It conflicts easily between libs and containers. Shade is not a good
>>>>>>> option too - see the thread on this topic :(.
>>>>>>>
>>>>>>> At the end i see using the cli sol

Re: jackson to parse options?

2018-01-11 Thread Romain Manni-Bucau
Some inputs - maybe comments? - would be appreciated on:

1. Why using json in
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestPipeline.java#L480
and not iterate over the options?
2. Also this is likely used to do a fromArgs and create another pipeline
options later no? Is this api that useful here or does it need an options
composition - fully object?

Le 11 janv. 2018 19:09, "Lukasz Cwik" <lc...@google.com> a écrit :

> Give links to the code segments that you want background on.
>
> On Wed, Jan 10, 2018 at 12:44 AM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>> updated my junit 5 PR to show that: https://github.com/apach
>> e/beam/pull/4360/files#diff-578d1770f8b47ebbc1e74a2c19de9a6aR28
>>
>> It doesn't remove jackson yet but exposes a nicer user interface for the
>> config.
>>
>> I'm not fully clear on all the jackson usage yet, there are some round
>> trips (PO -> json -> PO) which are weird without more knowledge.
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau>
>>
>> 2018-01-09 22:57 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>>
>>> Removing large dependencies is great but I can only assume it will be a
>>> lot of work so if you can get it to work in a backwards compatible way
>>> great.
>>>
>>> On Tue, Jan 9, 2018 at 1:52 PM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>> It conflicts easily between libs and containers. Shade is not a good
>>>> option too - see the thread on this topic :(.
>>>>
>>>> At the end i see using the cli solution closer to user - vs framework
>>>> dev for json - and hurting less in terms of classpath so pby sthg to test
>>>> no?
>>>>
>>>> Le 9 janv. 2018 22:47, "Lukasz Cwik" <lc...@google.com> a écrit :
>>>>
>>>>> Romain, how has Jackson been a classpath breaker?
>>>>>
>>>>>
>>>>> On Tue, Jan 9, 2018 at 1:20 PM, Romain Manni-Bucau <
>>>>> rmannibu...@gmail.com> wrote:
>>>>>
>>>>>> Hmm, beam already owns the cli parsing - that is what I meant - it
>>>>>> only misses the arg delimiter (ie quoting) and adding it is easy no?
>>>>>>
>>>>>> Le 9 janv. 2018 21:19, "Robert Bradshaw" <rober...@google.com> a
>>>>>> écrit :
>>>>>>
>>>>>>> On Tue, Jan 9, 2018 at 11:48 AM, Romain Manni-Bucau
>>>>>>> <rmannibu...@gmail.com> wrote:
>>>>>>> >
>>>>>>> > Le 9 janv. 2018 19:50, "Robert Bradshaw" <rober...@google.com> a
>>>>>>> écrit :
>>>>>>> >
>>>>>>> > From what I understand:
>>>>>>> >
>>>>>>> > 1) The command line argument values (both simple and more complex)
>>>>>>> are
>>>>>>> > all JSON-representable.
>>>>>>> >
>>>>>>> > And must be all CLI representable
>>>>>>>
>>>>>>> Sorry, I should have said "all pipeline options." In any case, one
>>>>>>> can
>>>>>>> always do JSON -> string and the resulting string is usable as a
>>>>>>> command line argument.
>>>>>>>
>>>>>>> > 2) The command line is a mapping of keys to these values.
>>>>>>> >
>>>>>>> > Which makes your 1 not true since json supports more ;)
>>>>>>> >
>>>>>>> > As such, it seems quite natural to represent the whole set as a
>>>>>>> single
>>>>>>> > JSON map, rather than using a different, custom encoding for the
>>>>>>> top
>>>>>>> > level (whose custom escaping would have to be carried into the
>>>>>>> inner
>>>>>>> > JSON values). Note that JSON has the advantage that one never
>>>>>>> needs to
>>>>>>> > explain or define it, and parsers/serializers already exists for
>>&g

Re: Gradle status

2018-01-11 Thread Romain Manni-Bucau
Ok

Will try to create task from now on when seeing them

Can we also put a kind of timeout for the switch to avoid an in between
state lasting 6 months? Something like april?

Le 11 janv. 2018 19:05, "Jean-Baptiste Onofré" <j...@nanthrax.net> a écrit :

> Hi Luke,
>
> Great: let's eventually add new sub-tasks in this Jira !
>
> Thanks !
> Regards
> JB
>
> On 01/11/2018 07:02 PM, Lukasz Cwik wrote:
>
>> The top level JIRA is here: https://issues.apache.org/jira
>> /browse/BEAM-3249
>>
>> On Thu, Jan 11, 2018 at 9:56 AM, Kenneth Knowles <k...@google.com > k...@google.com>> wrote:
>>
>> The phase we are in is: "Once Gradle is able to replace Maven in a
>> specific
>> process (or portion thereof), Maven will no longer be maintained for
>> said
>> process (or portion thereof) and will be removed."
>>
>> Once the Gradle presubmit can replace a particular Maven presubmit,
>> we can
>> remove the Maven version.
>>
>> Can you file JIRAs for the things you suspect are shaded incorrectly
>> by our
>> gradle configurations?
>>
>> Kenn
>>
>> On Thu, Jan 11, 2018 at 8:43 AM, Romain Manni-Bucau <
>> rmannibu...@gmail.com
>> <mailto:rmannibu...@gmail.com>> wrote:
>>
>> Hi guys,
>>
>> Do you plan to solve gradle issue (even dropping gradle from beam
>> source), issue I hit is:
>>
>> 1. gradle build is not equivalent to maven one - met some shadowed
>> dependencies which shouldnt be shaded like junit in some modules
>> 2. gradle build doesn't use the same output directory than maven
>> so it
>> is not really smooth to have both and have to maintain both
>>
>> 3. There are a lot of PR to flush before being able to switch
>>
>> Any action plan taken to fix that and remove this ambiguity? Once
>> again,
>> I would indeed prefer to stay on Maven but I'm fine to move to
>> gradle,
>>     however I'd like to stop having both builds and not really know
>> what to
>> use or loose time with some differences between both.
>>
>> Would the plan be to either do a PR merge effort and then gradle
>> effort
>> or drop it - that are the 2 options I see to exit that state?
>>
>> wdyt?
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> | Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau>
>>
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: jackson to parse options?

2018-01-10 Thread Romain Manni-Bucau
updated my junit 5 PR to show that:
https://github.com/apache/beam/pull/4360/files#diff-578d1770f8b47ebbc1e74a2c19de9a6aR28

It doesn't remove jackson yet but exposes a nicer user interface for the
config.

I'm not fully clear on all the jackson usage yet, there are some round
trips (PO -> json -> PO) which are weird without more knowledge.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-09 22:57 GMT+01:00 Lukasz Cwik <lc...@google.com>:

> Removing large dependencies is great but I can only assume it will be a
> lot of work so if you can get it to work in a backwards compatible way
> great.
>
> On Tue, Jan 9, 2018 at 1:52 PM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> It conflicts easily between libs and containers. Shade is not a good
>> option too - see the thread on this topic :(.
>>
>> At the end i see using the cli solution closer to user - vs framework dev
>> for json - and hurting less in terms of classpath so pby sthg to test no?
>>
>> Le 9 janv. 2018 22:47, "Lukasz Cwik" <lc...@google.com> a écrit :
>>
>>> Romain, how has Jackson been a classpath breaker?
>>>
>>>
>>> On Tue, Jan 9, 2018 at 1:20 PM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>> Hmm, beam already owns the cli parsing - that is what I meant - it only
>>>> misses the arg delimiter (ie quoting) and adding it is easy no?
>>>>
>>>> Le 9 janv. 2018 21:19, "Robert Bradshaw" <rober...@google.com> a
>>>> écrit :
>>>>
>>>>> On Tue, Jan 9, 2018 at 11:48 AM, Romain Manni-Bucau
>>>>> <rmannibu...@gmail.com> wrote:
>>>>> >
>>>>> > Le 9 janv. 2018 19:50, "Robert Bradshaw" <rober...@google.com> a
>>>>> écrit :
>>>>> >
>>>>> > From what I understand:
>>>>> >
>>>>> > 1) The command line argument values (both simple and more complex)
>>>>> are
>>>>> > all JSON-representable.
>>>>> >
>>>>> > And must be all CLI representable
>>>>>
>>>>> Sorry, I should have said "all pipeline options." In any case, one can
>>>>> always do JSON -> string and the resulting string is usable as a
>>>>> command line argument.
>>>>>
>>>>> > 2) The command line is a mapping of keys to these values.
>>>>> >
>>>>> > Which makes your 1 not true since json supports more ;)
>>>>> >
>>>>> > As such, it seems quite natural to represent the whole set as a
>>>>> single
>>>>> > JSON map, rather than using a different, custom encoding for the top
>>>>> > level (whose custom escaping would have to be carried into the inner
>>>>> > JSON values). Note that JSON has the advantage that one never needs
>>>>> to
>>>>> > explain or define it, and parsers/serializers already exists for all
>>>>> > languages (e.g. if one has a driver script in another language for
>>>>> > launching a java pipeline, it's easy to communicate all the args).
>>>>> >
>>>>> > Same reasonning applies to CLI AFAIK
>>>>>
>>>>> The spec of what a valid command line argument list is is surprisingly
>>>>> inconsistent across platforms, languages, and programs. And your
>>>>> proposal seems to be getting into what the delimiter is, and how to
>>>>> escape it, and possibly then how to escape the escape character. All
>>>>> of this is sidestepped by pointing at an existing spec.
>>>>>
>>>>> > We can't get rid of Jackson in the core because of (1) so there's
>>>>> > little value in adding complexity to remove it from (2). The fact
>>>>> that
>>>>> > Java doesn't ship anything in its expansive standard library for this
>>>>> > is unfortuante, so we have to take a dependency on something.
>>>>> >
>>>>> > We actually can as shown before
>>>>>
>>>>> How, if JSON is integral to the parsing of the argument values
>>>>> themselves? (Or is the argument that

<    1   2   3   4   5   >