Re: [DISCUSS] Python SDK status and next steps

2017-01-30 Thread Davor Bonaci
Great -- congratulations to everyone who has contributed to the Python SDK!

On Mon, Jan 30, 2017 at 11:10 PM, Ahmet Altay 
wrote:

> Hi all,
>
> This merge is completed. Python SDK is now officially part of the master
> branch! Thank you all for the support. Please open an issue, if you notice
> a reference to the now obsolete python-sdk branch in the documentation.
>
> There will not be any more merges to the python-sdk branch. Going forward
> please use the master branch for Python SDK development. There are a few
> existing open PRs to the python-sdk [1]. If you are the author of one of
> those PRs, please rebase them on top of master.
>
> Thank you,
> Ahmet
>
> [1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%
> 
> 3Apython-sdk+repo%3Aapache%2Fbeam+
>  3Apr+base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>
> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles 
> wrote:
>
> > To clarify the implied criteria of that last exchange, it is "An SDK
> should
> > have at least one runner that can execute the complete model (may be a
> > direct runner)"
> >
> > I want to highlight this, because whether an _SDK_ supports unbounded
> data
> > is not particularly well-defined, and will evolve:
> >
> >  - With the Runner API, an SDK will need to support building a graph with
> > unbounded constructs, as today with probably minimal changes.
> >
> >  - With the Fn API, if any part of the Fn API is specific to unbounded
> > data, the SDK will need to implement it. I think right now there is no
> such
> > thing, and we don't want such a thing, so SDKs implementing the Fn API
> > automatically support unbounded data.
> >
> >  - There will also likely be an SDK-specific shim just as there is today,
> > to leverage idiomatic deserialized representations. The richness of this
> > shim will decrease so that it will need to "support" unbounded data but
> > that will be a ~one liner.
> >
> > Getting the Python SDK on master will accelerate our progress towards the
> > Fn API - partly technical, partly community - which is the best path
> > towards support for unbounded data across multiple runners. I think the
> > criteria are written with the completed portability framework in mind. So
> > this exchange makes me actually more convinced we should merge python-sdk
> > to master.
> >
> > On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
> > rober...@google.com.invalid> wrote:
> >
> > > On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
> > >  wrote:
> > > > I do not think that Python SDK yet meets the bar [1] for implementing
> > the
> > > > Beam model -- supporting Unbounded data is very important. That said,
> > > given
> > > > the committed and sustained set of contributors, it generally makes
> > sense
> > > > to me to make an exception in anticipation of these features being
> > > fleshed
> > > > out soon; including potentially new users/contributors that would
> > arrive
> > > > once in master.
> > > >
> > > > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> > > > k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com
> > >
> > > That is a valid point. The Python SDK supports all the unbounded parts
> > > of the model except for unbounded sources, which was deferred while
> > > seeing how https://s.apache.org/splittable-do-fn played out. I've been
> > > working with the team and merging/reviewing most of their code, and
> > > have full confidence this will be coming (and on that note can vouch
> > > for a healthy community and support which are much harder to add
> > > later).
> > >
> > > In short, I think it has the required maturity, and I'm in favor of
> > > merging soonish.
> > >
> > > > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>  > >
> > > > wrote:
> > > >
> > > >> Thank you all for the comments so far. I would follow the process as
> > > >> suggested by Davor and others in this thread.
> > > >>
> > > >> Ahmet
> > > >>
> > > >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
> wik...@apache.org
> > >
> > > >> wrote:
> > > >>
> > > >> > Hi
> > > >> >
> > > >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
> >  > > >
> > > >> > wrote:
> > > >> > >
> > > >> > > tl;dr: I would like to start a discussion about merging
> python-sdk
> > > >> branch
> > > >> > > to master branch. Python SDK is mature enough and merging it to
> > > master
> > > >> > will
> > > >> > > accelerate its development and adoption.
> > > >> > >
> > > >> >
> > > >> > Good point, Ahmet!
> > > >> >
> > > >> > I've following closed the development since it was imported in
> June.
> > > For
> > > >> > the prototypes I've implemented so far it works quite well; I
> guess
> > > we'd
> > > >> > just need to focus the next months in bringing more runners
> support.
> > > >> >
> > > >> > 

Re: [DISCUSS] Python SDK status and next steps

2017-01-30 Thread Ahmet Altay
Hi all,

This merge is completed. Python SDK is now officially part of the master
branch! Thank you all for the support. Please open an issue, if you notice
a reference to the now obsolete python-sdk branch in the documentation.

There will not be any more merges to the python-sdk branch. Going forward
please use the master branch for Python SDK development. There are a few
existing open PRs to the python-sdk [1]. If you are the author of one of
those PRs, please rebase them on top of master.

Thank you,
Ahmet

[1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%
3Apython-sdk+repo%3Aapache%2Fbeam+


On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles 
wrote:

> To clarify the implied criteria of that last exchange, it is "An SDK should
> have at least one runner that can execute the complete model (may be a
> direct runner)"
>
> I want to highlight this, because whether an _SDK_ supports unbounded data
> is not particularly well-defined, and will evolve:
>
>  - With the Runner API, an SDK will need to support building a graph with
> unbounded constructs, as today with probably minimal changes.
>
>  - With the Fn API, if any part of the Fn API is specific to unbounded
> data, the SDK will need to implement it. I think right now there is no such
> thing, and we don't want such a thing, so SDKs implementing the Fn API
> automatically support unbounded data.
>
>  - There will also likely be an SDK-specific shim just as there is today,
> to leverage idiomatic deserialized representations. The richness of this
> shim will decrease so that it will need to "support" unbounded data but
> that will be a ~one liner.
>
> Getting the Python SDK on master will accelerate our progress towards the
> Fn API - partly technical, partly community - which is the best path
> towards support for unbounded data across multiple runners. I think the
> criteria are written with the completed portability framework in mind. So
> this exchange makes me actually more convinced we should merge python-sdk
> to master.
>
> On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
> rober...@google.com.invalid> wrote:
>
> > On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
> >  wrote:
> > > I do not think that Python SDK yet meets the bar [1] for implementing
> the
> > > Beam model -- supporting Unbounded data is very important. That said,
> > given
> > > the committed and sustained set of contributors, it generally makes
> sense
> > > to me to make an exception in anticipation of these features being
> > fleshed
> > > out soon; including potentially new users/contributors that would
> arrive
> > > once in master.
> > >
> > > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> > > k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com
> >
> > That is a valid point. The Python SDK supports all the unbounded parts
> > of the model except for unbounded sources, which was deferred while
> > seeing how https://s.apache.org/splittable-do-fn played out. I've been
> > working with the team and merging/reviewing most of their code, and
> > have full confidence this will be coming (and on that note can vouch
> > for a healthy community and support which are much harder to add
> > later).
> >
> > In short, I think it has the required maturity, and I'm in favor of
> > merging soonish.
> >
> > > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay  >
> > > wrote:
> > >
> > >> Thank you all for the comments so far. I would follow the process as
> > >> suggested by Davor and others in this thread.
> > >>
> > >> Ahmet
> > >>
> > >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández  >
> > >> wrote:
> > >>
> > >> > Hi
> > >> >
> > >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
>  > >
> > >> > wrote:
> > >> > >
> > >> > > tl;dr: I would like to start a discussion about merging python-sdk
> > >> branch
> > >> > > to master branch. Python SDK is mature enough and merging it to
> > master
> > >> > will
> > >> > > accelerate its development and adoption.
> > >> > >
> > >> >
> > >> > Good point, Ahmet!
> > >> >
> > >> > I've following closed the development since it was imported in June.
> > For
> > >> > the prototypes I've implemented so far it works quite well; I guess
> > we'd
> > >> > just need to focus the next months in bringing more runners support.
> > >> >
> > >> > With a great effort from a lot of contributors(*), Python SDK [1] is
> > now
> > >> a
> > >> > > mostly complete, tested, performant Python implementation of the
> > Beam
> > >> > > model. Since June, when we first started with Python SDK in Apache
> > Beam
> > >> > we
> > >> > > have been continuously improving it.
> > >> > >
> > >> >
> > >> > I wouldn't merge during the preparation of 0.5.0 release, but after
> > that
> > >> > could be a good time to merge back into 

Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-30 Thread Davor Bonaci
It looks good to me, but let's hear Aljoscha's opinion on BEAM-1346.

A passing suite of Jenkins jobs:
* https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6870/
* https://builds.apache.org/job/beam_PostCommit_Java_MavenInstall/2474/
*
https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Apex/336/
*
https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Flink/1470/
*
https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Spark/786/
*
https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Dataflow/2130/

On Mon, Jan 30, 2017 at 4:40 PM, Dan Halperin  wrote:

> I am worried about https://issues.apache.org/jira/browse/BEAM-1346 for RC1
> and would at least wait for resolution there before proceeding.
>
> On Mon, Jan 30, 2017 at 3:48 AM, Jean-Baptiste Onofré 
> wrote:
>
> > Good catch for the PPMC, I'm upgrading the email template in the release
> > guide (it was a copy/paste).
> >
> > Regards
> > JB
> >
> >
> > On 01/30/2017 11:50 AM, Sergio Fernández wrote:
> >
> >> +1 (non-binding)
> >>
> >> So far I've successfully checked:
> >> * signatures and digests
> >> * source releases file layouts
> >> * matched git tags and commit ids
> >> * incubator suffix and disclaimer
> >> * NOTICE and LICENSE files
> >> * license headers
> >> * clean build (Java 1.8.0_91, Maven 3.3.9, Debian amd64)
> >>
> >> Two minor comments that do not block the release:
> >> * Usually I like to see the commit id referencing the rc, since git tags
> >> can be changed.
> >> * Just a formality, "PPMC" is not committee that plays a role anymore,
> >> you're a PMC now ;-)
> >>
> >>
> >>
> >> On Fri, Jan 27, 2017 at 9:55 PM, Jean-Baptiste Onofré 
> >> wrote:
> >>
> >> Hi everyone,
> >>>
> >>> Please review and vote on the release candidate #1 for the version
> 0.5.0
> >>> as follows:
> >>>
> >>> [ ] +1, Approve the release
> >>> [ ] -1, Do not approve the release (please provide specific comments)
> >>>
> >>> The complete staging area is available for your review, which includes:
> >>>
> >>> * JIRA release notes [1],
> >>> * the official Apache source release to be deployed to dist.apache.org
> >>> [2], which is signed with the key with fingerprint C8282E76 [3],
> >>> * all artifacts to be deployed to the Maven Central Repository [4],
> >>> * source code tag "v0.5.0-RC1" [5],
> >>> * website pull request listing the release and publishing the API
> >>> reference manual [6].
> >>>
> >>> The vote will be open for at least 72 hours. It is adopted by majority
> >>> approval, with at least 3 PPMC affirmative votes.
> >>>
> >>> Thanks,
> >>> JB
> >>>
> >>> [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
> >>> ctId=12319527=12338859
> >>> [2] https://dist.apache.org/repos/dist/dev/beam/0.5.0/
> >>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> >>> [4] https://repository.apache.org/content/repositories/orgapache
> >>> beam-1010/
> >>> [5] https://git-wip-us.apache.org/repos/asf?p=beam.git;a=tag;h=r
> >>> efs/tags/v0.5.0-RC1
> >>> [6] https://github.com/apache/beam-site/pull/132
> >>>
> >>>
> >>
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-30 Thread Dan Halperin
I am worried about https://issues.apache.org/jira/browse/BEAM-1346 for RC1
and would at least wait for resolution there before proceeding.

On Mon, Jan 30, 2017 at 3:48 AM, Jean-Baptiste Onofré 
wrote:

> Good catch for the PPMC, I'm upgrading the email template in the release
> guide (it was a copy/paste).
>
> Regards
> JB
>
>
> On 01/30/2017 11:50 AM, Sergio Fernández wrote:
>
>> +1 (non-binding)
>>
>> So far I've successfully checked:
>> * signatures and digests
>> * source releases file layouts
>> * matched git tags and commit ids
>> * incubator suffix and disclaimer
>> * NOTICE and LICENSE files
>> * license headers
>> * clean build (Java 1.8.0_91, Maven 3.3.9, Debian amd64)
>>
>> Two minor comments that do not block the release:
>> * Usually I like to see the commit id referencing the rc, since git tags
>> can be changed.
>> * Just a formality, "PPMC" is not committee that plays a role anymore,
>> you're a PMC now ;-)
>>
>>
>>
>> On Fri, Jan 27, 2017 at 9:55 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Hi everyone,
>>>
>>> Please review and vote on the release candidate #1 for the version 0.5.0
>>> as follows:
>>>
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>> The complete staging area is available for your review, which includes:
>>>
>>> * JIRA release notes [1],
>>> * the official Apache source release to be deployed to dist.apache.org
>>> [2], which is signed with the key with fingerprint C8282E76 [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v0.5.0-RC1" [5],
>>> * website pull request listing the release and publishing the API
>>> reference manual [6].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval, with at least 3 PPMC affirmative votes.
>>>
>>> Thanks,
>>> JB
>>>
>>> [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
>>> ctId=12319527=12338859
>>> [2] https://dist.apache.org/repos/dist/dev/beam/0.5.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4] https://repository.apache.org/content/repositories/orgapache
>>> beam-1010/
>>> [5] https://git-wip-us.apache.org/repos/asf?p=beam.git;a=tag;h=r
>>> efs/tags/v0.5.0-RC1
>>> [6] https://github.com/apache/beam-site/pull/132
>>>
>>>
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: TextIO binary file

2017-01-30 Thread Eugene Kirpichov
P.S. Note that this point (about coders) is also mentioned in the
now-being-reviewed PTransform Style Guide
https://github.com/apache/beam-site/pull/134
currently staged at
http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders


On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath 
wrote:

> +1 to what Eugene said.
>
> I've seen a number of Python SDK users incorrectly assuming that
> coder.decode() is needed when developing their own file-based sources
> (since many users usually refer to text source first). Probably coder
> parameter should not be configurable for text source/sink and they should
> be updated to only read/write UTF-8 encoded strings.
>
> - Cham
>
> On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
>  wrote:
>
> > The use of Coder in TextIO is a long standing design issue because coders
> > are not intended to be used for general purpose converting things from
> and
> > to bytes, their only proper use is letting the runner materialize and
> > restore objects if the runner thinks it's necessary. IMO it should have
> > been called LineIO, document that it reads lines of text as String, and
> not
> > have a withCoder parameter at all.
> >
> > The proper way to address your use case is to write a custom
> > FileBasedSource.
> > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur  wrote:
> >
> > > The Javadoc of TextIO states:
> > >
> > > * By default, {@link TextIO.Read} returns a {@link PCollection} of
> > > {@link String Strings},
> > >  * each corresponding to one line of an input UTF-8 text file. To
> convert
> > > directly from the raw
> > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> another
> > > object of type {@code T},
> > >  * supply a {@code Coder} using {@link
> TextIO.Read#withCoder(Coder)}.
> > >
> > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > probably
> > > won't work given the hard-coded '\n' delimiter.
> > >
> > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré  >
> > > wrote:
> > >
> > > > Hi Aviem,
> > > >
> > > > TextIO is not designed to write/read binary file: it's pure Text, so
> > > > String.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > Hi,
> > > > >
> > > > > While trying to use TextIO to write/read a binary file rather than
> > > String
> > > > > lines from a textual file I ran into an issue - the delimiter
> TextIO
> > > uses
> > > > > seems to be hardcoded '\n'.
> > > > > See `findSeparatorBounds` -
> > > > >
> > > >
> > >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > >
> > > > > The use case is to have a file of objects, encoded into bytes
> using a
> > > > > coder. However, '\n' is not a good delimiter here, as you can
> > imagine.
> > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > > >
> > > >
> > >
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > where
> > > > > they use a more appropriate delimiter, to avoid such issues.
> > > > >
> > > > > I did not find any unit tests which use TextIO to read anything
> other
> > > > than
> > > > > Strings.
> > > > >
> > > >
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbono...@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > >
> >
>


Re: Build failed in Jenkins: beam_PostCommit_Java_MavenInstall #2473

2017-01-30 Thread Dan Halperin
Hey folks,

It looks like the python-sdk -> master merge went bad and, unfortunately,
we have it configured to email anyone who ever contributed a commit to the
merge, which I think devolves to "anyone who ever committed to that
branch". I've disabled further emails in this job's configuration for the
rest of the day, by which time the build will hopefully be green again.

On Mon, Jan 30, 2017 at 4:24 PM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See  >
>
> --
> [...truncated 12560 lines...]
> hard linking apache_beam/transforms/util.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/transforms
> hard linking apache_beam/transforms/window.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/transforms
> hard linking apache_beam/transforms/window_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/transforms
> hard linking apache_beam/transforms/write_ptransform_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/transforms
> hard linking apache_beam/typehints/__init__.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/typehints/decorators.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/typehints/opcodes.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/typehints/trivial_inference.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/typehints/trivial_inference_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/typehints/typecheck.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/typehints/typed_pipeline_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/typehints/typehints.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/typehints/typehints_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/utils/__init__.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/annotations.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/annotations_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/counters.pxd -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/counters.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/dependency.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/dependency_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/names.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/path.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/path_test.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/pipeline_options.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/pipeline_options_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/pipeline_options_validator.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/pipeline_options_validator_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/processes.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/processes_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/profiler.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/retry.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/retry_test.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/timestamp.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/timestamp_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/windowed_value.pxd ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/windowed_value.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/windowed_value_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam_sdk.egg-info/PKG-INFO ->
> apache-beam-sdk-0.6.0.dev/apache_beam_sdk.egg-info
> hard linking apache_beam_sdk.egg-info/SOURCES.txt ->
> apache-beam-sdk-0.6.0.dev/apache_beam_sdk.egg-info
> hard linking apache_beam_sdk.egg-info/dependency_links.txt ->
> apache-beam-sdk-0.6.0.dev/apache_beam_sdk.egg-info
> hard linking apache_beam_sdk.egg-info/entry_points.txt ->
> apache-beam-sdk-0.6.0.dev/apache_beam_sdk.egg-info
> hard linking apache_beam_sdk.egg-info/not-zip-safe ->
> apache-beam-sdk-0.6.0.dev/apache_beam_sdk.egg-info
> hard linking 

Re: TextIO binary file

2017-01-30 Thread Eugene Kirpichov
The use of Coder in TextIO is a long standing design issue because coders
are not intended to be used for general purpose converting things from and
to bytes, their only proper use is letting the runner materialize and
restore objects if the runner thinks it's necessary. IMO it should have
been called LineIO, document that it reads lines of text as String, and not
have a withCoder parameter at all.

The proper way to address your use case is to write a custom
FileBasedSource.
On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur  wrote:

> The Javadoc of TextIO states:
>
> * By default, {@link TextIO.Read} returns a {@link PCollection} of
> {@link String Strings},
>  * each corresponding to one line of an input UTF-8 text file. To convert
> directly from the raw
>  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another
> object of type {@code T},
>  * supply a {@code Coder} using {@link TextIO.Read#withCoder(Coder)}.
>
> However, as I stated, `withCoder` doesn't seem to have tests, and probably
> won't work given the hard-coded '\n' delimiter.
>
> On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré 
> wrote:
>
> > Hi Aviem,
> >
> > TextIO is not designed to write/read binary file: it's pure Text, so
> > String.
> >
> > Regards
> > JB
> >
> > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > Hi,
> > >
> > > While trying to use TextIO to write/read a binary file rather than
> String
> > > lines from a textual file I ran into an issue - the delimiter TextIO
> uses
> > > seems to be hardcoded '\n'.
> > > See `findSeparatorBounds` -
> > >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > >
> > > The use case is to have a file of objects, encoded into bytes using a
> > > coder. However, '\n' is not a good delimiter here, as you can imagine.
> > > A similar pattern is found in Spark's `saveAsObjectFile`
> > >
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > where
> > > they use a more appropriate delimiter, to avoid such issues.
> > >
> > > I did not find any unit tests which use TextIO to read anything other
> > than
> > > Strings.
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: TextIO binary file

2017-01-30 Thread Dan Halperin
Stas' comment is the right one. The "canonical" use of TextIO is using
something like a TextualIntegerCoder
,
but that should almost certainly be replaced with TextIO.Read |
ParDo.of(Parse integer). The `withCoder` functions need to get removed or
replaced.

For "holding a file of arbitrary records" -- simply producing a
delimiter-separated TextIO is probably not a good choice. Specifically,
splitting is broken when the delimiter might appear in the output (e.g.,
when using almost any coder). A better option is to design a file format to
hold arbitrary records. E.g., an Avro file where each record is just a
byte[].

Dan

On Mon, Jan 30, 2017 at 2:52 AM, Aviem Zur  wrote:

> The Javadoc of TextIO states:
>
> * By default, {@link TextIO.Read} returns a {@link PCollection} of
> {@link String Strings},
>  * each corresponding to one line of an input UTF-8 text file. To convert
> directly from the raw
>  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another
> object of type {@code T},
>  * supply a {@code Coder} using {@link TextIO.Read#withCoder(Coder)}.
>
> However, as I stated, `withCoder` doesn't seem to have tests, and probably
> won't work given the hard-coded '\n' delimiter.
>
> On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré 
> wrote:
>
> > Hi Aviem,
> >
> > TextIO is not designed to write/read binary file: it's pure Text, so
> > String.
> >
> > Regards
> > JB
> >
> > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > Hi,
> > >
> > > While trying to use TextIO to write/read a binary file rather than
> String
> > > lines from a textual file I ran into an issue - the delimiter TextIO
> uses
> > > seems to be hardcoded '\n'.
> > > See `findSeparatorBounds` -
> > >
> > https://github.com/apache/beam/blob/master/sdks/java/
> core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > >
> > > The use case is to have a file of objects, encoded into bytes using a
> > > coder. However, '\n' is not a good delimiter here, as you can imagine.
> > > A similar pattern is found in Spark's `saveAsObjectFile`
> > >
> > https://github.com/apache/spark/blob/master/core/src/
> main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > where
> > > they use a more appropriate delimiter, to avoid such issues.
> > >
> > > I did not find any unit tests which use TextIO to read anything other
> > than
> > > Strings.
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-30 Thread Etienne Chauchot

Hi,

Le 27/01/2017 à 19:44, Robert Bradshaw a écrit :

On Fri, Jan 27, 2017 at 6:55 AM, Etienne Chauchot  wrote:

Hi Robert,

Le 26/01/2017 à 18:17, Robert Bradshaw a écrit :

First off, let me say that a *correctly* batching DoFn is a lot of
value, especially because it's (too) easy to (often unknowingly)
implement it incorrectly.

I definitely agree, I put a similar comment in another email. As an example
I recall a comment of someone in stackoverflow who said that he would have
forgotten to flush the batch in finishBundle.

My take is that a BatchingParDo should be a PTransform that takes a DoFn, ? extends
Iterable> as a parameter, as well as some (optional?) batching
criteria (probably batch size and/or batch timeout).

This is how I implemented it plus another perElement function that produces
an intermediary type to allow the user to use another type than InputType in
perBatchFn (for ex convert elements to DTO  and to call external service in
perBatchFn using DTOs) or to do any other per-element computation before
adding elements to the batch.

I think we should omit the perElement as part of this transform as
that can be done immediately prior to this one without any loss of
generality or utility. One can always wrap this composition in a new
PTransform if desired.
You're right, it is simpler to let the user do it as a pipeline step, 
I'll remove the perElementFn.

Besides I used SimpleFunctions

SimpleFunction perElementFn;
SimpleFunction perBatchFn;

The input ArrayList in perBatchFn is the buffer of elements.

We should be as general as possible, e.g. SimpleFunction, ? extends Iterable>.

Yes sure, I've updated it.

Again, letting
this be a DoFn rather than SimpleFunction allows for things such as
setup, teardown, side inputs, etc. but forces complicated delegation
so this is probably a fine start.


Yes, actually, I hesitated, I have opted for the simpler as a start :)
I guess, as the list of possible use cases grow, we might change to DoFn 
to leverage its possibilities.

The DoFn should
map the set of inputs to a set of outputs of the same size and in the
same order as the input (or, possibly, an empty list would be
acceptable). Semantically, it should be defined as

public expand(PCollection input) {
return input
  .apply(e -> SingletonList.of(e))
  .apply(parDo(batchDoFn))
  .apply(es -> Iterables.onlyElement(es));
}

Getting this correct wrt timestamps and windowing is tricky. However,
even something that handles the most trivial case (e.g. GlobalWindows
only) and degenerates to batch sizes of 1 for other cases would allow
people to start using this code (rather than rolling their own) and we
could then continue to refine it.

Yes sure, right now the code handles only the global window case. This is
the very beginning, I'm still in the simple naive approach (no window and no
buffering trans-bundle support),

+1. We should assert on construction that the windowing is global.
Even in the global window case, we'll want to avoid mangling element
timestamps.


I plan to use state API to buffer
trans-bundle and timer API (as Kenn pointed) to detect the end of the window
in the DoFn.

Makes sense. It'd be nice if we could figure out a way to do this
across keys (and windows, when the batch computation isn't sensitive
to this of course).


Thanks for your comments Robert.

Glad to help. Thanks for taking this on.

- Robert

Thanks for your comments

Etienne



Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-30 Thread Jean-Baptiste Onofré
Good catch for the PPMC, I'm upgrading the email template in the release 
guide (it was a copy/paste).


Regards
JB

On 01/30/2017 11:50 AM, Sergio Fernández wrote:

+1 (non-binding)

So far I've successfully checked:
* signatures and digests
* source releases file layouts
* matched git tags and commit ids
* incubator suffix and disclaimer
* NOTICE and LICENSE files
* license headers
* clean build (Java 1.8.0_91, Maven 3.3.9, Debian amd64)

Two minor comments that do not block the release:
* Usually I like to see the commit id referencing the rc, since git tags
can be changed.
* Just a formality, "PPMC" is not committee that plays a role anymore,
you're a PMC now ;-)



On Fri, Jan 27, 2017 at 9:55 PM, Jean-Baptiste Onofré 
wrote:


Hi everyone,

Please review and vote on the release candidate #1 for the version 0.5.0
as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:

* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org
[2], which is signed with the key with fingerprint C8282E76 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v0.5.0-RC1" [5],
* website pull request listing the release and publishing the API
reference manual [6].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PPMC affirmative votes.

Thanks,
JB

[1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
ctId=12319527=12338859
[2] https://dist.apache.org/repos/dist/dev/beam/0.5.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1010/
[5] https://git-wip-us.apache.org/repos/asf?p=beam.git;a=tag;h=r
efs/tags/v0.5.0-RC1
[6] https://github.com/apache/beam-site/pull/132







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: TextIO binary file

2017-01-30 Thread Aviem Zur
The Javadoc of TextIO states:

* By default, {@link TextIO.Read} returns a {@link PCollection} of
{@link String Strings},
 * each corresponding to one line of an input UTF-8 text file. To convert
directly from the raw
 * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another
object of type {@code T},
 * supply a {@code Coder} using {@link TextIO.Read#withCoder(Coder)}.

However, as I stated, `withCoder` doesn't seem to have tests, and probably
won't work given the hard-coded '\n' delimiter.

On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré 
wrote:

> Hi Aviem,
>
> TextIO is not designed to write/read binary file: it's pure Text, so
> String.
>
> Regards
> JB
>
> On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > Hi,
> >
> > While trying to use TextIO to write/read a binary file rather than String
> > lines from a textual file I ran into an issue - the delimiter TextIO uses
> > seems to be hardcoded '\n'.
> > See `findSeparatorBounds` -
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> >
> > The use case is to have a file of objects, encoded into bytes using a
> > coder. However, '\n' is not a good delimiter here, as you can imagine.
> > A similar pattern is found in Spark's `saveAsObjectFile`
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > where
> > they use a more appropriate delimiter, to avoid such issues.
> >
> > I did not find any unit tests which use TextIO to read anything other
> than
> > Strings.
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-30 Thread Sergio Fernández
+1 (non-binding)

So far I've successfully checked:
* signatures and digests
* source releases file layouts
* matched git tags and commit ids
* incubator suffix and disclaimer
* NOTICE and LICENSE files
* license headers
* clean build (Java 1.8.0_91, Maven 3.3.9, Debian amd64)

Two minor comments that do not block the release:
* Usually I like to see the commit id referencing the rc, since git tags
can be changed.
* Just a formality, "PPMC" is not committee that plays a role anymore,
you're a PMC now ;-)



On Fri, Jan 27, 2017 at 9:55 PM, Jean-Baptiste Onofré 
wrote:

> Hi everyone,
>
> Please review and vote on the release candidate #1 for the version 0.5.0
> as follows:
>
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
>
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org
> [2], which is signed with the key with fingerprint C8282E76 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v0.5.0-RC1" [5],
> * website pull request listing the release and publishing the API
> reference manual [6].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PPMC affirmative votes.
>
> Thanks,
> JB
>
> [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
> ctId=12319527=12338859
> [2] https://dist.apache.org/repos/dist/dev/beam/0.5.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1010/
> [5] https://git-wip-us.apache.org/repos/asf?p=beam.git;a=tag;h=r
> efs/tags/v0.5.0-RC1
> [6] https://github.com/apache/beam-site/pull/132
>



-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernan...@redlink.co
w: http://redlink.co


Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-30 Thread Alexey Demin
Hi

all good but now branch release-0.5.0 can't be build because last commit
not reversed files

sdks/java/maven-archetypes/examples-java8/src/main/resources/archetype-resources/pom.xml
sdks/java/maven-archetypes/examples/src/main/resources/archetype-resources/pom.xml
sdks/java/maven-archetypes/starter/src/main/resources/archetype-resources/pom.xml
sdks/java/maven-archetypes/starter/src/test/resources/projects/basic/reference/pom.xml

from 0.5.0 to
 0.5.0-SNAPSHOT

as result mvn can't find necessary artifacts.

Thanks,
Alexey


2017-01-30 14:01 GMT+04:00 Ismaël Mejía :

> +1 (non-binding)
>
> - verified signatures + checksums
> - run mvn clean verify -Prelease, all artifacts build and the tests run
> smoothly
>
> Great to see a shorter release cycle, the improvements and the new IOs.
>
>
> On Fri, Jan 27, 2017 at 9:55 PM, Jean-Baptiste Onofré 
> wrote:
>
> > Hi everyone,
> >
> > Please review and vote on the release candidate #1 for the version 0.5.0
> > as follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > The complete staging area is available for your review, which includes:
> >
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to dist.apache.org
> > [2], which is signed with the key with fingerprint C8282E76 [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "v0.5.0-RC1" [5],
> > * website pull request listing the release and publishing the API
> > reference manual [6].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PPMC affirmative votes.
> >
> > Thanks,
> > JB
> >
> > [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
> > ctId=12319527=12338859
> > [2] https://dist.apache.org/repos/dist/dev/beam/0.5.0/
> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > [4] https://repository.apache.org/content/repositories/
> orgapachebeam-1010/
> > [5] https://git-wip-us.apache.org/repos/asf?p=beam.git;a=tag;h=r
> > efs/tags/v0.5.0-RC1
> > [6] https://github.com/apache/beam-site/pull/132
> >
>


Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-30 Thread Jean-Baptiste Onofré

+1 (binding)

Regards
JB

On 01/27/2017 09:55 PM, Jean-Baptiste Onofré wrote:

Hi everyone,

Please review and vote on the release candidate #1 for the version 0.5.0
as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:

* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org
[2], which is signed with the key with fingerprint C8282E76 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v0.5.0-RC1" [5],
* website pull request listing the release and publishing the API
reference manual [6].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PPMC affirmative votes.

Thanks,
JB

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12338859

[2] https://dist.apache.org/repos/dist/dev/beam/0.5.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1010/
[5]
https://git-wip-us.apache.org/repos/asf?p=beam.git;a=tag;h=refs/tags/v0.5.0-RC1

[6] https://github.com/apache/beam-site/pull/132


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


TextIO binary file

2017-01-30 Thread Aviem Zur
Hi,

While trying to use TextIO to write/read a binary file rather than String
lines from a textual file I ran into an issue - the delimiter TextIO uses
seems to be hardcoded '\n'.
See `findSeparatorBounds` -
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024

The use case is to have a file of objects, encoded into bytes using a
coder. However, '\n' is not a good delimiter here, as you can imagine.
A similar pattern is found in Spark's `saveAsObjectFile`
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
where
they use a more appropriate delimiter, to avoid such issues.

I did not find any unit tests which use TextIO to read anything other than
Strings.