Re: Jenkins build became unstable: beam_PostCommit_Java_RunnableOnService_Apex #363

2017-01-31 Thread Jason Kuster
This seems like it could be a legitimate flake.

Expected: <1970-01-01T00:09:59.999Z>
 but: was <2017-02-01T01:38:42.261Z>


Anyone with more knowledge about the apex runner have any ideas?

On Tue, Jan 31, 2017 at 5:48 PM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See  RunnableOnService_Apex/363/changes>
>
>


-- 
---
Jason Kuster
Apache Beam / Google Cloud Dataflow


Re: How to fire the global window when using GroupAlsoByWindowViaWindowSetDoFn?

2017-01-31 Thread Kenneth Knowles
Hi Shen,

Your runner should advance the watermark for the PCollection coming out of
the BoundedSource to BoundedWindow.MAX_TIMESTAMP, which is "positive
infinity" and indicates that even the global window has fired/expired (for
the global window these are the same instant).

Kenn

On Tue, Jan 31, 2017 at 8:32 PM, Shen Li  wrote:

> Hi,
>
> My runner is translating GroupByKey using
> GroupAlsoByWindowViaWindowSetDoFn. Say I have a BoundedSource with five
> tuples all placed into a global window. When the source is depleted, how
> should the runner notify the downstream
> GroupByKey(GroupAlsoByWindowViaWindowSetDoFn) that it should fire the
> global window?
>
> Thanks,
>
> Shen
>


How to fire the global window when using GroupAlsoByWindowViaWindowSetDoFn?

2017-01-31 Thread Shen Li
Hi,

My runner is translating GroupByKey using
GroupAlsoByWindowViaWindowSetDoFn. Say I have a BoundedSource with five
tuples all placed into a global window. When the source is depleted, how
should the runner notify the downstream
GroupByKey(GroupAlsoByWindowViaWindowSetDoFn) that it should fire the
global window?

Thanks,

Shen


Re: Build failed in Jenkins: beam_PostCommit_Java_RunnableOnService_Spark #807

2017-01-31 Thread Kenneth Knowles
Issue communicating with Maven central.

On Tue, Jan 31, 2017 at 7:12 PM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See  RunnableOnService_Spark/807/changes>
>
> Changes:
>
> [kirpichov] Removes inputProvider() and outputReceiver()
>
> --
> [...truncated 28 lines...]
> maven32-agent.jar already up to date
> maven32-interceptor.jar already up to date
> maven3-interceptor-commons.jar already up to date
> [beam_PostCommit_Java_RunnableOnService_Spark] $ 
> /home/jenkins/tools/java/latest1.8/bin/java
> -Dorg.slf4j.simpleLogger.showDateTime=true -Dorg.slf4j.simpleLogger.
> dateTimeFormat=-MM-dd'T'HH:mm:ss.SSS -cp /home/jenkins/jenkins-slave/
> maven32-agent.jar:/home/jenkins/tools/maven/apache-
> maven-3.3.3/boot/plexus-classworlds-2.5.2.jar:/home/
> jenkins/tools/maven/apache-maven-3.3.3/conf/logging 
> jenkins.maven3.agent.Maven32Main
> /home/jenkins/tools/maven/apache-maven-3.3.3 
> /home/jenkins/jenkins-slave/slave.jar
> /home/jenkins/jenkins-slave/maven32-interceptor.jar
> /home/jenkins/jenkins-slave/maven3-interceptor-commons.jar 40870
> <===[JENKINS REMOTING CAPACITY]===>   channel started
> Executing Maven:  -B -f  job/beam_PostCommit_Java_RunnableOnService_Spark/ws/pom.xml>
> -Dmaven.repo.local= RunnableOnService_Spark/ws/.repository> -B -e clean verify -am -pl
> runners/spark -Prunnable-on-service-tests -Plocal-runnable-on-service-tests
> -Dspark.port.maxRetries=64 -Dspark.ui.enabled=false
> 2017-02-01T03:09:42.175 [INFO] Error stacktraces are turned on.
> 2017-02-01T03:09:42.293 [INFO] Scanning for projects...
> 2017-02-01T03:09:43.481 [INFO] Downloading: https://repo.maven.apache.org/
> maven2/kr/motd/maven/os-maven-plugin/1.4.0.Final/os-maven-
> plugin-1.4.0.Final.pom
> 2017-02-01T03:09:43.904 [INFO] Downloaded: https://repo.maven.apache.org/
> maven2/kr/motd/maven/os-maven-plugin/1.4.0.Final/os-maven-
> plugin-1.4.0.Final.pom (7 KB at 14.1 KB/sec)
> 2017-02-01T03:09:43.930 [INFO] Downloading: https://repo.maven.apache.org/
> maven2/org/sonatype/oss/oss-parent/9/oss-parent-9.pom
> 2017-02-01T03:09:43.975 [INFO] Downloaded: https://repo.maven.apache.org/
> maven2/org/sonatype/oss/oss-parent/9/oss-parent-9.pom (7 KB at 130.9
> KB/sec)
> 2017-02-01T03:09:44.000 [INFO] Downloading: https://repo.maven.apache.org/
> maven2/org/apache/maven/maven-plugin-api/3.2.1/maven-plugin-api-3.2.1.pom
> 2017-02-01T03:09:44.036 [INFO] Downloaded: https://repo.maven.apache.org/
> maven2/org/apache/maven/maven-plugin-api/3.2.1/maven-plugin-api-3.2.1.pom
> (4 KB at 94.4 KB/sec)
> 2017-02-01T03:09:44.038 [INFO] Downloading: https://repo.maven.apache.org/
> maven2/org/apache/maven/maven/3.2.1/maven-3.2.1.pom
> 2017-02-01T03:09:44.096 [INFO] Downloaded: https://repo.maven.apache.org/
> maven2/org/apache/maven/maven/3.2.1/maven-3.2.1.pom (23 KB at 380.3
> KB/sec)
> 2017-02-01T03:09:44.099 [INFO] Downloading: https://repo.maven.apache.org/
> maven2/org/apache/maven/maven-parent/23/maven-parent-23.pom
> 2017-02-01T03:09:44.145 [INFO] Downloaded: https://repo.maven.apache.org/
> maven2/org/apache/maven/maven-parent/23/maven-parent-23.pom (32 KB at
> 691.8 KB/sec)
> 2017-02-01T03:09:44.153 [INFO] Downloading: https://repo.maven.apache.org/
> maven2/org/apache/apache/13/apache-13.pom
> 2017-02-01T03:09:44.188 [INFO] Downloaded: https://repo.maven.apache.org/
> maven2/org/apache/apache/13/apache-13.pom (14 KB at 379.1 KB/sec)
> 2017-02-01T03:09:44.198 [INFO] Downloading: https://repo.maven.apache.org/
> maven2/org/apache/maven/maven-model/3.2.1/maven-model-3.2.1.pom
> 2017-02-01T03:09:44.228 [INFO] Downloaded: https://repo.maven.apache.org/
> maven2/org/apache/maven/maven-model/3.2.1/maven-model-3.2.1.pom (5 KB at
> 130.4 KB/sec)
> 2017-02-01T03:09:44.234 [INFO] Downloading: https://repo.maven.apache.org/
> maven2/org/codehaus/plexus/plexus-utils/3.0.17/plexus-utils-3.0.17.pom
> 2017-02-01T03:09:44.278 [INFO] Downloaded: https://repo.maven.apache.org/
> maven2/org/codehaus/plexus/plexus-utils/3.0.17/plexus-utils-3.0.17.pom (4
> KB at 77.1 KB/sec)
> 2017-02-01T03:09:44.281 [INFO] Downloading: https://repo.maven.apache.org/
> maven2/org/codehaus/plexus/plexus/3.3.1/plexus-3.3.1.pom
> 2017-02-01T03:09:44.322 [INFO] Downloaded: https://repo.maven.apache.org/
> maven2/org/codehaus/plexus/plexus/3.3.1/plexus-3.3.1.pom (20 KB at 487.0
> KB/sec)
> 2017-02-01T03:09:44.326 [INFO] Downloading: https://repo.maven.apache.org/
> maven2/org/sonatype/spice/spice-parent/17/spice-parent-17.pom
> 2017-02-01T03:09:44.360 [INFO] Downloaded: https://repo.maven.apache.org/
> maven2/org/sonatype/spice/spice-parent/17/spice-parent-17.pom (7 KB at
> 194.1 KB/sec)
> 2017-02-01T03:09:44.363 [INFO] Downloading: https://repo.maven.apache.org/
> maven2/org/sonatype/forge/forge-parent/10/forge-parent-10.pom
> 2017-02-01T03:09:44.396 [INFO] Downloaded: 

Re: Let's make Beam transforms comply with PTransform Style Guide

2017-01-31 Thread Eugene Kirpichov
On Mon, Jan 30, 2017 at 7:56 PM Dan Halperin 
wrote:

> On Mon, Jan 30, 2017 at 5:42 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> > Hello,
> >
> > The PTransform Style Guide is live
> > https://beam.apache.org/contribute/ptransform-style-guide/ - a natural
> > next
> > step is to audit Beam libraries for compliance and file JIRAs for places
> > that need to be fixed. It'd be great to finish these cleanups before
> > declaring Beam stable API.
> >
> > Please take a look and file JIRAs / post suggestions on this thread!
> >
> > I think it'll also make a great source of easy and useful work for new
> > contributors.
> >
> > Some things I remember off the top of my head:
> > - TextIO, KafkaIO use coders improperly - coders should not be used as a
> > general-purpose byte parsing mechanism.
> >
>
> Can you say more about Kafka? Kafka actually exports byte[] by default,
> whereas Text files are String by default. So it does not seem nearly as
> egregious for Kafka as it is for Text.
>
Agreed that KafkaIO is less egregious, but it still has methods
withKeyCoder and withValueCoder - these should be replaced with something
that doesn't take Coder.


>
> - HadoopFileSource is not packaged as a PTransform
> > - Some connectors, e.g. KafkaIO, should use AutoValue for their parameter
> > builders, but don't
> >
>
> Isn't AutoValue entirely an internal implementation detail that is not
> exposed(*) to users? I think this is irrelevant to a stable API.
>
Agreed - doesn't block stable API, but still a good thing to do because it
makes the code cleaner (for KafkaIO there's a long-standing PR that was
blocked on ratifying the style guide
https://github.com/apache/beam/pull/1048)


>
> (*) except that it makes transforms not able to be final, which is a
> regression.
>
> I think AutoValue use should generally be considered *very* optional. In
> transforms I author, I prefer not to use AutoValue because it makes the
> code more complex and less readable.
>
Yeah, guidance on when to use / not use AutoValue could be improved. I
think it makes a lot of sense when the transform has more than one or two
parameters or when the set of parameters can grow.


>
>
> > - A few connectors improperly use
> > - Some transforms expose their transform type as "Something.Bound" and
> > "Something.Unbound", e.g. TextIO.Read.Bound - such names are banned
> >
>
> "banned" is a strong word to use here. All of these are just
> recommendations.
>
In general yes; the goal of the style guide is to be the default, where if
you deviate from it, you should have a good reason. I don't think there
ever exists a good reason to name a transform Something.Bound/Unbound
though.


>
>
> >
> > I filed an umbrella JIRA https://issues.apache.org/jira/browse/BEAM-1353
> > about
> > making existing Beam transforms comply with the guide - let's crowdsource
> > this!
> >
> > Thanks.
> >
>


Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-31 Thread Aljoscha Krettek
I opened this PR with three revert commits:
https://github.com/apache/beam/pull/1883

I also started PostCommit runs for this:
 -
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/2486/
 -
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_RunnableOnService_Flink/1493/
 -
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_RunnableOnService_Spark/803/
 -
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_RunnableOnService_Apex/
(still
waiting in queue as of writing)
 -
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_RunnableOnService_Dataflow/
(still
waiting in queue as of writing)

I think the MavenInstall hooks fail because the (Google-internal) Dataflow
Runner Harness doesn't work with the changed code, though I'm only guessing
here.


On Tue, 31 Jan 2017 at 21:26 Aljoscha Krettek  wrote:

Agreed, since it's a regression. Let's hope that the transitive closure of
"revert those two commits" doesn't get to large.

I'll checkout the release-0.5.0 branch and see where we get with reverting.

On Tue, 31 Jan 2017 at 19:28 Kenneth Knowles  wrote:

I agree. -1 and let's do the smartest thing to undo the regression.

Those two commits are not sufficient to restore late data dropping. You'll
also need to revert the switch of the Flink runner to use new DoFn, maybe
more.

On Tue, Jan 31, 2017 at 10:21 AM, Jean-Baptiste Onofré 
wrote:

> Basically, my question is: is it a regression ? If yes, definitely a -1
> and we should cancel the release.
>
> Correct me if I'm wrong, but the commits in the LateDataDroppingDoFnRunner
> introduced a regression. So, I would cancel this vote and revert the two
> commits for RC2.
>
> WDYT ?
>
> Regards
> JB
>
>
> On 01/31/2017 07:13 PM, Dan Halperin wrote:
>
>> Should we revert the CLs that lost the functionality? I'd really not like
>> to ship a release with such a functional regression
>>
>> On Tue, Jan 31, 2017 at 10:07 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Fair enough. Let's do that.
>>>
>>> Thanks !
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 01/31/2017 06:58 PM, Aljoscha Krettek wrote:
>>>
>>> I'm not sure. Poperly fixing this will take some time, especially since
 we
 have to add tests to prevent breakage from happening in the future.
 Plus,
 if my analysis is correct other runners might also not have proper late
 data dropping and it's fine to have a release with some missing
 features.
 (There's more besides dropping.)

 I think we should go ahead and fix for 0.6.

 On Tue, Jan 31, 2017, 18:23 Jean-Baptiste Onofré 
 wrote:

 Hi Aljoscha,

>
> so you propose to cancel this vote to prepare a RC2 ?
>
> Regards
> JB
>
> On 01/31/2017 05:06 PM, Aljoscha Krettek wrote:
>
> It's not just an issue with the Flink Runner, if I'm not mistaken.
>>
>> Flink had late-data dropping via the LateDataDroppingDoFnRunner
(which
>>
>> got
>
> "disabled" by the two commits I mention in the issue) while I think
>> that
>> the Apex and Spark Runners might not have had dropping in the first
>>
>> place.
>
> (Not sure about this last part.)
>>
>> As I now wrote to the issue I think this could be a blocker because
we
>> don't have the correct output in some cases.
>>
>> On Tue, 31 Jan 2017 at 02:16 Davor Bonaci  wrote:
>>
>> It looks good to me, but let's hear Aljoscha's opinion on BEAM-1346.
>>
>>>
>>> A passing suite of Jenkins jobs:
>>> * https://builds.apache.org/job/beam_PreCommit_Java_MavenInsta
>>> ll/6870/
>>> * https://builds.apache.org/job/beam_PostCommit_Java_MavenInst
>>> all/2474/
>>> *
>>>
>>>
>>> https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
>>>
>> nService_Apex/336/
>
> *
>>
>>>
>>>
>>> https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
>>>
>> nService_Flink/1470/
>
> *
>>
>>>
>>>
>>> https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
>>>
>> nService_Spark/786/
>
> *
>>
>>>
>>>
>>> https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
>>>
>> nService_Dataflow/2130/
>
>
>> On Mon, Jan 30, 2017 at 4:40 PM, Dan Halperin 
>>>
>>> wrote:
>>
>
>
>> I am worried about https://issues.apache.org/jira/browse/BEAM-1346
>>> for
>>>

 RC1
>>>
>>> and would at least wait for resolution there before proceeding.

 On Mon, Jan 30, 2017 at 3:48 AM, Jean-Baptiste Onofré <
 j...@nanthrax.net


>>> wrote:
>>
>>>
 Good catch for the PPMC, I'm upgrading the email template in the

>
> release

Re: Doesn't PAssertTest.runExpectingAssertionFailure need to call waitUntilFinish?

2017-01-31 Thread Shen Li
Hi Dan,

Thanks a lot for the explanation. :)

Best,

Shen

On Tue, Jan 31, 2017 at 4:19 PM, Dan Halperin 
wrote:

> Hi Shen,
>
> Great question. The trick is that the `pipeline` object is an instance of
> TestPipeline [0], for which p.run() is the same as
> p.run().waitUntilFinish().
>
> It might be documentationally better to use p.run().waitUntilFinish() to be
> consistent with real runners, or add a method to TestPipeline
> p.runTestPipeline() to signal that this works only in tests. At the same
> time, that would complicate writing tests, which we don't really want to
> do... so it's a tradeoff that may be okay as-is.
>
> Dan
>
> [0]
> https://github.com/apache/beam/blob/master/sdks/java/
> core/src/test/java/org/apache/beam/sdk/testing/PAssertTest.java#L64
>
>
>
> On Tue, Jan 31, 2017 at 1:07 PM, Shen Li  wrote:
>
> > Hi,
> >
> > In the PAssertTest, doesn't it need to append a "waitUntilFinish()" to
> the
> > "pipeline.run()" (please see the link below)? Otherwise, the runner may
> > return the PipelineResult immediately without actually kicking off the
> > execution, and therefore the AssertionError won't be thrown. Or did I
> miss
> > anything?
> >
> > https://github.com/apache/beam/blob/master/sdks/java/
> > core/src/test/java/org/apache/beam/sdk/testing/PAssertTest.java#L399
> >
> > Thanks,
> >
> > Shen
> >
>


Re: Doesn't PAssertTest.runExpectingAssertionFailure need to call waitUntilFinish?

2017-01-31 Thread Dan Halperin
Hi Shen,

Great question. The trick is that the `pipeline` object is an instance of
TestPipeline [0], for which p.run() is the same as
p.run().waitUntilFinish().

It might be documentationally better to use p.run().waitUntilFinish() to be
consistent with real runners, or add a method to TestPipeline
p.runTestPipeline() to signal that this works only in tests. At the same
time, that would complicate writing tests, which we don't really want to
do... so it's a tradeoff that may be okay as-is.

Dan

[0]
https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/testing/PAssertTest.java#L64



On Tue, Jan 31, 2017 at 1:07 PM, Shen Li  wrote:

> Hi,
>
> In the PAssertTest, doesn't it need to append a "waitUntilFinish()" to the
> "pipeline.run()" (please see the link below)? Otherwise, the runner may
> return the PipelineResult immediately without actually kicking off the
> execution, and therefore the AssertionError won't be thrown. Or did I miss
> anything?
>
> https://github.com/apache/beam/blob/master/sdks/java/
> core/src/test/java/org/apache/beam/sdk/testing/PAssertTest.java#L399
>
> Thanks,
>
> Shen
>


Re: TextIO binary file

2017-01-31 Thread Robert Bradshaw
On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur  wrote:
> +1 on what Stas said.
> I think there is value in not having the user write a custom IO for a
> protocol they use which is not covered by Beam IOs. Plus having them deal
> with not only the encoding but also the IO part is not ideal.
> I think having a basic FileIO that can write to the Filesystems supported
> by Beam (GS/HDFS/Local/...) which you can use any coder with, including
> your own custom coder, can be beneficial.

What would the format of the file be? Just the concatenation of the
elements encoded according to the coder? Or is there a delimiter
needed to separate records. In which case how does one ensure the
delimiter does not also appear in the middle of an encoded element? At
this point you're developing a file format, and might as well stick
with one of the standard ones. https://xkcd.com/927

> On Tue, Jan 31, 2017 at 7:56 PM Stas Levin  wrote:
>
> I believe the motivation is to have an abstraction that allows one to write
> stuff to a file in a way that is agnostic to the coder.
> If one needs to write a non-Avro protocol to a file, and this particular
> protocol does not meet the assumption made by TextIO, one might need to
> duplicate the file IO related code from AvroIO.
>
> On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
>  wrote:
>
>> Could you clarify why it would be useful to write objects to files using
>> Beam coders, as opposed to just using e.g. AvroIO?
>>
>> Coders (should) make no promise as to what their wire format is, so such
>> files could be read back only by other Beam pipelines using the same IO.
>>
>> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur  wrote:
>>
>> > So If I understand the general agreement is that TextIO should not
>> support
>> > anything but lines from files as strings.
>> > I'll go ahead and file a ticket that says the Javadoc should be changed
>> to
>> > reflect this and `withCoder` method should be removed.
>> >
>> > Is there merit for Beam to supply an IO which does allow writing objects
>> to
>> > a file using Beam coders and Beam FS (To write these files to
>> > GS/Hadoop/Local)?
>> >
>> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
>> >  wrote:
>> >
>> > P.S. Note that this point (about coders) is also mentioned in the
>> > now-being-reviewed PTransform Style Guide
>> > https://github.com/apache/beam-site/pull/134
>> > currently staged at
>> >
>> >
>>
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
>> >
>> >
>> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath > >
>> > wrote:
>> >
>> > > +1 to what Eugene said.
>> > >
>> > > I've seen a number of Python SDK users incorrectly assuming that
>> > > coder.decode() is needed when developing their own file-based sources
>> > > (since many users usually refer to text source first). Probably coder
>> > > parameter should not be configurable for text source/sink and they
>> should
>> > > be updated to only read/write UTF-8 encoded strings.
>> > >
>> > > - Cham
>> > >
>> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
>> > >  wrote:
>> > >
>> > > > The use of Coder in TextIO is a long standing design issue because
>> > coders
>> > > > are not intended to be used for general purpose converting things
>> from
>> > > and
>> > > > to bytes, their only proper use is letting the runner materialize
> and
>> > > > restore objects if the runner thinks it's necessary. IMO it should
>> have
>> > > > been called LineIO, document that it reads lines of text as String,
>> and
>> > > not
>> > > > have a withCoder parameter at all.
>> > > >
>> > > > The proper way to address your use case is to write a custom
>> > > > FileBasedSource.
>> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur 
>> wrote:
>> > > >
>> > > > > The Javadoc of TextIO states:
>> > > > >
>> > > > > * By default, {@link TextIO.Read} returns a {@link PCollection}
>> of
>> > > > > {@link String Strings},
>> > > > >  * each corresponding to one line of an input UTF-8 text file. To
>> > > convert
>> > > > > directly from the raw
>> > > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
>> > > another
>> > > > > object of type {@code T},
>> > > > >  * supply a {@code Coder} using {@link
>> > > TextIO.Read#withCoder(Coder)}.
>> > > > >
>> > > > > However, as I stated, `withCoder` doesn't seem to have tests, and
>> > > > probably
>> > > > > won't work given the hard-coded '\n' delimiter.
>> > > > >
>> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
>> > j...@nanthrax.net
>> > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hi Aviem,
>> > > > > >
>> > > > > > TextIO is not designed to write/read binary file: it's pure
> Text,
>> > so
>> > > > > > String.
>> > > > > >
>> > > > > > Regards
>> > > > > > JB

Re: TextIO binary file

2017-01-31 Thread Aviem Zur
+1 on what Stas said.
I think there is value in not having the user write a custom IO for a
protocol they use which is not covered by Beam IOs. Plus having them deal
with not only the encoding but also the IO part is not ideal.
I think having a basic FileIO that can write to the Filesystems supported
by Beam (GS/HDFS/Local/...) which you can use any coder with, including
your own custom coder, can be beneficial.

On Tue, Jan 31, 2017 at 7:56 PM Stas Levin  wrote:

I believe the motivation is to have an abstraction that allows one to write
stuff to a file in a way that is agnostic to the coder.
If one needs to write a non-Avro protocol to a file, and this particular
protocol does not meet the assumption made by TextIO, one might need to
duplicate the file IO related code from AvroIO.

On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
 wrote:

> Could you clarify why it would be useful to write objects to files using
> Beam coders, as opposed to just using e.g. AvroIO?
>
> Coders (should) make no promise as to what their wire format is, so such
> files could be read back only by other Beam pipelines using the same IO.
>
> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur  wrote:
>
> > So If I understand the general agreement is that TextIO should not
> support
> > anything but lines from files as strings.
> > I'll go ahead and file a ticket that says the Javadoc should be changed
> to
> > reflect this and `withCoder` method should be removed.
> >
> > Is there merit for Beam to supply an IO which does allow writing objects
> to
> > a file using Beam coders and Beam FS (To write these files to
> > GS/Hadoop/Local)?
> >
> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> >  wrote:
> >
> > P.S. Note that this point (about coders) is also mentioned in the
> > now-being-reviewed PTransform Style Guide
> > https://github.com/apache/beam-site/pull/134
> > currently staged at
> >
> >
>
http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> >
> >
> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath  >
> > wrote:
> >
> > > +1 to what Eugene said.
> > >
> > > I've seen a number of Python SDK users incorrectly assuming that
> > > coder.decode() is needed when developing their own file-based sources
> > > (since many users usually refer to text source first). Probably coder
> > > parameter should not be configurable for text source/sink and they
> should
> > > be updated to only read/write UTF-8 encoded strings.
> > >
> > > - Cham
> > >
> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> > >  wrote:
> > >
> > > > The use of Coder in TextIO is a long standing design issue because
> > coders
> > > > are not intended to be used for general purpose converting things
> from
> > > and
> > > > to bytes, their only proper use is letting the runner materialize
and
> > > > restore objects if the runner thinks it's necessary. IMO it should
> have
> > > > been called LineIO, document that it reads lines of text as String,
> and
> > > not
> > > > have a withCoder parameter at all.
> > > >
> > > > The proper way to address your use case is to write a custom
> > > > FileBasedSource.
> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur 
> wrote:
> > > >
> > > > > The Javadoc of TextIO states:
> > > > >
> > > > > * By default, {@link TextIO.Read} returns a {@link PCollection}
> of
> > > > > {@link String Strings},
> > > > >  * each corresponding to one line of an input UTF-8 text file. To
> > > convert
> > > > > directly from the raw
> > > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> > > another
> > > > > object of type {@code T},
> > > > >  * supply a {@code Coder} using {@link
> > > TextIO.Read#withCoder(Coder)}.
> > > > >
> > > > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > > > probably
> > > > > won't work given the hard-coded '\n' delimiter.
> > > > >
> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> > j...@nanthrax.net
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Aviem,
> > > > > >
> > > > > > TextIO is not designed to write/read binary file: it's pure
Text,
> > so
> > > > > > String.
> > > > > >
> > > > > > Regards
> > > > > > JB
> > > > > >
> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > While trying to use TextIO to write/read a binary file rather
> > than
> > > > > String
> > > > > > > lines from a textual file I ran into an issue - the delimiter
> > > TextIO
> > > > > uses
> > > > > > > seems to be hardcoded '\n'.
> > > > > > > See `findSeparatorBounds` -
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
>
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > > > >
> > > > > > > The use case 

Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-31 Thread Kenneth Knowles
I agree. -1 and let's do the smartest thing to undo the regression.

Those two commits are not sufficient to restore late data dropping. You'll
also need to revert the switch of the Flink runner to use new DoFn, maybe
more.

On Tue, Jan 31, 2017 at 10:21 AM, Jean-Baptiste Onofré 
wrote:

> Basically, my question is: is it a regression ? If yes, definitely a -1
> and we should cancel the release.
>
> Correct me if I'm wrong, but the commits in the LateDataDroppingDoFnRunner
> introduced a regression. So, I would cancel this vote and revert the two
> commits for RC2.
>
> WDYT ?
>
> Regards
> JB
>
>
> On 01/31/2017 07:13 PM, Dan Halperin wrote:
>
>> Should we revert the CLs that lost the functionality? I'd really not like
>> to ship a release with such a functional regression
>>
>> On Tue, Jan 31, 2017 at 10:07 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Fair enough. Let's do that.
>>>
>>> Thanks !
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 01/31/2017 06:58 PM, Aljoscha Krettek wrote:
>>>
>>> I'm not sure. Poperly fixing this will take some time, especially since
 we
 have to add tests to prevent breakage from happening in the future.
 Plus,
 if my analysis is correct other runners might also not have proper late
 data dropping and it's fine to have a release with some missing
 features.
 (There's more besides dropping.)

 I think we should go ahead and fix for 0.6.

 On Tue, Jan 31, 2017, 18:23 Jean-Baptiste Onofré 
 wrote:

 Hi Aljoscha,

>
> so you propose to cancel this vote to prepare a RC2 ?
>
> Regards
> JB
>
> On 01/31/2017 05:06 PM, Aljoscha Krettek wrote:
>
> It's not just an issue with the Flink Runner, if I'm not mistaken.
>>
>> Flink had late-data dropping via the LateDataDroppingDoFnRunner (which
>>
>> got
>
> "disabled" by the two commits I mention in the issue) while I think
>> that
>> the Apex and Spark Runners might not have had dropping in the first
>>
>> place.
>
> (Not sure about this last part.)
>>
>> As I now wrote to the issue I think this could be a blocker because we
>> don't have the correct output in some cases.
>>
>> On Tue, 31 Jan 2017 at 02:16 Davor Bonaci  wrote:
>>
>> It looks good to me, but let's hear Aljoscha's opinion on BEAM-1346.
>>
>>>
>>> A passing suite of Jenkins jobs:
>>> * https://builds.apache.org/job/beam_PreCommit_Java_MavenInsta
>>> ll/6870/
>>> * https://builds.apache.org/job/beam_PostCommit_Java_MavenInst
>>> all/2474/
>>> *
>>>
>>>
>>> https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
>>>
>> nService_Apex/336/
>
> *
>>
>>>
>>>
>>> https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
>>>
>> nService_Flink/1470/
>
> *
>>
>>>
>>>
>>> https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
>>>
>> nService_Spark/786/
>
> *
>>
>>>
>>>
>>> https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
>>>
>> nService_Dataflow/2130/
>
>
>> On Mon, Jan 30, 2017 at 4:40 PM, Dan Halperin 
>>>
>>> wrote:
>>
>
>
>> I am worried about https://issues.apache.org/jira/browse/BEAM-1346
>>> for
>>>

 RC1
>>>
>>> and would at least wait for resolution there before proceeding.

 On Mon, Jan 30, 2017 at 3:48 AM, Jean-Baptiste Onofré <
 j...@nanthrax.net


>>> wrote:
>>
>>>
 Good catch for the PPMC, I'm upgrading the email template in the

>
> release

>>>
>>> guide (it was a copy/paste).

>
> Regards
> JB
>
>
> On 01/30/2017 11:50 AM, Sergio Fernández wrote:
>
> +1 (non-binding)
>
>>
>> So far I've successfully checked:
>> * signatures and digests
>> * source releases file layouts
>> * matched git tags and commit ids
>> * incubator suffix and disclaimer
>> * NOTICE and LICENSE files
>> * license headers
>> * clean build (Java 1.8.0_91, Maven 3.3.9, Debian amd64)
>>
>> Two minor comments that do not block the release:
>> * Usually I like to see the commit id referencing the rc, since
>> git
>>
>> tags
>

>>> can be changed.

> * Just a formality, "PPMC" is not committee that plays a role
>>
>> anymore,
>

> you're a PMC now ;-)
>>
>>>
>>
>>
>> On Fri, Jan 27, 2017 at 9:55 PM, Jean-Baptiste Onofré <
>>
>> j...@nanthrax.net>
>

>>> wrote:

>

Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-31 Thread Jean-Baptiste Onofré
Basically, my question is: is it a regression ? If yes, definitely a -1 
and we should cancel the release.


Correct me if I'm wrong, but the commits in the 
LateDataDroppingDoFnRunner introduced a regression. So, I would cancel 
this vote and revert the two commits for RC2.


WDYT ?

Regards
JB

On 01/31/2017 07:13 PM, Dan Halperin wrote:

Should we revert the CLs that lost the functionality? I'd really not like
to ship a release with such a functional regression

On Tue, Jan 31, 2017 at 10:07 AM, Jean-Baptiste Onofré 
wrote:


Fair enough. Let's do that.

Thanks !

Regards
JB


On 01/31/2017 06:58 PM, Aljoscha Krettek wrote:


I'm not sure. Poperly fixing this will take some time, especially since we
have to add tests to prevent breakage from happening in the future. Plus,
if my analysis is correct other runners might also not have proper late
data dropping and it's fine to have a release with some missing features.
(There's more besides dropping.)

I think we should go ahead and fix for 0.6.

On Tue, Jan 31, 2017, 18:23 Jean-Baptiste Onofré  wrote:

Hi Aljoscha,


so you propose to cancel this vote to prepare a RC2 ?

Regards
JB

On 01/31/2017 05:06 PM, Aljoscha Krettek wrote:


It's not just an issue with the Flink Runner, if I'm not mistaken.

Flink had late-data dropping via the LateDataDroppingDoFnRunner (which


got


"disabled" by the two commits I mention in the issue) while I think that
the Apex and Spark Runners might not have had dropping in the first


place.


(Not sure about this last part.)

As I now wrote to the issue I think this could be a blocker because we
don't have the correct output in some cases.

On Tue, 31 Jan 2017 at 02:16 Davor Bonaci  wrote:

It looks good to me, but let's hear Aljoscha's opinion on BEAM-1346.


A passing suite of Jenkins jobs:
* https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6870/
* https://builds.apache.org/job/beam_PostCommit_Java_MavenInst
all/2474/
*


https://builds.apache.org/job/beam_PostCommit_Java_RunnableO

nService_Apex/336/


*



https://builds.apache.org/job/beam_PostCommit_Java_RunnableO

nService_Flink/1470/


*



https://builds.apache.org/job/beam_PostCommit_Java_RunnableO

nService_Spark/786/


*



https://builds.apache.org/job/beam_PostCommit_Java_RunnableO

nService_Dataflow/2130/




On Mon, Jan 30, 2017 at 4:40 PM, Dan Halperin 


wrote:





I am worried about https://issues.apache.org/jira/browse/BEAM-1346 for



RC1


and would at least wait for resolution there before proceeding.

On Mon, Jan 30, 2017 at 3:48 AM, Jean-Baptiste Onofré <
j...@nanthrax.net




wrote:


Good catch for the PPMC, I'm upgrading the email template in the



release



guide (it was a copy/paste).


Regards
JB


On 01/30/2017 11:50 AM, Sergio Fernández wrote:

+1 (non-binding)


So far I've successfully checked:
* signatures and digests
* source releases file layouts
* matched git tags and commit ids
* incubator suffix and disclaimer
* NOTICE and LICENSE files
* license headers
* clean build (Java 1.8.0_91, Maven 3.3.9, Debian amd64)

Two minor comments that do not block the release:
* Usually I like to see the commit id referencing the rc, since git


tags



can be changed.

* Just a formality, "PPMC" is not committee that plays a role


anymore,



you're a PMC now ;-)




On Fri, Jan 27, 2017 at 9:55 PM, Jean-Baptiste Onofré <


j...@nanthrax.net>



wrote:


Hi everyone,



Please review and vote on the release candidate #1 for the version


0.5.0



as follows:


[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific


comments)





The complete staging area is available for your review, which


includes:





* JIRA release notes [1],
* the official Apache source release to be deployed to


dist.apache.org



[2], which is signed with the key with fingerprint C8282E76 [3],

* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v0.5.0-RC1" [5],
* website pull request listing the release and publishing the API
reference manual [6].

The vote will be open for at least 72 hours. It is adopted by


majority



approval, with at least 3 PPMC affirmative votes.


Thanks,
JB

[1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
ctId=12319527=12338859
[2] https://dist.apache.org/repos/dist/dev/beam/0.5.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapache
beam-1010/
[5] https://git-wip-us.apache.org/repos/asf?p=beam.git;a=tag;h=r
efs/tags/v0.5.0-RC1
[6] https://github.com/apache/beam-site/pull/132






--

Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com









--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com




Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-31 Thread Dan Halperin
Should we revert the CLs that lost the functionality? I'd really not like
to ship a release with such a functional regression

On Tue, Jan 31, 2017 at 10:07 AM, Jean-Baptiste Onofré 
wrote:

> Fair enough. Let's do that.
>
> Thanks !
>
> Regards
> JB
>
>
> On 01/31/2017 06:58 PM, Aljoscha Krettek wrote:
>
>> I'm not sure. Poperly fixing this will take some time, especially since we
>> have to add tests to prevent breakage from happening in the future. Plus,
>> if my analysis is correct other runners might also not have proper late
>> data dropping and it's fine to have a release with some missing features.
>> (There's more besides dropping.)
>>
>> I think we should go ahead and fix for 0.6.
>>
>> On Tue, Jan 31, 2017, 18:23 Jean-Baptiste Onofré  wrote:
>>
>> Hi Aljoscha,
>>>
>>> so you propose to cancel this vote to prepare a RC2 ?
>>>
>>> Regards
>>> JB
>>>
>>> On 01/31/2017 05:06 PM, Aljoscha Krettek wrote:
>>>
 It's not just an issue with the Flink Runner, if I'm not mistaken.

 Flink had late-data dropping via the LateDataDroppingDoFnRunner (which

>>> got
>>>
 "disabled" by the two commits I mention in the issue) while I think that
 the Apex and Spark Runners might not have had dropping in the first

>>> place.
>>>
 (Not sure about this last part.)

 As I now wrote to the issue I think this could be a blocker because we
 don't have the correct output in some cases.

 On Tue, 31 Jan 2017 at 02:16 Davor Bonaci  wrote:

 It looks good to me, but let's hear Aljoscha's opinion on BEAM-1346.
>
> A passing suite of Jenkins jobs:
> * https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6870/
> * https://builds.apache.org/job/beam_PostCommit_Java_MavenInst
> all/2474/
> *
>
>
> https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
>>> nService_Apex/336/
>>>
 *
>
>
> https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
>>> nService_Flink/1470/
>>>
 *
>
>
> https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
>>> nService_Spark/786/
>>>
 *
>
>
> https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
>>> nService_Dataflow/2130/
>>>

> On Mon, Jan 30, 2017 at 4:40 PM, Dan Halperin 
>
 wrote:
>>>

> I am worried about https://issues.apache.org/jira/browse/BEAM-1346 for
>>
> RC1
>
>> and would at least wait for resolution there before proceeding.
>>
>> On Mon, Jan 30, 2017 at 3:48 AM, Jean-Baptiste Onofré <
>> j...@nanthrax.net
>>
>
 wrote:
>>
>> Good catch for the PPMC, I'm upgrading the email template in the
>>>
>> release
>
>> guide (it was a copy/paste).
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 01/30/2017 11:50 AM, Sergio Fernández wrote:
>>>
>>> +1 (non-binding)

 So far I've successfully checked:
 * signatures and digests
 * source releases file layouts
 * matched git tags and commit ids
 * incubator suffix and disclaimer
 * NOTICE and LICENSE files
 * license headers
 * clean build (Java 1.8.0_91, Maven 3.3.9, Debian amd64)

 Two minor comments that do not block the release:
 * Usually I like to see the commit id referencing the rc, since git

>>> tags
>
>> can be changed.
 * Just a formality, "PPMC" is not committee that plays a role

>>> anymore,
>>>
 you're a PMC now ;-)



 On Fri, Jan 27, 2017 at 9:55 PM, Jean-Baptiste Onofré <

>>> j...@nanthrax.net>
>
>> wrote:

 Hi everyone,

>
> Please review and vote on the release candidate #1 for the version
>
 0.5.0
>>
>>> as follows:
>
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific
>
 comments)
>>>

> The complete staging area is available for your review, which
>
 includes:
>
>>
> * JIRA release notes [1],
> * the official Apache source release to be deployed to
>
 dist.apache.org
>
>> [2], which is signed with the key with fingerprint C8282E76 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v0.5.0-RC1" [5],
> * website pull request listing the release and publishing the API
> reference manual [6].
>
> The vote will be open for at least 72 hours. It is adopted by
>
 majority
>
>> approval, with at least 3 PPMC affirmative votes.
>
> Thanks,
> JB
>
> [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje

Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-31 Thread Jean-Baptiste Onofré

Fair enough. Let's do that.

Thanks !

Regards
JB

On 01/31/2017 06:58 PM, Aljoscha Krettek wrote:

I'm not sure. Poperly fixing this will take some time, especially since we
have to add tests to prevent breakage from happening in the future. Plus,
if my analysis is correct other runners might also not have proper late
data dropping and it's fine to have a release with some missing features.
(There's more besides dropping.)

I think we should go ahead and fix for 0.6.

On Tue, Jan 31, 2017, 18:23 Jean-Baptiste Onofré  wrote:


Hi Aljoscha,

so you propose to cancel this vote to prepare a RC2 ?

Regards
JB

On 01/31/2017 05:06 PM, Aljoscha Krettek wrote:

It's not just an issue with the Flink Runner, if I'm not mistaken.

Flink had late-data dropping via the LateDataDroppingDoFnRunner (which

got

"disabled" by the two commits I mention in the issue) while I think that
the Apex and Spark Runners might not have had dropping in the first

place.

(Not sure about this last part.)

As I now wrote to the issue I think this could be a blocker because we
don't have the correct output in some cases.

On Tue, 31 Jan 2017 at 02:16 Davor Bonaci  wrote:


It looks good to me, but let's hear Aljoscha's opinion on BEAM-1346.

A passing suite of Jenkins jobs:
* https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6870/
* https://builds.apache.org/job/beam_PostCommit_Java_MavenInstall/2474/
*



https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Apex/336/

*



https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Flink/1470/

*



https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Spark/786/

*



https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Dataflow/2130/


On Mon, Jan 30, 2017 at 4:40 PM, Dan Halperin 

wrote:



I am worried about https://issues.apache.org/jira/browse/BEAM-1346 for

RC1

and would at least wait for resolution there before proceeding.

On Mon, Jan 30, 2017 at 3:48 AM, Jean-Baptiste Onofré 

wrote:

Hi everyone,


Please review and vote on the release candidate #1 for the version

0.5.0

as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific

comments)


The complete staging area is available for your review, which

includes:


* JIRA release notes [1],
* the official Apache source release to be deployed to

dist.apache.org

[2], which is signed with the key with fingerprint C8282E76 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v0.5.0-RC1" [5],
* website pull request listing the release and publishing the API
reference manual [6].

The vote will be open for at least 72 hours. It is adopted by

majority

approval, with at least 3 PPMC affirmative votes.

Thanks,
JB

[1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
ctId=12319527=12338859
[2] https://dist.apache.org/repos/dist/dev/beam/0.5.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapache
beam-1010/
[5] https://git-wip-us.apache.org/repos/asf?p=beam.git;a=tag;h=r
efs/tags/v0.5.0-RC1
[6] https://github.com/apache/beam-site/pull/132







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com









--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-31 Thread Aljoscha Krettek
I'm not sure. Poperly fixing this will take some time, especially since we
have to add tests to prevent breakage from happening in the future. Plus,
if my analysis is correct other runners might also not have proper late
data dropping and it's fine to have a release with some missing features.
(There's more besides dropping.)

I think we should go ahead and fix for 0.6.

On Tue, Jan 31, 2017, 18:23 Jean-Baptiste Onofré  wrote:

> Hi Aljoscha,
>
> so you propose to cancel this vote to prepare a RC2 ?
>
> Regards
> JB
>
> On 01/31/2017 05:06 PM, Aljoscha Krettek wrote:
> > It's not just an issue with the Flink Runner, if I'm not mistaken.
> >
> > Flink had late-data dropping via the LateDataDroppingDoFnRunner (which
> got
> > "disabled" by the two commits I mention in the issue) while I think that
> > the Apex and Spark Runners might not have had dropping in the first
> place.
> > (Not sure about this last part.)
> >
> > As I now wrote to the issue I think this could be a blocker because we
> > don't have the correct output in some cases.
> >
> > On Tue, 31 Jan 2017 at 02:16 Davor Bonaci  wrote:
> >
> >> It looks good to me, but let's hear Aljoscha's opinion on BEAM-1346.
> >>
> >> A passing suite of Jenkins jobs:
> >> * https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6870/
> >> * https://builds.apache.org/job/beam_PostCommit_Java_MavenInstall/2474/
> >> *
> >>
> >>
> https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Apex/336/
> >> *
> >>
> >>
> https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Flink/1470/
> >> *
> >>
> >>
> https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Spark/786/
> >> *
> >>
> >>
> https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Dataflow/2130/
> >>
> >> On Mon, Jan 30, 2017 at 4:40 PM, Dan Halperin 
> wrote:
> >>
> >>> I am worried about https://issues.apache.org/jira/browse/BEAM-1346 for
> >> RC1
> >>> and would at least wait for resolution there before proceeding.
> >>>
> >>> On Mon, Jan 30, 2017 at 3:48 AM, Jean-Baptiste Onofré  >
> >>> wrote:
> >>>
>  Good catch for the PPMC, I'm upgrading the email template in the
> >> release
>  guide (it was a copy/paste).
> 
>  Regards
>  JB
> 
> 
>  On 01/30/2017 11:50 AM, Sergio Fernández wrote:
> 
> > +1 (non-binding)
> >
> > So far I've successfully checked:
> > * signatures and digests
> > * source releases file layouts
> > * matched git tags and commit ids
> > * incubator suffix and disclaimer
> > * NOTICE and LICENSE files
> > * license headers
> > * clean build (Java 1.8.0_91, Maven 3.3.9, Debian amd64)
> >
> > Two minor comments that do not block the release:
> > * Usually I like to see the commit id referencing the rc, since git
> >> tags
> > can be changed.
> > * Just a formality, "PPMC" is not committee that plays a role
> anymore,
> > you're a PMC now ;-)
> >
> >
> >
> > On Fri, Jan 27, 2017 at 9:55 PM, Jean-Baptiste Onofré <
> >> j...@nanthrax.net>
> > wrote:
> >
> > Hi everyone,
> >>
> >> Please review and vote on the release candidate #1 for the version
> >>> 0.5.0
> >> as follows:
> >>
> >> [ ] +1, Approve the release
> >> [ ] -1, Do not approve the release (please provide specific
> comments)
> >>
> >> The complete staging area is available for your review, which
> >> includes:
> >>
> >> * JIRA release notes [1],
> >> * the official Apache source release to be deployed to
> >> dist.apache.org
> >> [2], which is signed with the key with fingerprint C8282E76 [3],
> >> * all artifacts to be deployed to the Maven Central Repository [4],
> >> * source code tag "v0.5.0-RC1" [5],
> >> * website pull request listing the release and publishing the API
> >> reference manual [6].
> >>
> >> The vote will be open for at least 72 hours. It is adopted by
> >> majority
> >> approval, with at least 3 PPMC affirmative votes.
> >>
> >> Thanks,
> >> JB
> >>
> >> [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
> >> ctId=12319527=12338859
> >> [2] https://dist.apache.org/repos/dist/dev/beam/0.5.0/
> >> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> >> [4] https://repository.apache.org/content/repositories/orgapache
> >> beam-1010/
> >> [5] https://git-wip-us.apache.org/repos/asf?p=beam.git;a=tag;h=r
> >> efs/tags/v0.5.0-RC1
> >> [6] https://github.com/apache/beam-site/pull/132
> >>
> >>
> >
> >
> >
>  --
>  Jean-Baptiste Onofré
>  jbono...@apache.org
>  http://blog.nanthrax.net
>  Talend - http://www.talend.com
> 
> >>>
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [DISCUSS] Python SDK status and next steps

2017-01-31 Thread Ahmet Altay
Thank you Prabeesh and Sergio for fixing those!

On Tue, Jan 31, 2017 at 4:51 AM, Jean-Baptiste Onofré 
wrote:

> Awesome, thanks Sergio ! Much appreciated ;)
>
> Regards
> JB
>
>
> On 01/31/2017 01:42 PM, Sergio Fernández wrote:
>
>> PR #1879 provides the basics: https://github.com/apache/beam/pull/1879
>>
>> On Tue, Jan 31, 2017 at 1:33 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>> No, that's fine as soon as we clearly document the prerequisite for the
>>> build. IMHO, we should provide quick BUILDING instructions in the
>>> README.md.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 01/31/2017 01:24 PM, Sergio Fernández wrote:
>>>
>>> Originally we integrate the build in Maven with the default profile.
 Do you feel like it'd be better to have it under a separated profile or
 so?

 On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré 
 wrote:

 Just to be clear, the prerequisite to be able to build the Python SDK
 are:

>
> apt-get install python-setuptools
> apt-get install python-pip
>
> It's also required by the default "regular" build.
>
> Regards
> JB
>
>
> On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:
>
> Just one thing I noticed (and can be helpful for others): to build Beam
>
>> we now need python setuptools installed.
>>
>> For instance, on Ubuntu, you have to do:
>>
>> apt-get install python-setuptools
>>
>> Same for the pip distribution.
>>
>> I guess (if not already done), we have to update README/Building
>> instructions.
>>
>> Correct ?
>>
>> Regards
>> JB
>>
>> On 01/31/2017 08:10 AM, Ahmet Altay wrote:
>>
>> Hi all,
>>
>>>
>>> This merge is completed. Python SDK is now officially part of the
>>> master
>>> branch! Thank you all for the support. Please open an issue, if you
>>> notice
>>> a reference to the now obsolete python-sdk branch in the
>>> documentation.
>>>
>>> There will not be any more merges to the python-sdk branch. Going
>>> forward
>>> please use the master branch for Python SDK development. There are a
>>> few
>>> existing open PRs to the python-sdk [1]. If you are the author of one
>>> of
>>> those PRs, please rebase them on top of master.
>>>
>>> Thank you,
>>> Ahmet
>>>
>>> [1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%
>>> 
>>> >> >
>>> >> >
>>> 3Apython-sdk+repo%3Aapache%2Fbeam+
>>> >> +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>>>
>>>
>>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles
>>> 
>>> wrote:
>>>
>>> To clarify the implied criteria of that last exchange, it is "An SDK
>>>
>>> should
 have at least one runner that can execute the complete model (may
 be a
 direct runner)"

 I want to highlight this, because whether an _SDK_ supports
 unbounded
 data
 is not particularly well-defined, and will evolve:

  - With the Runner API, an SDK will need to support building a graph
 with
 unbounded constructs, as today with probably minimal changes.

  - With the Fn API, if any part of the Fn API is specific to
 unbounded
 data, the SDK will need to implement it. I think right now there is
 no such
 thing, and we don't want such a thing, so SDKs implementing the Fn
 API
 automatically support unbounded data.

  - There will also likely be an SDK-specific shim just as there is
 today,
 to leverage idiomatic deserialized representations. The richness of
 this
 shim will decrease so that it will need to "support" unbounded data
 but
 that will be a ~one liner.

 Getting the Python SDK on master will accelerate our progress
 towards
 the
 Fn API - partly technical, partly community - which is the best path
 towards support for unbounded data across multiple runners. I think
 the
 criteria are written with the completed portability framework in
 mind. So
 this exchange makes me actually more convinced we should merge
 python-sdk
 to master.

 On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
 rober...@google.com.invalid> wrote:

 On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin

  wrote:
>
> I do 

Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-31 Thread Jean-Baptiste Onofré

Hi Aljoscha,

so you propose to cancel this vote to prepare a RC2 ?

Regards
JB

On 01/31/2017 05:06 PM, Aljoscha Krettek wrote:

It's not just an issue with the Flink Runner, if I'm not mistaken.

Flink had late-data dropping via the LateDataDroppingDoFnRunner (which got
"disabled" by the two commits I mention in the issue) while I think that
the Apex and Spark Runners might not have had dropping in the first place.
(Not sure about this last part.)

As I now wrote to the issue I think this could be a blocker because we
don't have the correct output in some cases.

On Tue, 31 Jan 2017 at 02:16 Davor Bonaci  wrote:


It looks good to me, but let's hear Aljoscha's opinion on BEAM-1346.

A passing suite of Jenkins jobs:
* https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6870/
* https://builds.apache.org/job/beam_PostCommit_Java_MavenInstall/2474/
*

https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Apex/336/
*

https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Flink/1470/
*

https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Spark/786/
*

https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Dataflow/2130/

On Mon, Jan 30, 2017 at 4:40 PM, Dan Halperin  wrote:


I am worried about https://issues.apache.org/jira/browse/BEAM-1346 for

RC1

and would at least wait for resolution there before proceeding.

On Mon, Jan 30, 2017 at 3:48 AM, Jean-Baptiste Onofré 
wrote:


Good catch for the PPMC, I'm upgrading the email template in the

release

guide (it was a copy/paste).

Regards
JB


On 01/30/2017 11:50 AM, Sergio Fernández wrote:


+1 (non-binding)

So far I've successfully checked:
* signatures and digests
* source releases file layouts
* matched git tags and commit ids
* incubator suffix and disclaimer
* NOTICE and LICENSE files
* license headers
* clean build (Java 1.8.0_91, Maven 3.3.9, Debian amd64)

Two minor comments that do not block the release:
* Usually I like to see the commit id referencing the rc, since git

tags

can be changed.
* Just a formality, "PPMC" is not committee that plays a role anymore,
you're a PMC now ;-)



On Fri, Jan 27, 2017 at 9:55 PM, Jean-Baptiste Onofré <

j...@nanthrax.net>

wrote:

Hi everyone,


Please review and vote on the release candidate #1 for the version

0.5.0

as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which

includes:


* JIRA release notes [1],
* the official Apache source release to be deployed to

dist.apache.org

[2], which is signed with the key with fingerprint C8282E76 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v0.5.0-RC1" [5],
* website pull request listing the release and publishing the API
reference manual [6].

The vote will be open for at least 72 hours. It is adopted by

majority

approval, with at least 3 PPMC affirmative votes.

Thanks,
JB

[1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
ctId=12319527=12338859
[2] https://dist.apache.org/repos/dist/dev/beam/0.5.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapache
beam-1010/
[5] https://git-wip-us.apache.org/repos/asf?p=beam.git;a=tag;h=r
efs/tags/v0.5.0-RC1
[6] https://github.com/apache/beam-site/pull/132







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com









--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: TextIO binary file

2017-01-31 Thread Eugene Kirpichov
Could you clarify why it would be useful to write objects to files using
Beam coders, as opposed to just using e.g. AvroIO?

Coders (should) make no promise as to what their wire format is, so such
files could be read back only by other Beam pipelines using the same IO.

On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur  wrote:

> So If I understand the general agreement is that TextIO should not support
> anything but lines from files as strings.
> I'll go ahead and file a ticket that says the Javadoc should be changed to
> reflect this and `withCoder` method should be removed.
>
> Is there merit for Beam to supply an IO which does allow writing objects to
> a file using Beam coders and Beam FS (To write these files to
> GS/Hadoop/Local)?
>
> On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
>  wrote:
>
> P.S. Note that this point (about coders) is also mentioned in the
> now-being-reviewed PTransform Style Guide
> https://github.com/apache/beam-site/pull/134
> currently staged at
>
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
>
>
> On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath 
> wrote:
>
> > +1 to what Eugene said.
> >
> > I've seen a number of Python SDK users incorrectly assuming that
> > coder.decode() is needed when developing their own file-based sources
> > (since many users usually refer to text source first). Probably coder
> > parameter should not be configurable for text source/sink and they should
> > be updated to only read/write UTF-8 encoded strings.
> >
> > - Cham
> >
> > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> >  wrote:
> >
> > > The use of Coder in TextIO is a long standing design issue because
> coders
> > > are not intended to be used for general purpose converting things from
> > and
> > > to bytes, their only proper use is letting the runner materialize and
> > > restore objects if the runner thinks it's necessary. IMO it should have
> > > been called LineIO, document that it reads lines of text as String, and
> > not
> > > have a withCoder parameter at all.
> > >
> > > The proper way to address your use case is to write a custom
> > > FileBasedSource.
> > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur  wrote:
> > >
> > > > The Javadoc of TextIO states:
> > > >
> > > > * By default, {@link TextIO.Read} returns a {@link PCollection} of
> > > > {@link String Strings},
> > > >  * each corresponding to one line of an input UTF-8 text file. To
> > convert
> > > > directly from the raw
> > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> > another
> > > > object of type {@code T},
> > > >  * supply a {@code Coder} using {@link
> > TextIO.Read#withCoder(Coder)}.
> > > >
> > > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > > probably
> > > > won't work given the hard-coded '\n' delimiter.
> > > >
> > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > > > wrote:
> > > >
> > > > > Hi Aviem,
> > > > >
> > > > > TextIO is not designed to write/read binary file: it's pure Text,
> so
> > > > > String.
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > > Hi,
> > > > > >
> > > > > > While trying to use TextIO to write/read a binary file rather
> than
> > > > String
> > > > > > lines from a textual file I ran into an issue - the delimiter
> > TextIO
> > > > uses
> > > > > > seems to be hardcoded '\n'.
> > > > > > See `findSeparatorBounds` -
> > > > > >
> > > > >
> > > >
> > >
> >
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > > >
> > > > > > The use case is to have a file of objects, encoded into bytes
> > using a
> > > > > > coder. However, '\n' is not a good delimiter here, as you can
> > > imagine.
> > > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > > > >
> > > > >
> > > >
> > >
> >
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > > where
> > > > > > they use a more appropriate delimiter, to avoid such issues.
> > > > > >
> > > > > > I did not find any unit tests which use TextIO to read anything
> > other
> > > > > than
> > > > > > Strings.
> > > > > >
> > > > >
> > > > > --
> > > > > Jean-Baptiste Onofré
> > > > > jbono...@apache.org
> > > > > http://blog.nanthrax.net
> > > > > Talend - http://www.talend.com
> > > > >
> > > >
> > >
> >
>


Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-31 Thread Aljoscha Krettek
It's not just an issue with the Flink Runner, if I'm not mistaken.

Flink had late-data dropping via the LateDataDroppingDoFnRunner (which got
"disabled" by the two commits I mention in the issue) while I think that
the Apex and Spark Runners might not have had dropping in the first place.
(Not sure about this last part.)

As I now wrote to the issue I think this could be a blocker because we
don't have the correct output in some cases.

On Tue, 31 Jan 2017 at 02:16 Davor Bonaci  wrote:

> It looks good to me, but let's hear Aljoscha's opinion on BEAM-1346.
>
> A passing suite of Jenkins jobs:
> * https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6870/
> * https://builds.apache.org/job/beam_PostCommit_Java_MavenInstall/2474/
> *
>
> https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Apex/336/
> *
>
> https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Flink/1470/
> *
>
> https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Spark/786/
> *
>
> https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Dataflow/2130/
>
> On Mon, Jan 30, 2017 at 4:40 PM, Dan Halperin  wrote:
>
> > I am worried about https://issues.apache.org/jira/browse/BEAM-1346 for
> RC1
> > and would at least wait for resolution there before proceeding.
> >
> > On Mon, Jan 30, 2017 at 3:48 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > Good catch for the PPMC, I'm upgrading the email template in the
> release
> > > guide (it was a copy/paste).
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 01/30/2017 11:50 AM, Sergio Fernández wrote:
> > >
> > >> +1 (non-binding)
> > >>
> > >> So far I've successfully checked:
> > >> * signatures and digests
> > >> * source releases file layouts
> > >> * matched git tags and commit ids
> > >> * incubator suffix and disclaimer
> > >> * NOTICE and LICENSE files
> > >> * license headers
> > >> * clean build (Java 1.8.0_91, Maven 3.3.9, Debian amd64)
> > >>
> > >> Two minor comments that do not block the release:
> > >> * Usually I like to see the commit id referencing the rc, since git
> tags
> > >> can be changed.
> > >> * Just a formality, "PPMC" is not committee that plays a role anymore,
> > >> you're a PMC now ;-)
> > >>
> > >>
> > >>
> > >> On Fri, Jan 27, 2017 at 9:55 PM, Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > >> wrote:
> > >>
> > >> Hi everyone,
> > >>>
> > >>> Please review and vote on the release candidate #1 for the version
> > 0.5.0
> > >>> as follows:
> > >>>
> > >>> [ ] +1, Approve the release
> > >>> [ ] -1, Do not approve the release (please provide specific comments)
> > >>>
> > >>> The complete staging area is available for your review, which
> includes:
> > >>>
> > >>> * JIRA release notes [1],
> > >>> * the official Apache source release to be deployed to
> dist.apache.org
> > >>> [2], which is signed with the key with fingerprint C8282E76 [3],
> > >>> * all artifacts to be deployed to the Maven Central Repository [4],
> > >>> * source code tag "v0.5.0-RC1" [5],
> > >>> * website pull request listing the release and publishing the API
> > >>> reference manual [6].
> > >>>
> > >>> The vote will be open for at least 72 hours. It is adopted by
> majority
> > >>> approval, with at least 3 PPMC affirmative votes.
> > >>>
> > >>> Thanks,
> > >>> JB
> > >>>
> > >>> [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
> > >>> ctId=12319527=12338859
> > >>> [2] https://dist.apache.org/repos/dist/dev/beam/0.5.0/
> > >>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > >>> [4] https://repository.apache.org/content/repositories/orgapache
> > >>> beam-1010/
> > >>> [5] https://git-wip-us.apache.org/repos/asf?p=beam.git;a=tag;h=r
> > >>> efs/tags/v0.5.0-RC1
> > >>> [6] https://github.com/apache/beam-site/pull/132
> > >>>
> > >>>
> > >>
> > >>
> > >>
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>


Re: Projects for Google Summer of Code 2017

2017-01-31 Thread Kenneth Knowles
I think this is a great idea. I also participated in GSOC once.

I've been particularly interested in coming up with great new applications
of Beam to new domains. In chatting with professors at the University of
Washington, I've learned that scholars of many fields would really like to
explore new and highly customized ways of processing the growing body of
publicly-available scholarly documents. This seems like a great project,
since we love doing this to Shakespeare's works, and there are thousands of
times as many public articles so there's non-toy scale issues. And yet, it
does seem like it can be scoped appropriately.

The deadline for a mentoring organization is Feb 9 so let's put together a
proposal!

Kenn

On Fri, Jan 13, 2017 at 3:25 PM, Pablo Estrada 
wrote:

> Hi there,
> The GSOC 2017 [1] is coming soon. I figured it would be nice if we could
> find small projects that a student could implement this summer. Apache
> already takes part in this, and all we'd need to do is label Jira issues as
> GSOC projects. Any ideas for projects?
>
> As a note, during my grad school I participated in GSOC a couple of times
> and I'd say they were some of my most rewarding development experiences.
>
> [1] - https://developers.google.com/open-source/gsoc/
>


Re: [DISCUSS] Python SDK status and next steps

2017-01-31 Thread Jean-Baptiste Onofré

Awesome, thanks Sergio ! Much appreciated ;)

Regards
JB

On 01/31/2017 01:42 PM, Sergio Fernández wrote:

PR #1879 provides the basics: https://github.com/apache/beam/pull/1879

On Tue, Jan 31, 2017 at 1:33 PM, Jean-Baptiste Onofré 
wrote:


No, that's fine as soon as we clearly document the prerequisite for the
build. IMHO, we should provide quick BUILDING instructions in the README.md.

Regards
JB


On 01/31/2017 01:24 PM, Sergio Fernández wrote:


Originally we integrate the build in Maven with the default profile.
Do you feel like it'd be better to have it under a separated profile or
so?

On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré 
wrote:

Just to be clear, the prerequisite to be able to build the Python SDK are:


apt-get install python-setuptools
apt-get install python-pip

It's also required by the default "regular" build.

Regards
JB


On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:

Just one thing I noticed (and can be helpful for others): to build Beam

we now need python setuptools installed.

For instance, on Ubuntu, you have to do:

apt-get install python-setuptools

Same for the pip distribution.

I guess (if not already done), we have to update README/Building
instructions.

Correct ?

Regards
JB

On 01/31/2017 08:10 AM, Ahmet Altay wrote:

Hi all,


This merge is completed. Python SDK is now officially part of the
master
branch! Thank you all for the support. Please open an issue, if you
notice
a reference to the now obsolete python-sdk branch in the documentation.

There will not be any more merges to the python-sdk branch. Going
forward
please use the master branch for Python SDK development. There are a
few
existing open PRs to the python-sdk [1]. If you are the author of one
of
those PRs, please rebase them on top of master.

Thank you,
Ahmet

[1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%


3Apython-sdk+repo%3Aapache%2Fbeam+



On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles

wrote:

To clarify the implied criteria of that last exchange, it is "An SDK


should
have at least one runner that can execute the complete model (may be a
direct runner)"

I want to highlight this, because whether an _SDK_ supports unbounded
data
is not particularly well-defined, and will evolve:

 - With the Runner API, an SDK will need to support building a graph
with
unbounded constructs, as today with probably minimal changes.

 - With the Fn API, if any part of the Fn API is specific to unbounded
data, the SDK will need to implement it. I think right now there is
no such
thing, and we don't want such a thing, so SDKs implementing the Fn API
automatically support unbounded data.

 - There will also likely be an SDK-specific shim just as there is
today,
to leverage idiomatic deserialized representations. The richness of
this
shim will decrease so that it will need to "support" unbounded data
but
that will be a ~one liner.

Getting the Python SDK on master will accelerate our progress towards
the
Fn API - partly technical, partly community - which is the best path
towards support for unbounded data across multiple runners. I think
the
criteria are written with the completed portability framework in
mind. So
this exchange makes me actually more convinced we should merge
python-sdk
to master.

On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin


 wrote:

I do not think that Python SDK yet meets the bar [1] for implementing


the




Beam model -- supporting Unbounded data is very important. That said,




given


the committed and sustained set of contributors, it generally makes


sense




to me to make an exception in anticipation of these features being




fleshed


out soon; including potentially new users/contributors that would


arrive




once in master.




[1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com



That is a valid point. The Python SDK supports all the unbounded
parts
of the model except for unbounded sources, which was deferred while
seeing how https://s.apache.org/splittable-do-fn played out. I've
been
working with the team and merging/reviewing most of their code, and
have full confidence this will be coming (and on that note can vouch
for a healthy community and support which are much harder to add
later).

In short, I think it has the required maturity, and I'm in favor of
merging soonish.

On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay




Re: [DISCUSS] Python SDK status and next steps

2017-01-31 Thread Sergio Fernández
PR #1879 provides the basics: https://github.com/apache/beam/pull/1879

On Tue, Jan 31, 2017 at 1:33 PM, Jean-Baptiste Onofré 
wrote:

> No, that's fine as soon as we clearly document the prerequisite for the
> build. IMHO, we should provide quick BUILDING instructions in the README.md.
>
> Regards
> JB
>
>
> On 01/31/2017 01:24 PM, Sergio Fernández wrote:
>
>> Originally we integrate the build in Maven with the default profile.
>> Do you feel like it'd be better to have it under a separated profile or
>> so?
>>
>> On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Just to be clear, the prerequisite to be able to build the Python SDK are:
>>>
>>> apt-get install python-setuptools
>>> apt-get install python-pip
>>>
>>> It's also required by the default "regular" build.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:
>>>
>>> Just one thing I noticed (and can be helpful for others): to build Beam
 we now need python setuptools installed.

 For instance, on Ubuntu, you have to do:

 apt-get install python-setuptools

 Same for the pip distribution.

 I guess (if not already done), we have to update README/Building
 instructions.

 Correct ?

 Regards
 JB

 On 01/31/2017 08:10 AM, Ahmet Altay wrote:

 Hi all,
>
> This merge is completed. Python SDK is now officially part of the
> master
> branch! Thank you all for the support. Please open an issue, if you
> notice
> a reference to the now obsolete python-sdk branch in the documentation.
>
> There will not be any more merges to the python-sdk branch. Going
> forward
> please use the master branch for Python SDK development. There are a
> few
> existing open PRs to the python-sdk [1]. If you are the author of one
> of
> those PRs, please rebase them on top of master.
>
> Thank you,
> Ahmet
>
> [1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%
> 
> 
> 3Apython-sdk+repo%3Aapache%2Fbeam+
>  +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>
>
> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles
> 
> wrote:
>
> To clarify the implied criteria of that last exchange, it is "An SDK
>
>> should
>> have at least one runner that can execute the complete model (may be a
>> direct runner)"
>>
>> I want to highlight this, because whether an _SDK_ supports unbounded
>> data
>> is not particularly well-defined, and will evolve:
>>
>>  - With the Runner API, an SDK will need to support building a graph
>> with
>> unbounded constructs, as today with probably minimal changes.
>>
>>  - With the Fn API, if any part of the Fn API is specific to unbounded
>> data, the SDK will need to implement it. I think right now there is
>> no such
>> thing, and we don't want such a thing, so SDKs implementing the Fn API
>> automatically support unbounded data.
>>
>>  - There will also likely be an SDK-specific shim just as there is
>> today,
>> to leverage idiomatic deserialized representations. The richness of
>> this
>> shim will decrease so that it will need to "support" unbounded data
>> but
>> that will be a ~one liner.
>>
>> Getting the Python SDK on master will accelerate our progress towards
>> the
>> Fn API - partly technical, partly community - which is the best path
>> towards support for unbounded data across multiple runners. I think
>> the
>> criteria are written with the completed portability framework in
>> mind. So
>> this exchange makes me actually more convinced we should merge
>> python-sdk
>> to master.
>>
>> On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
>> rober...@google.com.invalid> wrote:
>>
>> On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>>
>>>  wrote:
>>>
>>> I do not think that Python SDK yet meets the bar [1] for implementing

 the
>>>
>>
>> Beam model -- supporting Unbounded data is very important. That said,
>>>

 given
>>>
>>> the committed and sustained set of contributors, it generally makes

 sense
>>>
>>
>> to me to make an exception in anticipation of these features being
>>>

 fleshed
>>>
>>> out soon; including potentially new users/contributors that would

 arrive
>>>
>>
>> once in master.
>>>

 [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
 

Re: [DISCUSS] Python SDK status and next steps

2017-01-31 Thread Jean-Baptiste Onofré
No, that's fine as soon as we clearly document the prerequisite for the 
build. IMHO, we should provide quick BUILDING instructions in the README.md.


Regards
JB

On 01/31/2017 01:24 PM, Sergio Fernández wrote:

Originally we integrate the build in Maven with the default profile.
Do you feel like it'd be better to have it under a separated profile or so?

On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré 
wrote:


Just to be clear, the prerequisite to be able to build the Python SDK are:

apt-get install python-setuptools
apt-get install python-pip

It's also required by the default "regular" build.

Regards
JB


On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:


Just one thing I noticed (and can be helpful for others): to build Beam
we now need python setuptools installed.

For instance, on Ubuntu, you have to do:

apt-get install python-setuptools

Same for the pip distribution.

I guess (if not already done), we have to update README/Building
instructions.

Correct ?

Regards
JB

On 01/31/2017 08:10 AM, Ahmet Altay wrote:


Hi all,

This merge is completed. Python SDK is now officially part of the master
branch! Thank you all for the support. Please open an issue, if you
notice
a reference to the now obsolete python-sdk branch in the documentation.

There will not be any more merges to the python-sdk branch. Going forward
please use the master branch for Python SDK development. There are a few
existing open PRs to the python-sdk [1]. If you are the author of one of
those PRs, please rebase them on top of master.

Thank you,
Ahmet

[1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%

3Apython-sdk+repo%3Aapache%2Fbeam+



On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles

wrote:

To clarify the implied criteria of that last exchange, it is "An SDK

should
have at least one runner that can execute the complete model (may be a
direct runner)"

I want to highlight this, because whether an _SDK_ supports unbounded
data
is not particularly well-defined, and will evolve:

 - With the Runner API, an SDK will need to support building a graph
with
unbounded constructs, as today with probably minimal changes.

 - With the Fn API, if any part of the Fn API is specific to unbounded
data, the SDK will need to implement it. I think right now there is
no such
thing, and we don't want such a thing, so SDKs implementing the Fn API
automatically support unbounded data.

 - There will also likely be an SDK-specific shim just as there is
today,
to leverage idiomatic deserialized representations. The richness of this
shim will decrease so that it will need to "support" unbounded data but
that will be a ~one liner.

Getting the Python SDK on master will accelerate our progress towards
the
Fn API - partly technical, partly community - which is the best path
towards support for unbounded data across multiple runners. I think the
criteria are written with the completed portability framework in
mind. So
this exchange makes me actually more convinced we should merge
python-sdk
to master.

On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin

 wrote:


I do not think that Python SDK yet meets the bar [1] for implementing


the



Beam model -- supporting Unbounded data is very important. That said,



given


the committed and sustained set of contributors, it generally makes


sense



to me to make an exception in anticipation of these features being



fleshed


out soon; including potentially new users/contributors that would


arrive



once in master.


[1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com



That is a valid point. The Python SDK supports all the unbounded parts
of the model except for unbounded sources, which was deferred while
seeing how https://s.apache.org/splittable-do-fn played out. I've been
working with the team and merging/reviewing most of their code, and
have full confidence this will be coming (and on that note can vouch
for a healthy community and support which are much harder to add
later).

In short, I think it has the required maturity, and I'm in favor of
merging soonish.

On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay



Re: [DISCUSS] Python SDK status and next steps

2017-01-31 Thread Sergio Fernández
Originally we integrate the build in Maven with the default profile.
Do you feel like it'd be better to have it under a separated profile or so?

On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré 
wrote:

> Just to be clear, the prerequisite to be able to build the Python SDK are:
>
> apt-get install python-setuptools
> apt-get install python-pip
>
> It's also required by the default "regular" build.
>
> Regards
> JB
>
>
> On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:
>
>> Just one thing I noticed (and can be helpful for others): to build Beam
>> we now need python setuptools installed.
>>
>> For instance, on Ubuntu, you have to do:
>>
>> apt-get install python-setuptools
>>
>> Same for the pip distribution.
>>
>> I guess (if not already done), we have to update README/Building
>> instructions.
>>
>> Correct ?
>>
>> Regards
>> JB
>>
>> On 01/31/2017 08:10 AM, Ahmet Altay wrote:
>>
>>> Hi all,
>>>
>>> This merge is completed. Python SDK is now officially part of the master
>>> branch! Thank you all for the support. Please open an issue, if you
>>> notice
>>> a reference to the now obsolete python-sdk branch in the documentation.
>>>
>>> There will not be any more merges to the python-sdk branch. Going forward
>>> please use the master branch for Python SDK development. There are a few
>>> existing open PRs to the python-sdk [1]. If you are the author of one of
>>> those PRs, please rebase them on top of master.
>>>
>>> Thank you,
>>> Ahmet
>>>
>>> [1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%
>>> 
>>> 3Apython-sdk+repo%3Aapache%2Fbeam+
>>> >> +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>>>
>>>
>>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles
>>> 
>>> wrote:
>>>
>>> To clarify the implied criteria of that last exchange, it is "An SDK
 should
 have at least one runner that can execute the complete model (may be a
 direct runner)"

 I want to highlight this, because whether an _SDK_ supports unbounded
 data
 is not particularly well-defined, and will evolve:

  - With the Runner API, an SDK will need to support building a graph
 with
 unbounded constructs, as today with probably minimal changes.

  - With the Fn API, if any part of the Fn API is specific to unbounded
 data, the SDK will need to implement it. I think right now there is
 no such
 thing, and we don't want such a thing, so SDKs implementing the Fn API
 automatically support unbounded data.

  - There will also likely be an SDK-specific shim just as there is
 today,
 to leverage idiomatic deserialized representations. The richness of this
 shim will decrease so that it will need to "support" unbounded data but
 that will be a ~one liner.

 Getting the Python SDK on master will accelerate our progress towards
 the
 Fn API - partly technical, partly community - which is the best path
 towards support for unbounded data across multiple runners. I think the
 criteria are written with the completed portability framework in
 mind. So
 this exchange makes me actually more convinced we should merge
 python-sdk
 to master.

 On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
 rober...@google.com.invalid> wrote:

 On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>  wrote:
>
>> I do not think that Python SDK yet meets the bar [1] for implementing
>>
> the

> Beam model -- supporting Unbounded data is very important. That said,
>>
> given
>
>> the committed and sustained set of contributors, it generally makes
>>
> sense

> to me to make an exception in anticipation of these features being
>>
> fleshed
>
>> out soon; including potentially new users/contributors that would
>>
> arrive

> once in master.
>>
>> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
>> k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com
>>
>
> That is a valid point. The Python SDK supports all the unbounded parts
> of the model except for unbounded sources, which was deferred while
> seeing how https://s.apache.org/splittable-do-fn played out. I've been
> working with the team and merging/reviewing most of their code, and
> have full confidence this will be coming (and on that note can vouch
> for a healthy community and support which are much harder to add
> later).
>
> In short, I think it has the required maturity, and I'm in favor of
> merging soonish.
>
> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>> >
>
> wrote:
>>
>> Thank you all for the comments so far. I would follow the process as

Re: TextIO binary file

2017-01-31 Thread Aviem Zur
So If I understand the general agreement is that TextIO should not support
anything but lines from files as strings.
I'll go ahead and file a ticket that says the Javadoc should be changed to
reflect this and `withCoder` method should be removed.

Is there merit for Beam to supply an IO which does allow writing objects to
a file using Beam coders and Beam FS (To write these files to
GS/Hadoop/Local)?

On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
 wrote:

P.S. Note that this point (about coders) is also mentioned in the
now-being-reviewed PTransform Style Guide
https://github.com/apache/beam-site/pull/134
currently staged at
http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders


On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath 
wrote:

> +1 to what Eugene said.
>
> I've seen a number of Python SDK users incorrectly assuming that
> coder.decode() is needed when developing their own file-based sources
> (since many users usually refer to text source first). Probably coder
> parameter should not be configurable for text source/sink and they should
> be updated to only read/write UTF-8 encoded strings.
>
> - Cham
>
> On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
>  wrote:
>
> > The use of Coder in TextIO is a long standing design issue because
coders
> > are not intended to be used for general purpose converting things from
> and
> > to bytes, their only proper use is letting the runner materialize and
> > restore objects if the runner thinks it's necessary. IMO it should have
> > been called LineIO, document that it reads lines of text as String, and
> not
> > have a withCoder parameter at all.
> >
> > The proper way to address your use case is to write a custom
> > FileBasedSource.
> > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur  wrote:
> >
> > > The Javadoc of TextIO states:
> > >
> > > * By default, {@link TextIO.Read} returns a {@link PCollection} of
> > > {@link String Strings},
> > >  * each corresponding to one line of an input UTF-8 text file. To
> convert
> > > directly from the raw
> > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> another
> > > object of type {@code T},
> > >  * supply a {@code Coder} using {@link
> TextIO.Read#withCoder(Coder)}.
> > >
> > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > probably
> > > won't work given the hard-coded '\n' delimiter.
> > >
> > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré  >
> > > wrote:
> > >
> > > > Hi Aviem,
> > > >
> > > > TextIO is not designed to write/read binary file: it's pure Text, so
> > > > String.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > Hi,
> > > > >
> > > > > While trying to use TextIO to write/read a binary file rather than
> > > String
> > > > > lines from a textual file I ran into an issue - the delimiter
> TextIO
> > > uses
> > > > > seems to be hardcoded '\n'.
> > > > > See `findSeparatorBounds` -
> > > > >
> > > >
> > >
> >
>
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > >
> > > > > The use case is to have a file of objects, encoded into bytes
> using a
> > > > > coder. However, '\n' is not a good delimiter here, as you can
> > imagine.
> > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > > >
> > > >
> > >
> >
>
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > where
> > > > > they use a more appropriate delimiter, to avoid such issues.
> > > > >
> > > > > I did not find any unit tests which use TextIO to read anything
> other
> > > > than
> > > > > Strings.
> > > > >
> > > >
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbono...@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > >
> >
>


Re: [DISCUSS] Python SDK status and next steps

2017-01-31 Thread Jean-Baptiste Onofré

Just to be clear, the prerequisite to be able to build the Python SDK are:

apt-get install python-setuptools
apt-get install python-pip

It's also required by the default "regular" build.

Regards
JB

On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:

Just one thing I noticed (and can be helpful for others): to build Beam
we now need python setuptools installed.

For instance, on Ubuntu, you have to do:

apt-get install python-setuptools

Same for the pip distribution.

I guess (if not already done), we have to update README/Building
instructions.

Correct ?

Regards
JB

On 01/31/2017 08:10 AM, Ahmet Altay wrote:

Hi all,

This merge is completed. Python SDK is now officially part of the master
branch! Thank you all for the support. Please open an issue, if you
notice
a reference to the now obsolete python-sdk branch in the documentation.

There will not be any more merges to the python-sdk branch. Going forward
please use the master branch for Python SDK development. There are a few
existing open PRs to the python-sdk [1]. If you are the author of one of
those PRs, please rebase them on top of master.

Thank you,
Ahmet

[1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%
3Apython-sdk+repo%3Aapache%2Fbeam+



On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles

wrote:


To clarify the implied criteria of that last exchange, it is "An SDK
should
have at least one runner that can execute the complete model (may be a
direct runner)"

I want to highlight this, because whether an _SDK_ supports unbounded
data
is not particularly well-defined, and will evolve:

 - With the Runner API, an SDK will need to support building a graph
with
unbounded constructs, as today with probably minimal changes.

 - With the Fn API, if any part of the Fn API is specific to unbounded
data, the SDK will need to implement it. I think right now there is
no such
thing, and we don't want such a thing, so SDKs implementing the Fn API
automatically support unbounded data.

 - There will also likely be an SDK-specific shim just as there is
today,
to leverage idiomatic deserialized representations. The richness of this
shim will decrease so that it will need to "support" unbounded data but
that will be a ~one liner.

Getting the Python SDK on master will accelerate our progress towards
the
Fn API - partly technical, partly community - which is the best path
towards support for unbounded data across multiple runners. I think the
criteria are written with the completed portability framework in
mind. So
this exchange makes me actually more convinced we should merge
python-sdk
to master.

On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
rober...@google.com.invalid> wrote:


On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
 wrote:

I do not think that Python SDK yet meets the bar [1] for implementing

the

Beam model -- supporting Unbounded data is very important. That said,

given

the committed and sustained set of contributors, it generally makes

sense

to me to make an exception in anticipation of these features being

fleshed

out soon; including potentially new users/contributors that would

arrive

once in master.

[1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com


That is a valid point. The Python SDK supports all the unbounded parts
of the model except for unbounded sources, which was deferred while
seeing how https://s.apache.org/splittable-do-fn played out. I've been
working with the team and merging/reviewing most of their code, and
have full confidence this will be coming (and on that note can vouch
for a healthy community and support which are much harder to add
later).

In short, I think it has the required maturity, and I'm in favor of
merging soonish.


On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay


Re: [DISCUSS] Python SDK status and next steps

2017-01-31 Thread Prabeesh K.
https://issues.apache.org/jira/browse/BEAM-1360

On 31 January 2017 at 12:12, Prabeesh K.  wrote:

> https://issues.apache.org/jira/browse/BAHIR-86
>
> On 31 January 2017 at 11:10, Ahmet Altay  wrote:
>
>> Hi all,
>>
>> This merge is completed. Python SDK is now officially part of the master
>> branch! Thank you all for the support. Please open an issue, if you notice
>> a reference to the now obsolete python-sdk branch in the documentation.
>>
>> There will not be any more merges to the python-sdk branch. Going forward
>> please use the master branch for Python SDK development. There are a few
>> existing open PRs to the python-sdk [1]. If you are the author of one of
>> those PRs, please rebase them on top of master.
>>
>> Thank you,
>> Ahmet
>>
>> [1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%
>> 
>> 3Apython-sdk+repo%3Aapache%2Fbeam+
>> > +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>>
>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles > >
>> wrote:
>>
>> > To clarify the implied criteria of that last exchange, it is "An SDK
>> should
>> > have at least one runner that can execute the complete model (may be a
>> > direct runner)"
>> >
>> > I want to highlight this, because whether an _SDK_ supports unbounded
>> data
>> > is not particularly well-defined, and will evolve:
>> >
>> >  - With the Runner API, an SDK will need to support building a graph
>> with
>> > unbounded constructs, as today with probably minimal changes.
>> >
>> >  - With the Fn API, if any part of the Fn API is specific to unbounded
>> > data, the SDK will need to implement it. I think right now there is no
>> such
>> > thing, and we don't want such a thing, so SDKs implementing the Fn API
>> > automatically support unbounded data.
>> >
>> >  - There will also likely be an SDK-specific shim just as there is
>> today,
>> > to leverage idiomatic deserialized representations. The richness of this
>> > shim will decrease so that it will need to "support" unbounded data but
>> > that will be a ~one liner.
>> >
>> > Getting the Python SDK on master will accelerate our progress towards
>> the
>> > Fn API - partly technical, partly community - which is the best path
>> > towards support for unbounded data across multiple runners. I think the
>> > criteria are written with the completed portability framework in mind.
>> So
>> > this exchange makes me actually more convinced we should merge
>> python-sdk
>> > to master.
>> >
>> > On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
>> > rober...@google.com.invalid> wrote:
>> >
>> > > On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>> > >  wrote:
>> > > > I do not think that Python SDK yet meets the bar [1] for
>> implementing
>> > the
>> > > > Beam model -- supporting Unbounded data is very important. That
>> said,
>> > > given
>> > > > the committed and sustained set of contributors, it generally makes
>> > sense
>> > > > to me to make an exception in anticipation of these features being
>> > > fleshed
>> > > > out soon; including potentially new users/contributors that would
>> > arrive
>> > > > once in master.
>> > > >
>> > > > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
>> > > > k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com
>> > >
>> > > That is a valid point. The Python SDK supports all the unbounded parts
>> > > of the model except for unbounded sources, which was deferred while
>> > > seeing how https://s.apache.org/splittable-do-fn played out. I've
>> been
>> > > working with the team and merging/reviewing most of their code, and
>> > > have full confidence this will be coming (and on that note can vouch
>> > > for a healthy community and support which are much harder to add
>> > > later).
>> > >
>> > > In short, I think it has the required maturity, and I'm in favor of
>> > > merging soonish.
>> > >
>> > > > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>> > > >
>> > > > wrote:
>> > > >
>> > > >> Thank you all for the comments so far. I would follow the process
>> as
>> > > >> suggested by Davor and others in this thread.
>> > > >>
>> > > >> Ahmet
>> > > >>
>> > > >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
>> wik...@apache.org
>> > >
>> > > >> wrote:
>> > > >>
>> > > >> > Hi
>> > > >> >
>> > > >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
>> > > > > >
>> > > >> > wrote:
>> > > >> > >
>> > > >> > > tl;dr: I would like to start a discussion about merging
>> python-sdk
>> > > >> branch
>> > > >> > > to master branch. Python SDK is mature enough and merging it to
>> > > master
>> > > >> > will
>> > > >> > > accelerate its development and adoption.
>> > > >> > >
>> > > >> >
>> > > >> > Good point, Ahmet!
>> > > >> >
>> > > >> > I've following closed the development 

Re: [DISCUSS] Python SDK status and next steps

2017-01-31 Thread Prabeesh K.
https://issues.apache.org/jira/browse/BAHIR-86

On 31 January 2017 at 11:10, Ahmet Altay  wrote:

> Hi all,
>
> This merge is completed. Python SDK is now officially part of the master
> branch! Thank you all for the support. Please open an issue, if you notice
> a reference to the now obsolete python-sdk branch in the documentation.
>
> There will not be any more merges to the python-sdk branch. Going forward
> please use the master branch for Python SDK development. There are a few
> existing open PRs to the python-sdk [1]. If you are the author of one of
> those PRs, please rebase them on top of master.
>
> Thank you,
> Ahmet
>
> [1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%
> 
> 3Apython-sdk+repo%3Aapache%2Fbeam+
>  3Apr+base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>
> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles 
> wrote:
>
> > To clarify the implied criteria of that last exchange, it is "An SDK
> should
> > have at least one runner that can execute the complete model (may be a
> > direct runner)"
> >
> > I want to highlight this, because whether an _SDK_ supports unbounded
> data
> > is not particularly well-defined, and will evolve:
> >
> >  - With the Runner API, an SDK will need to support building a graph with
> > unbounded constructs, as today with probably minimal changes.
> >
> >  - With the Fn API, if any part of the Fn API is specific to unbounded
> > data, the SDK will need to implement it. I think right now there is no
> such
> > thing, and we don't want such a thing, so SDKs implementing the Fn API
> > automatically support unbounded data.
> >
> >  - There will also likely be an SDK-specific shim just as there is today,
> > to leverage idiomatic deserialized representations. The richness of this
> > shim will decrease so that it will need to "support" unbounded data but
> > that will be a ~one liner.
> >
> > Getting the Python SDK on master will accelerate our progress towards the
> > Fn API - partly technical, partly community - which is the best path
> > towards support for unbounded data across multiple runners. I think the
> > criteria are written with the completed portability framework in mind. So
> > this exchange makes me actually more convinced we should merge python-sdk
> > to master.
> >
> > On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
> > rober...@google.com.invalid> wrote:
> >
> > > On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
> > >  wrote:
> > > > I do not think that Python SDK yet meets the bar [1] for implementing
> > the
> > > > Beam model -- supporting Unbounded data is very important. That said,
> > > given
> > > > the committed and sustained set of contributors, it generally makes
> > sense
> > > > to me to make an exception in anticipation of these features being
> > > fleshed
> > > > out soon; including potentially new users/contributors that would
> > arrive
> > > > once in master.
> > > >
> > > > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> > > > k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com
> > >
> > > That is a valid point. The Python SDK supports all the unbounded parts
> > > of the model except for unbounded sources, which was deferred while
> > > seeing how https://s.apache.org/splittable-do-fn played out. I've been
> > > working with the team and merging/reviewing most of their code, and
> > > have full confidence this will be coming (and on that note can vouch
> > > for a healthy community and support which are much harder to add
> > > later).
> > >
> > > In short, I think it has the required maturity, and I'm in favor of
> > > merging soonish.
> > >
> > > > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>  > >
> > > > wrote:
> > > >
> > > >> Thank you all for the comments so far. I would follow the process as
> > > >> suggested by Davor and others in this thread.
> > > >>
> > > >> Ahmet
> > > >>
> > > >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
> wik...@apache.org
> > >
> > > >> wrote:
> > > >>
> > > >> > Hi
> > > >> >
> > > >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
> >  > > >
> > > >> > wrote:
> > > >> > >
> > > >> > > tl;dr: I would like to start a discussion about merging
> python-sdk
> > > >> branch
> > > >> > > to master branch. Python SDK is mature enough and merging it to
> > > master
> > > >> > will
> > > >> > > accelerate its development and adoption.
> > > >> > >
> > > >> >
> > > >> > Good point, Ahmet!
> > > >> >
> > > >> > I've following closed the development since it was imported in
> June.
> > > For
> > > >> > the prototypes I've implemented so far it works quite well; I
> guess
> > > we'd
> > > >> > just need to focus the next months in bringing more runners
> support.
> > > >> >
> > > >> > With a great effort from a lot of