Re: [DISCUSSION] Add hint/option on PCollection

2018-01-31 Thread Ismaël Mejía
This is a subject we have already discussed in the past. It was part
on the discussion on ‘data locality’ for the runners on top of HDFS.
In that moment the argument for ‘hints’ was that a transform could
send hints to the runners so they properly allocate the readers
improving its execution. This is similar to the case of resource
allocation (GPU) mentioned by Reuven.
https://issues.apache.org/jira/browse/BEAM-2085

What is a bit tricky about the design is the optional characteristic
of hints, we say that hints should not change the semantics of the
transforms (its output), but they can easily be abused to configure
how runners behave. We should limit hints only to the use case of
resource allocation, cases where the runner can benefit of the hint
info to pass it to the resource allocator, but runner specific
configuration must be part only of the runner options, or runners
should be smarter.

This is to avoid potential misuse for portability and to limit extra
knobs, Also to avoid the risky case of ending up with some sort of
runtime ‘map-like’ configuration with hundreds of options that change
behavior like they exist in Hadoop and Spark, We should avoid adding
another level of this kind of variables now on top of Beam.

On Wed, Jan 31, 2018 at 7:25 AM, Jean-Baptiste Onofré  wrote:
> Hi,
>
> yeah, it sounds good to me. I will create the Jira to track this and start a 
> PoC
> on the Composite.
>
> Thanks !
> Regards
> JB
>
> On 01/30/2018 10:40 PM, Reuven Lax wrote:
>> Did we actually reach consensus here? :)
>>
>> On Tue, Jan 30, 2018 at 1:29 PM, Romain Manni-Bucau > > wrote:
>>
>> Not sure how it fits in terms of API yet but +1 for the high level view.
>> Makes perfect sense.
>>
>> Le 30 janv. 2018 21:41, "Jean-Baptiste Onofré" > > a écrit :
>>
>> Hi Robert,
>>
>> Good point and idea for the Composite transform. It would apply 
>> nicely
>> on all transforms based on composite.
>>
>> I also agree that the hint is more on the transform than the 
>> PCollection
>> itself.
>>
>> Thanks !
>> Regards
>> JB
>>
>> On 30/01/2018 21:26, Robert Bradshaw wrote:
>>
>> Many hints make more sense for PTransforms (the computation 
>> itself)
>> than for PCollections. In addition, when we want properties 
>> attached
>> to PCollections of themselves, it often makes sense to let these 
>> be
>> provided by the producing PTransform (e.g. coders and schemas are
>> often functions of the input metadata and the operation itself, 
>> and
>> can't just be set arbitrarily).
>>
>> Also, we already have a perfectly standard way of nesting 
>> transforms
>> (or even sets of transforms), namely composite transforms. In 
>> terms of
>> API design I would propose writing a composite transform that 
>> applies
>> constraints/hints/requirements to all its inner transforms. This
>> translates nicely to the Fn API as well.
>>
>> On Tue, Jan 30, 2018 at 12:14 PM, Kenneth Knowles 
>> > > wrote:
>>
>> It seems like most of these use cases are hints on a 
>> PTransform
>> and not a
>> PCollection, no? CPU, memory, expected parallelism, etc are.
>> Then you could
>> just have:
>>  pc.apply(WithHints(myTransform, ))
>>
>> For a PCollection hints that might make sense are bits like
>> total size,
>> element size, and throughput. All things that the Dataflow 
>> folks
>> have said
>> should be measured instead of hinted. But I understand that 
>> we
>> shouldn't
>> force runners to do infeasible things like build a whole
>> no-knobs service on
>> top of a super-knobby engine.
>>
>> Incidentally for portability, we have this "environment" 
>> object
>> that is
>> basically the docker URL of an SDK harness that can execute a
>> function. We
>> always intended that same area of the proto (exact fields 
>> TBD)
>> to have
>> things like requirements for CPU, memory, GPUs, disk, etc. 
>> It is
>> likely a
>> good place for hints.
>>
>> BTW good idea to ask users@ for their pain points and bring 
>> them
>> back to the
>> dev list to motivate feature design discussions.
>>
>> Kenn
>>
>> On Tue, Jan 30, 2018 at 12:00 PM, Reuven Lax 
>> > > wrote:
>>
>>
>> I think the hints would logically be metadata in the
>>   

Build failed in Jenkins: beam_Release_NightlySnapshot #671

2018-01-31 Thread Apache Jenkins Server
See 


--
[...truncated 4.17 MB...]
2018-01-31T08:34:33.754 [INFO] --- mvn-golang-wrapper:2.1.6:build (go-build) @ 
beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:33.755 [INFO] Prepared command line : bin/go build 
-buildmode=default -o 

 github.com/apache/beam/cmd/gcsproxy
2018-01-31T08:34:35.724 [INFO] The Result file has been successfuly created : 

2018-01-31T08:34:35.830 [INFO] 
2018-01-31T08:34:35.831 [INFO] --- mvn-golang-wrapper:2.1.6:build 
(go-build-linux-amd64) @ beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:35.832 [INFO] Prepared command line : bin/go build 
-buildmode=default -o 

 github.com/apache/beam/cmd/gcsproxy
2018-01-31T08:34:37.812 [INFO] The Result file has been successfuly created : 

2018-01-31T08:34:37.920 [INFO] 
2018-01-31T08:34:37.920 [INFO] --- maven-checkstyle-plugin:2.17:check (default) 
@ beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:38.277 [INFO] Starting audit...
Audit done.
2018-01-31T08:34:38.430 [INFO] 
2018-01-31T08:34:38.430 [INFO] >>> findbugs-maven-plugin:3.0.5:check (default) 
> :findbugs @ beam-runners-gcp-gcsproxy >>>
2018-01-31T08:34:38.432 [INFO] 
2018-01-31T08:34:38.432 [INFO] --- findbugs-maven-plugin:3.0.5:findbugs 
(findbugs) @ beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:38.647 [INFO] 
2018-01-31T08:34:38.647 [INFO] <<< findbugs-maven-plugin:3.0.5:check (default) 
< :findbugs @ beam-runners-gcp-gcsproxy <<<
2018-01-31T08:34:38.647 [INFO] 
2018-01-31T08:34:38.647 [INFO] 
2018-01-31T08:34:38.647 [INFO] --- findbugs-maven-plugin:3.0.5:check (default) 
@ beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:38.773 [INFO] 
2018-01-31T08:34:38.773 [INFO] --- 
build-helper-maven-plugin:3.0.0:regex-properties (render-artifact-id) @ 
beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:38.920 [INFO] 
2018-01-31T08:34:38.920 [INFO] --- maven-site-plugin:3.7:attach-descriptor 
(attach-descriptor) @ beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:38.921 [INFO] No site descriptor found: nothing to attach.
2018-01-31T08:34:39.134 [INFO] 
2018-01-31T08:34:39.134 [INFO] --- maven-jar-plugin:3.0.2:jar (default-jar) @ 
beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:39.135 [INFO] Skipping packaging of the jar
2018-01-31T08:34:39.347 [INFO] 
2018-01-31T08:34:39.347 [INFO] --- maven-jar-plugin:3.0.2:test-jar 
(default-test-jar) @ beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:39.348 [INFO] Skipping packaging of the test-jar
2018-01-31T08:34:39.454 [INFO] 
2018-01-31T08:34:39.454 [INFO] --- maven-shade-plugin:3.1.0:shade 
(bundle-and-repackage) @ beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:39.457 [INFO] Replacing original artifact with shaded artifact.
2018-01-31T08:34:39.564 [INFO] 
2018-01-31T08:34:39.564 [INFO] --- maven-assembly-plugin:3.1.0:single 
(source-release-assembly) @ beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:39.565 [INFO] Skipping the assembly in this project because 
it's not the Execution Root
2018-01-31T08:34:39.673 [INFO] 
2018-01-31T08:34:39.673 [INFO] --- maven-source-plugin:3.0.1:jar-no-fork 
(attach-sources) @ beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:39.780 [INFO] 
2018-01-31T08:34:39.780 [INFO] --- maven-source-plugin:3.0.1:test-jar-no-fork 
(attach-test-sources) @ beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:39.887 [INFO] 
2018-01-31T08:34:39.888 [INFO] --- maven-javadoc-plugin:3.0.0-M1:jar 
(attach-javadocs) @ beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:39.889 [INFO] Not executing Javadoc as the project is not a 
Java classpath-capable package
2018-01-31T08:34:39.997 [INFO] 
2018-01-31T08:34:39.997 [INFO] --- 
reproducible-build-maven-plugin:0.4:strip-jar (default) @ 
beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:40.105 [INFO] 
2018-01-31T08:34:40.105 [INFO] --- maven-dependency-plugin:3.0.2:analyze-only 
(default) @ beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:40.106 [INFO] Skipping pom project
2018-01-31T08:34:40.213 [INFO] 
2018-01-31T08:34:40.213 [INFO] --- maven-install-plugin:2.5.2:install 
(default-install) @ beam-runners-gcp-gcsproxy ---
2018-01-31T08:34:40.214 [INFO] Installing 

 to 

2018-01-31T08:34:40.322 [INFO] 
2018-01-31T08:34:40.322 [INFO] --- maven-deploy-plugin:2.8.2:deploy 
(default-deploy) @ beam-runn

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-31 Thread Romain Manni-Bucau
Can we avoid it anyway? Not having it make the migration away from beam
very tempting
since the runtime diff can be important in terms of perf.

What about:
1. adding hints as @Experimental
2. see how it grow for some releases (like 6 months)
3. take a decision to keep that or drop it

Whatever you do if you intend to be portable you will need to expose
somehow the actual implementation feature at some point to enable users.
Hints are a ligh way to do it.
Hard way - and alternative - is to enable an "unwrap" to access the
underlying model (like SparkContext) but this is way more vicious and
insane in terms of application code and maintenance IMHO.
In that context, hints are a cheap and acceptable trade-off which enable
without breaking users.

Am I missing something?



Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn 

2018-01-31 9:31 GMT+01:00 Ismaël Mejía :

> This is a subject we have already discussed in the past. It was part
> on the discussion on ‘data locality’ for the runners on top of HDFS.
> In that moment the argument for ‘hints’ was that a transform could
> send hints to the runners so they properly allocate the readers
> improving its execution. This is similar to the case of resource
> allocation (GPU) mentioned by Reuven.
> https://issues.apache.org/jira/browse/BEAM-2085
>
> What is a bit tricky about the design is the optional characteristic
> of hints, we say that hints should not change the semantics of the
> transforms (its output), but they can easily be abused to configure
> how runners behave. We should limit hints only to the use case of
> resource allocation, cases where the runner can benefit of the hint
> info to pass it to the resource allocator, but runner specific
> configuration must be part only of the runner options, or runners
> should be smarter.
>
> This is to avoid potential misuse for portability and to limit extra
> knobs, Also to avoid the risky case of ending up with some sort of
> runtime ‘map-like’ configuration with hundreds of options that change
> behavior like they exist in Hadoop and Spark, We should avoid adding
> another level of this kind of variables now on top of Beam.
>
> On Wed, Jan 31, 2018 at 7:25 AM, Jean-Baptiste Onofré 
> wrote:
> > Hi,
> >
> > yeah, it sounds good to me. I will create the Jira to track this and
> start a PoC
> > on the Composite.
> >
> > Thanks !
> > Regards
> > JB
> >
> > On 01/30/2018 10:40 PM, Reuven Lax wrote:
> >> Did we actually reach consensus here? :)
> >>
> >> On Tue, Jan 30, 2018 at 1:29 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com
> >> > wrote:
> >>
> >> Not sure how it fits in terms of API yet but +1 for the high level
> view.
> >> Makes perfect sense.
> >>
> >> Le 30 janv. 2018 21:41, "Jean-Baptiste Onofré"  >> > a écrit :
> >>
> >> Hi Robert,
> >>
> >> Good point and idea for the Composite transform. It would apply
> nicely
> >> on all transforms based on composite.
> >>
> >> I also agree that the hint is more on the transform than the
> PCollection
> >> itself.
> >>
> >> Thanks !
> >> Regards
> >> JB
> >>
> >> On 30/01/2018 21:26, Robert Bradshaw wrote:
> >>
> >> Many hints make more sense for PTransforms (the computation
> itself)
> >> than for PCollections. In addition, when we want properties
> attached
> >> to PCollections of themselves, it often makes sense to let
> these be
> >> provided by the producing PTransform (e.g. coders and
> schemas are
> >> often functions of the input metadata and the operation
> itself, and
> >> can't just be set arbitrarily).
> >>
> >> Also, we already have a perfectly standard way of nesting
> transforms
> >> (or even sets of transforms), namely composite transforms.
> In terms of
> >> API design I would propose writing a composite transform
> that applies
> >> constraints/hints/requirements to all its inner transforms.
> This
> >> translates nicely to the Fn API as well.
> >>
> >> On Tue, Jan 30, 2018 at 12:14 PM, Kenneth Knowles <
> k...@google.com
> >> > wrote:
> >>
> >> It seems like most of these use cases are hints on a
> PTransform
> >> and not a
> >> PCollection, no? CPU, memory, expected parallelism, etc
> are.
> >> Then you could
> >> just have:
> >>  pc.apply(WithHints(myTransform, ))
> >>
> >> For a PCollection hints that might make sense are bits
> like
> >> total size,
> >> element size, and throughpu

Re: [DISCUSS] State of the project: Culture and governance

2018-01-31 Thread Ismaël Mejía
Some extra comments inlined:

JB,

> 1. The Apache Beam website contains a link to the Apache Code of Conduct (on 
> the dropdown menu under the feather icon ;)). Maybe we can also add a link to 
> the contribution guide as it's really important, especially for people who 
> target the committership.

Sure we should put it the contribution guide, my idea is that we the
code of conduct more visible so everybody knows what to follow and
what to expect.

> 3. You got me point: it's subjective. The link to Flink doesn't bring 
> anything more: "There is no strict protocol for becoming a committer". So, no 
> problem to add such section, but I don't see lot of details providing there ;)

I disagree with the fact that doesn’t bring anything more, Beam can be
the first project a contributor approaches and in this case he/she
will not know this stuff, having some kind of guidance is important.

> Actually, Flink just copied the section from Apache:
> https://community.apache.org/

>From a quick look I could not find the same section there, but I agree
that we can add references to this Apache one if you can refer to me
the exact part where it is.


Lukasz, Kenn:

> I have to -1 reductions in the code review quality bar as this leads to test 
> problems, which leads to CI issues, which leads to gaps in coverage and then 
> to delayed, bad and broken releases.

Lowering the bar on code quality should be avoided.

We still lack some clear rules for code reviews, this is probably
worth of an addition to the committer guide. For example if a
first-time contributor fixes some error and has the expected quality
we should accept it quickly, not being picky about some other part
that can be improved that was discovered during the review, or at
least not doing this without some extra encouragement to do it as an
additional task. Also cases where the reviewer proposes an improvement
and then changes his mind must be absolutely avoided, or the reviewer
should at least ask for excuses or work on that part of the fix. These
ideas are to encourage contribution and not disappoint contributors,
and they apply to casual / initial contributors, with committers and
paid contributors these situations could be more acceptable even if
not ideal.

On Tue, Jan 30, 2018 at 7:55 PM, Lukasz Cwik  wrote:
> I have to -1 reductions in the code review quality bar as this leads to test
> problems, which leads to CI issues, which leads to gaps in coverage and then
> to delayed, bad and broken releases.
>
> +1 on converting Google docs to either markdown or including them on the
> website since it is a valuable resource and becomes indexed via search
> engines.
>
> I don't have a strong opinion on the other topics discussed but many of them
> sound like good ideas.
>
> On Mon, Jan 29, 2018 at 7:21 AM, Jean-Baptiste Onofré 
> wrote:
>>
>> Hi,
>>
>> 1. The Apache Beam website contains a link to the Apache Code of Conduct
>> (on the
>> dropdown menu under the feather icon ;)). Maybe we can also add a link to
>> the
>> contribution guide as it's really important, especially for people who
>> target
>> the committership.
>>
>> 2. The retention time of Google doc is a valid point. Resume on the
>> mailing list
>> is a minimum. I like to have doc on the scm, but it means to use markdown
>> syntax
>> (it's what we are doing for some Apache projects). It's another option to
>> explore.
>>
>> 3. You got me point: it's subjective. The link to Flink doesn't bring
>> anything
>> more: "There is no strict protocol for becoming a committer". So, no
>> problem to
>> add such section, but I don't see lot of details providing there ;)
>>
>> Actually, Flink just copied the section from Apache:
>>
>> https://community.apache.org/contributors/
>>
>> Maybe we can just refer this page.
>>
>> Regards
>> JB
>>
>> On 01/29/2018 03:30 PM, Ismaël Mejía wrote:
>> > Hello again,
>> >
>> > Some comments inlined:
>> >
>> >
>> > JB,
>> >
>> >> 1. The code of conduct is the one from Apache.
>> >
>> > Yes I am not necessarily saying that we need a new one, I am just
>> > saying that we need to make this explicit, not sure everybody is aware
>> > of it https://www.apache.org/foundation/policies/conduct.html
>> >
>> >> 2. If it's not happen on the mailing list, it doesn't exist. That's the
>> >> Apache
>> > rule. We already discussed about that in the past: having wiki or Google
>> > doc is
>> > not a problem as soon as a summary is sent on the mailing list.
>> >
>> > I agree that a resume to the mailing list is a must, but I was
>> > referring more to preservation purposes, google docs are really
>> > practical, but have the issue that can be removed without a ‘public’
>> > copy in the open (at least the wiki will avoid this). This is still an
>> > open subject because we haven’t finished a proposal process (see
>> > below).
>> >
>> >> I don't see why and where Beam could be different from the other Apache
>> >> projects
>> > for the first three points.
>> >
>> > The three points are

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-31 Thread Jean-Baptiste Onofré
Hi Ismaël,

I agree that hint should not change the output of PTransforms.

However, let me illustrate why I think hint could be interesting:

- I agree with what you are saying about the runners: they should be smart.
However, to be smart enough, the runner could use some statements provided by
pipeline designer. Let's take the Spark runner. In first version, we didn't do
any caching of RDDs. Now, just for streaming, if a PCollection is read more than
once, I added RDDs caching. That's systematic, and the user doesn't have the
choice. When you write the same process using Spark directly, the user has
control of the way the RDDs are cached, and it can depend of the use case. So, I
think users will be happy to give some hint to the runner at some points of the
pipeline.

- Hint will open new features in the IOs. Let me take a concrete example. On a
local branch, I created a RestIO. The RestIO Write PTransform call a remote REST
service for each element in the PCollection. It's configured that way:

  pipeline.apply(...) // PCollection where String is JSON for now
  .apply(RestIO.write().withUrl("http://localhost:8181/rest/";))

So all elements will use the same URL. Now, imagine, the URL of the REST service
depends of the element. There's no easy way to achieve that today: the only way
is to change the IO to process a PCollection where RestRequest POJO
contains the message payload and the URL. That's OK to use such declarative
approach, it's just a bit painful to change the POJO each time we need to deal
an additional data (let's say encoding, authentication, etc).

In Camel, you can attach a header to a message. So basically, you have the data
payload (the body of the message) corresponding to the element.

That's why I was thinking about hint on PCollection first: it would allow us to
implement EIPs (Enterprise Integration Patterns).

Regards
JB

On 01/31/2018 09:31 AM, Ismaël Mejía wrote:
> This is a subject we have already discussed in the past. It was part
> on the discussion on ‘data locality’ for the runners on top of HDFS.
> In that moment the argument for ‘hints’ was that a transform could
> send hints to the runners so they properly allocate the readers
> improving its execution. This is similar to the case of resource
> allocation (GPU) mentioned by Reuven.
> https://issues.apache.org/jira/browse/BEAM-2085
> 
> What is a bit tricky about the design is the optional characteristic
> of hints, we say that hints should not change the semantics of the
> transforms (its output), but they can easily be abused to configure
> how runners behave. We should limit hints only to the use case of
> resource allocation, cases where the runner can benefit of the hint
> info to pass it to the resource allocator, but runner specific
> configuration must be part only of the runner options, or runners
> should be smarter.
> 
> This is to avoid potential misuse for portability and to limit extra
> knobs, Also to avoid the risky case of ending up with some sort of
> runtime ‘map-like’ configuration with hundreds of options that change
> behavior like they exist in Hadoop and Spark, We should avoid adding
> another level of this kind of variables now on top of Beam.
> 
> On Wed, Jan 31, 2018 at 7:25 AM, Jean-Baptiste Onofré  
> wrote:
>> Hi,
>>
>> yeah, it sounds good to me. I will create the Jira to track this and start a 
>> PoC
>> on the Composite.
>>
>> Thanks !
>> Regards
>> JB
>>
>> On 01/30/2018 10:40 PM, Reuven Lax wrote:
>>> Did we actually reach consensus here? :)
>>>
>>> On Tue, Jan 30, 2018 at 1:29 PM, Romain Manni-Bucau >> > wrote:
>>>
>>> Not sure how it fits in terms of API yet but +1 for the high level view.
>>> Makes perfect sense.
>>>
>>> Le 30 janv. 2018 21:41, "Jean-Baptiste Onofré" >> > a écrit :
>>>
>>> Hi Robert,
>>>
>>> Good point and idea for the Composite transform. It would apply 
>>> nicely
>>> on all transforms based on composite.
>>>
>>> I also agree that the hint is more on the transform than the 
>>> PCollection
>>> itself.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On 30/01/2018 21:26, Robert Bradshaw wrote:
>>>
>>> Many hints make more sense for PTransforms (the computation 
>>> itself)
>>> than for PCollections. In addition, when we want properties 
>>> attached
>>> to PCollections of themselves, it often makes sense to let 
>>> these be
>>> provided by the producing PTransform (e.g. coders and schemas 
>>> are
>>> often functions of the input metadata and the operation itself, 
>>> and
>>> can't just be set arbitrarily).
>>>
>>> Also, we already have a perfectly standard way of nesting 
>>> transforms
>>> (or even sets of transforms), namely composite transforms. In 
>>> terms of
>>> API design I would propose writing a composite tr

Re: [DISCUSS] State of the project: Culture and governance

2018-01-31 Thread Jean-Baptiste Onofré
Sorry, but I fail to see the difference between Beam and other Apache project.
It's not a github project: Beam follows the rules defined by the Apache Software
Foundation.

Beam website/contribution guide or whatever should just point to Apache website
resources. It's clearly explained and detailed.

If you want to add additional details, no problem, but be sure I will double
check that they are aligned and compliant with Apache rules.

Regards
JB

On 01/31/2018 10:11 AM, Ismaël Mejía wrote:
> Some extra comments inlined:
> 
> JB,
> 
>> 1. The Apache Beam website contains a link to the Apache Code of Conduct (on 
>> the dropdown menu under the feather icon ;)). Maybe we can also add a link 
>> to the contribution guide as it's really important, especially for people 
>> who target the committership.
> 
> Sure we should put it the contribution guide, my idea is that we the
> code of conduct more visible so everybody knows what to follow and
> what to expect.
> 
>> 3. You got me point: it's subjective. The link to Flink doesn't bring 
>> anything more: "There is no strict protocol for becoming a committer". So, 
>> no problem to add such section, but I don't see lot of details providing 
>> there ;)
> 
> I disagree with the fact that doesn’t bring anything more, Beam can be
> the first project a contributor approaches and in this case he/she
> will not know this stuff, having some kind of guidance is important.
> 
>> Actually, Flink just copied the section from Apache:
>> https://community.apache.org/
> 
> From a quick look I could not find the same section there, but I agree
> that we can add references to this Apache one if you can refer to me
> the exact part where it is.
> 
> 
> Lukasz, Kenn:
> 
>> I have to -1 reductions in the code review quality bar as this leads to test 
>> problems, which leads to CI issues, which leads to gaps in coverage and then 
>> to delayed, bad and broken releases.
> 
> Lowering the bar on code quality should be avoided.
> 
> We still lack some clear rules for code reviews, this is probably
> worth of an addition to the committer guide. For example if a
> first-time contributor fixes some error and has the expected quality
> we should accept it quickly, not being picky about some other part
> that can be improved that was discovered during the review, or at
> least not doing this without some extra encouragement to do it as an
> additional task. Also cases where the reviewer proposes an improvement
> and then changes his mind must be absolutely avoided, or the reviewer
> should at least ask for excuses or work on that part of the fix. These
> ideas are to encourage contribution and not disappoint contributors,
> and they apply to casual / initial contributors, with committers and
> paid contributors these situations could be more acceptable even if
> not ideal.
> 
> On Tue, Jan 30, 2018 at 7:55 PM, Lukasz Cwik  wrote:
>> I have to -1 reductions in the code review quality bar as this leads to test
>> problems, which leads to CI issues, which leads to gaps in coverage and then
>> to delayed, bad and broken releases.
>>
>> +1 on converting Google docs to either markdown or including them on the
>> website since it is a valuable resource and becomes indexed via search
>> engines.
>>
>> I don't have a strong opinion on the other topics discussed but many of them
>> sound like good ideas.
>>
>> On Mon, Jan 29, 2018 at 7:21 AM, Jean-Baptiste Onofré 
>> wrote:
>>>
>>> Hi,
>>>
>>> 1. The Apache Beam website contains a link to the Apache Code of Conduct
>>> (on the
>>> dropdown menu under the feather icon ;)). Maybe we can also add a link to
>>> the
>>> contribution guide as it's really important, especially for people who
>>> target
>>> the committership.
>>>
>>> 2. The retention time of Google doc is a valid point. Resume on the
>>> mailing list
>>> is a minimum. I like to have doc on the scm, but it means to use markdown
>>> syntax
>>> (it's what we are doing for some Apache projects). It's another option to
>>> explore.
>>>
>>> 3. You got me point: it's subjective. The link to Flink doesn't bring
>>> anything
>>> more: "There is no strict protocol for becoming a committer". So, no
>>> problem to
>>> add such section, but I don't see lot of details providing there ;)
>>>
>>> Actually, Flink just copied the section from Apache:
>>>
>>> https://community.apache.org/contributors/
>>>
>>> Maybe we can just refer this page.
>>>
>>> Regards
>>> JB
>>>
>>> On 01/29/2018 03:30 PM, Ismaël Mejía wrote:
 Hello again,

 Some comments inlined:


 JB,

> 1. The code of conduct is the one from Apache.

 Yes I am not necessarily saying that we need a new one, I am just
 saying that we need to make this explicit, not sure everybody is aware
 of it https://www.apache.org/foundation/policies/conduct.html

> 2. If it's not happen on the mailing list, it doesn't exist. That's the
> Apache
 rule. We already discussed about that in the past: 

Build failed in Jenkins: beam_PostRelease_NightlySnapshot #12

2018-01-31 Thread Apache Jenkins Server
See 


Changes:

[kenn] Add a test for an event time timer loop in ParDo

[kedin] [SQL] Refactor Variance

[kedin] [Nexmark][SQL] Implement sql query 3

[pawel.pk.kaczmarczyk] [BEAM-2469] Handling Kinesis shards splits and merges

[tgroh] Add CoderTranslatorRegistrar

[tgroh] Add slf4j_simple to the top level Gradle build

[tgroh] Implement FnService in FnApiControlClientPoolService

[tgroh] Add a Timeout to GrpcDataService#send

[tgroh] Use a Data Service in SdkHarnessClient

[XuMingmin] [BEAM-3525] Fix KafkaIO metric (#4524)

[chamikara] Updates PTransform overriding to create a new AppliedPTransform 
object

--
Started by timer
[EnvInject] - Loading node environment variables.
Building remotely on beam4 (beam) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/beam.git # timeout=10
Fetching upstream changes from https://github.com/apache/beam.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/beam.git 
 > +refs/heads/*:refs/remotes/origin/* 
 > +refs/pull/${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/*
 > git rev-parse origin/master^{commit} # timeout=10
Checking out Revision 645e1b51e1c83ec4b0b54a774d6615624f3c42ed (origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 645e1b51e1c83ec4b0b54a774d6615624f3c42ed
Commit message: "Merge pull request #4448: [SQL] Refactor Variance"
 > git rev-list dc24beb9f3651dc05749b8fac9bc6f6acacf7022 # timeout=10
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
[EnvInject] - Executing scripts and injecting environment variables after the 
SCM step.
[EnvInject] - Injecting as environment variables the properties content 
SPARK_LOCAL_IP=127.0.0.1

[EnvInject] - Variables injected successfully.
[beam_PostRelease_NightlySnapshot] $ /bin/bash -xe 
/tmp/jenkins8494512276168456757.sh
+ cd src/release
+ groovy quickstart-java-direct.groovy
/tmp/jenkins8494512276168456757.sh: line 2: groovy: command not found
Build step 'Execute shell' marked build as failure
Not sending mail to unregistered user joey.bar...@gmail.com
Not sending mail to unregistered user ke...@google.com
Not sending mail to unregistered user git...@alasdairhodge.co.uk
Not sending mail to unregistered user mott...@gmail.com
Not sending mail to unregistered user z...@giggles.nyc.corp.google.com
Not sending mail to unregistered user kirpic...@google.com
Not sending mail to unregistered user xuming...@users.noreply.github.com
Not sending mail to unregistered user aromanenko@gmail.com
Not sending mail to unregistered user pawel.pk.kaczmarc...@gmail.com


Re: [DISCUSS] State of the project

2018-01-31 Thread Etienne Chauchot

Thanks Kenn and Luke for your comments.

WDYT about my proposition (bellow) to add methods to the runner api to 
enhance the coherence between the runners?


WDYT about my other proposition (bellow) of trying to avoid having 
validates runner tests that are specific to a runner like we have now?


Thanks,

Etienne


Le 26/01/2018 à 21:34, Kenneth Knowles a écrit :
I also think that at a high level the success of Beam as a 
project/community and as a piece of software depends on having 
multiple viable runners with healthy set of users and contributors. 
The pieces that are missing to me:


*User-focused comparison of runners (and IOs)*
+1 to Jesse's. Automated capability tests don't really help this. 
Benchmarks will be part of the story but are worth very little on 
their own. Focusing on these is just choosing to measure things that 
are easy to measure instead of addressing what is important, which is 
in the end almost always qualitative.


*Automated integration tests on clusters*
We do need to know that runners and IOs "work" in a basic yes/no 
manner on every commit/release, beyond unit tests. I am not really 
willing to strongly claim to a potential user that something "works" 
without this level of automation.


*More uniform operational experiences*
Setting up your Spark/Flink/Apex deployment should be different. 
Launching a Beam pipeline on it should not be.


*Portability: Any SDK on any runner*
We have now one SDK on master and one SDK on a dev branch that both 
support portable execution somewhat. Unfortunately we have no major 
open source runner that supports portability*. "Java on any runner" is 
not compelling enough any more, if it ever was.




Reviews: I agree our response latency is too slow. I do not agree that 
our quality bar is too high; I think we should raise it 
*significantly*. Our codebase fails tests for long periods. Our tests 
need to be green enough that we are comfortable blocking merges *even 
for unrelated failures*. We should be able to cut a release any time, 
modulo known blocker-level bugs.


Runner dev: I think Etienne's point about making it more uniform to 
add features to all runners actually is quite important, since the 
portability framework is a lot harder than "translate a Beam ParDo to 
XYZ's FlatMap" where they are both Java. And even the support code 
we've been building is not obvious to use and probably won't be for 
the foreseeable future. This fits well into the "Ben thread" on 
technical ideas so I'll comment there.


Kenn

*We do have a local batch-only portable runner in Python

On Fri, Jan 26, 2018 at 10:09 AM, Lukasz Cwik > wrote:


Etienne, for the cross runner coherence, the portability framework
is attempting to create an API across all runners for job
management and job execution. A lot of work still needs to be done
to define and implement these APIs and migrate runners and SDKs to
support it since the current set of Java APIs are adhoc in usage
and purpose. In my opinion, development should really be focused
to migrate runners and SDKs to use these APIs to get developer
coherence. Work is slowly progressing on integrating them into the
Java, Python, and Go SDKs and there are several JIRA issues in
this regard but involvement from more people could help.

Some helpful pointers are:
https://s.apache.org/beam-runner-api

https://s.apache.org/beam-fn-api 

https://issues.apache.org/jira/browse/BEAM-3515?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20portability



On Fri, Jan 26, 2018 at 7:21 AM, Etienne Chauchot
mailto:echauc...@apache.org>> wrote:

Hi all,

Does anyone have comments about my point about dev coherence
across the runners?

Thanks
Etienne


Le 22/01/2018 à 16:16, Etienne Chauchot a écrit :

Thanks Davor for bringing this discussion up!

I particularly like that you listed the different areas of
improvement and proposed to assign people based on their
tastes.

I wanted to add a point about consistency across runners,
but through the dev point of view: I've been working on a
trans-runner feature lately (metrics push agnostic of the
runners) for which I compared the behavior of the runners
and wired up this feature into the flink and spark runners
themselves. I must admit that I had a hard time figuring
out how to wire it up in the different runners and that it
was completely different between the runners. Also, their
use (or non-use) of runner-core facilities vary. Even in
the architecture of the tests: some, like spark, own their
  

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-31 Thread Reuven Lax
On Wed, Jan 31, 2018 at 1:34 AM, Jean-Baptiste Onofré 
wrote:

> Hi Ismaël,
>
> I agree that hint should not change the output of PTransforms.
>
> However, let me illustrate why I think hint could be interesting:
>
> - I agree with what you are saying about the runners: they should be smart.
> However, to be smart enough, the runner could use some statements provided
> by
> pipeline designer. Let's take the Spark runner. In first version, we
> didn't do
> any caching of RDDs. Now, just for streaming, if a PCollection is read
> more than
> once, I added RDDs caching. That's systematic, and the user doesn't have
> the
> choice. When you write the same process using Spark directly, the user has
> control of the way the RDDs are cached, and it can depend of the use case.
> So, I
> think users will be happy to give some hint to the runner at some points
> of the
> pipeline.
>
> - Hint will open new features in the IOs. Let me take a concrete example.
> On a
> local branch, I created a RestIO. The RestIO Write PTransform call a
> remote REST
> service for each element in the PCollection. It's configured that way:
>
>   pipeline.apply(...) // PCollection where String is JSON for now
>   .apply(RestIO.write().withUrl("http://localhost:8181/rest/";))
>
> So all elements will use the same URL. Now, imagine, the URL of the REST
> service
> depends of the element. There's no easy way to achieve that today: the
> only way
> is to change the IO to process a PCollection where
> RestRequest POJO
> contains the message payload and the URL. That's OK to use such declarative
> approach, it's just a bit painful to change the POJO each time we need to
> deal
> an additional data (let's say encoding, authentication, etc).
>

However we already solved this particular use case  for FileIO - the user
can configure the fileIO sink (TextIO, AvroIO, etc.) with a lambda that
maps the record to the destination to put the files. Can't we solve things
similarly for RestIO? It would look like.

 pipeline.apply(...) // PCollection where String is JSON for now
  .apply(RestIO.write().withUrlFunction((String record) ->
{getUrl(record)})


> In Camel, you can attach a header to a message. So basically, you have the
> data
> payload (the body of the message) corresponding to the element.
>
> That's why I was thinking about hint on PCollection first: it would allow
> us to
> implement EIPs (Enterprise Integration Patterns).
>
> Regards
> JB
>
> On 01/31/2018 09:31 AM, Ismaël Mejía wrote:
> > This is a subject we have already discussed in the past. It was part
> > on the discussion on ‘data locality’ for the runners on top of HDFS.
> > In that moment the argument for ‘hints’ was that a transform could
> > send hints to the runners so they properly allocate the readers
> > improving its execution. This is similar to the case of resource
> > allocation (GPU) mentioned by Reuven.
> > https://issues.apache.org/jira/browse/BEAM-2085
> >
> > What is a bit tricky about the design is the optional characteristic
> > of hints, we say that hints should not change the semantics of the
> > transforms (its output), but they can easily be abused to configure
> > how runners behave. We should limit hints only to the use case of
> > resource allocation, cases where the runner can benefit of the hint
> > info to pass it to the resource allocator, but runner specific
> > configuration must be part only of the runner options, or runners
> > should be smarter.
> >
> > This is to avoid potential misuse for portability and to limit extra
> > knobs, Also to avoid the risky case of ending up with some sort of
> > runtime ‘map-like’ configuration with hundreds of options that change
> > behavior like they exist in Hadoop and Spark, We should avoid adding
> > another level of this kind of variables now on top of Beam.
> >
> > On Wed, Jan 31, 2018 at 7:25 AM, Jean-Baptiste Onofré 
> wrote:
> >> Hi,
> >>
> >> yeah, it sounds good to me. I will create the Jira to track this and
> start a PoC
> >> on the Composite.
> >>
> >> Thanks !
> >> Regards
> >> JB
> >>
> >> On 01/30/2018 10:40 PM, Reuven Lax wrote:
> >>> Did we actually reach consensus here? :)
> >>>
> >>> On Tue, Jan 30, 2018 at 1:29 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com
> >>> > wrote:
> >>>
> >>> Not sure how it fits in terms of API yet but +1 for the high level
> view.
> >>> Makes perfect sense.
> >>>
> >>> Le 30 janv. 2018 21:41, "Jean-Baptiste Onofré"  >>> > a écrit :
> >>>
> >>> Hi Robert,
> >>>
> >>> Good point and idea for the Composite transform. It would
> apply nicely
> >>> on all transforms based on composite.
> >>>
> >>> I also agree that the hint is more on the transform than the
> PCollection
> >>> itself.
> >>>
> >>> Thanks !
> >>> Regards
> >>> JB
> >>>
> >>> On 30/01/2018 21:26, Robert Bradshaw wrote:
> >>>
> >>> Many hints make 

IO plans?

2018-01-31 Thread Romain Manni-Bucau
Hi guys,

is there a plan for future IO and some tracking somewhere?

I particularly wonder if there are plans for a HTTP IO and common server IO
like SFTP, SSH, etc...

Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn 


Re: [DISCUSSION] Runner agnostic metrics extractor?

2018-01-31 Thread Etienne Chauchot

Hi all,

Just to let you know that I have just submitted the PR [1]:

This PR adds a MetricsPusher discussed in this [2] document in scenario 3.b.
It merges and pushes beam metrics at a configurable (via 
pipelineOptions) frequency to a configurable sink. By default the sink 
is a DummySink also useful for tests. There is also a HttpMetricsSink 
available that pushes to a http backend the json-serialized metrics 
results. We could imagine in the future many Beam sinks to write to 
Ganglia, Graphite, ... Please note that these are not IOs because it 
needed to be transparent to the user.


The pusher only supports attempted metrics for now.

This feature is hosted in the runner-core module (as discussed in the 
doc) to be used by all the runners. I have wired it up with Spark and 
Flink runners (this part was quite painful :) )


If the PR is merged, I hope that runner experts would wire this feature 
up in the other runners. Your help is needed guys :)


Besides, there is nothing related to portability for now in this PR.

Best!

Etienne

[1] https://github.com/apache/beam/pull/4548

[2] https://s.apache.org/runner_independent_metrics_extraction


Le 11/12/2017 à 17:33, Etienne Chauchot a écrit :


Hi all,

I sketched a little doc [1] about this subject. It tries to sum up the 
differences between the runners towards metrics extraction and propose 
some possible designs to have a runner agnostic extraction of the metrics.


It is a 2 pages long doc, can you please comment it, and correct it if 
needed?


Thanks

Etienne

[1]: *https://s.apache.org/runner_independent_metrics_extraction*


Le 27/11/2017 à 18:17, Ben Chambers a écrit :

I think discussing a runner agnostic way of configuring how metrics are
extracted is a great idea -- thanks for bringing it up Etienne!

Using a thread that polls the pipeline result relies on the program that
created and submitted the pipeline continuing to run (eg., no machine
faults, network problems, etc.). For many applications, this isn't a good
model (a Streaming pipeline may run for weeks, a Batch pipeline may be
automatically run every hour, etc.).

Etienne's proposal of having something the runner pushes metrics too has
the benefit of running in the same cluster as the pipeline, thus having the
same reliability benefits.

As noted, it would require runners to ensure that metrics were pushed into
the extractor but from there it would allow a general configuration of how
metrics are extracted from the pipeline and exposed to some external
services.

Providing something that the runners could push metrics into and have them
automatically exported seems like it would have several benefits:
   1. It would provide a single way to configure how metrics are actually
exported.
   2. It would allow the runners to ensure it was reliably executed.
   3. It would allow the runner to report system metrics directly (eg., if a
runner wanted to report the watermark, it could push that in directly).

-- Ben

On Mon, Nov 27, 2017 at 9:06 AM Jean-Baptiste Onofré
wrote:


Hi all,

Etienne forgot to mention that we started a PoC about that.

What I started is to wrap the Pipeline creation to include a thread that
polls
periodically the metrics in the pipeline result (it's what I proposed when
I
compared with Karaf Decanter some time ago).
Then, this thread marshalls the collected metrics and send to a sink. At
the
end, it means that the harvested metrics data will be store in a backend
(for
instance elasticsearch).

The pro of this approach is that it doesn't require any change in the
core, it's
up to the user to use the PipelineWithMetric wrapper.

The cons is that the user needs to explicitly use the PipelineWithMetric
wrapper.

IMHO, it's good enough as user can decide to poll metrics for some
pipelines and
not for others.

Regards
JB

On 11/27/2017 04:56 PM, Etienne Chauchot wrote:

Hi all,

I came by this tickethttps://issues.apache.org/jira/browse/BEAM-2456.

I know

that the metrics subject has already been discussed a lot, but I would

like to

revive the discussion.

The aim in this ticket is to avoid relying on the runner to provide the

metrics

because they don't have all the same capabilities towards metrics. The

idea in

the ticket is to still use beam metrics API (and not others like

codahale as it

has been discussed some time ago) and provide a way to extract the

metrics with

a polling thread that would be forked by a PipelineWithMetrics (so,

almost

invisible to the end user) and then to push to a sink (such as a Http

rest sink

for example or Graphite sink or anything else...). Nevertheless, a

polling

thread might not work for all the runners because some might not make the
metrics available before the end of the pipeline. Also, forking a thread

would

be a bit unconventional, so it could be provided as a beam sdk extension.

Another way, to avoid polling, would be to push metrics values to a sink

when

they are updated but I don't know if it is feasible in a

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-31 Thread Jean-Baptiste Onofré
Hi Reuven,

it's also what I did in JdbcIO for the statement or column mapper.

That's fair enough.

Regards
JB

On 01/31/2018 01:35 PM, Reuven Lax wrote:
> 
> 
> On Wed, Jan 31, 2018 at 1:34 AM, Jean-Baptiste Onofré  > wrote:
> 
> Hi Ismaël,
> 
> I agree that hint should not change the output of PTransforms.
> 
> However, let me illustrate why I think hint could be interesting:
> 
> - I agree with what you are saying about the runners: they should be 
> smart.
> However, to be smart enough, the runner could use some statements 
> provided by
> pipeline designer. Let's take the Spark runner. In first version, we 
> didn't do
> any caching of RDDs. Now, just for streaming, if a PCollection is read 
> more than
> once, I added RDDs caching. That's systematic, and the user doesn't have 
> the
> choice. When you write the same process using Spark directly, the user has
> control of the way the RDDs are cached, and it can depend of the use 
> case. So, I
> think users will be happy to give some hint to the runner at some points 
> of the
> pipeline.
> 
> - Hint will open new features in the IOs. Let me take a concrete example. 
> On a
> local branch, I created a RestIO. The RestIO Write PTransform call a 
> remote REST
> service for each element in the PCollection. It's configured that way:
> 
>   pipeline.apply(...) // PCollection where String is JSON for now
>       .apply(RestIO.write().withUrl("http://localhost:8181/rest/";))
> 
> So all elements will use the same URL. Now, imagine, the URL of the REST 
> service
> depends of the element. There's no easy way to achieve that today: the 
> only way
> is to change the IO to process a PCollection where 
> RestRequest POJO
> contains the message payload and the URL. That's OK to use such 
> declarative
> approach, it's just a bit painful to change the POJO each time we need to 
> deal
> an additional data (let's say encoding, authentication, etc).
> 
> 
> However we already solved this particular use case  for FileIO - the user can
> configure the fileIO sink (TextIO, AvroIO, etc.) with a lambda that maps the
> record to the destination to put the files. Can't we solve things similarly 
> for
> RestIO? It would look like.
> 
>  pipeline.apply(...) // PCollection where String is JSON for now
>       .apply(RestIO.write().withUrlFunction((String record) -> 
> {getUrl(record)})
> 
> 
> In Camel, you can attach a header to a message. So basically, you have 
> the data
> payload (the body of the message) corresponding to the element.
> 
> That's why I was thinking about hint on PCollection first: it would allow 
> us to
> implement EIPs (Enterprise Integration Patterns).
> 
> Regards
> JB
> 
> On 01/31/2018 09:31 AM, Ismaël Mejía wrote:
> > This is a subject we have already discussed in the past. It was part
> > on the discussion on ‘data locality’ for the runners on top of HDFS.
> > In that moment the argument for ‘hints’ was that a transform could
> > send hints to the runners so they properly allocate the readers
> > improving its execution. This is similar to the case of resource
> > allocation (GPU) mentioned by Reuven.
> > https://issues.apache.org/jira/browse/BEAM-2085
> 
> >
> > What is a bit tricky about the design is the optional characteristic
> > of hints, we say that hints should not change the semantics of the
> > transforms (its output), but they can easily be abused to configure
> > how runners behave. We should limit hints only to the use case of
> > resource allocation, cases where the runner can benefit of the hint
> > info to pass it to the resource allocator, but runner specific
> > configuration must be part only of the runner options, or runners
> > should be smarter.
> >
> > This is to avoid potential misuse for portability and to limit extra
> > knobs, Also to avoid the risky case of ending up with some sort of
> > runtime ‘map-like’ configuration with hundreds of options that change
> > behavior like they exist in Hadoop and Spark, We should avoid adding
> > another level of this kind of variables now on top of Beam.
> >
> > On Wed, Jan 31, 2018 at 7:25 AM, Jean-Baptiste Onofré  > wrote:
> >> Hi,
> >>
> >> yeah, it sounds good to me. I will create the Jira to track this and
> start a PoC
> >> on the Composite.
> >>
> >> Thanks !
> >> Regards
> >> JB
> >>
> >> On 01/30/2018 10:40 PM, Reuven Lax wrote:
> >>> Did we actually reach consensus here? :)
> >>>
> >>> On Tue, Jan 30, 2018 at 1:29 PM, Romain Manni-Bucau
> mailto:rmannibu...@gmail.com>
> >>> >> wrote:
> >>>
> >>>     Not sure how it fits 

Re: IO plans?

2018-01-31 Thread Jean-Baptiste Onofré
Hi Romain,

I have some IOs locally and some idea:

- ExecIO (it has been proposed as PR but declined)
- ConsoleIO (generic)
- SocketIO
- RestIO
- MinaIO

I also created the other IOs Jira.

Regards
JB

On 01/31/2018 01:57 PM, Romain Manni-Bucau wrote:
> Hi guys,
> 
> is there a plan for future IO and some tracking somewhere?
> 
> I particularly wonder if there are plans for a HTTP IO and common server IO 
> like
> SFTP, SSH, etc...
> 
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github  |
> LinkedIn 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: IO plans?

2018-01-31 Thread Jean-Baptiste Onofré
By the way, short term: ParquetIO, RabbitMqIO are coming (PRs already open).

Regards
JB

On 01/31/2018 02:41 PM, Jean-Baptiste Onofré wrote:
> Hi Romain,
> 
> I have some IOs locally and some idea:
> 
> - ExecIO (it has been proposed as PR but declined)
> - ConsoleIO (generic)
> - SocketIO
> - RestIO
> - MinaIO
> 
> I also created the other IOs Jira.
> 
> Regards
> JB
> 
> On 01/31/2018 01:57 PM, Romain Manni-Bucau wrote:
>> Hi guys,
>>
>> is there a plan for future IO and some tracking somewhere?
>>
>> I particularly wonder if there are plans for a HTTP IO and common server IO 
>> like
>> SFTP, SSH, etc...
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github  
>> |
>> LinkedIn 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: IO plans?

2018-01-31 Thread Romain Manni-Bucau
Thanks JB, this is great news since they are highly used IO in the industry
and really awaited now.


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn 

2018-01-31 14:42 GMT+01:00 Jean-Baptiste Onofré :

> By the way, short term: ParquetIO, RabbitMqIO are coming (PRs already
> open).
>
> Regards
> JB
>
> On 01/31/2018 02:41 PM, Jean-Baptiste Onofré wrote:
> > Hi Romain,
> >
> > I have some IOs locally and some idea:
> >
> > - ExecIO (it has been proposed as PR but declined)
> > - ConsoleIO (generic)
> > - SocketIO
> > - RestIO
> > - MinaIO
> >
> > I also created the other IOs Jira.
> >
> > Regards
> > JB
> >
> > On 01/31/2018 01:57 PM, Romain Manni-Bucau wrote:
> >> Hi guys,
> >>
> >> is there a plan for future IO and some tracking somewhere?
> >>
> >> I particularly wonder if there are plans for a HTTP IO and common
> server IO like
> >> SFTP, SSH, etc...
> >>
> >> Romain Manni-Bucau
> >> @rmannibucau  |  Blog
> >>  | Old Blog
> >>  | Github  rmannibucau> |
> >> LinkedIn 
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Romain Manni-Bucau
+1 (non-binding), upgraded to spark 2 in several test suites and
integrations and works very well. Good job guys!


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn 

2018-01-30 12:16 GMT+01:00 Colm O hEigeartaigh :

>
> +1 (non-binding), tested digests, signatures, built the source repo + tag.
>
> Colm.
>
> On Tue, Jan 30, 2018 at 8:04 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi everyone,
>>
>> Please review and vote on the release candidate #1 for the version 2.3.0,
>> as
>> follows:
>>
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>>
>> The complete staging area is available for your review, which includes:
>> * JIRA release notes [1],
>> * the official Apache source release to be deployed to dist.apache.org
>> [2],
>> which is signed with the key with fingerprint C8282E76 [3],
>> * all artifacts to be deployed to the Maven Central Repository [4],
>> * source code tag "v2.3.0-RC1" [5],
>> * website pull request listing the release and publishing the API
>> reference
>> manual [6].
>> * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
>> * Python artifacts are deployed along with the source release to the
>> dist.apache.org [2].
>>
>> The vote will be open for at least 72 hours. It is adopted by majority
>> approval,
>> with at least 3 PMC affirmative votes.
>>
>> Thanks,
>> JB
>>
>> [1]
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
>> ctId=12319527&version=12341608
>> [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> [4] https://repository.apache.org/content/repositories/orgapache
>> beam-1026/
>> [5] https://github.com/apache/beam/tree/v2.3.0-RC1
>> [6] https://github.com/apache/beam-site/pull/381
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>
>
> --
> Colm O hEigeartaigh
>
> Talend Community Coder
> http://coders.talend.com
>


Re: IO plans?

2018-01-31 Thread Jean-Baptiste Onofré
Agree !

That's why I would like to move forward for 2.4.0 on those ones.

Regards
JB

On 01/31/2018 02:44 PM, Romain Manni-Bucau wrote:
> Thanks JB, this is great news since they are highly used IO in the industry 
> and
> really awaited now.
> 
> 
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github  |
> LinkedIn 
> 
> 2018-01-31 14:42 GMT+01:00 Jean-Baptiste Onofré  >:
> 
> By the way, short term: ParquetIO, RabbitMqIO are coming (PRs already 
> open).
> 
> Regards
> JB
> 
> On 01/31/2018 02:41 PM, Jean-Baptiste Onofré wrote:
> > Hi Romain,
> >
> > I have some IOs locally and some idea:
> >
> > - ExecIO (it has been proposed as PR but declined)
> > - ConsoleIO (generic)
> > - SocketIO
> > - RestIO
> > - MinaIO
> >
> > I also created the other IOs Jira.
> >
> > Regards
> > JB
> >
> > On 01/31/2018 01:57 PM, Romain Manni-Bucau wrote:
> >> Hi guys,
> >>
> >> is there a plan for future IO and some tracking somewhere?
> >>
> >> I particularly wonder if there are plans for a HTTP IO and common 
> server
> IO like
> >> SFTP, SSH, etc...
> >>
> >> Romain Manni-Bucau
> >> @rmannibucau  > |  Blog
> >>  > |
> Old Blog
> >> >
> | Github  > |
> >> LinkedIn  >
> >
> 
> --
> Jean-Baptiste Onofré
> jbono...@apache.org 
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Jean-Baptiste Onofré
+1 (binding)

Casting my own +1 ;)

Regards
JB

On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
> Hi everyone,
> 
> Please review and vote on the release candidate #1 for the version 2.3.0, as
> follows:
> 
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
> 
> 
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org [2],
> which is signed with the key with fingerprint C8282E76 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.3.0-RC1" [5],
> * website pull request listing the release and publishing the API reference
> manual [6].
> * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2].
> 
> The vote will be open for at least 72 hours. It is adopted by majority 
> approval,
> with at least 3 PMC affirmative votes.
> 
> Thanks,
> JB
> 
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12341608
> [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1026/
> [5] https://github.com/apache/beam/tree/v2.3.0-RC1
> [6] https://github.com/apache/beam-site/pull/381
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Kenneth Knowles
I've cloned the release validation spreadsheet:

https://s.apache.org/beam-2.3.0-release-validation

If you plan to perform a manual validation task, please sign up so multiple
people don't waste effort.

Alan & JB, as far as your pairing up to automate more, anything manual on
this sheet should be considered.

Kenn

On Wed, Jan 31, 2018 at 5:59 AM, Jean-Baptiste Onofré 
wrote:

> +1 (binding)
>
> Casting my own +1 ;)
>
> Regards
> JB
>
> On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
> > Hi everyone,
> >
> > Please review and vote on the release candidate #1 for the version
> 2.3.0, as
> > follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to dist.apache.org
> [2],
> > which is signed with the key with fingerprint C8282E76 [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "v2.3.0-RC1" [5],
> > * website pull request listing the release and publishing the API
> reference
> > manual [6].
> > * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
> > * Python artifacts are deployed along with the source release to the
> > dist.apache.org [2].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> approval,
> > with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > JB
> >
> > [1]
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> projectId=12319527&version=12341608
> > [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > [4] https://repository.apache.org/content/repositories/
> orgapachebeam-1026/
> > [5] https://github.com/apache/beam/tree/v2.3.0-RC1
> > [6] https://github.com/apache/beam-site/pull/381
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Jean-Baptiste Onofré
Thanks Kenn,

I prepared the list of tasks I did. I will complete where release is out.

Regards
JB

On 01/31/2018 03:07 PM, Kenneth Knowles wrote:
> I've cloned the release validation spreadsheet:
> 
>     https://s.apache.org/beam-2.3.0-release-validation
> 
> If you plan to perform a manual validation task, please sign up so multiple
> people don't waste effort.
> 
> Alan & JB, as far as your pairing up to automate more, anything manual on this
> sheet should be considered.
> 
> Kenn
> 
> On Wed, Jan 31, 2018 at 5:59 AM, Jean-Baptiste Onofré  > wrote:
> 
> +1 (binding)
> 
> Casting my own +1 ;)
> 
> Regards
> JB
> 
> On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
> > Hi everyone,
> >
> > Please review and vote on the release candidate #1 for the version 
> 2.3.0, as
> > follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to dist.apache.org
>  [2],
> > which is signed with the key with fingerprint C8282E76 [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "v2.3.0-RC1" [5],
> > * website pull request listing the release and publishing the API 
> reference
> > manual [6].
> > * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
> > * Python artifacts are deployed along with the source release to the
> > dist.apache.org  [2].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> approval,
> > with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > JB
> >
> > [1]
> >
> 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12341608
> 
> 
> > [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
> 
> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> 
> > [4] 
> https://repository.apache.org/content/repositories/orgapachebeam-1026/
> 
> > [5] https://github.com/apache/beam/tree/v2.3.0-RC1
> 
> > [6] https://github.com/apache/beam-site/pull/381
> 
> >
> 
> --
> Jean-Baptiste Onofré
> jbono...@apache.org 
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [DISCUSS] State of the project

2018-01-31 Thread Kenneth Knowles
On Wed, Jan 31, 2018 at 3:40 AM, Etienne Chauchot 
wrote:

> Thanks Kenn and Luke for your comments.
>
> WDYT about my proposition (bellow) to add methods to the runner api to
> enhance the coherence between the runners?
>
If I understand your point, I think I agree but maybe with even stronger
opinions. This is somewhat related to my comment on the Ben thread about
"runner == service". The methods in any programming language SDK should not
be considered "the runner API" any more. A Beam runner should be expected
to be a service hosting Beam's job management proto APIs. So there's
already more methods than run() but we don't have any runners :-)


> WDYT about my other proposition (bellow) of trying to avoid having
> validates runner tests that are specific to a runner like we have now?
>
Yes, I think ValidatesRunner tests should be independent of any runner. Is
this just a convenient use of the Java annotation when it probably should
just be a static call to PipelineOptions.setRunner? IMO that is just a bug.

If you consider the above point along with this, ValidatesRunner tests
should be launched as jobs against a runner service. This matches things
like SQL compliance tests.

A build starting from change to SDK X might build like this:

(build) SDK X
(build) DirectRunner X
(build) ValidatesRunner suite X
(test) SDK X (submit the NeedsRunner tests to DR X)
(test) DR X (submit the VR suite X)
(test) other runners, submit the VR suite X

The way this is organized into modules today is more history and
convenience. You could easily imagine new ways to express this build in
Maven or Gradle. Portability makes it clearer!

Kenn


Thanks,
>
> Etienne
>
> Le 26/01/2018 à 21:34, Kenneth Knowles a écrit :
>
> I also think that at a high level the success of Beam as a
> project/community and as a piece of software depends on having multiple
> viable runners with healthy set of users and contributors. The pieces that
> are missing to me:
>
> *User-focused comparison of runners (and IOs)*
> +1 to Jesse's. Automated capability tests don't really help this.
> Benchmarks will be part of the story but are worth very little on their
> own. Focusing on these is just choosing to measure things that are easy to
> measure instead of addressing what is important, which is in the end almost
> always qualitative.
>
> *Automated integration tests on clusters*
> We do need to know that runners and IOs "work" in a basic yes/no manner on
> every commit/release, beyond unit tests. I am not really willing to
> strongly claim to a potential user that something "works" without this
> level of automation.
>
> *More uniform operational experiences*
> Setting up your Spark/Flink/Apex deployment should be different. Launching
> a Beam pipeline on it should not be.
>
> *Portability: Any SDK on any runner*
> We have now one SDK on master and one SDK on a dev branch that both
> support portable execution somewhat. Unfortunately we have no major open
> source runner that supports portability*. "Java on any runner" is not
> compelling enough any more, if it ever was.
>
> 
>
> Reviews: I agree our response latency is too slow. I do not agree that our
> quality bar is too high; I think we should raise it *significantly*. Our
> codebase fails tests for long periods. Our tests need to be green enough
> that we are comfortable blocking merges *even for unrelated failures*. We
> should be able to cut a release any time, modulo known blocker-level bugs.
>
> Runner dev: I think Etienne's point about making it more uniform to add
> features to all runners actually is quite important, since the portability
> framework is a lot harder than "translate a Beam ParDo to XYZ's FlatMap"
> where they are both Java. And even the support code we've been building is
> not obvious to use and probably won't be for the foreseeable future. This
> fits well into the "Ben thread" on technical ideas so I'll comment there.
>
> Kenn
>
> *We do have a local batch-only portable runner in Python
>
> On Fri, Jan 26, 2018 at 10:09 AM, Lukasz Cwik  wrote:
>
>> Etienne, for the cross runner coherence, the portability framework is
>> attempting to create an API across all runners for job management and job
>> execution. A lot of work still needs to be done to define and implement
>> these APIs and migrate runners and SDKs to support it since the current set
>> of Java APIs are adhoc in usage and purpose. In my opinion, development
>> should really be focused to migrate runners and SDKs to use these APIs to
>> get developer coherence. Work is slowly progressing on integrating them
>> into the Java, Python, and Go SDKs and there are several JIRA issues in
>> this regard but involvement from more people could help.
>>
>> Some helpful pointers are:
>> https://s.apache.org/beam-runner-api
>> https://s.apache.org/beam-fn-api
>> https://issues.apache.org/jira/browse/BEAM-3515?jql=project%
>> 20%3D%20BEAM%20AND%20labels%20%3D%20portability
>>
>> On Fri, Jan 26, 2

drop scala....version from artifact ;)

2018-01-31 Thread Romain Manni-Bucau
Hi guys

since beam supports a single version of runners why not dropping the scala
version from the artifactId?

ATM upgrades are painful cause you upgrade beam version+ runner
artifactIds.

wdyt?

Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn 


Re: drop scala....version from artifact ;)

2018-01-31 Thread Jean-Baptiste Onofré
Hi Romain,

AFAIR only Flink runner uses scala version in the artifactId.

+1 for me.

Regards
JB

On 01/31/2018 04:45 PM, Romain Manni-Bucau wrote:
> Hi guys
> 
> since beam supports a single version of runners why not dropping the scala
> version from the artifactId?
> 
> ATM upgrades are painful cause you upgrade beam version+ runner 
> artifactIds.
> 
> wdyt?
> 
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github  |
> LinkedIn 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: ***UNCHECKED*** Re: Samza Runner

2018-01-31 Thread xinyu liu
Thanks Kenneth to merge the Samza BEAM runner to the feature branch! We
will work on the other items (docs, example, capability matrix ..) to get
it to the master.

Thanks,
Xinyu

On Fri, Jan 26, 2018 at 9:28 AM, Kenneth Knowles  wrote:

> Regarding merging directly to master, I agree that the code itself is
> probably just as OK as our other runners were when they joined master. So
> we should watch the rest of the bits from https://beam.apache.org/
> contribute/feature-branches/ here is how I think things are:
>
>  [ ] Have at least 2 contributors interested in maintaining it, and 1
> committer interested in supporting it
>  [ ] Provide both end-user and developer-facing documentation
>  [x] Have at least a basic level of unit test coverage
>  [x] Run all existing applicable integration tests with other Beam
> components and create additional tests as appropriate
>  [~] Be able to handle a subset of the model that addresses a significant
> set of use cases, such as ‘traditional batch’ or ‘processing time
> streaming’.
>  [ ] Update the capability matrix with the current status
>  [ ] Add a webpage under documentation/runners
>
> So I would like to merge to the feature branch while we get docs together,
> identify interested contributors, and make sure of the integration with the
> build/test system, and maybe run through manual verification of the
> streaming game example at medium scale (anything bigger than local
> one-node).
>
> I also want to get it on a feature branch for this reason: we could
> continue to perfect it on the PR, but it is easier to have separate reviews
> for different issues identified on the big PR.
>
> Kenn
>
> On Thu, Jan 25, 2018 at 11:32 PM, Jean-Baptiste Onofré 
> wrote:
>
>> That's awesome ! Happy to see new runner.
>>
>> As the build is OK and the runner contains validation, why not simply
>> merge the
>> PR on master once 2.3.0 release branch is there ?
>>
>> It would give better visibility to this new feature and maybe attract
>> contribution in early stage.
>>
>> Regards
>> JB
>>
>> On 01/26/2018 05:37 AM, Kenneth Knowles wrote:
>> > Hi all,
>> >
>> > In case you haven't noticed or followed, there's a new runner in PR:
>> Samza!
>> >
>> > https://github.com/apache/beam/pull/4340
>> >
>> > It has been under review and revision for some time. In local mode it
>> passes a
>> > solid suite of ValidatesRunner tests (I don't have a Samza deployment
>> handy to
>> > test non-local).
>> >
>> > Given all this, I am ready to put it on a feature branch where it can
>> mature
>> > further, and we can build out our CI for it, etc, until we agree it is
>> ready for
>> > master.
>> >
>> > Kenn
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>


Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Chamikara Jayalath
Good point. There's always the chance of step that performs final rename
being retried. So we'll have to ignore this error at the sink level. We
don't necessarily have to do this at the FileSystem level though. I think
the proper behavior might be to raise an error for the rename at the
FileSystem level if the destination already exists (or source doesn't
exist) while ignoring that error (and possibly logging a warning) at the
sink level.

- Cham

On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax  wrote:

> I think the idea was to ignore "already exists" errors. The reason being
> that any step in Beam can be executed multiple times, including the rename
> step. If the rename step gets run twice, the second run should succeed
> vacuously.
>
>
> On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri  wrote:
>
>> Hi,
>> I've been working on HDFS code for the Python SDK and I've noticed some
>> behaviors which are surprising. I wanted to know if these behaviors are
>> known and intended.
>>
>> 1. When renaming files during finalize_write, rename errors are ignored
>> .
>> For example, if I run wordcount twice using HDFS code I get a warning the
>> second time because the file already exists:
>>
>> WARNING:root:Rename not successful:
>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>> -> hdfs://counts2-0-of-1, libhdfs error in renaming
>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>> to hdfs://counts2-0-of-1 with exceptions Unable to rename
>> '/beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2'
>> to '/counts2-0-of-1'.
>>
>> For GCS and local files there are no rename errors (in this case), since
>> the rename operation silently overwrites existing destination files.
>> However, blindly ignoring these errors might make the pipeline to report
>> success even though output files are missing.
>>
>> 2. Output files (--ouput) overwrite existing files.
>>
>> 3. The Python SDK doesn't use Filesystems.copy(). The Java SDK doesn't
>> use Filesystem.Rename().
>>
>> Thanks,
>> - Udi
>>
>
>


Re: Schema-Aware PCollections revisited

2018-01-31 Thread Reuven Lax
As to the question of how a schema should be specified, I want to support
several common schema formats. So if a user has a Json schema, or an Avro
schema, or a Calcite schema, etc. there should be adapters that allow
setting a schema from any of them. I don't think we should prefer one over
the other. While Romain is right that many people know Json, I think far
fewer people know Json schemas.

Agree, schemas should not be enforced (for one thing, that wouldn't be
backwards compatible!). I think for the initial prototype I will probably
use a special coder to represent the schema (with setSchema an option on
the coder), largely because it doesn't require modifying PCollection.
However I think longer term a schema should be an optional piece of
metadata on the PCollection object. Similar to the previous discussion
about "hints," I think this can be set on the producing PTransform, and a
SetSchema PTransform will allow attaching a schema to any PCollection (i.e.
pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I think
schema should be similar to hints, it's just another piece of metadata on
the PCollection (though something interpreted by the model, where hints are
interpreted by the runner)

Reuven

On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré 
wrote:

> Hi,
>
> I think we should avoid to mix two things in the discussion (and so the
> document):
>
> 1. The element of the collection and the schema itself are two different
> things.
> By essence, Beam should not enforce any schema. That's why I think it's a
> good
> idea to set the schema optionally on the PCollection
> (pcollection.setSchema()).
>
> 2. From point 1 comes two questions: how do we represent a schema ? How
> can we
> leverage the schema to simplify the serialization of the element in the
> PCollection and query ? These two questions are not directly related.
>
>  2.1 How do we represent the schema
> Json Schema is a very interesting idea. It could be an abstract and other
> providers, like Avro, can be bind on it. It's part of the json processing
> spec
> (javax).
>
>  2.2. How do we leverage the schema for query and serialization
> Also in the spec, json pointer is interesting for the querying. Regarding
> the
> serialization, jackson or other data binder can be used.
>
> It's still rough ideas in my mind, but I like Romain's idea about json-p
> usage.
>
> Once 2.3.0 release is out, I will start to update the document with those
> ideas,
> and PoC.
>
> Thanks !
> Regards
> JB
>
> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
> >
> >
> > Le 30 janv. 2018 01:09, "Reuven Lax"  > > a écrit :
> >
> >
> >
> > On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com
> > > wrote:
> >
> > Hi
> >
> > I have some questions on this: how hierarchic schemas would
> work? Seems
> > it is not really supported by the ecosystem (out of custom
> stuff) :(.
> > How would it integrate smoothly with other generic record types
> - N bridges?
> >
> >
> > Do you mean nested schemas? What do you mean here?
> >
> >
> > Yes, sorry - wrote the mail too late ;). Was hierarchic data and nested
> schemas.
> >
> >
> > Concretely I wonder if using json API couldnt be beneficial:
> json-p is a
> > nice generic abstraction with a built in querying mecanism
> (jsonpointer)
> > but no actual serialization (even if json and binary json are
> very
> > natural). The big advantage is to have a well known ecosystem -
> who
> > doesnt know json today? - that beam can reuse for free:
> JsonObject
> > (guess we dont want JsonValue abstraction) for the record type,
> > jsonschema standard for the schema, jsonpointer for the
> > delection/projection etc... It doesnt enforce the actual
> serialization
> > (json, smile, avro, ...) but provide an expressive and alread
> known API
> > so i see it as a big win-win for users (no need to learn a new
> API and
> > use N bridges in all ways) and beam (impls are here and API
> design
> > already thought).
> >
> >
> > I assume you're talking about the API for setting schemas, not using
> them.
> > Json has many downsides and I'm not sure it's true that everyone
> knows it;
> > there are also competing schema APIs, such as Avro etc.. However I
> think we
> > should give Json a fair evaluation before dismissing it.
> >
> >
> > It is a wider topic than schema. Actually schema are not the first
> citizen but a
> > generic data representation is. That is where json hits almost any other
> API.
> > Then, when it comes to schema, json has a standard for that so we are
> all good.
> >
> > Also json has a good indexing API compared to alternatives which are
> sometimes a
> > bit faster - for noop transforms - but are hardly usable or make the
> code not
> > that readable.
> >
> > Avro is a nice compe

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Romain Manni-Bucau
Le 31 janv. 2018 20:16, "Reuven Lax"  a écrit :

As to the question of how a schema should be specified, I want to support
several common schema formats. So if a user has a Json schema, or an Avro
schema, or a Calcite schema, etc. there should be adapters that allow
setting a schema from any of them. I don't think we should prefer one over
the other. While Romain is right that many people know Json, I think far
fewer people know Json schemas.


Agree but schema would get an API for beam usage - dont think there is a
standard we can use and we cant use any vendor specific api in beam - so
not a big deal IMO/not a blocker.



Agree, schemas should not be enforced (for one thing, that wouldn't be
backwards compatible!). I think for the initial prototype I will probably
use a special coder to represent the schema (with setSchema an option on
the coder), largely because it doesn't require modifying PCollection.
However I think longer term a schema should be an optional piece of
metadata on the PCollection object. Similar to the previous discussion
about "hints," I think this can be set on the producing PTransform, and a
SetSchema PTransform will allow attaching a schema to any PCollection (i.e.
pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I think
schema should be similar to hints, it's just another piece of metadata on
the PCollection (though something interpreted by the model, where hints are
interpreted by the runner)


Schema should probably be contributable from the transform when mandatory -
thinking of avro io here - or an hint as fallback when optional probably.
This sounds good to me and doesnt require another public API than hint.


Reuven

On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré 
wrote:

> Hi,
>
> I think we should avoid to mix two things in the discussion (and so the
> document):
>
> 1. The element of the collection and the schema itself are two different
> things.
> By essence, Beam should not enforce any schema. That's why I think it's a
> good
> idea to set the schema optionally on the PCollection
> (pcollection.setSchema()).
>
> 2. From point 1 comes two questions: how do we represent a schema ? How
> can we
> leverage the schema to simplify the serialization of the element in the
> PCollection and query ? These two questions are not directly related.
>
>  2.1 How do we represent the schema
> Json Schema is a very interesting idea. It could be an abstract and other
> providers, like Avro, can be bind on it. It's part of the json processing
> spec
> (javax).
>
>  2.2. How do we leverage the schema for query and serialization
> Also in the spec, json pointer is interesting for the querying. Regarding
> the
> serialization, jackson or other data binder can be used.
>
> It's still rough ideas in my mind, but I like Romain's idea about json-p
> usage.
>
> Once 2.3.0 release is out, I will start to update the document with those
> ideas,
> and PoC.
>
> Thanks !
> Regards
> JB
>
> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
> >
> >
> > Le 30 janv. 2018 01:09, "Reuven Lax"  > > a écrit :
> >
> >
> >
> > On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com
> > > wrote:
> >
> > Hi
> >
> > I have some questions on this: how hierarchic schemas would
> work? Seems
> > it is not really supported by the ecosystem (out of custom
> stuff) :(.
> > How would it integrate smoothly with other generic record types
> - N bridges?
> >
> >
> > Do you mean nested schemas? What do you mean here?
> >
> >
> > Yes, sorry - wrote the mail too late ;). Was hierarchic data and nested
> schemas.
> >
> >
> > Concretely I wonder if using json API couldnt be beneficial:
> json-p is a
> > nice generic abstraction with a built in querying mecanism
> (jsonpointer)
> > but no actual serialization (even if json and binary json are
> very
> > natural). The big advantage is to have a well known ecosystem -
> who
> > doesnt know json today? - that beam can reuse for free:
> JsonObject
> > (guess we dont want JsonValue abstraction) for the record type,
> > jsonschema standard for the schema, jsonpointer for the
> > delection/projection etc... It doesnt enforce the actual
> serialization
> > (json, smile, avro, ...) but provide an expressive and alread
> known API
> > so i see it as a big win-win for users (no need to learn a new
> API and
> > use N bridges in all ways) and beam (impls are here and API
> design
> > already thought).
> >
> >
> > I assume you're talking about the API for setting schemas, not using
> them.
> > Json has many downsides and I'm not sure it's true that everyone
> knows it;
> > there are also competing schema APIs, such as Avro etc.. However I
> think we
> > should give Json a fair evaluation before dismissing it.
> >
> >
> > It is a wider top

Re: How to get split location from HadoopInputFormatBoundedSource

2018-01-31 Thread Lukasz Cwik
What about using the split() method:
https://github.com/apache/beam/blob/db60d37266c2ad6c4e2b5681221cc055d5c02eab/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L435

Note, its probably a good idea to read the javadoc for BoundedSource:
https://github.com/apache/beam/blob/db60d37266c2ad6c4e2b5681221cc055d5c02eab/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedSource.java#L32

On Tue, Jan 30, 2018 at 7:07 PM, JangHo Seo  wrote:

> Hello Beam dev,
>
> I'm working on a distributed data processing engine that supports Beam
> dataflow program,
> and investigating how to take split location into consideration when
> scheduling 'read' task for HDFS source.
>
> Is there any way to get split location information from
> HadoopInputFormatBoundedSource,
> without using Java reflection? Since 'inputSplit' field in '
> HadoopInputFormatBoundedSource' class is
> private one, I can see no way to access Hadoop split information other
> than using reflection.
>
> Thanks.
>


Re: Schema-Aware PCollections revisited

2018-01-31 Thread Reuven Lax
I don't think "hint" is the right API, as schema is not a hint (it has
semantic meaning). However I think the API for schema should look similar
to any "hint" API.

On Wed, Jan 31, 2018 at 11:40 AM, Romain Manni-Bucau 
wrote:

>
>
> Le 31 janv. 2018 20:16, "Reuven Lax"  a écrit :
>
> As to the question of how a schema should be specified, I want to support
> several common schema formats. So if a user has a Json schema, or an Avro
> schema, or a Calcite schema, etc. there should be adapters that allow
> setting a schema from any of them. I don't think we should prefer one over
> the other. While Romain is right that many people know Json, I think far
> fewer people know Json schemas.
>
>
> Agree but schema would get an API for beam usage - dont think there is a
> standard we can use and we cant use any vendor specific api in beam - so
> not a big deal IMO/not a blocker.
>
>
>
> Agree, schemas should not be enforced (for one thing, that wouldn't be
> backwards compatible!). I think for the initial prototype I will probably
> use a special coder to represent the schema (with setSchema an option on
> the coder), largely because it doesn't require modifying PCollection.
> However I think longer term a schema should be an optional piece of
> metadata on the PCollection object. Similar to the previous discussion
> about "hints," I think this can be set on the producing PTransform, and a
> SetSchema PTransform will allow attaching a schema to any PCollection (i.e.
> pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I
> think schema should be similar to hints, it's just another piece of
> metadata on the PCollection (though something interpreted by the model,
> where hints are interpreted by the runner)
>
>
> Schema should probably be contributable from the transform when mandatory
> - thinking of avro io here - or an hint as fallback when optional probably.
> This sounds good to me and doesnt require another public API than hint.
>
>
> Reuven
>
> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi,
>>
>> I think we should avoid to mix two things in the discussion (and so the
>> document):
>>
>> 1. The element of the collection and the schema itself are two different
>> things.
>> By essence, Beam should not enforce any schema. That's why I think it's a
>> good
>> idea to set the schema optionally on the PCollection
>> (pcollection.setSchema()).
>>
>> 2. From point 1 comes two questions: how do we represent a schema ? How
>> can we
>> leverage the schema to simplify the serialization of the element in the
>> PCollection and query ? These two questions are not directly related.
>>
>>  2.1 How do we represent the schema
>> Json Schema is a very interesting idea. It could be an abstract and other
>> providers, like Avro, can be bind on it. It's part of the json processing
>> spec
>> (javax).
>>
>>  2.2. How do we leverage the schema for query and serialization
>> Also in the spec, json pointer is interesting for the querying. Regarding
>> the
>> serialization, jackson or other data binder can be used.
>>
>> It's still rough ideas in my mind, but I like Romain's idea about json-p
>> usage.
>>
>> Once 2.3.0 release is out, I will start to update the document with those
>> ideas,
>> and PoC.
>>
>> Thanks !
>> Regards
>> JB
>>
>> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
>> >
>> >
>> > Le 30 janv. 2018 01:09, "Reuven Lax" > > > a écrit :
>> >
>> >
>> >
>> > On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
>> rmannibu...@gmail.com
>> > > wrote:
>> >
>> > Hi
>> >
>> > I have some questions on this: how hierarchic schemas would
>> work? Seems
>> > it is not really supported by the ecosystem (out of custom
>> stuff) :(.
>> > How would it integrate smoothly with other generic record types
>> - N bridges?
>> >
>> >
>> > Do you mean nested schemas? What do you mean here?
>> >
>> >
>> > Yes, sorry - wrote the mail too late ;). Was hierarchic data and nested
>> schemas.
>> >
>> >
>> > Concretely I wonder if using json API couldnt be beneficial:
>> json-p is a
>> > nice generic abstraction with a built in querying mecanism
>> (jsonpointer)
>> > but no actual serialization (even if json and binary json are
>> very
>> > natural). The big advantage is to have a well known ecosystem -
>> who
>> > doesnt know json today? - that beam can reuse for free:
>> JsonObject
>> > (guess we dont want JsonValue abstraction) for the record type,
>> > jsonschema standard for the schema, jsonpointer for the
>> > delection/projection etc... It doesnt enforce the actual
>> serialization
>> > (json, smile, avro, ...) but provide an expressive and alread
>> known API
>> > so i see it as a big win-win for users (no need to learn a new
>> API and
>> > use N bridges in all ways) and beam (impls are here and API
>> design
>

Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-31 Thread Lukasz Cwik
Because people write code like:
myMethod(InputStream in) {
  InputStream child = new InputStream(in);
  child.close();
}

InputStream in = new FileInputStream(... path ...);
myMethod(in);
myMethod(in);

An exception will be thrown when the second myMethod call occurs.

Unfortunately not everyone wraps their calls to a coder with an
UnownedInputStream or a filter input stream which drops close() calls is
why its a problem and in the few places it is done it is used to prevent
bugs from creeping in.



On Tue, Jan 30, 2018 at 11:29 AM, Romain Manni-Bucau 
wrote:

> I get the issue but I don't get the last part. Concretely we can support
> any lib by just removing the exception in the close, no? What would be the
> issue? No additional wrapper, no lib integration issue.
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
> 
>
> 2018-01-30 19:29 GMT+01:00 Lukasz Cwik :
>
>> Its common in the code base that input and output streams are passed
>> around and the caller is responsible for closing it, not the callee. The
>> UnownedInputStream is to guard against libraries that are poorly behaved
>> and assume they get ownership of the stream when it is given to them.
>>
>> In the code:
>> myMethod(InputStream in) {
>>   InputStream child = new InputStream(in);
>>   child.close();
>> }
>>
>> InputStream in = ...
>> myMethod(in);
>> myMethod(in);
>> When should "in" be closed?
>>
>> To get around this issue, create a filter input/output stream that
>> ignores close calls like on the JAXB coder:
>> https://github.com/apache/beam/blob/master/sdks/java/io/xml/
>> src/main/java/org/apache/beam/sdk/io/xml/JAXBCoder.java#L181
>>
>> We can instead swap around this pattern so that the caller guards against
>> callees closing by wrapping with a filter input/output stream but this
>> costs an additional method call for each input/output stream call.
>>
>>
>> On Tue, Jan 30, 2018 at 10:04 AM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> Hi guys,
>>>
>>> All is in the subject ;)
>>>
>>> Rational is to support any I/O library and not fail when the close is
>>> encapsulated.
>>>
>>> Any blocker to swallow this close call?
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>> 
>>>
>>
>>
>


Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-31 Thread Romain Manni-Bucau
Hmm, here we are the ones owning the call since it is in a coder, no? Do we
assume people will badly implement coders? In this particular case we can
assume close() will be called by a framework I think.
What about swallowing one close() and fail on the second?


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn 

2018-01-31 20:59 GMT+01:00 Lukasz Cwik :

> Because people write code like:
> myMethod(InputStream in) {
>   InputStream child = new InputStream(in);
>   child.close();
> }
>
> InputStream in = new FileInputStream(... path ...);
> myMethod(in);
> myMethod(in);
>
> An exception will be thrown when the second myMethod call occurs.
>
> Unfortunately not everyone wraps their calls to a coder with an
> UnownedInputStream or a filter input stream which drops close() calls is
> why its a problem and in the few places it is done it is used to prevent
> bugs from creeping in.
>
>
>
> On Tue, Jan 30, 2018 at 11:29 AM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>> I get the issue but I don't get the last part. Concretely we can support
>> any lib by just removing the exception in the close, no? What would be the
>> issue? No additional wrapper, no lib integration issue.
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>> 
>>
>> 2018-01-30 19:29 GMT+01:00 Lukasz Cwik :
>>
>>> Its common in the code base that input and output streams are passed
>>> around and the caller is responsible for closing it, not the callee. The
>>> UnownedInputStream is to guard against libraries that are poorly behaved
>>> and assume they get ownership of the stream when it is given to them.
>>>
>>> In the code:
>>> myMethod(InputStream in) {
>>>   InputStream child = new InputStream(in);
>>>   child.close();
>>> }
>>>
>>> InputStream in = ...
>>> myMethod(in);
>>> myMethod(in);
>>> When should "in" be closed?
>>>
>>> To get around this issue, create a filter input/output stream that
>>> ignores close calls like on the JAXB coder:
>>> https://github.com/apache/beam/blob/master/sdks/java/io/xml/
>>> src/main/java/org/apache/beam/sdk/io/xml/JAXBCoder.java#L181
>>>
>>> We can instead swap around this pattern so that the caller guards
>>> against callees closing by wrapping with a filter input/output stream but
>>> this costs an additional method call for each input/output stream call.
>>>
>>>
>>> On Tue, Jan 30, 2018 at 10:04 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
 Hi guys,

 All is in the subject ;)

 Rational is to support any I/O library and not fail when the close is
 encapsulated.

 Any blocker to swallow this close call?

 Romain Manni-Bucau
 @rmannibucau  |  Blog
  | Old Blog
  | Github
  | LinkedIn
 

>>>
>>>
>>
>


Re: Schema-Aware PCollections revisited

2018-01-31 Thread Romain Manni-Bucau
Hmm, it is a hint semantically or it is deducable from the transform. Doing
the union of both you cover all cases. Then how it is forwarded from the
transform to the runtime is in runner API not the user (pipeline) API so
I'm not sure I see the case you reference where it has a semantic API. Can
you detail it please?


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn 

2018-01-31 20:45 GMT+01:00 Reuven Lax :

> I don't think "hint" is the right API, as schema is not a hint (it has
> semantic meaning). However I think the API for schema should look similar
> to any "hint" API.
>
> On Wed, Jan 31, 2018 at 11:40 AM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>>
>>
>> Le 31 janv. 2018 20:16, "Reuven Lax"  a écrit :
>>
>> As to the question of how a schema should be specified, I want to support
>> several common schema formats. So if a user has a Json schema, or an Avro
>> schema, or a Calcite schema, etc. there should be adapters that allow
>> setting a schema from any of them. I don't think we should prefer one over
>> the other. While Romain is right that many people know Json, I think far
>> fewer people know Json schemas.
>>
>>
>> Agree but schema would get an API for beam usage - dont think there is a
>> standard we can use and we cant use any vendor specific api in beam - so
>> not a big deal IMO/not a blocker.
>>
>>
>>
>> Agree, schemas should not be enforced (for one thing, that wouldn't be
>> backwards compatible!). I think for the initial prototype I will probably
>> use a special coder to represent the schema (with setSchema an option on
>> the coder), largely because it doesn't require modifying PCollection.
>> However I think longer term a schema should be an optional piece of
>> metadata on the PCollection object. Similar to the previous discussion
>> about "hints," I think this can be set on the producing PTransform, and a
>> SetSchema PTransform will allow attaching a schema to any PCollection (i.e.
>> pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I
>> think schema should be similar to hints, it's just another piece of
>> metadata on the PCollection (though something interpreted by the model,
>> where hints are interpreted by the runner)
>>
>>
>> Schema should probably be contributable from the transform when mandatory
>> - thinking of avro io here - or an hint as fallback when optional probably.
>> This sounds good to me and doesnt require another public API than hint.
>>
>>
>> Reuven
>>
>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi,
>>>
>>> I think we should avoid to mix two things in the discussion (and so the
>>> document):
>>>
>>> 1. The element of the collection and the schema itself are two different
>>> things.
>>> By essence, Beam should not enforce any schema. That's why I think it's
>>> a good
>>> idea to set the schema optionally on the PCollection
>>> (pcollection.setSchema()).
>>>
>>> 2. From point 1 comes two questions: how do we represent a schema ? How
>>> can we
>>> leverage the schema to simplify the serialization of the element in the
>>> PCollection and query ? These two questions are not directly related.
>>>
>>>  2.1 How do we represent the schema
>>> Json Schema is a very interesting idea. It could be an abstract and other
>>> providers, like Avro, can be bind on it. It's part of the json
>>> processing spec
>>> (javax).
>>>
>>>  2.2. How do we leverage the schema for query and serialization
>>> Also in the spec, json pointer is interesting for the querying.
>>> Regarding the
>>> serialization, jackson or other data binder can be used.
>>>
>>> It's still rough ideas in my mind, but I like Romain's idea about json-p
>>> usage.
>>>
>>> Once 2.3.0 release is out, I will start to update the document with
>>> those ideas,
>>> and PoC.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
>>> >
>>> >
>>> > Le 30 janv. 2018 01:09, "Reuven Lax" >> > > a écrit :
>>> >
>>> >
>>> >
>>> > On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com
>>> > > wrote:
>>> >
>>> > Hi
>>> >
>>> > I have some questions on this: how hierarchic schemas would
>>> work? Seems
>>> > it is not really supported by the ecosystem (out of custom
>>> stuff) :(.
>>> > How would it integrate smoothly with other generic record
>>> types - N bridges?
>>> >
>>> >
>>> > Do you mean nested schemas? What do you mean here?
>>> >
>>> >
>>> > Yes, sorry - wrote the mail too late ;). Was hierarchic data and
>>> nested schemas.
>>> >
>>> >
>>> > Concretely I wonder if using json API couldnt be beneficial:
>>> json-p is a
>>> > nice generic abstraction with a built in querying mecan

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Jean-Baptiste Onofré

Hi Reuven,

Agree to be able to describe the schema with different format. The good 
point about json schemas is that they are described by a spec. My point 
is also to avoid the reinvent the wheel. Just an abstract to be able to 
use Avro, Json, Calcite, custom schema descriptors would be great.


Using coder to describe a schema sounds like a smart move to implement 
quickly. However, it has to be clear in term of documentation to avoid 
"side effect". I still think PCollection.setSchema() is better: it 
should be metadata (or hint ;))) on the PCollection.


Regards
JB

On 31/01/2018 20:16, Reuven Lax wrote:
As to the question of how a schema should be specified, I want to 
support several common schema formats. So if a user has a Json schema, 
or an Avro schema, or a Calcite schema, etc. there should be adapters 
that allow setting a schema from any of them. I don't think we should 
prefer one over the other. While Romain is right that many people know 
Json, I think far fewer people know Json schemas.


Agree, schemas should not be enforced (for one thing, that wouldn't be 
backwards compatible!). I think for the initial prototype I will 
probably use a special coder to represent the schema (with setSchema an 
option on the coder), largely because it doesn't require modifying 
PCollection. However I think longer term a schema should be an optional 
piece of metadata on the PCollection object. Similar to the previous 
discussion about "hints," I think this can be set on the producing 
PTransform, and a SetSchema PTransform will allow attaching a schema to 
any PCollection (i.e. pc.apply(SetSchema.of(schema))). This part isn't 
designed yet, but I think schema should be similar to hints, it's just 
another piece of metadata on the PCollection (though something 
interpreted by the model, where hints are interpreted by the runner)


Reuven

On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré > wrote:


Hi,

I think we should avoid to mix two things in the discussion (and so
the document):

1. The element of the collection and the schema itself are two
different things.
By essence, Beam should not enforce any schema. That's why I think
it's a good
idea to set the schema optionally on the PCollection
(pcollection.setSchema()).

2. From point 1 comes two questions: how do we represent a schema ?
How can we
leverage the schema to simplify the serialization of the element in the
PCollection and query ? These two questions are not directly related.

  2.1 How do we represent the schema
Json Schema is a very interesting idea. It could be an abstract and
other
providers, like Avro, can be bind on it. It's part of the json
processing spec
(javax).

  2.2. How do we leverage the schema for query and serialization
Also in the spec, json pointer is interesting for the querying.
Regarding the
serialization, jackson or other data binder can be used.

It's still rough ideas in my mind, but I like Romain's idea about
json-p usage.

Once 2.3.0 release is out, I will start to update the document with
those ideas,
and PoC.

Thanks !
Regards
JB

On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
>
>
> Le 30 janv. 2018 01:09, "Reuven Lax" mailto:re...@google.com>
 > >> a écrit :
>
>
>
>     On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau mailto:rmannibu...@gmail.com>
 >     >> wrote:
 >
 >         Hi
 >
 >         I have some questions on this: how hierarchic schemas
would work? Seems
 >         it is not really supported by the ecosystem (out of
custom stuff) :(.
 >         How would it integrate smoothly with other generic record
types - N bridges?
 >
 >
 >     Do you mean nested schemas? What do you mean here?
 >
 >
 > Yes, sorry - wrote the mail too late ;). Was hierarchic data and
nested schemas.
 >
 >
 >         Concretely I wonder if using json API couldnt be
beneficial: json-p is a
 >         nice generic abstraction with a built in querying
mecanism (jsonpointer)
 >         but no actual serialization (even if json and binary json
are very
 >         natural). The big advantage is to have a well known
ecosystem - who
 >         doesnt know json today? - that beam can reuse for free:
JsonObject
 >         (guess we dont want JsonValue abstraction) for the record
type,
 >         jsonschema standard for the schema, jsonpointer for the
 >         delection/projection etc... It doesnt enforce the actual
serialization
 >         (json, smile, avro, ...) but provide an expressive and
alread known API
 >         so i see it as a big win-win for users (no need to learn
a new 

Re: Should we have a predictable test run order?

2018-01-31 Thread Ismaël Mejía
Is the conclusion of this thread is that we should then make the test
execution random, remember that currently it uses the default order
that is filesystem-based as Dan mentioned and that produces some minor
inconsistencies between mac/linux.

It is going to be interesting to see how much extra flakiness we find
just by defaulting to -Dsurefire.runOrder=random, any volunteer to
open pandora's box?



On Tue, Jan 30, 2018 at 7:30 PM, Reuven Lax  wrote:
> To expand on what Robert says, many other things in our test framework are
> randomized. e.g. PCollection elements are shuffled randomly, bundle sizes
> are determined randomly, etc. All of this should be repeatable if there's a
> failure. The test should print the seed used to generate the random numbers,
> and you should be able to pass that seed back into the run to recreate those
> exact conditions.
>
> On Tue, Jan 30, 2018 at 10:27 AM, Robert Bradshaw 
> wrote:
>>
>> Agreed, any leakage of state between tests is a bug, and giving things
>> a deterministic order just hides these bugs. I'd be in favor of
>> enforcing random ordering (with a published seed for reproduciblity of
>> course).
>>
>> On Tue, Jan 30, 2018 at 9:21 AM, Lukasz Cwik  wrote:
>> > The order should be random to ferret out issues but the test order seed
>> > should be printed and configurable so it allows replaying a test run
>> > because
>> > you can specify the order in which it should execute.
>> >
>> > I don't like having a strict order since it hides poorly written tests
>> > and
>> > people have a tendency to just work around the poorly written test
>> > instead
>> > of fixing it.
>> >
>> > On Tue, Jan 30, 2018 at 9:13 AM, Kenneth Knowles  wrote:
>> >>
>> >> What was the problem in this case?
>> >>
>> >> On Tue, Jan 30, 2018 at 9:12 AM, Romain Manni-Bucau
>> >>  wrote:
>> >>>
>> >>> What I was used to do is to capture the output when I identified some
>> >>> of
>> >>> these cases. Once it is reproduced I grep the "Running" lines from
>> >>> surefire.
>> >>> This gives me a reproducible order. Then with a kind of dichotomy you
>> >>> can
>> >>> find the "previous" test making your test failing and you can
>> >>> configure this
>> >>> sequence in idea.
>> >>>
>> >>> Not perfect but better than hiding the issue probably.
>> >>>
>> >>> Also running "clean" enforces inodes to change and increase the
>> >>> probability to reproduce it on linux.
>> >>>
>> >>>
>> >>> Romain Manni-Bucau
>> >>> @rmannibucau |  Blog | Old Blog | Github | LinkedIn
>> >>>
>> >>> 2018-01-30 18:03 GMT+01:00 Daniel Kulp :
>> 
>>  The biggest problem with random is that if a test fails due to an
>>  interaction, you have no way to reproduce it.   You could re-run with
>>  random
>>  10 times and it might not fail again.   Thus, what good did it do to
>>  even
>>  flag the failure?  At least with alphabetical and reverse
>>  alphabetical, if a
>>  tests fails, you can rerun and actually have a chance to diagnose the
>>  failure.   A test that randomly fails once out of every 20 times it
>>  runs
>>  tends to get @Ignored, not fixed.   I’ve seen that way too often.  :(
>> 
>>  Dan
>> 
>> 
>>  > On Jan 30, 2018, at 11:38 AM, Romain Manni-Bucau
>>  >  wrote:
>>  >
>>  > Hi Daniel,
>>  >
>>  > As a quick fix it sounds good but doesnt it hide a leak or issue
>>  > (in
>>  > test setup or in main code)? Long story short: using a random order
>>  > can
>>  > allow to find bugs faster instead of hiding them and discover them
>>  > randomly
>>  > adding a new test.
>>  >
>>  > That said, good point to have it configurable with a -D or -P and
>>  > be
>>  > able to test quickly this flag.
>>  >
>>  >
>>  > Le 30 janv. 2018 17:33, "Daniel Kulp"  a écrit :
>>  > I spent a couple hours this morning trying to figure out why two of
>>  > the SQL tests are failing on my machine, but not for Jenkins or for
>>  > JB.
>>  > Not knowing anything about the SQL stuff, it was very hard to debug
>>  > and it
>>  > wouldn’t fail within Eclipse or even if I ran that individual test
>>  > from the
>>  > command line with -Dtest= .   Thus, a real pain…
>>  >
>>  > It turns out, there is an interaction problem between it and a test
>>  > that is running before it on my machine, but on Jenkins and JB’s
>>  > machine,
>>  > the tests are run in a different order so the problem doesn’t
>>  > surface.   So
>>  > here’s the question:
>>  >
>>  > Should the surefire configuration specify a “runOrder” so that the
>>  > tests would run the same on all of our machines?   By default, the
>>  > runOrder
>>  > is “filesystem” so depending on the order that the filesystem
>>  > returns the
>>  > test classes to surefire, the tests would run in different order.
>>  > It looks
>>  > like my APFS Mac returns them in a differe

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Udi Meiri
Yeah, another round of refactoring is due to move the rename via
copy+delete logic up to the file-based sink level.

On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath  wrote:

> Good point. There's always the chance of step that performs final rename
> being retried. So we'll have to ignore this error at the sink level. We
> don't necessarily have to do this at the FileSystem level though. I think
> the proper behavior might be to raise an error for the rename at the
> FileSystem level if the destination already exists (or source doesn't
> exist) while ignoring that error (and possibly logging a warning) at the
> sink level.
>
> - Cham
>
>
> On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax  wrote:
>
>> I think the idea was to ignore "already exists" errors. The reason being
>> that any step in Beam can be executed multiple times, including the rename
>> step. If the rename step gets run twice, the second run should succeed
>> vacuously.
>>
>>
>> On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri  wrote:
>>
>>> Hi,
>>> I've been working on HDFS code for the Python SDK and I've noticed some
>>> behaviors which are surprising. I wanted to know if these behaviors are
>>> known and intended.
>>>
>>> 1. When renaming files during finalize_write, rename errors are ignored
>>> .
>>> For example, if I run wordcount twice using HDFS code I get a warning the
>>> second time because the file already exists:
>>>
>>> WARNING:root:Rename not successful:
>>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>>> -> hdfs://counts2-0-of-1, libhdfs error in renaming
>>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>>> to hdfs://counts2-0-of-1 with exceptions Unable to rename
>>> '/beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2'
>>> to '/counts2-0-of-1'.
>>>
>>> For GCS and local files there are no rename errors (in this case), since
>>> the rename operation silently overwrites existing destination files.
>>> However, blindly ignoring these errors might make the pipeline to report
>>> success even though output files are missing.
>>>
>>> 2. Output files (--ouput) overwrite existing files.
>>>
>>> 3. The Python SDK doesn't use Filesystems.copy(). The Java SDK doesn't
>>> use Filesystem.Rename().
>>>
>>> Thanks,
>>> - Udi
>>>
>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Should we have a predictable test run order?

2018-01-31 Thread Romain Manni-Bucau
2018-01-31 21:31 GMT+01:00 Ismaël Mejía :

> Is the conclusion of this thread is that we should then make the test
> execution random, remember that currently it uses the default order
> that is filesystem-based as Dan mentioned and that produces some minor
> inconsistencies between mac/linux.
>
> It is going to be interesting to see how much extra flakiness we find
> just by defaulting to -Dsurefire.runOrder=random, any volunteer to
> open pandora's box?
>

Hehe, didn't you just get designed volunteer? ;)
Anyway, since it uses inodes ATM it is quite random on UNIx at least.


>
>
>
> On Tue, Jan 30, 2018 at 7:30 PM, Reuven Lax  wrote:
> > To expand on what Robert says, many other things in our test framework
> are
> > randomized. e.g. PCollection elements are shuffled randomly, bundle sizes
> > are determined randomly, etc. All of this should be repeatable if
> there's a
> > failure. The test should print the seed used to generate the random
> numbers,
> > and you should be able to pass that seed back into the run to recreate
> those
> > exact conditions.
> >
> > On Tue, Jan 30, 2018 at 10:27 AM, Robert Bradshaw 
> > wrote:
> >>
> >> Agreed, any leakage of state between tests is a bug, and giving things
> >> a deterministic order just hides these bugs. I'd be in favor of
> >> enforcing random ordering (with a published seed for reproduciblity of
> >> course).
> >>
> >> On Tue, Jan 30, 2018 at 9:21 AM, Lukasz Cwik  wrote:
> >> > The order should be random to ferret out issues but the test order
> seed
> >> > should be printed and configurable so it allows replaying a test run
> >> > because
> >> > you can specify the order in which it should execute.
> >> >
> >> > I don't like having a strict order since it hides poorly written tests
> >> > and
> >> > people have a tendency to just work around the poorly written test
> >> > instead
> >> > of fixing it.
> >> >
> >> > On Tue, Jan 30, 2018 at 9:13 AM, Kenneth Knowles 
> wrote:
> >> >>
> >> >> What was the problem in this case?
> >> >>
> >> >> On Tue, Jan 30, 2018 at 9:12 AM, Romain Manni-Bucau
> >> >>  wrote:
> >> >>>
> >> >>> What I was used to do is to capture the output when I identified
> some
> >> >>> of
> >> >>> these cases. Once it is reproduced I grep the "Running" lines from
> >> >>> surefire.
> >> >>> This gives me a reproducible order. Then with a kind of dichotomy
> you
> >> >>> can
> >> >>> find the "previous" test making your test failing and you can
> >> >>> configure this
> >> >>> sequence in idea.
> >> >>>
> >> >>> Not perfect but better than hiding the issue probably.
> >> >>>
> >> >>> Also running "clean" enforces inodes to change and increase the
> >> >>> probability to reproduce it on linux.
> >> >>>
> >> >>>
> >> >>> Romain Manni-Bucau
> >> >>> @rmannibucau |  Blog | Old Blog | Github | LinkedIn
> >> >>>
> >> >>> 2018-01-30 18:03 GMT+01:00 Daniel Kulp :
> >> 
> >>  The biggest problem with random is that if a test fails due to an
> >>  interaction, you have no way to reproduce it.   You could re-run
> with
> >>  random
> >>  10 times and it might not fail again.   Thus, what good did it do
> to
> >>  even
> >>  flag the failure?  At least with alphabetical and reverse
> >>  alphabetical, if a
> >>  tests fails, you can rerun and actually have a chance to diagnose
> the
> >>  failure.   A test that randomly fails once out of every 20 times it
> >>  runs
> >>  tends to get @Ignored, not fixed.   I’ve seen that way too often.
> :(
> >> 
> >>  Dan
> >> 
> >> 
> >>  > On Jan 30, 2018, at 11:38 AM, Romain Manni-Bucau
> >>  >  wrote:
> >>  >
> >>  > Hi Daniel,
> >>  >
> >>  > As a quick fix it sounds good but doesnt it hide a leak or issue
> >>  > (in
> >>  > test setup or in main code)? Long story short: using a random
> order
> >>  > can
> >>  > allow to find bugs faster instead of hiding them and discover
> them
> >>  > randomly
> >>  > adding a new test.
> >>  >
> >>  > That said, good point to have it configurable with a -D or -P and
> >>  > be
> >>  > able to test quickly this flag.
> >>  >
> >>  >
> >>  > Le 30 janv. 2018 17:33, "Daniel Kulp"  a
> écrit :
> >>  > I spent a couple hours this morning trying to figure out why two
> of
> >>  > the SQL tests are failing on my machine, but not for Jenkins or
> for
> >>  > JB.
> >>  > Not knowing anything about the SQL stuff, it was very hard to
> debug
> >>  > and it
> >>  > wouldn’t fail within Eclipse or even if I ran that individual
> test
> >>  > from the
> >>  > command line with -Dtest= .   Thus, a real pain…
> >>  >
> >>  > It turns out, there is an interaction problem between it and a
> test
> >>  > that is running before it on my machine, but on Jenkins and JB’s
> >>  > machine,
> >>  > the tests are run in a different order so the problem doesn’t
> >>  > surface.   So
> >>  > here’s the 

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Reuven Lax
Agree. The initial implementation will be a prototype.

On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré 
wrote:

> Hi Reuven,
>
> Agree to be able to describe the schema with different format. The good
> point about json schemas is that they are described by a spec. My point is
> also to avoid the reinvent the wheel. Just an abstract to be able to use
> Avro, Json, Calcite, custom schema descriptors would be great.
>
> Using coder to describe a schema sounds like a smart move to implement
> quickly. However, it has to be clear in term of documentation to avoid
> "side effect". I still think PCollection.setSchema() is better: it should
> be metadata (or hint ;))) on the PCollection.
>
> Regards
> JB
>
> On 31/01/2018 20:16, Reuven Lax wrote:
>
>> As to the question of how a schema should be specified, I want to support
>> several common schema formats. So if a user has a Json schema, or an Avro
>> schema, or a Calcite schema, etc. there should be adapters that allow
>> setting a schema from any of them. I don't think we should prefer one over
>> the other. While Romain is right that many people know Json, I think far
>> fewer people know Json schemas.
>>
>> Agree, schemas should not be enforced (for one thing, that wouldn't be
>> backwards compatible!). I think for the initial prototype I will probably
>> use a special coder to represent the schema (with setSchema an option on
>> the coder), largely because it doesn't require modifying PCollection.
>> However I think longer term a schema should be an optional piece of
>> metadata on the PCollection object. Similar to the previous discussion
>> about "hints," I think this can be set on the producing PTransform, and a
>> SetSchema PTransform will allow attaching a schema to any PCollection (i.e.
>> pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I
>> think schema should be similar to hints, it's just another piece of
>> metadata on the PCollection (though something interpreted by the model,
>> where hints are interpreted by the runner)
>>
>> Reuven
>>
>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré > > wrote:
>>
>> Hi,
>>
>> I think we should avoid to mix two things in the discussion (and so
>> the document):
>>
>> 1. The element of the collection and the schema itself are two
>> different things.
>> By essence, Beam should not enforce any schema. That's why I think
>> it's a good
>> idea to set the schema optionally on the PCollection
>> (pcollection.setSchema()).
>>
>> 2. From point 1 comes two questions: how do we represent a schema ?
>> How can we
>> leverage the schema to simplify the serialization of the element in
>> the
>> PCollection and query ? These two questions are not directly related.
>>
>>   2.1 How do we represent the schema
>> Json Schema is a very interesting idea. It could be an abstract and
>> other
>> providers, like Avro, can be bind on it. It's part of the json
>> processing spec
>> (javax).
>>
>>   2.2. How do we leverage the schema for query and serialization
>> Also in the spec, json pointer is interesting for the querying.
>> Regarding the
>> serialization, jackson or other data binder can be used.
>>
>> It's still rough ideas in my mind, but I like Romain's idea about
>> json-p usage.
>>
>> Once 2.3.0 release is out, I will start to update the document with
>> those ideas,
>> and PoC.
>>
>> Thanks !
>> Regards
>> JB
>>
>> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
>> >
>> >
>> > Le 30 janv. 2018 01:09, "Reuven Lax" > re...@google.com>
>>  > >> a écrit :
>> >
>> >
>> >
>> > On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
>> rmannibu...@gmail.com 
>>  > >
>> >> wrote:
>>  >
>>  > Hi
>>  >
>>  > I have some questions on this: how hierarchic schemas
>> would work? Seems
>>  > it is not really supported by the ecosystem (out of
>> custom stuff) :(.
>>  > How would it integrate smoothly with other generic record
>> types - N bridges?
>>  >
>>  >
>>  > Do you mean nested schemas? What do you mean here?
>>  >
>>  >
>>  > Yes, sorry - wrote the mail too late ;). Was hierarchic data and
>> nested schemas.
>>  >
>>  >
>>  > Concretely I wonder if using json API couldnt be
>> beneficial: json-p is a
>>  > nice generic abstraction with a built in querying
>> mecanism (jsonpointer)
>>  > but no actual serialization (even if json and binary json
>> are very
>>  > natural). The big advantage is to have a well known
>> ecosystem - who
>>  > doesnt know json today? - that beam can re

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Romain Manni-Bucau
If you need help on the json part I'm happy to help. To give a few hints on
what is very doable: we can add an avro module to johnzon (asf json{p,b}
impl) to back jsonp by avro (guess it will be one of the first to be asked)
for instance.


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn 

2018-01-31 22:06 GMT+01:00 Reuven Lax :

> Agree. The initial implementation will be a prototype.
>
> On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi Reuven,
>>
>> Agree to be able to describe the schema with different format. The good
>> point about json schemas is that they are described by a spec. My point is
>> also to avoid the reinvent the wheel. Just an abstract to be able to use
>> Avro, Json, Calcite, custom schema descriptors would be great.
>>
>> Using coder to describe a schema sounds like a smart move to implement
>> quickly. However, it has to be clear in term of documentation to avoid
>> "side effect". I still think PCollection.setSchema() is better: it should
>> be metadata (or hint ;))) on the PCollection.
>>
>> Regards
>> JB
>>
>> On 31/01/2018 20:16, Reuven Lax wrote:
>>
>>> As to the question of how a schema should be specified, I want to
>>> support several common schema formats. So if a user has a Json schema, or
>>> an Avro schema, or a Calcite schema, etc. there should be adapters that
>>> allow setting a schema from any of them. I don't think we should prefer one
>>> over the other. While Romain is right that many people know Json, I think
>>> far fewer people know Json schemas.
>>>
>>> Agree, schemas should not be enforced (for one thing, that wouldn't be
>>> backwards compatible!). I think for the initial prototype I will probably
>>> use a special coder to represent the schema (with setSchema an option on
>>> the coder), largely because it doesn't require modifying PCollection.
>>> However I think longer term a schema should be an optional piece of
>>> metadata on the PCollection object. Similar to the previous discussion
>>> about "hints," I think this can be set on the producing PTransform, and a
>>> SetSchema PTransform will allow attaching a schema to any PCollection (i.e.
>>> pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I
>>> think schema should be similar to hints, it's just another piece of
>>> metadata on the PCollection (though something interpreted by the model,
>>> where hints are interpreted by the runner)
>>>
>>> Reuven
>>>
>>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré >> > wrote:
>>>
>>> Hi,
>>>
>>> I think we should avoid to mix two things in the discussion (and so
>>> the document):
>>>
>>> 1. The element of the collection and the schema itself are two
>>> different things.
>>> By essence, Beam should not enforce any schema. That's why I think
>>> it's a good
>>> idea to set the schema optionally on the PCollection
>>> (pcollection.setSchema()).
>>>
>>> 2. From point 1 comes two questions: how do we represent a schema ?
>>> How can we
>>> leverage the schema to simplify the serialization of the element in
>>> the
>>> PCollection and query ? These two questions are not directly related.
>>>
>>>   2.1 How do we represent the schema
>>> Json Schema is a very interesting idea. It could be an abstract and
>>> other
>>> providers, like Avro, can be bind on it. It's part of the json
>>> processing spec
>>> (javax).
>>>
>>>   2.2. How do we leverage the schema for query and serialization
>>> Also in the spec, json pointer is interesting for the querying.
>>> Regarding the
>>> serialization, jackson or other data binder can be used.
>>>
>>> It's still rough ideas in my mind, but I like Romain's idea about
>>> json-p usage.
>>>
>>> Once 2.3.0 release is out, I will start to update the document with
>>> those ideas,
>>> and PoC.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
>>> >
>>> >
>>> > Le 30 janv. 2018 01:09, "Reuven Lax" >> re...@google.com>
>>>  > >> a écrit :
>>> >
>>> >
>>> >
>>> > On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com 
>>>  > >>
>>> >> wrote:
>>>  >
>>>  > Hi
>>>  >
>>>  > I have some questions on this: how hierarchic schemas
>>> would work? Seems
>>>  > it is not really supported by the ecosystem (out of
>>> custom stuff) :(.
>>>  > How would it integrate smoothly with other generic record
>>> types - N bridges?
>>>  >
>>>  

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Alan Myrvold
-1 (for now, hope to change this)

Dataflow runner jobs are failing for me with 2.3.0 RC1, for both Java and
Python.

This is not an issues with the 2.3.0 RC1 SDK, we (google) need to release
worker images.

I have assigned these bugs to myself, and will be working to help get these
workers released.

[BEAM-3584] Java dataflow job fails with 2.3.0 RC1, due to missing worker
image
[BEAM-3585] Python dataflow job fails with 2.3.0 RC1, due to missing worker
image

On Wed, Jan 31, 2018 at 6:12 AM, Jean-Baptiste Onofré 
wrote:

> Thanks Kenn,
>
> I prepared the list of tasks I did. I will complete where release is out.
>
> Regards
> JB
>
> On 01/31/2018 03:07 PM, Kenneth Knowles wrote:
> > I've cloned the release validation spreadsheet:
> >
> > https://s.apache.org/beam-2.3.0-release-validation
> >
> > If you plan to perform a manual validation task, please sign up so
> multiple
> > people don't waste effort.
> >
> > Alan & JB, as far as your pairing up to automate more, anything manual
> on this
> > sheet should be considered.
> >
> > Kenn
> >
> > On Wed, Jan 31, 2018 at 5:59 AM, Jean-Baptiste Onofré  > > wrote:
> >
> > +1 (binding)
> >
> > Casting my own +1 ;)
> >
> > Regards
> > JB
> >
> > On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
> > > Hi everyone,
> > >
> > > Please review and vote on the release candidate #1 for the version
> 2.3.0, as
> > > follows:
> > >
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please provide specific
> comments)
> > >
> > >
> > > The complete staging area is available for your review, which
> includes:
> > > * JIRA release notes [1],
> > > * the official Apache source release to be deployed to
> dist.apache.org
> >  [2],
> > > which is signed with the key with fingerprint C8282E76 [3],
> > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > * source code tag "v2.3.0-RC1" [5],
> > > * website pull request listing the release and publishing the API
> reference
> > > manual [6].
> > > * Java artifacts were built with Maven 3.3.9 and Oracle JDK
> 1.8.0_111.
> > > * Python artifacts are deployed along with the source release to
> the
> > > dist.apache.org  [2].
> > >
> > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > approval,
> > > with at least 3 PMC affirmative votes.
> > >
> > > Thanks,
> > > JB
> > >
> > > [1]
> > >
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> projectId=12319527&version=12341608
> >  projectId=12319527&version=12341608>
> > > [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
> > 
> > > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > 
> > > [4] https://repository.apache.org/content/repositories/
> orgapachebeam-1026/
> >  orgapachebeam-1026/>
> > > [5] https://github.com/apache/beam/tree/v2.3.0-RC1
> > 
> > > [6] https://github.com/apache/beam-site/pull/381
> > 
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org 
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Jean-Baptiste Onofré

Hi Alan

does it related to change in the codebase or in a dependency/related 
project ?


I mean: is it something we have to fix/change in Beam ?

Just curious as I'm not sure what you mean by "worker images" ;)

Thanks !
Regards
JB

On 31/01/2018 22:18, Alan Myrvold wrote:

-1 (for now, hope to change this)

Dataflow runner jobs are failing for me with 2.3.0 RC1, for both Java 
and Python.


This is not an issues with the 2.3.0 RC1 SDK, we (google) need to 
release worker images.


I have assigned these bugs to myself, and will be working to help get 
these workers released.


[BEAM-3584] Java dataflow job fails with 2.3.0 RC1, due to missing 
worker image
[BEAM-3585] Python dataflow job fails with 2.3.0 RC1, due to missing 
worker image


On Wed, Jan 31, 2018 at 6:12 AM, Jean-Baptiste Onofré > wrote:


Thanks Kenn,

I prepared the list of tasks I did. I will complete where release is
out.

Regards
JB

On 01/31/2018 03:07 PM, Kenneth Knowles wrote:
> I've cloned the release validation spreadsheet:
>
> https://s.apache.org/beam-2.3.0-release-validation

>
> If you plan to perform a manual validation task, please sign up so 
multiple
> people don't waste effort.
>
> Alan & JB, as far as your pairing up to automate more, anything manual on 
this
> sheet should be considered.
>
> Kenn
>
> On Wed, Jan 31, 2018 at 5:59 AM, Jean-Baptiste Onofré mailto:j...@nanthrax.net>
> >> wrote:
>
>     +1 (binding)
>
>     Casting my own +1 ;)
>
>     Regards
>     JB
>
>     On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
>     > Hi everyone,
>     >
>     > Please review and vote on the release candidate #1 for the version 
2.3.0, as
>     > follows:
>     >
>     > [ ] +1, Approve the release
>     > [ ] -1, Do not approve the release (please provide specific 
comments)
>     >
>     >
>     > The complete staging area is available for your review, which 
includes:
>     > * JIRA release notes [1],
>     > * the official Apache source release to be deployed to dist.apache.org 

 >      [2],
>     > which is signed with the key with fingerprint C8282E76 [3],
>     > * all artifacts to be deployed to the Maven Central Repository [4],
>     > * source code tag "v2.3.0-RC1" [5],
>     > * website pull request listing the release and publishing the API 
reference
>     > manual [6].
>     > * Java artifacts were built with Maven 3.3.9 and Oracle JDK 
1.8.0_111.
>     > * Python artifacts are deployed along with the source release to the
 >     > dist.apache.org 
 [2].
>     >
>     > The vote will be open for at least 72 hours. It is adopted by 
majority
>     approval,
>     > with at least 3 PMC affirmative votes.
>     >
>     > Thanks,
>     > JB
>     >
>     > [1]
>     >
> 
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12341608


>     
>
>     > [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/

>     >
>     > [3] https://dist.apache.org/repos/dist/release/beam/KEYS

>     >
>     > [4] 
https://repository.apache.org/content/repositories/orgapachebeam-1026/

>     
>
>     > [5] https://github.com/apache/beam/tree/v2.3.0-RC1

>     >
>     > [6] https://github.com/apache/beam-site/pull/381

>     >
>     >
>
>     --
>     Jean-Baptiste Onofré
 > jbono...@apache.org 


Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Reuven Lax
It's just a step that needs to be peformed before the new release works on
Dataflow. Alan is saying that we've been unable to validate Dataflow so
far, as worker images are not yet built. Hopefully they'll be built soon,
and we'll be able to validate.

On Wed, Jan 31, 2018 at 1:31 PM, Jean-Baptiste Onofré 
wrote:

> Hi Alan
>
> does it related to change in the codebase or in a dependency/related
> project ?
>
> I mean: is it something we have to fix/change in Beam ?
>
> Just curious as I'm not sure what you mean by "worker images" ;)
>
> Thanks !
> Regards
> JB
>
> On 31/01/2018 22:18, Alan Myrvold wrote:
>
>> -1 (for now, hope to change this)
>>
>> Dataflow runner jobs are failing for me with 2.3.0 RC1, for both Java and
>> Python.
>>
>> This is not an issues with the 2.3.0 RC1 SDK, we (google) need to release
>> worker images.
>>
>> I have assigned these bugs to myself, and will be working to help get
>> these workers released.
>>
>> [BEAM-3584] Java dataflow job fails with 2.3.0 RC1, due to missing worker
>> image
>> [BEAM-3585] Python dataflow job fails with 2.3.0 RC1, due to missing
>> worker image
>>
>> On Wed, Jan 31, 2018 at 6:12 AM, Jean-Baptiste Onofré > > wrote:
>>
>> Thanks Kenn,
>>
>> I prepared the list of tasks I did. I will complete where release is
>> out.
>>
>> Regards
>> JB
>>
>> On 01/31/2018 03:07 PM, Kenneth Knowles wrote:
>> > I've cloned the release validation spreadsheet:
>> >
>> > https://s.apache.org/beam-2.3.0-release-validation
>> 
>> >
>> > If you plan to perform a manual validation task, please sign up so
>> multiple
>> > people don't waste effort.
>> >
>> > Alan & JB, as far as your pairing up to automate more, anything
>> manual on this
>> > sheet should be considered.
>> >
>> > Kenn
>> >
>> > On Wed, Jan 31, 2018 at 5:59 AM, Jean-Baptiste Onofré <
>> j...@nanthrax.net 
>> > >> wrote:
>> >
>> > +1 (binding)
>> >
>> > Casting my own +1 ;)
>> >
>> > Regards
>> > JB
>> >
>> > On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
>> > > Hi everyone,
>> > >
>> > > Please review and vote on the release candidate #1 for the
>> version 2.3.0, as
>> > > follows:
>> > >
>> > > [ ] +1, Approve the release
>> > > [ ] -1, Do not approve the release (please provide specific
>> comments)
>> > >
>> > >
>> > > The complete staging area is available for your review, which
>> includes:
>> > > * JIRA release notes [1],
>> > > * the official Apache source release to be deployed to
>> dist.apache.org 
>>  >  [2],
>> > > which is signed with the key with fingerprint C8282E76 [3],
>> > > * all artifacts to be deployed to the Maven Central
>> Repository [4],
>> > > * source code tag "v2.3.0-RC1" [5],
>> > > * website pull request listing the release and publishing the
>> API reference
>> > > manual [6].
>> > > * Java artifacts were built with Maven 3.3.9 and Oracle JDK
>> 1.8.0_111.
>> > > * Python artifacts are deployed along with the source release
>> to the
>>  > > dist.apache.org 
>>  [2].
>> > >
>> > > The vote will be open for at least 72 hours. It is adopted by
>> majority
>> > approval,
>> > > with at least 3 PMC affirmative votes.
>> > >
>> > > Thanks,
>> > > JB
>> > >
>> > > [1]
>> > >
>> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?proj
>> ectId=12319527&version=12341608
>> > ectId=12319527&version=12341608>
>> > > jectId=12319527&version=12341608
>> > ectId=12319527&version=12341608>>
>> > > [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
>> 
>> > > >
>> > > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> 
>> > > >
>> > > [4] https://repository.apache.org/
>> content/repositories/orgapachebeam-1026/
>> > ebeam-1026/>
>> > 

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Jean-Baptiste Onofré

OK, I think I understood ;)

So it's not "directly" related to Beam itself (it's more a Dataflow step 
to perform).


@Alan, I think it's better to test first and then cast the vote. This 
kind of tests are valuable to validate the release and make sense. But 
vote should represent the state of the Beam release. So I think -1 vote 
is a bit too early before the test.


Thanks !
Regards
JB

On 31/01/2018 22:33, Reuven Lax wrote:
It's just a step that needs to be peformed before the new release works 
on Dataflow. Alan is saying that we've been unable to validate Dataflow 
so far, as worker images are not yet built. Hopefully they'll be built 
soon, and we'll be able to validate.


On Wed, Jan 31, 2018 at 1:31 PM, Jean-Baptiste Onofré > wrote:


Hi Alan

does it related to change in the codebase or in a dependency/related
project ?

I mean: is it something we have to fix/change in Beam ?

Just curious as I'm not sure what you mean by "worker images" ;)

Thanks !
Regards
JB

On 31/01/2018 22:18, Alan Myrvold wrote:

-1 (for now, hope to change this)

Dataflow runner jobs are failing for me with 2.3.0 RC1, for both
Java and Python.

This is not an issues with the 2.3.0 RC1 SDK, we (google) need
to release worker images.

I have assigned these bugs to myself, and will be working to
help get these workers released.

[BEAM-3584] Java dataflow job fails with 2.3.0 RC1, due to
missing worker image
[BEAM-3585] Python dataflow job fails with 2.3.0 RC1, due to
missing worker image

On Wed, Jan 31, 2018 at 6:12 AM, Jean-Baptiste Onofré
mailto:j...@nanthrax.net>
>> wrote:

     Thanks Kenn,

     I prepared the list of tasks I did. I will complete where
release is
     out.

     Regards
     JB

     On 01/31/2018 03:07 PM, Kenneth Knowles wrote:
     > I've cloned the release validation spreadsheet:
     >
     > https://s.apache.org/beam-2.3.0-release-validation

     >
     >
     > If you plan to perform a manual validation task, please
sign up so multiple
     > people don't waste effort.
     >
     > Alan & JB, as far as your pairing up to automate more,
anything manual on this
     > sheet should be considered.
     >
     > Kenn
     >
     > On Wed, Jan 31, 2018 at 5:59 AM, Jean-Baptiste Onofré
mailto:j...@nanthrax.net>
>
     > 

     >     +1 (binding)
     >
     >     Casting my own +1 ;)
     >
     >     Regards
     >     JB
     >
     >     On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
     >     > Hi everyone,
     >     >
     >     > Please review and vote on the release candidate #1
for the version 2.3.0, as
     >     > follows:
     >     >
     >     > [ ] +1, Approve the release
     >     > [ ] -1, Do not approve the release (please provide
specific comments)
     >     >
     >     >
     >     > The complete staging area is available for your
review, which includes:
     >     > * JIRA release notes [1],
     >     > * the official Apache source release to be deployed
to dist.apache.org  
      >      [2],
     >     > which is signed with the key with fingerprint
C8282E76 [3],
     >     > * all artifacts to be deployed to the Maven Central
Repository [4],
     >     > * source code tag "v2.3.0-RC1" [5],
     >     > * website pull request listing the release and
publishing the API reference
     >     > manual [6].
     >     > * Java artifacts were built with Maven 3.3.9 and
Oracle JDK 1.8.0_111.
     >     > * Python artifacts are deployed along with the
source release to the
      >     > dist.apache.org 

      [2].
     >     >
     >     > The vote will be open for at least 72 hours. It is
adopted by majority
     >     approval,
     >     > with at least 3 PMC affirmative votes.
     >     >

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Alan Myrvold
Yes, it is a dataflow step. Happy to test this again when they are
available.

On Wed, Jan 31, 2018 at 1:39 PM, Jean-Baptiste Onofré 
wrote:

> OK, I think I understood ;)
>
> So it's not "directly" related to Beam itself (it's more a Dataflow step
> to perform).
>
> @Alan, I think it's better to test first and then cast the vote. This kind
> of tests are valuable to validate the release and make sense. But vote
> should represent the state of the Beam release. So I think -1 vote is a bit
> too early before the test.
>
> Thanks !
> Regards
> JB
>
> On 31/01/2018 22:33, Reuven Lax wrote:
>
>> It's just a step that needs to be peformed before the new release works
>> on Dataflow. Alan is saying that we've been unable to validate Dataflow so
>> far, as worker images are not yet built. Hopefully they'll be built soon,
>> and we'll be able to validate.
>>
>> On Wed, Jan 31, 2018 at 1:31 PM, Jean-Baptiste Onofré > > wrote:
>>
>> Hi Alan
>>
>> does it related to change in the codebase or in a dependency/related
>> project ?
>>
>> I mean: is it something we have to fix/change in Beam ?
>>
>> Just curious as I'm not sure what you mean by "worker images" ;)
>>
>> Thanks !
>> Regards
>> JB
>>
>> On 31/01/2018 22:18, Alan Myrvold wrote:
>>
>> -1 (for now, hope to change this)
>>
>> Dataflow runner jobs are failing for me with 2.3.0 RC1, for both
>> Java and Python.
>>
>> This is not an issues with the 2.3.0 RC1 SDK, we (google) need
>> to release worker images.
>>
>> I have assigned these bugs to myself, and will be working to
>> help get these workers released.
>>
>> [BEAM-3584] Java dataflow job fails with 2.3.0 RC1, due to
>> missing worker image
>> [BEAM-3585] Python dataflow job fails with 2.3.0 RC1, due to
>> missing worker image
>>
>> On Wed, Jan 31, 2018 at 6:12 AM, Jean-Baptiste Onofré
>> mailto:j...@nanthrax.net>
>> >> wrote:
>>
>>  Thanks Kenn,
>>
>>  I prepared the list of tasks I did. I will complete where
>> release is
>>  out.
>>
>>  Regards
>>  JB
>>
>>  On 01/31/2018 03:07 PM, Kenneth Knowles wrote:
>>  > I've cloned the release validation spreadsheet:
>>  >
>>  > https://s.apache.org/beam-2.3.0-release-validation
>> 
>>  > >
>>  >
>>  > If you plan to perform a manual validation task, please
>> sign up so multiple
>>  > people don't waste effort.
>>  >
>>  > Alan & JB, as far as your pairing up to automate more,
>> anything manual on this
>>  > sheet should be considered.
>>  >
>>  > Kenn
>>  >
>>  > On Wed, Jan 31, 2018 at 5:59 AM, Jean-Baptiste Onofré
>> mailto:j...@nanthrax.net>
>> >
>>  > 
>> >  >
>>  > +1 (binding)
>>  >
>>  > Casting my own +1 ;)
>>  >
>>  > Regards
>>  > JB
>>  >
>>  > On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
>>  > > Hi everyone,
>>  > >
>>  > > Please review and vote on the release candidate #1
>> for the version 2.3.0, as
>>  > > follows:
>>  > >
>>  > > [ ] +1, Approve the release
>>  > > [ ] -1, Do not approve the release (please provide
>> specific comments)
>>  > >
>>  > >
>>  > > The complete staging area is available for your
>> review, which includes:
>>  > > * JIRA release notes [1],
>>  > > * the official Apache source release to be deployed
>> to dist.apache.org  <
>> http://dist.apache.org>
>>   >  [2],
>>  > > which is signed with the key with fingerprint
>> C8282E76 [3],
>>  > > * all artifacts to be deployed to the Maven Central
>> Repository [4],
>>  > > * source code tag "v2.3.0-RC1" [5],
>>  > > * website pull request listing the release and
>> publishing the API reference
>>  > > manual [6].
>>  > > * Java artifacts were built with Maven 3.3.9 and
>> Oracle JDK 1

Tracking Sickbayed tests in Jira

2018-01-31 Thread Thomas Groh
Hey everyone;

I've realized that although we tend to tag any test we suppress (due to
consistent flakiness) in the codebase, and file an associated JIRA issue
with the failure, we don't have any centralized way to track tests that
we're currently suppressing. To try and get more visibility into our
suppressed tests (without running `grep -r @Ignore ...` over the codebase
over and over), I've created a label for these tests, and applied it to all
of the issues that annotated `@Ignore` tests point to.

Ideally, all of our suppressed tests would be tagged with this label, so we
can get some visibility into which components we would normally expect to
have coverage but don't currently.

The search to look at all of these issues is
https://issues.apache.org/jira/browse/BEAM-3583?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20sickbay

If you're looking for something to do, or have other issues that should be
labelled, feel free to jump right in.

Yours,

Thomas


Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Raghu Angadi
Original mail mentions that output from second run of word_count is
ignored. That does not seem as safe as ignoring error from a second attempt
of a step. How do we know second run didn't run on different output?
Overwriting seems more accurate than ignoring. Does handling this error at
sink level distinguish between the two (another run vs second attempt)?


On Wed, Jan 31, 2018 at 12:32 PM, Udi Meiri  wrote:

> Yeah, another round of refactoring is due to move the rename via
> copy+delete logic up to the file-based sink level.
>
> On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath 
> wrote:
>
>> Good point. There's always the chance of step that performs final rename
>> being retried. So we'll have to ignore this error at the sink level. We
>> don't necessarily have to do this at the FileSystem level though. I think
>> the proper behavior might be to raise an error for the rename at the
>> FileSystem level if the destination already exists (or source doesn't
>> exist) while ignoring that error (and possibly logging a warning) at the
>> sink level.
>>
>> - Cham
>>
>>
>> On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax  wrote:
>>
>>> I think the idea was to ignore "already exists" errors. The reason being
>>> that any step in Beam can be executed multiple times, including the rename
>>> step. If the rename step gets run twice, the second run should succeed
>>> vacuously.
>>>
>>>
>>> On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri  wrote:
>>>
 Hi,
 I've been working on HDFS code for the Python SDK and I've noticed some
 behaviors which are surprising. I wanted to know if these behaviors are
 known and intended.

 1. When renaming files during finalize_write, rename errors are ignored
 .
 For example, if I run wordcount twice using HDFS code I get a warning the
 second time because the file already exists:

 WARNING:root:Rename not successful: hdfs://beam-temp-counts2-
 7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
 -> hdfs://counts2-0-of-1, libhdfs error in renaming
 hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da2
 45/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2 to
 hdfs://counts2-0-of-1 with exceptions Unable to rename
 '/beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da2
 45/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2' to
 '/counts2-0-of-1'.

 For GCS and local files there are no rename errors (in this case),
 since the rename operation silently overwrites existing destination files.
 However, blindly ignoring these errors might make the pipeline to report
 success even though output files are missing.

 2. Output files (--ouput) overwrite existing files.

 3. The Python SDK doesn't use Filesystems.copy(). The Java SDK doesn't
 use Filesystem.Rename().

 Thanks,
 - Udi

>>>
>>>


Re: Tracking Sickbayed tests in Jira

2018-01-31 Thread Romain Manni-Bucau
If it helps I'm using on another project:

# to find @Ignore tests
$ find . -name '*.java' | xargs grep -n1 @Ignore
# to find test classes (not methods)
$ find . -name '*.java' | xargs grep @Ignore | sed 's#:.*##' | sort -u
# to find modules with @Ignore
$ find . -name '*.java' | xargs grep @Ignore | sed 's#src/.*##' | sort -u
# to count ignored tests
$ find . -name '*.java' | xargs grep  @Ignore  | wc -l

last one mixed with a loop and git allows to follow the evolution and check
if it grows or decreases.



Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn 

2018-01-31 22:40 GMT+01:00 Thomas Groh :

> Hey everyone;
>
> I've realized that although we tend to tag any test we suppress (due to
> consistent flakiness) in the codebase, and file an associated JIRA issue
> with the failure, we don't have any centralized way to track tests that
> we're currently suppressing. To try and get more visibility into our
> suppressed tests (without running `grep -r @Ignore ...` over the codebase
> over and over), I've created a label for these tests, and applied it to all
> of the issues that annotated `@Ignore` tests point to.
>
> Ideally, all of our suppressed tests would be tagged with this label, so
> we can get some visibility into which components we would normally expect
> to have coverage but don't currently.
>
> The search to look at all of these issues is
> https://issues.apache.org/jira/browse/BEAM-3583?jql=
> project%20%3D%20BEAM%20AND%20labels%20%3D%20sickbay
>
> If you're looking for something to do, or have other issues that should be
> labelled, feel free to jump right in.
>
> Yours,
>
> Thomas
>


Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Ismaël Mejía
What is the common procedure in cases like this ? Because it doesn't
seems that it needs a re-vote, just an extra day or two for
validation, any ideas JB ?

On Wed, Jan 31, 2018 at 10:41 PM, Alan Myrvold  wrote:
> Yes, it is a dataflow step. Happy to test this again when they are
> available.
>
> On Wed, Jan 31, 2018 at 1:39 PM, Jean-Baptiste Onofré 
> wrote:
>>
>> OK, I think I understood ;)
>>
>> So it's not "directly" related to Beam itself (it's more a Dataflow step
>> to perform).
>>
>> @Alan, I think it's better to test first and then cast the vote. This kind
>> of tests are valuable to validate the release and make sense. But vote
>> should represent the state of the Beam release. So I think -1 vote is a bit
>> too early before the test.
>>
>> Thanks !
>> Regards
>> JB
>>
>> On 31/01/2018 22:33, Reuven Lax wrote:
>>>
>>> It's just a step that needs to be peformed before the new release works
>>> on Dataflow. Alan is saying that we've been unable to validate Dataflow so
>>> far, as worker images are not yet built. Hopefully they'll be built soon,
>>> and we'll be able to validate.
>>>
>>> On Wed, Jan 31, 2018 at 1:31 PM, Jean-Baptiste Onofré >> > wrote:
>>>
>>> Hi Alan
>>>
>>> does it related to change in the codebase or in a dependency/related
>>> project ?
>>>
>>> I mean: is it something we have to fix/change in Beam ?
>>>
>>> Just curious as I'm not sure what you mean by "worker images" ;)
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On 31/01/2018 22:18, Alan Myrvold wrote:
>>>
>>> -1 (for now, hope to change this)
>>>
>>> Dataflow runner jobs are failing for me with 2.3.0 RC1, for both
>>> Java and Python.
>>>
>>> This is not an issues with the 2.3.0 RC1 SDK, we (google) need
>>> to release worker images.
>>>
>>> I have assigned these bugs to myself, and will be working to
>>> help get these workers released.
>>>
>>> [BEAM-3584] Java dataflow job fails with 2.3.0 RC1, due to
>>> missing worker image
>>> [BEAM-3585] Python dataflow job fails with 2.3.0 RC1, due to
>>> missing worker image
>>>
>>> On Wed, Jan 31, 2018 at 6:12 AM, Jean-Baptiste Onofré
>>> mailto:j...@nanthrax.net>
>>> >> wrote:
>>>
>>>  Thanks Kenn,
>>>
>>>  I prepared the list of tasks I did. I will complete where
>>> release is
>>>  out.
>>>
>>>  Regards
>>>  JB
>>>
>>>  On 01/31/2018 03:07 PM, Kenneth Knowles wrote:
>>>  > I've cloned the release validation spreadsheet:
>>>  >
>>>  > https://s.apache.org/beam-2.3.0-release-validation
>>> 
>>>  >> >
>>>  >
>>>  > If you plan to perform a manual validation task, please
>>> sign up so multiple
>>>  > people don't waste effort.
>>>  >
>>>  > Alan & JB, as far as your pairing up to automate more,
>>> anything manual on this
>>>  > sheet should be considered.
>>>  >
>>>  > Kenn
>>>  >
>>>  > On Wed, Jan 31, 2018 at 5:59 AM, Jean-Baptiste Onofré
>>> mailto:j...@nanthrax.net>
>>> >
>>>  > 
>>> >>  >
>>>  > +1 (binding)
>>>  >
>>>  > Casting my own +1 ;)
>>>  >
>>>  > Regards
>>>  > JB
>>>  >
>>>  > On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
>>>  > > Hi everyone,
>>>  > >
>>>  > > Please review and vote on the release candidate #1
>>> for the version 2.3.0, as
>>>  > > follows:
>>>  > >
>>>  > > [ ] +1, Approve the release
>>>  > > [ ] -1, Do not approve the release (please provide
>>> specific comments)
>>>  > >
>>>  > >
>>>  > > The complete staging area is available for your
>>> review, which includes:
>>>  > > * JIRA release notes [1],
>>>  > > * the official Apache source release to be deployed
>>> to dist.apache.org 
>>> 
>>>   >  [2],
>>>  > > which is signed with the key with fingerprint
>>> C8282E76 [3],
>>>  > > * all artifacts to be deployed to the 

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Reuven Lax
Hopefully we can validate soon. I believe some of the delays are because of
integrating major changes done over the last week (e.g. Java 8 migration).

On Wed, Jan 31, 2018 at 2:04 PM, Ismaël Mejía  wrote:

> What is the common procedure in cases like this ? Because it doesn't
> seems that it needs a re-vote, just an extra day or two for
> validation, any ideas JB ?
>
> On Wed, Jan 31, 2018 at 10:41 PM, Alan Myrvold 
> wrote:
> > Yes, it is a dataflow step. Happy to test this again when they are
> > available.
> >
> > On Wed, Jan 31, 2018 at 1:39 PM, Jean-Baptiste Onofré 
> > wrote:
> >>
> >> OK, I think I understood ;)
> >>
> >> So it's not "directly" related to Beam itself (it's more a Dataflow step
> >> to perform).
> >>
> >> @Alan, I think it's better to test first and then cast the vote. This
> kind
> >> of tests are valuable to validate the release and make sense. But vote
> >> should represent the state of the Beam release. So I think -1 vote is a
> bit
> >> too early before the test.
> >>
> >> Thanks !
> >> Regards
> >> JB
> >>
> >> On 31/01/2018 22:33, Reuven Lax wrote:
> >>>
> >>> It's just a step that needs to be peformed before the new release works
> >>> on Dataflow. Alan is saying that we've been unable to validate
> Dataflow so
> >>> far, as worker images are not yet built. Hopefully they'll be built
> soon,
> >>> and we'll be able to validate.
> >>>
> >>> On Wed, Jan 31, 2018 at 1:31 PM, Jean-Baptiste Onofré  >>> > wrote:
> >>>
> >>> Hi Alan
> >>>
> >>> does it related to change in the codebase or in a
> dependency/related
> >>> project ?
> >>>
> >>> I mean: is it something we have to fix/change in Beam ?
> >>>
> >>> Just curious as I'm not sure what you mean by "worker images" ;)
> >>>
> >>> Thanks !
> >>> Regards
> >>> JB
> >>>
> >>> On 31/01/2018 22:18, Alan Myrvold wrote:
> >>>
> >>> -1 (for now, hope to change this)
> >>>
> >>> Dataflow runner jobs are failing for me with 2.3.0 RC1, for
> both
> >>> Java and Python.
> >>>
> >>> This is not an issues with the 2.3.0 RC1 SDK, we (google) need
> >>> to release worker images.
> >>>
> >>> I have assigned these bugs to myself, and will be working to
> >>> help get these workers released.
> >>>
> >>> [BEAM-3584] Java dataflow job fails with 2.3.0 RC1, due to
> >>> missing worker image
> >>> [BEAM-3585] Python dataflow job fails with 2.3.0 RC1, due to
> >>> missing worker image
> >>>
> >>> On Wed, Jan 31, 2018 at 6:12 AM, Jean-Baptiste Onofré
> >>> mailto:j...@nanthrax.net>
> >>> >> wrote:
> >>>
> >>>  Thanks Kenn,
> >>>
> >>>  I prepared the list of tasks I did. I will complete where
> >>> release is
> >>>  out.
> >>>
> >>>  Regards
> >>>  JB
> >>>
> >>>  On 01/31/2018 03:07 PM, Kenneth Knowles wrote:
> >>>  > I've cloned the release validation spreadsheet:
> >>>  >
> >>>  > https://s.apache.org/beam-2.3.0-release-validation
> >>> 
> >>>   >>> >
> >>>  >
> >>>  > If you plan to perform a manual validation task, please
> >>> sign up so multiple
> >>>  > people don't waste effort.
> >>>  >
> >>>  > Alan & JB, as far as your pairing up to automate more,
> >>> anything manual on this
> >>>  > sheet should be considered.
> >>>  >
> >>>  > Kenn
> >>>  >
> >>>  > On Wed, Jan 31, 2018 at 5:59 AM, Jean-Baptiste Onofré
> >>> mailto:j...@nanthrax.net>
> >>> >
> >>>  > 
> >>>  >>>  >
> >>>  > +1 (binding)
> >>>  >
> >>>  > Casting my own +1 ;)
> >>>  >
> >>>  > Regards
> >>>  > JB
> >>>  >
> >>>  > On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
> >>>  > > Hi everyone,
> >>>  > >
> >>>  > > Please review and vote on the release candidate #1
> >>> for the version 2.3.0, as
> >>>  > > follows:
> >>>  > >
> >>>  > > [ ] +1, Approve the release
> >>>  > > [ ] -1, Do not approve the release (please provide
> >>> specific comments)
> >>>  > >
> >>>  > >
> >>>  > > The complete staging area is available for yo

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Ahmet Altay
This will require a change in the Beam code, because image names are
hardcoded in to code (python) and configuration (java). RC1 as it is will
not work correctly with Cloud Dataflow.

On Wed, Jan 31, 2018 at 2:08 PM, Reuven Lax  wrote:

> Hopefully we can validate soon. I believe some of the delays are because
> of integrating major changes done over the last week (e.g. Java 8
> migration).
>
> On Wed, Jan 31, 2018 at 2:04 PM, Ismaël Mejía  wrote:
>
>> What is the common procedure in cases like this ? Because it doesn't
>> seems that it needs a re-vote, just an extra day or two for
>> validation, any ideas JB ?
>>
>> On Wed, Jan 31, 2018 at 10:41 PM, Alan Myrvold 
>> wrote:
>> > Yes, it is a dataflow step. Happy to test this again when they are
>> > available.
>> >
>> > On Wed, Jan 31, 2018 at 1:39 PM, Jean-Baptiste Onofré 
>> > wrote:
>> >>
>> >> OK, I think I understood ;)
>> >>
>> >> So it's not "directly" related to Beam itself (it's more a Dataflow
>> step
>> >> to perform).
>> >>
>> >> @Alan, I think it's better to test first and then cast the vote. This
>> kind
>> >> of tests are valuable to validate the release and make sense. But vote
>> >> should represent the state of the Beam release. So I think -1 vote is
>> a bit
>> >> too early before the test.
>> >>
>> >> Thanks !
>> >> Regards
>> >> JB
>> >>
>> >> On 31/01/2018 22:33, Reuven Lax wrote:
>> >>>
>> >>> It's just a step that needs to be peformed before the new release
>> works
>> >>> on Dataflow. Alan is saying that we've been unable to validate
>> Dataflow so
>> >>> far, as worker images are not yet built. Hopefully they'll be built
>> soon,
>> >>> and we'll be able to validate.
>> >>>
>> >>> On Wed, Jan 31, 2018 at 1:31 PM, Jean-Baptiste Onofré <
>> j...@nanthrax.net
>> >>> > wrote:
>> >>>
>> >>> Hi Alan
>> >>>
>> >>> does it related to change in the codebase or in a
>> dependency/related
>> >>> project ?
>> >>>
>> >>> I mean: is it something we have to fix/change in Beam ?
>> >>>
>> >>> Just curious as I'm not sure what you mean by "worker images" ;)
>> >>>
>> >>> Thanks !
>> >>> Regards
>> >>> JB
>> >>>
>> >>> On 31/01/2018 22:18, Alan Myrvold wrote:
>> >>>
>> >>> -1 (for now, hope to change this)
>> >>>
>> >>> Dataflow runner jobs are failing for me with 2.3.0 RC1, for
>> both
>> >>> Java and Python.
>> >>>
>> >>> This is not an issues with the 2.3.0 RC1 SDK, we (google) need
>> >>> to release worker images.
>> >>>
>> >>> I have assigned these bugs to myself, and will be working to
>> >>> help get these workers released.
>> >>>
>> >>> [BEAM-3584] Java dataflow job fails with 2.3.0 RC1, due to
>> >>> missing worker image
>> >>> [BEAM-3585] Python dataflow job fails with 2.3.0 RC1, due to
>> >>> missing worker image
>> >>>
>> >>> On Wed, Jan 31, 2018 at 6:12 AM, Jean-Baptiste Onofré
>> >>> mailto:j...@nanthrax.net>
>> >>> >> wrote:
>> >>>
>> >>>  Thanks Kenn,
>> >>>
>> >>>  I prepared the list of tasks I did. I will complete where
>> >>> release is
>> >>>  out.
>> >>>
>> >>>  Regards
>> >>>  JB
>> >>>
>> >>>  On 01/31/2018 03:07 PM, Kenneth Knowles wrote:
>> >>>  > I've cloned the release validation spreadsheet:
>> >>>  >
>> >>>  > https://s.apache.org/beam-2.3.0-release-validation
>> >>> 
>> >>>  > >>> >
>> >>>  >
>> >>>  > If you plan to perform a manual validation task, please
>> >>> sign up so multiple
>> >>>  > people don't waste effort.
>> >>>  >
>> >>>  > Alan & JB, as far as your pairing up to automate more,
>> >>> anything manual on this
>> >>>  > sheet should be considered.
>> >>>  >
>> >>>  > Kenn
>> >>>  >
>> >>>  > On Wed, Jan 31, 2018 at 5:59 AM, Jean-Baptiste Onofré
>> >>> mailto:j...@nanthrax.net>
>> >>> >
>> >>>  > 
>> >>> > >>>  >
>> >>>  > +1 (binding)
>> >>>  >
>> >>>  > Casting my own +1 ;)
>> >>>  >
>> >>>  > Regards
>> >>>  > JB
>> >>>  >
>> >>>  > On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
>> >>>  > > Hi everyone,
>> >>>  > >
>> >>>  > > Please review and vote on the release candidate

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Ahmet Altay
On Wed, Jan 31, 2018 at 2:09 PM, Ahmet Altay  wrote:

> This will require a change in the Beam code, because image names are
> hardcoded in to code (python) and configuration (java). RC1 as it is will
> not work correctly with Cloud Dataflow.
>

Please ignore this. Looking at the issues Alan created this does not seem
to be correct. And RC1 should work as expected once Google related step is
completed.


>
> On Wed, Jan 31, 2018 at 2:08 PM, Reuven Lax  wrote:
>
>> Hopefully we can validate soon. I believe some of the delays are because
>> of integrating major changes done over the last week (e.g. Java 8
>> migration).
>>
>> On Wed, Jan 31, 2018 at 2:04 PM, Ismaël Mejía  wrote:
>>
>>> What is the common procedure in cases like this ? Because it doesn't
>>> seems that it needs a re-vote, just an extra day or two for
>>> validation, any ideas JB ?
>>>
>>> On Wed, Jan 31, 2018 at 10:41 PM, Alan Myrvold 
>>> wrote:
>>> > Yes, it is a dataflow step. Happy to test this again when they are
>>> > available.
>>> >
>>> > On Wed, Jan 31, 2018 at 1:39 PM, Jean-Baptiste Onofré >> >
>>> > wrote:
>>> >>
>>> >> OK, I think I understood ;)
>>> >>
>>> >> So it's not "directly" related to Beam itself (it's more a Dataflow
>>> step
>>> >> to perform).
>>> >>
>>> >> @Alan, I think it's better to test first and then cast the vote. This
>>> kind
>>> >> of tests are valuable to validate the release and make sense. But vote
>>> >> should represent the state of the Beam release. So I think -1 vote is
>>> a bit
>>> >> too early before the test.
>>> >>
>>> >> Thanks !
>>> >> Regards
>>> >> JB
>>> >>
>>> >> On 31/01/2018 22:33, Reuven Lax wrote:
>>> >>>
>>> >>> It's just a step that needs to be peformed before the new release
>>> works
>>> >>> on Dataflow. Alan is saying that we've been unable to validate
>>> Dataflow so
>>> >>> far, as worker images are not yet built. Hopefully they'll be built
>>> soon,
>>> >>> and we'll be able to validate.
>>> >>>
>>> >>> On Wed, Jan 31, 2018 at 1:31 PM, Jean-Baptiste Onofré <
>>> j...@nanthrax.net
>>> >>> > wrote:
>>> >>>
>>> >>> Hi Alan
>>> >>>
>>> >>> does it related to change in the codebase or in a
>>> dependency/related
>>> >>> project ?
>>> >>>
>>> >>> I mean: is it something we have to fix/change in Beam ?
>>> >>>
>>> >>> Just curious as I'm not sure what you mean by "worker images" ;)
>>> >>>
>>> >>> Thanks !
>>> >>> Regards
>>> >>> JB
>>> >>>
>>> >>> On 31/01/2018 22:18, Alan Myrvold wrote:
>>> >>>
>>> >>> -1 (for now, hope to change this)
>>> >>>
>>> >>> Dataflow runner jobs are failing for me with 2.3.0 RC1, for
>>> both
>>> >>> Java and Python.
>>> >>>
>>> >>> This is not an issues with the 2.3.0 RC1 SDK, we (google)
>>> need
>>> >>> to release worker images.
>>> >>>
>>> >>> I have assigned these bugs to myself, and will be working to
>>> >>> help get these workers released.
>>> >>>
>>> >>> [BEAM-3584] Java dataflow job fails with 2.3.0 RC1, due to
>>> >>> missing worker image
>>> >>> [BEAM-3585] Python dataflow job fails with 2.3.0 RC1, due to
>>> >>> missing worker image
>>> >>>
>>> >>> On Wed, Jan 31, 2018 at 6:12 AM, Jean-Baptiste Onofré
>>> >>> mailto:j...@nanthrax.net>
>>> >>> >> wrote:
>>> >>>
>>> >>>  Thanks Kenn,
>>> >>>
>>> >>>  I prepared the list of tasks I did. I will complete
>>> where
>>> >>> release is
>>> >>>  out.
>>> >>>
>>> >>>  Regards
>>> >>>  JB
>>> >>>
>>> >>>  On 01/31/2018 03:07 PM, Kenneth Knowles wrote:
>>> >>>  > I've cloned the release validation spreadsheet:
>>> >>>  >
>>> >>>  > https://s.apache.org/beam-2.3.0-release-validation
>>> >>> 
>>> >>>  >> >>> >
>>> >>>  >
>>> >>>  > If you plan to perform a manual validation task,
>>> please
>>> >>> sign up so multiple
>>> >>>  > people don't waste effort.
>>> >>>  >
>>> >>>  > Alan & JB, as far as your pairing up to automate more,
>>> >>> anything manual on this
>>> >>>  > sheet should be considered.
>>> >>>  >
>>> >>>  > Kenn
>>> >>>  >
>>> >>>  > On Wed, Jan 31, 2018 at 5:59 AM, Jean-Baptiste Onofré
>>> >>> mailto:j...@nanthrax.net>
>>> >>> >
>>> >>>  > 
>>> >>> >> >>>  >
>>> >>>  > +1 (binding)
>>> >>>  >
>>> >>>

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Udi Meiri
I agree that overwriting is more in line with user expectations.
I believe that the sink should not ignore errors from the filesystem layer.
Instead, the FileSystem API should be more well defined.
Examples: rename() and copy() should overwrite existing files at the
destination, copy() should have an ignore_missing flag.

On Wed, Jan 31, 2018 at 1:49 PM Raghu Angadi  wrote:

> Original mail mentions that output from second run of word_count is
> ignored. That does not seem as safe as ignoring error from a second attempt
> of a step. How do we know second run didn't run on different output?
> Overwriting seems more accurate than ignoring. Does handling this error at
> sink level distinguish between the two (another run vs second attempt)?
>
>
> On Wed, Jan 31, 2018 at 12:32 PM, Udi Meiri  wrote:
>
>> Yeah, another round of refactoring is due to move the rename via
>> copy+delete logic up to the file-based sink level.
>>
>> On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath 
>> wrote:
>>
>>> Good point. There's always the chance of step that performs final rename
>>> being retried. So we'll have to ignore this error at the sink level. We
>>> don't necessarily have to do this at the FileSystem level though. I think
>>> the proper behavior might be to raise an error for the rename at the
>>> FileSystem level if the destination already exists (or source doesn't
>>> exist) while ignoring that error (and possibly logging a warning) at the
>>> sink level.
>>>
>>> - Cham
>>>
>>>
>>> On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax  wrote:
>>>
 I think the idea was to ignore "already exists" errors. The reason
 being that any step in Beam can be executed multiple times, including the
 rename step. If the rename step gets run twice, the second run should
 succeed vacuously.


 On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri  wrote:

> Hi,
> I've been working on HDFS code for the Python SDK and I've noticed
> some behaviors which are surprising. I wanted to know if these behaviors
> are known and intended.
>
> 1. When renaming files during finalize_write, rename errors are
> ignored
> .
> For example, if I run wordcount twice using HDFS code I get a warning the
> second time because the file already exists:
>
> WARNING:root:Rename not successful:
> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
> -> hdfs://counts2-0-of-1, libhdfs error in renaming
> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
> to hdfs://counts2-0-of-1 with exceptions Unable to rename
> '/beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2'
> to '/counts2-0-of-1'.
>
> For GCS and local files there are no rename errors (in this case),
> since the rename operation silently overwrites existing destination files.
> However, blindly ignoring these errors might make the pipeline to report
> success even though output files are missing.
>
> 2. Output files (--ouput) overwrite existing files.
>
> 3. The Python SDK doesn't use Filesystems.copy(). The Java SDK doesn't
> use Filesystem.Rename().
>
> Thanks,
> - Udi
>


>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Eugene Kirpichov
As far as I know, the current implementation of file sinks is the only
reason why the flag IGNORE_MISSING for copying even exists - there's no
other compelling reason to justify it. We implement "rename" as "copy, then
delete" (in a single DoFn), so for idempodency of this operation we need to
ignore the copying of a non-existent file.

I think the right way to go would be to change the implementation of
renaming to have a @RequiresStableInput (or reshuffle) in the middle, so
it's made of 2 individually idempotent operations:
1) copy, which would fail if input is missing, and would overwrite output
if it exists
-- reshuffle --
2) delete, which would not fail if input is missing.

That way first everything is copied (possibly via multiple attempts), and
then old files are deleted (possibly via multiple attempts).

On Wed, Jan 31, 2018 at 2:26 PM Udi Meiri  wrote:

> I agree that overwriting is more in line with user expectations.
> I believe that the sink should not ignore errors from the filesystem
> layer. Instead, the FileSystem API should be more well defined.
> Examples: rename() and copy() should overwrite existing files at the
> destination, copy() should have an ignore_missing flag.
>
> On Wed, Jan 31, 2018 at 1:49 PM Raghu Angadi  wrote:
>
>> Original mail mentions that output from second run of word_count is
>> ignored. That does not seem as safe as ignoring error from a second attempt
>> of a step. How do we know second run didn't run on different output?
>> Overwriting seems more accurate than ignoring. Does handling this error at
>> sink level distinguish between the two (another run vs second attempt)?
>>
>>
>> On Wed, Jan 31, 2018 at 12:32 PM, Udi Meiri  wrote:
>>
>>> Yeah, another round of refactoring is due to move the rename via
>>> copy+delete logic up to the file-based sink level.
>>>
>>> On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath 
>>> wrote:
>>>
 Good point. There's always the chance of step that performs final
 rename being retried. So we'll have to ignore this error at the sink level.
 We don't necessarily have to do this at the FileSystem level though. I
 think the proper behavior might be to raise an error for the rename at the
 FileSystem level if the destination already exists (or source doesn't
 exist) while ignoring that error (and possibly logging a warning) at the
 sink level.

 - Cham


 On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax  wrote:

> I think the idea was to ignore "already exists" errors. The reason
> being that any step in Beam can be executed multiple times, including the
> rename step. If the rename step gets run twice, the second run should
> succeed vacuously.
>
>
> On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri  wrote:
>
>> Hi,
>> I've been working on HDFS code for the Python SDK and I've noticed
>> some behaviors which are surprising. I wanted to know if these behaviors
>> are known and intended.
>>
>> 1. When renaming files during finalize_write, rename errors are
>> ignored
>> .
>> For example, if I run wordcount twice using HDFS code I get a warning the
>> second time because the file already exists:
>>
>> WARNING:root:Rename not successful:
>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>> -> hdfs://counts2-0-of-1, libhdfs error in renaming
>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>> to hdfs://counts2-0-of-1 with exceptions Unable to rename
>> '/beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2'
>> to '/counts2-0-of-1'.
>>
>> For GCS and local files there are no rename errors (in this case),
>> since the rename operation silently overwrites existing destination 
>> files.
>> However, blindly ignoring these errors might make the pipeline to report
>> success even though output files are missing.
>>
>> 2. Output files (--ouput) overwrite existing files.
>>
>> 3. The Python SDK doesn't use Filesystems.copy(). The Java SDK
>> doesn't use Filesystem.Rename().
>>
>> Thanks,
>> - Udi
>>
>
>
>>


Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Robert Bradshaw
For very large filesets, it may be too much to assume that the copy
succeed in its entirety on the first try. (I suppose we could chunk
copies into individual retryable bundles, but this may not respect the
filesystem's default chunking/bulk operations.) The other downside of
copying entirely before any deletion is that unless the filesystem is
smart about copies, it may double the required intermediate storage
size (v.s. deleting once a particular shard has been copied). Also,
some filesystems may support rename (even bulk rename) that's cheaper
than copy + delete. For these reasons I think a (optionally
retry-tolerant) bulk rename makes sense as an operation on the
filesystem API rather than implemented as a composite operation built
on lower-level filesystem primitives.

On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov  wrote:
> As far as I know, the current implementation of file sinks is the only
> reason why the flag IGNORE_MISSING for copying even exists - there's no
> other compelling reason to justify it. We implement "rename" as "copy, then
> delete" (in a single DoFn), so for idempodency of this operation we need to
> ignore the copying of a non-existent file.
>
> I think the right way to go would be to change the implementation of
> renaming to have a @RequiresStableInput (or reshuffle) in the middle, so
> it's made of 2 individually idempotent operations:
> 1) copy, which would fail if input is missing, and would overwrite output if
> it exists
> -- reshuffle --
> 2) delete, which would not fail if input is missing.
>
> That way first everything is copied (possibly via multiple attempts), and
> then old files are deleted (possibly via multiple attempts).
>
> On Wed, Jan 31, 2018 at 2:26 PM Udi Meiri  wrote:
>>
>> I agree that overwriting is more in line with user expectations.
>> I believe that the sink should not ignore errors from the filesystem
>> layer. Instead, the FileSystem API should be more well defined.
>> Examples: rename() and copy() should overwrite existing files at the
>> destination, copy() should have an ignore_missing flag.
>>
>> On Wed, Jan 31, 2018 at 1:49 PM Raghu Angadi  wrote:
>>>
>>> Original mail mentions that output from second run of word_count is
>>> ignored. That does not seem as safe as ignoring error from a second attempt
>>> of a step. How do we know second run didn't run on different output?
>>> Overwriting seems more accurate than ignoring. Does handling this error at
>>> sink level distinguish between the two (another run vs second attempt)?
>>>
>>>
>>> On Wed, Jan 31, 2018 at 12:32 PM, Udi Meiri  wrote:

 Yeah, another round of refactoring is due to move the rename via
 copy+delete logic up to the file-based sink level.


 On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath 
 wrote:
>
> Good point. There's always the chance of step that performs final
> rename being retried. So we'll have to ignore this error at the sink 
> level.
> We don't necessarily have to do this at the FileSystem level though. I 
> think
> the proper behavior might be to raise an error for the rename at the
> FileSystem level if the destination already exists (or source doesn't 
> exist)
> while ignoring that error (and possibly logging a warning) at the sink
> level.
>
> - Cham
>
>
> On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax  wrote:
>>
>> I think the idea was to ignore "already exists" errors. The reason
>> being that any step in Beam can be executed multiple times, including the
>> rename step. If the rename step gets run twice, the second run should
>> succeed vacuously.
>>
>>
>> On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri  wrote:
>>>
>>> Hi,
>>> I've been working on HDFS code for the Python SDK and I've noticed
>>> some behaviors which are surprising. I wanted to know if these 
>>> behaviors are
>>> known and intended.
>>>
>>> 1. When renaming files during finalize_write, rename errors are
>>> ignored. For example, if I run wordcount twice using HDFS code I get a
>>> warning the second time because the file already exists:
>>>
>>> WARNING:root:Rename not successful:
>>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>>> -> hdfs://counts2-0-of-1, libhdfs error in renaming
>>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>>> to hdfs://counts2-0-of-1 with exceptions Unable to rename
>>> '/beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2'
>>> to '/counts2-0-of-1'.
>>>
>>> For GCS and local files there are no rename errors (in this case),
>>> since the rename operation silently overwrites existing destination 
>>> files.
>>> However, blindly ignoring these errors might make the pipeline to report

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Konstantinos Katsiapis
+1 (non-binding). tensorflow.transform
 0.5.0 is blocked on Apache Beam
2.3


On Wed, Jan 31, 2018 at 5:59 AM, Jean-Baptiste Onofré 
wrote:

> +1 (binding)
>
> Casting my own +1 ;)
>
> Regards
> JB
>
> On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
> > Hi everyone,
> >
> > Please review and vote on the release candidate #1 for the version
> 2.3.0, as
> > follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to dist.apache.org
> [2],
> > which is signed with the key with fingerprint C8282E76 [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "v2.3.0-RC1" [5],
> > * website pull request listing the release and publishing the API
> reference
> > manual [6].
> > * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
> > * Python artifacts are deployed along with the source release to the
> > dist.apache.org [2].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> approval,
> > with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > JB
> >
> > [1]
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> projectId=12319527&version=12341608
> > [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > [4] https://repository.apache.org/content/repositories/
> orgapachebeam-1026/
> > [5] https://github.com/apache/beam/tree/v2.3.0-RC1
> > [6] https://github.com/apache/beam-site/pull/381
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Gus Katsiapis | Software Engineer | katsia...@google.com | 650-918-7487


Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Chamikara Jayalath
Agree with what Robert said. We have a rename() operation in the FileSystem
abstraction and some file-systems might be able to implement this more
efficiently than copy+delete. Also note that the same issue could arise in
any other usage of rename operation. So I agree that a retry-tolerant
version of rename will be useful. Note that we can do this without making
all FileSystem.rename() implementations unsafe. For example, in Java,
IGNORE_MISSING_FILES options is implemented by filtering out non-existing
files in FileSystems.rename() before invoking FileSystem.rename().

https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L316

- Cham

On Wed, Jan 31, 2018 at 3:14 PM Robert Bradshaw  wrote:

> For very large filesets, it may be too much to assume that the copy
> succeed in its entirety on the first try. (I suppose we could chunk
> copies into individual retryable bundles, but this may not respect the
> filesystem's default chunking/bulk operations.) The other downside of
> copying entirely before any deletion is that unless the filesystem is
> smart about copies, it may double the required intermediate storage
> size (v.s. deleting once a particular shard has been copied). Also,
> some filesystems may support rename (even bulk rename) that's cheaper
> than copy + delete. For these reasons I think a (optionally
> retry-tolerant) bulk rename makes sense as an operation on the
> filesystem API rather than implemented as a composite operation built
> on lower-level filesystem primitives.
>
> On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov 
> wrote:
> > As far as I know, the current implementation of file sinks is the only
> > reason why the flag IGNORE_MISSING for copying even exists - there's no
> > other compelling reason to justify it. We implement "rename" as "copy,
> then
> > delete" (in a single DoFn), so for idempodency of this operation we need
> to
> > ignore the copying of a non-existent file.
> >
> > I think the right way to go would be to change the implementation of
> > renaming to have a @RequiresStableInput (or reshuffle) in the middle, so
> > it's made of 2 individually idempotent operations:
> > 1) copy, which would fail if input is missing, and would overwrite
> output if
> > it exists
> > -- reshuffle --
> > 2) delete, which would not fail if input is missing.
> >
> > That way first everything is copied (possibly via multiple attempts), and
> > then old files are deleted (possibly via multiple attempts).
> >
> > On Wed, Jan 31, 2018 at 2:26 PM Udi Meiri  wrote:
> >>
> >> I agree that overwriting is more in line with user expectations.
> >> I believe that the sink should not ignore errors from the filesystem
> >> layer. Instead, the FileSystem API should be more well defined.
> >> Examples: rename() and copy() should overwrite existing files at the
> >> destination, copy() should have an ignore_missing flag.
> >>
> >> On Wed, Jan 31, 2018 at 1:49 PM Raghu Angadi 
> wrote:
> >>>
> >>> Original mail mentions that output from second run of word_count is
> >>> ignored. That does not seem as safe as ignoring error from a second
> attempt
> >>> of a step. How do we know second run didn't run on different output?
> >>> Overwriting seems more accurate than ignoring. Does handling this
> error at
> >>> sink level distinguish between the two (another run vs second attempt)?
> >>>
> >>>
> >>> On Wed, Jan 31, 2018 at 12:32 PM, Udi Meiri  wrote:
> 
>  Yeah, another round of refactoring is due to move the rename via
>  copy+delete logic up to the file-based sink level.
> 
> 
>  On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath 
>  wrote:
> >
> > Good point. There's always the chance of step that performs final
> > rename being retried. So we'll have to ignore this error at the sink
> level.
> > We don't necessarily have to do this at the FileSystem level though.
> I think
> > the proper behavior might be to raise an error for the rename at the
> > FileSystem level if the destination already exists (or source
> doesn't exist)
> > while ignoring that error (and possibly logging a warning) at the
> sink
> > level.
> >
> > - Cham
> >
> >
> > On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax  wrote:
> >>
> >> I think the idea was to ignore "already exists" errors. The reason
> >> being that any step in Beam can be executed multiple times,
> including the
> >> rename step. If the rename step gets run twice, the second run
> should
> >> succeed vacuously.
> >>
> >>
> >> On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri 
> wrote:
> >>>
> >>> Hi,
> >>> I've been working on HDFS code for the Python SDK and I've noticed
> >>> some behaviors which are surprising. I wanted to know if these
> behaviors are
> >>> known and intended.
> >>>
> >>> 1. When renaming files during finalize_write, rename errors are
> >>> ignored. For example, 

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Eugene Kirpichov
I agree that using an atomic rename operation is even better. I'm mainly
opposed to having a copy option that ignores missing files, and to our
implementation of rename using that option, because it's unsafe.
Unfortunately GCS doesn't have an atomic rename, so I'm not sure what's the
best way to go for GCS without introducing the unsafe operations.

On Wed, Jan 31, 2018, 3:39 PM Chamikara Jayalath 
wrote:

> Agree with what Robert said. We have a rename() operation in the
> FileSystem abstraction and some file-systems might be able to implement
> this more efficiently than copy+delete. Also note that the same issue could
> arise in any other usage of rename operation. So I agree that a
> retry-tolerant version of rename will be useful. Note that we can do this
> without making all FileSystem.rename() implementations unsafe. For example,
> in Java, IGNORE_MISSING_FILES options is implemented by filtering out
> non-existing files in FileSystems.rename() before invoking
> FileSystem.rename().
>
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L316
>
> - Cham
>
>
> On Wed, Jan 31, 2018 at 3:14 PM Robert Bradshaw 
> wrote:
>
>> For very large filesets, it may be too much to assume that the copy
>> succeed in its entirety on the first try. (I suppose we could chunk
>> copies into individual retryable bundles, but this may not respect the
>> filesystem's default chunking/bulk operations.) The other downside of
>> copying entirely before any deletion is that unless the filesystem is
>> smart about copies, it may double the required intermediate storage
>> size (v.s. deleting once a particular shard has been copied). Also,
>> some filesystems may support rename (even bulk rename) that's cheaper
>> than copy + delete. For these reasons I think a (optionally
>> retry-tolerant) bulk rename makes sense as an operation on the
>> filesystem API rather than implemented as a composite operation built
>> on lower-level filesystem primitives.
>>
>> On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov 
>> wrote:
>> > As far as I know, the current implementation of file sinks is the only
>> > reason why the flag IGNORE_MISSING for copying even exists - there's no
>> > other compelling reason to justify it. We implement "rename" as "copy,
>> then
>> > delete" (in a single DoFn), so for idempodency of this operation we
>> need to
>> > ignore the copying of a non-existent file.
>> >
>> > I think the right way to go would be to change the implementation of
>> > renaming to have a @RequiresStableInput (or reshuffle) in the middle, so
>> > it's made of 2 individually idempotent operations:
>> > 1) copy, which would fail if input is missing, and would overwrite
>> output if
>> > it exists
>> > -- reshuffle --
>> > 2) delete, which would not fail if input is missing.
>> >
>> > That way first everything is copied (possibly via multiple attempts),
>> and
>> > then old files are deleted (possibly via multiple attempts).
>> >
>> > On Wed, Jan 31, 2018 at 2:26 PM Udi Meiri  wrote:
>> >>
>> >> I agree that overwriting is more in line with user expectations.
>> >> I believe that the sink should not ignore errors from the filesystem
>> >> layer. Instead, the FileSystem API should be more well defined.
>> >> Examples: rename() and copy() should overwrite existing files at the
>> >> destination, copy() should have an ignore_missing flag.
>> >>
>> >> On Wed, Jan 31, 2018 at 1:49 PM Raghu Angadi 
>> wrote:
>> >>>
>> >>> Original mail mentions that output from second run of word_count is
>> >>> ignored. That does not seem as safe as ignoring error from a second
>> attempt
>> >>> of a step. How do we know second run didn't run on different output?
>> >>> Overwriting seems more accurate than ignoring. Does handling this
>> error at
>> >>> sink level distinguish between the two (another run vs second
>> attempt)?
>> >>>
>> >>>
>> >>> On Wed, Jan 31, 2018 at 12:32 PM, Udi Meiri  wrote:
>> 
>>  Yeah, another round of refactoring is due to move the rename via
>>  copy+delete logic up to the file-based sink level.
>> 
>> 
>>  On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath > >
>>  wrote:
>> >
>> > Good point. There's always the chance of step that performs final
>> > rename being retried. So we'll have to ignore this error at the
>> sink level.
>> > We don't necessarily have to do this at the FileSystem level
>> though. I think
>> > the proper behavior might be to raise an error for the rename at the
>> > FileSystem level if the destination already exists (or source
>> doesn't exist)
>> > while ignoring that error (and possibly logging a warning) at the
>> sink
>> > level.
>> >
>> > - Cham
>> >
>> >
>> > On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax 
>> wrote:
>> >>
>> >> I think the idea was to ignore "already exists" errors. The reason
>> >> being that any step in Beam can be executed multiple times,
>>

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Raghu Angadi
On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov 
wrote:

> As far as I know, the current implementation of file sinks is the only
> reason why the flag IGNORE_MISSING for copying even exists - there's no
> other compelling reason to justify it. We implement "rename" as "copy, then
> delete" (in a single DoFn), so for idempodency of this operation we need to
> ignore the copying of a non-existent file.
>
> I think the right way to go would be to change the implementation of
> renaming to have a @RequiresStableInput (or reshuffle) in the middle, so
> it's made of 2 individually idempotent operations:
> 1) copy, which would fail if input is missing, and would overwrite output
> if it exists
> -- reshuffle --
> 2) delete, which would not fail if input is missing.
>

Something like this is needed only in streaming, right?

Raghu.


> That way first everything is copied (possibly via multiple attempts), and
> then old files are deleted (possibly via multiple attempts).
>
> On Wed, Jan 31, 2018 at 2:26 PM Udi Meiri  wrote:
>
>> I agree that overwriting is more in line with user expectations.
>> I believe that the sink should not ignore errors from the filesystem
>> layer. Instead, the FileSystem API should be more well defined.
>> Examples: rename() and copy() should overwrite existing files at the
>> destination, copy() should have an ignore_missing flag.
>>
>> On Wed, Jan 31, 2018 at 1:49 PM Raghu Angadi  wrote:
>>
>>> Original mail mentions that output from second run of word_count is
>>> ignored. That does not seem as safe as ignoring error from a second attempt
>>> of a step. How do we know second run didn't run on different output?
>>> Overwriting seems more accurate than ignoring. Does handling this error at
>>> sink level distinguish between the two (another run vs second attempt)?
>>>
>>>
>>> On Wed, Jan 31, 2018 at 12:32 PM, Udi Meiri  wrote:
>>>
 Yeah, another round of refactoring is due to move the rename via
 copy+delete logic up to the file-based sink level.

 On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath 
 wrote:

> Good point. There's always the chance of step that performs final
> rename being retried. So we'll have to ignore this error at the sink 
> level.
> We don't necessarily have to do this at the FileSystem level though. I
> think the proper behavior might be to raise an error for the rename at the
> FileSystem level if the destination already exists (or source doesn't
> exist) while ignoring that error (and possibly logging a warning) at the
> sink level.
>
> - Cham
>
>
> On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax  wrote:
>
>> I think the idea was to ignore "already exists" errors. The reason
>> being that any step in Beam can be executed multiple times, including the
>> rename step. If the rename step gets run twice, the second run should
>> succeed vacuously.
>>
>>
>> On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri  wrote:
>>
>>> Hi,
>>> I've been working on HDFS code for the Python SDK and I've noticed
>>> some behaviors which are surprising. I wanted to know if these behaviors
>>> are known and intended.
>>>
>>> 1. When renaming files during finalize_write, rename errors are
>>> ignored
>>> .
>>> For example, if I run wordcount twice using HDFS code I get a warning 
>>> the
>>> second time because the file already exists:
>>>
>>> WARNING:root:Rename not successful: hdfs://beam-temp-counts2-
>>> 7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>>> -> hdfs://counts2-0-of-1, libhdfs error in renaming
>>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da2
>>> 45/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2 to
>>> hdfs://counts2-0-of-1 with exceptions Unable to rename
>>> '/beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da2
>>> 45/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2' to
>>> '/counts2-0-of-1'.
>>>
>>> For GCS and local files there are no rename errors (in this case),
>>> since the rename operation silently overwrites existing destination 
>>> files.
>>> However, blindly ignoring these errors might make the pipeline to report
>>> success even though output files are missing.
>>>
>>> 2. Output files (--ouput) overwrite existing files.
>>>
>>> 3. The Python SDK doesn't use Filesystems.copy(). The Java SDK
>>> doesn't use Filesystem.Rename().
>>>
>>> Thanks,
>>> - Udi
>>>
>>
>>
>>>


Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-31 Thread Lukasz Cwik
Yes, people will write bad coders which is why this is there. No, I don't
think swallowing one close is what we want.

In the case where you wants to pass an input/output stream to a library
that incorrectly assumes ownership, the input/output stream should be
wrapped right before the call to the library with a filter input/output
stream that swallows the close and not propagate ignoring this bad behavior
elsewhere.

On Wed, Jan 31, 2018 at 12:04 PM, Romain Manni-Bucau 
wrote:

> Hmm, here we are the ones owning the call since it is in a coder, no? Do
> we assume people will badly implement coders? In this particular case we
> can assume close() will be called by a framework I think.
> What about swallowing one close() and fail on the second?
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
> 
>
> 2018-01-31 20:59 GMT+01:00 Lukasz Cwik :
>
>> Because people write code like:
>> myMethod(InputStream in) {
>>   InputStream child = new InputStream(in);
>>   child.close();
>> }
>>
>> InputStream in = new FileInputStream(... path ...);
>> myMethod(in);
>> myMethod(in);
>>
>> An exception will be thrown when the second myMethod call occurs.
>>
>> Unfortunately not everyone wraps their calls to a coder with an
>> UnownedInputStream or a filter input stream which drops close() calls is
>> why its a problem and in the few places it is done it is used to prevent
>> bugs from creeping in.
>>
>>
>>
>> On Tue, Jan 30, 2018 at 11:29 AM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> I get the issue but I don't get the last part. Concretely we can support
>>> any lib by just removing the exception in the close, no? What would be the
>>> issue? No additional wrapper, no lib integration issue.
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>> 
>>>
>>> 2018-01-30 19:29 GMT+01:00 Lukasz Cwik :
>>>
 Its common in the code base that input and output streams are passed
 around and the caller is responsible for closing it, not the callee. The
 UnownedInputStream is to guard against libraries that are poorly behaved
 and assume they get ownership of the stream when it is given to them.

 In the code:
 myMethod(InputStream in) {
   InputStream child = new InputStream(in);
   child.close();
 }

 InputStream in = ...
 myMethod(in);
 myMethod(in);
 When should "in" be closed?

 To get around this issue, create a filter input/output stream that
 ignores close calls like on the JAXB coder:
 https://github.com/apache/beam/blob/master/sdks/java/io/xml/
 src/main/java/org/apache/beam/sdk/io/xml/JAXBCoder.java#L181

 We can instead swap around this pattern so that the caller guards
 against callees closing by wrapping with a filter input/output stream but
 this costs an additional method call for each input/output stream call.


 On Tue, Jan 30, 2018 at 10:04 AM, Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

> Hi guys,
>
> All is in the subject ;)
>
> Rational is to support any I/O library and not fail when the close is
> encapsulated.
>
> Any blocker to swallow this close call?
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
> 
>


>>>
>>
>


Re: TextIO operation not supported by Flink for Beam master

2018-01-31 Thread Lukasz Cwik
Thanks for the feedback, we are currently in the middle of releasing 2.3.0
from pretty close to what is on Apache Beam master so your issue should be
investigated before release.

On Wed, Jan 31, 2018 at 5:36 PM, Thomas Pelletier 
wrote:

> Hi,
>
> I'm trying to run a pipeline containing just a TextIO.read() step on a
> Flink cluster, using the latest Beam git revision (ff37337
> ).
> The job fails to start with the Exception:
>
>   java.lang.UnsupportedOperationException: The transform  is currently
> not supported.
>
> It does work with Beam 2.2.0 though. All code, logs, and reproduction
> steps on this repository 
> .
>
> Any idea what might be going on?
>
> Thanks,
> Thomas
>


Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Lukasz Cwik
Note that a user reported TextIO being broken on Flink.
Thread is here:
https://lists.apache.org/thread.html/47b16c94032392782505415e010970fd2a9480891c55c2f7b5de92bd@%3Cuser.beam.apache.org%3E
Can someone confirm/refute?

On Wed, Jan 31, 2018 at 3:36 PM, Konstantinos Katsiapis <
katsia...@google.com> wrote:

> +1 (non-binding). tensorflow.transform
>  0.5.0 is blocked on Apache Beam
> 2.3
>
>
> On Wed, Jan 31, 2018 at 5:59 AM, Jean-Baptiste Onofré 
> wrote:
>
>> +1 (binding)
>>
>> Casting my own +1 ;)
>>
>> Regards
>> JB
>>
>> On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
>> > Hi everyone,
>> >
>> > Please review and vote on the release candidate #1 for the version
>> 2.3.0, as
>> > follows:
>> >
>> > [ ] +1, Approve the release
>> > [ ] -1, Do not approve the release (please provide specific comments)
>> >
>> >
>> > The complete staging area is available for your review, which includes:
>> > * JIRA release notes [1],
>> > * the official Apache source release to be deployed to dist.apache.org
>> [2],
>> > which is signed with the key with fingerprint C8282E76 [3],
>> > * all artifacts to be deployed to the Maven Central Repository [4],
>> > * source code tag "v2.3.0-RC1" [5],
>> > * website pull request listing the release and publishing the API
>> reference
>> > manual [6].
>> > * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
>> > * Python artifacts are deployed along with the source release to the
>> > dist.apache.org [2].
>> >
>> > The vote will be open for at least 72 hours. It is adopted by majority
>> approval,
>> > with at least 3 PMC affirmative votes.
>> >
>> > Thanks,
>> > JB
>> >
>> > [1]
>> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
>> ctId=12319527&version=12341608
>> > [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
>> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> > [4] https://repository.apache.org/content/repositories/orgapache
>> beam-1026/
>> > [5] https://github.com/apache/beam/tree/v2.3.0-RC1
>> > [6] https://github.com/apache/beam-site/pull/381
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>
>
> --
> Gus Katsiapis | Software Engineer | katsia...@google.com | 650-918-7487
> <(650)%20918-7487>
>


Re: TextIO operation not supported by Flink for Beam master

2018-01-31 Thread Thomas Pelletier
Let me know if there is anything I can do to help! Should I file a jira
ticket?

On Wed 31 Jan 2018 at 18:14, Lukasz Cwik  wrote:

> Thanks for the feedback, we are currently in the middle of releasing 2.3.0
> from pretty close to what is on Apache Beam master so your issue should be
> investigated before release.
>
> On Wed, Jan 31, 2018 at 5:36 PM, Thomas Pelletier 
> wrote:
>
>> Hi,
>>
>> I'm trying to run a pipeline containing just a TextIO.read() step on a
>> Flink cluster, using the latest Beam git revision (ff37337
>> ).
>> The job fails to start with the Exception:
>>
>>   java.lang.UnsupportedOperationException: The transform  is currently
>> not supported.
>>
>> It does work with Beam 2.2.0 though. All code, logs, and reproduction
>> steps on this repository
>> .
>>
>> Any idea what might be going on?
>>
>> Thanks,
>> Thomas
>>
>
>


Re: TextIO operation not supported by Flink for Beam master

2018-01-31 Thread Kenneth Knowles
Yes, a Jira ticket will help to keep track of things.

I have a guess that this has to do with switching to running from a
portable pipeline representation, and it looks like there's a non-composite
transform with an empty URN and it threw a bad error message. We can try to
root cause but may also mitigate short-term by removing the round-trip
through pipeline proto for now.

What is curious is that the ValidatesRunner and WordCountIT are working -
they only run on a local Flink, yet this seems to be a translation issue
that would occur for local or distributed runs.

I'll copy all this onto the Jira ticket but I wanted to communicate it
immediately.

Kenn

On Wed, Jan 31, 2018 at 6:25 PM, Thomas Pelletier 
wrote:

> Let me know if there is anything I can do to help! Should I file a jira
> ticket?
>
> On Wed 31 Jan 2018 at 18:14, Lukasz Cwik  wrote:
>
>> Thanks for the feedback, we are currently in the middle of releasing
>> 2.3.0 from pretty close to what is on Apache Beam master so your issue
>> should be investigated before release.
>>
>> On Wed, Jan 31, 2018 at 5:36 PM, Thomas Pelletier > > wrote:
>>
>>> Hi,
>>>
>>> I'm trying to run a pipeline containing just a TextIO.read() step on a
>>> Flink cluster, using the latest Beam git revision (ff37337
>>> ).
>>> The job fails to start with the Exception:
>>>
>>>   java.lang.UnsupportedOperationException: The transform  is currently
>>> not supported.
>>>
>>> It does work with Beam 2.2.0 though. All code, logs, and reproduction
>>> steps on this repository
>>> .
>>>
>>> Any idea what might be going on?
>>>
>>> Thanks,
>>> Thomas
>>>
>>
>>


Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Udi Meiri
Eugene, if I get this right, you're saying that IGNORE_MISSING_FILES is
unsafe because it will skip (src, dst) pairs where neither exist? (it only
looks if src exists)

For GCS, we can construct a safe retryable rename() operation, assuming
that copy() and delete() are atomic for a single file or pair of files.



On Wed, Jan 31, 2018 at 4:00 PM Raghu Angadi  wrote:

> On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov 
> wrote:
>
>> As far as I know, the current implementation of file sinks is the only
>> reason why the flag IGNORE_MISSING for copying even exists - there's no
>> other compelling reason to justify it. We implement "rename" as "copy, then
>> delete" (in a single DoFn), so for idempodency of this operation we need to
>> ignore the copying of a non-existent file.
>>
>> I think the right way to go would be to change the implementation of
>> renaming to have a @RequiresStableInput (or reshuffle) in the middle, so
>> it's made of 2 individually idempotent operations:
>> 1) copy, which would fail if input is missing, and would overwrite output
>> if it exists
>> -- reshuffle --
>> 2) delete, which would not fail if input is missing.
>>
>
> Something like this is needed only in streaming, right?
>
> Raghu.
>
>
>> That way first everything is copied (possibly via multiple attempts), and
>> then old files are deleted (possibly via multiple attempts).
>>
>> On Wed, Jan 31, 2018 at 2:26 PM Udi Meiri  wrote:
>>
>>> I agree that overwriting is more in line with user expectations.
>>> I believe that the sink should not ignore errors from the filesystem
>>> layer. Instead, the FileSystem API should be more well defined.
>>> Examples: rename() and copy() should overwrite existing files at the
>>> destination, copy() should have an ignore_missing flag.
>>>
>>> On Wed, Jan 31, 2018 at 1:49 PM Raghu Angadi  wrote:
>>>
 Original mail mentions that output from second run of word_count is
 ignored. That does not seem as safe as ignoring error from a second attempt
 of a step. How do we know second run didn't run on different output?
 Overwriting seems more accurate than ignoring. Does handling this error at
 sink level distinguish between the two (another run vs second attempt)?


 On Wed, Jan 31, 2018 at 12:32 PM, Udi Meiri  wrote:

> Yeah, another round of refactoring is due to move the rename via
> copy+delete logic up to the file-based sink level.
>
> On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath 
> wrote:
>
>> Good point. There's always the chance of step that performs final
>> rename being retried. So we'll have to ignore this error at the sink 
>> level.
>> We don't necessarily have to do this at the FileSystem level though. I
>> think the proper behavior might be to raise an error for the rename at 
>> the
>> FileSystem level if the destination already exists (or source doesn't
>> exist) while ignoring that error (and possibly logging a warning) at the
>> sink level.
>>
>> - Cham
>>
>>
>> On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax  wrote:
>>
>>> I think the idea was to ignore "already exists" errors. The reason
>>> being that any step in Beam can be executed multiple times, including 
>>> the
>>> rename step. If the rename step gets run twice, the second run should
>>> succeed vacuously.
>>>
>>>
>>> On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri  wrote:
>>>
 Hi,
 I've been working on HDFS code for the Python SDK and I've noticed
 some behaviors which are surprising. I wanted to know if these 
 behaviors
 are known and intended.

 1. When renaming files during finalize_write, rename errors are
 ignored
 .
 For example, if I run wordcount twice using HDFS code I get a warning 
 the
 second time because the file already exists:

 WARNING:root:Rename not successful:
 hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
 -> hdfs://counts2-0-of-1, libhdfs error in renaming
 hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
 to hdfs://counts2-0-of-1 with exceptions Unable to rename
 '/beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2'
 to '/counts2-0-of-1'.

 For GCS and local files there are no rename errors (in this case),
 since the rename operation silently overwrites existing destination 
 files.
 However, blindly ignoring these errors might make the pipeline to 
 report
 success even though output files are missing.

 2. Output files 

Re: TextIO operation not supported by Flink for Beam master

2018-01-31 Thread Kenneth Knowles
Filed https://issues.apache.org/jira/browse/BEAM-3587

On Wed, Jan 31, 2018 at 6:31 PM, Kenneth Knowles  wrote:

> Yes, a Jira ticket will help to keep track of things.
>
> I have a guess that this has to do with switching to running from a
> portable pipeline representation, and it looks like there's a non-composite
> transform with an empty URN and it threw a bad error message. We can try to
> root cause but may also mitigate short-term by removing the round-trip
> through pipeline proto for now.
>
> What is curious is that the ValidatesRunner and WordCountIT are working -
> they only run on a local Flink, yet this seems to be a translation issue
> that would occur for local or distributed runs.
>
> I'll copy all this onto the Jira ticket but I wanted to communicate it
> immediately.
>
> Kenn
>
> On Wed, Jan 31, 2018 at 6:25 PM, Thomas Pelletier 
> wrote:
>
>> Let me know if there is anything I can do to help! Should I file a jira
>> ticket?
>>
>> On Wed 31 Jan 2018 at 18:14, Lukasz Cwik  wrote:
>>
>>> Thanks for the feedback, we are currently in the middle of releasing
>>> 2.3.0 from pretty close to what is on Apache Beam master so your issue
>>> should be investigated before release.
>>>
>>> On Wed, Jan 31, 2018 at 5:36 PM, Thomas Pelletier <
>>> tpellet...@zendesk.com> wrote:
>>>
 Hi,

 I'm trying to run a pipeline containing just a TextIO.read() step on a
 Flink cluster, using the latest Beam git revision (ff37337
 ).
 The job fails to start with the Exception:

   java.lang.UnsupportedOperationException: The transform  is currently
 not supported.

 It does work with Beam 2.2.0 though. All code, logs, and reproduction
 steps on this repository
 .

 Any idea what might be going on?

 Thanks,
 Thomas

>>>
>>>
>


FOSDEM mini office hour?

2018-01-31 Thread Holden Karau
Hi BEAM Friends,

If any folks are around for FOSDEM this year I was planning on doing a
coffee office hour on the last day after my talks
. Maybe like 6pm?
I'm also going to see if any Spark folks are around and interested :)

Cheers,

Holden :)


-- 
Twitter: https://twitter.com/holdenkarau


Re: Can Window PTransform drop tuples that violate allowed lateness?

2018-01-31 Thread Kenneth Knowles
On Mon, Jan 22, 2018 at 11:42 AM, Shen Li  wrote:

> Hi Kenn,
>
> Thanks for the explanation.
>
> > So now elements are droppable if they belong to an expired window.
>
> Say I have two consecutive window transforms with FixedWindows WindowFn
> (just an example, most likely won't appear in real pipeline). The first
> windowFn says the element belongs to an expired window. But according to
> the second windowFn, the element's window is not yet expired. In this case,
> can the first Window transform drop the element?
>

Yes, it is permitted to drop the expired data at any point. The reason I
think this is OK is that the runner also completely controls the watermark.
So there is arbitrary runner-owned behavior in terms of dropping either
way. It hasn't come up, since windows are hardly useful until you have an
aggregation, where they provide the notion of completeness. Do you have an
example in mind where this gets weird?

Kenn




> Best,
> Shen
>
> On Mon, Jan 22, 2018 at 2:07 PM, Kenneth Knowles  wrote:
>
>> Hi Shen,
>>
>> This is a documentation issue. The Beam model switched from dropping
>> individual elements to expiring windows. So now elements are droppable if
>> they belong to an expired window. This works a little better with the
>> purpose of windowing and allowed lateness: to say when an aggregation is
>> "complete". Any element that manages to make it to an aggregation before
>> the accumulator is expired is allowed to be included now and only after the
>> whole window expires we drop any further incoming elements for that window.
>>
>> Kenn
>>
>> On Mon, Jan 22, 2018 at 10:52 AM, Shen Li  wrote:
>>
>>> Hi,
>>>
>>> The Window#withAllowedLateness(Duration) doc says "Any elements that
>>> are later than this as decided by the system-maintained watermark will be
>>> dropped". Can the runner safely discard a tuple that violates the allowed
>>> lateness in the Window operator? Or does it have to drop it in the
>>> downstream GBK operator just in case that there could be another Window
>>> transform in between overriding the allowed lateness (or other
>>> configurations)?
>>>
>>> Thanks,
>>> Shen
>>>
>>
>>
>


Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-01-31 Thread Kenneth Knowles
I didn't have time for this, but it just bit me. We definitely have Guava
on the API surface of runner support code in ways that get incompatibly
shaded. I will probably start "1a" by making a shaded library
org.apache.beam:vendored-guava and starting to use it. It sounds like there
is generally unanimous support for that much, anyhow.

Kenn

On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek 
wrote:

> Thanks Ismaël for bringing up this discussion again!
>
> I would be in favour of 1) and more specifically of 1a)
>
> Aljoscha
>
>
> On 12. Dec 2017, at 18:56, Lukasz Cwik  wrote:
>
> You can always run tests on post shaded artifacts instead of the compiled
> classes, it just requires us to change our maven surefire / gradle test
> configurations but it is true that most IDEs would behave better with a
> dependency jar unless you delegate all the build/test actions to the build
> system and then it won't matter.
>
> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles  wrote:
>
>> There's also, with additional overhead,
>>
>> 1a) A relocated and shipped package for each thing we want to relocate. I
>> think this has also been tried outside Beam...
>>
>> Pros:
>> * all the pros of 1) plus no bloat beyond what is necessary
>> Cons:
>> * abandons whitelist approach for public deps, reverting to blacklist
>> approach for trouble things like guava, so a bit less principled
>>
>> For both 1) and 1a) I would add:
>>
>> Pros:
>> * clearly readable dependency since code will `import
>> org.apache.beam.private.guava21` and IDEs will understand it is a
>> distinct lilbrary
>> * can run tests on unpackaged classes, as long as deps are shaded or
>> provided as jars
>> * no mysterious action at a distance from inherited configuration
>> Cons:
>> * need to adjust imports
>>
>> Kenn
>>
>> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik  wrote:
>>
>>> I would suggest that either we use:
>>> 1) A common deps package containing shaded dependencies allows for
>>> Pros
>>> * doesn't require the user to build an uber jar
>>> Risks
>>> * dependencies package will keep growing even if something is or isn't
>>> needed by all of Apache Beam leading to a large jar anyways negating any
>>> space savings
>>>
>>> 2) Shade within each module to a common location like
>>> org.apache.beam.relocated.guava
>>> Pros
>>> * you only get the shaded dependencies of the things that are required
>>> * its one less dependency for users to manage
>>> Risks
>>> * requires an uber jar to be built to get the space savings (either by a
>>> user or a distribution of Apache Beam) otherwise we negate any space
>>> savings.
>>>
>>> If we either use a common relocation scheme or a dependencies jar then
>>> each relocation should specifically contain the version number of the
>>> package because we would like to allow for us to be using two different
>>> versions of the same library.
>>>
>>> For the common deps package approach, should we check in code where the
>>> imports contain the relocated location (e.g. import
>>> org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)?
>>>
>>>
>>> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré 
>>> wrote:
>>>
 Thanks for bringing that back.

 Indeed guava is shaded in different uber-jar. Maybe we can have a
 common deps module that we include once (but the user will have to
 explicitly define the dep) ?

 Basically, what do you propose for protobuf (unfortunately, I don't see
 an obvious) ?

 Regards
 JB


 On 12/11/2017 05:35 PM, Ismaël Mejía wrote:

> Hello, I wanted to bring back this subject because I think we should
> take action on this and at least first have a shaded version of guava.
> I was playing with a toy project and I did the procedure we use to
> submit jars to a Hadoop cluster via Flink/Spark which involves
> creating an uber jar and I realized that the size of the jar was way
> bigger than I expected, and the fact that we shade guava in every
> module contributes to this. I found guava shaded on:
>
> sdks/java/core
> runners/core-construction-java
> runners/core-java
> model/job-management
> runners/spark
> sdks/java/io/hadoop-file-system
> sdks/java/io/kafka
>
> This means at least 6 times more of the size it should which counts in
> around 15MB more (2.4MB*6 deps) of extra weight that we can simply
> reduce by using a shaded version of the library.
>
> I add this point to the previous ones mentioned during the discussion
> because this goes against the end user experience and affects us all
> (devs/users).
>
> Another question is if we should shade (and how) protocol buffers
> because now with the portability work we are exposing it closer to the
> end users. I say this because I also found an issue while running a
> job on YARN with the spark runner because hadoop-common includes
> protobuf-java 2 and I had to e

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-01-31 Thread Lukasz Cwik
Make sure to include the guava version in the artifact name so that we can
have multiple vendored versions.

On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles  wrote:

> I didn't have time for this, but it just bit me. We definitely have Guava
> on the API surface of runner support code in ways that get incompatibly
> shaded. I will probably start "1a" by making a shaded library
> org.apache.beam:vendored-guava and starting to use it. It sounds like there
> is generally unanimous support for that much, anyhow.
>
> Kenn
>
> On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek 
> wrote:
>
>> Thanks Ismaël for bringing up this discussion again!
>>
>> I would be in favour of 1) and more specifically of 1a)
>>
>> Aljoscha
>>
>>
>> On 12. Dec 2017, at 18:56, Lukasz Cwik  wrote:
>>
>> You can always run tests on post shaded artifacts instead of the compiled
>> classes, it just requires us to change our maven surefire / gradle test
>> configurations but it is true that most IDEs would behave better with a
>> dependency jar unless you delegate all the build/test actions to the build
>> system and then it won't matter.
>>
>> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles  wrote:
>>
>>> There's also, with additional overhead,
>>>
>>> 1a) A relocated and shipped package for each thing we want to relocate.
>>> I think this has also been tried outside Beam...
>>>
>>> Pros:
>>> * all the pros of 1) plus no bloat beyond what is necessary
>>> Cons:
>>> * abandons whitelist approach for public deps, reverting to blacklist
>>> approach for trouble things like guava, so a bit less principled
>>>
>>> For both 1) and 1a) I would add:
>>>
>>> Pros:
>>> * clearly readable dependency since code will `import
>>> org.apache.beam.private.guava21` and IDEs will understand it is a
>>> distinct lilbrary
>>> * can run tests on unpackaged classes, as long as deps are shaded or
>>> provided as jars
>>> * no mysterious action at a distance from inherited configuration
>>> Cons:
>>> * need to adjust imports
>>>
>>> Kenn
>>>
>>> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik  wrote:
>>>
 I would suggest that either we use:
 1) A common deps package containing shaded dependencies allows for
 Pros
 * doesn't require the user to build an uber jar
 Risks
 * dependencies package will keep growing even if something is or isn't
 needed by all of Apache Beam leading to a large jar anyways negating any
 space savings

 2) Shade within each module to a common location like
 org.apache.beam.relocated.guava
 Pros
 * you only get the shaded dependencies of the things that are required
 * its one less dependency for users to manage
 Risks
 * requires an uber jar to be built to get the space savings (either by
 a user or a distribution of Apache Beam) otherwise we negate any space
 savings.

 If we either use a common relocation scheme or a dependencies jar then
 each relocation should specifically contain the version number of the
 package because we would like to allow for us to be using two different
 versions of the same library.

 For the common deps package approach, should we check in code where the
 imports contain the relocated location (e.g. import
 org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)?


 On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré 
 wrote:

> Thanks for bringing that back.
>
> Indeed guava is shaded in different uber-jar. Maybe we can have a
> common deps module that we include once (but the user will have to
> explicitly define the dep) ?
>
> Basically, what do you propose for protobuf (unfortunately, I don't
> see an obvious) ?
>
> Regards
> JB
>
>
> On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>
>> Hello, I wanted to bring back this subject because I think we should
>> take action on this and at least first have a shaded version of guava.
>> I was playing with a toy project and I did the procedure we use to
>> submit jars to a Hadoop cluster via Flink/Spark which involves
>> creating an uber jar and I realized that the size of the jar was way
>> bigger than I expected, and the fact that we shade guava in every
>> module contributes to this. I found guava shaded on:
>>
>> sdks/java/core
>> runners/core-construction-java
>> runners/core-java
>> model/job-management
>> runners/spark
>> sdks/java/io/hadoop-file-system
>> sdks/java/io/kafka
>>
>> This means at least 6 times more of the size it should which counts in
>> around 15MB more (2.4MB*6 deps) of extra weight that we can simply
>> reduce by using a shaded version of the library.
>>
>> I add this point to the previous ones mentioned during the discussion
>> because this goes against the end user experience and affects us all
>> (devs/users).
>>
>> Another question is if

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-01-31 Thread Romain Manni-Bucau
Why not dropping guava for all beam codebase? With java 8 it is quite easy
to do it and avoid a bunch of conflicts. Did it in 2 projects with quite a
good result.

Le 1 févr. 2018 06:50, "Lukasz Cwik"  a écrit :

> Make sure to include the guava version in the artifact name so that we can
> have multiple vendored versions.
>
> On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles  wrote:
>
>> I didn't have time for this, but it just bit me. We definitely have Guava
>> on the API surface of runner support code in ways that get incompatibly
>> shaded. I will probably start "1a" by making a shaded library
>> org.apache.beam:vendored-guava and starting to use it. It sounds like there
>> is generally unanimous support for that much, anyhow.
>>
>> Kenn
>>
>> On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek 
>> wrote:
>>
>>> Thanks Ismaël for bringing up this discussion again!
>>>
>>> I would be in favour of 1) and more specifically of 1a)
>>>
>>> Aljoscha
>>>
>>>
>>> On 12. Dec 2017, at 18:56, Lukasz Cwik  wrote:
>>>
>>> You can always run tests on post shaded artifacts instead of the
>>> compiled classes, it just requires us to change our maven surefire / gradle
>>> test configurations but it is true that most IDEs would behave better with
>>> a dependency jar unless you delegate all the build/test actions to the
>>> build system and then it won't matter.
>>>
>>> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles  wrote:
>>>
 There's also, with additional overhead,

 1a) A relocated and shipped package for each thing we want to relocate.
 I think this has also been tried outside Beam...

 Pros:
 * all the pros of 1) plus no bloat beyond what is necessary
 Cons:
 * abandons whitelist approach for public deps, reverting to blacklist
 approach for trouble things like guava, so a bit less principled

 For both 1) and 1a) I would add:

 Pros:
 * clearly readable dependency since code will `import
 org.apache.beam.private.guava21` and IDEs will understand it is a
 distinct lilbrary
 * can run tests on unpackaged classes, as long as deps are shaded or
 provided as jars
 * no mysterious action at a distance from inherited configuration
 Cons:
 * need to adjust imports

 Kenn

 On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik  wrote:

> I would suggest that either we use:
> 1) A common deps package containing shaded dependencies allows for
> Pros
> * doesn't require the user to build an uber jar
> Risks
> * dependencies package will keep growing even if something is or isn't
> needed by all of Apache Beam leading to a large jar anyways negating any
> space savings
>
> 2) Shade within each module to a common location like
> org.apache.beam.relocated.guava
> Pros
> * you only get the shaded dependencies of the things that are required
> * its one less dependency for users to manage
> Risks
> * requires an uber jar to be built to get the space savings (either by
> a user or a distribution of Apache Beam) otherwise we negate any space
> savings.
>
> If we either use a common relocation scheme or a dependencies jar then
> each relocation should specifically contain the version number of the
> package because we would like to allow for us to be using two different
> versions of the same library.
>
> For the common deps package approach, should we check in code where
> the imports contain the relocated location (e.g. import
> org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)?
>
>
> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré  > wrote:
>
>> Thanks for bringing that back.
>>
>> Indeed guava is shaded in different uber-jar. Maybe we can have a
>> common deps module that we include once (but the user will have to
>> explicitly define the dep) ?
>>
>> Basically, what do you propose for protobuf (unfortunately, I don't
>> see an obvious) ?
>>
>> Regards
>> JB
>>
>>
>> On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>>
>>> Hello, I wanted to bring back this subject because I think we should
>>> take action on this and at least first have a shaded version of
>>> guava.
>>> I was playing with a toy project and I did the procedure we use to
>>> submit jars to a Hadoop cluster via Flink/Spark which involves
>>> creating an uber jar and I realized that the size of the jar was way
>>> bigger than I expected, and the fact that we shade guava in every
>>> module contributes to this. I found guava shaded on:
>>>
>>> sdks/java/core
>>> runners/core-construction-java
>>> runners/core-java
>>> model/job-management
>>> runners/spark
>>> sdks/java/io/hadoop-file-system
>>> sdks/java/io/kafka
>>>
>>> This means at least 6 times more of the size it should which counts
>>

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-01-31 Thread Lukasz Cwik
That is an even better idea. A lot of guava constructs have been superseded
by stuff that was added to Java 8.

On Wed, Jan 31, 2018 at 9:56 PM, Romain Manni-Bucau 
wrote:

> Why not dropping guava for all beam codebase? With java 8 it is quite easy
> to do it and avoid a bunch of conflicts. Did it in 2 projects with quite a
> good result.
>
> Le 1 févr. 2018 06:50, "Lukasz Cwik"  a écrit :
>
>> Make sure to include the guava version in the artifact name so that we
>> can have multiple vendored versions.
>>
>> On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles  wrote:
>>
>>> I didn't have time for this, but it just bit me. We definitely have
>>> Guava on the API surface of runner support code in ways that get
>>> incompatibly shaded. I will probably start "1a" by making a shaded library
>>> org.apache.beam:vendored-guava and starting to use it. It sounds like there
>>> is generally unanimous support for that much, anyhow.
>>>
>>> Kenn
>>>
>>> On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek 
>>> wrote:
>>>
 Thanks Ismaël for bringing up this discussion again!

 I would be in favour of 1) and more specifically of 1a)

 Aljoscha


 On 12. Dec 2017, at 18:56, Lukasz Cwik  wrote:

 You can always run tests on post shaded artifacts instead of the
 compiled classes, it just requires us to change our maven surefire / gradle
 test configurations but it is true that most IDEs would behave better with
 a dependency jar unless you delegate all the build/test actions to the
 build system and then it won't matter.

 On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles 
 wrote:

> There's also, with additional overhead,
>
> 1a) A relocated and shipped package for each thing we want to
> relocate. I think this has also been tried outside Beam...
>
> Pros:
> * all the pros of 1) plus no bloat beyond what is necessary
> Cons:
> * abandons whitelist approach for public deps, reverting to blacklist
> approach for trouble things like guava, so a bit less principled
>
> For both 1) and 1a) I would add:
>
> Pros:
> * clearly readable dependency since code will `import
> org.apache.beam.private.guava21` and IDEs will understand it is a
> distinct lilbrary
> * can run tests on unpackaged classes, as long as deps are shaded or
> provided as jars
> * no mysterious action at a distance from inherited configuration
> Cons:
> * need to adjust imports
>
> Kenn
>
> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik  wrote:
>
>> I would suggest that either we use:
>> 1) A common deps package containing shaded dependencies allows for
>> Pros
>> * doesn't require the user to build an uber jar
>> Risks
>> * dependencies package will keep growing even if something is or
>> isn't needed by all of Apache Beam leading to a large jar anyways 
>> negating
>> any space savings
>>
>> 2) Shade within each module to a common location like
>> org.apache.beam.relocated.guava
>> Pros
>> * you only get the shaded dependencies of the things that are required
>> * its one less dependency for users to manage
>> Risks
>> * requires an uber jar to be built to get the space savings (either
>> by a user or a distribution of Apache Beam) otherwise we negate any space
>> savings.
>>
>> If we either use a common relocation scheme or a dependencies jar
>> then each relocation should specifically contain the version number of 
>> the
>> package because we would like to allow for us to be using two different
>> versions of the same library.
>>
>> For the common deps package approach, should we check in code where
>> the imports contain the relocated location (e.g. import
>> org.apache.beam.guava.20.0.com.google.common.collect.ImmutableList)?
>>
>>
>> On Mon, Dec 11, 2017 at 8:47 AM, Jean-Baptiste Onofré <
>> j...@nanthrax.net> wrote:
>>
>>> Thanks for bringing that back.
>>>
>>> Indeed guava is shaded in different uber-jar. Maybe we can have a
>>> common deps module that we include once (but the user will have to
>>> explicitly define the dep) ?
>>>
>>> Basically, what do you propose for protobuf (unfortunately, I don't
>>> see an obvious) ?
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 12/11/2017 05:35 PM, Ismaël Mejía wrote:
>>>
 Hello, I wanted to bring back this subject because I think we should
 take action on this and at least first have a shaded version of
 guava.
 I was playing with a toy project and I did the procedure we use to
 submit jars to a Hadoop cluster via Flink/Spark which involves
 creating an uber jar and I realized that the size of the jar was way
 bigger than I expected, and the fact that we shade guava in every
 module contrib

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Romain Manni-Bucau
@ismael: any vote can be changes from -1 to +1 (or +-0) without additional
delay

Le 1 févr. 2018 03:15, "Lukasz Cwik"  a écrit :

> Note that a user reported TextIO being broken on Flink.
> Thread is here: https://lists.apache.org/thread.html/
> 47b16c94032392782505415e010970fd2a9480891c55c2f7b5de92bd@%
> 3Cuser.beam.apache.org%3E
> Can someone confirm/refute?
>
> On Wed, Jan 31, 2018 at 3:36 PM, Konstantinos Katsiapis <
> katsia...@google.com> wrote:
>
>> +1 (non-binding). tensorflow.transform
>>  0.5.0 is blocked on Apache
>> Beam 2.3
>>
>>
>> On Wed, Jan 31, 2018 at 5:59 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>>> +1 (binding)
>>>
>>> Casting my own +1 ;)
>>>
>>> Regards
>>> JB
>>>
>>> On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
>>> > Hi everyone,
>>> >
>>> > Please review and vote on the release candidate #1 for the version
>>> 2.3.0, as
>>> > follows:
>>> >
>>> > [ ] +1, Approve the release
>>> > [ ] -1, Do not approve the release (please provide specific comments)
>>> >
>>> >
>>> > The complete staging area is available for your review, which includes:
>>> > * JIRA release notes [1],
>>> > * the official Apache source release to be deployed to dist.apache.org
>>> [2],
>>> > which is signed with the key with fingerprint C8282E76 [3],
>>> > * all artifacts to be deployed to the Maven Central Repository [4],
>>> > * source code tag "v2.3.0-RC1" [5],
>>> > * website pull request listing the release and publishing the API
>>> reference
>>> > manual [6].
>>> > * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
>>> > * Python artifacts are deployed along with the source release to the
>>> > dist.apache.org [2].
>>> >
>>> > The vote will be open for at least 72 hours. It is adopted by majority
>>> approval,
>>> > with at least 3 PMC affirmative votes.
>>> >
>>> > Thanks,
>>> > JB
>>> >
>>> > [1]
>>> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
>>> ctId=12319527&version=12341608
>>> > [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
>>> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> > [4] https://repository.apache.org/content/repositories/orgapache
>>> beam-1026/
>>> > [5] https://github.com/apache/beam/tree/v2.3.0-RC1
>>> > [6] https://github.com/apache/beam-site/pull/381
>>> >
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>>
>>
>> --
>> Gus Katsiapis | Software Engineer | katsia...@google.com | 650-918-7487
>> <(650)%20918-7487>
>>
>
>


Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-31 Thread Romain Manni-Bucau
Le 1 févr. 2018 03:10, "Lukasz Cwik"  a écrit :

Yes, people will write bad coders which is why this is there. No, I don't
think swallowing one close is what we want.

In the case where you wants to pass an input/output stream to a library
that incorrectly assumes ownership, the input/output stream should be
wrapped right before the call to the library with a filter input/output
stream that swallows the close and not propagate ignoring this bad behavior
elsewhere.


Hmm,

Elsewhere is nowhere else here since it wouldnt have any side effect except
not enforcing another layer and making smoothly work for most mappers.

Anyway I can live with it but I'm a bit sad it closes the door to the
easyness to write extensions.


On Wed, Jan 31, 2018 at 12:04 PM, Romain Manni-Bucau 
wrote:

> Hmm, here we are the ones owning the call since it is in a coder, no? Do
> we assume people will badly implement coders? In this particular case we
> can assume close() will be called by a framework I think.
> What about swallowing one close() and fail on the second?
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
> 
>
> 2018-01-31 20:59 GMT+01:00 Lukasz Cwik :
>
>> Because people write code like:
>> myMethod(InputStream in) {
>>   InputStream child = new InputStream(in);
>>   child.close();
>> }
>>
>> InputStream in = new FileInputStream(... path ...);
>> myMethod(in);
>> myMethod(in);
>>
>> An exception will be thrown when the second myMethod call occurs.
>>
>> Unfortunately not everyone wraps their calls to a coder with an
>> UnownedInputStream or a filter input stream which drops close() calls is
>> why its a problem and in the few places it is done it is used to prevent
>> bugs from creeping in.
>>
>>
>>
>> On Tue, Jan 30, 2018 at 11:29 AM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> I get the issue but I don't get the last part. Concretely we can support
>>> any lib by just removing the exception in the close, no? What would be the
>>> issue? No additional wrapper, no lib integration issue.
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>> 
>>>
>>> 2018-01-30 19:29 GMT+01:00 Lukasz Cwik :
>>>
 Its common in the code base that input and output streams are passed
 around and the caller is responsible for closing it, not the callee. The
 UnownedInputStream is to guard against libraries that are poorly behaved
 and assume they get ownership of the stream when it is given to them.

 In the code:
 myMethod(InputStream in) {
   InputStream child = new InputStream(in);
   child.close();
 }

 InputStream in = ...
 myMethod(in);
 myMethod(in);
 When should "in" be closed?

 To get around this issue, create a filter input/output stream that
 ignores close calls like on the JAXB coder:
 https://github.com/apache/beam/blob/master/sdks/java/io/xml/
 src/main/java/org/apache/beam/sdk/io/xml/JAXBCoder.java#L181

 We can instead swap around this pattern so that the caller guards
 against callees closing by wrapping with a filter input/output stream but
 this costs an additional method call for each input/output stream call.


 On Tue, Jan 30, 2018 at 10:04 AM, Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

> Hi guys,
>
> All is in the subject ;)
>
> Rational is to support any I/O library and not fail when the close is
> encapsulated.
>
> Any blocker to swallow this close call?
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
> 
>


>>>
>>
>


Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-01-31 Thread Jean-Baptiste Onofré
+1, it was on my TODO for a while waiting the Java8 update.

Regards
JB

On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
> Why not dropping guava for all beam codebase? With java 8 it is quite easy to 
> do
> it and avoid a bunch of conflicts. Did it in 2 projects with quite a good 
> result.
> 
> Le 1 févr. 2018 06:50, "Lukasz Cwik"  > a écrit :
> 
> Make sure to include the guava version in the artifact name so that we can
> have multiple vendored versions.
> 
> On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles  > wrote:
> 
> I didn't have time for this, but it just bit me. We definitely have
> Guava on the API surface of runner support code in ways that get
> incompatibly shaded. I will probably start "1a" by making a shaded
> library org.apache.beam:vendored-guava and starting to use it. It 
> sounds
> like there is generally unanimous support for that much, anyhow.
> 
> Kenn
> 
> On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek  > wrote:
> 
> Thanks Ismaël for bringing up this discussion again!
> 
> I would be in favour of 1) and more specifically of 1a)
> 
> Aljoscha
> 
> 
>> On 12. Dec 2017, at 18:56, Lukasz Cwik > > wrote:
>>
>> You can always run tests on post shaded artifacts instead of the
>> compiled classes, it just requires us to change our maven 
>> surefire
>> / gradle test configurations but it is true that most IDEs would
>> behave better with a dependency jar unless you delegate all the
>> build/test actions to the build system and then it won't matter.
>>
>> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles > > wrote:
>>
>> There's also, with additional overhead,
>>
>> 1a) A relocated and shipped package for each thing we want to
>> relocate. I think this has also been tried outside Beam...
>>
>> Pros:
>> * all the pros of 1) plus no bloat beyond what is necessary
>> Cons:
>> * abandons whitelist approach for public deps, reverting to
>> blacklist approach for trouble things like guava, so a bit
>> less principled
>>
>> For both 1) and 1a) I would add:
>>
>> Pros:
>> * clearly readable dependency since code will `import
>> org.apache.beam.private.guava21` and IDEs will understand it
>> is a distinct lilbrary
>> * can run tests on unpackaged classes, as long as deps are
>> shaded or provided as jars
>> * no mysterious action at a distance from inherited 
>> configuration
>> Cons:
>> * need to adjust imports
>>
>> Kenn
>>
>> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik 
>> > > wrote:
>>
>> I would suggest that either we use:
>> 1) A common deps package containing shaded dependencies
>> allows for 
>> Pros
>> * doesn't require the user to build an uber jar
>> Risks
>> * dependencies package will keep growing even if 
>> something
>> is or isn't needed by all of Apache Beam leading to a
>> large jar anyways negating any space savings
>>
>> 2) Shade within each module to a common location like
>> org.apache.beam.relocated.guava
>> Pros
>> * you only get the shaded dependencies of the things that
>> are required
>> * its one less dependency for users to manage
>> Risks
>> * requires an uber jar to be built to get the space
>> savings (either by a user or a distribution of Apache
>> Beam) otherwise we negate any space savings.
>>
>> If we either use a common relocation scheme or a
>> dependencies jar then each relocation should specifically
>> contain the version number of the package because we 
>> would
>> like to allow for us to be using two different versions 
>> of
>> the same library.
>>
>> For the common deps package approach, should we check in
>> code where the imports contain the relocated location
>> (e.g. import org.apache.beam.guava.20.0.com
>> 
>> .google.common.collect.Imm

Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-31 Thread Lukasz Cwik
I'm not sure what you mean by it closes the door since as the caller of the
library you can create a wrapper filter input stream that ignores close
calls effectively overriding what happens in the UnownedInputStream.

On Wed, Jan 31, 2018 at 10:08 PM, Romain Manni-Bucau 
wrote:

>
>
> Le 1 févr. 2018 03:10, "Lukasz Cwik"  a écrit :
>
> Yes, people will write bad coders which is why this is there. No, I don't
> think swallowing one close is what we want.
>
> In the case where you wants to pass an input/output stream to a library
> that incorrectly assumes ownership, the input/output stream should be
> wrapped right before the call to the library with a filter input/output
> stream that swallows the close and not propagate ignoring this bad behavior
> elsewhere.
>
>
> Hmm,
>
> Elsewhere is nowhere else here since it wouldnt have any side effect
> except not enforcing another layer and making smoothly work for most
> mappers.
>
> Anyway I can live with it but I'm a bit sad it closes the door to the
> easyness to write extensions.
>
>
> On Wed, Jan 31, 2018 at 12:04 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>> Hmm, here we are the ones owning the call since it is in a coder, no? Do
>> we assume people will badly implement coders? In this particular case we
>> can assume close() will be called by a framework I think.
>> What about swallowing one close() and fail on the second?
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>> 
>>
>> 2018-01-31 20:59 GMT+01:00 Lukasz Cwik :
>>
>>> Because people write code like:
>>> myMethod(InputStream in) {
>>>   InputStream child = new InputStream(in);
>>>   child.close();
>>> }
>>>
>>> InputStream in = new FileInputStream(... path ...);
>>> myMethod(in);
>>> myMethod(in);
>>>
>>> An exception will be thrown when the second myMethod call occurs.
>>>
>>> Unfortunately not everyone wraps their calls to a coder with an
>>> UnownedInputStream or a filter input stream which drops close() calls is
>>> why its a problem and in the few places it is done it is used to prevent
>>> bugs from creeping in.
>>>
>>>
>>>
>>> On Tue, Jan 30, 2018 at 11:29 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
 I get the issue but I don't get the last part. Concretely we can
 support any lib by just removing the exception in the close, no? What would
 be the issue? No additional wrapper, no lib integration issue.


 Romain Manni-Bucau
 @rmannibucau  |  Blog
  | Old Blog
  | Github
  | LinkedIn
 

 2018-01-30 19:29 GMT+01:00 Lukasz Cwik :

> Its common in the code base that input and output streams are passed
> around and the caller is responsible for closing it, not the callee. The
> UnownedInputStream is to guard against libraries that are poorly behaved
> and assume they get ownership of the stream when it is given to them.
>
> In the code:
> myMethod(InputStream in) {
>   InputStream child = new InputStream(in);
>   child.close();
> }
>
> InputStream in = ...
> myMethod(in);
> myMethod(in);
> When should "in" be closed?
>
> To get around this issue, create a filter input/output stream that
> ignores close calls like on the JAXB coder:
> https://github.com/apache/beam/blob/master/sdks/java/io/xml/
> src/main/java/org/apache/beam/sdk/io/xml/JAXBCoder.java#L181
>
> We can instead swap around this pattern so that the caller guards
> against callees closing by wrapping with a filter input/output stream but
> this costs an additional method call for each input/output stream call.
>
>
> On Tue, Jan 30, 2018 at 10:04 AM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>> Hi guys,
>>
>> All is in the subject ;)
>>
>> Rational is to support any I/O library and not fail when the close is
>> encapsulated.
>>
>> Any blocker to swallow this close call?
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>> 
>>
>
>

>>>
>>
>
>


Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Jean-Baptiste Onofré
Hi all,

just a quick reminder about the vote process:

1. Any vote can be changed during the vote period. A -1 vote has to be argued
(especially if there's not change to do in the project codebase).
2. For convenience to the release manager, please inform if your vote is binding
or non-binding (the vote from PMC members are binding)
3. It's not possible to "veto" a release: if we have at least 3 binding votes,
the vote can pass.
4. Please, keep only vote in the thread. If you have some tests in progress,
please use another thread. It would be great if the thread only contains
concrete votes.
5. The vote duration can be extended on request.

So, I'm extending this vote to 72 more hours to give us time to review
especially the dataflow worker images test and the Flink TextIO potential issue.

Thanks !
Regards
JB

On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
> Hi everyone,
> 
> Please review and vote on the release candidate #1 for the version 2.3.0, as
> follows:
> 
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
> 
> 
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org [2],
> which is signed with the key with fingerprint C8282E76 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.3.0-RC1" [5],
> * website pull request listing the release and publishing the API reference
> manual [6].
> * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2].
> 
> The vote will be open for at least 72 hours. It is adopted by majority 
> approval,
> with at least 3 PMC affirmative votes.
> 
> Thanks,
> JB
> 
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12341608
> [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1026/
> [5] https://github.com/apache/beam/tree/v2.3.0-RC1
> [6] https://github.com/apache/beam-site/pull/381
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-31 Thread Romain Manni-Bucau
Yep but makes one other step to work in beam - dont forget you already
passed like 10 steps when you end up on coders.

My view was that the skip first close was a win-win for beam and users bit
technically you are right, users can always do it themselves.

Le 1 févr. 2018 07:16, "Lukasz Cwik"  a écrit :

> I'm not sure what you mean by it closes the door since as the caller of
> the library you can create a wrapper filter input stream that ignores close
> calls effectively overriding what happens in the UnownedInputStream.
>
> On Wed, Jan 31, 2018 at 10:08 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>>
>>
>> Le 1 févr. 2018 03:10, "Lukasz Cwik"  a écrit :
>>
>> Yes, people will write bad coders which is why this is there. No, I don't
>> think swallowing one close is what we want.
>>
>> In the case where you wants to pass an input/output stream to a library
>> that incorrectly assumes ownership, the input/output stream should be
>> wrapped right before the call to the library with a filter input/output
>> stream that swallows the close and not propagate ignoring this bad behavior
>> elsewhere.
>>
>>
>> Hmm,
>>
>> Elsewhere is nowhere else here since it wouldnt have any side effect
>> except not enforcing another layer and making smoothly work for most
>> mappers.
>>
>> Anyway I can live with it but I'm a bit sad it closes the door to the
>> easyness to write extensions.
>>
>>
>> On Wed, Jan 31, 2018 at 12:04 PM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> Hmm, here we are the ones owning the call since it is in a coder, no? Do
>>> we assume people will badly implement coders? In this particular case we
>>> can assume close() will be called by a framework I think.
>>> What about swallowing one close() and fail on the second?
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>> 
>>>
>>> 2018-01-31 20:59 GMT+01:00 Lukasz Cwik :
>>>
 Because people write code like:
 myMethod(InputStream in) {
   InputStream child = new InputStream(in);
   child.close();
 }

 InputStream in = new FileInputStream(... path ...);
 myMethod(in);
 myMethod(in);

 An exception will be thrown when the second myMethod call occurs.

 Unfortunately not everyone wraps their calls to a coder with an
 UnownedInputStream or a filter input stream which drops close() calls is
 why its a problem and in the few places it is done it is used to prevent
 bugs from creeping in.



 On Tue, Jan 30, 2018 at 11:29 AM, Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

> I get the issue but I don't get the last part. Concretely we can
> support any lib by just removing the exception in the close, no? What 
> would
> be the issue? No additional wrapper, no lib integration issue.
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
> 
>
> 2018-01-30 19:29 GMT+01:00 Lukasz Cwik :
>
>> Its common in the code base that input and output streams are passed
>> around and the caller is responsible for closing it, not the callee. The
>> UnownedInputStream is to guard against libraries that are poorly behaved
>> and assume they get ownership of the stream when it is given to them.
>>
>> In the code:
>> myMethod(InputStream in) {
>>   InputStream child = new InputStream(in);
>>   child.close();
>> }
>>
>> InputStream in = ...
>> myMethod(in);
>> myMethod(in);
>> When should "in" be closed?
>>
>> To get around this issue, create a filter input/output stream that
>> ignores close calls like on the JAXB coder:
>> https://github.com/apache/beam/blob/master/sdks/java/io/xml/
>> src/main/java/org/apache/beam/sdk/io/xml/JAXBCoder.java#L181
>>
>> We can instead swap around this pattern so that the caller guards
>> against callees closing by wrapping with a filter input/output stream but
>> this costs an additional method call for each input/output stream call.
>>
>>
>> On Tue, Jan 30, 2018 at 10:04 AM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> Hi guys,
>>>
>>> All is in the subject ;)
>>>
>>> Rational is to support any I/O library and not fail when the close
>>> is encapsulated.
>>>
>>> Any blocker to swallow this close call?
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>