Re: [VOTE] Release 2.16.0, release candidate #1

2019-10-03 Thread Ahmet Altay
I see most of the release validations have been completed and marked in the
spreadsheet. Thank you all for doing that. If you have not validated/voted
yet please take a look at the release candidate.

On Thu, Oct 3, 2019 at 7:59 AM Thomas Weise  wrote:

> I think there is a different reason why the release manager should
> probably merge/approve all PRs that go into the release branch while the
> release is in progress:
>
> If/when the need arises for another RC, then only those changes should be
> included that are deemed blockers or explicitly agreed. Otherwise the
> release can potentially be delayed by modifications that invalidate prior
> verification or introduce new instability.
>

I agree with this reasoning. It expresses my concern in a more clear way.


>
> Thomas
>
>
> On Thu, Oct 3, 2019 at 3:12 AM Maximilian Michels  wrote:
>
>>  > For the next time, may I suggest asking release manager to do the
>>  > merging to the release branch. We do not know whether there will be an
>>  > RC2 or not. And if there will not be an RC2 release branch as of now
>>  > does not directly correspond to what will be released.
>>
>> The ground truth for releases are the release tags, not the release
>> branches. Downstream projects should not depend on the release branches.
>> Release branches are merely important for the process of creating a
>> release, but they lose validity after the RC has been created and
>> released.
>>
>> On 02.10.19 11:45, Ahmet Altay wrote:
>> > +1 (validated python quickstarts). Thank you Mark.
>> >
>> > On Wed, Oct 2, 2019 at 10:49 AM Maximilian Michels > > > wrote:
>> >
>> > Thanks for preparing the release, Mark! I would like to address
>> > https://issues.apache.org/jira/browse/BEAM-8303 in the release.
>> I've
>> > already merged the fix to the release-2.16.0 branch. If we do
>> another
>> > RC, we could include it. As a user is blocked on this, I would not
>> vote
>> > +1 for this RC, but I also do not want to block the release process.
>> >
>> >
>> > Max, thank you for the clear communication for the importance and at
>> the
>> > same time non-blocking status of the issue.
>> >
>> > For the next time, may I suggest asking release manager to do the
>> > merging to the release branch. We do not know whether there will be an
>> > RC2 or not. And if there will not be an RC2 release branch as of now
>> > does not directly correspond to what will be released.
>> >
>> >
>> > On 01.10.19 09:18, Mark Liu wrote:
>> >  > Hi everyone,
>> >  >
>> >  > Please review and vote on the release candidate #1 for the
>> version
>> >  > 2.16.0, as follows:
>> >  > [ ] +1, Approve the release
>> >  > [ ] -1, Do not approve the release (please provide specific
>> comments)
>> >  >
>> >  >
>> >  > The complete staging area is available for your review, which
>> > includes:
>> >  > * JIRA release notes [1],
>> >  > * the official Apache source release to be deployed to
>> > dist.apache.org 
>> >  >  [2], which is signed with the key with
>> >  > fingerprint C110B1C82074883A4241D977599D6305FF3ABB32 [3],
>> >  > * all artifacts to be deployed to the Maven Central Repository
>> [4],
>> >  > * source code tag ""v2.16.0-RC1" [5],
>> >  > * website pull request listing the release [6], publishing the
>> API
>> >  > reference manual [7], and the blog post [8].
>> >  > * Python artifacts are deployed along with the source release to
>> the
>> >  > dist.apache.org  > >
>> > [2].
>> >  > * Validation sheet with a tab for 2.16.0 release to help with
>> > validation
>> >  > [9].
>> >  > * Docker images published to Docker Hub [10].
>> >  >
>> >  > The vote will be open for at least 72 hours. It is adopted by
>> > majority
>> >  > approval, with at least 3 PMC affirmative votes.
>> >  >
>> >  > Thanks,
>> >  > Mark Liu, Release Manager
>> >  >
>> >  > [1]
>> >  >
>> >
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12345494
>> >  > [2] https://dist.apache.org/repos/dist/dev/beam/2.16.0/
>> >  > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> >  > [4]
>> >
>> https://repository.apache.org/content/repositories/orgapachebeam-1085/
>> >  > [5] https://github.com/apache/beam/tree/v2.16.0-RC1
>> >  > [6] https://github.com/apache/beam/pull/9667
>> >  > [7] https://github.com/apache/beam-site/pull/593
>> >  > [8] https://github.com/apache/beam/pull/9671
>> >  > [9]
>> >  >
>> >
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=890914284
>> >  > [10] https://hub.docker.com/u/apachebeam
>> >
>>
>


Feature addition to java CassandraIO connector

2019-10-03 Thread Vincent Marquez
Currently the CassandraIO connector allows a user to specify a table, and
the CassandraSource object generates a list of queries based on token
ranges of the table, along with grouping them by the token ranges.

I often need to run (generated, sometimes a million+) queries against a
subset of a table.  Instead of providing a filter, it is easier and much
more performant to supply a collection of queries along with their tokens
to both partition and group by, instead of letting CassandraIO naively run
over the entire table or with a simple filter.

I propose in addition to the current method of supplying a table and
filter, also allowing the user to pass in a collection of queries and
tokens.   The current way CassandraSource breaks up the table could be
modified to build on top of the proposed implementation to reduce code
duplication as well.  If this sounds like an acceptable alternative way of
using the CassandraIO connector, I don't mind giving it a shot with a pull
request.

If there is a better way of doing this, I'm eager to hear and learn.
Thanks for reading!


Re: Share Outreachy project progress

2019-10-03 Thread Kenneth Knowles
Some links for reference:

https://www.outreachy.org/docs/applicant/#outreachy-schedule

https://www.outreachy.org/communities/cfp/apache/

Kenn

On Wed, Oct 2, 2019 at 11:37 AM Rui Wang  wrote:

> Hi Community,
>
> I submitted an Outreachy project proposal on behalf of Apache Beam(it's
> related to BeamSQL, yeah!). The workflow of Outreachy project is, in Oct
> 2019, applicants should try to make contributions to projects (for us it's
> Beam). After that, there will be a final match between applicants and
> projects. In current step some applicants will ask to contribute to Beam so
> don't feel surprised about it. It's by the design.
>
>
> -Rui
>


Re: Multiple iterations after GroupByKey with SparkRunner

2019-10-03 Thread Kenneth Knowles
On Tue, Oct 1, 2019 at 5:35 PM Robert Bradshaw  wrote:

> For this specific usecase, I would suggest this be done via
> PTranform URNs. E.g. one could have a GroupByKeyOneShot whose
> implementation is
>
> input
> .apply(GroupByKey.of()
> .apply(kv -> KV.of(kv.key(), kv.iterator())
>

This is dual to what I clumsily was trying to say in my last paragraph. But
I agree that ReduceByKey is better, if we were to add any new primitive
transform. I very much dislike PCollection for just the reasons
you also mention.

I think the annotation route where @ProcessElement can accept a different
type of element seems less intrusive and more flexible.


> On Tue, Oct 1, 2019 at 2:16 AM Jan Lukavský  wrote:
>
>> The car analogy was meant to say, that in real world you have to make
>> decision before you take any action. There is no retroactivity possible.
>>
> Reuven pointed out, that it is possible (although it seems a little weird
>> to me, but that is the only thing I can tell against it :-)), that the way
>> a grouped PCollection is produced might be out of control of a consuming
>> operator. One example of this might be, that the grouping is produced in a
>> submodule (some library), but still, the consumer wants to be able to
>> specify if he wants or doesn't want reiterations.
>>
> Exactly. The person choosing to GroupByKey and the person writing the
one-shot ParDo must be assumed to be different people, in general.

FWIW I always think of the pipeline as a program and the runner as a
planner/optimizer. So always responsible for reordering and making physical
planning decisions like whether to create an iterable materialization or
just some streamed iterator.

If we move on, our next option might be to specify the annotation on the
>> consumer (as suggested), but that has all the "not really nice" properties
>> of being counter-intuitive, ignoring strong types, etc., etc., for which
>> reason I think that this should be ruled out as well.
>>
> In Beam we have taken the position that type checking the graph after it
is constructed is an early enough place to catch type errors (speaking for
Java). The validation that ParDo does on the DoFn is basically lightweight,
local, type checking. This is how we detect and type check stateful ParDo
transforms as well as splittable ParDo transforms. We also catch errors
that are not expressible in Java's type system.

If we were discussing just this Spark limitation/optimization this is a
very natural fit with what we already have: Give the runner all the
information about the nature of the transforms and user functions, and let
it make the best plan it can.

So to me the interesting part is that there is a DSL that wants to support
primitives that are strictly weaker than Beam's, in order to *only* allow
the oneshot path. Annotations are quite annoying for DSLs, as you may have
noticed for state & timers, so that is not a good fit. But the concepts
still work. I would suggest pivot this thread into how to allow a DSL
builder to directly provide a DoFnInvoke with DoFnSignature in order to
programmatically provide the same information that annotations are used.
Essentially exposing an IR to DSL authors rather than forcing them to work
with the source language meant for end users. Do you already have a
solution for this today?

Kenn

This leaves us with a single option (at least I have not figured out any
>> other) - which is we can bundle GBK and associated ParDo into atomic
>> PTransform, which can then be overridden by runners that need special
>> handling of this situation - these are all runners that need buffer data to
>> memory in order to support reiterations (spark and flink, note that this
>> problem arises only for batch case, because in streaming case, one can
>> reasonably assume that the data resides in a state that supports
>> reiterations). But - we already have this PTransform in Euphoria, it is
>> called ReduceByKey, and has all the required properties (technically, it is
>> not a PTransform now, but that is a minor detail and can be changed
>> trivially).
>>
>> So, the direction I was trying to take this discussion was - what could
>> be the best way for a runner to natively support a PTransform from a DSL? I
>> can imagine several options:
>>
>>  a) support it directly and let runners depend on the DSL (compileOnly
>> dependency might suffice, because users will include the DSL into their
>> code to be able to use it)
>>
>>  b) create an interface in runners for user-code to be able to provide
>> translation for user-specified operators (this could be absolutely generic,
>> DSLs might just use this feature the same way any user could), after all
>> runners already use a concept of Translator, but that is pretty much
>> copy-pasted, not abstracted into a general purpose one
>>
>>  c) move the operators that need to be translated into core
>>
>> The option (c) then leaves open questions related to - if we would want
>> to move other operators to core, w

Re: Outreachy applicant

2019-10-03 Thread Rui Wang
Hi Ismael,

Sorry I wasn't aware that you also has a project. Carlos has contacted me
on SQL before. Next time I will ask people to also include their interest
of project in introduction emails.


-Rui

On Thu, Oct 3, 2019 at 9:47 AM Rui Wang  wrote:

> Hi Isamel,
>
> Carlos is an Outreachy applicant so I will take care of starter tasks.
>
>
> -Rui
>
> On Thu, Oct 3, 2019 at 6:31 AM Ismaël Mejía  wrote:
>
>> Hello Carlos!
>>
>> Just added you as a contributor so you can self assign the tickets you
>> want to work on
>> What project are you interested in?
>>
>> Regards,
>> Ismaël
>>
>>
>> On Thu, Oct 3, 2019 at 8:27 AM Carlos Oceguera 
>> wrote:
>> >
>> > I forgot to add my jira user: "chefoce", sorry
>> >
>> > -- Forwarded message -
>> > De: Carlos Oceguera 
>> > Date: jue., 3 de octubre de 2019 12:24 a. m.
>> > Subject: Outreachy applicant
>> > To: 
>> >
>> >
>> > Hi my name is Carlos, i'm from Mexico, i want to contribute in the
>> project, i will very grateful for the opportunity.
>>
>


Re: Outreachy applicant

2019-10-03 Thread Rui Wang
Hi Isamel,

Carlos is an Outreachy applicant so I will take care of starter tasks.


-Rui

On Thu, Oct 3, 2019 at 6:31 AM Ismaël Mejía  wrote:

> Hello Carlos!
>
> Just added you as a contributor so you can self assign the tickets you
> want to work on
> What project are you interested in?
>
> Regards,
> Ismaël
>
>
> On Thu, Oct 3, 2019 at 8:27 AM Carlos Oceguera 
> wrote:
> >
> > I forgot to add my jira user: "chefoce", sorry
> >
> > -- Forwarded message -
> > De: Carlos Oceguera 
> > Date: jue., 3 de octubre de 2019 12:24 a. m.
> > Subject: Outreachy applicant
> > To: 
> >
> >
> > Hi my name is Carlos, i'm from Mexico, i want to contribute in the
> project, i will very grateful for the opportunity.
>


Re: Multiple iterations after GroupByKey with SparkRunner

2019-10-03 Thread Reuven Lax
Ok - now I see what you're talking about. You are focusing on the Java
types in the Java SDK, where the output of GBK is an Iterable type (which
should always be reiterable). I was talking more abstractly about the
programming model, i.e. the portability representation of the graph.

In this case I think that Robert's suggestion for the Java SDK is the right
one. Create a new transform, and have runners optimize it away if necessary.

On Wed, Oct 2, 2019 at 2:19 AM Jan Lukavský  wrote:

>
> On 10/2/19 4:30 AM, Reuven Lax wrote:
>
>
>
> On Mon, Sep 30, 2019 at 2:02 AM Jan Lukavský  wrote:
>
>> > The fact that the annotation on the ParDo "changes" the GroupByKey
>> implementation is very specific to the Spark runner implementation.
>>
>> I don't quite agree. It is not very specific to Spark, it is specific to
>> generally all runners, that produce grouped elements in a way that is not
>> reiterable. That is the key property. The example you gave with HDFS does
>> not satisfy this condition (files on HDFS are certainly reiterable), and
>> that's why no change to the GBK is needed (it actually already has the
>> required property). A quick look at what FlinkRunner (at least non portable
>> does) is that it implements GBK using reducing elements into List. That is
>> going to crash on big PCollection, which is even nicely documented:
>>
>>* For internal use to translate {@link GroupByKey}. For a large {@link 
>> PCollection} this is
>>* expected to crash!
>>
>>
>> If this is fixed, then it is likely to start behave the same as Spark. So
>> actually I think the opposite is true - Dataflow is a special case, because
>> of how its internal shuffle service works.
>>
>
> I think you misunderstood - I was not trying to dish on the Spark runner.
> Rather my point is that whether the GroupByKey implementation is affected
> or not is runner dependent. In some runners it is and in others it isn't.
> However in all cases the *semantics* of the ParDo is affected. Since Beam
> tries as much as possible to be runner agnostic, we should default to
> making the change where there is an obvious semantic difference.
>
>
> I understand that, but I just don't think, that a semantics should be
> affected by this. If outcome of GBK is Iterable, then it should be
> reiterable, that is how Iterable works, so I now lean more towards a
> conclusion, that the current behavior of Spark runner simply breaks this
> contract. Solution of this would be to introduce the proposed
> GroupByKeyOneShot or Reduce(By|Per)Key.
>
>
>
> > In general I sympathize with the worry about non-local effects. Beam is
>> already full of them (e.g. a Window.into statement effects downstream
>> GroupByKeys). In each case where they were added there was extensive debate
>> and discussion (Windowing semantics were debated for many months), exactly
>> because there was concern over adding these non-local effects. In every
>> case, no other good solution could be found. For the case of windowing for
>> example, it was often easy to propose simple local APIs (e.g. just pass the
>> window fn as a parameter to GroupByKey), however all of these local
>> solutions ended up not working for important use cases when we analyzed
>> them more deeply.
>>
>> That is very interesting. Could you elaborate more about some examples of
>> the use cases which didn't work? I'd like to try to match it against how
>> Euphoria is structured, it should be more resistant to this non-local
>> effects, because it very often bundles together multiple Beam's primitives
>> to single transform - ReduceByKey is one example of this, if is actually
>> mix of Window.into() + GBK + ParDo, Although it might look like if this
>> transform can be broken down to something else, then it is not primitive
>> (euphoria has no native equivalent of GBK itself), but it has several other
>> nice implications - that is that Combine now becomes a special case of RBK.
>> It now becomes only a question of where and how you can "run" the reduce
>> function. The logic is absolutely equal. This can be worked in more detail
>> and actually show, that even Combine and RBK can be decribed by a more
>> general stateful operation (ReduceStateByKey), and so finally Euphoria
>> actually has only two really "primitive" operations - these are FlatMap
>> (basically stateless ParDo) and RSBK. As I already mentioned on some other
>> thread, when stateful ParDo would support merging windows, it can be shown
>> that both Combine and GBK become special cases of this.
>>
>> > As you mentioned below, I do think it's perfectly reasonable for a DSL
>> to impose its own semantics. Scio already does this - the raw Beam API is
>> used by a DSL as a substrate, but the DSL does not need to blindly mirror
>> the semantics of the raw Beam API - at least in my opinion!
>>
>> Sure, but currently, there is no way for DSL to "hook" into runner, so it
>> has to use raw Beam SDK, and so this will fail in cases like this - where
>> Beam actuall

Re: [DISCUSS] Beam SQL filter push-down

2019-10-03 Thread Kenneth Knowles
** Bumping this thread especially if you are an IO author **

Really glad you are working on this. The basic idea in your doc seems good.

It seems mostly that Beam SQL contributors have commented on it. There are
many more people who may be interested in this and have valuable feedback,
such as authors of IO connectors. Your examples the great difference
between BigQuery and MongoDB made me think of this. Right now very few IO
connectors have SQL adapters. But at some point almost all of them should
have SQL adapters and should do their best to support pushdown. This may
require changes to the pure Java connector to unlock some capabilities in
the underlying storage system.

Kenn

On Mon, Sep 30, 2019 at 11:04 AM Kirill Kozlov 
wrote:

> The objective is to create a universal way for Beam SQL IO APIs to support
> filter/project push-down.
> A proposed way to achieve that is by introducing an interface
> responsible for identifying what portion(s) of a Calc can be moved down to
> IO layer. Also, adding following methods to a BeamSqlTable interface to
> pass necessary parameters to IO APIs:
> - BeamSqlTableFilter supportsFilter(RexNode program, RexNode filter)
> - Boolean supportsProjects()
> - PCollection buildIOReader(PBegin begin, BeamSqlTableFilter
> filters, List fieldNames)
>
> Please feel free to provide feedback and suggestions on this proposal.
> Thank you!
>
> Here is a more complete design doc:
> https://docs.google.com/document/d/1-ysD7U7qF3MAmSfkbXZO_5PLJBevAL9bktlLCerd_jE/edit?usp=sharing
>
> --
> Kirill Kozlov
>


Re: Multiple iterations after GroupByKey with SparkRunner

2019-10-03 Thread Reuven Lax
Putting a stateful dofn after a GBK is not completely redundant - the
element type changes, so it is different than just having .a stateful dofn.
However it is a weird thing to do, and usually not optimal (especially
because many runners might insert two shuffles in this case).

On Wed, Oct 2, 2019 at 2:24 AM Jan Lukavský  wrote:

> +1
>
> The difference between GroupByKeyOneShot and Reduce(By|Per)Key is probably
> only in that in the first case one can pass the result to a stateful ParDo.
> The latter has a more strict semantics, so that user is a little limited
> about what he can do with the result of the grouping. It seems to me,
> though, that applying a stateful operation on result of grouping makes
> little sense, because stateful operation performs this grouping (keying)
> automatically, so the preceding GroupBeyKeyOneShot would be somewhat
> redundant. But maybe someone can provide a different insight.
> On 10/2/19 2:34 AM, Robert Bradshaw wrote:
>
> For this specific usecase, I would suggest this be done via
> PTranform URNs. E.g. one could have a GroupByKeyOneShot whose
> implementation is
>
> input
> .apply(GroupByKey.of()
> .apply(kv -> KV.of(kv.key(), kv.iterator())
>
> A runner would be free to recognize and optimize this in the graph (based
> on its urn) and swap out a more efficient implementation. Of course a
> Coder would have to be introduced, and the semantics of
> PCollection are a bit odd due to the inherently mutable nature of
> Iterators. (Possibly a ReducePerKey transform would be a better
> abstraction.)
>
>
> On Tue, Oct 1, 2019 at 2:16 AM Jan Lukavský  wrote:
>
>> The car analogy was meant to say, that in real world you have to make
>> decision before you take any action. There is no retroactivity possible.
>>
>> Reuven pointed out, that it is possible (although it seems a little weird
>> to me, but that is the only thing I can tell against it :-)), that the way
>> a grouped PCollection is produced might be out of control of a consuming
>> operator. One example of this might be, that the grouping is produced in a
>> submodule (some library), but still, the consumer wants to be able to
>> specify if he wants or doesn't want reiterations. There still is a
>> "classical" solution to this - the library might expose an interface to
>> specify a factory for the grouped PCollection, so that the user of the
>> library will be able to specify what he wants. But we can say, that we
>> don't want to force users (or authors of libraries) to do that. That's okay
>> for me.
>>
>> If we move on, our next option might be to specify the annotation on the
>> consumer (as suggested), but that has all the "not really nice" properties
>> of being counter-intuitive, ignoring strong types, etc., etc., for which
>> reason I think that this should be ruled out as well.
>>
>> This leaves us with a single option (at least I have not figured out any
>> other) - which is we can bundle GBK and associated ParDo into atomic
>> PTransform, which can then be overridden by runners that need special
>> handling of this situation - these are all runners that need buffer data to
>> memory in order to support reiterations (spark and flink, note that this
>> problem arises only for batch case, because in streaming case, one can
>> reasonably assume that the data resides in a state that supports
>> reiterations). But - we already have this PTransform in Euphoria, it is
>> called ReduceByKey, and has all the required properties (technically, it is
>> not a PTransform now, but that is a minor detail and can be changed
>> trivially).
>>
>> So, the direction I was trying to take this discussion was - what could
>> be the best way for a runner to natively support a PTransform from a DSL? I
>> can imagine several options:
>>
>>  a) support it directly and let runners depend on the DSL (compileOnly
>> dependency might suffice, because users will include the DSL into their
>> code to be able to use it)
>>
>>  b) create an interface in runners for user-code to be able to provide
>> translation for user-specified operators (this could be absolutely generic,
>> DSLs might just use this feature the same way any user could), after all
>> runners already use a concept of Translator, but that is pretty much
>> copy-pasted, not abstracted into a general purpose one
>>
>>  c) move the operators that need to be translated into core
>>
>> The option (c) then leaves open questions related to - if we would want
>> to move other operators to core, would this be the right time to ask
>> questions if our current set of "core" operators is the ideal one? Or could
>> this be optimized?
>>
>> Jan
>> On 10/1/19 12:32 AM, Kenneth Knowles wrote:
>>
>> In the car analogy, you have something this:
>>
>> Iterable: car
>> Iterator: taxi ride
>>
>> They are related, but not as variations of a common concept.
>>
>> In the discussion of Combine vs RSBK, if the reducer is required to be an
>> associative and commutative operator, then it i

Re: [VOTE] Release 2.16.0, release candidate #1

2019-10-03 Thread Thomas Weise
I think there is a different reason why the release manager should probably
merge/approve all PRs that go into the release branch while the release is
in progress:

If/when the need arises for another RC, then only those changes should be
included that are deemed blockers or explicitly agreed. Otherwise the
release can potentially be delayed by modifications that invalidate prior
verification or introduce new instability.

Thomas


On Thu, Oct 3, 2019 at 3:12 AM Maximilian Michels  wrote:

>  > For the next time, may I suggest asking release manager to do the
>  > merging to the release branch. We do not know whether there will be an
>  > RC2 or not. And if there will not be an RC2 release branch as of now
>  > does not directly correspond to what will be released.
>
> The ground truth for releases are the release tags, not the release
> branches. Downstream projects should not depend on the release branches.
> Release branches are merely important for the process of creating a
> release, but they lose validity after the RC has been created and released.
>
> On 02.10.19 11:45, Ahmet Altay wrote:
> > +1 (validated python quickstarts). Thank you Mark.
> >
> > On Wed, Oct 2, 2019 at 10:49 AM Maximilian Michels  > > wrote:
> >
> > Thanks for preparing the release, Mark! I would like to address
> > https://issues.apache.org/jira/browse/BEAM-8303 in the release. I've
> > already merged the fix to the release-2.16.0 branch. If we do another
> > RC, we could include it. As a user is blocked on this, I would not
> vote
> > +1 for this RC, but I also do not want to block the release process.
> >
> >
> > Max, thank you for the clear communication for the importance and at the
> > same time non-blocking status of the issue.
> >
> > For the next time, may I suggest asking release manager to do the
> > merging to the release branch. We do not know whether there will be an
> > RC2 or not. And if there will not be an RC2 release branch as of now
> > does not directly correspond to what will be released.
> >
> >
> > On 01.10.19 09:18, Mark Liu wrote:
> >  > Hi everyone,
> >  >
> >  > Please review and vote on the release candidate #1 for the version
> >  > 2.16.0, as follows:
> >  > [ ] +1, Approve the release
> >  > [ ] -1, Do not approve the release (please provide specific
> comments)
> >  >
> >  >
> >  > The complete staging area is available for your review, which
> > includes:
> >  > * JIRA release notes [1],
> >  > * the official Apache source release to be deployed to
> > dist.apache.org 
> >  >  [2], which is signed with the key with
> >  > fingerprint C110B1C82074883A4241D977599D6305FF3ABB32 [3],
> >  > * all artifacts to be deployed to the Maven Central Repository
> [4],
> >  > * source code tag ""v2.16.0-RC1" [5],
> >  > * website pull request listing the release [6], publishing the API
> >  > reference manual [7], and the blog post [8].
> >  > * Python artifacts are deployed along with the source release to
> the
> >  > dist.apache.org  
> > [2].
> >  > * Validation sheet with a tab for 2.16.0 release to help with
> > validation
> >  > [9].
> >  > * Docker images published to Docker Hub [10].
> >  >
> >  > The vote will be open for at least 72 hours. It is adopted by
> > majority
> >  > approval, with at least 3 PMC affirmative votes.
> >  >
> >  > Thanks,
> >  > Mark Liu, Release Manager
> >  >
> >  > [1]
> >  >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12345494
> >  > [2] https://dist.apache.org/repos/dist/dev/beam/2.16.0/
> >  > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> >  > [4]
> >
> https://repository.apache.org/content/repositories/orgapachebeam-1085/
> >  > [5] https://github.com/apache/beam/tree/v2.16.0-RC1
> >  > [6] https://github.com/apache/beam/pull/9667
> >  > [7] https://github.com/apache/beam-site/pull/593
> >  > [8] https://github.com/apache/beam/pull/9671
> >  > [9]
> >  >
> >
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=890914284
> >  > [10] https://hub.docker.com/u/apachebeam
> >
>


Re: Introduction + Support in Comms for Beam!

2019-10-03 Thread Cyrus Maden
Welcome, Maria!

On Thu, Oct 3, 2019 at 9:40 AM Ismaël Mejía  wrote:

> Hello and welcome Maria!
>
> Great to see you at dev@. Thanks for sharing the link on the comm
> framework. Now I am curious on what's next and how this will adapt to
> our community.
>
> Ismaël.
>
>
> On Tue, Oct 1, 2019 at 12:15 AM María Cruz  wrote:
> >
> > Hi everyone,
> > my name is María Cruz, I am from Buenos Aires but I live in the Bay
> Area. I recently became acquainted with Apache Beam project, and I got a
> chance to meet some of the Beam community at Apache Con North America this
> past September. I'm testing out a communications framework for Open Source
> communities. I'm emailing the list now because I'd like to work on a
> communications strategy for Beam, to make the most of the content you
> produce during Beam Summits.
> >
> > A little bit more about me. I am a communications strategist with 11
> years of experience in the field, 8 of which are in the non-profit sector.
> I started working in Open Source in 2013, when I joined Wikimedia, the
> social movement behind Wikipedia. I now work to support Google Open Source
> projects, and I also volunteer in the communications team of the Apache
> Software Foundation, working closely with Sally (for those of you who know
> her).
> >
> > I will be sending the list a proposal in the coming days. Looking
> forward to hearing from you!
> >
> > Best,
> >
> > María
>


Re: Introduction + Support in Comms for Beam!

2019-10-03 Thread Ismaël Mejía
Hello and welcome Maria!

Great to see you at dev@. Thanks for sharing the link on the comm
framework. Now I am curious on what's next and how this will adapt to
our community.

Ismaël.


On Tue, Oct 1, 2019 at 12:15 AM María Cruz  wrote:
>
> Hi everyone,
> my name is María Cruz, I am from Buenos Aires but I live in the Bay Area. I 
> recently became acquainted with Apache Beam project, and I got a chance to 
> meet some of the Beam community at Apache Con North America this past 
> September. I'm testing out a communications framework for Open Source 
> communities. I'm emailing the list now because I'd like to work on a 
> communications strategy for Beam, to make the most of the content you produce 
> during Beam Summits.
>
> A little bit more about me. I am a communications strategist with 11 years of 
> experience in the field, 8 of which are in the non-profit sector. I started 
> working in Open Source in 2013, when I joined Wikimedia, the social movement 
> behind Wikipedia. I now work to support Google Open Source projects, and I 
> also volunteer in the communications team of the Apache Software Foundation, 
> working closely with Sally (for those of you who know her).
>
> I will be sending the list a proposal in the coming days. Looking forward to 
> hearing from you!
>
> Best,
>
> María


Re: Outreachy applicant

2019-10-03 Thread Ismaël Mejía
Hello Carlos!

Just added you as a contributor so you can self assign the tickets you
want to work on
What project are you interested in?

Regards,
Ismaël


On Thu, Oct 3, 2019 at 8:27 AM Carlos Oceguera  wrote:
>
> I forgot to add my jira user: "chefoce", sorry
>
> -- Forwarded message -
> De: Carlos Oceguera 
> Date: jue., 3 de octubre de 2019 12:24 a. m.
> Subject: Outreachy applicant
> To: 
>
>
> Hi my name is Carlos, i'm from Mexico, i want to contribute in the project, i 
> will very grateful for the opportunity.


Re: Beam 2.15.0 SparkRunner issues

2019-10-03 Thread Jan Lukavský

Hi Tim,

can you please elaborate more about some parts?

1) What happens actually in your case? What is the specific settings you 
use?


3) Can you share stacktrace? Is it always the same, or does it change?

The mentioned GroupCombineFunctions.java:202 comes from a Reshuffle, 
which seems to make a little sense to me regarding the logic you 
described. Do you use Reshuffle transform or does it expand from some 
other transform?


Jan

On 10/3/19 9:24 AM, Tim Robertson wrote:

Hi all,

We haven't dug enough into this to know where to log issues, but I'll 
start by sharing here.


After upgrading from Beam 2.10.0 to 2.15.0 we see issues on 
SparkRunner - we suspect all of this related.


1. spark.default.parallelism is not respected

2. File writing (Avro) with dynamic destinations (grouped into folders 
by a field name) consistently fail with
org.apache.beam.sdk.util.UserCodeException: 
java.nio.file.FileAlreadyExistsException: Unable to rename resource 
hdfs://ha-nn/pipelines/export-20190930-0854/.temp-beam-d4fd89ed-fc7a-4b1e-aceb-68f9d72d50f0/6e086f60-8bda-4d0e-b29d-1b47fdfc88c0 
to 
hdfs://ha-nn/pipelines/export-20190930-0854/7c9d2aec-f762-11e1-a439-00145eb45e9a/verbatimHBaseExport-0-of-1.avro 
as destination already exists and couldn't be deleted.


3. GBK operations that run over 500M small records consistently fail 
with OOM. We tried different configs with 48GB, 60GB, 80GB executor 
memory


Our pipelines run are batch, simple transformations with either an 
HBaseSnapshot to Avro files or a merge of records in Avro (the GBK 
issue) pushed to ElasticSearch (it fails upstream of the 
ElasticsearchIO in the GBK stage).


We notice operations that were mapToPair  in 2.10.0 become repartition 
operations ( (mapToPair at GroupCombineFunctions.java:68 becomes 
repartition at GroupCombineFunctions.java:202)) which might be related 
to this and looks surprising.


I'll report more as we learn. If anyone has any immediate ideas based 
on their commits or reviews or if you wish an tests run on other Beam 
versions please say.


Thanks,
Tim





Re: Beam 2.15.0 SparkRunner issues

2019-10-03 Thread Jozef Vilcek
We do have 2.15.0 Beam batch jobs running on Spark runner. I did have a bit
of tricky time with spark.default.parallelism, but at the end it works fine
for us (custom parallelism on source stages and spark.default.parallelism
on all other stages after shuffles)

Tricky part in my case was interaction between `spark.default.parallelism`
and `beam.bundleSize`. I had a problem that default parallelism was
enforced on inputs too, therefore splitting them too much or too little.
Configuring bundleSize and custom config on inputs (e.g. hadoop input
format max/min split size) did the trick. TransformTranslator does make a
decision on parishioner based on bundleSize, however I am not sure how it
is later on used:
https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java#L571

On Thu, Oct 3, 2019 at 9:25 AM Tim Robertson 
wrote:

> Hi all,
>
> We haven't dug enough into this to know where to log issues, but I'll
> start by sharing here.
>
> After upgrading from Beam 2.10.0 to 2.15.0 we see issues on SparkRunner -
> we suspect all of this related.
>
> 1. spark.default.parallelism is not respected
>
> 2. File writing (Avro) with dynamic destinations (grouped into folders by
> a field name) consistently fail with
> org.apache.beam.sdk.util.UserCodeException:
> java.nio.file.FileAlreadyExistsException: Unable to rename resource
> hdfs://ha-nn/pipelines/export-20190930-0854/.temp-beam-d4fd89ed-fc7a-4b1e-aceb-68f9d72d50f0/6e086f60-8bda-4d0e-b29d-1b47fdfc88c0
> to
> hdfs://ha-nn/pipelines/export-20190930-0854/7c9d2aec-f762-11e1-a439-00145eb45e9a/verbatimHBaseExport-0-of-1.avro
> as destination already exists and couldn't be deleted.
>
> 3. GBK operations that run over 500M small records consistently fail with
> OOM. We tried different configs with 48GB, 60GB, 80GB executor memory
>
> Our pipelines run are batch, simple transformations with either an
> HBaseSnapshot to Avro files or a merge of records in Avro (the GBK issue)
> pushed to ElasticSearch (it fails upstream of the ElasticsearchIO in the
> GBK stage).
>
> We notice operations that were mapToPair  in 2.10.0 become repartition
> operations ( (mapToPair at GroupCombineFunctions.java:68 becomes
> repartition at GroupCombineFunctions.java:202)) which might be related to
> this and looks surprising.
>
> I'll report more as we learn. If anyone has any immediate ideas based on
> their commits or reviews or if you wish an tests run on other Beam versions
> please say.
>
> Thanks,
> Tim
>
>
>
>


Re: Introduction + Support in Comms for Beam!

2019-10-03 Thread Maximilian Michels

Hi Mujuzi Moses,

Welcome! I've given you contributor permissions in JIRA.

Cheers,
Max

On 03.10.19 01:07, Mujuzi Moses wrote:
Hello, i am requesting to be added to the contributors list, i am an 
Outreachy applicant,


Regards,

Mujuzi Moses
JIRA Username: iamMujuziMoses

On Wed, Oct 2, 2019, 10:10 PM Kenneth Knowles > wrote:


Welcome to dev@beam! And thanks for the interesting link.

Kenn

On Tue, Oct 1, 2019 at 10:11 AM Pablo Estrada mailto:pabl...@google.com>> wrote:

Welcome Maria! : )

On Tue, Oct 1, 2019 at 8:32 AM Ahmet Altay mailto:al...@google.com>> wrote:

Welcome!

On Tue, Oct 1, 2019 at 3:26 AM Jesse Anderson
mailto:je...@bigdatainstitute.io>> wrote:

Excellent and welcome!

Big Data Institute  Jesse Anderson
Managing Director
Big Data Institute
(775) 393 9122 | je...@bigdatainstitute.io

bigdatainstitute.io 



On Tue, Oct 1, 2019 at 10:46 AM Łukasz Gajowy
mailto:lgaj...@apache.org>> wrote:

Welcome! :)

wt., 1 paź 2019 o 11:30 Maximilian Michels
mailto:m...@apache.org>> napisał(a):

Welcome Maria! Looking forward to your proposal.

Cheers,
Max

On 01.10.19 00:33, Reza Rokni wrote:
 > Welcome!
 >
 > On Tue, 1 Oct 2019 at 11:18, Lukasz Cwik
mailto:lc...@google.com>
 > >> wrote:
 >
 >     Welcome to the community.
 >
 >     On Mon, Sep 30, 2019 at 3:15 PM María
Cruz mailto:macruz...@gmail.com>
 >     >> wrote:
 >
 >         Hi everyone,
 >         my name is María Cruz, I am from
Buenos Aires but I live in the
 >         Bay Area. I recently became
acquainted with Apache Beam project,
 >         and I got a chance to meet some of
the Beam community at Apache
 >         Con North America this past
September. I'm testing out a
 >         communications framework
 >   
  

 >         for Open Source communities. I'm
emailing the list now because
 >         I'd like to work on a communications
strategy for Beam, to make
 >         the most of the content you
produce during Beam Summits.
 >
 >         A little bit more about me. I am a
communications strategist
 >         with 11 years of experience in the
field, 8 of which are in the
 >         non-profit sector. I started working
in Open Source in 2013,
 >         when I joined Wikimedia, the social
movement behind Wikipedia. I
 >         now work to support Google Open
Source projects, and I also
 >         volunteer in the communications team
of the Apache Software
 >         Foundation, working closely with
Sally (for those of you who
 >         know her).
 >
 >         I will be sending the list a proposal
in the coming days.
 >         Looking forward to hearing from you!
 >
 >         Best,
 >
 >         María
 >
 >
 >
 > --
 >
 > This email may be confidential and
privileged. If you received this
 > communication by mistake, please don't
 

Re: [VOTE] Release 2.16.0, release candidate #1

2019-10-03 Thread Maximilian Michels

> For the next time, may I suggest asking release manager to do the
> merging to the release branch. We do not know whether there will be an
> RC2 or not. And if there will not be an RC2 release branch as of now
> does not directly correspond to what will be released.

The ground truth for releases are the release tags, not the release 
branches. Downstream projects should not depend on the release branches. 
Release branches are merely important for the process of creating a 
release, but they lose validity after the RC has been created and released.


On 02.10.19 11:45, Ahmet Altay wrote:

+1 (validated python quickstarts). Thank you Mark.

On Wed, Oct 2, 2019 at 10:49 AM Maximilian Michels > wrote:


Thanks for preparing the release, Mark! I would like to address
https://issues.apache.org/jira/browse/BEAM-8303 in the release. I've
already merged the fix to the release-2.16.0 branch. If we do another
RC, we could include it. As a user is blocked on this, I would not vote
+1 for this RC, but I also do not want to block the release process.


Max, thank you for the clear communication for the importance and at the 
same time non-blocking status of the issue.


For the next time, may I suggest asking release manager to do the 
merging to the release branch. We do not know whether there will be an 
RC2 or not. And if there will not be an RC2 release branch as of now 
does not directly correspond to what will be released.



On 01.10.19 09:18, Mark Liu wrote:
 > Hi everyone,
 >
 > Please review and vote on the release candidate #1 for the version
 > 2.16.0, as follows:
 > [ ] +1, Approve the release
 > [ ] -1, Do not approve the release (please provide specific comments)
 >
 >
 > The complete staging area is available for your review, which
includes:
 > * JIRA release notes [1],
 > * the official Apache source release to be deployed to
dist.apache.org 
 >  [2], which is signed with the key with
 > fingerprint C110B1C82074883A4241D977599D6305FF3ABB32 [3],
 > * all artifacts to be deployed to the Maven Central Repository [4],
 > * source code tag ""v2.16.0-RC1" [5],
 > * website pull request listing the release [6], publishing the API
 > reference manual [7], and the blog post [8].
 > * Python artifacts are deployed along with the source release to the
 > dist.apache.org  
[2].
 > * Validation sheet with a tab for 2.16.0 release to help with
validation
 > [9].
 > * Docker images published to Docker Hub [10].
 >
 > The vote will be open for at least 72 hours. It is adopted by
majority
 > approval, with at least 3 PMC affirmative votes.
 >
 > Thanks,
 > Mark Liu, Release Manager
 >
 > [1]
 >

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12345494
 > [2] https://dist.apache.org/repos/dist/dev/beam/2.16.0/
 > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 > [4]
https://repository.apache.org/content/repositories/orgapachebeam-1085/
 > [5] https://github.com/apache/beam/tree/v2.16.0-RC1
 > [6] https://github.com/apache/beam/pull/9667
 > [7] https://github.com/apache/beam-site/pull/593
 > [8] https://github.com/apache/beam/pull/9671
 > [9]
 >

https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=890914284
 > [10] https://hub.docker.com/u/apachebeam



Re: Introduction + Support in Comms for Beam!

2019-10-03 Thread Mujuzi Moses
Hello, i am requesting to be added to the contributors list, i am an
Outreachy applicant,

Regards,

Mujuzi Moses
JIRA Username: iamMujuziMoses

On Wed, Oct 2, 2019, 10:10 PM Kenneth Knowles  wrote:

> Welcome to dev@beam! And thanks for the interesting link.
>
> Kenn
>
> On Tue, Oct 1, 2019 at 10:11 AM Pablo Estrada  wrote:
>
>> Welcome Maria! : )
>>
>> On Tue, Oct 1, 2019 at 8:32 AM Ahmet Altay  wrote:
>>
>>> Welcome!
>>>
>>> On Tue, Oct 1, 2019 at 3:26 AM Jesse Anderson 
>>> wrote:
>>>
 Excellent and welcome!

 [image: Big Data Institute] Jesse Anderson
 Managing Director
 Big Data Institute
 (775) 393 9122 | je...@bigdatainstitute.io
 bigdatainstitute.io 


 On Tue, Oct 1, 2019 at 10:46 AM Łukasz Gajowy 
 wrote:

> Welcome! :)
>
> wt., 1 paź 2019 o 11:30 Maximilian Michels 
> napisał(a):
>
>> Welcome Maria! Looking forward to your proposal.
>>
>> Cheers,
>> Max
>>
>> On 01.10.19 00:33, Reza Rokni wrote:
>> > Welcome!
>> >
>> > On Tue, 1 Oct 2019 at 11:18, Lukasz Cwik > > > wrote:
>> >
>> > Welcome to the community.
>> >
>> > On Mon, Sep 30, 2019 at 3:15 PM María Cruz > > > wrote:
>> >
>> > Hi everyone,
>> > my name is María Cruz, I am from Buenos Aires but I live in
>> the
>> > Bay Area. I recently became acquainted with Apache Beam
>> project,
>> > and I got a chance to meet some of the Beam community at
>> Apache
>> > Con North America this past September. I'm testing out a
>> > communications framework
>> > <
>> https://medium.com/@marianarra_/designing-a-communications-framework-for-community-engagement-e087312f9b83
>> >
>> > for Open Source communities. I'm emailing the list now
>> because
>> > I'd like to work on a communications strategy for Beam, to
>> make
>> > the most of the content you produce during Beam Summits.
>> >
>> > A little bit more about me. I am a communications strategist
>> > with 11 years of experience in the field, 8 of which are in
>> the
>> > non-profit sector. I started working in Open Source in 2013,
>> > when I joined Wikimedia, the social movement behind
>> Wikipedia. I
>> > now work to support Google Open Source projects, and I also
>> > volunteer in the communications team of the Apache Software
>> > Foundation, working closely with Sally (for those of you who
>> > know her).
>> >
>> > I will be sending the list a proposal in the coming days.
>> > Looking forward to hearing from you!
>> >
>> > Best,
>> >
>> > María
>> >
>> >
>> >
>> > --
>> >
>> > This email may be confidential and privileged. If you received this
>> > communication by mistake, please don't forward it to anyone else,
>> please
>> > erase all copies and attachments, and please let me know that it
>> has
>> > gone to the wrong person.
>> >
>> > The above terms reflect a potential business arrangement, are
>> provided
>> > solely as a basis for further discussion, and are not intended to
>> be and
>> > do not constitute a legally binding obligation. No legally binding
>> > obligations will be created, implied, or inferred until an
>> agreement in
>> > final form is executed in writing by all parties involved.
>> >
>>
>


Beam 2.15.0 SparkRunner issues

2019-10-03 Thread Tim Robertson
Hi all,

We haven't dug enough into this to know where to log issues, but I'll start
by sharing here.

After upgrading from Beam 2.10.0 to 2.15.0 we see issues on SparkRunner -
we suspect all of this related.

1. spark.default.parallelism is not respected

2. File writing (Avro) with dynamic destinations (grouped into folders by a
field name) consistently fail with
org.apache.beam.sdk.util.UserCodeException:
java.nio.file.FileAlreadyExistsException: Unable to rename resource
hdfs://ha-nn/pipelines/export-20190930-0854/.temp-beam-d4fd89ed-fc7a-4b1e-aceb-68f9d72d50f0/6e086f60-8bda-4d0e-b29d-1b47fdfc88c0
to
hdfs://ha-nn/pipelines/export-20190930-0854/7c9d2aec-f762-11e1-a439-00145eb45e9a/verbatimHBaseExport-0-of-1.avro
as destination already exists and couldn't be deleted.

3. GBK operations that run over 500M small records consistently fail with
OOM. We tried different configs with 48GB, 60GB, 80GB executor memory

Our pipelines run are batch, simple transformations with either an
HBaseSnapshot to Avro files or a merge of records in Avro (the GBK issue)
pushed to ElasticSearch (it fails upstream of the ElasticsearchIO in the
GBK stage).

We notice operations that were mapToPair  in 2.10.0 become repartition
operations ( (mapToPair at GroupCombineFunctions.java:68 becomes
repartition at GroupCombineFunctions.java:202)) which might be related to
this and looks surprising.

I'll report more as we learn. If anyone has any immediate ideas based on
their commits or reviews or if you wish an tests run on other Beam versions
please say.

Thanks,
Tim


Re: Outreachy applicant

2019-10-03 Thread Mujuzi Moses
Username: iamMujuziMoses

On Wed, Oct 2, 2019, 11:47 PM Rui Wang  wrote:

> Can you copy your username? That link directs me to my own profile.
>
> -Rui
>
> On Wed, Oct 2, 2019 at 1:34 PM Mujuzi Moses  wrote:
>
>> JIRA: https://issues.apache.org/jira/secure/ViewProfile.jspa
>>
>> On Wed, Oct 2, 2019, 9:00 PM Rui Wang  wrote:
>>
>>> Welcome! Welcome!
>>>
>>> Have you already created an account on [1]? We can add your JIRA account
>>> id to apache beam as the first step.
>>>
>>>
>>> [1]: https://jira.apache.org/jira/secure/Dashboard.jspa
>>>
>>> -Rui
>>>
>>> On Wed, Oct 2, 2019 at 6:26 AM Mujuzi Moses 
>>> wrote:
>>>
 Hello there, my name is Mujuzi Moses from Uganda and i am 22yrs old, am
 an outreachy applicant and i would like to contribute with my host to
 the[Improve Apache BeamSQL to allow users better write big data processing
 pipelines] project as part of the application process. looking forward to
 contributing with you

>>>