Re: [ANNOUNCE] Apache Beam 2.13.0 released!

2019-06-07 Thread Chad Dombrova
I saw this and was particularly excited about the new support for
"external" transforms in portable runners like python (i.e. the ability to
use the Java KafkaIO transforms, with presumably more to come in the
future).  While the release notes are useful, I will say that it takes a
lot of time and effort to sift through the release notes to find relevant
issues.  They're not grouped by sdk/component, and for example, not all of
the python issues include the word "python" in their title.  It would be
great to have a blurb on the Beam blog explaining the highlights.  An
example of a project that I think does this very well is mypy:
http://mypy-lang.blogspot.com/

thanks!
chad





On Fri, Jun 7, 2019 at 2:58 PM Kyle Weaver  wrote:

> Awesome! Thanks for leading the release Ankur.
>
> On Fri, Jun 7, 2019 at 2:57 PM Ankur Goenka  wrote:
>
>> The Apache Beam team is pleased to announce the release of version 2.13
>> .0!
>>
>> Apache Beam is an open source unified programming model to define and
>> execute data processing pipelines, including ETL, batch and stream
>> (continuous) processing. See https://beam.apache.org
>>
>> You can download the release here:
>>
>> https://beam.apache.org/get-started/downloads/
>>
>> This release includes bugfixes, features, and improvements detailed on
>> the Beam blog: https://beam.apache.org/blog/2019/05/22/beam-2.13.0.html
>>
>> Thanks to everyone who contributed to this release, and we hope you enjoy
>> using Beam 2.13.0.
>>
>> -- Ankur Goenka, on behalf of The Apache Beam team
>>
> --
> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
> | +1650203
>


Re: [ANNOUNCEMENT] Common Pipeline Patterns - new section in the documentation + contributions welcome

2019-06-07 Thread Sergei Sokolenko
and now the news in on the twitterwebs
https://twitter.com/datancoffee/status/1137160729386074113

On Fri, Jun 7, 2019 at 5:52 PM Reza Rokni  wrote:

> +1 on the pattern Tim!
>
> Please raise a Jira with the label pipeline-patterns, details are here:
>
> https://beam.apache.org/documentation/patterns/overview/#contributing-a-pattern
>
>
>
> On Sat, 8 Jun 2019 at 05:04, Tim Robertson 
> wrote:
>
>> This is great. Thanks Pablo and all
>>
>> I've seen several folk struggle with writing avro to dynamic locations
>> which I think might be a good addition. If you agree I'll offer a PR unless
>> someone gets there first - I have an example here:
>>
>> https://github.com/gbif/pipelines/blob/master/pipelines/export-gbif-hbase/src/main/java/org/gbif/pipelines/hbase/beam/ExportHBase.java#L81
>>
>>
>> On Fri, Jun 7, 2019 at 10:52 PM Pablo Estrada  wrote:
>>
>>> Hello everyone,
>>> A group of community members has been working on gathering and providing
>>> common pipeline patterns for pipelines in Beam. These are examples on how
>>> to perform certain operations, and useful ways of using Beam in your
>>> pipelines. Some of them relate to processing of files, use of side inputs,
>>> sate/timers, etc. Check them out[1].
>>>
>>> These initial patterns have been chosen based on evidence gathered from
>>> StackOverflow, and from talking to users of Beam.
>>>
>>> It would be great if this section could grow, and be useful to many Beam
>>> users. For that reason, we invite anyone to share patterns, and pipeline
>>> examples that they have used in the past. If you are interested in
>>> contributing, please submit a pull request, or get in touch with Cyrus
>>> Maden, Reza Rokni, Melissa Pashniak or myself.
>>>
>>> Thanks!
>>> Best
>>> -P.
>>>
>>> [1] https://beam.apache.org/documentation/patterns/overview/
>>>
>>
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and do
> not constitute a legally binding obligation. No legally binding obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>


Re: [ANNOUNCEMENT] Common Pipeline Patterns - new section in the documentation + contributions welcome

2019-06-07 Thread Reza Rokni
+1 on the pattern Tim!

Please raise a Jira with the label pipeline-patterns, details are here:
https://beam.apache.org/documentation/patterns/overview/#contributing-a-pattern



On Sat, 8 Jun 2019 at 05:04, Tim Robertson 
wrote:

> This is great. Thanks Pablo and all
>
> I've seen several folk struggle with writing avro to dynamic locations
> which I think might be a good addition. If you agree I'll offer a PR unless
> someone gets there first - I have an example here:
>
> https://github.com/gbif/pipelines/blob/master/pipelines/export-gbif-hbase/src/main/java/org/gbif/pipelines/hbase/beam/ExportHBase.java#L81
>
>
> On Fri, Jun 7, 2019 at 10:52 PM Pablo Estrada  wrote:
>
>> Hello everyone,
>> A group of community members has been working on gathering and providing
>> common pipeline patterns for pipelines in Beam. These are examples on how
>> to perform certain operations, and useful ways of using Beam in your
>> pipelines. Some of them relate to processing of files, use of side inputs,
>> sate/timers, etc. Check them out[1].
>>
>> These initial patterns have been chosen based on evidence gathered from
>> StackOverflow, and from talking to users of Beam.
>>
>> It would be great if this section could grow, and be useful to many Beam
>> users. For that reason, we invite anyone to share patterns, and pipeline
>> examples that they have used in the past. If you are interested in
>> contributing, please submit a pull request, or get in touch with Cyrus
>> Maden, Reza Rokni, Melissa Pashniak or myself.
>>
>> Thanks!
>> Best
>> -P.
>>
>> [1] https://beam.apache.org/documentation/patterns/overview/
>>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.


Re: [ANNOUNCE] Apache Beam 2.13.0 released!

2019-06-07 Thread Kyle Weaver
Awesome! Thanks for leading the release Ankur.

On Fri, Jun 7, 2019 at 2:57 PM Ankur Goenka  wrote:

> The Apache Beam team is pleased to announce the release of version 2.13.0!
>
> Apache Beam is an open source unified programming model to define and
> execute data processing pipelines, including ETL, batch and stream
> (continuous) processing. See https://beam.apache.org
>
> You can download the release here:
>
> https://beam.apache.org/get-started/downloads/
>
> This release includes bugfixes, features, and improvements detailed on
> the Beam blog: https://beam.apache.org/blog/2019/05/22/beam-2.13.0.html
>
> Thanks to everyone who contributed to this release, and we hope you enjoy
> using Beam 2.13.0.
>
> -- Ankur Goenka, on behalf of The Apache Beam team
>
-- 
Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com |
+1650203


Re: [ANNOUNCEMENT] Common Pipeline Patterns - new section in the documentation + contributions welcome

2019-06-07 Thread Tim Robertson
This is great. Thanks Pablo and all

I've seen several folk struggle with writing avro to dynamic locations
which I think might be a good addition. If you agree I'll offer a PR unless
someone gets there first - I have an example here:

https://github.com/gbif/pipelines/blob/master/pipelines/export-gbif-hbase/src/main/java/org/gbif/pipelines/hbase/beam/ExportHBase.java#L81


On Fri, Jun 7, 2019 at 10:52 PM Pablo Estrada  wrote:

> Hello everyone,
> A group of community members has been working on gathering and providing
> common pipeline patterns for pipelines in Beam. These are examples on how
> to perform certain operations, and useful ways of using Beam in your
> pipelines. Some of them relate to processing of files, use of side inputs,
> sate/timers, etc. Check them out[1].
>
> These initial patterns have been chosen based on evidence gathered from
> StackOverflow, and from talking to users of Beam.
>
> It would be great if this section could grow, and be useful to many Beam
> users. For that reason, we invite anyone to share patterns, and pipeline
> examples that they have used in the past. If you are interested in
> contributing, please submit a pull request, or get in touch with Cyrus
> Maden, Reza Rokni, Melissa Pashniak or myself.
>
> Thanks!
> Best
> -P.
>
> [1] https://beam.apache.org/documentation/patterns/overview/
>


[ANNOUNCEMENT] Common Pipeline Patterns - new section in the documentation + contributions welcome

2019-06-07 Thread Pablo Estrada
Hello everyone,
A group of community members has been working on gathering and providing
common pipeline patterns for pipelines in Beam. These are examples on how
to perform certain operations, and useful ways of using Beam in your
pipelines. Some of them relate to processing of files, use of side inputs,
sate/timers, etc. Check them out[1].

These initial patterns have been chosen based on evidence gathered from
StackOverflow, and from talking to users of Beam.

It would be great if this section could grow, and be useful to many Beam
users. For that reason, we invite anyone to share patterns, and pipeline
examples that they have used in the past. If you are interested in
contributing, please submit a pull request, or get in touch with Cyrus
Maden, Reza Rokni, Melissa Pashniak or myself.

Thanks!
Best
-P.

[1] https://beam.apache.org/documentation/patterns/overview/


Re: Design Proposal for Cost Estimation

2019-06-07 Thread Kenneth Knowles
Thanks for the doc. This is really clear and readable. It all looks like a
good improvement, whatever the result of the various open threads. And nice
bonus that you've pointed to more good reading material.

Kenn

On Fri, Jun 7, 2019 at 12:25 PM Alireza Samadian 
wrote:

> Thank you so much.
>
> Best,
> Alireza
>
> On Fri, Jun 7, 2019 at 11:48 AM Pablo Estrada  wrote:
>
>> I've added you as a contributor! : )
>>
>> On Fri, Jun 7, 2019 at 11:20 AM Alireza Samadian 
>> wrote:
>>
>>> Hi,
>>>
>>> I am going to create Issues in Jira and start implementing row
>>> estimation of each source separately. I will appreciate if someone gives me
>>> the permission to assign Jira Issues to myself. My Jira id is riazela.
>>>
>>> Best,
>>> Alireza
>>>
>>> On Fri, May 31, 2019 at 3:54 PM Alireza Samadian 
>>> wrote:
>>>
 Dear Members of Apache Beam Dev List,

 My name is Alireza; I am a Software Engineer Intern at Google, and I am
 working closely with Anton on Beam SQL query optimizer. Currently, it uses
 Apache Calcite without any cost estimation; I am proposing to implement the
 cost estimator for it.
 The first step would be implementing cost estimator for the sources;
 this is my design proposal for this implementation. I will appreciate your
 comments and suggestions.


 https://docs.google.com/document/d/1vi1PBBu5IqSy-qZl1Gk-49CcANOpbNs1UAud6LnOaiY/edit#heading=h.6rlkpwwx7gvf

 Best,
 Alireza Samadian

>>>


Re: Help triaging Jira issues

2019-06-07 Thread Kenneth Knowles
Nice. I noticed the huge drop in untriaged issues. Both of those ideas for
automation sound reasonable.

I think the other things that are harder to optimize can probably be
addressed by re-triaging stale bugs. We will probably find those that
should have been closed and those that are just sitting on an inactive
contributor.

Kenn

On Fri, Jun 7, 2019 at 12:53 AM Ismaël Mejía  wrote:

> I took a look and reduced the untriaged issues to around 100. I
> noticed however some patterns that are producing more untriaged issues
> that we should have. Those can be probably automated (if JIRA has ways
> to do it):
>
> 1. Issues created and assigned on creation can be marked as open.
> 2. Once an issue has an associated PR it could be marked as open if it
> was in Triaged state.
>
> Other common case that is probably harder to automate are issues that
> are in Triaged state because we forgot to resolve/close them. I don’t
> know how we can improve these, apart of reminding people to look that
> they do not have untriaged assigned issues.
>
> Another interesting triage to do are the issues that are Open and
> assigned to members of the community that are not active anymore in
> the project, but that’s probably worth of another discussion, as well
> as how can we more effectively track open unassigned issues (which are
> currently around 1600).
>
> On Wed, Jun 5, 2019 at 7:03 PM Tanay Tummalapalli 
> wrote:
> >
> > Hi Kenneth,
> >
> > I already follow the issues@ mailing list pretty much daily.
> > I'd like to help with triaging issues, especially ones related to the
> Python SDK since I'm most familiar with it.
> >
> > On Wed, Jun 5, 2019 at 10:26 PM Alex Van Boxel  wrote:
> >>
> >> Hey Kenneth, I help out. I'm planning to contribute more on Beam and it
> seems to be ideal to keep up-to-date with the project.
> >>
> >>  _/
> >> _/ Alex Van Boxel
> >>
> >>
> >> On Wed, Jun 5, 2019 at 6:46 PM Kenneth Knowles  wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I am requesting help in triaging incoming issues. I made a search
> here: https://issues.apache.org/jira/issues/?filter=12345682
> >>>
> >>> I have a daily email subscription to this filter as a reminder, but
> rarely can really sit down to do triage for very long. It has grown from
> just under 200 to just over 200. The rate is actually pretty low but there
> is a backlog. I also want to start re-triaging stale bugs but priority
> would be (1) keep up with new bugs (2) clear backlog (3) re-triage stale
> bugs.
> >>>
> >>> Just FYI what I look for before I clicked "Triaged" is:
> >>>
> >>>  - correct component
> >>>  - correct priority
> >>>  - maybe ping someone in a comment or assign
> >>>  - write to dev@ if it is a major problem
> >>>
> >>> If I can't figure that out, then I ask the reporter for clarification
> and "Start Watching" the issue so I will receive their response.
> >>>
> >>> To avoid duplicate triage work it may help to assign to yourself
> temporarily during triage phase.
> >>>
> >>> Any help greatly appreciated!
> >>>
> >>> Kenn
>


Re: Design Proposal for Cost Estimation

2019-06-07 Thread Alireza Samadian
Thank you so much.

Best,
Alireza

On Fri, Jun 7, 2019 at 11:48 AM Pablo Estrada  wrote:

> I've added you as a contributor! : )
>
> On Fri, Jun 7, 2019 at 11:20 AM Alireza Samadian 
> wrote:
>
>> Hi,
>>
>> I am going to create Issues in Jira and start implementing row estimation
>> of each source separately. I will appreciate if someone gives me the
>> permission to assign Jira Issues to myself. My Jira id is riazela.
>>
>> Best,
>> Alireza
>>
>> On Fri, May 31, 2019 at 3:54 PM Alireza Samadian 
>> wrote:
>>
>>> Dear Members of Apache Beam Dev List,
>>>
>>> My name is Alireza; I am a Software Engineer Intern at Google, and I am
>>> working closely with Anton on Beam SQL query optimizer. Currently, it uses
>>> Apache Calcite without any cost estimation; I am proposing to implement the
>>> cost estimator for it.
>>> The first step would be implementing cost estimator for the sources;
>>> this is my design proposal for this implementation. I will appreciate your
>>> comments and suggestions.
>>>
>>>
>>> https://docs.google.com/document/d/1vi1PBBu5IqSy-qZl1Gk-49CcANOpbNs1UAud6LnOaiY/edit#heading=h.6rlkpwwx7gvf
>>>
>>> Best,
>>> Alireza Samadian
>>>
>>


Re: Design Proposal for Cost Estimation

2019-06-07 Thread Pablo Estrada
I've added you as a contributor! : )

On Fri, Jun 7, 2019 at 11:20 AM Alireza Samadian 
wrote:

> Hi,
>
> I am going to create Issues in Jira and start implementing row estimation
> of each source separately. I will appreciate if someone gives me the
> permission to assign Jira Issues to myself. My Jira id is riazela.
>
> Best,
> Alireza
>
> On Fri, May 31, 2019 at 3:54 PM Alireza Samadian 
> wrote:
>
>> Dear Members of Apache Beam Dev List,
>>
>> My name is Alireza; I am a Software Engineer Intern at Google, and I am
>> working closely with Anton on Beam SQL query optimizer. Currently, it uses
>> Apache Calcite without any cost estimation; I am proposing to implement the
>> cost estimator for it.
>> The first step would be implementing cost estimator for the sources; this
>> is my design proposal for this implementation. I will appreciate your
>> comments and suggestions.
>>
>>
>> https://docs.google.com/document/d/1vi1PBBu5IqSy-qZl1Gk-49CcANOpbNs1UAud6LnOaiY/edit#heading=h.6rlkpwwx7gvf
>>
>> Best,
>> Alireza Samadian
>>
>


Re: Plan for dropping python 2 support

2019-06-07 Thread Ahmet Altay
I agree with you. A more recent LTS release with python 2 support will be
good. Cost of maintaining python 2 support is also fairly low (maybe zero
actually besides keeping some pre-existing compatibility code).

I believe we are referring to two separate things with support:
- Supporting existing releases for patches - I agree that we need to give
users a long enough window to upgrade. Great if it happens with an LTS
release. Even if it does not, I think it will be fair to offer patches on
the last python 2 supporting release during some part of 2020 if that
becomes necessary.
- Making new releases with python 2 support - Each new Beam release with
python 2 support will implicitly extend the lifetime of beam's python 2
support. I do not think we need to extend this to beyond 2019. 2 releases
(~ 3 months) after solid python 3 support will very likely put the last
python 2 supporting release to last quarter of 2019 already.

On Fri, Jun 7, 2019 at 2:15 AM Robert Bradshaw  wrote:

> I don't think the second release with robust/recommended Python 3
> support should be the last release with Python 2 support--that is
> simply not enough time for people to migrate. (Look at how long it
> took us...) It does make a lot of sense to at least have one LTS
> release with support for both.
>
> Regarding timeline, I think we could safely say we expect to support
> Python 2 through 2019, likely for some of 2020 (possibly only via an
> LTS release), and (very) unlikely beyond 2020.
>
> On Wed, Jun 5, 2019 at 6:34 PM Ahmet Altay  wrote:
> >
> > I agree with the sentiment on this thread. Our priority needs to be
> offering good python 3 support that we can comfortably recommend users to
> switch. Progress on that so far has been promising and I do anticipate that
> we will reach there in the near future.
> >
> > My proposal would be, once we reach to that state, we can mark the first
> subsequent Beam release as the last Beam release that supports Python 2.
> (Alternatively: in line with the previous experimental/deprecated
> discussion we can make 2 more release with python 2 support rather than
> just 1 more.) With the current state, we would not give users plenty of
> time to upgrade python 3. So in addition, I would suggest we can consider
> and upgrade relief by offering something like a 6-month support on the last
> python 2 compatible release. We might do that in the context of an LTS
> release.
> >
> > I do not believe we have a timeline we can share with users at this
> point. However if we go with this suggestion, we will probably support
> python 2 approximately until mid-2020.
> >
> > Ahmet
> >
> > On Wed, Jun 5, 2019 at 4:53 AM Tanay Tummalapalli 
> wrote:
> >>
> >> We can support Python 2 for some time in 2020, but, we should target a
> date no later than 2020 to drop support.
> >> If we do plan to drop support for Python 2 in 2020, we should sign the
> Python 3 statement[1], declaring that we will "drop support for Python 2.7
> no later than 2020".
> >>
> >> In addition to the statement, keeping a target release and date(if
> possible) or timeline to drop support would also help users to decide when
> they need to work on migrating to Python 3.
> >>
> >> Regards,
> >> - TT
> >>
> >> [1] https://python3statement.org/
> >>
> >> On Wed, Jun 5, 2019 at 4:37 PM Robert Bradshaw 
> wrote:
> >>>
> >>> Until Python 3 support for Beam is officially out of beta and
> >>> recommended, I don't think we can tell people to stop using Python 2.
> >>> Given that 2020 is just over 6 months away, that seems a short
> >>> transition time, so I would guess we'll have to continue supporting
> >>> Python 2 sometime into 2020.
> >>>
> >>> A quick survey of users would be valuable here. But first priority is
> >>> making Python 3 rock solid so we can unconditionally recommend it over
> >>> Python 2.
> >>>
> >>> On Wed, Jun 5, 2019 at 12:27 PM Ismaël Mejía 
> wrote:
> >>> >
> >>> > Python 2 won't be maintained after 2020 [1]. I was wondering what
> will
> >>> > be our (Beam) plan for this. Other projects [2] have started to alert
> >>> > users that support will be removed so maybe we should decide or
> policy
> >>> > for this too.
> >>> >
> >>> > [1] https://pythonclock.org/
> >>> > [2]
> https://spark.apache.org/news/plan-for-dropping-python-2-support.html
>


Re: Design Proposal for Cost Estimation

2019-06-07 Thread Alireza Samadian
Hi,

I am going to create Issues in Jira and start implementing row estimation
of each source separately. I will appreciate if someone gives me the
permission to assign Jira Issues to myself. My Jira id is riazela.

Best,
Alireza

On Fri, May 31, 2019 at 3:54 PM Alireza Samadian 
wrote:

> Dear Members of Apache Beam Dev List,
>
> My name is Alireza; I am a Software Engineer Intern at Google, and I am
> working closely with Anton on Beam SQL query optimizer. Currently, it uses
> Apache Calcite without any cost estimation; I am proposing to implement the
> cost estimator for it.
> The first step would be implementing cost estimator for the sources; this
> is my design proposal for this implementation. I will appreciate your
> comments and suggestions.
>
>
> https://docs.google.com/document/d/1vi1PBBu5IqSy-qZl1Gk-49CcANOpbNs1UAud6LnOaiY/edit#heading=h.6rlkpwwx7gvf
>
> Best,
> Alireza Samadian
>


Re: [DISCUSS] Portability representation of schemas

2019-06-07 Thread Anton Kedin
The topic of schema registries probably does not block the design and
implementation of logical types and portable schemas by themselves, however
I think we should spend some time discussing it (probably in a separate
thread) so that all SDKs have similar mechanisms for schema registration
and lookup.
Current Java SDK allows registering schemas for Java-types of the elements
enabling automatic conversions from Pojos/AutoValues/etc to Rows. This
approach is helpful within Java SDK but it will need to be generalized and
extended. E.g. it should allow the lookup of schemas/types using some other
logic (customizable), not just Java type of the elements, or maybe even
dynamic schemas (not just Union, don't know if there is a use case for
this). This should also include an understanding of how external
schema/metadata sources (Hive Metastore, Data Catalog) can be used in
different SDKs.
And maybe some general reflection mechanisms?

Regards,
Anton


On Fri, Jun 7, 2019 at 4:35 AM Robert Burke  wrote:

> Wouldn't SDK specific types always be under the "coders" component instead
> of the logical type listing?
>
> Offhand, having a separate normalized listing of logical schema types in
> the pipeline components message of the types seems about right. Then
> they're unambiguous, but can also either refer to other logical types or
> existing coders as needed. When SDKs don't understand a given coder, the
> field could be just represented by a blob of bytes.
>
>
>
> On Wed, Jun 5, 2019, 11:29 PM Brian Hulette  wrote:
>
>> If we want to have a Pipeline level registry, we could add it to
>> Components [1].
>>
>> message Components {
>>   ...
>>   map logical_types;
>> }
>>
>> And in FieldType reference the logical types by id:
>> oneof field_type {
>>   AtomicType atomic_type;
>>   ArrayType array_type;
>>   ...
>>   string logical_type_id;// was LogicalType logical_type;
>> }
>>
>> I'm not sure I like this idea though. The reason we started discussing a
>> "registry" was just to separate the SDK-specific bits from the
>> representation type, and this doesn't accomplish that, it just de-dupes
>> logical types used
>> across the pipeline.
>>
>> I think instead I'd rather just come back to the message we have now in
>> the doc, used directly in FieldType's oneof:
>>
>> message LogicalType {
>>   FieldType representation = 1;
>>   string logical_urn = 2;
>>   bytes logical_payload = 3;
>> }
>>
>> We can have a URN for SDK-specific types (user type aliases), like
>> "beam:logical:javasdk", and the logical_payload could itself be a protobuf
>> with attributes of 1) a serialized class and 2/3) to/from functions. For
>> truly portable types it would instead have a well-known URN and optionally
>> a logical_payload with some agreed-upon representation of parameters.
>>
>> It seems like maybe SdkFunctionSpec/Environment should be used for this
>> somehow, but I can't find a good example of this in the Runner API to use
>> as a model. For example, what we're trying to accomplish is basically the
>> same as Java custom coders vs. standard coders. But that is accomplished
>> with a magic "javasdk" URN, as I suggested here, not with Environment
>> [2,3]. There is a "TODO: standardize such things" where that URN is
>> defined, is it possible that Environment is that standard and just hasn't
>> been utilized for custom coders yet?
>>
>> Brian
>>
>> [1]
>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L54
>> [2]
>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L542
>> [3]
>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/CoderTranslation.java#L121
>>
>> On Tue, Jun 4, 2019 at 2:24 PM Brian Hulette  wrote:
>>
>>> Yeah that's what I meant. It does seem logical reasonable to scope any
>>> registry by pipeline and not by PCollection. Then it seems we would want
>>> the entire LogicalType (including the `FieldType representation` field) as
>>> the value type, and not just LogicalTypeConversion. Otherwise we're
>>> separating the representations from the conversions, and duplicating the
>>> representations. You did say a "registry of logical types", so maybe that
>>> is what you meant.
>>>
>>> Brian
>>>
>>> On Tue, Jun 4, 2019 at 1:21 PM Reuven Lax  wrote:
>>>


 On Tue, Jun 4, 2019 at 9:20 AM Brian Hulette 
 wrote:

>
>
> On Mon, Jun 3, 2019 at 10:04 PM Reuven Lax  wrote:
>
>>
>>
>> On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette 
>> wrote:
>>
>>> > It has to go into the proto somewhere (since that's the only way
>>> the SDK can get it), but I'm not sure they should be considered integral
>>> parts of the type.
>>> Are you just advocating for an approach where any SDK-specific
>>> information is stored outside of the Schema message itself so that 
>>> Schema
>>> really does 

Re: Testing code in extensions against runner

2019-06-07 Thread Lukasz Cwik
We have been currently been having every runner define and manage its own
suite/tests so yes modifying flink_runner.gradle is currently the correct
thing to do.

There is a larger discussion about whether this is the right way since we
would like to capture things like perf benchmarks and validates runner
tests so we can add information to the website about how well a feature is
supported by each runner automatically.



On Thu, Jun 6, 2019 at 8:36 PM Reza Rokni  wrote:

> Hi,
>
> I would like to validate some code that I am building under
> extensions against different runners. It makes use of some caches in a DoFn
> which are a little off the beaten path.
>
> I have added @ValidatesRunner to the class and by adding the right values
> to the gradle file in flink_runner have got the tests to run. However it
> does not feel right for me to change the flink_runner.gradle file to
> achieve this, especially as this is all experimental and under extensions.
>
> I could copy over all the bits needed from the gradle file over to my
> extensions gradle, but then I would need to do that for all runners , which
> also feels a bit heavy weight. Is there a way, or should there be a way of
> having a task added to my gradle file which will do tests against all
> runners for me?
>
> Cheers
> Reza
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and do
> not constitute a legally binding obligation. No legally binding obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>


Re: [Discuss] Ideas for Apache Beam presence in social media

2019-06-07 Thread Thomas Weise
Here is an idea how this could be done: Create a JIRA ticket that will
always remain open. Have folks append their suggested tweets as comments.
Interested PMC members can watch that ticket.

Thomas

On Thu, Jun 6, 2019 at 10:41 AM Thomas Weise  wrote:

> Pinging individual PMC members doesn't work. There needs to be visibility
> to proposed actions to anyone that is interested. That would require a form
> of subscribe/notification mechanism (as exists for PRs and JIRAs).
>
>
> On Thu, Jun 6, 2019 at 10:33 AM Aizhamal Nurmamat kyzy <
> aizha...@apache.org> wrote:
>
>> With the spreadsheet in http://s.apache.org/beam-tweets, anyone can
>> propose tweets. I will check it every few days, and ping/tag PMC members to
>> review tweets and publish. Does that sound fine?
>> If you have ideas on how to make the process better, please let me know.
>>
>> Thanks,
>> Aizhamal
>>
>> On Wed, Jun 5, 2019 at 4:10 AM Thomas Weise  wrote:
>>
>>> +1
>>>
>>> What would be the mechanism to notify the PMC that there is something to
>>> review?
>>>
>>>
>>> On Tue, Jun 4, 2019 at 9:55 PM Kenneth Knowles  wrote:
>>>
 Bringing the PMC's conclusion back to this list, we are happy to start
 with the following arrangement:

  - Doc/spreadsheet/etc readable by dev@ (aka the public), writable by
 some group of contributors to set up a queue of news
  - Any member of PMC approves and executes the posts, with enough time
 elapsing to consider it lazy consensus

 Any mistake transcribing this conclusion is my own. And of course
 nothing is permanent, but we try and iterate.

 Kenn

 On Mon, Jun 3, 2019 at 2:18 PM Aizhamal Nurmamat kyzy <
 aizha...@google.com> wrote:

> Hello folks,
>
> I have created a spreadsheet where people can suggest tweets [1]. It
> contains a couple of tweets that have been tweeted as examples. Also, 
> there
> are a couple others that I will ask PMC members to review in the next few
> days.
>
> I have also created a blog post[2] to invite community members to
> participate by proposing tweets / retweets.
>
> Does this look OK to everyone? I’d love to try it out and see if it
> drives engagement in the community. If not we can always change the
> processes.
>
> Thanks,
> aizhamal
>
> [1] s.apache.org/beam-tweets
> [2] https://github.com/apache/beam/pull/8747
>
> On Fri, May 24, 2019 at 4:26 PM Kenneth Knowles 
> wrote:
>
>> Thanks for taking on this work!
>>
>> Kenn
>>
>> On Fri, May 24, 2019 at 2:52 PM Aizhamal Nurmamat kyzy <
>> aizha...@google.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I'd like to pilot this if that's okay by everyone. I'll set up a
>>> spreadsheet, write a blog post publicizing it, and perhaps send out a
>>> tweet. We can improve the process later with tools if necessary.
>>>
>>> Thanks all and have a great weekend!
>>> Aizhamal
>>>
>>> On Tue, May 21, 2019 at 8:37 PM Kenneth Knowles 
>>> wrote:
>>>
 Great idea.

 Austin - point well taken about whether the PMC really has to
 micro-manage here. The stakes are potentially very high, but so are the
 stakes for code and website changes.

 I know that comdev votes authoring privileges to people who are not
 committers, but they are not speaking on behalf of comdev but under 
 their
 own name.

 Let's definitely find a way to be effective on social media.

 Kenn

 On Tue, May 21, 2019 at 4:14 AM Maximilian Michels 
 wrote:

> Hi Aizhamal,
>
> This is a great idea. I think it would help Beam to be more
> prominent on
> social media.
>
> We need to discuss this also on the private@ mailing list but I
> don't
> see anything standing in the way if the PMC always gets to approve
> the
> proposed social media postings.
>
> I could even imagine that the PMC gives rights to a Beam community
> member to post in their name.
>
> Thanks,
> Max
>
> On 21.05.19 03:09, Austin Bennett wrote:
> > Is PMC definitely in charge of this (approving, communication
> channel,
> > etc)?
> >
> > There could even be a more concrete pull-request-like function
> even for
> > things like tweets (to minimize cut/paste operations)?
> >
> > I remember a bit of a mechanism having been proposed some time
> ago (in
> > another circumstance), though doesn't look like it made it
> terribly far:
> >
> http://www.redhenlab.org/home/the-cognitive-core-research-topics-in-red-hen/the-barnyard/-slick-tweeting
> > (I haven't otherwise seen such 

Re: Removing shading by default within BeamModulePlugin.groovy

2019-06-07 Thread Lukasz Cwik
I also noticed that the build takes significantly less time on my machine,
several mins saved.

On Fri, Jun 7, 2019 at 9:54 AM Lukasz Cwik  wrote:

> Guava was the only thing that we shaded everywhere but the original intent
> was for us to shade more and more by default until we decided to do
> vendoring (which is a better solution).
>
> So yes, this really only removed shading of Guava, we still have shading
> in all these other places:
> model/*
> sdks/java/core
> sdks/java/extensions/kryo
> sdks/java/extensions/sql
> sdks/java/extensions/sql/jdbc
> sdks/java/harness
> runners/spark/job-server
> runners/direct-java
> runners/samza/job-server
> runners/google-cloud-dataflow-java/worker
> runners/google-cloud-dataflow-java/worker/legacy-worker
> runners/google-cloud-dataflow-java/worker/windmill
> vendor/*
>
> On Fri, Jun 7, 2019 at 1:05 AM Ismaël Mejía  wrote:
>
>> This is fantastic. Took a look at the PR and did not see anything that
>> jump to my eyes and also validated with two external projects with
>> today’s snapshots (after merge) without issues so far. Great that we
>> finally tackle this on, thanks Luke!
>>
>> Have one minor comment because the title of the thread may be
>> confusing, after checking sdks-java-core I noticed we are still
>> shading other dependencies: protobuf, bytebuddy, antlr, apache
>> commons, so I suppose this was mostly around shading guava, isn’t it?
>>
>> On Wed, Jun 5, 2019 at 10:09 PM Lukasz Cwik  wrote:
>> >
>> > I am able to pass several runners validates runner tests and the Java
>> PostCommit.
>> >
>> > I also was able to publish a release into the staging repository[1] and
>> compared the newly generated poms artifact-2.14.0-20190605.*-30.pom against
>> the previously nightly snapshot of artifact-2.14.0-20190605.*-28.pom for
>> the following projects as a spot check and found no differences in those
>> poms:
>> > beam-sdks-java-core
>> > beam-sdks-java-fn-execution
>> > beam-runners-spark
>> >
>> > I believe my PR is now ready for review.
>> >
>> > 1:
>> https://repository.apache.org/content/groups/snapshots/org/apache/beam/
>> >
>> > On Tue, Jun 4, 2019 at 7:18 PM Kenneth Knowles  wrote:
>> >>
>> >> Nice! This is a huge step. One thing that showed up in the last big
>> gradle change was needing to check the generated poms.
>> >>
>> >> Kenn
>> >>
>> >> On Tue, Jun 4, 2019 at 5:07 PM Lukasz Cwik  wrote:
>> >>>
>> >>> Since we have been migrating to using vendoring instead of shading[1]
>> and due to previous efforts in vendoring[2, 3] I have opened up PR 8762[4]
>> which migrates all projects that weren't doing anything shading wise to not
>> perform any shading. This required me to fix up all intra project
>> dependencies and release publishing.
>> >>>
>> >>> The following is a list of all project paths which are still using
>> shading for some reason:
>> >>> model/*
>> >>> sdks/java/core
>> >>> sdks/java/extensions/kryo
>> >>> sdks/java/extensions/sql
>> >>> sdks/java/extensions/sql/jdbc
>> >>> sdks/java/harness
>> >>> runners/spark/job-server
>> >>> runners/direct-java
>> >>> runners/samza/job-server
>> >>> runners/google-cloud-dataflow-java/worker
>> >>> runners/google-cloud-dataflow-java/worker/legacy-worker
>> >>> runners/google-cloud-dataflow-java/worker/windmill
>> >>> vendor/*
>> >>>
>> >>> Out of the list above, migrating sdks/java/core and
>> runners/direct-java (in that order) would provide the most benefit to
>> moving away from shading within our project. Many of the others are either
>> shaded proto classes or applications (e.g. job-servers, harness, sql jdbc)
>> and either require shading to be compatible with vendoring or aren't meant
>> to be used as dependencies.
>> >>>
>> >>> Since this is a larger change that cuts across so many projects there
>> is risk for breakage. I'm looking for people to help test the change and
>> validate any scenarios that they are specifically interested in. I'm
>> planning to run several of the postcommits on my PR and check that we can
>> build a release in addition to any efforts others provide before looking to
>> have the change merged.
>> >>>
>> >>> The following guidance should help those who edit Gradle build files
>> (after this change is merged):
>> >>> * For projects that don't perform any shading, those projects have
>> been migrated to use the default configurations that the Gradle Java plugin
>> uses[5]. Note that the default configurations we use have been deprecated.
>> >>> * For projects that depend on another project that isn't shaded, the
>> intra project configuration has been swapped to use compile / testRuntime
>> instead of shadow and shadowTest
>> >>> * Existing projects that are still shaded should use the shadow and
>> shadowTest configurations as before.
>> >>>
>> >>> 1:
>> https://lists.apache.org/thread.html/4c12db35b40a6d56e170cd6fc8bb0ac4c43a99aa3cb7dbae54176815@%3Cdev.beam.apache.org%3E
>> >>> 2:
>> 

Re: Removing shading by default within BeamModulePlugin.groovy

2019-06-07 Thread Lukasz Cwik
Guava was the only thing that we shaded everywhere but the original intent
was for us to shade more and more by default until we decided to do
vendoring (which is a better solution).

So yes, this really only removed shading of Guava, we still have shading in
all these other places:
model/*
sdks/java/core
sdks/java/extensions/kryo
sdks/java/extensions/sql
sdks/java/extensions/sql/jdbc
sdks/java/harness
runners/spark/job-server
runners/direct-java
runners/samza/job-server
runners/google-cloud-dataflow-java/worker
runners/google-cloud-dataflow-java/worker/legacy-worker
runners/google-cloud-dataflow-java/worker/windmill
vendor/*

On Fri, Jun 7, 2019 at 1:05 AM Ismaël Mejía  wrote:

> This is fantastic. Took a look at the PR and did not see anything that
> jump to my eyes and also validated with two external projects with
> today’s snapshots (after merge) without issues so far. Great that we
> finally tackle this on, thanks Luke!
>
> Have one minor comment because the title of the thread may be
> confusing, after checking sdks-java-core I noticed we are still
> shading other dependencies: protobuf, bytebuddy, antlr, apache
> commons, so I suppose this was mostly around shading guava, isn’t it?
>
> On Wed, Jun 5, 2019 at 10:09 PM Lukasz Cwik  wrote:
> >
> > I am able to pass several runners validates runner tests and the Java
> PostCommit.
> >
> > I also was able to publish a release into the staging repository[1] and
> compared the newly generated poms artifact-2.14.0-20190605.*-30.pom against
> the previously nightly snapshot of artifact-2.14.0-20190605.*-28.pom for
> the following projects as a spot check and found no differences in those
> poms:
> > beam-sdks-java-core
> > beam-sdks-java-fn-execution
> > beam-runners-spark
> >
> > I believe my PR is now ready for review.
> >
> > 1:
> https://repository.apache.org/content/groups/snapshots/org/apache/beam/
> >
> > On Tue, Jun 4, 2019 at 7:18 PM Kenneth Knowles  wrote:
> >>
> >> Nice! This is a huge step. One thing that showed up in the last big
> gradle change was needing to check the generated poms.
> >>
> >> Kenn
> >>
> >> On Tue, Jun 4, 2019 at 5:07 PM Lukasz Cwik  wrote:
> >>>
> >>> Since we have been migrating to using vendoring instead of shading[1]
> and due to previous efforts in vendoring[2, 3] I have opened up PR 8762[4]
> which migrates all projects that weren't doing anything shading wise to not
> perform any shading. This required me to fix up all intra project
> dependencies and release publishing.
> >>>
> >>> The following is a list of all project paths which are still using
> shading for some reason:
> >>> model/*
> >>> sdks/java/core
> >>> sdks/java/extensions/kryo
> >>> sdks/java/extensions/sql
> >>> sdks/java/extensions/sql/jdbc
> >>> sdks/java/harness
> >>> runners/spark/job-server
> >>> runners/direct-java
> >>> runners/samza/job-server
> >>> runners/google-cloud-dataflow-java/worker
> >>> runners/google-cloud-dataflow-java/worker/legacy-worker
> >>> runners/google-cloud-dataflow-java/worker/windmill
> >>> vendor/*
> >>>
> >>> Out of the list above, migrating sdks/java/core and
> runners/direct-java (in that order) would provide the most benefit to
> moving away from shading within our project. Many of the others are either
> shaded proto classes or applications (e.g. job-servers, harness, sql jdbc)
> and either require shading to be compatible with vendoring or aren't meant
> to be used as dependencies.
> >>>
> >>> Since this is a larger change that cuts across so many projects there
> is risk for breakage. I'm looking for people to help test the change and
> validate any scenarios that they are specifically interested in. I'm
> planning to run several of the postcommits on my PR and check that we can
> build a release in addition to any efforts others provide before looking to
> have the change merged.
> >>>
> >>> The following guidance should help those who edit Gradle build files
> (after this change is merged):
> >>> * For projects that don't perform any shading, those projects have
> been migrated to use the default configurations that the Gradle Java plugin
> uses[5]. Note that the default configurations we use have been deprecated.
> >>> * For projects that depend on another project that isn't shaded, the
> intra project configuration has been swapped to use compile / testRuntime
> instead of shadow and shadowTest
> >>> * Existing projects that are still shaded should use the shadow and
> shadowTest configurations as before.
> >>>
> >>> 1:
> https://lists.apache.org/thread.html/4c12db35b40a6d56e170cd6fc8bb0ac4c43a99aa3cb7dbae54176815@%3Cdev.beam.apache.org%3E
> >>> 2:
> https://lists.apache.org/thread.html/4c12db35b40a6d56e170cd6fc8bb0ac4c43a99aa3cb7dbae54176815@%3Cdev.beam.apache.org%3E
> >>> 3:
> https://lists.apache.org/thread.html/972b5175641f4eaf7ec92870cc0ff72fa52e6f0bbaccc384a3814e45@%3Cdev.beam.apache.org%3E
> >>> 4: https://github.com/apache/beam/pull/8762
> >>> 5:
> 

Re: I'm thinking about new features, what do you think?

2019-06-07 Thread Lukasz Cwik
Even though we don't support iteration, one could have a known upperbound
and "unroll" the loop to a fixed number of iterations statically before the
pipeline is run but I agree with Eugene on his other points.







On Fri, Jun 7, 2019 at 3:59 AM Robert Burke  wrote:

> I'm not sure I understand the desired properties of GroupByMultiKey.
>
> Offhand, am I right interpreting GroupByMultiKey as essentially forming a
> graph of the keys based on the MultiKeys nodes, and the number of resulting
> iterables is based on the components of the graph.
>
> If that's the case then, what does the integer do when creating the
> GroupByMultiKey?
>
> In the example, it seems to be saying "I'd like 3 groups" but wouldn't
> that be a property of the implicit connected graphs of MultiKeys?
>
> Thank you very much!
>
>
> On Fri, Jun 7, 2019, 10:14 AM Jan Lukavský  wrote:
>
>> Hi,
>>
>> that sounds interesting, but it seems to be computationally intensive
>> and might not be well scalable, if I understand it correctly. It looks
>> like it needs a transitive closure, am I right?
>>
>>   Jan
>>
>> On 6/7/19 11:17 AM, i.am.moai wrote:
>> > Hello everyone, nice to meet you
>> >
>> > I am Naoki Hyu(日宇尚記). a developer live in Tokyo. I often use scala and
>> > python as my favorite language .
>> >
>> > I have no experience with OSS development, but as I use DataFlow at
>> > work, I want to contribute to the development of Beam.
>> >
>> > In fact, there is a feature I want to develop, and now I have the
>> > source code on my local PC.
>> >
>> > The feature I want to create is an extension of GroupBy to a multiple
>> > key, which realizes more complex grouping.
>> >
>> > https://issues.apache.org/jira/browse/BEAM-7358
>> >
>> > Everyone, could you give me an opinion on this intent?
>> >
>>
>


Re: [DISCUSS] Portability representation of schemas

2019-06-07 Thread Robert Burke
Wouldn't SDK specific types always be under the "coders" component instead
of the logical type listing?

Offhand, having a separate normalized listing of logical schema types in
the pipeline components message of the types seems about right. Then
they're unambiguous, but can also either refer to other logical types or
existing coders as needed. When SDKs don't understand a given coder, the
field could be just represented by a blob of bytes.



On Wed, Jun 5, 2019, 11:29 PM Brian Hulette  wrote:

> If we want to have a Pipeline level registry, we could add it to
> Components [1].
>
> message Components {
>   ...
>   map logical_types;
> }
>
> And in FieldType reference the logical types by id:
> oneof field_type {
>   AtomicType atomic_type;
>   ArrayType array_type;
>   ...
>   string logical_type_id;// was LogicalType logical_type;
> }
>
> I'm not sure I like this idea though. The reason we started discussing a
> "registry" was just to separate the SDK-specific bits from the
> representation type, and this doesn't accomplish that, it just de-dupes
> logical types used
> across the pipeline.
>
> I think instead I'd rather just come back to the message we have now in
> the doc, used directly in FieldType's oneof:
>
> message LogicalType {
>   FieldType representation = 1;
>   string logical_urn = 2;
>   bytes logical_payload = 3;
> }
>
> We can have a URN for SDK-specific types (user type aliases), like
> "beam:logical:javasdk", and the logical_payload could itself be a protobuf
> with attributes of 1) a serialized class and 2/3) to/from functions. For
> truly portable types it would instead have a well-known URN and optionally
> a logical_payload with some agreed-upon representation of parameters.
>
> It seems like maybe SdkFunctionSpec/Environment should be used for this
> somehow, but I can't find a good example of this in the Runner API to use
> as a model. For example, what we're trying to accomplish is basically the
> same as Java custom coders vs. standard coders. But that is accomplished
> with a magic "javasdk" URN, as I suggested here, not with Environment
> [2,3]. There is a "TODO: standardize such things" where that URN is
> defined, is it possible that Environment is that standard and just hasn't
> been utilized for custom coders yet?
>
> Brian
>
> [1]
> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L54
> [2]
> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L542
> [3]
> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/CoderTranslation.java#L121
>
> On Tue, Jun 4, 2019 at 2:24 PM Brian Hulette  wrote:
>
>> Yeah that's what I meant. It does seem logical reasonable to scope any
>> registry by pipeline and not by PCollection. Then it seems we would want
>> the entire LogicalType (including the `FieldType representation` field) as
>> the value type, and not just LogicalTypeConversion. Otherwise we're
>> separating the representations from the conversions, and duplicating the
>> representations. You did say a "registry of logical types", so maybe that
>> is what you meant.
>>
>> Brian
>>
>> On Tue, Jun 4, 2019 at 1:21 PM Reuven Lax  wrote:
>>
>>>
>>>
>>> On Tue, Jun 4, 2019 at 9:20 AM Brian Hulette 
>>> wrote:
>>>


 On Mon, Jun 3, 2019 at 10:04 PM Reuven Lax  wrote:

>
>
> On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette 
> wrote:
>
>> > It has to go into the proto somewhere (since that's the only way
>> the SDK can get it), but I'm not sure they should be considered integral
>> parts of the type.
>> Are you just advocating for an approach where any SDK-specific
>> information is stored outside of the Schema message itself so that Schema
>> really does just represent the type? That seems reasonable to me, and
>> alleviates my concerns about how this applies to columnar encodings a bit
>> as well.
>>
>
> Yes, that's exactly what I'm advocating.
>
>
>>
>> We could lift all of the LogicalTypeConversion messages out of the
>> Schema and the LogicalType like this:
>>
>> message SchemaCoder {
>>   Schema schema = 1;
>>   LogicalTypeConversion root_conversion = 2;
>>   map attribute_conversions = 3; //
>> only necessary for user type aliases, portable logical types by 
>> definition
>> have nothing SDK-specific
>> }
>>
>
> I'm not sure what the map is for? I think we have status quo wihtout
> it.
>

 My intention was that the SDK-specific information (to/from functions)
 for any nested fields that are themselves user type aliases would be stored
 in this map. That was the motivation for my next question, if we don't
 allow user types to be nested within other user types we may not need it.

>>>
>>> Oh, is this meant to contain the ids of all the logical 

Re: I'm thinking about new features, what do you think?

2019-06-07 Thread Robert Burke
I'm not sure I understand the desired properties of GroupByMultiKey.

Offhand, am I right interpreting GroupByMultiKey as essentially forming a
graph of the keys based on the MultiKeys nodes, and the number of resulting
iterables is based on the components of the graph.

If that's the case then, what does the integer do when creating the
GroupByMultiKey?

In the example, it seems to be saying "I'd like 3 groups" but wouldn't that
be a property of the implicit connected graphs of MultiKeys?

Thank you very much!


On Fri, Jun 7, 2019, 10:14 AM Jan Lukavský  wrote:

> Hi,
>
> that sounds interesting, but it seems to be computationally intensive
> and might not be well scalable, if I understand it correctly. It looks
> like it needs a transitive closure, am I right?
>
>   Jan
>
> On 6/7/19 11:17 AM, i.am.moai wrote:
> > Hello everyone, nice to meet you
> >
> > I am Naoki Hyu(日宇尚記). a developer live in Tokyo. I often use scala and
> > python as my favorite language .
> >
> > I have no experience with OSS development, but as I use DataFlow at
> > work, I want to contribute to the development of Beam.
> >
> > In fact, there is a feature I want to develop, and now I have the
> > source code on my local PC.
> >
> > The feature I want to create is an extension of GroupBy to a multiple
> > key, which realizes more complex grouping.
> >
> > https://issues.apache.org/jira/browse/BEAM-7358
> >
> > Everyone, could you give me an opinion on this intent?
> >
>


Re: I'm thinking about new features, what do you think?

2019-06-07 Thread Eugene Kirpichov
It looks like you want to take a PCollection of lists of items of the same
type (but not necessarily of the same length - in your example you pad them
to the same length but that's unnecessary), induce an undirected graph on
them where there's an edge between XS and YS if they have an element in
common*, and compute connected components on that graph.

*you can make the problem somewhat simpler if for each list you also add
nodes for its individual elements + edges from element to list. Then the
total size of the graph increases only linearly, and the connected
components are the same, but the number of edges is no longer quadratic in
the worst case.

This looks like a really specialized use case (I've never seen a
sufficiently similar problem in my career) so contributing it to the Beam
SDK might not be the best way to go, unless more people chime in that
they'd find it useful.

Unfortunately it is also likely not possible to implement in a scalable way
using Beam primitives, because Beam does not yet support iterative
computations, and computing connected components provably requires at least
O(log N) iterations. It is also easy to prove that your original problem
can not be solved faster: any connected components problem can be reduced
to yours by creating a PCollection with 1 element per edge, where the
element is {source, target}.

If you can elaborate why you think you need this algorithm, the community
might help you find a different way to accomplish the original task.

On Fri, Jun 7, 2019 at 12:14 PM Jan Lukavský  wrote:

> Hi,
>
> that sounds interesting, but it seems to be computationally intensive
> and might not be well scalable, if I understand it correctly. It looks
> like it needs a transitive closure, am I right?
>
>   Jan
>
> On 6/7/19 11:17 AM, i.am.moai wrote:
> > Hello everyone, nice to meet you
> >
> > I am Naoki Hyu(日宇尚記). a developer live in Tokyo. I often use scala and
> > python as my favorite language .
> >
> > I have no experience with OSS development, but as I use DataFlow at
> > work, I want to contribute to the development of Beam.
> >
> > In fact, there is a feature I want to develop, and now I have the
> > source code on my local PC.
> >
> > The feature I want to create is an extension of GroupBy to a multiple
> > key, which realizes more complex grouping.
> >
> > https://issues.apache.org/jira/browse/BEAM-7358
> >
> > Everyone, could you give me an opinion on this intent?
> >
>


Re: I'm thinking about new features, what do you think?

2019-06-07 Thread Jan Lukavský

Hi,

that sounds interesting, but it seems to be computationally intensive 
and might not be well scalable, if I understand it correctly. It looks 
like it needs a transitive closure, am I right?


 Jan

On 6/7/19 11:17 AM, i.am.moai wrote:

Hello everyone, nice to meet you

I am Naoki Hyu(日宇尚記). a developer live in Tokyo. I often use scala and 
python as my favorite language .


I have no experience with OSS development, but as I use DataFlow at 
work, I want to contribute to the development of Beam.


In fact, there is a feature I want to develop, and now I have the 
source code on my local PC.


The feature I want to create is an extension of GroupBy to a multiple 
key, which realizes more complex grouping.


https://issues.apache.org/jira/browse/BEAM-7358

Everyone, could you give me an opinion on this intent?



Re: [DISCUSS] Cookbooks for users with knowledge in other frameworks

2019-06-07 Thread Maximilian Michels

Sounds like a good idea. I think the same can be done for Flink; Flink's and 
Spark's APIs are similar to a large degree.

Here also a link to the transforms: 
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/operators/

-Max

On 04.06.19 03:20, Ahmet Altay wrote:

Thank you for the feedback so far. It seems like this will be generally
helpful :)

I guess next step would be, would anyone be interested in working in
this area? We can potentially break this down into starter tasks.

On Sat, Jun 1, 2019 at 7:00 PM Ankur Goenka mailto:goe...@google.com>> wrote:

    +1 for the proposal.
    Compatibility Matrix
     can
    be a good place to show case parity between different runners.


+1

    Do you think we should write 2 way examples [Spark, Flink, ..]<=>Beam?


Both ways, would be most useful I believe.




    On Sat, Jun 1, 2019 at 4:31 PM Reza Rokni mailto:r...@google.com>> wrote:

    For layer 1, what about working through this link as a starting
    point :
    
https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations?


+1


    On Sat, 1 Jun 2019 at 09:21, Ahmet Altay mailto:al...@google.com>> wrote:

    Thank you Reza. That separation makes sense to me.

    On Wed, May 29, 2019 at 6:26 PM Reza Rokni mailto:r...@google.com>> wrote:

    +1

    I think there will be at least two layers of this;

    Layer 1 - Using primitives : I do join, GBK,
    Aggregation... with system x this way, what is the
    canonical equivalent in Beam.
    Layer 2 - Patterns : I read and join Unbounded and
    Bounded Data in system x this way, what is the canonical
    equivalent in Beam.

    I suspect as a first pass Layer 1 is reasonably well
    bounded work, there would need to be agreement on
    "canonical" version of how to do something in Beam as
    this could be seen to be opinionated. As there are often
    a multitude of ways of doing x


    Once we identify a set of layer 1 items, we could crowd
    source the canonical implementations. I believe we can use
    our usual code review process to settle on a version that is
    agreeable. (Examples have the same issue, they are
    probably opinionated today based on the author but it works
    out.)



    On Thu, 30 May 2019 at 08:56, Ahmet Altay
    mailto:al...@google.com>> wrote:

    Hi all,

    Inspired by the user asking about a Spark feature in
    Beam [1] in the release thread, I searched the user@
    list and noticed a few instances of people asking
    for question like "I can do X in Spark, how can I do
    that in Beam?" Would it make sense to add
    documentation to explain how certain tasks that can
    be accomplished in Beam with side by side examples
    of doing the same task in Beam/Spark etc. It could
    help with on-boarding because it will be easier for
    people to leverage their existing knowledge. It
    could also help other frameworks as well, because it
    will serve as a Rosetta stone with two translations.

    Questions I have are:
    - Would such a thing be a helpful?
    - Is it feasible? Would a few pages worth of
    examples can cover enough use cases?

    Thank you!
    Ahmet

    [1]
    
https://lists.apache.org/thread.html/b73a54aa1e6e9933628f177b04a8f907c26cac854745fa081c478eff@%3Cdev.beam.apache.org%3E



    --

    This email may be confidential and privileged. If you
    received this communication by mistake, please don't
    forward it to anyone else, please erase all copies and
    attachments, and please let me know that it has gone to
    the wrong person.

    The above terms reflect a potential business
    arrangement, are provided solely as a basis for further
    discussion, and are not intended to be and do not
    constitute a legally binding obligation. No legally
    binding obligations will be created, implied, or
    inferred until an agreement in final form is executed in
    writing by all parties involved.



    --

    This email may be confidential and privileged. If you received
    this communication by mistake, please don't forward it to anyone
    else, please erase all copies and attachments, and please 

Re: [PROPOSAL] Prepare for LTS bugfix release 2.7.1

2019-06-07 Thread Maximilian Michels

Created an up-to-date version of the Flink backports for 2.7.1: 
https://github.com/apache/beam/pull/8787

Some of the Gradle task names have changed which makes testing via Jenkins 
hard. Will have to run them manually before merging.

-Max

On 06.06.19 17:41, Kenneth Knowles wrote:

Hi all,

Re-raising this thread. I got busy for the last month, and also did not
want to overlap the 2.13.0 release process. Now I want to pick up 2.7.1
again.

Can everyone check on any bug they have targeted to 2.7.1 [1] and get
the backports merged to release-2.7.1 and the tickets resolved?

Kenn

[1]
https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.7.1%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC

On Fri, Apr 26, 2019 at 11:19 AM Ahmet Altay mailto:al...@google.com>> wrote:

    I agree with both keeping 2.7.x going until a new LTS is declared
    and declaring LTS spost-release after some use. 2.12 might actually
    be a good candidate, with multiple RCs/validations it presumably is
    well tested. We can consider that after it gets some real world use.

    On Fri, Apr 26, 2019 at 6:29 AM Robert Bradshaw mailto:rober...@google.com>> wrote:

    IIRC, there was some talk on making 2.12 the next LTS, but the
    consensus is to decide on a LTS after having had some experience
    with
    it, not at or before the release itself.


    On Fri, Apr 26, 2019 at 3:04 PM Alexey Romanenko
    mailto:aromanenko@gmail.com>> wrote:
 >
 > Thanks for working on this, Kenn.
 >
 > Perhaps, I missed this but has it been already
    discussed/decided what will be the next LTS release?
 >
 > On 26 Apr 2019, at 08:02, Kenneth Knowles mailto:k...@apache.org>> wrote:
 >
 > Since it is all trivially reversible if there is some other
    feeling about this thread, I have gone ahead and started the work:
 >
 >  - I made release-2.7.1 branch point to the same commit as
    release-2.7.0 so there is something to target PRs
 >  - I have opened the first PR, cherry-picking the set_version
    script and using it to set the version on the branch to 2.7.1:
    https://github.com/apache/beam/pull/8407 (found bug in the new
    script right away :-)
 >
 > Here is the release with list of issues:
    https://issues.apache.org/jira/projects/BEAM/versions/12344458.
    So anyone can grab a ticket and volunteer to open a backport PR
    to the release-2.7.1 branch.
 >
 > I don't have a strong opinion about how long we should
    support the 2.7.x line. I am curious about different
    perspectives on user / vendor needs. I have two very basic
    thoughts: (1) we surely need to keep it going until some time
    after we have another LTS designated, to make sure there is a
    clear path for anyone only using LTS releases and (2) if we
    decide to end support of 2.7.x but then someone volunteers to
    backport and release, of course I would not expect anyone to
    block them, so it has no maximum lifetime, but we just need
    consensus on a minimum. And of course that consensus cannot
    force anyone to do the work, but is just a resolution of the
    community.
 >
 > Kenn
 >
 > On Thu, Apr 25, 2019 at 10:29 PM Jean-Baptiste Onofré
    mailto:j...@nanthrax.net>> wrote:
 >>
 >> +1 it sounds good to me.
 >>
 >> Thanks !
 >>
 >> Regards
 >> JB
 >>
 >> On 26/04/2019 02:42, Kenneth Knowles wrote:
 >> > Hi all,
 >> >
 >> > Since the release of 2.7.0 we have identified some serious
    bugs:
 >> >
 >> >  - There are 8 (non-dupe) issues* tagged with Fix Version
    2.7.1
 >> >  - 2 are rated "Blocker" (aka P0) but I think the others
    may be underrated
 >> >  - If you know of a critical bug that is not on that list,
    please file
 >> > an LTS backport ticket for it
 >> >
 >> > If a user is on an old version and wants to move to the
    LTS, there are
 >> > some real blockers. I propose that we perform a 2.7.1
    release starting now.
 >> >
 >> > I volunteer to manage the release. What do you think?
 >> >
 >> > Kenn
 >> >
 >> > *Some are "resolved" but this is not accurate as the LTS
    2.7.1 branch is
 >> > not created yet. I suggest filing a ticket to track just
    the LTS
 >> > backport when you hit a bug that merits it.
 >> >
 >
 >





I'm thinking about new features, what do you think?

2019-06-07 Thread i.am.moai
Hello everyone, nice to meet you

I am Naoki Hyu(日宇尚記). a developer live in Tokyo. I often use scala and
python as my favorite language .

I have no experience with OSS development, but as I use DataFlow at work, I
want to contribute to the development of Beam.

In fact, there is a feature I want to develop, and now I have the source
code on my local PC.

The feature I want to create is an extension of GroupBy to a multiple key,
which realizes more complex grouping.

https://issues.apache.org/jira/browse/BEAM-7358

Everyone, could you give me an opinion on this intent?


Re: Plan for dropping python 2 support

2019-06-07 Thread Robert Bradshaw
I don't think the second release with robust/recommended Python 3
support should be the last release with Python 2 support--that is
simply not enough time for people to migrate. (Look at how long it
took us...) It does make a lot of sense to at least have one LTS
release with support for both.

Regarding timeline, I think we could safely say we expect to support
Python 2 through 2019, likely for some of 2020 (possibly only via an
LTS release), and (very) unlikely beyond 2020.

On Wed, Jun 5, 2019 at 6:34 PM Ahmet Altay  wrote:
>
> I agree with the sentiment on this thread. Our priority needs to be offering 
> good python 3 support that we can comfortably recommend users to switch. 
> Progress on that so far has been promising and I do anticipate that we will 
> reach there in the near future.
>
> My proposal would be, once we reach to that state, we can mark the first 
> subsequent Beam release as the last Beam release that supports Python 2. 
> (Alternatively: in line with the previous experimental/deprecated discussion 
> we can make 2 more release with python 2 support rather than just 1 more.) 
> With the current state, we would not give users plenty of time to upgrade 
> python 3. So in addition, I would suggest we can consider and upgrade relief 
> by offering something like a 6-month support on the last python 2 compatible 
> release. We might do that in the context of an LTS release.
>
> I do not believe we have a timeline we can share with users at this point. 
> However if we go with this suggestion, we will probably support python 2 
> approximately until mid-2020.
>
> Ahmet
>
> On Wed, Jun 5, 2019 at 4:53 AM Tanay Tummalapalli  wrote:
>>
>> We can support Python 2 for some time in 2020, but, we should target a date 
>> no later than 2020 to drop support.
>> If we do plan to drop support for Python 2 in 2020, we should sign the 
>> Python 3 statement[1], declaring that we will "drop support for Python 2.7 
>> no later than 2020".
>>
>> In addition to the statement, keeping a target release and date(if possible) 
>> or timeline to drop support would also help users to decide when they need 
>> to work on migrating to Python 3.
>>
>> Regards,
>> - TT
>>
>> [1] https://python3statement.org/
>>
>> On Wed, Jun 5, 2019 at 4:37 PM Robert Bradshaw  wrote:
>>>
>>> Until Python 3 support for Beam is officially out of beta and
>>> recommended, I don't think we can tell people to stop using Python 2.
>>> Given that 2020 is just over 6 months away, that seems a short
>>> transition time, so I would guess we'll have to continue supporting
>>> Python 2 sometime into 2020.
>>>
>>> A quick survey of users would be valuable here. But first priority is
>>> making Python 3 rock solid so we can unconditionally recommend it over
>>> Python 2.
>>>
>>> On Wed, Jun 5, 2019 at 12:27 PM Ismaël Mejía  wrote:
>>> >
>>> > Python 2 won't be maintained after 2020 [1]. I was wondering what will
>>> > be our (Beam) plan for this. Other projects [2] have started to alert
>>> > users that support will be removed so maybe we should decide or policy
>>> > for this too.
>>> >
>>> > [1] https://pythonclock.org/
>>> > [2] https://spark.apache.org/news/plan-for-dropping-python-2-support.html


Re: Removing shading by default within BeamModulePlugin.groovy

2019-06-07 Thread Ismaël Mejía
This is fantastic. Took a look at the PR and did not see anything that
jump to my eyes and also validated with two external projects with
today’s snapshots (after merge) without issues so far. Great that we
finally tackle this on, thanks Luke!

Have one minor comment because the title of the thread may be
confusing, after checking sdks-java-core I noticed we are still
shading other dependencies: protobuf, bytebuddy, antlr, apache
commons, so I suppose this was mostly around shading guava, isn’t it?

On Wed, Jun 5, 2019 at 10:09 PM Lukasz Cwik  wrote:
>
> I am able to pass several runners validates runner tests and the Java 
> PostCommit.
>
> I also was able to publish a release into the staging repository[1] and 
> compared the newly generated poms artifact-2.14.0-20190605.*-30.pom against 
> the previously nightly snapshot of artifact-2.14.0-20190605.*-28.pom for the 
> following projects as a spot check and found no differences in those poms:
> beam-sdks-java-core
> beam-sdks-java-fn-execution
> beam-runners-spark
>
> I believe my PR is now ready for review.
>
> 1: https://repository.apache.org/content/groups/snapshots/org/apache/beam/
>
> On Tue, Jun 4, 2019 at 7:18 PM Kenneth Knowles  wrote:
>>
>> Nice! This is a huge step. One thing that showed up in the last big gradle 
>> change was needing to check the generated poms.
>>
>> Kenn
>>
>> On Tue, Jun 4, 2019 at 5:07 PM Lukasz Cwik  wrote:
>>>
>>> Since we have been migrating to using vendoring instead of shading[1] and 
>>> due to previous efforts in vendoring[2, 3] I have opened up PR 8762[4] 
>>> which migrates all projects that weren't doing anything shading wise to not 
>>> perform any shading. This required me to fix up all intra project 
>>> dependencies and release publishing.
>>>
>>> The following is a list of all project paths which are still using shading 
>>> for some reason:
>>> model/*
>>> sdks/java/core
>>> sdks/java/extensions/kryo
>>> sdks/java/extensions/sql
>>> sdks/java/extensions/sql/jdbc
>>> sdks/java/harness
>>> runners/spark/job-server
>>> runners/direct-java
>>> runners/samza/job-server
>>> runners/google-cloud-dataflow-java/worker
>>> runners/google-cloud-dataflow-java/worker/legacy-worker
>>> runners/google-cloud-dataflow-java/worker/windmill
>>> vendor/*
>>>
>>> Out of the list above, migrating sdks/java/core and runners/direct-java (in 
>>> that order) would provide the most benefit to moving away from shading 
>>> within our project. Many of the others are either shaded proto classes or 
>>> applications (e.g. job-servers, harness, sql jdbc) and either require 
>>> shading to be compatible with vendoring or aren't meant to be used as 
>>> dependencies.
>>>
>>> Since this is a larger change that cuts across so many projects there is 
>>> risk for breakage. I'm looking for people to help test the change and 
>>> validate any scenarios that they are specifically interested in. I'm 
>>> planning to run several of the postcommits on my PR and check that we can 
>>> build a release in addition to any efforts others provide before looking to 
>>> have the change merged.
>>>
>>> The following guidance should help those who edit Gradle build files (after 
>>> this change is merged):
>>> * For projects that don't perform any shading, those projects have been 
>>> migrated to use the default configurations that the Gradle Java plugin 
>>> uses[5]. Note that the default configurations we use have been deprecated.
>>> * For projects that depend on another project that isn't shaded, the intra 
>>> project configuration has been swapped to use compile / testRuntime instead 
>>> of shadow and shadowTest
>>> * Existing projects that are still shaded should use the shadow and 
>>> shadowTest configurations as before.
>>>
>>> 1: 
>>> https://lists.apache.org/thread.html/4c12db35b40a6d56e170cd6fc8bb0ac4c43a99aa3cb7dbae54176815@%3Cdev.beam.apache.org%3E
>>> 2: 
>>> https://lists.apache.org/thread.html/4c12db35b40a6d56e170cd6fc8bb0ac4c43a99aa3cb7dbae54176815@%3Cdev.beam.apache.org%3E
>>> 3: 
>>> https://lists.apache.org/thread.html/972b5175641f4eaf7ec92870cc0ff72fa52e6f0bbaccc384a3814e45@%3Cdev.beam.apache.org%3E
>>> 4: https://github.com/apache/beam/pull/8762
>>> 5: 
>>> https://docs.gradle.org/current/userguide/java_plugin.html#sec:java_plugin_and_dependency_management


Re: Help triaging Jira issues

2019-06-07 Thread Ismaël Mejía
I took a look and reduced the untriaged issues to around 100. I
noticed however some patterns that are producing more untriaged issues
that we should have. Those can be probably automated (if JIRA has ways
to do it):

1. Issues created and assigned on creation can be marked as open.
2. Once an issue has an associated PR it could be marked as open if it
was in Triaged state.

Other common case that is probably harder to automate are issues that
are in Triaged state because we forgot to resolve/close them. I don’t
know how we can improve these, apart of reminding people to look that
they do not have untriaged assigned issues.

Another interesting triage to do are the issues that are Open and
assigned to members of the community that are not active anymore in
the project, but that’s probably worth of another discussion, as well
as how can we more effectively track open unassigned issues (which are
currently around 1600).

On Wed, Jun 5, 2019 at 7:03 PM Tanay Tummalapalli  wrote:
>
> Hi Kenneth,
>
> I already follow the issues@ mailing list pretty much daily.
> I'd like to help with triaging issues, especially ones related to the Python 
> SDK since I'm most familiar with it.
>
> On Wed, Jun 5, 2019 at 10:26 PM Alex Van Boxel  wrote:
>>
>> Hey Kenneth, I help out. I'm planning to contribute more on Beam and it 
>> seems to be ideal to keep up-to-date with the project.
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Wed, Jun 5, 2019 at 6:46 PM Kenneth Knowles  wrote:
>>>
>>> Hi all,
>>>
>>> I am requesting help in triaging incoming issues. I made a search here: 
>>> https://issues.apache.org/jira/issues/?filter=12345682
>>>
>>> I have a daily email subscription to this filter as a reminder, but rarely 
>>> can really sit down to do triage for very long. It has grown from just 
>>> under 200 to just over 200. The rate is actually pretty low but there is a 
>>> backlog. I also want to start re-triaging stale bugs but priority would be 
>>> (1) keep up with new bugs (2) clear backlog (3) re-triage stale bugs.
>>>
>>> Just FYI what I look for before I clicked "Triaged" is:
>>>
>>>  - correct component
>>>  - correct priority
>>>  - maybe ping someone in a comment or assign
>>>  - write to dev@ if it is a major problem
>>>
>>> If I can't figure that out, then I ask the reporter for clarification and 
>>> "Start Watching" the issue so I will receive their response.
>>>
>>> To avoid duplicate triage work it may help to assign to yourself 
>>> temporarily during triage phase.
>>>
>>> Any help greatly appreciated!
>>>
>>> Kenn


Re: @RequireTimeSortedInput design draft

2019-06-07 Thread Jan Lukavský

Hi Reza, interesting suggestions, thanks.

When you mentioned join, I recalled an older issue (which apparently was 
not yet transfered to Beam's JIRA)  [1]. Is this anyhow related to what 
you are implementing? Would you like to make your implementation 
accessible via Euphoria DSL [2]?


 Jan

[1] https://github.com/seznam/euphoria/issues/143

[2] 
https://github.com/apache/beam/blob/master/sdks/java/extensions/euphoria/src/main/java/org/apache/beam/sdk/extensions/euphoria/core/client/operator/Join.java


On 6/7/19 7:06 AM, Reza Rokni wrote:

Hi Jan,

I have been working on a timeseries extension which makes use of many 
of these techniques for joining two temporal streams, it's almost 
ready for the PR, will ping it here when it is as it might be useful 
for you. In general, I borrowed a lot of techniques from CoGroupBy code.


/1) need to figure out how to get Coder of input PCollection 
of stateful ParDo inside StatefulDoFnRunner/
My join takes in a  , in the outer transform I use things 
like leftCollection.getCoder()).getValueCoder(); Then when creating 
the Join transform I can defer the StateSpec object creation until the 
constructor is called.


/2) there are performance considerations, that can be solved 
probably only by Sorted Map State [2]/
Sorted Map is going to be awesome, until then the only option is to 
create a Cache in the DoFn to make it more efficient. For the cache to 
work you need to key on Window + key and do things like clear the 
cache @Startbundle. Better to wait for Sorted Map if this is not time 
critical.


/3) additional work is needed for allowedLateness to work 
correctly (and there are at least two ways how to solve this), see the 
design doc [3]/
Yup, in my case I can support this by not GC the right side of the 
join for now, but that is a compromise.


/4) more tests (for batch and validatesRunner) are needed/
I just posted a question on the best way to make use of 
the @ValidateRunner annotation on this list, sounds like it might be 
useful to you as well :-)


On Thu, 6 Jun 2019 at 23:03, Jan Lukavský > wrote:


Hi,

I have written a PoC implementation of this in [1] and I'd like to
discuss some implementation details. First of all, I'd appreciate any
feedback about this. There are some known issues:

  1) need to figure out how to get Coder of input PCollection of
stateful ParDo inside StatefulDoFnRunner

  2) there are performance considerations, that can be solved
probably
only by Sorted Map State [2]

  3) additional work is needed for allowedLateness to work correctly
(and there are at least two ways how to solve this), see the
design doc [3]

  4) more tests (for batch and validatesRunner) are needed

I have come across a few bugs in DirectRunner, which I tried to solve:

  a) timers seem to be broken in stateful pardo with side inputs

  b) timers need to be sorted by timestamp, otherwise state might be
cleared before it gets chance to be flushed


Thanks for feedback,

  Jan


[1] https://github.com/apache/beam/pull/8774

[2]

http://mail-archives.apache.org/mod_mbox/beam-dev/201905.mbox/%3ccalstk6+ldemtjmnuysn3vcufywjkhmgv1isfbdmxthoqh91...@mail.gmail.com%3e

[3]

https://docs.google.com/document/d/1ObLVUFsf1NcG8ZuIZE4aVy2RYKx2FfyMhkZYWPnI9-c/


On 5/23/19 4:40 PM, Robert Bradshaw wrote:
> Thanks for writing this up.
>
> I think the justification for adding this to the model needs to be
> that it is useful (you have this covered, though some examples would
> be nice) and that it's something that can't easily be done by users
> themselves (specifically, though it can be (relatively) cheaply done
> in streaming and batch, it's done in very different ways, and also
> that it's hard to do via composition).
>
> On Thu, May 23, 2019 at 4:10 PM Jan Lukavský mailto:je...@seznam.cz>> wrote:
>> Hi,
>>
>> I have written a very brief draft of how it might be possible to
>> implement @RequireTimeSortedInput discussed in [1]. I see the
document
>> [2] a starting point for a discussion. There are several open
questions,
>> which I believe can be resolved by this great community. :-)
>>
>> Jan
>>
>> [1]
http://mail-archives.apache.org/mod_mbox/beam-dev/201905.mbox/browser
>>
>> [2]
>>

https://docs.google.com/document/d/1ObLVUFsf1NcG8ZuIZE4aVy2RYKx2FfyMhkZYWPnI9-c/
>>



--

This email may be confidential and privileged. If you received this 
communication by mistake, please don't forward it to anyone else, 
please erase all copies and attachments, and please let me know that 
it has gone to the wrong person.


The above terms reflect a potential business arrangement, are provided 
solely as a basis for further discussion, and are not intended to be 
and do not constitute a legally binding obligation. No legally binding 
obligations will