Re: [Discuss] Retractions in Beam

2019-08-12 Thread Rui Wang
Hello!

I have also been building a proof of concept(PR
), which implements the streaming
wordcount example in the design doc.

What is missing in the PoC is ordering guarantee implementation in sink
(which I am working on).


-Rui

On Wed, Jul 24, 2019 at 1:37 PM Rui Wang  wrote:

> Hello!
>
> In case you are not aware of, I have added a modified streaming wordcount
> example at the end of the doc to illustrate retractions.
>
>
> -Rui
>
> On Wed, Jul 10, 2019 at 10:58 AM Rui Wang  wrote:
>
>> Hi Community,
>>
>> Retractions is a part of core Beam model [1]. I come up with a doc to
>> discuss retractions about use cases, model and API (see the link below).
>> This is a very beginning discussion on retractions but I do hope we can
>> have a consensus and make retractions implemented in a useful way
>> eventually.
>>
>>
>> doc link:
>> https://docs.google.com/document/d/14WRfxwk_iLUHGPty3C6ZenddPsp_d6jhmx0vuafXqmE/edit?usp=sharing
>>
>>
>> [1]: https://issues.apache.org/jira/browse/BEAM-91
>>
>>
>> -Rui
>>
>


Re: [VOTE] Support ZetaSQL as another SQL dialect for BeamSQL in Beam repo

2019-08-12 Thread Manu Zhang
+1

On Tue, Aug 13, 2019 at 11:55 AM Mingmin Xu  wrote:

> +1
>
> On Mon, Aug 12, 2019 at 8:53 PM Ryan McDowell 
> wrote:
>
>> +1
>>
>> On Mon, Aug 12, 2019 at 8:30 PM Reza Rokni  wrote:
>>
>>> +1
>>>
>>> On Tue, 13 Aug 2019 at 09:28, Ahmet Altay  wrote:
>>>
 +1

 On Mon, Aug 12, 2019 at 6:27 PM Kenneth Knowles 
 wrote:

> +1
>
> On Mon, Aug 12, 2019 at 4:43 PM Rui Wang  wrote:
>
>> Hi Community,
>>
>> I am using this separate thread to collect votes on contributing Beam
>> ZetaSQL(my way to say ZetaSQL as a dialect supported by BeamSQL) to Beam
>> repo.
>>
>> There are discussions related to benefits, technical design and
>> others on Beam ZetaSQL in [1]. The Beam ZetaSQL code lives in [2]. Note
>> that this vote is not about merging the PR, which should be decided by 
>> code
>> review. This vote is only to vote if Beam ZetaSQL should live in Beam 
>> repo.
>>
>> +1: Beam repo can host Beam ZetaSQL
>> -1: Beam repo should not host Beam ZetaSQL
>>
>> If there are more questions related to Beam ZetaSQL, please discuss
>> it in [1].
>>
>> [1]:
>> https://lists.apache.org/thread.html/eab70bb99218aaedfd506e979967379c0efa05ea56a792a1486f9c74@%3Cdev.beam.apache.org%3E
>> [2]: https://github.com/apache/beam/pull/9210
>>
>> -Rui
>>
>
>>>
>>> --
>>>
>>> This email may be confidential and privileged. If you received this
>>> communication by mistake, please don't forward it to anyone else, please
>>> erase all copies and attachments, and please let me know that it has gone
>>> to the wrong person.
>>>
>>> The above terms reflect a potential business arrangement, are provided
>>> solely as a basis for further discussion, and are not intended to be and do
>>> not constitute a legally binding obligation. No legally binding obligations
>>> will be created, implied, or inferred until an agreement in final form is
>>> executed in writing by all parties involved.
>>>
>>
>
> --
> 
> Mingmin
>


Re: [VOTE] Support ZetaSQL as another SQL dialect for BeamSQL in Beam repo

2019-08-12 Thread Mingmin Xu
+1

On Mon, Aug 12, 2019 at 8:53 PM Ryan McDowell 
wrote:

> +1
>
> On Mon, Aug 12, 2019 at 8:30 PM Reza Rokni  wrote:
>
>> +1
>>
>> On Tue, 13 Aug 2019 at 09:28, Ahmet Altay  wrote:
>>
>>> +1
>>>
>>> On Mon, Aug 12, 2019 at 6:27 PM Kenneth Knowles  wrote:
>>>
 +1

 On Mon, Aug 12, 2019 at 4:43 PM Rui Wang  wrote:

> Hi Community,
>
> I am using this separate thread to collect votes on contributing Beam
> ZetaSQL(my way to say ZetaSQL as a dialect supported by BeamSQL) to Beam
> repo.
>
> There are discussions related to benefits, technical design and others
> on Beam ZetaSQL in [1]. The Beam ZetaSQL code lives in [2]. Note that this
> vote is not about merging the PR, which should be decided by code review.
> This vote is only to vote if Beam ZetaSQL should live in Beam repo.
>
> +1: Beam repo can host Beam ZetaSQL
> -1: Beam repo should not host Beam ZetaSQL
>
> If there are more questions related to Beam ZetaSQL, please discuss it
> in [1].
>
> [1]:
> https://lists.apache.org/thread.html/eab70bb99218aaedfd506e979967379c0efa05ea56a792a1486f9c74@%3Cdev.beam.apache.org%3E
> [2]: https://github.com/apache/beam/pull/9210
>
> -Rui
>

>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>

-- 

Mingmin


Re: [VOTE] Support ZetaSQL as another SQL dialect for BeamSQL in Beam repo

2019-08-12 Thread Ryan McDowell
+1

On Mon, Aug 12, 2019 at 8:30 PM Reza Rokni  wrote:

> +1
>
> On Tue, 13 Aug 2019 at 09:28, Ahmet Altay  wrote:
>
>> +1
>>
>> On Mon, Aug 12, 2019 at 6:27 PM Kenneth Knowles  wrote:
>>
>>> +1
>>>
>>> On Mon, Aug 12, 2019 at 4:43 PM Rui Wang  wrote:
>>>
 Hi Community,

 I am using this separate thread to collect votes on contributing Beam
 ZetaSQL(my way to say ZetaSQL as a dialect supported by BeamSQL) to Beam
 repo.

 There are discussions related to benefits, technical design and others
 on Beam ZetaSQL in [1]. The Beam ZetaSQL code lives in [2]. Note that this
 vote is not about merging the PR, which should be decided by code review.
 This vote is only to vote if Beam ZetaSQL should live in Beam repo.

 +1: Beam repo can host Beam ZetaSQL
 -1: Beam repo should not host Beam ZetaSQL

 If there are more questions related to Beam ZetaSQL, please discuss it
 in [1].

 [1]:
 https://lists.apache.org/thread.html/eab70bb99218aaedfd506e979967379c0efa05ea56a792a1486f9c74@%3Cdev.beam.apache.org%3E
 [2]: https://github.com/apache/beam/pull/9210

 -Rui

>>>
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and do
> not constitute a legally binding obligation. No legally binding obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>


Re: [VOTE] Support ZetaSQL as another SQL dialect for BeamSQL in Beam repo

2019-08-12 Thread Reza Rokni
+1

On Tue, 13 Aug 2019 at 09:28, Ahmet Altay  wrote:

> +1
>
> On Mon, Aug 12, 2019 at 6:27 PM Kenneth Knowles  wrote:
>
>> +1
>>
>> On Mon, Aug 12, 2019 at 4:43 PM Rui Wang  wrote:
>>
>>> Hi Community,
>>>
>>> I am using this separate thread to collect votes on contributing Beam
>>> ZetaSQL(my way to say ZetaSQL as a dialect supported by BeamSQL) to Beam
>>> repo.
>>>
>>> There are discussions related to benefits, technical design and others
>>> on Beam ZetaSQL in [1]. The Beam ZetaSQL code lives in [2]. Note that this
>>> vote is not about merging the PR, which should be decided by code review.
>>> This vote is only to vote if Beam ZetaSQL should live in Beam repo.
>>>
>>> +1: Beam repo can host Beam ZetaSQL
>>> -1: Beam repo should not host Beam ZetaSQL
>>>
>>> If there are more questions related to Beam ZetaSQL, please discuss it
>>> in [1].
>>>
>>> [1]:
>>> https://lists.apache.org/thread.html/eab70bb99218aaedfd506e979967379c0efa05ea56a792a1486f9c74@%3Cdev.beam.apache.org%3E
>>> [2]: https://github.com/apache/beam/pull/9210
>>>
>>> -Rui
>>>
>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.


Re: [VOTE] Support ZetaSQL as another SQL dialect for BeamSQL in Beam repo

2019-08-12 Thread Ahmet Altay
+1

On Mon, Aug 12, 2019 at 6:27 PM Kenneth Knowles  wrote:

> +1
>
> On Mon, Aug 12, 2019 at 4:43 PM Rui Wang  wrote:
>
>> Hi Community,
>>
>> I am using this separate thread to collect votes on contributing Beam
>> ZetaSQL(my way to say ZetaSQL as a dialect supported by BeamSQL) to Beam
>> repo.
>>
>> There are discussions related to benefits, technical design and others on
>> Beam ZetaSQL in [1]. The Beam ZetaSQL code lives in [2]. Note that this
>> vote is not about merging the PR, which should be decided by code review.
>> This vote is only to vote if Beam ZetaSQL should live in Beam repo.
>>
>> +1: Beam repo can host Beam ZetaSQL
>> -1: Beam repo should not host Beam ZetaSQL
>>
>> If there are more questions related to Beam ZetaSQL, please discuss it in
>> [1].
>>
>> [1]:
>> https://lists.apache.org/thread.html/eab70bb99218aaedfd506e979967379c0efa05ea56a792a1486f9c74@%3Cdev.beam.apache.org%3E
>> [2]: https://github.com/apache/beam/pull/9210
>>
>> -Rui
>>
>


Re: [VOTE] Support ZetaSQL as another SQL dialect for BeamSQL in Beam repo

2019-08-12 Thread Kenneth Knowles
+1

On Mon, Aug 12, 2019 at 4:43 PM Rui Wang  wrote:

> Hi Community,
>
> I am using this separate thread to collect votes on contributing Beam
> ZetaSQL(my way to say ZetaSQL as a dialect supported by BeamSQL) to Beam
> repo.
>
> There are discussions related to benefits, technical design and others on
> Beam ZetaSQL in [1]. The Beam ZetaSQL code lives in [2]. Note that this
> vote is not about merging the PR, which should be decided by code review.
> This vote is only to vote if Beam ZetaSQL should live in Beam repo.
>
> +1: Beam repo can host Beam ZetaSQL
> -1: Beam repo should not host Beam ZetaSQL
>
> If there are more questions related to Beam ZetaSQL, please discuss it in
> [1].
>
> [1]:
> https://lists.apache.org/thread.html/eab70bb99218aaedfd506e979967379c0efa05ea56a792a1486f9c74@%3Cdev.beam.apache.org%3E
> [2]: https://github.com/apache/beam/pull/9210
>
> -Rui
>


Re: [Discuss] Ideas for Apache Beam presence in social media

2019-08-12 Thread Thomas Weise
Yes, everyone should have comment access for this to make sense. Sorry for
the confusion.


On Mon, Aug 12, 2019 at 5:30 PM Kenneth Knowles  wrote:

> Thanks for setting this up. It is nice to start building up a system for
> this so everyone can participate.
>
> Regarding Jira versus notifications, how are people with only view access
> to make suggestions for tweets? When I suggested gdocs, I meant for
> everyone to have "comment" access, so then anyone can subscribe to all
> comments, which would include suggestions. This allows anyone to suggest
> tweets and anyone to subscribe to suggestions.
>
> Kenn
>
> On Wed, Aug 7, 2019 at 4:07 PM Aizhamal Nurmamat kyzy 
> wrote:
>
>> Thanks Thomas, changed the doc to view only and granted you and Ahmet
>> edit access.
>> @all - please send requests for access with your google accounts. I will
>> update the thread once I document the process and submit the PR to the
>> website.
>>
>> Thank you,
>> Aizhamal
>>
>> On Wed, Aug 7, 2019 at 3:12 PM Thomas Weise  wrote:
>>
>>> I was able to subscribe now.
>>>
>>> Reminder for others that the spreadsheet of interest can be found here:
>>> s.apache.org/beam-tweets
>>>
>>> Aizhamal,
>>>
>>> Can you help with a couple changes to bring this closer to how similar
>>> gdoc resources are handled?
>>>
>>> * Make the document view only. *PMC members* that care to help with
>>> this can request edit access.
>>> * Document the process for other contributors. Maybe here?
>>> https://beam.apache.org/contribute/
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On Wed, Aug 7, 2019 at 2:39 PM Ahmet Altay  wrote:
>>>
 I am able to subscribe to notifications now. Thomas does it work for
 you?

 On Wed, Aug 7, 2019 at 2:23 PM Aizhamal Nurmamat kyzy <
 aizha...@apache.org> wrote:

> Hi all,
>
> I set the access to 'anyone can edit'. Let me know if notifications
> work now.
>
> Thanks,
>
> On Wed, Aug 7, 2019 at 2:00 PM Ahmet Altay  wrote:
>
>> You are probably right and it is an access issue.
>>
>> Aizhamal, could you give us edit access? And we can see if
>> notifications work after that.
>>
>> On Wed, Aug 7, 2019 at 1:41 PM Thomas Weise  wrote:
>>
>>> The use of JIRA was also suggested before, but why do the
>>> notifications not work? I wasn't able to subscribe and I suspect that 
>>> was
>>> due to not having sufficient access to the spreadsheet?
>>>
>>>
>>>
>>> On Wed, Aug 7, 2019 at 1:26 PM Ahmet Altay  wrote:
>>>
 As far as I understand we have not resolved this discussion and the
 sticking issue is that there is no good way of subscribing to changes 
 (i.e.
 proposals for tweets) for interested parties. The method suggested in 
 this
 thread (e.g. Tools and then Notification rules.) does not work for some
 reason for a few of us including myself.

 Could we try to use any of our existing tools? For example, could
 the proposals be done in the form of filing a new JIRA issue under a
 specific component. All of us should be able to get notifications for 
 that.
 And then we can follow the lazy consensus and alternative approval 
 options
 as written down by Robert (1 week or 3 PMC +1s).

 Ahmet

 On Mon, Jun 24, 2019 at 3:47 AM Robert Bradshaw <
 rober...@google.com> wrote:

> On Fri, Jun 21, 2019 at 1:02 PM Thomas Weise 
> wrote:
> >
> > From what I understand, spreadsheets (not docs) provide the
> functionality that we need:
> https://support.google.com/docs/answer/91588
> >
> > Interested PMC members can subscribe and react to changes in the
> spreadsheet.
> >
> > Lazy consensus requires a minimum wait. How much should that be?
>
> 72 hours is a pretty typical minimum, but I'd go a bit longer. A
> week?
> I'd want at least two pairs of eyes though, so if it's a PMC member
> that proposes the message someone else on the PMC should approve
> before it goes out on an official channel.
>
> > Should there be an alternative approval option (like minimum
> number of +1s) for immediate post (in case it is time sensitive)?
>
> +1. I'd say three is probably sufficient, five at most.
>
> For both of these, let's just do something conservative and see
> how it goes.
>
> > On Fri, Jun 7, 2019 at 7:28 PM Kenneth Knowles 
> wrote:
> >>
> >> GDocs also have the ability to subscribe to all comments so
> that works as well.
> >>
> >> This does add another system to our infrastructure, versus
> using Jira. But I think a spreadsheet for tracking what has been sent 
> and
> when, it could be helpful 

Re: [Discuss] Ideas for Apache Beam presence in social media

2019-08-12 Thread Kenneth Knowles
Thanks for setting this up. It is nice to start building up a system for
this so everyone can participate.

Regarding Jira versus notifications, how are people with only view access
to make suggestions for tweets? When I suggested gdocs, I meant for
everyone to have "comment" access, so then anyone can subscribe to all
comments, which would include suggestions. This allows anyone to suggest
tweets and anyone to subscribe to suggestions.

Kenn

On Wed, Aug 7, 2019 at 4:07 PM Aizhamal Nurmamat kyzy 
wrote:

> Thanks Thomas, changed the doc to view only and granted you and Ahmet edit
> access.
> @all - please send requests for access with your google accounts. I will
> update the thread once I document the process and submit the PR to the
> website.
>
> Thank you,
> Aizhamal
>
> On Wed, Aug 7, 2019 at 3:12 PM Thomas Weise  wrote:
>
>> I was able to subscribe now.
>>
>> Reminder for others that the spreadsheet of interest can be found here:
>> s.apache.org/beam-tweets
>>
>> Aizhamal,
>>
>> Can you help with a couple changes to bring this closer to how similar
>> gdoc resources are handled?
>>
>> * Make the document view only. *PMC members* that care to help with this
>> can request edit access.
>> * Document the process for other contributors. Maybe here?
>> https://beam.apache.org/contribute/
>>
>> Thanks!
>>
>>
>>
>> On Wed, Aug 7, 2019 at 2:39 PM Ahmet Altay  wrote:
>>
>>> I am able to subscribe to notifications now. Thomas does it work for you?
>>>
>>> On Wed, Aug 7, 2019 at 2:23 PM Aizhamal Nurmamat kyzy <
>>> aizha...@apache.org> wrote:
>>>
 Hi all,

 I set the access to 'anyone can edit'. Let me know if notifications
 work now.

 Thanks,

 On Wed, Aug 7, 2019 at 2:00 PM Ahmet Altay  wrote:

> You are probably right and it is an access issue.
>
> Aizhamal, could you give us edit access? And we can see if
> notifications work after that.
>
> On Wed, Aug 7, 2019 at 1:41 PM Thomas Weise  wrote:
>
>> The use of JIRA was also suggested before, but why do the
>> notifications not work? I wasn't able to subscribe and I suspect that was
>> due to not having sufficient access to the spreadsheet?
>>
>>
>>
>> On Wed, Aug 7, 2019 at 1:26 PM Ahmet Altay  wrote:
>>
>>> As far as I understand we have not resolved this discussion and the
>>> sticking issue is that there is no good way of subscribing to changes 
>>> (i.e.
>>> proposals for tweets) for interested parties. The method suggested in 
>>> this
>>> thread (e.g. Tools and then Notification rules.) does not work for some
>>> reason for a few of us including myself.
>>>
>>> Could we try to use any of our existing tools? For example, could
>>> the proposals be done in the form of filing a new JIRA issue under a
>>> specific component. All of us should be able to get notifications for 
>>> that.
>>> And then we can follow the lazy consensus and alternative approval 
>>> options
>>> as written down by Robert (1 week or 3 PMC +1s).
>>>
>>> Ahmet
>>>
>>> On Mon, Jun 24, 2019 at 3:47 AM Robert Bradshaw 
>>> wrote:
>>>
 On Fri, Jun 21, 2019 at 1:02 PM Thomas Weise 
 wrote:
 >
 > From what I understand, spreadsheets (not docs) provide the
 functionality that we need:
 https://support.google.com/docs/answer/91588
 >
 > Interested PMC members can subscribe and react to changes in the
 spreadsheet.
 >
 > Lazy consensus requires a minimum wait. How much should that be?

 72 hours is a pretty typical minimum, but I'd go a bit longer. A
 week?
 I'd want at least two pairs of eyes though, so if it's a PMC member
 that proposes the message someone else on the PMC should approve
 before it goes out on an official channel.

 > Should there be an alternative approval option (like minimum
 number of +1s) for immediate post (in case it is time sensitive)?

 +1. I'd say three is probably sufficient, five at most.

 For both of these, let's just do something conservative and see how
 it goes.

 > On Fri, Jun 7, 2019 at 7:28 PM Kenneth Knowles 
 wrote:
 >>
 >> GDocs also have the ability to subscribe to all comments so that
 works as well.
 >>
 >> This does add another system to our infrastructure, versus using
 Jira. But I think a spreadsheet for tracking what has been sent and 
 when,
 it could be helpful to have the added structure.
 >>
 >> Kenn
 >>
 >> On Fri, Jun 7, 2019 at 10:05 AM Thomas Weise 
 wrote:
 >>>
 >>> Here is an idea how this could be done: Create a JIRA ticket
 that will always remain open. Have folks append their suggested tweets 
 as

[VOTE] Support ZetaSQL as another SQL dialect for BeamSQL in Beam repo

2019-08-12 Thread Rui Wang
Hi Community,

I am using this separate thread to collect votes on contributing Beam
ZetaSQL(my way to say ZetaSQL as a dialect supported by BeamSQL) to Beam
repo.

There are discussions related to benefits, technical design and others on
Beam ZetaSQL in [1]. The Beam ZetaSQL code lives in [2]. Note that this
vote is not about merging the PR, which should be decided by code review.
This vote is only to vote if Beam ZetaSQL should live in Beam repo.

+1: Beam repo can host Beam ZetaSQL
-1: Beam repo should not host Beam ZetaSQL

If there are more questions related to Beam ZetaSQL, please discuss it in
[1].

[1]:
https://lists.apache.org/thread.html/eab70bb99218aaedfd506e979967379c0efa05ea56a792a1486f9c74@%3Cdev.beam.apache.org%3E
[2]: https://github.com/apache/beam/pull/9210

-Rui


Re: Support ZetaSQL as a new SQL dialect in BeamSQL

2019-08-12 Thread Rui Wang
Thanks Kenneth.

I will start a vote for Beam ZetaSQL contribution.

-Rui

On Mon, Aug 12, 2019 at 4:11 PM Kenneth Knowles  wrote:

> Nice explanations of the reasoning. I think two things will stay
> approximately the same even as the ecosystem develops: (1) ZetaSQL has
> pretty clear semantics so we will have a compliant parser, whether it is
> the official one or another like Calcite Babel, and (2) we will need a way
> to implement all the standard ZetaSQL functions and this will be the same
> no matter the frontend.
>
> For a contribution this large where i.p. clearance is necessary, a vote is
> appropriate. It can happen at the same time or even after i.p. clearance.
>
> Kenn
>
> On Wed, Aug 7, 2019 at 1:08 PM Mingmin Xu  wrote:
>
>> Thanks to highlight the parts of types/operators/functions/..., that does
>> make things more complicated. +1 that as a short/middle term solution, the
>> proposal is reasonable. We could follow up in future to handle it in
>> Calcite Babel if possible.
>>
>> Mingmin
>>
>> On Tue, Aug 6, 2019 at 3:57 PM Rui Wang  wrote:
>>
>>> Hi Mingmin,
>>>
>>> Honestly I don't have an answer to it: a SQL dialect is complicated and
>>> I don't have enough understanding on Calcite (Calcite has a big repo).
>>> Based on my read from CALCITE-2280
>>> , the closer to
>>> standard sql that a dialect is, the less blockers that we will have to
>>> support this dialect in Calcite babel parser.
>>>
>>> However, this is a good question, which raises a good aspect that I
>>> found people usually ignore: supporting a SQL dialect is not only support a
>>> type of syntax. It also includes data types, built-in sql functions,
>>> operators and many other stuff.
>>>
>>> I especially found the following incompatibilities between Calcite and
>>> ZetaSQL during the development:
>>> 1. Calcite does not support Struct/Row type well because Calcite
>>> flattens Rows when reading from tables by adding an extra Projection on top
>>> of tables.
>>> 2. I had trouble in supporting DATETIME(or timestamp without time zone)
>>> type.
>>> 3. Huge incompatibilities on SQL functions. E.g. return type is
>>> different for AVG(long), and many many more.
>>> 4. I am not sure if Calcite has the same set of type casting rules as
>>> BigQuery(my impression is there are differences).
>>>
>>>
>>> I would say in the short/mid term, it's much easier to use logical plan
>>> as IR to implement another SQL dialect for BeamSQL (Linkedin has
>>> similar practice, see their blog post
>>> 
>>> ).
>>>
>>> For the longer term, it would be interesting to see how we can add
>>> BigQuery syntax (plus its data types and sql functions) to Calcite babel
>>> parser.
>>>
>>>
>>>
>>> -Rui
>>>
>>>
>>> On Tue, Aug 6, 2019 at 2:49 PM Mingmin Xu  wrote:
>>>
 Just take a look at https://issues.apache.org/jira/browse/CALCITE-2280
 which introduced Babel parser in Calcite to support varied dialects, this
 may be an easier way to support BigQuery syntax. @Rui do you notice any big
 difference between Calcite engine and ZetaSQL, like parsing, optimization?
 If that's the case, it make sense to build the alternative switch in Beam
 side.

 On Sun, Aug 4, 2019 at 4:47 PM Rui Wang  wrote:

> Mingmin - it sounds like an awesome idea to translate from SparkSQL.
> It's even more exciting to know if we could translate Spark
> Structured Streaming code by a similar way, which enables existing Spark
> SQL/Structure Streaming pipelines run on Beam.
>
> Reuven - Thanks for bringing it up. I tried to search dev@calcite and
> only found[1]. From that thread, I see that adding ZetaSQL to Calcite
> itself is still a discussion. I am also looking for if anyone knows more
> progress on this work than the thread.
>
>
> [1]:
> http://mail-archives.apache.org/mod_mbox/calcite-dev/201905.mbox/%3CCAMj=j=-sPWgxzAgusnx8OYvYDYDcDY=dupe6poytrxhjri9...@mail.gmail.com%3E
>
> -Rui
>
> On Sun, Aug 4, 2019 at 3:54 PM Reuven Lax  wrote:
>
>> I hear rumours that the Calcite project is planning on adding a
>> zeta-SQL compatible parser to Calcite itself, in which case there will 
>> be a
>> Java parser we can use as well. Does anyone know if this work is still
>> going on?
>>
>> On Sat, Aug 3, 2019 at 8:41 PM Manu Zhang 
>> wrote:
>>
>>> A question to the community, does the size of the change require any
 process besides the usual PR reviews?

>>>
>>> I think so. This is a big change and has come as kind of a surprise
>>> (sorry if I've missed previous discussions).
>>>
>>> Rui, could you explain more on how things will play out between
>>> BeamSQL and ZetaSQL (A design doc including the pluggable interface 
>>> would
>>> be perfect). 

Re: Support ZetaSQL as a new SQL dialect in BeamSQL

2019-08-12 Thread Kenneth Knowles
Nice explanations of the reasoning. I think two things will stay
approximately the same even as the ecosystem develops: (1) ZetaSQL has
pretty clear semantics so we will have a compliant parser, whether it is
the official one or another like Calcite Babel, and (2) we will need a way
to implement all the standard ZetaSQL functions and this will be the same
no matter the frontend.

For a contribution this large where i.p. clearance is necessary, a vote is
appropriate. It can happen at the same time or even after i.p. clearance.

Kenn

On Wed, Aug 7, 2019 at 1:08 PM Mingmin Xu  wrote:

> Thanks to highlight the parts of types/operators/functions/..., that does
> make things more complicated. +1 that as a short/middle term solution, the
> proposal is reasonable. We could follow up in future to handle it in
> Calcite Babel if possible.
>
> Mingmin
>
> On Tue, Aug 6, 2019 at 3:57 PM Rui Wang  wrote:
>
>> Hi Mingmin,
>>
>> Honestly I don't have an answer to it: a SQL dialect is complicated and I
>> don't have enough understanding on Calcite (Calcite has a big repo). Based
>> on my read from CALCITE-2280
>> , the closer to
>> standard sql that a dialect is, the less blockers that we will have to
>> support this dialect in Calcite babel parser.
>>
>> However, this is a good question, which raises a good aspect that I found
>> people usually ignore: supporting a SQL dialect is not only support a type
>> of syntax. It also includes data types, built-in sql functions, operators
>> and many other stuff.
>>
>> I especially found the following incompatibilities between Calcite and
>> ZetaSQL during the development:
>> 1. Calcite does not support Struct/Row type well because Calcite flattens
>> Rows when reading from tables by adding an extra Projection on top of
>> tables.
>> 2. I had trouble in supporting DATETIME(or timestamp without time zone)
>> type.
>> 3. Huge incompatibilities on SQL functions. E.g. return type is different
>> for AVG(long), and many many more.
>> 4. I am not sure if Calcite has the same set of type casting rules as
>> BigQuery(my impression is there are differences).
>>
>>
>> I would say in the short/mid term, it's much easier to use logical plan
>> as IR to implement another SQL dialect for BeamSQL (Linkedin has
>> similar practice, see their blog post
>> 
>> ).
>>
>> For the longer term, it would be interesting to see how we can add
>> BigQuery syntax (plus its data types and sql functions) to Calcite babel
>> parser.
>>
>>
>>
>> -Rui
>>
>>
>> On Tue, Aug 6, 2019 at 2:49 PM Mingmin Xu  wrote:
>>
>>> Just take a look at https://issues.apache.org/jira/browse/CALCITE-2280
>>> which introduced Babel parser in Calcite to support varied dialects, this
>>> may be an easier way to support BigQuery syntax. @Rui do you notice any big
>>> difference between Calcite engine and ZetaSQL, like parsing, optimization?
>>> If that's the case, it make sense to build the alternative switch in Beam
>>> side.
>>>
>>> On Sun, Aug 4, 2019 at 4:47 PM Rui Wang  wrote:
>>>
 Mingmin - it sounds like an awesome idea to translate from SparkSQL.
 It's even more exciting to know if we could translate Spark
 Structured Streaming code by a similar way, which enables existing Spark
 SQL/Structure Streaming pipelines run on Beam.

 Reuven - Thanks for bringing it up. I tried to search dev@calcite and
 only found[1]. From that thread, I see that adding ZetaSQL to Calcite
 itself is still a discussion. I am also looking for if anyone knows more
 progress on this work than the thread.


 [1]:
 http://mail-archives.apache.org/mod_mbox/calcite-dev/201905.mbox/%3CCAMj=j=-sPWgxzAgusnx8OYvYDYDcDY=dupe6poytrxhjri9...@mail.gmail.com%3E

 -Rui

 On Sun, Aug 4, 2019 at 3:54 PM Reuven Lax  wrote:

> I hear rumours that the Calcite project is planning on adding a
> zeta-SQL compatible parser to Calcite itself, in which case there will be 
> a
> Java parser we can use as well. Does anyone know if this work is still
> going on?
>
> On Sat, Aug 3, 2019 at 8:41 PM Manu Zhang 
> wrote:
>
>> A question to the community, does the size of the change require any
>>> process besides the usual PR reviews?
>>>
>>
>> I think so. This is a big change and has come as kind of a surprise
>> (sorry if I've missed previous discussions).
>>
>> Rui, could you explain more on how things will play out between
>> BeamSQL and ZetaSQL (A design doc including the pluggable interface would
>> be perfect). From GitHub, ZetaSQL is mainly in C++ so what you are doing 
>> is
>> a port or a connector to ZetaSQL ? Do we need to depend on
>> https://github.com/google/zetasql ? ZetaSQL looks interesting but I
>> could barely find any doc for end users.

Re: Hello :)

2019-08-12 Thread Ahmet Altay
Done. Added your username as a contributor. You should be able to self
assign issues now.

On Mon, Aug 12, 2019 at 12:12 PM Johan Hansson 
wrote:

> Hi again,
>
> By the way my jira name is :  Johhan
>
> Sorry if I made it complicated.
>
> Kind regards
> Johan Hansson
>
> Den lör 10 aug. 2019 kl 20:22 skrev Johan Hansson <
> johan.eric.hans...@gmail.com>:
>
>> Hi everyone!
>>
>> Sorry sent the first email from the wrong email address.
>>
>> My name is Johan and I work for a small startup(fintech) in Stockholm. I
>> would like to make some contributions with more examples for the Python
>> API. Could some one please add me to Jira so I can create and assign my
>> self to user stories.
>>
>> Thank you!
>>
>> Best regards
>>
>


Re: Hello :)

2019-08-12 Thread Johan Hansson
Hi again,

By the way my jira name is :  Johhan

Sorry if I made it complicated.

Kind regards
Johan Hansson

Den lör 10 aug. 2019 kl 20:22 skrev Johan Hansson <
johan.eric.hans...@gmail.com>:

> Hi everyone!
>
> Sorry sent the first email from the wrong email address.
>
> My name is Johan and I work for a small startup(fintech) in Stockholm. I
> would like to make some contributions with more examples for the Python
> API. Could some one please add me to Jira so I can create and assign my
> self to user stories.
>
> Thank you!
>
> Best regards
>


Re: [FLINK-12653] and system state

2019-08-12 Thread Jan Lukavský
I've managed to fix that by introducing (optional) method to DoFnRunner 
called getSystemStateTags() (default implementation returns 
Collection.emptyList()), and the use that list to early bind states in 
Flink's DoFnOperator ([1])


@Max, WDYT?

Jan

[1] 
https://github.com/je-ik/beam/commit/1360fb6bd35192d443f97068c7ba2155f79e8802


On 8/12/19 4:00 PM, Jan Lukavský wrote:

Hi,

I have come across issue that is very much likely caused by [1]. The 
issue is that Beam's state is (generally) created lazily, after 
element is received (as Max described in the Flink's JIRA). Max also 
created workaround [2], but that seems to work for user state only 
(i.e. state that has been created in user code by declaring @StateId - 
please correct me if I'm wrong). In my work, however, I created a 
system state (that holds elements before being output, due to 
@RequiresTimeSortedInput annotation, but that's probably not 
important), and this state is not covered by the workaround. This 
might be the case for all system states, generally, because they are 
not visible, until element arrives and the state is actually created.


Is my analysis correct? If so, would anyone have any suggestions how 
to fix that?


Jan

[1] https://jira.apache.org/jira/browse/FLINK-12653

[2] https://issues.apache.org/jira/browse/BEAM-7144



Re: Write-through-cache in State logic

2019-08-12 Thread Lukasz Cwik
On Mon, Aug 12, 2019 at 10:09 AM Thomas Weise  wrote:

>
> On Mon, Aug 12, 2019 at 8:53 AM Maximilian Michels  wrote:
>
>> Thanks for starting this discussion Rakesh. An efficient cache layer is
>> one of the missing pieces for good performance in stateful pipelines.
>> The good news are that there is a level of caching already present in
>> Python which batches append requests until the bundle is finished.
>>
>> Thomas, in your example indeed we would have to profile to see why CPU
>> utilization is high on the Flink side but not in the Python SDK harness.
>> For example, older versions of Flink (<=1.5) have a high cost of
>> deleting existing instances of a timer when setting a timer.
>> Nevertheless, cross-bundle caching would likely result in increased
>> performance.
>>
>
> CPU on the Flink side was unchanged, and that's important. The throughout
> improvement comes from the extended bundle caching on the SDK side. That's
> what tells me that cross-bundle caching is needed. Of course, it will
> require a good solution for the write also and I like your idea of using
> the checkpoint boundary for that, especially since that already aligns with
> the bundle boundary and is under runner control. Of course we also want to
> be careful to not cause overly bursty writes.
>
> Profiling will be useful for the timer processing, that is also on my list
> of suspects.
>
>
>> Luke, I think the idea to merge pending state requests could be
>> complementary to caching across bundles.
>>
>> Question: Couldn't we defer flushing back state from the SDK to the
>> Runner indefinitely, provided that we add a way to flush the state in
>> case of a checkpoint?
>>
>
Flushing is needed to prevent the SDK from running out of memory. Having a
fixed budget for state inside the SDK would have flushing happen under
certain state usage scenarios.
I could also see that only flushing at checkpoint may lead to slow
checkpoint performance so we may want to flush state that hasn't been used
in a while as well.


> Another performance improvement would be caching read requests because
>> these first go to the Runner regardless of already cached appends.
>>
>> -Max
>>
>> On 09.08.19 17:12, Lukasz Cwik wrote:
>> >
>> >
>> > On Fri, Aug 9, 2019 at 2:32 AM Robert Bradshaw > > > wrote:
>> >
>> > The question is whether the SDK needs to wait for the StateResponse
>> to
>> > come back before declaring the bundle done. The proposal was to not
>> > send the cache token back as part of an append StateResponse [1],
>> but
>> > pre-provide it as part of the bundle request.
>> >
>> >
>> > Agree, the purpose of the I'm Blocked message is to occur during bundle
>> > processing.
>> >
>> >
>> > Thinking about this some more, if we assume the state response was
>> > successfully applied, there's no reason for the SDK to block the
>> > bundle until it has its hands on the cache token--we can update the
>> > cache once the StateResponse comes back whether or not the bundle is
>> > still active. On the other hand, the runner needs a way to assert it
>> > has received and processed all StateRequests from the SDK associated
>> > with a bundle before it can declare the bundle complete (regardless
>> of
>> > the cache tokens), so this might not be safe without some extra
>> > coordination (e.g. the ProcessBundleResponse indicating the number
>> of
>> > state requests associated with a bundle).
>> >
>> >
>> > Since the state request stream is ordered, we can add the id of the last
>> > state request as part of the ProcessBundleResponse.
>> >
>> >
>> > [1]
>> >
>> https://github.com/apache/beam/blob/release-2.14.0/model/fn-execution/src/main/proto/beam_fn_api.proto#L627
>> >
>> > On Thu, Aug 8, 2019 at 6:57 PM Lukasz Cwik > > > wrote:
>> > >
>> > > The purpose of the new state API call in BEAM-7000 is to tell the
>> > runner that the SDK is now blocked waiting for the result of a
>> > specific state request and it should be used for fetches (not
>> > updates) and is there to allow for SDKs to differentiate readLater
>> > (I will need this data at some point in time in the future) from
>> > read (I need this data now). This comes up commonly where the user
>> > prefetches multiple state cells and then looks at their content
>> > allowing the runner to batch up those calls on its end.
>> > >
>> > > The way it can be used for clear+append is that the runner can
>> > store requests in memory up until some time/memory limit or until it
>> > gets its first "blocked" call and then issue all the requests
>> together.
>> > >
>> > >
>> > > On Thu, Aug 8, 2019 at 9:42 AM Robert Bradshaw
>> > mailto:rober...@google.com>> wrote:
>> > >>
>> > >> On Tue, Aug 6, 2019 at 12:07 AM Thomas Weise > > > wrote:
>> > >> >
>> > >> > That would add a 

Re: Write-through-cache in State logic

2019-08-12 Thread Thomas Weise
On Mon, Aug 12, 2019 at 8:53 AM Maximilian Michels  wrote:

> Thanks for starting this discussion Rakesh. An efficient cache layer is
> one of the missing pieces for good performance in stateful pipelines.
> The good news are that there is a level of caching already present in
> Python which batches append requests until the bundle is finished.
>
> Thomas, in your example indeed we would have to profile to see why CPU
> utilization is high on the Flink side but not in the Python SDK harness.
> For example, older versions of Flink (<=1.5) have a high cost of
> deleting existing instances of a timer when setting a timer.
> Nevertheless, cross-bundle caching would likely result in increased
> performance.
>

CPU on the Flink side was unchanged, and that's important. The throughout
improvement comes from the extended bundle caching on the SDK side. That's
what tells me that cross-bundle caching is needed. Of course, it will
require a good solution for the write also and I like your idea of using
the checkpoint boundary for that, especially since that already aligns with
the bundle boundary and is under runner control. Of course we also want to
be careful to not cause overly bursty writes.

Profiling will be useful for the timer processing, that is also on my list
of suspects.


> Luke, I think the idea to merge pending state requests could be
> complementary to caching across bundles.
>
> Question: Couldn't we defer flushing back state from the SDK to the
> Runner indefinitely, provided that we add a way to flush the state in
> case of a checkpoint?
>
> Another performance improvement would be caching read requests because
> these first go to the Runner regardless of already cached appends.
>
> -Max
>
> On 09.08.19 17:12, Lukasz Cwik wrote:
> >
> >
> > On Fri, Aug 9, 2019 at 2:32 AM Robert Bradshaw  > > wrote:
> >
> > The question is whether the SDK needs to wait for the StateResponse
> to
> > come back before declaring the bundle done. The proposal was to not
> > send the cache token back as part of an append StateResponse [1], but
> > pre-provide it as part of the bundle request.
> >
> >
> > Agree, the purpose of the I'm Blocked message is to occur during bundle
> > processing.
> >
> >
> > Thinking about this some more, if we assume the state response was
> > successfully applied, there's no reason for the SDK to block the
> > bundle until it has its hands on the cache token--we can update the
> > cache once the StateResponse comes back whether or not the bundle is
> > still active. On the other hand, the runner needs a way to assert it
> > has received and processed all StateRequests from the SDK associated
> > with a bundle before it can declare the bundle complete (regardless
> of
> > the cache tokens), so this might not be safe without some extra
> > coordination (e.g. the ProcessBundleResponse indicating the number of
> > state requests associated with a bundle).
> >
> >
> > Since the state request stream is ordered, we can add the id of the last
> > state request as part of the ProcessBundleResponse.
> >
> >
> > [1]
> >
> https://github.com/apache/beam/blob/release-2.14.0/model/fn-execution/src/main/proto/beam_fn_api.proto#L627
> >
> > On Thu, Aug 8, 2019 at 6:57 PM Lukasz Cwik  > > wrote:
> > >
> > > The purpose of the new state API call in BEAM-7000 is to tell the
> > runner that the SDK is now blocked waiting for the result of a
> > specific state request and it should be used for fetches (not
> > updates) and is there to allow for SDKs to differentiate readLater
> > (I will need this data at some point in time in the future) from
> > read (I need this data now). This comes up commonly where the user
> > prefetches multiple state cells and then looks at their content
> > allowing the runner to batch up those calls on its end.
> > >
> > > The way it can be used for clear+append is that the runner can
> > store requests in memory up until some time/memory limit or until it
> > gets its first "blocked" call and then issue all the requests
> together.
> > >
> > >
> > > On Thu, Aug 8, 2019 at 9:42 AM Robert Bradshaw
> > mailto:rober...@google.com>> wrote:
> > >>
> > >> On Tue, Aug 6, 2019 at 12:07 AM Thomas Weise  > > wrote:
> > >> >
> > >> > That would add a synchronization point that forces extra
> > latency especially in streaming mode.
> > >> >
> > >> > Wouldn't it be possible for the runner to assign the token when
> > starting the bundle and for the SDK to pass it along the state
> > requests? That way, there would be no need to batch and wait for a
> > flush.
> > >>
> > >> I think it makes sense to let the runner pre-assign these state
> > update
> > >> tokens rather than forcing a synchronization point.
> > >>
> > >> 

Re: [DISCUSS] Backwards compatibility of @Experimental features

2019-08-12 Thread Anton Kedin
Concrete user feedback:
https://stackoverflow.com/questions/57453473/was-the-beamrecord-type-removed-from-apache-beam/57463708#57463708
Short version: we moved BeamRecord from Beam SQL to core Beam and renamed
it to Row (still @Experimental, BTW). But we never mentioned it anywhere
where it would be easy for users to find. Highlighting deprecations and
major shifts of public APIs in the release blog post (and in Javadoc) can
help make this traceable at the very least.

Regards,
Anton

On Wed, May 8, 2019 at 1:42 PM Kenneth Knowles  wrote:

>
>
> On Wed, May 8, 2019 at 9:29 AM Ahmet Altay  wrote:
>
>>
>>
>> *From: *Kenneth Knowles 
>> *Date: *Wed, May 8, 2019 at 9:24 AM
>> *To: *dev
>>
>>
>>>
>>> On Fri, Apr 19, 2019 at 3:09 AM Ismaël Mejía  wrote:
>>>
 It seems we mostly agree that @Experimental is important, and that API
 changes (removals) on experimental features should happen quickly but still
 give some time to users so the Experimental purpose is not lost.

 Ahmet proposal given our current release calendar is close to 2
 releases. Can we settle this on 2 releases as a 'minimum time' before
 removal? (This will let maintainers the option to choose to support it more
 time if they want as discussed in the related KafkaIO thread but still be
 friendly with users).

 Do we agree?

>>>
>>> This sounds pretty good to me.
>>>
>>
>> Sounds good to me too.
>>
>>
>>> How can we manage this? Right now we tie most activities (like
>>> re-triaging flakes) to the release process, since it is the only thing that
>>> happens regularly for the community. If we don't have some forcing then I
>>> expect the whole thing will just be forgotten.
>>>
>>
>> Can we pre-create a list of future releases in JIRA, and for each
>> experimental feature require that a JIRA issue is created for resolving the
>> experimental status and tag it with the release that will happen after the
>> minimum time period?
>>
>
> Great idea. I just created the 2.15.0 release so it reaches far enough
> ahead for right now.
>
> Kenn
>
>
>>
>>> Kenn
>>>
>>>

 Note: for the other subjects (e.g. when an Experimental feature should
 become not experimental) I think we will hardly find an agreement so I
 think this should be treated in a per case basis by the maintainers, but if
 you want to follow up on that discussion we can open another thread for
 this.



 On Sat, Apr 6, 2019 at 1:04 AM Ahmet Altay  wrote:

> I agree that Experimental feature is still very useful. I was trying
> to argue that we diluted its value so +1 to reclaim that.
>
> Back to the original question, in my opinion removing existing
> "experimental and deprecated" features in n=1 release will confuse users.
> This will likely be a surprise to them because we have been maintaining
> this state release after release now. I would propose in the next release
> warning users of such a change happening and give them at least 3 months 
> to
> upgrade to suggested newer paths. In the future we can have a shorter
> timelines assuming that we will set the user expectations right.
>
> On Fri, Apr 5, 2019 at 3:01 PM Ismaël Mejía  wrote:
>
>> I agree 100% with Kenneth on the multiple advantages that the
>> Experimental feature gave us. I also can count multiple places where this
>> has been essential in other modules than core. I disagree on the fact 
>> that
>> the @Experimental annotation has lost sense, it is simply ill defined, 
>> and
>> probably it is by design because its advantages come from it.
>>
>> Most of the topics in this thread are a consequence of the this loose
>> definition, e.g. (1) not defining how a feature becomes stable, and (2)
>> what to do when we want to remove an experimental feature, are ideas that
>> we need to decide if we define just continue to handle as we do today.
>>
>> Defining a target for graduating an Experimental feature is a bit too
>> aggressive with not much benefit, in this case we could be losing the
>> advantages of Experimental (save if we could change the proposed version 
>> in
>> the future). This probably makes sense for the removal of features but
>> makes less sense to decide when some feature becomes stable. Of course in
>> the case of the core SDKs packages this is probably more critical but
>> nothing guarantees that things will be ready when we expect too. When 
>> will
>> we tag for stability things like SDF or portability APIs?. We cannot
>> predict the future for completion of features.
>>
>> Nobody has mentioned the LTS releases couldn’t be these like the
>> middle points for these decisions? That at least will give LTS some value
>> because so far I still have issues to understand the value of this idea
>> given that we can do a minor release of any pre-released version.

Re: Write-through-cache in State logic

2019-08-12 Thread Maximilian Michels
Thanks for starting this discussion Rakesh. An efficient cache layer is
one of the missing pieces for good performance in stateful pipelines.
The good news are that there is a level of caching already present in
Python which batches append requests until the bundle is finished.

Thomas, in your example indeed we would have to profile to see why CPU
utilization is high on the Flink side but not in the Python SDK harness.
For example, older versions of Flink (<=1.5) have a high cost of
deleting existing instances of a timer when setting a timer.
Nevertheless, cross-bundle caching would likely result in increased
performance.

Luke, I think the idea to merge pending state requests could be
complementary to caching across bundles.

Question: Couldn't we defer flushing back state from the SDK to the
Runner indefinitely, provided that we add a way to flush the state in
case of a checkpoint?

Another performance improvement would be caching read requests because
these first go to the Runner regardless of already cached appends.

-Max

On 09.08.19 17:12, Lukasz Cwik wrote:
> 
> 
> On Fri, Aug 9, 2019 at 2:32 AM Robert Bradshaw  > wrote:
> 
> The question is whether the SDK needs to wait for the StateResponse to
> come back before declaring the bundle done. The proposal was to not
> send the cache token back as part of an append StateResponse [1], but
> pre-provide it as part of the bundle request.
> 
> 
> Agree, the purpose of the I'm Blocked message is to occur during bundle
> processing. 
>  
> 
> Thinking about this some more, if we assume the state response was
> successfully applied, there's no reason for the SDK to block the
> bundle until it has its hands on the cache token--we can update the
> cache once the StateResponse comes back whether or not the bundle is
> still active. On the other hand, the runner needs a way to assert it
> has received and processed all StateRequests from the SDK associated
> with a bundle before it can declare the bundle complete (regardless of
> the cache tokens), so this might not be safe without some extra
> coordination (e.g. the ProcessBundleResponse indicating the number of
> state requests associated with a bundle).
> 
>  
> Since the state request stream is ordered, we can add the id of the last
> state request as part of the ProcessBundleResponse.
>  
> 
> [1]
> 
> https://github.com/apache/beam/blob/release-2.14.0/model/fn-execution/src/main/proto/beam_fn_api.proto#L627
> 
> On Thu, Aug 8, 2019 at 6:57 PM Lukasz Cwik  > wrote:
> >
> > The purpose of the new state API call in BEAM-7000 is to tell the
> runner that the SDK is now blocked waiting for the result of a
> specific state request and it should be used for fetches (not
> updates) and is there to allow for SDKs to differentiate readLater
> (I will need this data at some point in time in the future) from
> read (I need this data now). This comes up commonly where the user
> prefetches multiple state cells and then looks at their content
> allowing the runner to batch up those calls on its end.
> >
> > The way it can be used for clear+append is that the runner can
> store requests in memory up until some time/memory limit or until it
> gets its first "blocked" call and then issue all the requests together.
> >
> >
> > On Thu, Aug 8, 2019 at 9:42 AM Robert Bradshaw
> mailto:rober...@google.com>> wrote:
> >>
> >> On Tue, Aug 6, 2019 at 12:07 AM Thomas Weise  > wrote:
> >> >
> >> > That would add a synchronization point that forces extra
> latency especially in streaming mode.
> >> >
> >> > Wouldn't it be possible for the runner to assign the token when
> starting the bundle and for the SDK to pass it along the state
> requests? That way, there would be no need to batch and wait for a
> flush.
> >>
> >> I think it makes sense to let the runner pre-assign these state
> update
> >> tokens rather than forcing a synchronization point.
> >>
> >> Here's some pointers for the Python implementation:
> >>
> >> Currently, when a DoFn needs UserState, a StateContext object is used
> >> that converts from a StateSpec to the actual value. When running
> >> portably, this is FnApiUserStateContext [1]. The state handles
> >> themselves are cached at [2] but this context only lives for the
> >> lifetime of a single bundle. Logic could be added here to use the
> >> token to share these across bundles.
> >>
> >> Each of these handles in turn invokes state_handler.get* methods when
> >> its read is called. (Here state_handler is a thin wrapper around the
> >> service itself) and constructs the appropriate result from the
> >> StateResponse. We would need to implement caching at this level as
> >> well, 

Re: Hello :)

2019-08-12 Thread Ahmet Altay
Welcome. What is your JIRA user name

On Mon, Aug 12, 2019 at 8:27 AM Johan Hansson 
wrote:

> Hi everyone!
>
> Sorry sent the first email from the wrong email address.
>
> My name is Johan and I work for a small startup(fintech) in Stockholm. I
> would like to make some contributions with more examples for the Python
> API. Could some one please add me to Jira so I can create and assign my
> self to user stories.
>
> Thank you!
>
> Best regards
>


Hello :)

2019-08-12 Thread Johan Hansson
Hi everyone!

Sorry sent the first email from the wrong email address.

My name is Johan and I work for a small startup(fintech) in Stockholm. I
would like to make some contributions with more examples for the Python
API. Could some one please add me to Jira so I can create and assign my
self to user stories.

Thank you!

Best regards


Hello :)

2019-08-12 Thread Johan Hansson
Hi everyone!

My name is Johan and I work for a small startup(fintech) in Stockholm. I
would like to make some contributions with more examples for the Python
API. Could some one please add me to Jira so I can create and assign my
self to user stories.

Thank you!

Best regards
Johan


Re: Can not test Timer with processing time domain

2019-08-12 Thread Jan Lukavský

Hi,

if I understand it correctly, the issue there was, that the 
OnTimerContext.timestamp() is currently used for two things:


 a) it tell the caller what timestamp it actually is

 b) it is used as event element timestamp when element is output from 
OnTimer method


That implies, that it cannot be easily changed to processing time, 
because the timestamp of elements output from such method would have 
wrong timestamp. Mayabe it would be better to add 
OnTimerContext.processingTimestamp() instead?


Jan

On 8/12/19 4:54 PM, marek-simu...@seznam.cz wrote:

Hi,
  I bumped into the same issue as in BEAM-5265 [1], where I can't test 
Timer with TimeDomain.PROCESSING_TIME.
Jozef tried to fix it, but the PR [2] got forgotten. Are there any 
concerns how it would behave on different runners or it just only 
waited for reviewer feedback?


Just in case here is my minimal example to reproduce [3]:
- set element timestamp 1970-01-01T00:00:01.000Z
- set timer.offset(TEN_SECONDS).setRelative();
- advanceProcessingTime for 20 seconds
- onTimer method is called where: 
|OnTimerContext||context.timestamp()=|-290308-12-21T19:59:05.225Z
I would expect |context.timestamp() |to be around 
1970-01-01T00:00:11.000Z

[1] https://issues.apache.org/jira/browse/BEAM-5265
[2] https://github.com/apache/beam/pull/6305
[3] 
https://github.com/seznam/beam/commit/79de4a72e35274e4d92c726eb196d662080e5020


Re: Sort Merge Bucket - Action Items

2019-08-12 Thread Claire McGinty
Hi! Wanted to bump this thread with some updated PRs that reflect these
discussions (updated IOs that parameterize FileIO#Sink, and re-use
ReadableFile). The base pull requests are:

- https://github.com/apache/beam/pull/8823 (BucketMetadata implementation)
- https://github.com/apache/beam/pull/8824/ (FileOperations/IO
implementations)

And the actual PTransform implements build on top of those PRs:
- https://github.com/apache/beam/pull/9250/ (SMB Sink transform)
- https://github.com/apache/beam/pull/9251 (SMB Source transform)

Finally we have some benchmarks/style changes (using AutoValue/Builder
pattern) for those PTransforms:
- https://github.com/apache/beam/pull/9253/ (high-level API classes/style
fixes)
- https://github.com/apache/beam/pull/9279 (benchmarks for SMB sink and
source transform)

I know it's a lot of pull requests at once -- let us know if there's
anything else we can clarify or streamline. Thanks!

- Claire/Neville



On Fri, Jul 26, 2019 at 12:45 PM Kenneth Knowles  wrote:

> There is still considerable value in knowing data sources statically so
> you can do things like fetch sizes and other metadata and adjust pipeline
> shape. I would not expect to delete these, but to implement them on top of
> SDF while still giving them a clear URN and payload so runners can know
> that it is a statically-specified source.
>
> Kenn
>
> On Fri, Jul 26, 2019 at 3:23 AM Robert Bradshaw 
> wrote:
>
>> On Thu, Jul 25, 2019 at 11:09 PM Eugene Kirpichov 
>> wrote:
>> >
>> > Hi Gleb,
>> >
>> > Regarding the future of io.Read: ideally things would go as follows
>> > - All runners support SDF at feature parity with Read (mostly this is
>> just the Dataflow runner's liquid sharding and size estimation for bounded
>> sources, and backlog for unbounded sources, but I recall that a couple of
>> other runners also used size estimation)
>> > - Bounded/UnboundedSource APIs are declared "deprecated" - it is
>> forbidden to add any new implementations to SDK, and users shouldn't use
>> them either (note: I believe it's already effectively forbidden to use them
>> for cases where a DoFn/SDF at the current level of support will be
>> sufficient)
>> > - People one by one rewrite existing Bounded/UnboundedSource based
>> PTransforms in the SDK to use SDFs instead
>> > - Read.from() is rewritten to use a wrapper SDF over the given Source,
>> and explicit support for Read is deleted from runners
>> > - In the next major version of Beam - presumably 3.0 - the Read
>> transform itself is deleted
>> >
>> > I don't know what's the current status of SDF/Read feature parity,
>> maybe Luke or Cham can comment. An alternative path is offered in
>> http://s.apache.org/sdf-via-source.
>>
>> Python supports initial splitting for SDF of all sources on portable
>> runners. Dataflow support for batch SDF is undergoing testing, not yet
>> rolled out. Dataflow support for streaming SDF is awaiting portable
>> state/timer support.
>>
>> > On Thu, Jul 25, 2019 at 6:39 AM Gleb Kanterov  wrote:
>> >>
>> >> What is the long-term plan for org.apache.beam.sdk.io.Read? Is it
>> going away in favor of SDF, or we are always going to have both?
>> >>
>> >> I was looking into AvroIO.read and AvroIO.readAll, both of them use
>> AvroSource. AvroIO.readAll is using SDF, and it's implemented with
>> ReadAllViaFileBasedSource that takes AvroSource as a parameter. Looking at
>> ReadAllViaFileBasedSource I find it not necessary to use Source, it
>> should be enough to have something like (KV,
>> OutputReceiver), as we have discussed in this thread, and that should be
>> fine for SMB as well. It would require duplicating code from AvroSource,
>> but in the end, I don't see it as a problem if AvroSource is going away.
>> >>
>> >> I'm attaching a small diagram I put for myself to better understand
>> the code.
>> >>
>> >> AvroIO.readAll :: PTransform> ->
>> >>
>> >> FileIO.matchAll :: PTransform,
>> PCollection>
>> >> FileIO.readMatches :: PTransform,
>> PCollection>
>> >> AvroIO.readFiles :: PTransform,
>> PCollection> ->
>> >>
>> >> ReadAllViaFileBasedSource :: PTransform,
>> PCollection> ->
>> >>
>> >> ParDo.of(SplitIntoRangesFn :: DoFn> OffsetRange>>) (splittable do fn)
>> >>
>> >> Reshuffle.viaRandomKey()
>> >>
>> >> ParDo.of(ReadFileRangesFn(createSource) :: DoFn> OffsetRange>, T>) where
>> >>
>> >> createSource :: String -> FileBasedSource
>> >>
>> >> createSource = AvroSource
>> >>
>> >>
>> >> AvroIO.read without getHintMatchedManyFiles() :: PTransform> PCollection> ->
>> >>
>> >> Read.Bounded.from(createSource) where
>> >>
>> >> createSource :: String -> FileBasedSource
>> >>
>> >> createSource = AvroSource
>> >>
>> >>
>> >> Gleb
>> >>
>> >>
>> >> On Thu, Jul 25, 2019 at 2:41 PM Robert Bradshaw 
>> wrote:
>> >>>
>> >>> On Thu, Jul 25, 2019 at 12:35 AM Kenneth Knowles 
>> wrote:
>> >>> >
>> >>> > From the peanut gallery, keeping a separate implementation for SMB
>> seems fine. Dependencies are serious liabilities for both upstream and
>> downstream. 

[FLINK-12653] and system state

2019-08-12 Thread Jan Lukavský

Hi,

I have come across issue that is very much likely caused by [1]. The 
issue is that Beam's state is (generally) created lazily, after element 
is received (as Max described in the Flink's JIRA). Max also created 
workaround [2], but that seems to work for user state only (i.e. state 
that has been created in user code by declaring @StateId - please 
correct me if I'm wrong). In my work, however, I created a system state 
(that holds elements before being output, due to 
@RequiresTimeSortedInput annotation, but that's probably not important), 
and this state is not covered by the workaround. This might be the case 
for all system states, generally, because they are not visible, until 
element arrives and the state is actually created.


Is my analysis correct? If so, would anyone have any suggestions how to 
fix that?


Jan

[1] https://jira.apache.org/jira/browse/FLINK-12653

[2] https://issues.apache.org/jira/browse/BEAM-7144



StateNamespaces.global() != StateNamespaces.window(GlobalWindowCoder, GlobalWindow)

2019-08-12 Thread Jan Lukavský

Hi,

I noticed, that StateNamespaces.global() generates a different stringKey 
than StateNamespaces.window(GlobalWindowCoder, GlobalWindow). In the 
first case, the stringKey will be simply '/',  in the other it will be 
'//'. That has some other implications, like that 
StateNamespaces.fromString(StateNamespaces.window(GlobalWindowCoder, 
GlobalWindow).stringKey()) != StateNamespaces.global().


It looks like a bug, but maybe I'm wrong - is this intentional?

Jan



Beam Dependency Check Report (2019-08-12)

2019-08-12 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
google-cloud-pubsub
0.39.1
0.45.0
2019-01-21
2019-08-05BEAM-5539
mock
2.0.0
3.0.5
2019-05-20
2019-05-20BEAM-7369
oauth2client
3.0.0
4.1.3
2018-12-10
2018-12-10BEAM-6089
Sphinx
1.8.5
2.1.2
2019-05-20
2019-06-24BEAM-7370
High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
com.github.ben-manes.versions:com.github.ben-manes.versions.gradle.plugin
0.17.0
0.22.0
2019-02-11
2019-08-12BEAM-6645
com.gradle.build-scan:com.gradle.build-scan.gradle.plugin
2.1
2.4
2019-02-11
2019-08-12BEAM-6647
org.conscrypt:conscrypt-openjdk
1.1.3
2.2.1
2018-06-04
2019-08-08BEAM-5748
javax.servlet:javax.servlet-api
3.1.0
4.0.1
2013-04-25
2018-04-20BEAM-5750
org.eclipse.jetty:jetty-server
9.2.10.v20150310
10.0.0-alpha0
2015-03-10
2019-07-11BEAM-5752
org.eclipse.jetty:jetty-servlet
9.2.10.v20150310
10.0.0-alpha0
2015-03-10
2019-07-11BEAM-5753
junit:junit
4.13-beta-1
4.13-beta-3
2018-11-25
2019-05-05BEAM-6127
com.github.spotbugs:spotbugs
3.1.10
4.0.0-beta3
2018-12-18
2019-06-23BEAM-7792
com.github.spotbugs:spotbugs-annotations
3.1.11
4.0.0-beta3
2019-01-21
2019-06-23BEAM-6951

 A dependency update is high priority if it satisfies one of following criteria: 

 It has major versions update available, e.g. org.assertj:assertj-core 2.5.0 -> 3.10.0; 


 It is over 3 minor versions behind the latest version, e.g. org.tukaani:xz 1.5 -> 1.8; 


 The current version is behind the later version for over 180 days, e.g. com.google.auto.service:auto-service 2014-10-24 -> 2017-12-11. 

 In Beam, we make a best-effort attempt at keeping all dependencies up-to-date.
 In the future, issues will be filed and tracked for these automatically,
 but in the meantime you can search for existing issues or open a new one.

 For more information:  Beam Dependency Guide  

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-08-12 Thread jincheng sun
Hi all,

Thanks for your confirmation Robert! :)

Thanks for share the details about the state discussion Luke! :)

The MapState is a bit complex, I think it's better to add some detail
design doc when we deal with the map state supported.

I will create JIRAs and follow up on subsequent developments. If there are
big changes, I will provide detailed design documentation and bring up the
discussion in ML.

Thanks everyone for joining this discussion.

Best,
Jincheng

Lukasz Cwik  于2019年8月7日周三 下午8:19写道:

> I wanted to add some more details about the state discussion.
>
> BEAM-7000 is about adding support for a gRPC message saying that the SDK
> is now blocked on one of its requests. This would allow for an easy
> optimization on the runner side where it gathers requests and is able to
> batch them knowing that the SDK is only blocked once it sees one of the
> blocked gRPC messages. This would make it easy for the runner to gather up
> clear + append calls and convert them to sets internally.
>
> Also, most of the reason around map state not existing has been since we
> haven't discuessed the changes to the gRPC APIs that we need. (things like,
> can you lookup/clear/append to ranges?, map or multimap?, should we really
> just get rid of bag state in favor of a multimap state?, can you enumerate
> keys?, know how many keys there are?, ...)
>
> On Wed, Aug 7, 2019 at 9:52 AM Robert Bradshaw 
> wrote:
>
>> The list looks good to me. Thanks for summarizing. Feel free to dive
>> into any of these issues yourself :).
>>
>> On Fri, Aug 2, 2019 at 6:24 PM jincheng sun 
>> wrote:
>> >
>> > Hi all,
>> >
>> >
>> > Thanks a lot for sharing your thoughts!
>> >
>> >
>> > It seems that we have already reached consensus for the following
>> items. Could you please read through them again and double-check if you all
>> agree with these? If yes, then I would start creating JIRA issues for those
>> that don’t yet have a JIRA issue
>> >
>> >
>> > 1. Items that require improvements of Beam:
>> >
>> >
>> > 1) The configuration of "semi_persist_dir" should be configurable.
>> (TODO)
>> >
>> >
>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L48
>> >
>> >
>> > 2) Time-based cache threshold should be supported. (TODO)
>> >
>> >
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/data_plane.py#L259
>> >
>> >
>> https://github.com/apache/beam/blob/master/sdks/java/fn-execution/src/main/java/org/apache/beam/sdk/fn/data/BeamFnDataBufferingOutboundObserver.java
>> >
>> >
>> > 3) Cross-bundle cache should be supported. (
>> https://issues.apache.org/jira/browse/BEAM-5428)
>> >
>> >
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/bundle_processor.py#L349
>> >
>> >
>> > 4) Allows to configure the log level. (TODO)
>> >
>> > https://issues.apache.org/jira/browse/BEAM-5468
>> >
>> >
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L102
>> >
>> >
>> > 5) Improves the interfaces of classes such as FnDataService,
>> BundleProcessor, ActiveBundle, etc to change the parameter type from
>> WindowedValue to T. (TODO)
>> >
>> >
>> > 6) Python 3 is already supported in Beam. The warning should be
>> removed. (TODO)
>> >
>> > https://github.com/apache/beam/blob/master/sdks/python/setup.py#L179
>> >
>> >
>> > 7) The coder of WindowedValue should be configurable which makes it
>> possible to use customization coder such as ValueOnlyWindowedValueCoder.
>> (TODO)
>> >
>> >
>> https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/wire/WireCoders.java#L91
>> >
>> >
>> > 8) The schema work can be used to solve the performance issue of the
>> extra prefixing length of encoding. However, it should also be supported in
>> Python. (https://github.com/apache/beam/pull/9188)
>> >
>> >
>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/schema.proto
>> >
>> >
>> > 9) MapState should be supported in the gRPC protocol. (TODO)
>> >
>> >
>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/beam_fn_api.proto#L662
>> >
>> >
>> >
>> >
>> > 2. Items where we don’t need to do anything for now:
>> >
>> >
>> > 1) The default buffer size is enough for most cases and there is no
>> need to make it configurable for now.
>> >
>> > 2) Do not support ValueState in the gRPC protocol for now unless we
>> have evidence it matters.
>> >
>> >
>> >
>> > If there are any incorrect understanding,  please feel free to correct
>> me :)
>> >
>> >
>> > 
>> >
>> >
>> > There are also some items that I didn’t bring up earlier which require
>> further discussion:
>> >
>> > 1) The input queue size of the input buffer in Python SDK Harness is
>> not size limited. We should give a reasonable default size.
>> >
>> >
>>