from:"Tyler Akidau"

Re: Snowflake account donation for ongoing testing

2020-06-03 Thread Tyler Akidau

Thanks, Pablo! +Harsha Kapre , who appears to
have been accidentally dropped from the original thread.

-Tyler

On Mon, May 18, 2020, 15:21 Pablo Estrada  wrote:

> Hello Harsha!
> Thanks for contributing the connector, and the support for testing it.
> This is great.
> I am a PMC member, and would be happy to help figure this out - if that's
> what you need.
> Let me know!
> Best
> -P.
>
> On Mon, May 18, 2020 at 2:13 PM Harsha Kapre 
> wrote:
>
>> Hello Dev Community,
>>
>> We have been working with a partner to develop and publish a Beam
>> connector for Snowflake. As we get through the final reviews, we wanted to
>> offer a Snowflake account to use for ongoing testing against the connector.
>>
>> I was hoping to identify someone to work with on this as we may need a
>> signature for our standard partner agreement in order to provide the
>> account with a monthly free usage tier.
>>
>> I look forward to your response.
>>
>> -Harsha Kapre
>> Senior Product Manager, Snowflake Ecosystem
>>
>

Re: Snowflake connector

2020-03-10 Thread Tyler Akidau

On Tue, Mar 10, 2020 at 1:27 AM Elias Djurfeldt 
wrote:

> From what I can tell, the only difference is that the Python connector is
> a pure Python implementation and doesn't rely on ODBC or JDBC (it's just a
> pip installable). Whereas the Java version needs JDBC. But that seems to be
> the only difference.
>

Correct me if I'm wrong, but this sounds like a concern around having to
install Java dependencies for the cross-language transform. If so, I think
the question is: how frictionless can we make the user experience here? If
it can be relatively straightforward, even for a Python user with zero Java
familiarity, it's going to be a win from a maintainability perspective to
only have one implementation (Java, in this case) to keep up to date, as
Cham pointed out. Kasia, do you have a sense yet for what the experience
for a Python user would be for using the Python-wrapped Java SnowflakeIO
connector?

-Tyler


>
> I don't know enough about the Java side of Beam (or Java in general
> really) to say if that's an issue or not though :)
>
> Cheers,
>
> On Mon, 9 Mar 2020 at 18:06, Chamikara Jayalath 
> wrote:
>
>> Thank you. Elias and Shashanka, do you think the Python connector (and
>> API) can offer some additional benefits that a Java cross-language
>>  connector cannot
>> ? It's fine to develop Java and Python versions if it makes sense but if
>> cross-language Java version offers the same benefits as Python just having
>> one implementation will reduce maintenance burden.
>>
>> Thanks,
>> Cham
>>
>> On Mon, Mar 9, 2020 at 5:41 AM Katarzyna Kucharczyk <
>> ka.kucharc...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> Me and my colleague Dariusz we are working currently on Java connector
>>> and we are planning to use cross-language to add Python as well. The
>>> proposal should arrive on dev-list in the nearest future.
>>> Also we would be happy to help if needed in current work of yours.
>>>
>>> Cheers,
>>> Kasia
>>>
>>> On Mon, Mar 9, 2020 at 9:41 AM Elias Djurfeldt <
>>> elias.djurfe...@mirado.com> wrote:
>>>
 Cool Shashanka! Feel free to tag me in the JIRA and update me on any
 progress / ponderings.

 Cheers,
 Elias

 On Sat, 7 Mar 2020 at 03:43, Chamikara Jayalath 
 wrote:

> Absolutely. Please create a JIRA and coordinate with Elias and any
> others that would like to contribute to this.
>
> Thanks,
> Cham
>
> On Fri, Mar 6, 2020 at 10:46 AM Shashanka Balakuntala <
> shbalakunt...@gmail.com> wrote:
>
>> Hi Chamikara and Elias,
>> This seems like an interesting feature. Can I start working on this?
>> *Regards*
>>   Shashanka Balakuntala Srinivasa
>>
>>
>>
>> On Sat, Mar 7, 2020 at 12:00 AM Chamikara Jayalath <
>> chamik...@google.com> wrote:
>>
>>> I don't think we have this but contributions are welcome.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Tue, Mar 3, 2020 at 4:46 AM Elias Djurfeldt <
>>> elias.djurfe...@mirado.com> wrote:
>>>
 Hi all,

 I've stumbled upon a use case where I might need a SnowflakeIO in
 Python. Has anyone worked on this before or are there any discussions
 surrounding it?

 There is a Snowflake Python library available [1], so looks
 feasible to implement in Beam.

 [1]
 https://docs.snowflake.net/manuals/user-guide/python-connector.html

 Cheers,
 Elias

>>>

Re: [VOTE] Release Vendored gRPC 1.13.1 v0.2, release candidate #1

2018-12-20 Thread Tyler Akidau

+1, Approve the release.

-Tyler

On Thu, Dec 20, 2018 at 9:49 AM Ahmet Altay  wrote:

> I meant BEAM-6249 in my last sentence. It should read: "BEAM-6249 has a
> comment about user building the libraries themselves, I am not sure if they
> are using the release 2.9 version directly or not."
>
> On Thu, Dec 20, 2018 at 9:48 AM Ahmet Altay  wrote:
>
>> +1
>>
>> I don't think there is a need for a hotfix release. The reason the
>> initial vendoring PR (https://github.com/apache/beam/pull/7024) that
>> started the issue was not cherry picked to the release branch. BEAM-6056
>> has a comment about user building the libraries themselves, I am not sure
>> if they are using the release 2.9 version directly or not.
>>
>> On Thu, Dec 20, 2018 at 9:37 AM Kenneth Knowles  wrote:
>>
>>> I don't know yet about 2.9.1. There's a bit more context on BEAM-6249.
>>>
>>> Kenn
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-6249
>>>
>>> On Thu, Dec 20, 2018 at 12:02 PM Scott Wegner  wrote:
>>>
 Releasing new vendored artifacts won't generally imply a full Beam
 release. The plan is to pick up the new artifact version at HEAD which will
 roll into the next release.

 For this particularly case, the question is if the Dataflow issue that
 this fixes (BEAM-6056) warrants a hotfix release (2.9.1). I don't know the
 answer--  Ahmet/Kenn do you have any thoughts?

 On Thu, Dec 20, 2018 at 2:18 AM Ismaël Mejía  wrote:

> Does this imply that we need a subsequent full release afterwards?
> I am assuming this new release is related to the reported issues with
> the dataflow worker or is this something different?
>
> On Thu, Dec 20, 2018 at 2:51 AM Kenneth Knowles 
> wrote:
> >
> > +1
> >
> >  - sigs good
> >  - `jar tf` looks good
> >
> > On Wed, Dec 19, 2018 at 7:54 PM Scott Wegner 
> wrote:
> >>
> >> Please review and vote on the release candidate #1 for the vendored
> artifact gRPC 1.13.1 v0.2
> >> [ ] +1, Approve the release
> >> [ ] -1, Do not approve the release (please provide specific
> comments)
> >>
> >> This is a follow-up to the previous thread about vendoring updates
> [1]
> >>
> >> The complete staging area is available for your review, which
> includes:
> >> * all artifacts to be deployed to the Maven Central Repository [2],
> >> * commit hash "3b8abca3ca3352e6bf20e059f17324049a2eae0a" [3],
> >> * artifacts which are signed with the key with fingerprint
> >> 5F47BD54C52008007288FF4D3593BA6C25ABF71F [4]
> >>
> >> The vote will be open for at least 72 hours. It is adopted by
> majority approval, with at least 3 PMC affirmative votes.
> >>
> >> Thanks,
> >> Scott
> >>
> >> [1]
> https://lists.apache.org/thread.html/9a55d12000cb3b1b61620b7dc4009d1351e6b8c70951f70aeb358583@%3Cdev.beam.apache.org%3E
> >> [2]
> https://repository.apache.org/content/repositories/orgapachebeam-1055/
> >> [3] https://github.com/apache/beam/pull/7328
> >> [4] https://dist.apache.org/repos/dist/release/beam/KEYS
> >> --
> >>
> >>
> >>
> >>
> >> Got feedback? tinyurl.com/swegner-feedback
>


 --




 Got feedback? tinyurl.com/swegner-feedback

>>>

Re: [ANNOUNCEMENT] New Beam chair: Kenneth Knowles

2018-09-19 Thread Tyler Akidau

Thanks Davor, and congrats Kenn!

-Tyler

On Wed, Sep 19, 2018 at 2:43 PM Yifan Zou  wrote:

> Congratulations Kenn!
>
> On Wed, Sep 19, 2018 at 2:36 PM Robert Burke  wrote:
>
>> Congrats Kenn! :D
>>
>> On Wed, Sep 19, 2018, 2:21 PM Ismaël Mejía  wrote:
>>
>>> Congratulations and welcome Kenn as new chair!
>>> Thanks Davor for your hard work too.
>>>
>>> On Wed, Sep 19, 2018 at 11:14 PM Rui Wang  wrote:
>>>
 Congrats!

 -Rui

 On Wed, Sep 19, 2018 at 2:12 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

> Congrats!
>
> On Wed, Sep 19, 2018 at 2:05 PM Ahmet Altay  wrote:
>
>> Congratulations, Kenn! And thank you Davor.
>>
>> On Wed, Sep 19, 2018 at 1:44 PM, Anton Kedin 
>> wrote:
>>
>>> Congrats!
>>>
>>> On Wed, Sep 19, 2018 at 1:36 PM Ankur Goenka 
>>> wrote:
>>>
 Congrats Kenn!

 On Wed, Sep 19, 2018 at 1:35 PM Amit Sela 
 wrote:

> Well deserved! Congrats Kenn.
>
> On Wed, Sep 19, 2018 at 4:25 PM Kai Jiang 
> wrote:
>
>> Congrats, Kenn!
>> ᐧ
>>
>> On Wed, Sep 19, 2018 at 1:23 PM Alan Myrvold 
>> wrote:
>>
>>> Congrats, Kenn.
>>>
>>> On Wed, Sep 19, 2018 at 1:08 PM Maximilian Michels <
>>> m...@apache.org> wrote:
>>>
 Congrats!

 On 19.09.18 22:07, Robin Qiu wrote:
 > Congratulations, Kenn!
 >
 > On Wed, Sep 19, 2018 at 1:05 PM Lukasz Cwik >>> > > wrote:
 >
 > Congrats Kenn.
 >
 > On Wed, Sep 19, 2018 at 12:54 PM Davor Bonaci <
 da...@apache.org
 > > wrote:
 >
 > Hi everyone --
 > It is with great pleasure that I announce that at
 today's
 > meeting of the Foundation's Board of Directors, the
 Board has
 > appointed Kenneth Knowles as the second chair of the
 Apache Beam
 > project.
 >
 > Kenn has served on the PMC since its inception, and
 is very
 > active and effective in growing the community. His
 exemplary
 > posts have been cited in other projects. I'm super
 happy to have
 > Kenn accepted the nomination, and I'm confident that
 he'll serve
 > with distinction.
 >
 > As for myself, I'm not going anywhere. I'm still
 around and will
 > be as active as I have recently been. Thrilled to be
 able to
 > pass the baton to such a key member of this community
 and to
 > have less administrative work to do ;-).
 >
 > Please join me in welcoming Kenn to his new role, and
 I ask that
 > you support him as much as possible. As always,
 please let me
 > know if you have any questions.
 >
 > Davor
 >

>>>
>>

Re: [DISCUSS] Communicating in dev@

2018-08-10 Thread Tyler Akidau

+1 to all points made so far. I think docs are great for collaborating on
bigger proposals, but mailing list summaries are very helpful since it's
often unpractical to click through on each and every one (plus I can search
archives more effectively that way). And email for shorter proposals is
also much nicer.

-Tyler

On Thu, Aug 9, 2018 at 11:59 AM Reuven Lax  wrote:

> Thank you, this is something that has come up several times and is worth
> addressing.
>
> I tend to agree with your main points here. Google Docs makes a good tool
> for collaborative editing and commenting, and is great while iterating on a
> design. However once the iterative process is done, the final doc should be
> shared (on the dev list and on the Beam site)  in a directly-readable
> format. I assume best intentions and believe that people sometimes simply
> forget to circle back and share the final design doc (I know I have
> forgotten to in the past), so we need to make .a stronger effort to track
> these things.
>
> Things that don't require collaborative editing and commenting or things
> that are short enough to capture in a one-page email should go directly to
> the dev list. There's really no good need to share such documents in Google
> Doc form. Again I assume best intentions - I think people often use docs as
> a tool to write the proposals, and then simply share out the doc when they
> are done. When docs is not adding value, take a second to copy/paste the
> proposal into an email instead!
>
> Reuven
>
> On Thu, Aug 9, 2018 at 9:11 AM Rafael Fernandez 
> wrote:
>
>> Hi,
>>
>> I think it's important we all discuss in the community what we want to
>> do to make sure we communicate effectively. I ask because I've seen
>> preferences expressed in dev@, and want to make sure we're conscious
>> about our practices.
>>
>> I think we all want discussions to be accessible to all members of the
>> community, and we need to make sure decisions are recorded in the dev@
>> list. If we are not doing this well, we need to flag this. I hope this
>> is thread allows us to do so.
>>
>> Some questions I have:
>>
>> - Are we sharing docs that require installing software or forcing
>> creation of accounts? I don't think this is happening, but let's make
>> sure this is not the case.
>> - Are we having technical discussions and collaborating in tools such
>> as Google Docs without circling back to record decisions in the dev
>> list? If so, let' try our best to circle back to the dev list.
>> - Are we sharing trivially short information in doc form when an email
>> would suffice? If so, we can try our best to avoid that. Save a Beamer
>> a click and a new tab! :)
>>
>> Thoughts?
>> r
>>
>

Re: Wait.on() - "Do this, then that" transform

2018-05-14 Thread Tyler Akidau

Nice! I like the clean integration with streaming.

-Tyler

On Mon, May 14, 2018 at 2:48 PM Eugene Kirpichov 
wrote:

> Hi folks,
>
> Wanted to give a heads up about the existence of a commonly requested
> feature and its first successful production usage.
>
> The feature is the Wait.on() transform [1] , and the first successful
> production usage is in Spanner [2] .
>
> The Wait.on() transform allows you to "do this, then that" - in the sense
> that a.apply(Wait.on(signal)) re-emits PCollection "a", but only after the
> PCollection "signal" is "done" in the same window (i.e. when no more
> elements can arrive into the same window of "signal"). The PCollection
> "signal" is typically a collection of results of some operation - so
> Wait.on(signal) allows you to wait until that operation is done. It
> transparently works correctly in streaming pipelines too.
>
> This may sound a little convoluted, so the example from documentation
> should help.
>
> PCollection firstWriteResults = data.apply(ParDo.of(...write to
> first database...));
> data.apply(Wait.on(firstWriteResults))
>  // Windows of this intermediate PCollection will be processed no
> earlier than when
>  // the respective window of firstWriteResults closes.
>  .apply(ParDo.of(...write to second database...));
>
> This is indeed what Spanner folks have done, and AFAIK they intend this
> for importing multiple dependent database tables - e.g. first import a
> parent table; when it's done, import the child table - all within one
> pipeline. You can see example code in the tests [3].
>
> Please note that this kind of stuff requires support from the IO connector
> - IO.write() has to return a result that can be waited on. The code of
> SpannerIO is a great example; another example is FileIO.write().
>
> People have expressed wishes for similar support in Bigtable and BigQuery
> connectors but it's not there yet. It would be really cool if somebody
> added it to these connectors or others (I think there was a recent thread
> discussing how to add it to BigQueryIO).
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Wait.java
> [2] https://github.com/apache/beam/pull/4264
> [3]
> https://github.com/apache/beam/blob/a3ce091b3bbebf724c63be910bd3bc4cede4d11f/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/spanner/SpannerWriteIT.java#L158
>
>

Re: [ANNOUCEMENT] New Foundation members!

2018-04-17 Thread Tyler Akidau

Congratulations!

On Tue, Apr 3, 2018 at 10:31 AM Daniel Oliveira 
wrote:

> Congrats!
>
> On Tue, Apr 3, 2018 at 2:05 AM Etienne Chauchot 
> wrote:
>
>> Congrats
>> Le mardi 03 avril 2018 à 10:41 +0200, Kostas Kloudas a écrit :
>>
>> Congratulations to everyone!
>>
>> On Apr 2, 2018, at 9:14 PM, Kenneth Knowles  wrote:
>>
>> Congratulations!
>>
>> On Mon, Apr 2, 2018 at 11:44 AM Alan Myrvold  wrote:
>>
>> Congratulations!
>>
>> On Mon, Apr 2, 2018 at 9:14 AM Scott Wegner  wrote:
>>
>> Congrats!
>>
>> On Sat, Mar 31, 2018 at 12:18 PM Robert Burke  wrote:
>>
>> Congratulations!
>>
>> On Sat, 31 Mar 2018 at 11:53 Ekrem Aksoy  wrote:
>>
>> Congrats!
>>
>> On Sat, Mar 31, 2018 at 2:08 AM, Davor Bonaci  wrote:
>>
>> Now that this is public... please join me in welcoming three newly
>> elected members of the Apache Software Foundation with ties to this
>> community, who were elected during the most recent Members' Meeting.
>>
>> * Ismaël Mejía (Beam PMC)
>>
>> * Josh Wills (Crunch Chair; Beam, DataFu PMC)
>>
>> * Holden Karau (Spark, SystemML PMC; Mahout, Subversion committer; Beam
>> contributor)
>>
>> These individuals demonstrated merit in Foundation's growth, evolution,
>> and progress. They were recognized, nominated, and elected by existing
>> membership for their significant impact to the Foundation as a whole, such
>> as the roots of project-related and cross-project activities.
>>
>> As members, they now become legal owners and shareholders of the
>> Foundation. They can vote for the Board, incubate new projects, nominate
>> new members, participate in any PMC-private discussions, and contribute to
>> any project.
>>
>> (For the Beam community, this election nearly doubles the number of
>> Foundation members. The new members are joining Jean-Baptiste Onofré,
>> Stephan Ewen, Romain Manni-Bucau and myself in this role.)
>>
>> I'm happy to be able to call all three of you my fellow members.
>> Congratulations!
>>
>> Davor
>>
>>
>>
>>
>> --
>>
>>
>> Got feedback? http://go/swegner-feedback
>>
>>
>>
>>

Re: [PROPOSAL] Scripting extension based on Java JSR-223

2018-03-23 Thread Tyler Akidau

+1, I like it. Thanks!

On Fri, Mar 23, 2018 at 9:03 AM Ahmet Altay  wrote:

> Thank you Ismaël, this looks really cool.
>
> On Fri, Mar 23, 2018 at 5:33 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi,
>>
>> it sounds like a very good extension mechanism to PTransform.
>>
>> +1
>>
>> Regards
>> JB
>>
>> On 03/23/2018 12:03 PM, Ismaël Mejía wrote:
>> > This is a really simple proposal to add an extension with transforms
>> > that package the Java Scripting API )JSR-223) [1] to allow users to
>> > specialize some transforms via a scripting language. This work was
>> > initially created by Romain [2] and I just took it with his
>> > authorization and refined it to make it pass all the Beam validations
>> > + style. I also added ValueProviders that allow users to template now
>> > scripts also in Dataflow.
>> >
>> > Notice that Dataflow recently added something similar to create really
>> > simple data movement pipelines [3], so maybe the rest of the community
>> > can benefit of a similar extension (and eventually dataflow may
>> > converge to this implementation).
>> >
>> > I hope there is interest in this extension, so far we have a
>> > ScriptingParDo transform to show the idea, hopefully we can expand
>> > this to other transforms.
>> >
>> > For those interested in more details you can check the Jira issue [4]
>> > and the PR [5].
>> >
>> > [1] https://www.jcp.org/en/jsr/detail?id=223
>> > [2] https://github.com/rmannibucau/beam-jsr223
>> > [3]
>> https://cloud.google.com/blog/big-data/2018/03/pre-built-cloud-dataflow-templates-kiss-for-data-movement
>> > [4] https://issues.apache.org/jira/browse/BEAM-3921
>> > [5} https://github.com/apache/beam/pull/4944
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>

Re: [SQL] Windowing and triggering changes proposal

2018-01-19 Thread Tyler Akidau

I'm late to the party as usual, but also added some comments. Overall
supportive of this work. Thanks for the clear analysis, Anton.

On Tue, Jan 16, 2018 at 10:58 AM Mingmin Xu  wrote:

> Thanks @Anton for the proposal. Window(w/ trigger) support in SQL is
> limited now, you're very welcome to join the improvement.
>
> There's a balance between injected DSL mode and CLI mode when we were
> implementing BealmSQL overall, not only widowing. Many default behaviors
> are introduced to make it workable in pure SQL CLI scenario. If it limits
> the potential with DSL mode, we should adjust it absolutely.
>
> Mingmin
>
> On Tue, Jan 16, 2018 at 9:56 AM, Kenneth Knowles  wrote:
>
>> I've commented on the doc. This is a really nice analysis and I think the
>> proposal is good for making SQL work with Beam windowing and triggering in
>> a way that will make sense to users.
>>
>> Kenn
>>
>> On Thu, Jan 11, 2018 at 4:05 PM, Anton Kedin  wrote:
>>
>>> Hi,
>>>
>>> Wanted to gather feedback on changes I propose to the behavior of some
>>> aspects of windowing and triggering in Beam SQL.
>>>
>>> In short:
>>>
>>> Beam SQL currently overrides input PCollections' windowing/triggering
>>> configuration in few cases. For example if a query has a simple GROUP BY
>>> clause, we would apply GlobalWindows. And it's not configurable by the
>>> user, it happens under the hood of SQL.
>>>
>>> Proposal is to update the Beam SQL implementation in these cases to
>>> avoid changing the input PCollections' configuration as much as possible.
>>>
>>> More details here:
>>> https://docs.google.com/document/d/1RmyV9e1Qab-axsLI1WWpw5oGAJDv0X7y9OSnPnrZWJk
>>>
>>> Regards,
>>> Anton
>>>
>>
>>
>
>
> --
> 
> Mingmin
>

Re: Happy new year

2018-01-02 Thread Tyler Akidau

+1 all around. 2017 was excellent, and we're in a great position for 2018
to be even better. Looking forward to it. :-)

On Tue, Jan 2, 2018 at 10:34 AM Kenneth Knowles  wrote:

> Happy new year! Very excited for 2018's possibilities. And nice work, JB.
>
> Kenn
>
> On Tue, Jan 2, 2018 at 6:42 AM, Ismaël Mejía  wrote:
>
>> Happy new year everyone !
>>
>> Agree 100% with Davor. 2017 was a good year for the project and it is
>> worth to thank everyone who helped to make the project better.
>> Now is the time to work to have a great project in 2018 too !
>>
>> Best wishes to all (and kudos to JB too for the top 5) !
>>
>> Ismaël
>>
>>
>>
>> On Mon, Jan 1, 2018 at 9:29 PM, David Sabater Dinter
>>  wrote:
>> > Happy New year all!
>> >
>> > On Mon, 1 Jan 2018 at 19:43, Davor Bonaci  wrote:
>> >>
>> >> Hi everyone --
>> >> As we begin the new year, I wanted to send the best wishes in 2018 to
>> >> everyone in the Beam community -- users, contributors and observers
>> alike!
>> >>
>> >> There's so much to be proud of in 2017; including graduation to a
>> >> top-level project and the availability of the first stable release.
>> Thanks
>> >> to everyone for making this possible!
>> >>
>> >> Finally, I'd also like to pass along some fun facts compiled by others
>> >> [1]. Beam mailing lists had the 9th highest volume among all user@
>> +dev@
>> >> lists. Our very own, Jean-Baptiste Onofre, has once again finished in
>> the
>> >> top 5 committers across all projects in the Apache Software
>> Foundation. This
>> >> year, JB finished as #3, with 2,142 commits, among 6,504 committers.
>> >> Congrats JB!
>> >>
>> >> Happy New Year -- and I hope to see you out and about in the next few
>> >> months!
>> >>
>> >> Davor
>> >>
>> >> [1] https://blogs.apache.org/foundation/entry/apache-in-2017-by-the
>> >>
>> >> On Mon, Jan 1, 2018 at 8:41 AM, Jesse Anderson <
>> je...@bigdatainstitute.io>
>> >> wrote:
>> >>>
>> >>> Happy New Year!
>> >>>
>> >>>
>> >>> On Sun, Dec 31, 2017, 11:09 PM Jean-Baptiste Onofré 
>> >>> wrote:
>> 
>>  Hi beamers,
>> 
>>  I wish you a great and happy new year !
>> 
>>  Regards
>>  JB
>>  --
>>  Jean-Baptiste Onofré
>>  jbono...@apache.org
>>  http://blog.nanthrax.net
>>  Talend - http://www.talend.com
>> >>
>> >>
>> >
>>
>
>

Re: Euphoria Java 8 DSL - proposal

2018-01-02 Thread Tyler Akidau

+1, I'm supportive of seeing this move forward. What remaining concrete
concerns are there?

-Tyler


On Tue, Jan 2, 2018 at 8:35 AM David Morávek 
wrote:

> Hello JB,
>
> can we help in any way to move things forward?
>
> Thanks,
> D.
>
> On Mon, Dec 18, 2017 at 4:28 PM, Jean-Baptiste Onofré 
> wrote:
>
>> Thanks Jan,
>>
>> It makes sense.
>>
>> Let me take a look on the code to understand the "interaction".
>>
>> Regards
>> JB
>>
>>
>> On 12/18/2017 04:26 PM, Jan Lukavský wrote:
>>
>>> Hi JB,
>>>
>>> basically you are not wrong. The project started about three or four
>>> years ago with a goal to unify batch and streaming processing into single
>>> portable, executor independent API. Because of that, it is currently
>>> "close" to Beam in this sense. But we don't see much added value keeping
>>> this as a separate project, with one of the key differences to be the API
>>> (not the model itself), so we would like to focus on translation from
>>> Euphoria API to Beam's SDK. That's why we would like to see it as a DSL, so
>>> that it would be possible to use Euphoria API with Beam's runners as much
>>> natively as possible.
>>>
>>> I hope I didn't make the subject even more unclear, if so, I'll be happy
>>> to explain anything in more detail. :-)
>>>
>>>Jan
>>>
>>>
>>> On 12/18/2017 04:08 PM, Jean-Baptiste Onofré wrote:
>>>
 Hi Jan,

 Thanks for your answers.

 However, they confused me ;)

 Regarding what you replied, Euphoria seems like a programming model/SDK
 "close" to Beam more than a DSL on top of an existing Beam SDK.

 Am I wrong ?

 Regards
 JB

 On 12/18/2017 03:44 PM, Jan Lukavský wrote:

> Hi Ismael,
>
> basically we adopted the Beam's design regarding partitioning (
> https://github.com/seznam/euphoria/issues/160) and implemented the
> sorting manually (https://github.com/seznam/euphoria/issues/158). I'm
> not aware of the time model differences (Euphoria supports ingestion and
> event time, we don't support processing time by decision). Regarding other
> differences (looking into Beam capability matrix, I'd say that):
>
>   - we don't support stateful FlatMap (i.e. ParDo) for now (
> https://github.com/seznam/euphoria/issues/192)
>
>   - we don't support side inputs (by decision now, but might be
> reconsidered) and outputs (
> https://github.com/seznam/euphoria/issues/124)
>
>   - we support complete event-time windows (non-merging, merging,
> aligned, unaligned) and time control
>
>   - we don't support processing time by decision (might be
> reconsidered if a valid use-case is found)
>
>   - we support window triggering based on both time and data,
> including discarding and accumulating (without accumulating & retracting)
>
> All our executors (runners) - Flink, Spark and Local - implement the
> complete model, which we enforce using "operator test kit" that all
> executors must pass. Spark executor supports bounded sources only (for
> now). As David said, we currently don't have serialization abstraction, so
> there is some work to be done in that regard.
>
> Our intention is to completely supersede Euphoria, we would like to
> consider possibility to use executors that would not rely on Beam, but 
> that
> is optional now and should be straightforward.
>
> We'd be happy to answer any more questions you might have and thanks a
> lot!
>
> Best,
>
>   Jan
>
>
> On 12/18/2017 03:19 PM, Ismaël Mejía wrote:
>
>> Hi,
>>
>> It is great to see that you guys have achieved a maturity point to
>> propose this. Congratulations for your work and the idea to contribute
>> it into Beam.
>>
>> I remember from a previous discussion with Jan about the model
>> mismatch between Euphoria and Beam, because of some design decisions
>> of both projects. I remember you guys had some issues with the way
>> Beam's sources do partitioning, as well as Beam's lack of sorted data
>> (on shuffle a la hadoop). Also if I remember well the 'time' model of
>> Euphoria was simpler than Beam's. I talk about all of this because I
>> am curious about what parts of the Euphoria model you guys had to
>> sacrifice to support Beam, and what parts of Beam's model should still
>> be integrated into Euphoria (and if there is a straightforward path to
>> do it).
>>
>> If I understand well if this gets merged into Apache this means that
>> Euphoria's current implementation would be superseded by this DSL? I
>> am curious because I would like to understand your level of investment
>> on supporting the future of this DSL.
>>
>> Thanks and congrats again !
>> Ismaël
>>
>> On Mon, Dec 18, 2017 at 10:12 AM, Jean-Baptiste Onofré <
>>

Re: Schema-Aware PCollections

2017-11-30 Thread Tyler Akidau

On Wed, Nov 29, 2017 at 6:38 PM Reuven Lax  wrote:

> There has been a lot of conversation about schemas on PCollections
> recently. There are a number of reasons for this. Schemas as first-class
> objects in Beam provide a nice base for building BeamSQL. Spark has
> provided schema-support via Dataframes for over two years, and it has
> proved to be very popular among Spark users; it turns out that FlumeJava -
> the original inspiration for the Beam API - has had schema support for even
> longer, though this feature was not included in the Beam (at that time
> Dataflow) API. It turns out that most records have structure, and allowing
> the system to understand record structure can both simplify usage of the
> system and allow for new performance optimizations.
>
> After discussion with JB, Eugene, Kenn, Robert, and a number of others on
> the list, I've started a proposal document here
> 
> describing how schemas can be added to Beam in a manner that integrates
> with the existing Beam API. The goal is not blindly copy existing systems
> that have schemas, but rather to ensure that we get the best fit for Beam.
> Please comment on this proposal - as much feedback as possible is valuable.
>
> In addition, you may notice this document is incomplete. While it does
> sketch out how schemas can fit into Beam semantically, many portions of
> this design remain to be fleshed out. In particular, the API signatures are
> only sketched at at a high level, exactly what all these APIs will look
> like has not yet been defined. I would welcome help from interested members
> of the community to define these APIs, and to make sure we're covering all
> relevant use cases.
>

Thanks for sharing this Reuven, I'm excited to see this being discussed.
One global comment: all of the existing examples are in Java. It would be
great if we could design this with Python in mind (and how it could
interact cleanly with Pandas) at the same time. +Robert Bradshaw
 , +Holden Karau  , and +Ahmet Altay
 , all whom I've spoken with regarding this and other
Python things recently, just to be sure they see it. But of course it'd be
great if anyone working on Python could jump in.

-Tyler



>
> Thanks all,
>
> Reuven
>
>
>

Re: [VOTE] Choose the "new" Spark runner

2017-11-19 Thread Tyler Akidau

[ ] Use Spark 1 & Spark 2 Support Branch
[X] Use Spark 2 Only Branch

On Sun, Nov 19, 2017 at 1:46 PM Amit Sela  wrote:

> [X] Use Spark 2 Only Branch
>
> On Sun, Nov 19, 2017, 02:46 Reuven Lax  wrote:
>
> > [ ] Use Spark 1 & Spark 2 Support Branch
> >  [X] Use Spark 2 Only Branch
> >
> > On Sat, Nov 18, 2017 at 1:54 AM, Ben Sidhom 
> > wrote:
> >
> > > [ ] Use Spark 1 & Spark 2 Support Branch
> > > [X] Use Spark 2 Only Branch
> > >
> > > On Fri, Nov 17, 2017 at 9:46 AM, Ted Yu  wrote:
> > >
> > > > [ ] Use Spark 1 & Spark 2 Support Branch
> > > > [X] Use Spark 2 Only Branch
> > > >
> > > > On Thu, Nov 16, 2017 at 5:08 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > > > wrote:
> > > >
> > > > > Hi guys,
> > > > >
> > > > > To illustrate the current discussion about Spark versions support,
> > you
> > > > can
> > > > > take a look on:
> > > > >
> > > > > --
> > > > > Spark 1 & Spark 2 Support Branch
> > > > >
> > > > > https://github.com/jbonofre/beam/tree/BEAM-1920-SPARK2-MODULES
> > > > >
> > > > > This branch contains a Spark runner common module compatible with
> > both
> > > > > Spark 1.x and 2.x. For convenience, we introduced spark1 & spark2
> > > > > modules/artifacts containing just a pom.xml to define the
> > dependencies
> > > > set.
> > > > >
> > > > > --
> > > > > Spark 2 Only Branch
> > > > >
> > > > > https://github.com/jbonofre/beam/tree/BEAM-1920-SPARK2-ONLY
> > > > >
> > > > > This branch is an upgrade to Spark 2.x and "drop" support of Spark
> > 1.x.
> > > > >
> > > > > As I'm ready to merge one of the other in the PR, I would like to
> > > > complete
> > > > > the vote/discussion pretty soon.
> > > > >
> > > > > Correct me if I'm wrong, but it seems that the preference is to
> drop
> > > > Spark
> > > > > 1.x to focus only on Spark 2.x (for the Spark 2 Only Branch).
> > > > >
> > > > > I would like to call a final vote to act the merge I will do:
> > > > >
> > > > > [ ] Use Spark 1 & Spark 2 Support Branch
> > > > > [ ] Use Spark 2 Only Branch
> > > > >
> > > > > This informal vote is open for 48 hours.
> > > > >
> > > > > Please, let me know what your preference is.
> > > > >
> > > > > Thanks !
> > > > > Regards
> > > > > JB
> > > > >
> > > > > On 11/13/2017 09:32 AM, Jean-Baptiste Onofré wrote:
> > > > >
> > > > >> Hi Beamers,
> > > > >>
> > > > >> I'm forwarding this discussion & vote from the dev mailing list to
> > the
> > > > >> user mailing list.
> > > > >> The goal is to have your feedback as user.
> > > > >>
> > > > >> Basically, we have two options:
> > > > >> 1. Right now, in the PR, we support both Spark 1.x and 2.x using
> > three
> > > > >> artifacts (common, spark1, spark2). You, as users, pick up spark1
> or
> > > > spark2
> > > > >> in your dependencies set depending the Spark target version you
> > want.
> > > > >> 2. The other option is to upgrade and focus on Spark 2.x in Beam
> > > 2.3.0.
> > > > >> If you still want to use Spark 1.x, then, you will be stuck up to
> > Beam
> > > > >> 2.2.0.
> > > > >>
> > > > >> Thoughts ?
> > > > >>
> > > > >> Thanks !
> > > > >> Regards
> > > > >> JB
> > > > >>
> > > > >>
> > > > >>  Forwarded Message 
> > > > >> Subject: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
> > > > >> Date: Wed, 8 Nov 2017 08:27:58 +0100
> > > > >> From: Jean-Baptiste Onofré 
> > > > >> Reply-To: dev@beam.apache.org
> > > > >> To: dev@beam.apache.org
> > > > >>
> > > > >> Hi all,
> > > > >>
> > > > >> as you might know, we are working on Spark 2.x support in the
> Spark
> > > > >> runner.
> > > > >>
> > > > >> I'm working on a PR about that:
> > > > >>
> > > > >> https://github.com/apache/beam/pull/3808
> > > > >>
> > > > >> Today, we have something working with both Spark 1.x and 2.x from
> a
> > > code
> > > > >> standpoint, but I have to deal with dependencies. It's the first
> > step
> > > of
> > > > >> the update as I'm still using RDD, the second step would be to
> > support
> > > > >> dataframe (but for that, I would need PCollection elements with
> > > schemas,
> > > > >> that's another topic on which Eugene, Reuven and I are
> discussing).
> > > > >>
> > > > >> However, as all major distributions now ship Spark 2.x, I don't
> > think
> > > > >> it's required anymore to support Spark 1.x.
> > > > >>
> > > > >> If we agree, I will update and cleanup the PR to only support and
> > > focus
> > > > >> on Spark 2.x.
> > > > >>
> > > > >> So, that's why I'm calling for a vote:
> > > > >>
> > > > >>[ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
> > > > >>[ ] 0 (I don't care ;))
> > > > >>[ ] -1, I would like to still support Spark 1.x, and so having
> > > > support
> > > > >> of both Spark 1.x and 2.x (please provide specific comment)
> > > > >>
> > > > >> This vote is open for 48 hours (I have the commits ready, just
> > waiting
> > > > >> the end of the

Re: [Proposal] Apache Beam Swag Store

2017-10-27 Thread Tyler Akidau

One additional note: for the logos w/ the name below them, would be nice to
not have quite so much whitespace between the logo and the name. Otherwise,
trademark validation aside, this looks great.

On Fri, Oct 27, 2017 at 10:15 AM Griselda Cuevas 
wrote:

> Hi Dan - thanks for bringing this up to my attention. I haven't raised this
> up to the tradema...@apache.org people. Can I just reach out to them to
> get
> the proposal or should one of the PMCs do this?
>
>
>
> Gris Cuevas Zambrano
>
> g...@google.com
>
> Open Source Strategy
>
> 345 Spear Street, San Francisco, 94105
> 
>
>
>
> On 26 October 2017 at 18:30, Daniel Kulp  wrote:
>
> >
> > Have you run this through tradema...@apache.org yet?
> >
> > I bring this up for two reasons:
> >
> > 1) We would need to make sure the appearance and such of the logo is
> > “correct”
> >
> > 2).A few years ago, Apache did have a partnership with a company that
> > would produce various swag things and then donate a portion back to
> > Apache.   I don’t know what the state of that agreement is and whether
> that
> > would restrict going to another vendor or something.
> >
> > Dan
> >
> >
> > > On Oct 25, 2017, at 8:51 AM, Griselda Cuevas 
> > wrote:
> > >
> > > Hi Everyone,
> > >
> > > I'd like to propose the creation of an online swag store for Apache
> Beam
> > where anyone could order swag from a wide selection and get it deliver to
> > their home or office. I got in touch with a provider who could include
> this
> > service as part of an order I'm placing to send swag for the Meetups
> we're
> > organizing this year. What do you think?
> > >
> > > I'd also like to get your feedback on the swag I'm requesting (you can
> > see it in the pdf I attached), what do you think of the colors, design,
> > etc.?
> > >
> > > Lastly, I'll be ordering swag for Meetup organized this year so if
> > you're hosting one or speaking at one get in touch with me to send some!
> > >
> > > Cheers,
> > > G
> >
> > --
> > Daniel Kulp
> > dk...@apache.org - http://dankulp.com/blog
> > Talend Community Coder - http://coders.talend.com
> >
> >
>

Re: [DISCUSS] Switch to gitbox

2017-10-06 Thread Tyler Akidau

+1

On Fri, Oct 6, 2017 at 8:54 AM Reuven Lax  wrote:

> +1
>
> On Oct 6, 2017 4:51 PM, "Lukasz Cwik"  wrote:
>
> > I think its a great idea and find that the mergebot works well on the
> > website.
> > Since gitbox enforces that the precommit checks pass, it would also be a
> > good forcing function for the community to maintain reliably passing
> tests.
> >
> > On Fri, Oct 6, 2017 at 4:58 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > Hi guys,
> > >
> > > We use Apache gitbox for the website and it works fine (as soon as you
> > > linked your Apache and github with 2FA enabled).
> > >
> > > What do you think about moving to gitbox for the codebase itself ?
> > >
> > > It could speed up the review and merge for the PRs.
> > >
> > > Thoughts ?
> > >
> > > Regards
> > > JB
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> > >
> >
>

Re: Support for window analytic functions in SQL DSL

2017-10-05 Thread Tyler Akidau

I'm not aware of analytic window support. +Mingmin Xu 
 or +James  could speak to any plans they might have
regarding adding support.

-Tyler

On Mon, Oct 2, 2017 at 3:23 AM Kobi Salant  wrote:

> Hi,
>
> Calcite streaming documentation includes examples for using SQL window
> analytic functions
> https://calcite.apache.org/docs/stream.html#sliding-windows
>
> In the beam documentation https://beam.apache.org/documentation/dsls/sql/
> there is no mention for this functionality.
>
> Does the DSL_SQL branch supports it or do we have future plans for it?
>
> Thanks
> Kobi
>

Re: Merge branch DSL_SQL to master

2017-09-12 Thread Tyler Akidau

Congrats, and thanks!

-Tyler

On Tue, Sep 12, 2017 at 5:49 AM Etienne Chauchot <echauc...@gmail.com>
wrote:

> Great work guys!
>
> Etienne
>
>
> Le 11/09/2017 à 23:51, Mingmin Xu a écrit :
> > Now it's merged to master. Thanks to everyone!
> >
> > Mingmin
> >
> > On Thu, Sep 7, 2017 at 10:09 AM, Ahmet Altay <al...@google.com.invalid>
> > wrote:
> >
> >> +1 Thanks to all contributors/reviewers!
> >>
> >> On Thu, Sep 7, 2017 at 9:55 AM, Kai Jiang <jiang...@gmail.com> wrote:
> >>
> >>> +1 looking forward to this.
> >>>
> >>> On Thu, Sep 7, 2017, 09:53 Tyler Akidau <taki...@google.com.invalid>
> >>> wrote:
> >>>
> >>>> +1, thanks for all the hard work to everyone that contributed!
> >>>>
> >>>> -Tyler
> >>>>
> >>>> On Thu, Sep 7, 2017 at 2:39 AM Ismaël Mejía <ieme...@gmail.com>
> wrote:
> >>>>
> >>>>> +1
> >>>>> A nice feature to have on Beam. Great work guys !
> >>>>>
> >>>>> On Thu, Sep 7, 2017 at 10:21 AM, Pei HE <pei...@gmail.com> wrote:
> >>>>>> +1
> >>>>>>
> >>>>>> On Thu, Sep 7, 2017 at 4:03 PM, tarush grover <
> >>> tarushappt...@gmail.com
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Thank you all, it was a great learning experience!
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Tarush
> >>>>>>>
> >>>>>>> On Thu, 7 Sep 2017 at 1:05 PM, Jean-Baptiste Onofré <
> >>> j...@nanthrax.net>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> +1
> >>>>>>>>
> >>>>>>>> Great work guys !
> >>>>>>>> Ready to help for the merge and maintain !
> >>>>>>>>
> >>>>>>>> Regards
> >>>>>>>> JB
> >>>>>>>>
> >>>>>>>> On 09/07/2017 08:48 AM, Mingmin Xu wrote:
> >>>>>>>>> Hi all,
> >>>>>>>>>
> >>>>>>>>> On behalf of the virtual Beam SQL team[1], I'd like to propose
> >>> to
> >>>>> merge
> >>>>>>>>> DSL_SQL branch into master (PR #3782 [2]) and include it in
> >>>> release
> >>>>>>>> version
> >>>>>>>>> 2.2.0, which will give it more visibility to other
> >> contributors
> >>>> and
> >>>>>>>> users.
> >>>>>>>>> The SQL feature satisfies the following criteria outlined in
> >>>>>>> contribution
> >>>>>>>>> guide[3].
> >>>>>>>>>
> >>>>>>>>> 1. Have at least 2 contributors interested in maintaining it,
> >>> and
> >>>> 1
> >>>>>>>>> committer interested in supporting it
> >>>>>>>>>
> >>>>>>>>> * James and me will continue for new features and maintain it;
> >>>>>>>>>
> >>>>>>>>> Tyler, James and me will support it as committers;
> >>>>>>>>>
> >>>>>>>>> 2. Provide both end-user and developer-facing documentation
> >>>>>>>>>
> >>>>>>>>> * A web page[4] is added to describe the usage of SQL DSL and
> >>> how
> >>>> it
> >>>>>>>> works;
> >>>>>>>>>
> >>>>>>>>> 3. Have at least a basic level of unit test coverage
> >>>>>>>>>
> >>>>>>>>> * Totally 230 unit/integration tests, with code coverage
> >> 83.4%;
> >>>>>>>>> 4. Run all existing applicable integration tests with other
> >> Beam
> >>>>>>>> components
> >>>>>>>>> and create additional tests as appropriate
> >>>>>>>>>
> >>>>>>>>> * Besides of integration tests in package
> >>>>>>>> org.apache.beam.sdk.extensions.sql,
> >>>>>>>>> there's another example in
> >>>>> org.apache.beam.sdk.extensions.sql.example.
> >>>>>>>>> BeamSqlExample.
> >>>>>>>>>
> >>>>>>>>> [1]. Special thanks to all contributors/reviewers:
> >>>>>>>>>
> >>>>>>>>>Tyler Akidau
> >>>>>>>>>
> >>>>>>>>>Davor Bonaci
> >>>>>>>>>
> >>>>>>>>>Robert Bradshaw
> >>>>>>>>>
> >>>>>>>>>Lukasz Cwik
> >>>>>>>>>
> >>>>>>>>>Tarush Grover
> >>>>>>>>>
> >>>>>>>>>Kai Jiang
> >>>>>>>>>
> >>>>>>>>>Kenneth Knowles
> >>>>>>>>>
> >>>>>>>>>Jingsong Lee
> >>>>>>>>>
> >>>>>>>>>Ismaël Mejía
> >>>>>>>>>
> >>>>>>>>>Jean-Baptiste Onofré
> >>>>>>>>>
> >>>>>>>>>James Xu
> >>>>>>>>>
> >>>>>>>>>Mingmin Xu
> >>>>>>>>>
> >>>>>>>>> [2]. https://github.com/apache/beam/pull/3782
> >>>>>>>>>
> >>>>>>>>> [3]. https://beam.apache.org/contribute/contribution-guide/
> >>>>>>>>> #merging-into-master
> >>>>>>>>>
> >>>>>>>>> [4]. https://beam.apache.org/documentation/dsls/sql/
> >>>>>>>>>
> >>>>>>>>> Thanks!
> >>>>>>>>> 
> >>>>>>>>> Mingmin
> >>>>>>>>>
> >>>>>>>> --
> >>>>>>>> Jean-Baptiste Onofré
> >>>>>>>> jbono...@apache.org
> >>>>>>>> http://blog.nanthrax.net
> >>>>>>>> Talend - http://www.talend.com
> >>>>>>>>
> >
> >
>
>

Re: [DISCUSSION] using NexMark for Beam

2017-08-24 Thread Tyler Akidau

Awesome news, thank you! :-D

On Thu, Aug 24, 2017 at 12:40 AM Etienne Chauchot 
wrote:

> Hi all,
>
> I wanted to let you know that the Nexmark PR is merged into master. Feel
> free to use it (e.g. performance testing, release testing ...).
>
> Etienne
>
> Le 12/05/2017 à 10:55, Etienne Chauchot a écrit :
> > Hi guys,
> >
> > I wanted to let you know that I have just submitted a PR around
> > NexMark. This is a port of the NexMark queries to Beam, to be used as
> > integration tests.
> > This can also be used as A-B testing (no-regression or performance
> > comparison between 2 versions of the same engine or of the same runner)
> >
> > This a continuation of the previous PR (#99) from Mark Shields.
> > The code has changed quite a bit: some queries have changed to use new
> > Beam APIs and there where some big refactorings. More important, we
> > can now run all the queries in all the runners.
> >
> > Nevertheless, there are still some open issues in Nexmark
> > (https://github.com/iemejia/beam/issues) and in Beam upstream (see
> > issue links in https://issues.apache.org/jira/browse/BEAM-160)
> >
> > I wanted to submit the PR before our (Ismaël and I) NexMark talk at
> > the ApacheCon. The PR is not perfect but it is in a good shape to
> > share it.
> >
> > Best,
> >
> > Etienne
> >
> >
> >
> > Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
> >> This is great! Having a variety of realistic-ish pipelines running on
> >> all
> >> runners complements the validation suite and IO IT work.
> >>
> >> If I recall, some of these involve heavy and esoteric uses of state, so
> >> definitely give me a ping if you hit any trouble.
> >>
> >> Kenn
> >>
> >> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot 
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> Ismael and I are working on upgrading the Nexmark implementation for
> >>> Beam.
> >>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
> >>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing the
> >>> work done by Mark Shields. See https://github.com/apache/beam/pull/366
> >>> for the original PR.
> >>>
> >>> The PR contains queries that have a wide coverage of the Beam model and
> >>> that represent a realistic end user use case (some come from client
> >>> experience on Google Cloud Dataflow).
> >>>
> >>> So far, we have upgraded the implementation to the latest Beam
> >>> snapshot.
> >>> And we are able to execute a good subset of the queries in the
> >>> different
> >>> runners. We upgraded the nexmark drivers to do so: direct driver
> >>> (upgraded
> >>> from inProcessDriver) and flink driver and we added a new one for
> >>> spark.
> >>>
> >>> There is still a good amount of work to do and we would like to know if
> >>> you think that this contribution can have its place into Beam
> >>> eventually.
> >>>
> >>> The interests of having Nexmark on Beam that we have seen so far are:
> >>>
> >>> - Rich batch/streaming test
> >>>
> >>> - A-B testing of runners or runtimes (non-regression, performance
> >>> comparison between versions ...)
> >>>
> >>> - Integration testing (sdk/runners, runner/runtime, ...)
> >>>
> >>> - Validate beam capability matrix
> >>>
> >>> - It can be used as part of the ongoing PerfKit work (if there is any
> >>> interest).
> >>>
> >>> As a final note, we are tracking the issues in the same repo. If
> >>> someone
> >>> is interested in contributing, or have more ideas, you are welcome :)
> >>>
> >>> Etienne
> >>>
> >>>
> >
>
>

Re: [DISCUSS] Capability Matrix revamp

2017-08-21 Thread Tyler Akidau

Is there any way we could add quantitative runner metrics to this as well?
Like by having some benchmarks that process X amount of data, and then
detailing in the matrix latency, throughput, and (where possible) cost,
etc, numbers for each of the given runners? Semantic support is one thing,
but there are other differences between runners that aren't captured by
just checking feature boxes. I'd be curious if anyone has other ideas in
this vein as well. The benchmark idea might not be the best way to go about
it.

-Tyler

On Sun, Aug 20, 2017 at 9:43 AM Jesse Anderson 
wrote:

> It'd be awesome to see these updated. I'd add two more:
>
>1. A plain English summary of the runner's support in Beam. People who
>are new to Beam won't understand the in-depth coverage and need a
> general
>idea of how it is supported.
>2. The production readiness of the runner. Does the maintainer think
>this runner is production ready?
>
>
>
> On Sun, Aug 20, 2017 at 8:03 AM Kenneth Knowles 
> wrote:
>
> > Hi all,
> >
> > I want to revamp
> > https://beam.apache.org/documentation/runners/capability-matrix/
> >
> > When Beam first started, we didn't work on feature branches for the core
> > runners, and they had a lot more gaps compared to what goes on `master`
> > today, so this tracked our progress in a way that was easy for users to
> > read. Now it is still our best/only comparison page for users, but I
> think
> > we could improve its usefulness.
> >
> > For the benefit of the thread, let me inline all the capabilities fully
> > here:
> >
> > 
> >
> > "What is being computed?"
> >  - ParDo
> >  - GroupByKey
> >  - Flatten
> >  - Combine
> >  - Composite Transforms
> >  - Side Inputs
> >  - Source API
> >  - Splittable DoFn
> >  - Metrics
> >  - Stateful Processing
> >
> > "Where in event time?"
> >  - Global windows
> >  - Fixed windows
> >  - Sliding windows
> >  - Session windows
> >  - Custom windows
> >  - Custom merging windows
> >  - Timestamp control
> >
> > "When in processing time?"
> >  - Configurable triggering
> >  - Event-time triggers
> >  - Processing-time triggers
> >  - Count triggers
> >  - [Meta]data driven triggers
> >  - Composite triggers
> >  - Allowed lateness
> >  - Timers
> >
> > "How do refinements relate?"
> >  - Discarding
> >  - Accumulating
> >  - Accumulating & Retracting
> >
> > 
> >
> > Here are some issues I'd like to improve:
> >
> >  - Rows that are impossible to not support (ParDo)
> >  - Rows where "support" doesn't really make sense (Composite transforms)
> >  - Rows are actually the same model feature (non-merging windowfns)
> >  - Rows that represent optimizations (Combine)
> >  - Rows in the wrong place (Timers)
> >  - Rows have not been designed ([Meta]Data driven triggers)
> >  - Rows with names that appear no where else (Timestamp control)
> >  - No place to compare non-model differences between runners
> >
> > I'm still pondering how to improve this, but I thought I'd send the
> notion
> > out for discussion. Some imperfect ideas I've had:
> >
> > 1. Lump all the basic stuff (ParDo, GroupByKey, Read, Window) into one
> row
> > 2. Make sections as users see them, like "ParDo" / "side Inputs" not
> > "What?" / "side inputs"
> > 3. Add rows for non-model things, like portability framework support,
> > metrics backends, etc
> > 4. Drop rows that are not informative, like Composite transforms, or not
> > designed
> > 5. Reorganize the windowing section to be just support for merging /
> > non-merging windowing.
> > 6. Switch to a more distinct color scheme than the solid vs faded colors
> > currently used.
> > 7. Find a web design to get short descriptions into the foreground to
> make
> > it easier to grok.
> >
> > These are just a few thoughts, and not necessarily compatible with each
> > other. What do you think?
> >
> > Kenn
> >
> --
> Thanks,
>
> Jesse
>

Re: Hello from a newbie to the data world living in the city by the bay!

2017-08-17 Thread Tyler Akidau

Welcome all! :-)

On Wed, Aug 16, 2017 at 4:17 PM María García Herrero
 wrote:

> Welcome Gris, Umang, and Justin!
>
> On Wed, Aug 16, 2017 at 1:15 AM, Jean-Baptiste Onofré 
> wrote:
>
> > Welcome !
> >
> > Regards
> > JB
> >
> > On Aug 16, 2017, 08:54, at 08:54, "Ismaël Mejía" 
> > wrote:
> > >Hello and welcome Griselda, Umang, Justin
> > >
> > >Apart of the links provided by Ahmet you might read Beam-related
> > >material on the website (See Documentation > Programming Guide and
> > >Documentation > Additional Resources among others).
> > >
> > >But probably as important as improving your Beam related knowledge is
> > >to understand the principles of an open source project and more
> > >concretely the way the Apache projects work (in case this is your
> > >first Apache project), concepts like How projects are structured
> > >(PMCs, committers, votes, etc) and the most important ones Community
> > >over Code and Meritocracy.
> > >
> > >https://www.apache.org/foundation/how-it-works.html
> > >https://blogs.apache.org/foundation/entry/asf_15_community_over_code
> > >
> > >Welcome all and don't hesitate to ask questions, we are all here to
> > >make this project better so for sure we can help.
> > >Ismaël
> > >
> > >
> > >On Tue, Aug 15, 2017 at 11:04 PM, Justin T  wrote:
> > >> Hello Beam community,
> > >>
> > >> I am also a new member, and I feel a little better knowing that there
> > >> others on the same boat:)
> > >>
> > >> My name is Justin and I work as a full stack engineer for Neustar, a
> > >> marketing analytics company in San Diego. Over the past few weeks I
> > >have
> > >> been getting more familiar with Beam via documentation, papers,
> > >videos, and
> > >> the old email archives and I am very excited to start making
> > >contributions.
> > >> Thank you Altay for the useful links!
> > >>
> > >> -Justin Tumale
> > >>
> > >> On Tue, Aug 15, 2017 at 11:19 AM, Ahmet Altay
> > >
> > >> wrote:
> > >>
> > >>> Welcome both of you!
> > >>>
> > >>> Some helpful starting points:
> > >>> - Contribution guide [1]
> > >>> - Unassigned starter issues in JIRA [2]
> > >>>
> > >>> Ahmet
> > >>>
> > >>> [1] https://beam.apache.org/contribute/contribution-guide/
> > >>> [2]
> > >>> https://issues.apache.org/jira/browse/BEAM-2632?jql=
> > >>>
> > >project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20Reopened)%20AND%
> > >>>
> > >20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20starter%20AND%
> > >>> 20assignee%20in%20(EMPTY)%20ORDER%20BY%20created%20DESC%
> > >>> 2C%20priority%20DESC
> > >>>
> > >>> On Tue, Aug 15, 2017 at 11:13 AM, Umang Sharma 
> > >>> wrote:
> > >>>
> > >>> > Hi Gris,
> > >>> > Nice to meet you.
> > >>> >
> > >>> > I'd like to take this opportunity to introduce me to you and
> > >everyone
> > >>> else
> > >>> > in  the dev team.
> > >>> >
> > >>> > I’m m Umang Sharma. I'm an associate in Data Science and
> > >Applications at
> > >>> > Accenture Digital.
> > >>> >
> > >>> >
> > >>> > I write in python, Java and a number of other languages.
> > >>> > I'd love to contribute to Beam. It'd br great if someone guides me
> > >to get
> > >>> > started with contributing :)
> > >>> >
> > >>> > Among the other things i like are polo golf, giving talks and
> > >talking
> > >>> about
> > >>> > mu work .
> > >>> >
> > >>> > Thanks,
> > >>> > Umang
> > >>> >
> > >>> >
> > >>> > On Aug 15, 2017 22:40, "Griselda Cuevas" 
> > >>> wrote:
> > >>> >
> > >>> > Hi Beam community,
> > >>> >
> > >>> > I’m Griselda (Gris) Cuevas and I’m very excited to join the
> > >community,
> > >>> I’m
> > >>> > looking forward to learning awesome things from you and to getting
> > >the
> > >>> > chance to collaborate on great initiatives.
> > >>> >
> > >>> > I’m currently working at Google and I’m studying a masters in
> > >operations
> > >>> > research and data science at UC Berkeley. I’m interested in
> > >Natural
> > >>> > Language Processing, Information Retrieval and Online Communities.
> > >Some
> > >>> > other fun topics I love are juggling, camping and -just getting
> > >into it-
> > >>> >  listening to podcasts, so if you ever want to discuss and talk
> > >about any
> > >>> > of these topics, here I am!
> > >>> >
> > >>> > Another reason why I’m here is because I want to help this project
> > >grow
> > >>> and
> > >>> > thrive. This means that you’ll see me contributing to the project,
> > >>> reaching
> > >>> > out to ask questions as I get familiar with our community, and I
> > >also
> > >>> > helping evangelize Apache Beam by organizing meetups, hangouts,
> > >etc.
> > >>> >
> > >>> > I say bye for now, I’ll see you around,
> > >>> >
> > >>> > Cheers,
> > >>> >
> > >>> > G
> > >>> >
> > >>>
> >
>

Re: [ANNOUNCEMENT] New PMC members, August 2017 edition!

2017-08-14 Thread Tyler Akidau

Congrats!

On Sat, Aug 12, 2017 at 1:51 AM Pei HE  wrote:

> Congratulations!
> --
> Pei
>
> On Sat, Aug 12, 2017 at 3:10 PM, Aljoscha Krettek 
> wrote:
>
> > Congratulations! :-)
> >
> > Best,
> > Aljoscha
> >
> > > On 12. Aug 2017, at 06:39, Robert Bradshaw  >
> > wrote:
> > >
> > > Congratulations!
> > >
> > > On Fri, Aug 11, 2017 at 2:23 PM, Jean-Baptiste Onofré  >
> > wrote:
> > >> Congrats !
> > >>
> > >> Regards
> > >> JB
> > >>
> > >>
> > >> On 08/11/2017 07:40 PM, Davor Bonaci wrote:
> > >>>
> > >>> Please join me and the rest of Beam PMC in welcoming the following
> > >>> committers as our newest PMC members. They have significantly
> > contributed
> > >>> to the project in different ways, and we look forward to many more
> > >>> contributions in the future.
> > >>>
> > >>> * Ahmet Altay
> > >>> Beyond significant work to drive the Python SDK to the master branch,
> > >>> Ahmet
> > >>> has worked project-wide, driving releases, improving processes and
> > >>> testing,
> > >>> and growing the community.
> > >>>
> > >>> * Aviem Zur
> > >>> Beyond significant work in the Spark runner, Aviem has worked to
> > improve
> > >>> how the project operates, leading discussions on inclusiveness and
> > >>> openness.
> > >>>
> > >>> Congratulations to both! Welcome!
> > >>>
> > >>> Davor
> > >>>
> > >>
> > >> --
> > >> Jean-Baptiste Onofré
> > >> jbono...@apache.org
> > >> http://blog.nanthrax.net
> > >> Talend - http://www.talend.com
> >
> >
>

Re: [ANNOUNCEMENT] New committers, August 2017 edition!

2017-08-14 Thread Tyler Akidau

Congrats and thanks all around!

On Sat, Aug 12, 2017 at 12:09 AM Aljoscha Krettek 
wrote:

> Congrats, everyone! It's well deserved.
>
> Best,
> Aljoscha
>
> > On 12. Aug 2017, at 08:06, Pei HE  wrote:
> >
> > Congratulations to all!
> > --
> > Pei
> >
> > On Sat, Aug 12, 2017 at 10:50 AM, James  wrote:
> >
> >> Thank you guys, glad to contribute to this great project, congratulate
> to
> >> all the new committers!
> >>
> >> On Sat, Aug 12, 2017 at 8:36 AM Manu Zhang 
> >> wrote:
> >>
> >>> Thanks everyone !!! It's a great journey.
> >>> Congrats to other new committers !
> >>>
> >>> Thanks,
> >>> Manu
> >>>
> >>> On Sat, Aug 12, 2017 at 5:23 AM Jean-Baptiste Onofré 
> >>> wrote:
> >>>
>  Congrats and welcome !
> 
>  Regards
>  JB
> 
>  On 08/11/2017 07:40 PM, Davor Bonaci wrote:
> > Please join me and the rest of Beam PMC in welcoming the following
> > contributors as our newest committers. They have significantly
>  contributed
> > to the project in different ways, and we look forward to many more
> > contributions in the future.
> >
> > * Reuven Lax
> > Reuven has been with the project since the very beginning,
> >> contributing
> > mostly to the core SDK and the GCP IO connectors. He accumulated 52
>  commits
> > (19,824 ++ / 12,039 --). Most recently, Reuven re-wrote several IO
> > connectors that significantly expanded their functionality.
> >>> Additionally,
> > Reuven authored important new design documents relating to update and
> > snapshot functionality.
> >
> > * Jingsong Lee
> > Jingsong has been contributing to Apache Beam since the beginning of
> >>> the
> > year, particularly to the Flink runner. He has accumulated 34 commits
> > (11,214 ++ / 6,314 --) of deep, fundamental changes that
> >> significantly
> > improved the quality of the runner. Additionally, Jingsong has
>  contributed
> > to the project in other ways too -- reviewing contributions, and
> > participating in discussions on the mailing list, design documents,
> >> and
> > JIRA issue tracker.
> >
> > * Mingmin Xu
> > Mingmin started the SQL DSL effort, and has driven it to the point of
> > merging to the master branch. In this effort, he extended the project
> >>> to
> > the significant new user community.
> >
> > * Mingming (James) Xu
> > James joined the SQL DSL effort, contributing some of the trickier
> >>> parts,
> > such as the Join functionality. Additionally, he's consistently shown
> > himself to be an insightful code reviewer, significantly impacting
> >> the
> > project’s code quality and ensuring the success of the new major
>  component.
> >
> > * Manu Zhang
> > Manu initiated and developed a runner for the Apache Gearpump
>  (incubating)
> > engine, and has driven it to the point of merging to the master
> >> branch.
>  In
> > this effort, he accumulated 65 commits (7,812 ++ / 4,882 --) and
> >>> extended
> > the project to the new user community.
> >
> > Congratulations to all five! Welcome!
> >
> > Davor
> >
> 
>  --
>  Jean-Baptiste Onofré
>  jbono...@apache.org
>  http://blog.nanthrax.net
>  Talend - http://www.talend.com
> 
> >>>
> >>
>
>

Re: [PROPOSAL] "Requires deterministic input"

2017-08-10 Thread Tyler Akidau

+1 to the annotation idea, and to having it on processTimer.

-Tyler

On Thu, Aug 10, 2017 at 2:15 AM Aljoscha Krettek 
wrote:

> +1 to the annotation approach. I outlined how implementing this would work
> in the Flink runner in the Thread about the exactly-once Kafka Sink.
>
> > On 9. Aug 2017, at 23:03, Reuven Lax  wrote:
> >
> > Yes - I don't think we should try and make any deterministic guarantees
> > about what is in a bundle. Stability guarantees are per element only.
> >
> > On Wed, Aug 9, 2017 at 1:30 PM, Thomas Groh 
> > wrote:
> >
> >> +1 to the annotation-on-ProcessElement approach. ProcessElement is the
> >> minimum implementation requirement of a DoFn, and should be where the
> >> processing logic which depends on characteristics of the inputs lie.
> It's a
> >> good way of signalling the requirements of the Fn, and letting the
> runner
> >> decide.
> >>
> >> I have a minor concern that this may not work as expected for users that
> >> try to batch remote calls in `FinishBundle` - we should make sure we
> >> document that it is explicitly the input elements that will be replayed,
> >> and bundles and other operational are still arbitrary.
> >>
> >>
> >>
> >> On Wed, Aug 9, 2017 at 10:37 AM, Reuven Lax 
> >> wrote:
> >>
> >>> I think deterministic here means deterministically replayable. i.e. no
> >>> matter how many times the element is retried, it will always be the
> same.
> >>>
> >>> I think we should also allow specifying this on processTimer. This
> would
> >>> mean that any keyed state written in a previous processElement must be
> >>> guaranteed durable before processTimer is called.
> >>>
> >>>
> >>> On Wed, Aug 9, 2017 at 10:10 AM, Ben Chambers 
> >>> wrote:
> >>>
>  I strongly agree with this proposal. I think moving away from "just
> >>> insert
>  a GroupByKey for one of the 3 different reasons you may want it"
> >> towards
>  APIs that allow code to express the requirements they have and the
> >> runner
>  to choose the best way to meet this is a major step forwards in terms
> >> of
>  portability.
> 
>  I think "deterministic" may be misleading. The actual contents of the
>  collection aren't deterministic if upstream computations aren't. The
>  property we really need is that once an input may have been observed
> by
> >>> the
>  side-effecting code it will never be observed with a different value.
> 
>  I would propose something RequiresStableInput, to indicate that the
> >> input
>  must be stable as observed by the function. I could also see something
>  hinting at the fact we don't recompute the input, such as
>  RequiresMemoizedInput or RequiresNoRecomputation.
> 
>  -- Ben
> 
>  P.S For anyone interested other uses of GroupByKey that we may want to
>  discuss APIs for would be preventing retry across steps (eg.,
> >> preventing
>  fusion) and redistributing inputs across workers.
> 
>  On Wed, Aug 9, 2017 at 9:53 AM Kenneth Knowles  >>>
>  wrote:
> 
> > This came up again, so I wanted to push it along by proposing a
> >>> specific
> > API for Java that could have a derived API in Python. I am writing
> >> this
> > quickly to get something out there, so I welcome suggestions for
>  revision.
> >
> > Today a DoFn has a @ProcessElement annotated method with various
>  automated
> > parameters, but most fundamentally this:
> >
> > @ProcessElement
> > public void process(ProcessContext ctx) {
> >  ctx.element() // to access the current input element
> >  ctx.output(something) // to write to default output collection
> >  ctx.output(tag, something) // to write to other output collections
> > }
> >
> > For some time, we have hoped to unpack the context - it is a
> > backwards-compatibility pattern made obsolete by the new DoFn design.
> >>> So
> > instead it would look like this:
> >
> > @ProcessElement
> > public void process(Element element, MainOutput mainOutput, ...) {
> >  element.get() // to access the current input element
> >  mainOutput.output(something) // to write to the default output
>  collection
> >  other.output(something) // to write to other output collection
> > }
> >
> > I've deliberately left out the undecided syntax for side outputs. But
> >>> it
> > would be nice for the tag to be built in to the parameter so it
> >> doesn't
> > have to be used when calling output().
> >
> > One way to enhance this to deterministic input would just be this:
> >
> > @ProcessElement
> > @RequiresDeterministicInput
> > public void process(Element element, MainOutput mainOutput, ...) {
> >  element.get() // to access the current input element
> >  mainOutput.output(something) //

Re: Towards a spec for robust streaming SQL, Part 2

2017-07-25 Thread Tyler Akidau

+d...@apex.apache.org, since I'm told Apex has a Calcite integration as
well. If anyone on the Apex side wants to join in on the fun, your input
would be welcomed!

-Tyler


On Mon, Jul 24, 2017 at 4:34 PM Tyler Akidau <taki...@apache.org> wrote:

> Hello Flink, Calcite, and Beam dev lists!
>
> Linked below is the second document I promised way back in April regarding
> a collaborative spec for streaming SQL in Beam/Calcite/Flink (& apologies
> for the delay; I thought I was nearly done a while back and then temporal
> joins expanded to something much larger than expected).
>
> To repeat what it says in the doc, my hope is that it can serve various
> purposes over it's lifetime:
>
>-
>- A discussion ground for ironing out any remaining features necessary
>for supporting robust streaming semantics in Calcite SQL.
>
>- A rough, high-level source of truth for tracking efforts underway in
>support of this, currently spanning the Calcite, Flink, and Beam projects.
>
>- A written specification of the changes that were made, for the sake
>of understanding the delta after the fact.
>
> The first and third points are, IMO, the most important. AFAIK, there are
> a few features missing still that need to be defined (e.g., triggers
> equivalents via EMIT, robust temporal join support). I'm also proposing a
> clear distinction of streams and tables, which I think is important, but
> which I believe is not the approach most folks have been taking in this
> area. Sorting out these open issues and then having a concise record of the
> solutions adopted will be important for providing a solid streaming
> experience and teaching folks how to use it.
>
> At any rate, I would much appreciate it if anyone with an interest in this
> stuff could please take a look and add comments/suggestions/references to
> related work in flight/etc as appropriate. For now please use
> comments/suggestions, but if you really want to dive in with edit access,
> let me know.
>
> The doc: http://s.apache.org/streaming-sql-spec
>
> -Tyler
>
>
>

Towards a spec for robust streaming SQL, Part 2

2017-07-24 Thread Tyler Akidau

Hello Flink, Calcite, and Beam dev lists!

Linked below is the second document I promised way back in April regarding
a collaborative spec for streaming SQL in Beam/Calcite/Flink (& apologies
for the delay; I thought I was nearly done a while back and then temporal
joins expanded to something much larger than expected).

To repeat what it says in the doc, my hope is that it can serve various
purposes over it's lifetime:

   -
   - A discussion ground for ironing out any remaining features necessary
   for supporting robust streaming semantics in Calcite SQL.

   - A rough, high-level source of truth for tracking efforts underway in
   support of this, currently spanning the Calcite, Flink, and Beam projects.

   - A written specification of the changes that were made, for the sake of
   understanding the delta after the fact.

The first and third points are, IMO, the most important. AFAIK, there are a
few features missing still that need to be defined (e.g., triggers
equivalents via EMIT, robust temporal join support). I'm also proposing a
clear distinction of streams and tables, which I think is important, but
which I believe is not the approach most folks have been taking in this
area. Sorting out these open issues and then having a concise record of the
solutions adopted will be important for providing a solid streaming
experience and teaching folks how to use it.

At any rate, I would much appreciate it if anyone with an interest in this
stuff could please take a look and add comments/suggestions/references to
related work in flight/etc as appropriate. For now please use
comments/suggestions, but if you really want to dive in with edit access,
let me know.

The doc: http://s.apache.org/streaming-sql-spec

-Tyler

Re: BeamSQL status and merge to master

2017-07-21 Thread Tyler Akidau

There are still open items in the merge to master doc. We're close to being
ready, but let's please address those first.

-Tyler


On Thu, Jul 20, 2017 at 9:34 PM Mingmin Xu <mingm...@gmail.com> wrote:

> Quick update:
>
> The final PR[1] is open to review now. Please leave your comment or create
> a sub-task in [2] for any question.
>
> [1]. https://github.com/apache/beam/pull/3606
> [2]. https://issues.apache.org/jira/browse/BEAM-2651
>
>
> On Wed, Jul 5, 2017 at 3:34 PM, Jesse Anderson <je...@bigdatainstitute.io>
> wrote:
>
> > So excited to start using this!
> >
> > On Wed, Jul 5, 2017, 3:34 PM Mingmin Xu <mingm...@gmail.com> wrote:
> >
> > > Thanks for everybody's effort, we're very close to finish existing
> tasks.
> > > Here's an status update of SQL DSL, feel free to have a try and share
> any
> > > comment:
> > >
> > > *1. what's done*
> > >   DSL feature is done, with basic
> filter/project/aggregation/union/join,
> > > built-in functions/UDF/UDAF(pending on #3491)
> > >
> > > *2. what's on-going*
> > >   more unit tests, and documentation of README/Beam web.
> > >
> > > *3. open questions*
> > >   BEAM-2441 <https://issues.apache.org/jira/browse/BEAM-2441> want to
> > see
> > > any suggestion on the proper module name for SQL work. As mentioned in
> > > task, '*dsl/sql* is for the Java SDK and also prevents alternative
> > language
> > > implementations, however there's another SQL client and not good to be
> > > included as Java SDK extention'.
> > >
> > > ---
> > > *How to run the example* beam/dsls/sql/example/BeamSqlExample.java
> > > <
> > > https://github.com/apache/beam/blob/DSL_SQL/dsls/sql/src/mai
> > n/java/org/apache/beam/dsls/sql/example/BeamSqlExample.java
> > > >
> > > 1. run 'mvn install' to avoid the error in #3439
> > > <https://github.com/apache/beam/pull/3439>
> > > 2. run 'mvn -pl dsls/sql compile exec:java
> > > -Dexec.mainClass=org.apache.beam.dsls.sql.example.BeamSqlExample
> > > -Dexec.args="--runner=DirectRunner" -Pdirect-runner'
> > >
> > > FYI:
> > > 1. burn-down list in google doc
> > >
> > > https://docs.google.com/document/d/1EHZgSu4Jd75iplYpYT_K_JwS
> > ZxL2DWG8kv_EmQzNXFc/edit?usp=sharing
> > > 2. JIRA tasks with label 'dsl_sql_merge'
> > >
> > > https://issues.apache.org/jira/browse/BEAM-2555?jql=labels%2
> > 0%3D%20dsl_sql_merge
> > >
> > >
> > > Mingmin
> > >
> > > On Tue, Jun 13, 2017 at 8:51 AM, Lukasz Cwik <lc...@google.com.invalid
> >
> > > wrote:
> > >
> > > > Nevermind, I merged it into #2 about usability.
> > > >
> > > > On Tue, Jun 13, 2017 at 8:50 AM, Lukasz Cwik <lc...@google.com>
> wrote:
> > > >
> > > > > I added a section about maven module structure/packaging (#6).
> > > > >
> > > > > On Tue, Jun 13, 2017 at 8:30 AM, Tyler Akidau
> > > <taki...@google.com.invalid
> > > > >
> > > > > wrote:
> > > > >
> > > > >> Thanks Mingmin. I've copied your list into a doc[1] to make it
> > easier
> > > to
> > > > >> collaborate on comments and edits.
> > > > >>
> > > > >> [1] https://s.apache.org/beam-dsl-sql-burndown
> > > > >>
> > > > >> -Tyler
> > > > >>
> > > > >>
> > > > >> On Mon, Jun 12, 2017 at 10:09 PM Jean-Baptiste Onofré <
> > > j...@nanthrax.net>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Mingmin
> > > > >> >
> > > > >> > Sorry, the meeting was in the middle of the night for me and I
> > > wasn't
> > > > >> able
> > > > >> > to
> > > > >> > make it.
> > > > >> >
> > > > >> > The timing and checklist look good to me.
> > > > >> >
> > > > >> > We plan to do a Beam release end of June, so, merging in July
> > means
> > > we
> > > > >> can
> > > > >> > include it in the next release.
> > > > >> >
> > > > >> > Thanks !
> > > > >> > Regards
> > > > >> > JB
> > > > >

Re: BeamSQL status and merge to master

2017-06-13 Thread Tyler Akidau

Thanks Mingmin. I've copied your list into a doc[1] to make it easier to
collaborate on comments and edits.

[1] https://s.apache.org/beam-dsl-sql-burndown

-Tyler


On Mon, Jun 12, 2017 at 10:09 PM Jean-Baptiste Onofré 
wrote:

> Hi Mingmin
>
> Sorry, the meeting was in the middle of the night for me and I wasn't able
> to
> make it.
>
> The timing and checklist look good to me.
>
> We plan to do a Beam release end of June, so, merging in July means we can
> include it in the next release.
>
> Thanks !
> Regards
> JB
>
> On 06/13/2017 03:06 AM, Mingmin Xu wrote:
> > Hi all,
> >
> > Thanks to join the meeting. As discussed, we're planning to merge DSL_SQL
> > branch back to master, targeted in the middle of July. A tag
> > 'dsl_sql_merge'[1] is created to track all todo tasks.
> >
> > *What's added in Beam SQL?*
> > BeamSQL provides the capability to execute SQL queries with Beam Java
> SDK,
> > either by translating SQL to a PTransform, or run with a standalone CLI
> > client.
> >
> > *Checklist for merge:*
> > 1. functionality
> >1.1. SQL grammer:
> >  1.1.1. basic query with SELECT/FILTER/PROJECT;
> >  1.1.2. AGGREGATION with global window;
> >  1.1.3. AGGREGATION with FIX_TIME/SLIDING_TIME/SESSION window;
> >  1.1.4. JOIN
> >1.2. UDF/UDAF support;
> >1.3. support predefined String/Math/Date functions, see[2];
> >
> > 2. DSL interface to convert SQL as PTransform;
> >
> > 3. junit test;
> >
> > 4. Java document;
> >
> > 5. Document of SQL feature in website;
> >
> > Any comments/suggestions are very welcomed.
> >
> > Note:
> > [1].
> >
> https://issues.apache.org/jira/browse/BEAM-2436?jql=labels%20%3D%20dsl_sql_merge
> >
> > [2]. https://calcite.apache.org/docs/reference.html
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Apache Beam SQL: DDL

2017-06-08 Thread Tyler Akidau

Thanks James, added some initial comments.

-Tyler

On Thu, Jun 8, 2017 at 3:11 AM James  wrote:

> Hi team, I am working on BeamSQL now, and would like to add DDL for
> BeamSQL, so prepared a design doc[1], would like to hear your opinions :)
>
> [1]
>
> https://docs.google.com/document/d/162_cuYlZ5pC_8PzGWX844tlLOsSvQmoSGjJAgol4ipE/edit?usp=sharing
>

Re: First stable release completed!

2017-05-18 Thread Tyler Akidau

Well done everyone. Thanks to all who helped make this happen. :-)

-Tyler

On Thu, May 18, 2017 at 6:52 PM Etienne Chauchot 
wrote:

> This is awesome! Congrats to everyone!
>
> Etienne
>
>
> Le 17/05/2017 à 18:34, Thomas Groh a écrit :
> > Well done all!
> >
> > On Wed, May 17, 2017 at 9:31 AM, Sourabh Bajaj <
> > sourabhba...@google.com.invalid> wrote:
> >
> >> Congrats !!
> >> On Wed, May 17, 2017 at 9:02 AM Mingmin Xu  wrote:
> >>
> >>> Congratulations to everyone!
> >>>
> >>> On Wed, May 17, 2017 at 8:36 AM, Dan Halperin
> >>  >>> wrote:
> >>>
>  Great job, folks. What an amazing amount of work, and I'd like to
>  especially thank the community for participating in hackathons and
>  extensive release validation over the last few weeks! We caught some
>  crucial issues in time and really pushed a much better release as a
> >>> result.
>  Thanks everyone!
>  Dan
> 
>  On Wed, May 17, 2017 at 11:31 AM, Jesse Anderson <
>  je...@bigdatainstitute.io>
>  wrote:
> 
> > Awesome!
> >
> > On Wed, May 17, 2017, 8:30 AM Ahmet Altay 
> > wrote:
> >
> >> Congratulations everyone, this is great!
> >>
> >> On Wed, May 17, 2017 at 7:26 AM, Kenneth Knowles
>   >> wrote:
> >>
> >>> Awesome. A huge step.
> >>>
> >>> On Wed, May 17, 2017 at 6:30 AM, Andrew Psaltis <
> >> psaltis.and...@gmail.com>
> >>> wrote:
> >>>
>  This is fantastic.  Great job!
>  On Wed, May 17, 2017 at 08:20 Jean-Baptiste Onofré <
>  j...@nanthrax.net>
>  wrote:
> 
> > Huge congrats to everyone who helped reaching this important
> >> milestone
> >>> !
> > Honestly, we are a great team, WE ROCK ! ;)
> >
> > Regards
> > JB
> >
> > On 05/17/2017 01:28 PM, Davor Bonaci wrote:
> >> The first stable release is now complete!
> >>
> >> Release artifacts are available through various
> >> repositories,
> >>> including
> >> dist.apache.org, Maven Central, and PyPI. The website is
> > updated,
> >>> and
> >> announcements are published.
> >>
> >> Apache Software Foundation press release:
> >>
> > http://globenewswire.com/news-release/2017/05/17/986839/0/
>  en/The-Apache-Software-Foundation-Announces-Apache-
>  Beam-v2-0-0.html
> >> Beam blog:
> >> https://beam.apache.org/blog/2017/05/17/beam-first-stable-
> >>> release.html
> >> Congratulations to everyone -- this is a really big
> >> milestone
>  for
> >> the
> >> project, and I'm proud to be a part of this great
> >> community.
> >> Davor
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>  --
>  Thanks,
>  Andrew
> 
>  Subscribe to my book: Streaming Data <
> >> http://manning.com/psaltis
>  
>  twiiter: @itmdata  > user?screen_name=itmdata>
> > --
> > Thanks,
> >
> > Jesse
> >
> >>>
> >>>
> >>> --
> >>> 
> >>> Mingmin
> >>>
>
>

Re: Towards a spec for robust streaming SQL, Part 1

2017-05-12 Thread Tyler Akidau

Being able to support an EMIT config independent of the query itself sounds
great for compatible use cases (which should be many :-).

Shaoxuan, can you please refresh my memory what a dynamic table means in
Flink? It's basically just a state table, right? The "dynamic" part of the
name is to simply to imply it can evolve and change over time?

-Tyler


On Fri, May 12, 2017 at 1:59 AM Shaoxuan Wang <wshaox...@gmail.com> wrote:

> Thanks to Tyler and Fabian for sharing your thoughts.
>
> Regarding to the early/late update control of FLINK. IMO, each dynamic
> table can have an EMIT config. For FLINK table-API, this can be easily
> implemented in different manners, case by case. For instance, in window
> aggregate, we could define "when EMIT a result" via a windowConf per each
> window when we create windows. Unfortunately we do not have such
> flexibility (as compared with TableAPI) in SQL query, we need find a way to
> embed this EMIT config.
>
> Regards,
> Shaoxuan
>
>
> On Fri, May 12, 2017 at 4:28 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>
> > 2017-05-11 7:14 GMT+02:00 Tyler Akidau <taki...@google.com.invalid>:
> >
> > > On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fhue...@gmail.com>
> wrote:
> > >
> > > > Hi Tyler,
> > > >
> > > > thank you very much for this excellent write-up and the super nice
> > > > visualizations!
> > > > You are discussing a lot of the things that we have been thinking
> about
> > > as
> > > > well from a different perspective.
> > > > IMO, yours and the Flink model are pretty much aligned although we
> use
> > a
> > > > different terminology (which is not yet completely established). So
> > there
> > > > should be room for unification ;-)
> > > >
> > >
> > > Good to hear, thanks for giving it a look. :-)
> > >
> > >
> > > > Allow me a few words on the current state in Flink. In the upcoming
> > 1.3.0
> > > > release, we will have support for group window (TUMBLE, HOP,
> SESSION),
> > > OVER
> > > > RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> > > > aggregations. The group windows are triggered by watermark and the
> over
> > > > window and non-windowed aggregations emit for each input record
> > > > (AtCount(1)). The window aggregations do not support early or late
> > firing
> > > > (late records are dropped), so no updates here. However, the
> > non-windowed
> > > > aggregates produce updates (in acc and acc/retract mode). Based on
> this
> > > we
> > > > will work on better control for late updates and early firing as well
> > as
> > > > joins in the next release.
> > > >
> > >
> > > Impressive. At this rate there's a good chance we'll just be doing
> > catchup
> > > and thanking you for building everything. ;-) Do you have ideas for
> what
> > > you want your early/late updates control to look like? That's one of
> the
> > > areas I'd like to see better defined for us. And how deep are you going
> > > with joins?
> > >
> >
> > Right now (well actually I merged the change 1h ago) we are using a
> > QueryConfig object to specify state retention intervals to be able to
> clean
> > up state for inactive keys.
> > A running a query looks like this:
> >
> > // ---
> > val env = StreamExecutionEnvironment.getExecutionEnvironment
> > val tEnv = TableEnvironment.getTableEnvironment(env)
> >
> > val qConf = tEnv.queryConfig.withIdleStateRetentionTime(Time.hours(12))
> //
> > state of inactive keys is kept for 12 hours
> >
> > val t: Table = tEnv.sql("SELECT a, COUNT(*) AS cnt FROM MyTable GROUP BY
> > a")
> > val stream: DataStream[(Boolean, Row)] = t.toRetractStream[Row](qConf) //
> > boolean flag for acc/retract
> >
> > env.execute()
> > // ---
> >
> > We plan to use the QueryConfig also to specify early/late updates. Our
> main
> > motivation is to have a uniform and standard SQL for batch and streaming.
> > Hence, we have to move the configuration out of the query. But I agree,
> > that it would be very nice to be able to include it in the query. I think
> > it should not be difficult to optionally support an EMIT clause as well.
> >
> >
> > >
> > > > Reading the document, I did not find any major difference in our
> > > concepts.
> > > > In fact, we are aim

Re: Towards a spec for robust streaming SQL, Part 1

2017-05-10 Thread Tyler Akidau

On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fhue...@gmail.com> wrote:

> Hi Tyler,
>
> thank you very much for this excellent write-up and the super nice
> visualizations!
> You are discussing a lot of the things that we have been thinking about as
> well from a different perspective.
> IMO, yours and the Flink model are pretty much aligned although we use a
> different terminology (which is not yet completely established). So there
> should be room for unification ;-)
>

Good to hear, thanks for giving it a look. :-)


> Allow me a few words on the current state in Flink. In the upcoming 1.3.0
> release, we will have support for group window (TUMBLE, HOP, SESSION), OVER
> RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> aggregations. The group windows are triggered by watermark and the over
> window and non-windowed aggregations emit for each input record
> (AtCount(1)). The window aggregations do not support early or late firing
> (late records are dropped), so no updates here. However, the non-windowed
> aggregates produce updates (in acc and acc/retract mode). Based on this we
> will work on better control for late updates and early firing as well as
> joins in the next release.
>

Impressive. At this rate there's a good chance we'll just be doing catchup
and thanking you for building everything. ;-) Do you have ideas for what
you want your early/late updates control to look like? That's one of the
areas I'd like to see better defined for us. And how deep are you going
with joins?


> Reading the document, I did not find any major difference in our concepts.
> In fact, we are aiming to support the cases you are describing as well.
> I have a question though. Would you classify an OVER aggregation as a
> stream -> stream or stream -> table operation? It collects records to
> aggregate them, but emits one record for each input row. Depending on the
> window definition (i.e., with FOLLOWING CURRENT ROW), you can compute and
> emit the result record when the input record is received.
>

I would call it a composite stream → stream operation (since SQL, like the
standard Beam/Flink APIs, is a higher level set of constructs than raw
streams/tables operations) consisting of a stream → table windowed grouping
followed by a table → stream triggering on every element, basically as you
described in the previous paragraph.

-Tyler


>
> I'm looking forward to the second part.
>
> Cheers, Fabian
>
>
>
> 2017-05-09 0:34 GMT+02:00 Tyler Akidau <taki...@google.com.invalid>:
>
> > Any thoughts here Fabian? I'm planning to start sending out some more
> > emails towards the end of the week.
> >
> > -Tyler
> >
> >
> > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <taki...@google.com> wrote:
> >
> > > No worries, thanks for the heads up. Good luck wrapping all that stuff
> > up.
> > >
> > > -Tyler
> > >
> > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fhue...@gmail.com>
> > wrote:
> > >
> > >> Hi Tyler,
> > >>
> > >> thanks for pushing this effort and including the Flink list.
> > >> I haven't managed to read the doc yet, but just wanted to thank you
> for
> > >> the
> > >> write-up and let you know that I'm very interested in this discussion.
> > >>
> > >> We are very close to the feature freeze of Flink 1.3 and I'm quite
> busy
> > >> getting as many contributions merged before the release is forked off.
> > >> When that happened, I'll have more time to read and comment.
> > >>
> > >> Thanks,
> > >> Fabian
> > >>
> > >>
> > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <taki...@google.com.invalid>:
> > >>
> > >> > Good point, when you start talking about anything less than a full
> > join,
> > >> > triggers get involved to describe how one actually achieves the
> > desired
> > >> > semantics, and they may end up being tied to just one of the inputs
> > >> (e.g.,
> > >> > you may only care about the watermark for one side of the join). Am
> > >> > expecting us to address these sorts of details more precisely in doc
> > #2.
> > >> >
> > >> > -Tyler
> > >> >
> > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > <k...@google.com.invalid
> > >> >
> > >> > wrote:
> > >> >
> > >> > > There's something to be said about having different triggering
> > >> depending
> > >> > on
&

Re: Towards a spec for robust streaming SQL, Part 1

2017-04-26 Thread Tyler Akidau

No worries, thanks for the heads up. Good luck wrapping all that stuff up.

-Tyler

On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fhue...@gmail.com> wrote:

> Hi Tyler,
>
> thanks for pushing this effort and including the Flink list.
> I haven't managed to read the doc yet, but just wanted to thank you for the
> write-up and let you know that I'm very interested in this discussion.
>
> We are very close to the feature freeze of Flink 1.3 and I'm quite busy
> getting as many contributions merged before the release is forked off.
> When that happened, I'll have more time to read and comment.
>
> Thanks,
> Fabian
>
>
> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <taki...@google.com.invalid>:
>
> > Good point, when you start talking about anything less than a full join,
> > triggers get involved to describe how one actually achieves the desired
> > semantics, and they may end up being tied to just one of the inputs
> (e.g.,
> > you may only care about the watermark for one side of the join). Am
> > expecting us to address these sorts of details more precisely in doc #2.
> >
> > -Tyler
> >
> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles <k...@google.com.invalid>
> > wrote:
> >
> > > There's something to be said about having different triggering
> depending
> > on
> > > which side of a join data comes from, perhaps?
> > >
> > > (delightful doc, as usual)
> > >
> > > Kenn
> > >
> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> <taki...@google.com.invalid
> > >
> > > wrote:
> > >
> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> basically
> > > > flatten + GBK. Flatten is a non-grouping operation that merges the
> > input
> > > > streams into a single output stream. GBK then groups the data within
> > that
> > > > single union stream as you might otherwise expect, yielding a single
> > > table.
> > > > So I think it doesn't really impact things much. Grouping,
> aggregation,
> > > > window merging etc all just act upon the merged stream and generate
> > what
> > > is
> > > > effectively a merged table.
> > > >
> > > > -Tyler
> > > >
> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> <lc...@google.com.invalid
> > >
> > > > wrote:
> > > >
> > > > > The doc is a good read.
> > > > >
> > > > > I think you do a great job of explaining table -> stream, stream ->
> > > > stream,
> > > > > and stream -> table when there is only one stream.
> > > > > But when there are multiple streams reading/writing to a table, how
> > > does
> > > > > that impact what occurs?
> > > > > For example, with CoGBK you have multiple streams writing to a
> table,
> > > how
> > > > > does that impact window merging?
> > > > >
> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > <taki...@google.com.invalid
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > > >
> > > > > > Apologies for the big cross post, but I thought this might be
> > > something
> > > > > all
> > > > > > three communities would find relevant.
> > > > > >
> > > > > > Beam is finally making progress on a SQL DSL utilizing Calcite,
> > > thanks
> > > > to
> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> conclusion
> > > > about
> > > > > > how to elegantly support the full suite of streaming
> functionality
> > in
> > > > the
> > > > > > Beam model in via Calcite SQL. You folks in the Flink community
> > have
> > > > been
> > > > > > pushing on this (e.g., adding windowing constructs, amongst
> others,
> > > > thank
> > > > > > you! :-), but from my understanding we still don't have a full
> spec
> > > for
> > > > > how
> > > > > > to support robust streaming in SQL (including but not limited to,
> > > > e.g., a
> > > > > > triggers analogue such as EMIT).
> > > > > >
> > > > > > I've been spending a lot of time thinking about this and have
> some
> &

Re: Towards a spec for robust streaming SQL, Part 1

2017-04-21 Thread Tyler Akidau

Good point, when you start talking about anything less than a full join,
triggers get involved to describe how one actually achieves the desired
semantics, and they may end up being tied to just one of the inputs (e.g.,
you may only care about the watermark for one side of the join). Am
expecting us to address these sorts of details more precisely in doc #2.

-Tyler

On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles <k...@google.com.invalid>
wrote:

> There's something to be said about having different triggering depending on
> which side of a join data comes from, perhaps?
>
> (delightful doc, as usual)
>
> Kenn
>
> On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau <taki...@google.com.invalid>
> wrote:
>
> > Thanks for reading, Luke. The simple answer is that CoGBK is basically
> > flatten + GBK. Flatten is a non-grouping operation that merges the input
> > streams into a single output stream. GBK then groups the data within that
> > single union stream as you might otherwise expect, yielding a single
> table.
> > So I think it doesn't really impact things much. Grouping, aggregation,
> > window merging etc all just act upon the merged stream and generate what
> is
> > effectively a merged table.
> >
> > -Tyler
> >
> > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik <lc...@google.com.invalid>
> > wrote:
> >
> > > The doc is a good read.
> > >
> > > I think you do a great job of explaining table -> stream, stream ->
> > stream,
> > > and stream -> table when there is only one stream.
> > > But when there are multiple streams reading/writing to a table, how
> does
> > > that impact what occurs?
> > > For example, with CoGBK you have multiple streams writing to a table,
> how
> > > does that impact window merging?
> > >
> > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> <taki...@google.com.invalid
> > >
> > > wrote:
> > >
> > > > Hello Beam, Calcite, and Flink dev lists!
> > > >
> > > > Apologies for the big cross post, but I thought this might be
> something
> > > all
> > > > three communities would find relevant.
> > > >
> > > > Beam is finally making progress on a SQL DSL utilizing Calcite,
> thanks
> > to
> > > > Mingmin Xu. As you can imagine, we need to come to some conclusion
> > about
> > > > how to elegantly support the full suite of streaming functionality in
> > the
> > > > Beam model in via Calcite SQL. You folks in the Flink community have
> > been
> > > > pushing on this (e.g., adding windowing constructs, amongst others,
> > thank
> > > > you! :-), but from my understanding we still don't have a full spec
> for
> > > how
> > > > to support robust streaming in SQL (including but not limited to,
> > e.g., a
> > > > triggers analogue such as EMIT).
> > > >
> > > > I've been spending a lot of time thinking about this and have some
> > > opinions
> > > > about how I think it should look that I've already written down, so I
> > > > volunteered to try to drive forward agreement on a general streaming
> > SQL
> > > > spec between our three communities (well, technically I volunteered
> to
> > do
> > > > that w/ Beam and Calcite, but I figured you Flink folks might want to
> > > join
> > > > in since you're going that direction already anyway and will have
> > useful
> > > > insights :-).
> > > >
> > > > My plan was to do this by sharing two docs:
> > > >
> > > >1. The Beam Model : Streams & Tables - This one is for context,
> and
> > > >really only mentions SQL in passing. But it describes the
> > relationship
> > > >between the Beam Model and the "streams & tables" way of thinking,
> > > which
> > > >turns out to be useful in understanding what robust streaming in
> SQL
> > > > might
> > > >look like. Many of you probably already know some or all of what's
> > in
> > > > here,
> > > >but I felt it was necessary to have it all written down in order
> to
> > > > justify
> > > >some of the proposals I wanted to make in the second doc.
> > > >
> > > >2. A streaming SQL spec for Calcite - The goal for this doc is
> that
> > it
> > > >would become a general specification for what robust streaming SQL
>

Re: Towards a spec for robust streaming SQL, Part 1

2017-04-21 Thread Tyler Akidau

Thanks for reading, Luke. The simple answer is that CoGBK is basically
flatten + GBK. Flatten is a non-grouping operation that merges the input
streams into a single output stream. GBK then groups the data within that
single union stream as you might otherwise expect, yielding a single table.
So I think it doesn't really impact things much. Grouping, aggregation,
window merging etc all just act upon the merged stream and generate what is
effectively a merged table.

-Tyler

On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik <lc...@google.com.invalid>
wrote:

> The doc is a good read.
>
> I think you do a great job of explaining table -> stream, stream -> stream,
> and stream -> table when there is only one stream.
> But when there are multiple streams reading/writing to a table, how does
> that impact what occurs?
> For example, with CoGBK you have multiple streams writing to a table, how
> does that impact window merging?
>
> On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau <taki...@google.com.invalid>
> wrote:
>
> > Hello Beam, Calcite, and Flink dev lists!
> >
> > Apologies for the big cross post, but I thought this might be something
> all
> > three communities would find relevant.
> >
> > Beam is finally making progress on a SQL DSL utilizing Calcite, thanks to
> > Mingmin Xu. As you can imagine, we need to come to some conclusion about
> > how to elegantly support the full suite of streaming functionality in the
> > Beam model in via Calcite SQL. You folks in the Flink community have been
> > pushing on this (e.g., adding windowing constructs, amongst others, thank
> > you! :-), but from my understanding we still don't have a full spec for
> how
> > to support robust streaming in SQL (including but not limited to, e.g., a
> > triggers analogue such as EMIT).
> >
> > I've been spending a lot of time thinking about this and have some
> opinions
> > about how I think it should look that I've already written down, so I
> > volunteered to try to drive forward agreement on a general streaming SQL
> > spec between our three communities (well, technically I volunteered to do
> > that w/ Beam and Calcite, but I figured you Flink folks might want to
> join
> > in since you're going that direction already anyway and will have useful
> > insights :-).
> >
> > My plan was to do this by sharing two docs:
> >
> >1. The Beam Model : Streams & Tables - This one is for context, and
> >really only mentions SQL in passing. But it describes the relationship
> >between the Beam Model and the "streams & tables" way of thinking,
> which
> >turns out to be useful in understanding what robust streaming in SQL
> > might
> >look like. Many of you probably already know some or all of what's in
> > here,
> >but I felt it was necessary to have it all written down in order to
> > justify
> >some of the proposals I wanted to make in the second doc.
> >
> >2. A streaming SQL spec for Calcite - The goal for this doc is that it
> >would become a general specification for what robust streaming SQL in
> >Calcite should look like. It would start out as a basic proposal of
> what
> >things *could* look like (combining both what things look like now as
> > well
> >as a set of proposed changes for the future), and we could all iterate
> > on
> >it together until we get to something we're happy with.
> >
> > At this point, I have doc #1 ready, and it's a bit of a monster, so I
> > figured I'd share it and let folks hack at it with comments if they have
> > any, while I try to get the second doc ready in the meantime. As part of
> > getting doc #2 ready, I'll be starting a separate thread to try to gather
> > input on what things are already in flight for streaming SQL across the
> > various communities, to make sure the proposal captures everything that's
> > going on as accurately as it can.
> >
> > If you have any questions or comments, I'm interested to hear them.
> > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> >
> >   http://s.apache.org/beam-streams-tables
> >
> > -Tyler
> >
>

Towards a spec for robust streaming SQL, Part 1

2017-04-20 Thread Tyler Akidau

Hello Beam, Calcite, and Flink dev lists!

Apologies for the big cross post, but I thought this might be something all
three communities would find relevant.

Beam is finally making progress on a SQL DSL utilizing Calcite, thanks to
Mingmin Xu. As you can imagine, we need to come to some conclusion about
how to elegantly support the full suite of streaming functionality in the
Beam model in via Calcite SQL. You folks in the Flink community have been
pushing on this (e.g., adding windowing constructs, amongst others, thank
you! :-), but from my understanding we still don't have a full spec for how
to support robust streaming in SQL (including but not limited to, e.g., a
triggers analogue such as EMIT).

I've been spending a lot of time thinking about this and have some opinions
about how I think it should look that I've already written down, so I
volunteered to try to drive forward agreement on a general streaming SQL
spec between our three communities (well, technically I volunteered to do
that w/ Beam and Calcite, but I figured you Flink folks might want to join
in since you're going that direction already anyway and will have useful
insights :-).

My plan was to do this by sharing two docs:

   1. The Beam Model : Streams & Tables - This one is for context, and
   really only mentions SQL in passing. But it describes the relationship
   between the Beam Model and the "streams & tables" way of thinking, which
   turns out to be useful in understanding what robust streaming in SQL might
   look like. Many of you probably already know some or all of what's in here,
   but I felt it was necessary to have it all written down in order to justify
   some of the proposals I wanted to make in the second doc.

   2. A streaming SQL spec for Calcite - The goal for this doc is that it
   would become a general specification for what robust streaming SQL in
   Calcite should look like. It would start out as a basic proposal of what
   things *could* look like (combining both what things look like now as well
   as a set of proposed changes for the future), and we could all iterate on
   it together until we get to something we're happy with.

At this point, I have doc #1 ready, and it's a bit of a monster, so I
figured I'd share it and let folks hack at it with comments if they have
any, while I try to get the second doc ready in the meantime. As part of
getting doc #2 ready, I'll be starting a separate thread to try to gather
input on what things are already in flight for streaming SQL across the
various communities, to make sure the proposal captures everything that's
going on as accurately as it can.

If you have any questions or comments, I'm interested to hear them.
Otherwise, here's doc #1, "The Beam Model : Streams & Tables":

  http://s.apache.org/beam-streams-tables

-Tyler

Re: [PROPOSAL]: a new feature branch for SQL DSL

2017-04-11 Thread Tyler Akidau

Hi 陈竞,

I'm doubtful there will be an explicit equivalent of the State API in SQL,
at least not in the SQL portion of the DSL itself (it might make sense to
expose one within UDFs). The State API is an imperative interface for
accessing an underlying persistent state table, whereas SQL operates more
functionally. There's no good way I'm aware of to expose the
characteristics provided by the State API (logic-driven, fine- and
coarse-grained reads/writes of potentially multiple fields of state
utilizing potentially multiple data types) in raw SQL cleanly.

On the upside, SQL has the advantage of making it very easy to materialize
new state tables very naturally. In the proposal I'll be sharing for how I
think we should integrate streaming into SQL robustly, any time you perform
some grouping operation (GROUP BY, JOIN, CUBE, etc) you're transforming
your stream into a table. That table is effectively a persistent state
table. So there exists a large suite of functionality in standard SQL that
gives you a lot of powerful tools for creating state.

It may also be possible for the different access patterns of more
complicated data structures (e.g., bags or lists) to be captured by
different data types supported by the underlying systems. But I don't
expect there to be an imperative State access API built into SQL itself.

All that said, I'm curious to hear ideas otherwise if anyone has them. :-)

-Tyler

On Mon, Apr 10, 2017 at 10:19 PM 陈竞 <cj.mag...@gmail.com> wrote:

> i just want to know what the SQL State API equivalent is for SQL, since
> beam has already support stateful processing using state DoFn
>
> 2017-04-11 2:12 GMT+08:00 Tyler Akidau <taki...@google.com.invalid>:
>
> > 陈竞, what are you specifically curious about regarding state? Are you
> > wanting to know what the SQL State API equivalent is for SQL? Or are you
> > asking an operational question about where the state for a given SQL
> > pipeline will live?
> >
> > -Tyler
> >
> >
> > On Sun, Apr 9, 2017 at 12:39 PM Mingmin Xu <mingm...@gmail.com> wrote:
> >
> > > Thanks @JB, will come out the initial PR soon.
> > >
> > > On Sun, Apr 9, 2017 at 12:28 PM, Jean-Baptiste Onofré <j...@nanthrax.net
> >
> > > wrote:
> > >
> > > > As discussed, I created the DSL_SQL branch with the skeleton. Mingmin
> > is
> > > > rebasing on this branch to submit the PR.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > >
> > > > On 04/09/2017 08:02 PM, Mingmin Xu wrote:
> > > >
> > > >> State is not touched yet, welcome to add it.
> > > >>
> > > >> On Sun, Apr 9, 2017 at 2:40 AM, 陈竞 <cj.mag...@gmail.com> wrote:
> > > >>
> > > >> how will this sql support state both in streaming and batch mode
> > > >>>
> > > >>> 2017-04-07 4:54 GMT+08:00 Mingmin Xu <mingm...@gmail.com>:
> > > >>>
> > > >>> @Tyler, there's no big change in the previous design doc, I added
> > some
> > > >>>> details in chapter 'Part 2. DML( [INSERT] SELECT )' , describing
> > steps
> > > >>>> to
> > > >>>> process a query, feel free to leave a comment.
> > > >>>>
> > > >>>> Come through your doc of 'EMIT', it's awesome from my perspective.
> > > I've
> > > >>>> some tests on GroupBy with default triggers/allowed_lateness now.
> > EMIT
> > > >>>> syntax can be added to fill the gap.
> > > >>>>
> > > >>>> On Thu, Apr 6, 2017 at 1:04 PM, Tyler Akidau <taki...@apache.org>
> > > >>>> wrote:
> > > >>>>
> > > >>>> I'm very excited by this development as well, thanks for
> continuing
> > to
> > > >>>>>
> > > >>>> push
> > > >>>>
> > > >>>>> this forward, Mingmin. :-)
> > > >>>>>
> > > >>>>> I noticed you'd made some changes to your design doc
> > > >>>>> <
> https://docs.google.com/document/d/1Uc5xYTpO9qsLXtT38OfuoqSLimH_
> > > >>>>> 0a1Bz5BsCROMzCU/edit>.
> > > >>>>> Is it ready for another review? How reflective is it currently of
> > the
> > > >>>>>
> > > >>>> work
> > > >>>>
> > > >>>>> that going into the feature branch?
> > > >>>>>
> >

Re: [PROPOSAL]: a new feature branch for SQL DSL

2017-04-06 Thread Tyler Akidau

I'm very excited by this development as well, thanks for continuing to push
this forward, Mingmin. :-)

I noticed you'd made some changes to your design doc
.
Is it ready for another review? How reflective is it currently of the work
that going into the feature branch?

In parallel, I'd also like to continue helping push forward the definition
of unified model semantics for SQL so we can get Calcite to a point where
it supports the full Beam model. I added a comment

on the JIRA suggesting I create a doc with a specification proposal for
EMIT (and any other necessary semantic changes) that we can then iterate on
in public with the Calcite folks. I already have most of the content
written (and there's a significant amount of background needed to justify
some aspects of the proposal), so it'll mostly be a matter of pulling it
all together into something coherent. Does that sound reasonable to
everyone?

-Tyler


On Thu, Apr 6, 2017 at 10:26 AM Kenneth Knowles 
wrote:

> Very cool! I'm really excited about this integration.
>
> On Thu, Apr 6, 2017 at 9:39 AM, Jean-Baptiste Onofré 
> wrote:
>
> > Hi,
> >
> > Mingmin and I prepared a new branch to have the SQL DSL in dsls/sql
> > location.
> >
> > Any help is welcome !
> >
> > Thanks,
> > Regards
> > JB
> >
> >
> > On 04/06/2017 06:36 PM, Mingmin Xu wrote:
> >
> >> @Tarush, you're very welcome to join the effort.
> >>
> >> On Thu, Apr 6, 2017 at 7:22 AM, tarush grover 
> >> wrote:
> >>
> >> Hi,
> >>>
> >>> Can I be also part of this feature development.
> >>>
> >>> Regards,
> >>> Tarush Grover
> >>>
> >>> On Thu, Apr 6, 2017 at 3:17 AM, Ted Yu  wrote:
> >>>
> >>> I compiled BEAM-301 branch with calcite 1.12 - passed.
> 
>  Julian tries to not break existing things, but he will if there's a
> 
> >>> reason
> >>>
>  to do so :-)
> 
>  On Wed, Apr 5, 2017 at 2:36 PM, Mingmin Xu 
> wrote:
> 
>  @Ted, thanks for the note. I intend to stick with one version, Beam
> >
>  0.6.0
> >>>
>  and Calcite 1.11 so far, unless impacted by API change. Before it's
> >
>  merged
> 
> > back to master, will upgrade to the latest version.
> >
> > On Wed, Apr 5, 2017 at 2:14 PM, Ted Yu  wrote:
> >
> > Working in feature branch is good - you may want to periodically sync
> >>
> > up
> 
> > with master.
> >>
> >> I noticed that you are using 1.11.0 of calcite.
> >> 1.12 is out, FYI
> >>
> >> On Wed, Apr 5, 2017 at 2:05 PM, Mingmin Xu 
> >>
> > wrote:
> >>>
> 
> >> Hi all,
> >>>
> >>> I'm working on https://issues.apache.org/jira/browse/BEAM-301(Add
> >>>
> >> a
> >>>
>  Beam
> >
> >> SQL DSL). The skeleton is already in
> >>> https://github.com/XuMingmin/beam/tree/BEAM-301, using Java SDK in
> >>>
> >> the
> 
> > back-end. The goal is to provide a SQL interface over Beam, based
> >>>
> >> on
> >>>
>  Calcite, including:
> >>> 1). a translator to create Beam pipeline from SQL,
> >>> (SELECT/INSERT/FILTER/GROUP-BY/JOIN/...);
> >>> 2). an interactive client to submit queries;  (All-SQL mode)
> >>> 3). a SQL API which reduce the work to create a Pipeline; (Semi-SQL
> >>>
> >> mode)
> >
> >>
> >>> As we see many folks are interested in this feature, would like to
> >>>
> >> create a
> >>
> >>> feature branch to have more involvement.
> >>> Looking for comments and feedback.
> >>>
> >>> Thanks!
> >>> 
> >>> Mingmin
> >>>
> >>>
> >>
> >
> >
> > --
> > 
> > Mingmin
> >
> >
> 
> >>>
> >>
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: [PROPOSAL] @OnWindowExpiration

2017-03-30 Thread Tyler Akidau

+1. I'm assuming this is much easier for the user in the case of merging
windows (i.e., or do we disallow user-specified timers for merging windows
currently, similar to value state)?

-Tyler

On Wed, Mar 29, 2017 at 2:38 PM Aljoscha Krettek 
wrote:

> +1 I had also already commented on the issue a while back ;-)
>
> On Wed, Mar 29, 2017, at 21:23, Kenneth Knowles wrote:
> > I had totally forgotten that this was filed as
> > https://issues.apache.org/jira/browse/BEAM-1589 already, which I have
> now
> > assigned to myself.
> >
> > And, of course, there have been many discussions that mentioned the
> > feature, so my initial phrasing as though it was a new idea probably
> > seemed
> > a bit odd.
> >
> > I was just finally putting it forward as a formal proposal to the list to
> > get feedback such as Robert's as well as any objections.
> >
> > Kenn
> >
> > On Wed, Mar 29, 2017 at 9:35 AM, Thomas Groh 
> > wrote:
> >
> > > +1
> > >
> > > The fact that we have this ability already (including all of the
> required
> > > information), just in a roundabout way by manually dredging in the
> allowed
> > > lateness, means that this isn't a huge burden to implement on an SDK or
> > > runner side; meanwhile, this much more strongly communicates what a
> user is
> > > trying to accomplish (in the general case, flush anything left over).
> > >
> > > I think having this annotation present and available also makes it more
> > > obvious that if there's no window-expiration cleanup then any remaining
> > > buffered state will be lost, and that there's a recommended way to
> flush
> > > any remaining state.
> > >
> > > On Wed, Mar 29, 2017 at 9:14 AM, Kenneth Knowles
> 
> > > wrote:
> > >
> > > > On Wed, Mar 29, 2017 at 12:16 AM, JingsongLee <
> lzljs3620...@aliyun.com>
> > > > wrote:
> > > >
> > > > > If user have a WordCount StatefulDoFn, the result of
> > > > > counts is always changing before the expiration of window.
> > > > > Maybe the user want a signal to know the count is the final value
> > > > > and then archive the value to the timing database or somewhere
> else.
> > > > > best,
> > > > > JingsongLee
> > > > >
> > > >
> > > > This is a good point to bring up, but actually already required to be
> > > > handled by the runner. This issue exists with timers already. The
> runner
> > > > must sequence these:
> > > >
> > > > 1. Expire the window and start dropping any more input
> > > > 2. Fire the user's expiration callback
> > > > 3. Delete the state for the window
> > > >
> > > > This actually made me think of a special property of
> @OnWindowExpiration:
> > > > we can forbid Timer parameters. If we followed Robert's idea we
> could do
> > > > static analysis and enforce the same thing.
> > > >
> > > > This is a pretty good motivation for the special feature. It is more
> than
> > > > convenience.
> > > >
> > > > Kenn
> > > >
> > > >
> > > > > 
> > > > --From:Kenneth
> > > > > Knowles Time:2017 Mar 29 (Wed)
> 09:07To:dev <
> > > > > dev@beam.apache.org>Subject:Re: [PROPOSAL] @OnWindowExpiration
> > > > > On Tue, Mar 28, 2017 at 2:47 PM, Eugene Kirpichov <
> > > > > kirpic...@google.com.invalid> wrote:
> > > > >
> > > > > > Kenn, can you quote some use cases for this, to make
> > > > > it more clear what are
> > > > > > the consequences of having this API in this form?
> > > > > >
> > > > > > I recall that one of the main use cases was batching DoFn, right?
> > > > > >
> > > > >
> > > > > I believe every stateful DoFn where the data stored in state
> represents
> > > > > some accumulation of the input and/or buffering of output requires
> > > this.
> > > > > So, yes:
> > > > >
> > > > >  - batching DoFn and the many variants that may spring up
> > > > >  - combine-like stateful DoFns that require state, like blended
> > > > > accumulation modes or selective composed combines
> > > > >  - trigger-like stateful DoFns that output based on some complex
> > > > > user-defined criteria
> > > > >
> > > > > The stateful DoFns that do not require such a timer are those
> where the
> > > > > stored data is a phase transition or side-input-like enrichment,
> and I
> > > > > think also common join algorithms.
> > > > >
> > > > > I don't have a sense of which of these will be more prevalent. Both
> > > > > categories represent common user needs.
> > > > >
> > > > > Kenn
> > > > >
> > > > >
> > > > > > On Tue, Mar 28, 2017 at 1:37 PM Kenneth Knowles
> > >  > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > On Tue, Mar 28, 2017 at 1:32 PM, Robert Bradshaw <
> > > > > > > rober...@google.com.invalid> wrote:
> > > > > > >
> > > > > > > > Another alternative is to be able to set special timers,
> e.g. end
> > > > of
> > > > > > > window
> > > > > > > > and expiration of window. That at least addresses (2).
> > > > > > > >
> > > > > > >
> > > > >

Re: [ANNOUNCEMENT] New committers, March 2017 edition!

2017-03-20 Thread Tyler Akidau

Welcome!

On Mon, Mar 20, 2017, 02:25 Jean-Baptiste Onofré  wrote:

> Welcome aboard, and congrats !
>
> Really happy to count you all in the team ;)
>
> Regards
> JB
>
> On 03/17/2017 10:13 PM, Davor Bonaci wrote:
> > Please join me and the rest of Beam PMC in welcoming the following
> > contributors as our newest committers. They have significantly
> contributed
> > to the project in different ways, and we look forward to many more
> > contributions in the future.
> >
> > * Chamikara Jayalath
> > Chamikara has been contributing to Beam since inception, and previously
> to
> > Google Cloud Dataflow, accumulating a total of 51 commits (8,301 ++ /
> 3,892
> > --) since February 2016 [1]. He contributed broadly to the project, but
> > most significantly to the Python SDK, building the IO framework in this
> SDK
> > [2], [3].
> >
> > * Eugene Kirpichov
> > Eugene has been contributing to Beam since inception, and previously to
> > Google Cloud Dataflow, accumulating a total of 95 commits (22,122 ++ /
> > 18,407 --) since February 2016 [1]. In recent months, he’s been driving
> the
> > Splittable DoFn effort [4]. A true expert on IO subsystem, Eugene has
> > reviewed nearly every IO contributed to Beam. Finally, Eugene contributed
> > the Beam Style Guide, and is championing it across the project.
> >
> > * Ismaël Mejia
> > Ismaël has been contributing to Beam since mid-2016, accumulating a total
> > of 35 commits (3,137 ++ / 1,328 --) [1]. He authored the HBaseIO
> connector,
> > helped on the Spark runner, and contributed in other areas as well,
> > including cross-project collaboration with Apache Zeppelin. Ismaël
> reported
> > 24 Jira issues.
> >
> > * Aviem Zur
> > Aviem has been contributing to Beam since early fall, accumulating a
> total
> > of 49 commits (6,471 ++ / 3,185 --) [1]. He reported 43 Jira issues, and
> > resolved ~30 issues. Aviem improved the stability of the Spark runner a
> > lot, and introduced support for metrics. Finally, Aviem is championing
> > dependency management across the project.
> >
> > Congratulations to all four! Welcome!
> >
> > Davor
> >
> > [1]
> >
> https://github.com/apache/beam/graphs/contributors?from=2016-02-01=2017-03-17=c
> > [2]
> >
> https://github.com/apache/beam/blob/v0.6.0/sdks/python/apache_beam/io/iobase.py#L70
> > [3]
> >
> https://github.com/apache/beam/blob/v0.6.0/sdks/python/apache_beam/io/iobase.py#L561
> > [4] https://s.apache.org/splittable-do-fn
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [BEAM-301] Add a Beam SQL DSL

2017-02-28 Thread Tyler Akidau

Hi Mingmin,

Thanks for your interest in helping out on this task, and for your initial
proposal. I'm also very happy to work with you on this, and excited to see
some progress made here. Added a few more comments on the doc, but will
summarize them below as well.

As far as the DSL point goes, I agree with JB that any sort of interface to
Beam that uses SQL will be creating a DSL. Having the initial interface be
an interactive SQL prompt is a perfectly valid approach, but at the end of
the day, theres' still a DSL under the covers. As such, there are a lot of
questions that will need to be addressed in designing such a DSL (and the
Jira lists some resources discussing those already).

That said, it's possible to make progress on a Beam DSL without addressing
them all (e.g., by tackling only a small subset of functionality first,
such as project and filter). But the current phases as listed in the doc
will require addressing some of the big ones.

So a good first step might be trying to scope the proposal to have a more
modest initial set of functionality, or else providing more detail on how
you propose to address the issues that will come up with various features
currently listed in phase 1, particularly grouping w/ streams.

-Tyler

On Mon, Feb 27, 2017 at 10:44 PM Jean-Baptiste Onofré 
wrote:

> Hi Mingmin,
>
> The idea is actual both:
>
> 1. an interactive SQL prompt where we can express pipeline directly
> using SQL.
> 2. a SQL DSL to describe a pipeline in SQL and create the corresponding
> Java code under the hood.
>
> I provided couple of comments on the doc. Ready and happy to help you on
> this (as I created the Jira ;)).
>
> Regards
> JB
>
> On 02/27/2017 10:33 PM, Mingmin Xu wrote:
> > Hello all,
> >
> > Would like to pop up this task, to see any interest to move it forward.
> >
> > I've a project to run SQL queries with an interactive interface, and
> would
> > like to share my ideas. A draft doc is available to describe how it works
> > with Calcite. --A little different from BEAM-301, that I choose a CLI
> > interactive way, not SQL DSL.
> >
> > Doc link:
> >
> https://docs.google.com/document/d/1Uc5xYTpO9qsLXtT38OfuoqSLimH_0a1Bz5BsCROMzCU/edit?usp=sharing
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [jira] [Created] (BEAM-1261) State API should allow state to be managed in different windows

2017-01-11 Thread Tyler Akidau

On Wed, Jan 11, 2017 at 9:43 AM Robert Bradshaw 
wrote:

> On Wed, Jan 11, 2017 at 8:59 AM, Lukasz Cwik 
> wrote:
> > I was under the impression that user state was scoped to a ParDo and was
> > not shareable across multiple ParDos. Wouldn't rewindowing require the
> > usage of multiple ParDos and hence not allow for state to be shared?
>
> No, you'd do something like
>
> pc.apply(WindowInto(grouping_windowing))
>   .apply(GroupByKey())
>   .apply(WindowInto(state_windowing)
>   .apply(ParDo(state_using_dofn)
>
> You could reify the window after GroupByKey if you need to inspect it.
>
> However, I'm liking the idea of being able to associate different
> WindowFns with particular state tags similar to side inputs (though
> the default would be the windowing of the main input).
>

Can you expand upon what you mean by this? I'm not sure I understand what
you're getting at yet.

-Tyler


>
> > On Tue, Jan 10, 2017 at 10:51 PM, Robert Bradshaw <
> > rober...@google.com.invalid> wrote:
> >
> >> Possibly this could be handled by rewindowing and the current
> semantics. If
> >> not, maybe treat state like a side input with its own windowing and
> window
> >> mapping fn.
> >>
> >> On Jan 10, 2017 3:14 PM, "Ben Chambers (JIRA)"  wrote:
> >>
> >> > Ben Chambers created BEAM-1261:
> >> > --
> >> >
> >> >  Summary: State API should allow state to be managed in
> >> > different windows
> >> >  Key: BEAM-1261
> >> >  URL: https://issues.apache.org/jira/browse/BEAM-1261
> >> >  Project: Beam
> >> >   Issue Type: Bug
> >> >   Components: beam-model, sdk-java-core
> >> > Reporter: Ben Chambers
> >> > Assignee: Kenneth Knowles
> >> >
> >> >
> >> > For example, even if the elements are being processed in fixed
> windows of
> >> > an hour, it may be desirable for the state to "roll over" between
> windows
> >> > (or be available to all windows).
> >> >
> >> > It will also be necessary to figure out when this state should be
> deleted
> >> > (TTL? maximum retention?)
> >> >
> >> > Another problem is how to deal with out of order data. If data comes
> in
> >> > from the 10:00 AM window, should its state changes be visible to the
> data
> >> > in the 9:00 AM window?
> >> >
> >> >
> >> >
> >> > --
> >> > This message was sent by Atlassian JIRA
> >> > (v6.3.4#6332)
> >> >
> >>
>

41 matches

Mail list logo