from:"Mingmin Xu"

Re: [DISCUSS] Multiple-triggering SQL Join with retractions support

2019-08-21 Thread Mingmin Xu

@Rui In my cases, we have some complex queries like
SELECT ...
FROM ( SELECT ... FROM PRE_A GROUP BY id, TUMBLE(1 HOUR) ) A
JOIN ( SELECT ... FROM PRE_B GROUP BY id, TUMBLE(1 HOUR) ) B
ON A.id=B.id
//A emit every minute on accumulate mode and B emit every minute on discard
move.

Would be interested to know how it can support with retraction in SQL,
currently this operation is blocked simply.

Mingmin

On Wed, Aug 21, 2019 at 11:21 AM Kenneth Knowles  wrote:

> These all sound useful. One thing is that the EMIT syntax is a more early
> idea, and more likely subject to some changes. The problem with EMIT
> anywhere except the top level is that it is not very composable. It really
> belongs most as part of an INSERT statement, just like sink triggers.
>
> Maybe a first step is to do the basics for retractions in Beam itself.
> This is already a lot of work (I just reviewed your prototype and Anton's
> together so I have a very good idea where it is at). Once we have the
> basics, then SqlTransform can have triggers set on its input and still work
> with grouping and joins. That will let us explore retractions in SQL
> without depending on EMIT.
>
> Kenn
>
> On Mon, Aug 19, 2019 at 7:02 PM Rui Wang  wrote:
>
>> I am also asking TVF windowing and EMIT syntax support in dev@calcite.
>> See [1].
>>
>>
>>
>> [1]:
>> https://lists.apache.org/thread.html/71724f8a9079be11c04c70c64097491822323f560a79a7fa1321711d@%3Cdev.calcite.apache.org%3E
>>
>> -Rui
>>
>> On Mon, Aug 19, 2019 at 4:40 PM Rui Wang  wrote:
>>
>>> Hi Mingmin,
>>>
>>> Thanks for adding "INSERT INTO" (which I missed from the example)
>>>
>>> I am not sure if I understand the question:
>>>
>>> 1. multiple GBK with retraction is solved by [1].
>>> 2. In terms of SQL and its view, the output are defined by the last GBK.
>>>
>>> [1]:
>>> https://docs.google.com/document/d/14WRfxwk_iLUHGPty3C6ZenddPsp_d6jhmx0vuafXqmE/edit?usp=sharing
>>>
>>>
>>> -Rui
>>>
>>> On Mon, Aug 19, 2019 at 4:02 PM Mingmin Xu  wrote:
>>>
>>>> +1 to support EMIT in Beam side first if we cannot include it in
>>>> Calcite in short time(See #1, #2). I'm open to use any format, the one
>>>> above or something as below. The tricky question is, what's the expected
>>>> behavior for a complex query with more than 1 GBK operators?
>>>>
>>>> EMIT   |  [ACCUMULATE|DISCARD]
>>>> [INSERT INTO ...]
>>>> SELECT ...
>>>>
>>>> #1.
>>>> https://sematext.com/opensee/m/Calcite/FR3K9JVAl32VULr6?subj=Towards+a+spec+for+robust+streaming+SQL+Part+1
>>>> #2
>>>> https://sematext.com/opensee/m/Beam/gfKHFFDd4i1I3nZc2?subj=Towards+a+spec+for+robust+streaming+SQL+Part+2
>>>>
>>>> On Mon, Aug 19, 2019 at 12:02 PM Rui Wang  wrote:
>>>>
>>>>> To update this idea, I think we can go a step further to support EMIT
>>>>> syntax from one-sql-to-rule-them-all paper [1].
>>>>>
>>>>> EMIT will allow periodic delay stream materialization. For stream
>>>>> view, it means we will add support to sinks to keep generating a changelog
>>>>> table. For view only, it means we will add support to sinks to generate a
>>>>> compacted table form changelog table periodically.
>>>>>
>>>>> Regarding to SQL, a typical query like the following should run:
>>>>>
>>>>>
>>>>> *WITH joined_table AS (SELECT * FROM S1 JOIN S2)*
>>>>> *SELECT XX FROM HOP(joined_table)*
>>>>> *EMTI [STREAM] AFTER DELAY INTERVAL '1' HOUR*
>>>>>
>>>>>
>>>>> By doing so, retractions will be much useful for SQL from a product
>>>>> scenario, in which we can have a meaningful end to end SQL pipeline.
>>>>>
>>>>> [1]: https://arxiv.org/pdf/1905.12133.pdf
>>>>>
>>>>> -Rui
>>>>>
>>>>> On Mon, Aug 12, 2019 at 11:30 PM Rui Wang  wrote:
>>>>>
>>>>>> Hi Community,
>>>>>>
>>>>>> BeamSQL currently does not support unbounded-unbounded join with
>>>>>> non-default trigger. It is because:
>>>>>>
>>>>>> - Discarding mode does not work for outer joins because of lacking of
>>>>>> ability to retract pre-emitted values. You can think about an example in
>>>>>> which a tuple of (left_row, null) n

Re: [DISCUSS] Multiple-triggering SQL Join with retractions support

2019-08-19 Thread Mingmin Xu

+1 to support EMIT in Beam side first if we cannot include it in Calcite in
short time(See #1, #2). I'm open to use any format, the one above or
something as below. The tricky question is, what's the expected behavior
for a complex query with more than 1 GBK operators?

EMIT   |  [ACCUMULATE|DISCARD]
[INSERT INTO ...]
SELECT ...

#1.
https://sematext.com/opensee/m/Calcite/FR3K9JVAl32VULr6?subj=Towards+a+spec+for+robust+streaming+SQL+Part+1
#2
https://sematext.com/opensee/m/Beam/gfKHFFDd4i1I3nZc2?subj=Towards+a+spec+for+robust+streaming+SQL+Part+2

On Mon, Aug 19, 2019 at 12:02 PM Rui Wang  wrote:

> To update this idea, I think we can go a step further to support EMIT
> syntax from one-sql-to-rule-them-all paper [1].
>
> EMIT will allow periodic delay stream materialization. For stream view, it
> means we will add support to sinks to keep generating a changelog table.
> For view only, it means we will add support to sinks to generate a
> compacted table form changelog table periodically.
>
> Regarding to SQL, a typical query like the following should run:
>
>
> *WITH joined_table AS (SELECT * FROM S1 JOIN S2)*
> *SELECT XX FROM HOP(joined_table)*
> *EMTI [STREAM] AFTER DELAY INTERVAL '1' HOUR*
>
>
> By doing so, retractions will be much useful for SQL from a product
> scenario, in which we can have a meaningful end to end SQL pipeline.
>
> [1]: https://arxiv.org/pdf/1905.12133.pdf
>
> -Rui
>
> On Mon, Aug 12, 2019 at 11:30 PM Rui Wang  wrote:
>
>> Hi Community,
>>
>> BeamSQL currently does not support unbounded-unbounded join with
>> non-default trigger. It is because:
>>
>> - Discarding mode does not work for outer joins because of lacking of
>> ability to retract pre-emitted values. You can think about an example in
>> which a tuple of (left_row, null) needed to be retracted  if the matched
>> right_row appears since last trigger fired.
>> - Accumulating mode *theoretically* can support unbounded-unbounded join
>> because it's supposed to always "overwrite" previous result. However in
>> practice, for join use cases such overwriting is too expensive. It would be
>> much more efficient if small changes in inputs of join only cause small
>> changes to downstream to compute.
>> - Both discarding mode and accumulating mode are not sufficient to refine
>> materialized data.
>>
>> Meanwhile, [1] has kicked off a discussion on retractions in Beam model.
>> I have been collecting people's feedback and generally speaking people
>> agree that retractions are useful for some use cases.
>>
>> Thus I propose to combine SQL join with retractions to
>> support multiple-triggering SQL Join.
>>
>> I think SQL join is a good start for supporting retraction in Beam with
>> the following caveats:
>> 1. multiple-triggering SQL Join is a useful feature.
>> 2. SQL join is an opportunity for us to figure out implementation details
>> of retraction by building it for a well defined use case.
>> 3. Supporting retraction should not cause performance regression on
>> existing pipelines, or require changes on existing pipelines.
>>
>>
>> What do you think?
>>
>> [1]:
>> https://lists.apache.org/thread.html/bb2d40b1bea8b21fbbb7caf599fabba823da357768ceca8ea2363789@%3Cdev.beam.apache.org%3E
>>
>>
>> -Rui
>>
>

-- 

Mingmin

Re: [VOTE] Support ZetaSQL as another SQL dialect for BeamSQL in Beam repo

2019-08-12 Thread Mingmin Xu

+1

On Mon, Aug 12, 2019 at 8:53 PM Ryan McDowell 
wrote:

> +1
>
> On Mon, Aug 12, 2019 at 8:30 PM Reza Rokni  wrote:
>
>> +1
>>
>> On Tue, 13 Aug 2019 at 09:28, Ahmet Altay  wrote:
>>
>>> +1
>>>
>>> On Mon, Aug 12, 2019 at 6:27 PM Kenneth Knowles  wrote:
>>>
 +1

 On Mon, Aug 12, 2019 at 4:43 PM Rui Wang  wrote:

> Hi Community,
>
> I am using this separate thread to collect votes on contributing Beam
> ZetaSQL(my way to say ZetaSQL as a dialect supported by BeamSQL) to Beam
> repo.
>
> There are discussions related to benefits, technical design and others
> on Beam ZetaSQL in [1]. The Beam ZetaSQL code lives in [2]. Note that this
> vote is not about merging the PR, which should be decided by code review.
> This vote is only to vote if Beam ZetaSQL should live in Beam repo.
>
> +1: Beam repo can host Beam ZetaSQL
> -1: Beam repo should not host Beam ZetaSQL
>
> If there are more questions related to Beam ZetaSQL, please discuss it
> in [1].
>
> [1]:
> https://lists.apache.org/thread.html/eab70bb99218aaedfd506e979967379c0efa05ea56a792a1486f9c74@%3Cdev.beam.apache.org%3E
> [2]: https://github.com/apache/beam/pull/9210
>
> -Rui
>

>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>

-- 

Mingmin

Re: Support ZetaSQL as a new SQL dialect in BeamSQL

2019-08-07 Thread Mingmin Xu

Thanks to highlight the parts of types/operators/functions/..., that does
make things more complicated. +1 that as a short/middle term solution, the
proposal is reasonable. We could follow up in future to handle it in
Calcite Babel if possible.

Mingmin

On Tue, Aug 6, 2019 at 3:57 PM Rui Wang  wrote:

> Hi Mingmin,
>
> Honestly I don't have an answer to it: a SQL dialect is complicated and I
> don't have enough understanding on Calcite (Calcite has a big repo). Based
> on my read from CALCITE-2280
> <https://issues.apache.org/jira/browse/CALCITE-2280>, the closer to
> standard sql that a dialect is, the less blockers that we will have to
> support this dialect in Calcite babel parser.
>
> However, this is a good question, which raises a good aspect that I found
> people usually ignore: supporting a SQL dialect is not only support a type
> of syntax. It also includes data types, built-in sql functions, operators
> and many other stuff.
>
> I especially found the following incompatibilities between Calcite and
> ZetaSQL during the development:
> 1. Calcite does not support Struct/Row type well because Calcite flattens
> Rows when reading from tables by adding an extra Projection on top of
> tables.
> 2. I had trouble in supporting DATETIME(or timestamp without time zone)
> type.
> 3. Huge incompatibilities on SQL functions. E.g. return type is different
> for AVG(long), and many many more.
> 4. I am not sure if Calcite has the same set of type casting rules as
> BigQuery(my impression is there are differences).
>
>
> I would say in the short/mid term, it's much easier to use logical plan as
> IR to implement another SQL dialect for BeamSQL (Linkedin has
> similar practice, see their blog post
> <https://engineering.linkedin.com/blog/2019/01/bridging-offline-and-nearline-computations-with-apache-calcite>
> ).
>
> For the longer term, it would be interesting to see how we can add
> BigQuery syntax (plus its data types and sql functions) to Calcite babel
> parser.
>
>
>
> -Rui
>
>
> On Tue, Aug 6, 2019 at 2:49 PM Mingmin Xu  wrote:
>
>> Just take a look at https://issues.apache.org/jira/browse/CALCITE-2280
>> which introduced Babel parser in Calcite to support varied dialects, this
>> may be an easier way to support BigQuery syntax. @Rui do you notice any big
>> difference between Calcite engine and ZetaSQL, like parsing, optimization?
>> If that's the case, it make sense to build the alternative switch in Beam
>> side.
>>
>> On Sun, Aug 4, 2019 at 4:47 PM Rui Wang  wrote:
>>
>>> Mingmin - it sounds like an awesome idea to translate from SparkSQL.
>>> It's even more exciting to know if we could translate Spark
>>> Structured Streaming code by a similar way, which enables existing Spark
>>> SQL/Structure Streaming pipelines run on Beam.
>>>
>>> Reuven - Thanks for bringing it up. I tried to search dev@calcite and
>>> only found[1]. From that thread, I see that adding ZetaSQL to Calcite
>>> itself is still a discussion. I am also looking for if anyone knows more
>>> progress on this work than the thread.
>>>
>>>
>>> [1]:
>>> http://mail-archives.apache.org/mod_mbox/calcite-dev/201905.mbox/%3CCAMj=j=-sPWgxzAgusnx8OYvYDYDcDY=dupe6poytrxhjri9...@mail.gmail.com%3E
>>>
>>> -Rui
>>>
>>> On Sun, Aug 4, 2019 at 3:54 PM Reuven Lax  wrote:
>>>
>>>> I hear rumours that the Calcite project is planning on adding a
>>>> zeta-SQL compatible parser to Calcite itself, in which case there will be a
>>>> Java parser we can use as well. Does anyone know if this work is still
>>>> going on?
>>>>
>>>> On Sat, Aug 3, 2019 at 8:41 PM Manu Zhang 
>>>> wrote:
>>>>
>>>>> A question to the community, does the size of the change require any
>>>>>> process besides the usual PR reviews?
>>>>>>
>>>>>
>>>>> I think so. This is a big change and has come as kind of a surprise
>>>>> (sorry if I've missed previous discussions).
>>>>>
>>>>> Rui, could you explain more on how things will play out between
>>>>> BeamSQL and ZetaSQL (A design doc including the pluggable interface would
>>>>> be perfect). From GitHub, ZetaSQL is mainly in C++ so what you are doing 
>>>>> is
>>>>> a port or a connector to ZetaSQL ? Do we need to depend on
>>>>> https://github.com/google/zetasql ? ZetaSQL looks interesting but I
>>>>> could barely find any doc for end users.
>>>>>
>>>>

Re: Support ZetaSQL as a new SQL dialect in BeamSQL

2019-08-06 Thread Mingmin Xu

Just take a look at https://issues.apache.org/jira/browse/CALCITE-2280
which introduced Babel parser in Calcite to support varied dialects, this
may be an easier way to support BigQuery syntax. @Rui do you notice any big
difference between Calcite engine and ZetaSQL, like parsing, optimization?
If that's the case, it make sense to build the alternative switch in Beam
side.

On Sun, Aug 4, 2019 at 4:47 PM Rui Wang  wrote:

> Mingmin - it sounds like an awesome idea to translate from SparkSQL. It's
> even more exciting to know if we could translate Spark Structured Streaming
> code by a similar way, which enables existing Spark SQL/Structure Streaming
> pipelines run on Beam.
>
> Reuven - Thanks for bringing it up. I tried to search dev@calcite and
> only found[1]. From that thread, I see that adding ZetaSQL to Calcite
> itself is still a discussion. I am also looking for if anyone knows more
> progress on this work than the thread.
>
>
> [1]:
> http://mail-archives.apache.org/mod_mbox/calcite-dev/201905.mbox/%3CCAMj=j=-sPWgxzAgusnx8OYvYDYDcDY=dupe6poytrxhjri9...@mail.gmail.com%3E
>
> -Rui
>
> On Sun, Aug 4, 2019 at 3:54 PM Reuven Lax  wrote:
>
>> I hear rumours that the Calcite project is planning on adding a zeta-SQL
>> compatible parser to Calcite itself, in which case there will be a Java
>> parser we can use as well. Does anyone know if this work is still going on?
>>
>> On Sat, Aug 3, 2019 at 8:41 PM Manu Zhang 
>> wrote:
>>
>>> A question to the community, does the size of the change require any
 process besides the usual PR reviews?

>>>
>>> I think so. This is a big change and has come as kind of a surprise
>>> (sorry if I've missed previous discussions).
>>>
>>> Rui, could you explain more on how things will play out between BeamSQL
>>> and ZetaSQL (A design doc including the pluggable interface would be
>>> perfect). From GitHub, ZetaSQL is mainly in C++ so what you are doing is a
>>> port or a connector to ZetaSQL ? Do we need to depend on
>>> https://github.com/google/zetasql ? ZetaSQL looks interesting but I
>>> could barely find any doc for end users.
>>>
>>> Also, I'd prefer the PR to be split into two, one for the pluggable
>>> interface and one for the ZetaSQL.
>>>
>>> Thanks,
>>> Manu
>>>
>>>
>>>
>>> On Sat, Aug 3, 2019 at 10:06 AM Ahmet Altay  wrote:
>>>
 Thank you Rui for the heads up.

 A question to the community, does the size of the change require any
 process besides the usual PR reviews?

 On Fri, Aug 2, 2019 at 10:23 AM Rui Wang  wrote:

> Hi community,
>
> I have been working on supporting ZetaSQL[1] as a SQL dialect in
> BeamSQL. ZetaSQL is a SQL analyzer open sourced by Google. Here is
> ZetaSQL's documentation[2].
>
> Birfely, the design of integrating ZetaSQL with BeamSQL is, I made a
> plugable query planner interface in BeamSQL, and we can easily plug in a
> new planner[3] (in my case, ZetaSQL planner). Actually anyone can add new
> planners by this way (e.g. PostgreSQL dialect).
>
> I want to contribute ZetaSQL planner and its related code(~10k) to
> Beam repo(#9210 ). This
> contribution barely touch existing Beam code (because the idea is plugable
> planner).
>
>
> *Acknowledgement*
> Thanks to all the people who provided help during Beam ZetaSQL
> development: Matthew Brown, Brian Hulette, Andrew Pilloud, Kenneth 
> Knowles,
> Anton Kedin and Mikhail Gryzykhin. This list is not exhausted and also
> thanks to contributions which are not listed.
>
>
> [1]: https://github.com/google/zetasql
> [2]: https://github.com/google/zetasql/tree/master/docs
> [3]:
> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/QueryPlanner.java
>
>
> -Rui
>


-- 

Mingmin

Re: Support ZetaSQL as a new SQL dialect in BeamSQL

2019-08-04 Thread Mingmin Xu

Interesting feature, thanks Rui to bring the new option. Please keep me in 
loop, I’ll take a look when back to home tomorrow. It seems the chance to 
support other dialects, we see lots of concerns to translate from like SparkSQL.

Mingmin 
Sent from my iPhone

> On Aug 4, 2019, at 2:43 PM, Rui Wang  wrote:
> 
> Hi David,
> 
> That's a good point. I just add a section to discuss benefits in the doc 
> (link).
> 
> 
> -Rui
> 
>> On Sun, Aug 4, 2019 at 2:01 PM David Morávek  wrote:
>> Hi Rui,
>> 
>> This is definitely an interesting topic! Can you please elaborate little bit 
>> more about the benefits, that this will bring to the end user? All the 
>> documents only cover technical details and I'm still not sure what you're 
>> trying to achieve product-wise.
>> 
>> Best,
>> D.
>> 
>>> On Sun, Aug 4, 2019 at 8:07 PM Rui Wang  wrote:
>>> I created a google doc to explain basic design on Beam ZetaSQL: 
>>> https://docs.google.com/document/d/14Yi4oEMzqS3n9-LfSNi6Q6kQpEP3gWTHzX0HxqUksdc/edit?usp=sharing
>>> 
>>> 
>>> 
>>> -Rui
>>> 
 On Sun, Aug 4, 2019 at 10:02 AM Rui Wang  wrote:
 Thanks Manu for you feedback! Some comments inlined:
 
 
 On Sat, Aug 3, 2019 at 8:41 PM Manu Zhang  wrote:
>> A question to the community, does the size of the change require any 
>> process besides the usual PR reviews?
> 
> I think so. This is a big change and has come as kind of a surprise 
> (sorry if I've missed previous discussions). 
> 
> Rui, could you explain more on how things will play out between BeamSQL 
> and ZetaSQL (A design doc including the pluggable interface would be 
> perfect).
 
 I see. I will have a document about some basic idea on Beam ZetaSQL (this 
 is my way to call "ZetaSQL as a SQL dialect in BeamSQL", and I usually use 
 Beam CalciteSQL to refer to Calcite's SQL dialect.).
 
 At least from users perspective, it's simple to use: setup planner name in 
 BeamSqlPipelineOptions and BeamSQL will initialize different planners: 
 either Calcite or ZetaSQL is supported now.
  
> From GitHub, ZetaSQL is mainly in C++ so what you are doing is a port or 
> a connector to ZetaSQL? Do we need to depend on 
> https://github.com/google/zetasql ? ZetaSQL looks interesting but I could 
> barely find any doc for end users.
 
 ZetaSQL provides a Java interface which calls c++ binary through JNI. For 
 using ZetaSQL in BeamSQL, we only need to depend on ZetaSQL jars in maven 
 central (https://mvnrepository.com/search?q=zetasql). These jars contains 
 all we need to call ZetaSQL analyzer by Java.
 
> 
> Also, I'd prefer the PR to be split into two, one for the pluggable 
> interface and one for the ZetaSQL.
> 
 Pluggable planner is already a separate PR merged before: 
 https://github.com/apache/beam/pull/7745  
 
 
 -Rui
 
  
> Thanks,
> Manu
> 
>  
> 
>> On Sat, Aug 3, 2019 at 10:06 AM Ahmet Altay  wrote:
>> Thank you Rui for the heads up.
>> 
>> A question to the community, does the size of the change require any 
>> process besides the usual PR reviews?
>> 
>> On Fri, Aug 2, 2019 at 10:23 AM Rui Wang  wrote:
>>> Hi community,
>>> 
>>> I have been working on supporting ZetaSQL[1] as a SQL dialect in 
>>> BeamSQL. ZetaSQL is a SQL analyzer open sourced by Google. Here is 
>>> ZetaSQL's documentation[2].
>>> 
>>> Birfely, the design of integrating ZetaSQL with BeamSQL is, I made a 
>>> plugable query planner interface in BeamSQL, and we can easily plug in 
>>> a new planner[3] (in my case, ZetaSQL planner). Actually anyone can add 
>>> new planners by this way (e.g. PostgreSQL dialect).   
>>> 
>>> I want to contribute ZetaSQL planner and its related code(~10k) to Beam 
>>> repo(#9210). This contribution barely touch existing Beam code (because 
>>> the idea is plugable planner).
>>> 
>>> 
>>> Acknowledgement
>>> Thanks to all the people who provided help during Beam ZetaSQL 
>>> development: Matthew Brown, Brian Hulette, Andrew Pilloud, Kenneth 
>>> Knowles, Anton Kedin and Mikhail Gryzykhin. This list is not exhausted 
>>> and also thanks to contributions which are not listed.
>>> 
>>> 
>>> [1]: https://github.com/google/zetasql
>>> [2]: https://github.com/google/zetasql/tree/master/docs
>>> [3]: 
>>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/QueryPlanner.java
>>> 
>>> 
>>> -Rui

Re: [PROPOSAL] Revised streaming extensions for Beam SQL

2019-07-24 Thread Mingmin Xu

+1 to remove those magic words in Calcite streaming SQL, just because
they're not SQL standard. The idea to replace HOP/TUMBLE with
table-view-functions makes it concise, my only question is, is it(or will
it be) part of SQL standard? --I'm a big fan to align with standards :lol

Ps, although the concept of `window` used here are different from window
function in SQL, the syntax gives some insight. Take the example of
`ROW_NUMBER()
OVER (PARTITION BY COL1 ORDER BY COL2) AS row_number`, `ROW_NUMBER()`
assigns a sequence value for records in subgroup with key 'COL1'. We can
introduce another function, like TUMBLE() which will assign a window
instance(more instances for HOP()) for the record.

Mingmin


On Sun, Jul 21, 2019 at 9:42 PM Manu Zhang  wrote:

> Thanks Kenn,
> great paper and left some newbie questions on the proposal.
>
> Manu
>
> On Fri, Jul 19, 2019 at 1:51 AM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> I recently had the great privilege to work with others from Beam plus
>> Calcite and Flink SQL contributors to build a new and minimal proposal for
>> adding streaming extensions to standard SQL: event time, watermarks,
>> windowing, triggers, stream materialization.
>>
>> We hope this will influence the standard body and also Calcite and Flink
>> and other projects working on the streaming SQL.
>>
>> I would like to start implementing these extensions in Beam, moving from
>> our current streaming extensions to the new proposal.
>>
>>The whole paper is https://arxiv.org/abs/1905.12133
>>
>>My small proposal to start in Beam:
>> https://s.apache.org/streaming-beam-sql
>>
>> TL;DR: replace `GROUP BY Tumble/Hop/Session` with table functions that do
>> Tumble, Hop, Session. The details of why to make this change are explained
>> in the appendix to my proposal. For the big picture of how it fits in, the
>> full paper is best.
>>
>> Kenn
>>
>

-- 

Mingmin

Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-06 Thread Mingmin Xu

Good point to reject DISTINCT operations currently, as it's not handled
now. There could be more similar cases need to revise and document well.

Regarding to how to DISTINCT support, I was confused by stateful CombineFn
at first. To make it simple, we can extend step by step, like reject
DISTINCT+nonDISTINCT cases:
1. customized CombineFn looks good to even support DISTINCT+nonDISTINCT
case, just not sure about performance impact, and it looks odd when users
write UDAF ;
2. for those with only one  DISTINCT, it's not hard to handle it with
Calcite rule or during compile step, with two GBKs. To handle
DISTINCT+nonDISTINCT case in this way, maybe we can try with
JOIN(GBK_for_non_DISTINCT, 2_GBK_for_DISTINCT), need more tests to confirm;

Mingmin

On Fri, May 3, 2019 at 2:38 PM Rui Wang  wrote:

> A compromise solution would be using SELECT DISTINCT or GROUP BY to
> duplicate before apply aggregations. It's two shuffles and works on non
> floating point columns. The good thing is no code change is needed, but
> downsides are users need to write more complicated query and floating point
> data is not supported.
>
>
> -Rui
>
> On Fri, May 3, 2019 at 1:23 PM Rui Wang  wrote:
>
>> Fair point. It lacks of proper benchmarks for BeamSQL to test performance
>> and scalability of implementations.
>>
>>
>> -Rui
>>
>> On Fri, May 3, 2019 at 12:56 PM Reuven Lax  wrote:
>>
>>> Back to the original point: I'm very skeptical of adding something that
>>> does not scale at all. In our experience, users get far more upset with an
>>> advertised feature that doesn't work for them (e.g. their workers OOM) than
>>> with a missing feature.
>>>
>>> Reuven
>>>
>>> On Fri, May 3, 2019 at 12:41 PM Kenneth Knowles  wrote:
>>>
 All good points. My version of the two shuffle approach does not work
 at all.

 On Fri, May 3, 2019 at 11:38 AM Brian Hulette 
 wrote:

> Rui's point about FLOAT/DOUBLE columns is interesting as well. We
> couldn't support distinct aggregations on floating point columns with the
> two-shuffle approach, but we could with the CombineFn approach. I'm not
> sure if that's a good thing or not, it seems like an anti-pattern to do a
> distinct aggregation on floating point numbers but I suppose the spec
> allows it.
>

 I can't find the Jira, but grouping on doubles has been discussed at
 some length before. Many DBMSs do not provide this, so it is not generally
 expected by SQL users. That is good, because mathematically it is
 questionable - floating point is usually used as a stand-in for real
 numbers, where computing equality is not generally possible. So any code
 that actually depends on equality of floating points is likely susceptible
 to rounding errors, other quirks of floating point, and also is probably
 misguided because the underlying thing that floats are approximating
 already cannot be checked for equality.

 Kenn

>
> Brian
>
>
> On Fri, May 3, 2019 at 10:52 AM Rui Wang  wrote:
>
>> To clarify what I said "So two shuffle approach will lead to two
>> different implementation for tables with and without FLOAT/DOUBLE 
>> column.":
>>
>> Basically I wanted to say that two shuffles approach will be an
>> implementation for some cases, and it will co-exist with CombineFn
>> approach. In the feature, when we start cost based optimization in
>> BeamSQL,  CBO is supposed to compare different plans.
>>
>> -Rui
>>
>> On Fri, May 3, 2019 at 10:40 AM Rui Wang  wrote:
>>
>>>
 As to the distinct aggregations: At the least, these queries should
 be rejected, not evaluated incorrectly.

>>>
>>> Yes. The least is not to support it, and throws clear message to say
>>> no. (current implementation ignores DISTINCT and executes all 
>>> aggregations
>>> as ALL).
>>>
>>>
 The term "stateful CombineFn" is not one I would use, as the nature
 of state is linearity and the nature of CombineFn is parallelism. So I
 don't totally understand this proposal. If I replace stateful CombineFn
 with stateful DoFn with one combining state per column, then I think I
 understand. FWIW on a runner with scalable SetState or MapState it 
 will not
 be any risk at all.

 I see. "Stateful" is indeed misleading. In this thread, it was all
>>> about using simple CombineFn to achieve DISTINCT aggregation with 
>>> massive
>>> parallelism.
>>>
>>> But if you go the two shuffle route, you don't have to separate the
 aggregations and re-join them. You just have to incur the cost of the 
 GBK +
 DISTINCT for all columns, and just drop the secondary key for the 
 second
 shuffle, no?

 Two shuffle approach cannot be the unified approach because it
>>> requires to

Re: kafka 0.9 support

2019-04-02 Thread Mingmin Xu

We're still using Kafka 0.10 a lot, similar as 0.9 IMO. To expand multiple
versions in KafkaIO is quite complex now, and it confuses users which is
supported / which is not. I would prefer to support Kafka 2.0+ only in the
latest version. For old versions, there're some options:
1). document Kafka-Beam support versions, like what we do in FlinkRunner;
2). maintain separated KafkaIOs for old versions;

1) would be easy to maintain, and I assume there should be no issue to use
Beam-Core 3.0 together with KafkaIO 2.0.

Any thoughts?

Mingmin

On Tue, Apr 2, 2019 at 9:56 AM Reuven Lax  wrote:

> KafkaIO is marked as Experimental, and the comment already warns that 0.9
> support might be removed. I think that if users still rely on Kafka 0.9 we
> should leave a fork (renamed) of the IO in the tree for 0.9, but we can
> definitely remove 0.9 support from the main IO if we want, especially if
> it's complicated changes to that IO. If we do though, we should fail with a
> clear error message telling users to use the Kafka 0.9 IO.
>
> On Tue, Apr 2, 2019 at 9:34 AM Alexey Romanenko 
> wrote:
>
>> > How are multiple versions of Kafka supported? Are they all in one
>> client, or is there a case for forks like ElasticSearchIO?
>>
>> They are supported in one client but we have additional “ConsumerSpEL”
>> adapter which unifies interface difference among different Kafka client
>> versions (mostly to support old ones 0.9-0.10.0).
>>
>> On the other hand, we warn user in Javadoc of KafkaIO (which is Unstable,
>> btw) by the following:
>> *“KafkaIO relies on kafka-clients for all its interactions with the Kafka
>> cluster.**kafka-clients versions 0.10.1 and newer are supported at
>> runtime. The older versions 0.9.x **- 0.10.0.0 are also supported, but
>> are deprecated and likely be removed in near future.”*
>>
>> Despite the fact that, personally, I’d prefer to have only one unified
>> client interface but, since people still use Beam with old Kafka instances,
>> we, likely, should stick with it till Beam 3.0.
>>
>> WDYT?
>>
>> On 2 Apr 2019, at 02:27, Austin Bennett 
>> wrote:
>>
>> FWIW --
>>
>> On my (desired, not explicitly job-function) roadmap is to tap into a
>> bunch of our corporate Kafka queues to ingest that data to places I can
>> use.  Those are 'stuck' 0.9, with no upgrade in sight (am told the upgrade
>> path isn't trivial, is very critical flows, and they are scared for it to
>> break, so it just sits behind firewalls, etc).  But, I wouldn't begin that
>> for probably at least another quarter.
>>
>> I don't contribute to nor understand the burden of maintaining the
>> support for the older version, so can't reasonably lobby for that continued
>> pain.
>>
>> Anecdotally, this could be a place many enterprises are at (though I also
>> wonder whether many of the people that would be 'stuck' on such versions
>> would also have Beam on their current radar).
>>
>>
>> On Mon, Apr 1, 2019 at 2:29 PM Kenneth Knowles  wrote:
>>
>>> This could be a backward-incompatible change, though that notion has
>>> many interpretations. What matters is user pain. Technically if we don't
>>> break the core SDK, users should be able to use Java SDK >=2.11.0 with
>>> KafkaIO 2.11.0 forever.
>>>
>>> How are multiple versions of Kafka supported? Are they all in one
>>> client, or is there a case for forks like ElasticSearchIO?
>>>
>>> Kenn
>>>
>>> On Mon, Apr 1, 2019 at 10:37 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
 +1 to remove 0.9 support.

 I think it's more interesting to test and verify Kafka 2.2.0 than 0.9 ;)

 Regards
 JB

 On 01/04/2019 19:36, David Morávek wrote:
 > Hello,
 >
 > is there still a reason to keep Kafka 0.9 support? This unfortunately
 > adds lot of complexity to KafkaIO implementation.
 >
 > Kafka 0.9 was released on Nov 2015.
 >
 > My first shot on removing Kafka 0.9 support would remove second
 > consumer, which is used for fetching offsets.
 >
 > WDYT? Is this support worth keeping?
 >
 > https://github.com/apache/beam/pull/8186
 >
 > D.

 --
 Jean-Baptiste Onofré
 jbono...@apache.org
 http://blog.nanthrax.net
 Talend - http://www.talend.com

>>>
>>

-- 

Mingmin

Re: Migrating Beam SQL to Calcite's code generation

2018-11-15 Thread Mingmin Xu

1. Window start/end: Actually this is already provided in other ways and
the window in the SQL environment is unused and just waiting to be deleted.
So you can still access TUMBLE_START, etc. This is well-defined as a part
of the row so there's no semantic problem, but I think it should already
work.
*MM: Others work except SESSION_END();*

2. Pane information: I don't think access to pane info is enough for
correct results for a SQL join that triggers more than once. The pane info
is part of a Beam element, but these records just represent a kind of
changelog of the aggregation/join. The general solution is retractions.
Until we finish that, you need to follow the Join/CoGBK with custom logic ,
often a stateful DoFn to get the join results right. For example, if both
inputs are append-only relations and it is an equijoin, then you can do
this with a dedupe when you unpack the CoGbkResult. I am guessing this is
the main use case for BEAM-5204. Is it your use case?
*MM: my case is a self-join with SQL-only, written as [DISCARD_Pane JOIN
ACCU_Pane];*
*These UDFs is not a blocker, limitation in BEAM-5204 should be removed
directly IMO. With multiple-trigger assigned, developers need to handle the
output which is not complex with Java SDK, but very hard for SQL only
cases. *

On Thu, Nov 15, 2018 at 10:54 AM Kenneth Knowles  wrote:

> From https://issues.apache.org/jira/browse/BEAM-5204 it seems like what
> you most care about is to have joins that trigger more than once per
> window. To accomplish it you hope to build an "escape hatch" from
> SQL/relational semantics to specialized Beam SQL semantics. It could make
> sense with extreme care.
>
> Separating the two parts:
>
> 1. Window start/end: Actually this is already provided in other ways and
> the window in the SQL environment is unused and just waiting to be deleted.
> So you can still access TUMBLE_START, etc. This is well-defined as a part
> of the row so there's no semantic problem, but I think it should already
> work.
>
> 2. Pane information: I don't think access to pane info is enough for
> correct results for a SQL join that triggers more than once. The pane info
> is part of a Beam element, but these records just represent a kind of
> changelog of the aggregation/join. The general solution is retractions.
> Until we finish that, you need to follow the Join/CoGBK with custom logic ,
> often a stateful DoFn to get the join results right. For example, if both
> inputs are append-only relations and it is an equijoin, then you can do
> this with a dedupe when you unpack the CoGbkResult. I am guessing this is
> the main use case for BEAM-5204. Is it your use case?
>
> Kenn
>
> On Thu, Nov 15, 2018 at 10:08 AM Mingmin Xu  wrote:
>
>> Raise this thread.
>> Seems there're more changes in the backend on how a FUNCTION is executed
>> in the backend, as noticed in #6996
>> <https://github.com/apache/beam/pull/6996>:
>> 1. BeamSqlExpression and BeamSqlExpressionExecutor are removed;
>> 2. BeamSqlExpressionEnvironment are removed;
>>
>> Then,
>> 1. for Calcite defined FUNCTIONS, it uses Calcite generated code (which
>> is great and duplicate work is worthless);
>> *2. no way to access Beam context now;*
>>
>> For *#2*, I think we need to find a way to expose it, at least our
>> UDF/UDAF should be able to access it to leverage the advantages of Beam
>> module.
>>
>> Any comments?
>>
>>
>> On Wed, Sep 19, 2018 at 2:55 PM Rui Wang  wrote:
>>
>>> This is a so exciting change!
>>>
>>> Since we cannot mix current implementation with Calcite code generation,
>>> is there any case that Calcite code generation does not support but our
>>> current implementation supports, so switching to Calcite code generation
>>> will have some impact to existing usage?
>>>
>>> -Rui
>>>
>>> On Wed, Sep 19, 2018 at 11:53 AM Andrew Pilloud 
>>> wrote:
>>>
>>>> To follow up on this, the PR is now in a reviewable state and I've
>>>> added more tests for FLOOR and CEIL. Both work with a more extensive set of
>>>> arguments after this change. There are now 4 outstanding calcite PRs that
>>>> get all the tests passing.
>>>>
>>>> Unfortunately there is no easy way to mix our current implementation
>>>> and using Calcite's code generator.
>>>>
>>>> Andrew
>>>>
>>>> On Mon, Sep 17, 2018 at 3:22 PM Mingmin Xu  wrote:
>>>>
>>>>> Awesome work, we should call Calcite operator functions if available.
>>>>>
>>>>> I haven't get time to read the PR yet, for those impacted would ke

Re: Migrating Beam SQL to Calcite's code generation

2018-11-15 Thread Mingmin Xu

Raise this thread.
Seems there're more changes in the backend on how a FUNCTION is executed in
the backend, as noticed in #6996 <https://github.com/apache/beam/pull/6996>:

1. BeamSqlExpression and BeamSqlExpressionExecutor are removed;
2. BeamSqlExpressionEnvironment are removed;

Then,
1. for Calcite defined FUNCTIONS, it uses Calcite generated code (which is
great and duplicate work is worthless);
*2. no way to access Beam context now;*

For *#2*, I think we need to find a way to expose it, at least our UDF/UDAF
should be able to access it to leverage the advantages of Beam module.

Any comments?


On Wed, Sep 19, 2018 at 2:55 PM Rui Wang  wrote:

> This is a so exciting change!
>
> Since we cannot mix current implementation with Calcite code generation,
> is there any case that Calcite code generation does not support but our
> current implementation supports, so switching to Calcite code generation
> will have some impact to existing usage?
>
> -Rui
>
> On Wed, Sep 19, 2018 at 11:53 AM Andrew Pilloud 
> wrote:
>
>> To follow up on this, the PR is now in a reviewable state and I've added
>> more tests for FLOOR and CEIL. Both work with a more extensive set of
>> arguments after this change. There are now 4 outstanding calcite PRs that
>> get all the tests passing.
>>
>> Unfortunately there is no easy way to mix our current implementation and
>> using Calcite's code generator.
>>
>> Andrew
>>
>> On Mon, Sep 17, 2018 at 3:22 PM Mingmin Xu  wrote:
>>
>>> Awesome work, we should call Calcite operator functions if available.
>>>
>>> I haven't get time to read the PR yet, for those impacted would keep
>>> existing implementation. One example is, I notice FLOOR/CEIL only supports
>>> months/years recently which is quite a surprise to me.
>>>
>>> Mingmin
>>>
>>> On Mon, Sep 17, 2018 at 3:03 PM Anton Kedin  wrote:
>>>
>>>> This is pretty amazing! Thank you for doing this!
>>>>
>>>> Regards,
>>>> Anton
>>>>
>>>> On Mon, Sep 17, 2018 at 2:27 PM Andrew Pilloud 
>>>> wrote:
>>>>
>>>>> I've adapted Calcite's EnumerableCalc code generation to generate the
>>>>> BeamCalc DoFn. The primary purpose behind this change is so we can take
>>>>> advantage of Calcite's extensive SQL operator implementation. This deletes
>>>>> ~11000 lines of code from Beam (with ~350 added), significantly increases
>>>>> the set of supported SQL operators, and improves performance and
>>>>> correctness of currently supported operators. Here is my work in progress:
>>>>> https://github.com/apache/beam/pull/6417
>>>>>
>>>>> There are a few bugs in Calcite that this has exposed:
>>>>>
>>>>> Fixed in Calcite master:
>>>>>
>>>>>- CALCITE-2321 <https://issues.apache.org/jira/browse/CALCITE-2321>
>>>>>- The type of a union of CHAR columns of different lengths should be 
>>>>> VARCHAR
>>>>>- CALCITE-2447 <https://issues.apache.org/jira/browse/CALCITE-2447>
>>>>>- Some POWER, ATAN2 functions fail with NoSuchMethodException
>>>>>
>>>>> Pending PRs:
>>>>>
>>>>>- CALCITE-2529 <https://issues.apache.org/jira/browse/CALCITE-2529>
>>>>>- linq4j should promote integer to floating point when generating 
>>>>> function
>>>>>calls
>>>>>- CALCITE-2530 <https://issues.apache.org/jira/browse/CALCITE-2530>
>>>>>- TRIM function does not throw exception when the length of trim 
>>>>> character
>>>>>is not 1(one)
>>>>>
>>>>> More work:
>>>>>
>>>>>- CALCITE-2404 <https://issues.apache.org/jira/browse/CALCITE-2404>
>>>>>- Accessing structured-types is not implemented by the runtime
>>>>>- (none yet) - Support multi character TRIM extension in Calcite
>>>>>
>>>>> I would like to push these changes in with these minor regressions. Do
>>>>> any of these Calcite bugs block this functionality being adding to Beam?
>>>>>
>>>>> Andrew
>>>>>
>>>>
>>>
>>> --
>>> 
>>> Mingmin
>>>
>>

-- 

Mingmin

Re: Migrating Beam SQL to Calcite's code generation

2018-09-17 Thread Mingmin Xu

Awesome work, we should call Calcite operator functions if available.

I haven't get time to read the PR yet, for those impacted would keep
existing implementation. One example is, I notice FLOOR/CEIL only supports
months/years recently which is quite a surprise to me.

Mingmin

On Mon, Sep 17, 2018 at 3:03 PM Anton Kedin  wrote:

> This is pretty amazing! Thank you for doing this!
>
> Regards,
> Anton
>
> On Mon, Sep 17, 2018 at 2:27 PM Andrew Pilloud 
> wrote:
>
>> I've adapted Calcite's EnumerableCalc code generation to generate the
>> BeamCalc DoFn. The primary purpose behind this change is so we can take
>> advantage of Calcite's extensive SQL operator implementation. This deletes
>> ~11000 lines of code from Beam (with ~350 added), significantly increases
>> the set of supported SQL operators, and improves performance and
>> correctness of currently supported operators. Here is my work in progress:
>> https://github.com/apache/beam/pull/6417
>>
>> There are a few bugs in Calcite that this has exposed:
>>
>> Fixed in Calcite master:
>>
>>- CALCITE-2321 
>>- The type of a union of CHAR columns of different lengths should be 
>> VARCHAR
>>- CALCITE-2447  -
>>Some POWER, ATAN2 functions fail with NoSuchMethodException
>>
>> Pending PRs:
>>
>>- CALCITE-2529 
>>- linq4j should promote integer to floating point when generating function
>>calls
>>- CALCITE-2530 
>>- TRIM function does not throw exception when the length of trim character
>>is not 1(one)
>>
>> More work:
>>
>>- CALCITE-2404  -
>>Accessing structured-types is not implemented by the runtime
>>- (none yet) - Support multi character TRIM extension in Calcite
>>
>> I would like to push these changes in with these minor regressions. Do
>> any of these Calcite bugs block this functionality being adding to Beam?
>>
>> Andrew
>>
>

-- 

Mingmin

Re: [SQL] Create External Schema

2018-08-13 Thread Mingmin Xu

awesome proposal to integrate with existing external schemas, add some
comments in doc.

On Mon, Aug 13, 2018 at 4:13 PM, Reuven Lax  wrote:

> Is it possible to extend Beam's SchemaRegistry to do this?
>
> On Mon, Aug 13, 2018 at 4:06 PM Anton Kedin  wrote:
>
>> Hi,
>>
>> I am planning to work on implementing a support for external schema
>> providers for Beam SQL and wanted to share a high level idea how I think
>> this can work.
>>
>> *Short Version*
>> Implement CREATE FOREIGN SCHEMA statement:
>>
>> CREATE FOREIGN SCHEMA
>>
>>  TYPE 'bigquery'
>>
>>  LOCATION 'dataset_example'
>>
>>  AS bq;
>>
>> CREATE FOREIGN SCHEMA
>>
>>  TYPE 'hcatalog'
>>
>>  LOCATION 'hive-server:2341'
>>
>>  AS hive;
>>
>> SELECT *
>>
>>  FROM
>>
>>   bq.table_example_bq AS bq_table1
>>
>> JOIN
>>
>>   hive.table_example_hive AS hive_table1
>>
>> ON
>>   bq_table1.some_field = hive_table1.some_other_field;
>>
>> *A Bit Longer Version: *https://docs.google.com/document/d/
>> 1Ilk3OpDxrp3bHNlcnYDoj29tt9bd1E0EXt8i0WytNmQ
>>
>> Thoughts, ideas?
>>
>> Regards,
>> Anton
>>
>


-- 

Mingmin

Re: [VOTE] Apache Beam, version 2.6.0, release candidate #1

2018-08-02 Thread Mingmin Xu

+1
Verified with SQL component.

On Thu, Aug 2, 2018 at 10:05 AM, Thomas Weise  wrote:

> It does include *some* of the portable Flink runner (you will be able to
> run wordcount as documented on https://beam.apache.org/
> contribute/portability/#status).
>
> I would recommend to continue using master though, as we are still not
> fully at MVP, adding test coverage and also Flink version update.
>
>
> On Thu, Aug 2, 2018 at 9:52 AM Suneel Marthi  wrote:
>
>> Does this release include the Portability runner for Flink ? - Sorry I
>> have not read the release notes yet, pardon my asking again.
>>
>> On Wed, Aug 1, 2018 at 7:03 PM, Boyuan Zhang  wrote:
>>
>>> +1
>>> Tested Dataflow related items in: https://s.apache.org/beam-
>>> release-validation
>>>
>>> On Wed, Aug 1, 2018 at 11:40 AM Yifan Zou  wrote:
>>>
 +1
 Tested Python quickstarts and mobile gaming examples against tar and
 wheel versions.
 https://builds.apache.org/job/beam_PostRelease_Python_Candidate/123/

 On Wed, Aug 1, 2018 at 8:27 AM Andrew Pilloud 
 wrote:

> +1 tested the Beam SQL jar from the Maven Central repo, it worked.
>
> On Wed, Aug 1, 2018 at 7:37 AM Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>> Hi Pablo,
>>
>> +1, tested on my apps and libs and words after some fixed due to some
>> breaking changes in ArgProvider - but guess it is not "public" to need to
>> be reported.
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>>
>> Le mer. 1 août 2018 à 01:50, Pablo Estrada  a
>> écrit :
>>
>>> Hello everyone!
>>>
>>> I have been able to prepare a release candidate for Beam 2.6.0. : D
>>>
>>> Please review and vote on the release candidate #1 for the version
>>> 2.6.0, as follows:
>>>
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>> The complete staged set of artifacts is available for your review,
>>> which includes:
>>> * JIRA release notes [1],
>>> * the official Apache source release to be deployed to
>>> dist.apache.org [2], which is signed with the key with fingerprint
>>> 2F1FEDCDF6DD7990422F482F65224E0292DD8A51 [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v2.6.0-RC1" [5],
>>> * website pull request listing the release and publishing the API
>>> reference manual [6].
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by
>>> majority approval, with at least 3 PMC affirmative votes.
>>>
>>> Regards
>>> -Pablo.
>>>
>>> [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?
>>> projectId=12319527=12343392
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.6.0/
>>> [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
>>> [4] https://repository.apache.org/content/repositories/
>>> orgapachebeam-1044/
>>> [5] https://github.com/apache/beam/tree/v2.6.0-RC1
>>> [6] https://github.com/apache/beam-site/pull/518
>>>
>>> --
>>> Got feedback? go/pabloem-feedback
>>> 
>>>
>>
>>


-- 

Mingmin

POM beam-sdks-java-extensions-parent is not available in repository.apache.org

2018-07-06 Thread Mingmin Xu

Hello,

Seems some versions are lost in the repository, can someone help to deploy
it?

[Releases]: 2.5.0
[Snapshots]: 2.6.0-SNAPSHOT

Thanks!

Mingmin

Re: Building and visualizing the Beam SQL graph

2018-06-14 Thread Mingmin Xu

Is there a guideline about how the name provided in `PCollection.apply(
String name, PTransform, PCollection> t)` is
adopted in different runners? I suppose that should be the option, to have
a readable graph for all runners, instead of 'adjust' it to make DataFlow
runner works only.

On Thu, Jun 14, 2018 at 8:53 AM, Reuven Lax  wrote:

> There was a previous discussion about having generic attributes on
> PCollection. Maybe this is a good driving use case?
>
> On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles  wrote:
>
>> Another thing to consider is that we might return something like a
>> "SqlPCollection" that is the PCollection plus additional metadata that
>> is useful to the shell / enumerable converter (such as if the PCollection
>> has a known finite size due to LIMIT, even if it is "unbounded", and the
>> shell can return control to the user once it receives enough rows). After
>> your proposed change this will be much more natural to do, so that's
>> another point in favor of the refactor.
>>
>> Kenn
>>
>> On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud 
>> wrote:
>>
>>> One of my goals is to make the graph easier to read and map back to the
>>> SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs
>>> `toPCollection`) does make a big difference in that graph. I think it is
>>> also important to have a common function to do the apply with consistent
>>> naming. I think that will greatly help with ease of understanding. It
>>> sounds like what really want is this in the BeamRelNode interface:
>>>
>>> PInput buildPInput(Pipeline pipeline);
>>> PTransform> buildPTransform();
>>>
>>> default PCollection toPCollection(Pipeline pipeline) {
>>> return buildPInput(pipeline).apply(getStageName(),
>>> buildPTransform());
>>> }
>>>
>>> Andrew
>>>
>>> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu  wrote:
>>>
>>>> EXPLAIN shows the execution plan in SQL perspective only. After
>>>> converting to a Beam composite PTransform, there're more steps underneath,
>>>> each Runner re-org Beam PTransforms again which makes the final pipeline
>>>> hard to read. In SQL module itself, I don't see any difference between
>>>> `toPTransform` and `toPCollection`. We could have an easy-to-understand
>>>> step name when converting RelNodes, but Runners show the graph to
>>>> developers.
>>>>
>>>> Mingmin
>>>>
>>>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud 
>>>> wrote:
>>>>
>>>>> That sounds correct. And because each rel node might have a different
>>>>> input there isn't a standard interface (like PTransform>>>> Row>, PCollection> toPTransform());
>>>>>
>>>>> Andrew
>>>>>
>>>>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles 
>>>>> wrote:
>>>>>
>>>>>> Agree with that. It will be kind of tricky to generalize. I think
>>>>>> there are some criteria in this case that might apply in other cases:
>>>>>>
>>>>>> 1. Each rel node (or construct of a DSL) should have a PTransform for
>>>>>> how it computes its result from its inputs.
>>>>>> 2. The inputs to that PTransform should actually be the inputs to the
>>>>>> rel node!
>>>>>>
>>>>>> So I tried to improve #1 but I probably made #2 worse.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin 
>>>>>> wrote:
>>>>>>
>>>>>>> Not answering the original question, but doesn't "explain" satisfy
>>>>>>> the SQL use case?
>>>>>>>
>>>>>>> Going forward we probably want to solve this in a more general way.
>>>>>>> We have at least 3 ways to represent the pipeline:
>>>>>>>  - how runner executes it;
>>>>>>>  - what it looks like when constructed;
>>>>>>>  - what the user was describing in DSL;
>>>>>>> And there will probably be more, if extra layers are built on top of
>>>>>>> DSLs.
>>>>>>>
>>>>>>> If possible, we probably should be able to map any level of
>>>>>>> abstraction to any other to better understand and

Re: Building and visualizing the Beam SQL graph

2018-06-11 Thread Mingmin Xu

EXPLAIN shows the execution plan in SQL perspective only. After converting
to a Beam composite PTransform, there're more steps underneath, each Runner
re-org Beam PTransforms again which makes the final pipeline hard to read.
In SQL module itself, I don't see any difference between `toPTransform` and
`toPCollection`. We could have an easy-to-understand step name when
converting RelNodes, but Runners show the graph to developers.

Mingmin

On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud  wrote:

> That sounds correct. And because each rel node might have a different
> input there isn't a standard interface (like PTransform,
> PCollection> toPTransform());
>
> Andrew
>
> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles  wrote:
>
>> Agree with that. It will be kind of tricky to generalize. I think there
>> are some criteria in this case that might apply in other cases:
>>
>> 1. Each rel node (or construct of a DSL) should have a PTransform for how
>> it computes its result from its inputs.
>> 2. The inputs to that PTransform should actually be the inputs to the rel
>> node!
>>
>> So I tried to improve #1 but I probably made #2 worse.
>>
>> Kenn
>>
>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin  wrote:
>>
>>> Not answering the original question, but doesn't "explain" satisfy the
>>> SQL use case?
>>>
>>> Going forward we probably want to solve this in a more general way. We
>>> have at least 3 ways to represent the pipeline:
>>>  - how runner executes it;
>>>  - what it looks like when constructed;
>>>  - what the user was describing in DSL;
>>> And there will probably be more, if extra layers are built on top of
>>> DSLs.
>>>
>>> If possible, we probably should be able to map any level of abstraction
>>> to any other to better understand and debug the pipelines.
>>>
>>>
>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles  wrote:
>>>
 In other words, revert https://github.com/apache/beam/pull/4705/files,
 at least in spirit? I agree :-)

 Kenn

 On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
 wrote:

> We are currently converting the Calcite Rel tree to Beam by
> recursively building a tree of nested PTransforms. This results in a weird
> nested graph in the dataflow UI where each node contains its inputs nested
> inside of it. I'm going to change the internal data structure for
> converting the tree from a PTransform to a PCollection, which will result
> in a more accurate representation of the tree structure being built and
> should simplify the code as well. This will not change the public 
> interface
> to SQL, which will remain a PTransform. Any thoughts or objections?
>
> I was also wondering if there are tools for visualizing the Beam graph
> aside from the dataflow runner UI. What other tools exist?
>
> Andrew
>



-- 

Mingmin

Re: Merge options in Github UI are confusing

2018-04-17 Thread Mingmin Xu

Not strongly against `*Create a merge commit*`, but I use `squash and
merge` by default. I understand the potential impact mentioned by Andrew,
it's still a better option IMO:
1. if a PR contains several parts, it can be documented in commit message
instead of several commits; --If it's a big task, let's split it into
several PRs if possible;
2. when several PRs are changing the same file, I would ask contributor to
fix it;
3. most commits are introduced by reviewer's ask, it's not necessary to do
another squash(by contributors) before merge;

On Tue, Apr 17, 2018 at 1:09 PM, Robert Burke  wrote:

> +1 Having made a few web commits and been frustrated by the options,
> anything to standardize on a single option seems good to me.
>
> On Tue, 17 Apr 2018 at 01:49 Etienne Chauchot 
> wrote:
>
>> +1 to enforce the behavior recommended in the committer guide. I usually
>> ask the author to manually squash before committing.
>>
>> Etienne
>>
>> Le lundi 16 avril 2018 à 22:19 +, Robert Bradshaw a écrit :
>>
>> +1, though I'll admit I've been an occasional user of the "squash and
>> merge" button when a small PR has a huge number of small, fixup changes
>> piled on it.
>>
>> On Mon, Apr 16, 2018 at 3:07 PM Kenneth Knowles  wrote:
>>
>> It is no secret that I agree with this. When you don't rewrite history,
>> distributed git "just works". I didn't realize we could mechanically
>> enforce it.
>>
>> Kenn
>>
>> On Mon, Apr 16, 2018 at 2:55 PM Andrew Pilloud 
>> wrote:
>>
>>
>>
>>
>> *The Github UI provides several options for merging a PR hidden behind
>> the “Merge pull request” button. Only the “Create a merge commit” option
>> does what most users expect, which is to merge by creating a new merge
>> commit. This is the option recommended in the Beam committer’s guide, but
>> it is not necessarily the default behavior of the merge button.A small
>> cleanup PR I made was recently merged via the merge button which generated
>> a squash merge instead of a merge commit, breaking two other PRs which were
>> based on it. See https://github.com/apache/beam/pull/4991
>> I would propose that we disable
>> the options for both rebase and squash merging via the Github UI. This will
>> make the behavior of the merge button unambiguous and consistent with our
>> documentation, but will not prevent a committer from performing these
>> operations from the git cli if they desire.Andrew*
>>
>>
>>


-- 

Mingmin

Re: SQL in Python SDK

2018-04-13 Thread Mingmin Xu

With current implementation we're not able to extent it for Python as
Calcite has Jave API only. Another separated Python based SQL should be the
solution. Based on our practice, we write lots of UDF/UDAF and customized
TABLE to fit our own data source/storage. For the former it could be
possible with an adaptor like https://github.com/ninia/jep (just a rough
idea and not verified) , for the later I don't see an option so far.

On Fri, Apr 13, 2018 at 9:11 AM, Robert Bradshaw 
wrote:

> On Fri, Apr 13, 2018 at 8:16 AM Andrew Pilloud 
> wrote:
>
>> Hi Gabor,
>>
>> Are Python UDFs (User-defined functions) something that might work for
>> you? If all you really need to write in Python is your DoFn this is
>> probably your best option.
>>
>
> +1. Note that since Python has tuples as builtin objects, this is a bit
> easier than in Java.
>
>
>> It is still a bit of work but we support Java UDFs today, so all you
>> would need to do is write a Java wrapper to call your Python function.
>>
>
> This is easier said than done (in an efficient manner at least), and
> probably the most compelling reason to implement SQL in pure Python (though
> I'm not convince it outweighs the downsides).
>
>
>>
>> Andrew
>>
>>
>> On Fri, Apr 13, 2018, 7:58 AM Kenneth Knowles  wrote:
>>
>>> The most recent work on cross-language pipeline authoring is the design
>>> brainstorming at https://s.apache.org/beam-mixed-language-pipelines so
>>> it is still in the preliminary stages. There's no basic mystery, but there
>>> are a lot of practical considerations about what is easy to run on a
>>> pipeline author's machine.
>>>
>>> Regarding Apache Calcite - it is a Java library. It doesn't really make
>>> sense to bind it to Python. Today we don't use most of its capabilities. We
>>> just use it as a parser mostly. It would be easy to find an existing parser
>>> in Python or write your own (with ply, the basics could be done within a
>>> day). But still I don't think it makes sense to reimplement and maintain
>>> the SQL-to-Beam translation in multiple languages.
>>>
>>> Kenn
>>>
>>> On Fri, Apr 13, 2018 at 2:43 AM Reuven Lax  wrote:
>>>
 If someone implemented it directly in Python then it would be supported
 directly in Python. I don't know if anyone is actively working on that -
 the current implementation uses Apache Calcite, and I don't know whether
 they have a Python API.

 On Fri, Apr 13, 2018 at 9:40 AM Prabeesh K. 
 wrote:

> What about supporting SQL in Python SDK?
>
> On 13 April 2018 at 13:32, Reuven Lax  wrote:
>
>> The portability work will allow the Python and Java SDKs to be used
>> in the same pipeline, though this work is not yet complete.
>>
>>
> This is would be an interesting feature.
>
> On Fri, Apr 13, 2018 at 9:15 AM Gabor Hermann 
>> wrote:
>>
>>> Hey all,
>>>
>>> Are there any efforts towards supporting SQL from the Python SDK,
>>> not
>>> just from Java? I couldn't find any info about this in JIRA or
>>> mailing
>>> lists.
>>>
>>> How much effort do you think it would take to implement this? Are
>>> there
>>> some dependencies like supporting more features in Python? I know
>>> that
>>> the Python SDK is experimental.
>>>
>>> As an alternative, is there a way to combine Python and Java SDKs in
>>> the
>>> same pipeline?
>>>
>>> Thanks for your answers in advance!
>>>
>>> Cheers,
>>> Gabor
>>>
>>>
>
>


-- 

Mingmin

daily build is not consistent

2018-03-27 Thread Mingmin Xu

Hi all,

Find that daily snapshot build could be partially successful, which causes
failure w/ SNAPSHOT dependencies. Is it possible to have a consistent
'deploy' action?

Here's one example:
https://github.com/apache/beam/pull/4918 changes both
`beam-runners-flink_2.11` and `beam-sdks-java-core`. In Apache repository
the latest build for `beam-runners-flink_2.11` is `2.5.0-20180325.083704-16`,
`2.5.0-20180327.070739-24` for `beam-sdks-java-core`.  Now it fails with
error because #4918 renames some methods:

java.lang.NoSuchMethodError:
org.apache.beam.sdk.metrics.MetricQueryResults.counters()Ljava/lang/Iterable;
at
org.apache.beam.runners.flink.metrics.FlinkMetricContainer.updateMetrics(FlinkMetricContainer.java:91)
at
org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeStart(ReaderInvocationUtil.java:55)
at
org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:232)
at
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:86)
at
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
at
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:94)
at
org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:40)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:745)

Thank you!

Mingmin

Re: [SQL] Windowing and triggering changes proposal

2018-01-16 Thread Mingmin Xu

Thanks @Anton for the proposal. Window(w/ trigger) support in SQL is
limited now, you're very welcome to join the improvement.

There's a balance between injected DSL mode and CLI mode when we were
implementing BealmSQL overall, not only widowing. Many default behaviors
are introduced to make it workable in pure SQL CLI scenario. If it limits
the potential with DSL mode, we should adjust it absolutely.

Mingmin

On Tue, Jan 16, 2018 at 9:56 AM, Kenneth Knowles  wrote:

> I've commented on the doc. This is a really nice analysis and I think the
> proposal is good for making SQL work with Beam windowing and triggering in
> a way that will make sense to users.
>
> Kenn
>
> On Thu, Jan 11, 2018 at 4:05 PM, Anton Kedin  wrote:
>
>> Hi,
>>
>> Wanted to gather feedback on changes I propose to the behavior of some
>> aspects of windowing and triggering in Beam SQL.
>>
>> In short:
>>
>> Beam SQL currently overrides input PCollections' windowing/triggering
>> configuration in few cases. For example if a query has a simple GROUP BY
>> clause, we would apply GlobalWindows. And it's not configurable by the
>> user, it happens under the hood of SQL.
>>
>> Proposal is to update the Beam SQL implementation in these cases to avoid
>> changing the input PCollections' configuration as much as possible.
>>
>> More details here: https://docs.google.com/docume
>> nt/d/1RmyV9e1Qab-axsLI1WWpw5oGAJDv0X7y9OSnPnrZWJk
>>
>> Regards,
>> Anton
>>
>
>

-- 

Mingmin

Re: KafkaIO reading from latest offset when pipeline fails on FlinkRunner

2018-01-10 Thread Mingmin Xu

@Sushil, I have several jobs running on KafkaIO+FlinkRunner, hope my
experience can help you a bit.

For short, `ENABLE_AUTO_COMMIT_CONFIG` doesn't meet your requirement, you
need to leverage exactly-once checkpoint/savepoint in Flink. The reason
is,  with `ENABLE_AUTO_COMMIT_CONFIG` KafkaIO commits offset after data is
read, and once job is restarted KafkaIO reads from last_committed_offset.

In my jobs, I enable external(external should be optional I think?)
checkpoint on exactly-once mode in Flink cluster. When the job auto-restart
on failures it doesn't lost data. In case of manually redeploy the job, I
use savepoint to cancel and launch the job.

Mingmin

On Wed, Jan 10, 2018 at 10:34 AM, Raghu Angadi  wrote:

> How often does your pipeline checkpoint/snapshot? If the failure happens
> before the first checkpoint, the pipeline could restart without any state,
> in which case KafkaIO would read from latest offset. There is probably some
> way to verify if pipeline is restarting from a checkpoint.
>
> On Sun, Jan 7, 2018 at 10:57 PM, Sushil Ks  wrote:
>
>> HI Aljoscha,
>>The issue is let's say I consumed 100 elements in 5
>> mins Fixed Window with *GroupByKey* and later I applied *ParDO* for all
>> those elements. If there is an issue while processing element 70 in
>> *ParDo *and the pipeline restarts with *UserCodeException *it's skipping
>> the rest 30 elements. Wanted to know if this is expected? In case if you
>> still having doubt let me know will share a code snippet.
>>
>> Regards,
>> Sushil Ks
>>
>
>

-- 

Mingmin

Re: Is upgrading to Kafka Client 0.10.2.0+ in the roadmap?

2017-10-30 Thread Mingmin Xu

Thanks for the feedback, glad to know that it works now.

Mingmin

On Mon, Oct 30, 2017 at 11:10 AM, Shen Li <cs.she...@gmail.com> wrote:

> Dear All,
>
> Thanks a lot for the information. I am using Beam-2.0.
> https://github.com/apache/beam/blob/release-2.0.0/sdks/
> java/io/kafka/pom.xml#L33
>
> I have just verified that adding Kafka-Client 0.11 in the application
> pom.xml works fine for me. I can now avoid the JAAS configuration file by
> using the "java.security.auth.login.config" property.
>
> Best,
> Shen
>
> On Mon, Oct 30, 2017 at 1:41 PM, Mingmin Xu <mingm...@gmail.com> wrote:
>
> > Hi Shen,
> >
> > Can you share which Beam version are you using? Just check master code,
> the
> > default version for Kafka is
> > `0.11.0.1`.
> > I cannot recall the usage for old versions, my
> application(2.2.0-SNAPSHOT)
> > works with a customized kafka version based on 0.10.00-SASL. What you
> need
> > to do is
> > 1). exclude the kafka-client in KafkaIO, and add your own Kafka client
> > library in pom.xml;
> > 2). add your configuration like:
> > ```
> > Map<String, Object> consumerPara = new HashMap<String, Object>();
> > //consumerPara.put(ConsumerConfig.GROUP_ID_CONFIG,
> consumerName);
> > //consumerPara.put(ConsumerConfig.CLIENT_ID_CONFIG, clientId);
> > if (secureEnabled) {
> > consumerPara.put("sasl.mechanism", "IAF");
> > consumerPara.put("security.protocol", "SASL_PLAINTEXT");
> > consumerPara.put("sasl.login.class", ".");
> > consumerPara.put("sasl.callback.handler.class", "...");
> > }
> >
> >  KafkaIO.<byte[], byte[]>read()
> > 
> > .updateConsumerProperties(configUpdates)
> > ;
> >  ```
> >
> > Mingmin
> >
> > On Mon, Oct 30, 2017 at 10:21 AM, Raghu Angadi
> <rang...@google.com.invalid
> > >
> > wrote:
> >
> > > >  https://issues.apache.org/jira/browse/BEAM-307
> > >
> > > This should be closed.
> > >
> > > On Mon, Oct 30, 2017 at 9:00 AM, Lukasz Cwik <lc...@google.com.invalid
> >
> > > wrote:
> > >
> > > > There has been some discussion about getting Kafka 0.10.x working on
> > > > BEAM-307[1].
> > > >
> > > > As an immediate way to unblock yourself, modify your local copy of
> the
> > > > KafkaIO source to include setting the system property in a static
> block
> > > > before the class is loaded or before the Kafka client is instantiated
> > and
> > > > used.
> > > >
> > > > Also consider contributing to the Kafka connector to getting 0.10.x
> > > > working.
> > > >
> > > > 1: https://issues.apache.org/jira/browse/BEAM-307
> > > >
> > > > On Mon, Oct 30, 2017 at 8:14 AM, Shen Li <cs.she...@gmail.com>
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > To use KafkaIO in secure mode, I need to set
> > > > > -Djava.security.auth.login.config to point to a JAAS configuration
> > > file.
> > > > > It
> > > > > works fine for local execution. But how can I configure the
> > > > > "java.security.auth.login.config" property in the Beam app when
> the
> > > > > pipeline is submitted to a cluster/cloud-service? Even if I use a
> > ParDo
> > > > to
> > > > > set the system property, there is no guarantee that the ParDo will
> > run
> > > on
> > > > > the same server with the KafkaIO source.
> > > > >
> > > > > For this specific problem, it would be helpful to upgrade to Kafka
> > > Client
> > > > > 0.10.2.0, which provides a "sasl.jaas.config" property that can be
> > > > updated
> > > > > programmatically. Or is there any other work around?
> > > > >
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 85%3A+Dynamic+JAAS+
> > > > > configuration+for+Kafka+clients
> > > > >
> > > > > Thanks,
> > > > > Shen
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > 
> > Mingmin
> >
>



-- 

Mingmin

Re: Is upgrading to Kafka Client 0.10.2.0+ in the roadmap?

2017-10-30 Thread Mingmin Xu

Hi Shen,

Can you share which Beam version are you using? Just check master code, the
default version for Kafka is
`0.11.0.1`.
I cannot recall the usage for old versions, my application(2.2.0-SNAPSHOT)
works with a customized kafka version based on 0.10.00-SASL. What you need
to do is
1). exclude the kafka-client in KafkaIO, and add your own Kafka client
library in pom.xml;
2). add your configuration like:
```
Map consumerPara = new HashMap();
//consumerPara.put(ConsumerConfig.GROUP_ID_CONFIG, consumerName);
//consumerPara.put(ConsumerConfig.CLIENT_ID_CONFIG, clientId);
if (secureEnabled) {
consumerPara.put("sasl.mechanism", "IAF");
consumerPara.put("security.protocol", "SASL_PLAINTEXT");
consumerPara.put("sasl.login.class", ".");
consumerPara.put("sasl.callback.handler.class", "...");
}

 KafkaIO.read()

.updateConsumerProperties(configUpdates)
;
 ```

Mingmin

On Mon, Oct 30, 2017 at 10:21 AM, Raghu Angadi 
wrote:

> >  https://issues.apache.org/jira/browse/BEAM-307
>
> This should be closed.
>
> On Mon, Oct 30, 2017 at 9:00 AM, Lukasz Cwik 
> wrote:
>
> > There has been some discussion about getting Kafka 0.10.x working on
> > BEAM-307[1].
> >
> > As an immediate way to unblock yourself, modify your local copy of the
> > KafkaIO source to include setting the system property in a static block
> > before the class is loaded or before the Kafka client is instantiated and
> > used.
> >
> > Also consider contributing to the Kafka connector to getting 0.10.x
> > working.
> >
> > 1: https://issues.apache.org/jira/browse/BEAM-307
> >
> > On Mon, Oct 30, 2017 at 8:14 AM, Shen Li  wrote:
> >
> > > Hi,
> > >
> > > To use KafkaIO in secure mode, I need to set
> > > -Djava.security.auth.login.config to point to a JAAS configuration
> file.
> > > It
> > > works fine for local execution. But how can I configure the
> > > "java.security.auth.login.config" property in the Beam app when the
> > > pipeline is submitted to a cluster/cloud-service? Even if I use a ParDo
> > to
> > > set the system property, there is no guarantee that the ParDo will run
> on
> > > the same server with the KafkaIO source.
> > >
> > > For this specific problem, it would be helpful to upgrade to Kafka
> Client
> > > 0.10.2.0, which provides a "sasl.jaas.config" property that can be
> > updated
> > > programmatically. Or is there any other work around?
> > >
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 85%3A+Dynamic+JAAS+
> > > configuration+for+Kafka+clients
> > >
> > > Thanks,
> > > Shen
> > >
> >
>

-- 

Mingmin

Re: Support for window analytic functions in SQL DSL

2017-10-05 Thread Mingmin Xu

@Kobi,

Currently we don't support window analytic functions, feel free to create a
new-feature JIRA ticket.


On Thu, Oct 5, 2017 at 12:07 PM, Tyler Akidau <taki...@google.com> wrote:

> I'm not aware of analytic window support. +Mingmin Xu <mingm...@gmail.com>
>  or +James <xumingmi...@gmail.com> could speak to any plans they might
> have regarding adding support.
>
> -Tyler
>
> On Mon, Oct 2, 2017 at 3:23 AM Kobi Salant <kobi.sal...@gmail.com> wrote:
>
>> Hi,
>>
>> Calcite streaming documentation includes examples for using SQL window
>> analytic functions
>> https://calcite.apache.org/docs/stream.html#sliding-windows
>>
>> In the beam documentation https://beam.apache.org/documentation/dsls/sql/
>> there is no mention for this functionality.
>>
>> Does the DSL_SQL branch supports it or do we have future plans for it?
>>
>> Thanks
>> Kobi
>>
>


-- 

Mingmin

Re: Beam 2.2.0 release

2017-09-11 Thread Mingmin Xu

I don't think there're some particular permissions, #3782 is merged to
master. So from SQL perspective I'm good to the 2.2.0 release, will do a
quick POC job for verification purpose.

Mingmin

On Mon, Sep 11, 2017 at 9:46 AM, Reuven Lax <re...@google.com.invalid>
wrote:

> Are these permissions that only you have, or does anyone on the PMC have
> these permissions? I'm asking so that in the future if you are unavailable,
> we know who has these permissions. We should also make sure this is all
> documented on the Beam release guide.
>
> Reuven
>
> On Wed, Sep 6, 2017 at 9:25 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> > It sounds good to me.
> >
> > By the way, you will need my help to complete the release process (as you
> > need some permissions that you don't have).
> >
> > Regards
> > JB
> >
> >
> > On 09/07/2017 01:00 AM, Reuven Lax wrote:
> >
> >> It sounds like SQL is still not in, and there are a couple of other PRs
> >> that people have requested in 2.2.0. I am mostly out next week, so let's
> >> set September 18 as a target date for cutting the first RC. That should
> >> hopefully give plenty of time to get SQL and the remaining PRs merged
> into
> >> master.
> >>
> >> Reuven
> >>
> >> On Thu, Aug 31, 2017 at 3:04 PM, Mingmin Xu <mingm...@gmail.com> wrote:
> >>
> >> Add https://issues.apache.org/jira/browse/BEAM-2833 which is a blocker
> to
> >>> merge DSL_SQL. There may be something wrong in the back-end(maybe
> >>> RunnerApi) to handle parametered CustomCoder in TestPipeline.
> >>>
> >>> On Thu, Aug 31, 2017 at 10:38 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net>
> >>> wrote:
> >>>
> >>> Fair enough.
> >>>>
> >>>> That's fine for me.
> >>>>
> >>>> Regards
> >>>> JB
> >>>>
> >>>> On Aug 31, 2017, 19:03, at 19:03, Steve Niemitz <sniem...@apache.org>
> >>>> wrote:
> >>>>
> >>>>> I'll chime in as a user who would love to see 2.2.0 sooner than
> later,
> >>>>> specifically for the file IO Eugene mentioned.  We're using the
> AvroIO
> >>>>> enhancements extensively, but I am hesitant to run from HEAD in
> master
> >>>>> in
> >>>>> production.
> >>>>>
> >>>>> On Thu, Aug 31, 2017 at 12:42 PM, Eugene Kirpichov <
> >>>>> kirpic...@google.com.invalid> wrote:
> >>>>>
> >>>>> There are a lot of users including very large production customers
> >>>>>>
> >>>>> who have
> >>>>>
> >>>>>> been asking specifically for the features that are in 2.2.0 (most of
> >>>>>>
> >>>>> them
> >>>>>
> >>>>>> accumulated while 2.1.0 was being iterated on) - mostly I'm
> referring
> >>>>>>
> >>>>> to
> >>>>>
> >>>>>> the vastly improved file IO - and they have been hesitant to use
> Beam
> >>>>>>
> >>>>> at
> >>>>>
> >>>>>> HEAD in production. I think the slight unusualness of having a
> >>>>>>
> >>>>> release
> >>>>>
> >>>>>> published soon after the previous release is a small price to pay
> for
> >>>>>> helping those users :)
> >>>>>>
> >>>>>> On Wed, Aug 30, 2017, 11:30 PM Jean-Baptiste Onofré <
> j...@nanthrax.net>
> >>>>>> wrote:
> >>>>>>
> >>>>>> As we released 2.1.0 couple of weeks ago, it could sound weird to
> >>>>>>>
> >>>>>> the
> >>>>>
> >>>>>> users to
> >>>>>>> do a 2.2.0 so fast. If we have a blocking issue, we can do a 2.1.1
> >>>>>>>
> >>>>>> If
> >>>>>
> >>>>>> it's
> >>>>>>
> >>>>>>> new
> >>>>>>> features, why not having a release pace in October (2.2.0) ?
> >>>>>>>
> >>>>>>> Thoughts ?
> >>>>>>>
> >>>>>>> Regards
> >>>>>>> JB
> >>>>>>&

Re: Merge branch DSL_SQL to master

2017-09-11 Thread Mingmin Xu

Now it's merged to master. Thanks to everyone!

Mingmin

On Thu, Sep 7, 2017 at 10:09 AM, Ahmet Altay <al...@google.com.invalid>
wrote:

> +1 Thanks to all contributors/reviewers!
>
> On Thu, Sep 7, 2017 at 9:55 AM, Kai Jiang <jiang...@gmail.com> wrote:
>
> > +1 looking forward to this.
> >
> > On Thu, Sep 7, 2017, 09:53 Tyler Akidau <taki...@google.com.invalid>
> > wrote:
> >
> > > +1, thanks for all the hard work to everyone that contributed!
> > >
> > > -Tyler
> > >
> > > On Thu, Sep 7, 2017 at 2:39 AM Ismaël Mejía <ieme...@gmail.com> wrote:
> > >
> > > > +1
> > > > A nice feature to have on Beam. Great work guys !
> > > >
> > > > On Thu, Sep 7, 2017 at 10:21 AM, Pei HE <pei...@gmail.com> wrote:
> > > > > +1
> > > > >
> > > > > On Thu, Sep 7, 2017 at 4:03 PM, tarush grover <
> > tarushappt...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > >> Thank you all, it was a great learning experience!
> > > > >>
> > > > >> Regards,
> > > > >> Tarush
> > > > >>
> > > > >> On Thu, 7 Sep 2017 at 1:05 PM, Jean-Baptiste Onofré <
> > j...@nanthrax.net>
> > > > >> wrote:
> > > > >>
> > > > >> > +1
> > > > >> >
> > > > >> > Great work guys !
> > > > >> > Ready to help for the merge and maintain !
> > > > >> >
> > > > >> > Regards
> > > > >> > JB
> > > > >> >
> > > > >> > On 09/07/2017 08:48 AM, Mingmin Xu wrote:
> > > > >> > > Hi all,
> > > > >> > >
> > > > >> > > On behalf of the virtual Beam SQL team[1], I'd like to propose
> > to
> > > > merge
> > > > >> > > DSL_SQL branch into master (PR #3782 [2]) and include it in
> > > release
> > > > >> > version
> > > > >> > > 2.2.0, which will give it more visibility to other
> contributors
> > > and
> > > > >> > users.
> > > > >> > > The SQL feature satisfies the following criteria outlined in
> > > > >> contribution
> > > > >> > > guide[3].
> > > > >> > >
> > > > >> > > 1. Have at least 2 contributors interested in maintaining it,
> > and
> > > 1
> > > > >> > > committer interested in supporting it
> > > > >> > >
> > > > >> > > * James and me will continue for new features and maintain it;
> > > > >> > >
> > > > >> > >Tyler, James and me will support it as committers;
> > > > >> > >
> > > > >> > > 2. Provide both end-user and developer-facing documentation
> > > > >> > >
> > > > >> > > * A web page[4] is added to describe the usage of SQL DSL and
> > how
> > > it
> > > > >> > works;
> > > > >> > >
> > > > >> > >
> > > > >> > > 3. Have at least a basic level of unit test coverage
> > > > >> > >
> > > > >> > > * Totally 230 unit/integration tests, with code coverage
> 83.4%;
> > > > >> > >
> > > > >> > > 4. Run all existing applicable integration tests with other
> Beam
> > > > >> > components
> > > > >> > > and create additional tests as appropriate
> > > > >> > >
> > > > >> > > * Besides of integration tests in package
> > > > >> > org.apache.beam.sdk.extensions.sql,
> > > > >> > > there's another example in
> > > > org.apache.beam.sdk.extensions.sql.example.
> > > > >> > > BeamSqlExample.
> > > > >> > >
> > > > >> > > [1]. Special thanks to all contributors/reviewers:
> > > > >> > >
> > > > >> > >   Tyler Akidau
> > > > >> > >
> > > > >> > >   Davor Bonaci
> > > > >> > >
> > > > >> > >   Robert Bradshaw
> > > > >> > >
> > > > >> > >   Lukasz Cwik
> > > > >> > >
> > > > >> > >   Tarush Grover
> > > > >> > >
> > > > >> > >   Kai Jiang
> > > > >> > >
> > > > >> > >   Kenneth Knowles
> > > > >> > >
> > > > >> > >   Jingsong Lee
> > > > >> > >
> > > > >> > >   Ismaël Mejía
> > > > >> > >
> > > > >> > >   Jean-Baptiste Onofré
> > > > >> > >
> > > > >> > >   James Xu
> > > > >> > >
> > > > >> > >   Mingmin Xu
> > > > >> > >
> > > > >> > > [2]. https://github.com/apache/beam/pull/3782
> > > > >> > >
> > > > >> > > [3]. https://beam.apache.org/contribute/contribution-guide/
> > > > >> > > #merging-into-master
> > > > >> > >
> > > > >> > > [4]. https://beam.apache.org/documentation/dsls/sql/
> > > > >> > >
> > > > >> > > Thanks!
> > > > >> > > 
> > > > >> > > Mingmin
> > > > >> > >
> > > > >> >
> > > > >> > --
> > > > >> > Jean-Baptiste Onofré
> > > > >> > jbono...@apache.org
> > > > >> > http://blog.nanthrax.net
> > > > >> > Talend - http://www.talend.com
> > > > >> >
> > > > >>
> > > >
> > >
> >
>



-- 

Mingmin

Re: Beam 2.2.0 release

2017-09-07 Thread Mingmin Xu

Thanks @JB, add you and Tyler as reviewer and is waiting for jenkins job.

On Thu, Sep 7, 2017 at 12:25 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> It sounds good to me.
>
> By the way, you will need my help to complete the release process (as you
> need some permissions that you don't have).
>
> Regards
> JB
>
>
> On 09/07/2017 01:00 AM, Reuven Lax wrote:
>
>> It sounds like SQL is still not in, and there are a couple of other PRs
>> that people have requested in 2.2.0. I am mostly out next week, so let's
>> set September 18 as a target date for cutting the first RC. That should
>> hopefully give plenty of time to get SQL and the remaining PRs merged into
>> master.
>>
>> Reuven
>>
>> On Thu, Aug 31, 2017 at 3:04 PM, Mingmin Xu <mingm...@gmail.com> wrote:
>>
>> Add https://issues.apache.org/jira/browse/BEAM-2833 which is a blocker to
>>> merge DSL_SQL. There may be something wrong in the back-end(maybe
>>> RunnerApi) to handle parametered CustomCoder in TestPipeline.
>>>
>>> On Thu, Aug 31, 2017 at 10:38 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
>>> wrote:
>>>
>>> Fair enough.
>>>>
>>>> That's fine for me.
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On Aug 31, 2017, 19:03, at 19:03, Steve Niemitz <sniem...@apache.org>
>>>> wrote:
>>>>
>>>>> I'll chime in as a user who would love to see 2.2.0 sooner than later,
>>>>> specifically for the file IO Eugene mentioned.  We're using the AvroIO
>>>>> enhancements extensively, but I am hesitant to run from HEAD in master
>>>>> in
>>>>> production.
>>>>>
>>>>> On Thu, Aug 31, 2017 at 12:42 PM, Eugene Kirpichov <
>>>>> kirpic...@google.com.invalid> wrote:
>>>>>
>>>>> There are a lot of users including very large production customers
>>>>>>
>>>>> who have
>>>>>
>>>>>> been asking specifically for the features that are in 2.2.0 (most of
>>>>>>
>>>>> them
>>>>>
>>>>>> accumulated while 2.1.0 was being iterated on) - mostly I'm referring
>>>>>>
>>>>> to
>>>>>
>>>>>> the vastly improved file IO - and they have been hesitant to use Beam
>>>>>>
>>>>> at
>>>>>
>>>>>> HEAD in production. I think the slight unusualness of having a
>>>>>>
>>>>> release
>>>>>
>>>>>> published soon after the previous release is a small price to pay for
>>>>>> helping those users :)
>>>>>>
>>>>>> On Wed, Aug 30, 2017, 11:30 PM Jean-Baptiste Onofré <j...@nanthrax.net>
>>>>>> wrote:
>>>>>>
>>>>>> As we released 2.1.0 couple of weeks ago, it could sound weird to
>>>>>>>
>>>>>> the
>>>>>
>>>>>> users to
>>>>>>> do a 2.2.0 so fast. If we have a blocking issue, we can do a 2.1.1
>>>>>>>
>>>>>> If
>>>>>
>>>>>> it's
>>>>>>
>>>>>>> new
>>>>>>> features, why not having a release pace in October (2.2.0) ?
>>>>>>>
>>>>>>> Thoughts ?
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>> On 08/31/2017 08:27 AM, Eugene Kirpichov wrote:
>>>>>>>
>>>>>>>> I'd suggest to do 2.2.0 as quickly as possible, and target 2.3.0
>>>>>>>>
>>>>>>> for
>>>>>
>>>>>> October. I don't see a reason to delay 2.2.0 until October:
>>>>>>>>
>>>>>>> there's a
>>>>>
>>>>>> huge
>>>>>>>
>>>>>>>> amount of features worth releasing between when 2.1.0 was cut and
>>>>>>>>
>>>>>>> the
>>>>>
>>>>>> current HEAD.
>>>>>>>>
>>>>>>>> On Wed, Aug 30, 2017 at 10:18 PM Jean-Baptiste Onofré
>>>>>>>>
>>>>>>> <j...@nanthrax.net
>>>>>
>>>>>>

Re: Beam 2.2.0 release

2017-09-06 Thread Mingmin Xu

Regarding to BEAM-2833, a PR is there for review(
https://github.com/apache/beam/pull/3803). Once it's completed, I'll move
forward for SQL merge.

Mingmin

On Wed, Sep 6, 2017 at 4:00 PM, Reuven Lax <re...@google.com.invalid> wrote:

> It sounds like SQL is still not in, and there are a couple of other PRs
> that people have requested in 2.2.0. I am mostly out next week, so let's
> set September 18 as a target date for cutting the first RC. That should
> hopefully give plenty of time to get SQL and the remaining PRs merged into
> master.
>
> Reuven
>
> On Thu, Aug 31, 2017 at 3:04 PM, Mingmin Xu <mingm...@gmail.com> wrote:
>
> > Add https://issues.apache.org/jira/browse/BEAM-2833 which is a blocker
> to
> > merge DSL_SQL. There may be something wrong in the back-end(maybe
> > RunnerApi) to handle parametered CustomCoder in TestPipeline.
> >
> > On Thu, Aug 31, 2017 at 10:38 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> > wrote:
> >
> > > Fair enough.
> > >
> > > That's fine for me.
> > >
> > > Regards
> > > JB
> > >
> > > On Aug 31, 2017, 19:03, at 19:03, Steve Niemitz <sniem...@apache.org>
> > > wrote:
> > > >I'll chime in as a user who would love to see 2.2.0 sooner than later,
> > > >specifically for the file IO Eugene mentioned.  We're using the AvroIO
> > > >enhancements extensively, but I am hesitant to run from HEAD in master
> > > >in
> > > >production.
> > > >
> > > >On Thu, Aug 31, 2017 at 12:42 PM, Eugene Kirpichov <
> > > >kirpic...@google.com.invalid> wrote:
> > > >
> > > >> There are a lot of users including very large production customers
> > > >who have
> > > >> been asking specifically for the features that are in 2.2.0 (most of
> > > >them
> > > >> accumulated while 2.1.0 was being iterated on) - mostly I'm
> referring
> > > >to
> > > >> the vastly improved file IO - and they have been hesitant to use
> Beam
> > > >at
> > > >> HEAD in production. I think the slight unusualness of having a
> > > >release
> > > >> published soon after the previous release is a small price to pay
> for
> > > >> helping those users :)
> > > >>
> > > >> On Wed, Aug 30, 2017, 11:30 PM Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > > >> wrote:
> > > >>
> > > >> > As we released 2.1.0 couple of weeks ago, it could sound weird to
> > > >the
> > > >> > users to
> > > >> > do a 2.2.0 so fast. If we have a blocking issue, we can do a 2.1.1
> > > >If
> > > >> it's
> > > >> > new
> > > >> > features, why not having a release pace in October (2.2.0) ?
> > > >> >
> > > >> > Thoughts ?
> > > >> >
> > > >> > Regards
> > > >> > JB
> > > >> >
> > > >> > On 08/31/2017 08:27 AM, Eugene Kirpichov wrote:
> > > >> > > I'd suggest to do 2.2.0 as quickly as possible, and target 2.3.0
> > > >for
> > > >> > > October. I don't see a reason to delay 2.2.0 until October:
> > > >there's a
> > > >> > huge
> > > >> > > amount of features worth releasing between when 2.1.0 was cut
> and
> > > >the
> > > >> > > current HEAD.
> > > >> > >
> > > >> > > On Wed, Aug 30, 2017 at 10:18 PM Jean-Baptiste Onofré
> > > ><j...@nanthrax.net
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > >> With a 2.2.0 in October, I think we can try to move forward on
> > > >> RedisIO.
> > > >> > >>
> > > >> > >> I'm now back from vacation and I will resume the work on this
> > > >IO.
> > > >> > >>
> > > >> > >> Regards
> > > >> > >> JB
> > > >> > >>
> > > >> > >> On 08/30/2017 11:27 PM, Eugene Kirpichov wrote:
> > > >> > >>> RedisIO in 2.2.0 is very unlikely. There's still a lot of
> > > >review
> > > >> > >> remaining
> > > >> > >>> last time I checked on the PR.
> > > >> > >>>

Re: Beam 2.2.0 release

2017-08-31 Thread Mingmin Xu

; > >>>>>
> >> > >>>>>
> >> > >>>>> On Wed, Aug 30, 2017 at 1:27 PM, Eugene Kirpichov <
> >> > >>>>> kirpic...@google.com.invalid> wrote:
> >> > >>>>>
> >> > >>>>>> Thanks Ismael. I've marked these two issues for fix in
> >2.2.0.
> >> > >>>> Definitely
> >> > >>>>>> agree that at least the first one must be fixed.
> >> > >>>>>>
> >> > >>>>>> Here's the current burndown list
> >> > >>>>>>
> >https://issues.apache.org/jira/projects/BEAM/versions/12341044 -
> >> we
> >> > >>>>> should
> >> > >>>>>> clean it up.
> >> > >>>>>>
> >> > >>>>>> On Wed, Aug 30, 2017 at 1:20 PM Ismaël Mejía
> ><ieme...@gmail.com>
> >> > >>>> wrote:
> >> > >>>>>>
> >> > >>>>>>> The current master has accumulated a good amount of nice
> >features
> >> > >>>>>>> since 2.1.0 so a new release is welcomed. I have two
> >JIRAs/PR
> >> that
> >> > I
> >> > >>>>>>> think are important to check/solve before the cut:
> >> > >>>>>>>
> >> > >>>>>>> BEAM-2516 (this is a regression on the performance of
> >Direct
> >> runner
> >> > >>>> on
> >> > >>>>>>> Java). We had never really defined if a performance
> >regression is
> >> > >>>>>>> critical to be a blocker. I executed WordCount with the
> >> > kinglear.txt
> >> > >>>>>>> (170KB) file in version 2.1.0 vs the current 2.2.0-SNAPSHOT
> >and I
> >> > >>>>>>> found that the execution time passed from 5s to 126s. So
> >maybe we
> >> > >>>> need
> >> > >>>>>>> to review this one before the release. I can understand if
> >others
> >> > >>>>>>> consider this a minor issue because the Direct runner is
> >not
> >> > supposed
> >> > >>>>>>> to be used for production, but this performance regression
> >can
> >> > cause
> >> > >>>> a
> >> > >>>>>>> bad impression for a casual user starting with Beam.
> >> > >>>>>>>
> >> > >>>>>>> BEAM-2790 (fix reading from Amazon S3 via
> >HadoopFileSystem). I
> >> > think
> >> > >>>>>>> this one is a nice to have. I am not sure that I can tackle
> >it
> >> for
> >> > >>>> the
> >> > >>>>>>> wednesday cut. I’m OOO until the beginning of next week,
> >but
> >> maybe
> >> > >>>>>>> someone else can take a look. In the worst case this is not
> >a
> >> > release
> >> > >>>>>>> blocker but definitely a really nice fix to include.
> >> > >>>>>>>
> >> > >>>>>>>
> >> > >>>>>>> On Wed, Aug 30, 2017 at 8:49 PM, Eugene Kirpichov
> >> > >>>>>>> <kirpic...@google.com.invalid> wrote:
> >> > >>>>>>>> I'd like to get the following PRs into 2.2.0:
> >> > >>>>>>>>
> >> > >>>>>>>> #3765 <https://github.com/apache/beam/pull/3765>
> >[BEAM-2753
> >> > >>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2753>] Fixes
> >> > >>>> translation
> >> > >>>>>> of
> >> > >>>>>>>> WriteFiles side inputs (important bugfix for
> >DynamicDestinations
> >> > in
> >> > >>>>>>> files)
> >> > >>>>>>>> #3725 <https://github.com/apache/beam/pull/3725>
> >[BEAM-2827
> >> > >>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2827>]
> >Introduces
> >> > >>>>>>>> AvroIO.watchForNewFiles (parity for AvroIO with TextIO in
> >a few
> >> > >>>>>> important
> >> > >>>>>>>> features)
> >> > >>>>>>>> #3759 <https://github.com/apache/beam/pull/3759>
> >[BEAM-2828
> >> > >>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2828>] Moves
> >Match
> >> > >>>> into
> >> > >>>>>>>> FileIO.match()/matchAll() (to prevent releasing current
> >> > >>>>>>>> Match.filepatterns() into 2.2.0 and then having to keep it
> >under
> >> > >>>> that
> >> > >>>>>>> name)
> >> > >>>>>>>>
> >> > >>>>>>>> On Wed, Aug 30, 2017 at 11:31 AM Mingmin Xu
> ><mingm...@gmail.com
> >> >
> >> > >>>>>> wrote:
> >> > >>>>>>>>
> >> > >>>>>>>>> Glad to see that 2.2.0 is coming. Can we include SQL
> >feature in
> >> > >>>> next
> >> > >>>>>>>>> release? We're in the final stage and expect to merge
> >back to
> >> > >>>> master
> >> > >>>>>>> this
> >> > >>>>>>>>> week.
> >> > >>>>>>>>>
> >> > >>>>>>>>> On Wed, Aug 30, 2017 at 11:27 AM, Reuven Lax
> >> > >>>>> <re...@google.com.invalid
> >> > >>>>>>>
> >> > >>>>>>>>> wrote:
> >> > >>>>>>>>>
> >> > >>>>>>>>>> Now that Beam 2.1.0 has finally completed, I think we
> >should
> >> cut
> >> > >>>>>> Beam
> >> > >>>>>>>>> 2.2.0
> >> > >>>>>>>>>> soon. I volunteer to coordinate this release.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Are there any pending pull requests that people think
> >should
> >> be
> >> > >>>>>> merged
> >> > >>>>>>>>>> before we cut 2.2.0? If so, please let me know soon, as
> >I
> >> would
> >> > >>>>> like
> >> > >>>>>>> to
> >> > >>>>>>>>> cut
> >> > >>>>>>>>>> by Wednesday of next week.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Thanks,
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Reuven
> >> > >>>>>>>>>>
> >> > >>>>>>>>>
> >> > >>>>>>>>>
> >> > >>>>>>>>>
> >> > >>>>>>>>> --
> >> > >>>>>>>>> 
> >> > >>>>>>>>> Mingmin
> >> > >>>>>>>>>
> >> > >>>>>>>
> >> > >>>>>>
> >> > >>>>>
> >> > >>>>
> >> > >>>
> >> > >>
> >> > >> --
> >> > >> Jean-Baptiste Onofré
> >> > >> jbono...@apache.org
> >> > >> http://blog.nanthrax.net
> >> > >> Talend - http://www.talend.com
> >> > >>
> >> > >
> >> >
> >> > --
> >> > Jean-Baptiste Onofré
> >> > jbono...@apache.org
> >> > http://blog.nanthrax.net
> >> > Talend - http://www.talend.com
> >> >
> >>
>



-- 

Mingmin

Re: Beam 2.2.0 release

2017-08-30 Thread Mingmin Xu

Glad to see that 2.2.0 is coming. Can we include SQL feature in next
release? We're in the final stage and expect to merge back to master this
week.

On Wed, Aug 30, 2017 at 11:27 AM, Reuven Lax 
wrote:

> Now that Beam 2.1.0 has finally completed, I think we should cut Beam 2.2.0
> soon. I volunteer to coordinate this release.
>
> Are there any pending pull requests that people think should be merged
> before we cut 2.2.0? If so, please let me know soon, as I would like to cut
> by Wednesday of next week.
>
> Thanks,
>
> Reuven
>

-- 

Mingmin

Re: [PROPOSAL] External Join with KV Stores

2017-08-30 Thread Mingmin Xu

Besides data size, I think data refreshment is the BIG barrier especially
for streaming jobs. For most cases lookup data set is updated periodically
when the streaming job is running. I like the idea of SeekableIO, it stil
can be integrated with ExternalKvStore , as a lower level API.

On Mon, Aug 28, 2017 at 7:32 PM, JingsongLee <lzljs3620...@aliyun.com>
wrote:

> Yes, the runner can hold the entire side input in the right way.But it
> will be some waste, in the case of large amounts of data.
> Best, Jingsong Lee
>
> --From:Lukasz
> Cwik <lc...@google.com.INVALID>Time:2017 Aug 25 (Fri) 23:26To:dev <
> dev@beam.apache.org>Cc:JingsongLee <jingsongl...@gmail.com>Subject:Re:
> [PROPOSAL] External Join with KV Stores
> Jinsong, what do you mean by the batch data is too large?
>
> To my knowledge, nothing requires an SDK/runner to hold the entire side
> input in memory. Lists, maps, iterables, ... can all be broken up into
> smaller segments which can be loaded, cached and discarded separately.
>
> On Thu, Aug 24, 2017 at 5:10 PM, Mingmin Xu <mingm...@gmail.com> wrote:
>
> > wanna bring up this thread as we're looking for similar feature in SQL.
> > --Please point me if something is there, I don't find any JIRA task.
> >
> > Now the streaming+batch/batch+batch join is implemented with sideInput.
> > It's not a one-fit-all rule as Jingsong mentioned, the batch data may be
> > too large, and it would be changed periodically. A userland PTransform
> > sounds a more straight-forward option, as it doesn't require support in
> > runner level.
> >
> > Mingmin
> >
> > On Mon, Jul 17, 2017 at 8:56 PM, JingsongLee <lzljs3620...@aliyun.com>
> > wrote:
> >
> > > Sorry for so long to reply.
> > > Hi, Aljoscha, I think Async I/O operator and Batch
> the same, and Async is
> > > a better interface. All IO-related operations may be more appropriate
> > >  for asynchronous use. Just like you said, the beginning
> > > is like no any special support by the Runners.
> > > I really like Luke's idea, let the user see a SeekableRea
> > > d + Sideinput interface, and in the runner layer will
> > > optimize it to the direct access to external
> > > store. This requires a suitable SeekableRead
> interface and more efficient
> > > compiler optimization.
> > > Kenn's idea is exciting. If we can have an interface similar
> > >  to FileSystem (Maybe like SeekableRead), abstract
> and unify a interface
> > > for multiple of KV stores, we can let users to see only the concept
> > > of Beam rather than the specific KVStore.
> > > Best, Jingsong Lee
> > > 
> > --From:Kenneth
> > > Knowles <k...@google.com.INVALID>Time:2017 Jul 7 (Fri) 11:43To:dev <
> > > dev@beam.apache.org>Cc:JingsongLee <lzljs3620...@aliyun.com
> >Subject:Re:
> > > [PROPOSAL] External Join with KV Stores
> > > In the streams/tables way of talking, side inputs are
> tables. External KV
> > > stores are basically also [globally windowed] tables. Both
> > > are time-varying.
> > >
> > > I think it makes perfect sense to access an external
> KV store in userland
> > > directly rather than listen to its changelog and
> reproduce the same table
> > > as a multimap side input. I'm sure many users are
> already doing this. I'm
> > > sure users will always do this. Providing a common interface (simpler
> > than
> > > Filesystem) and helpful transform(s) in an extension
> module seems nice.
> > > Does it require any support in the core SDK?
> > >
> > > If I understand, Luke & Robert, you favor adding
> metadata to Read/SDF so
> > > that a user _does_ write it as a changelog listener
> that is observed as a
> > > multimap side input, and each runner optimizes it if they can to just
> > > directly access the KV store? A runner is free to
> use any kind of storage
> > > they like to materialize a side input anyhow, so this
> is surely possible,
> > > but it is a "sufficiently smart compiler" issue. As for semantics, I'm
> > not
> > > worried about availability - it is globally windowed and always
> > available.
> > > But I think this requires retractions to be correctly equivalent to
> > direct
> > > access.
> > >
> > > I think we can have a userland PTransform in much
> less time than a model
> > > concept, so I favor it.
> > >
> > > Kenn
> > >
> > >
> >
> >
> > --
> > 
> > Mingmin
> >
>
>


-- 

Mingmin

Re: [PROPOSAL] External Join with KV Stores

2017-08-24 Thread Mingmin Xu

wanna bring up this thread as we're looking for similar feature in SQL.
--Please point me if something is there, I don't find any JIRA task.

Now the streaming+batch/batch+batch join is implemented with sideInput.
It's not a one-fit-all rule as Jingsong mentioned, the batch data may be
too large, and it would be changed periodically. A userland PTransform
sounds a more straight-forward option, as it doesn't require support in
runner level.

Mingmin

On Mon, Jul 17, 2017 at 8:56 PM, JingsongLee 
wrote:

> Sorry for so long to reply.
> Hi, Aljoscha, I think Async I/O operator and Batch the same, and Async is
> a better interface. All IO-related operations may be more appropriate
>  for asynchronous use. Just like you said, the beginning
> is like no any special support by the Runners.
> I really like Luke's idea, let the user see a SeekableRea
> d + Sideinput interface, and in the runner layer will
> optimize it to the direct access to external
> store. This requires a suitable SeekableRead interface and more efficient
> compiler optimization.
> Kenn's idea is exciting. If we can have an interface similar
>  to FileSystem (Maybe like SeekableRead), abstract and unify a interface
> for multiple of KV stores, we can let users to see only the concept
> of Beam rather than the specific KVStore.
> Best, Jingsong Lee
> --From:Kenneth
> Knowles Time:2017 Jul 7 (Fri) 11:43To:dev <
> dev@beam.apache.org>Cc:JingsongLee Subject:Re:
> [PROPOSAL] External Join with KV Stores
> In the streams/tables way of talking, side inputs are tables. External KV
> stores are basically also [globally windowed] tables. Both
> are time-varying.
>
> I think it makes perfect sense to access an external KV store in userland
> directly rather than listen to its changelog and reproduce the same table
> as a multimap side input. I'm sure many users are already doing this. I'm
> sure users will always do this. Providing a common interface (simpler than
> Filesystem) and helpful transform(s) in an extension module seems nice.
> Does it require any support in the core SDK?
>
> If I understand, Luke & Robert, you favor adding metadata to Read/SDF so
> that a user _does_ write it as a changelog listener that is observed as a
> multimap side input, and each runner optimizes it if they can to just
> directly access the KV store? A runner is free to use any kind of storage
> they like to materialize a side input anyhow, so this is surely possible,
> but it is a "sufficiently smart compiler" issue. As for semantics, I'm not
> worried about availability - it is globally windowed and always available.
> But I think this requires retractions to be correctly equivalent to direct
> access.
>
> I think we can have a userland PTransform in much less time than a model
> concept, so I favor it.
>
> Kenn
>
>


-- 

Mingmin

Re: [DISCUSS] Capability Matrix revamp

2017-08-23 Thread Mingmin Xu

I would like to have an API compatibility testing. AFAIK there's still gap
to achieve our goal (one job for any runner), that means developers should
notice the limitation when writing the job. For example PCollectionView is
not well supported in FlinkRunner(not quite sure the current status as my
test job is broken)/SparkRunner streaming.

> 5. Reorganize the windowing section to be just support for merging /
non-merging windowing.
sliding/fix_window/session is more straightforward to me,
merging/non-merging is more about the backend implementation.


On Tue, Aug 22, 2017 at 7:28 PM, Kenneth Knowles 
wrote:

> Oh, I missed
>
> 11. Quantitative properties. This seems like an interesting and important
> project all on its own. Since Beam is so generic, we need pretty diverse
> measurements for a user to have a hope of extrapolating to their use case.
>
> Kenn
>
> On Tue, Aug 22, 2017 at 7:22 PM, Kenneth Knowles  wrote:
>
> > OK, so adding these good ideas to the list:
> >
> > 8. Plain-English summary that comes before the nitty-gritty.
> > 9. Comment on production readiness from maintainers. Maybe testimonials
> > are helpful if they can be obtained?
> > 10. Versioning of all of the above
> >
> > Any more thoughts? I'll summarize in a JIRA in a bit.
> >
> > Kenn
> >
> > On Tue, Aug 22, 2017 at 10:45 AM, Griselda Cuevas
>  > > wrote:
> >
> >> Hi, I'd also like to ask if versioning as proposed in BEAM-166 <
> >> https://issues.apache.org/jira/browse/BEAM-166> is still relevant? If
> it
> >> is, would this be something we want to add to this proposal?
> >>
> >> G
> >>
> >> On 21 August 2017 at 08:31, Tyler Akidau 
> >> wrote:
> >>
> >> > Is there any way we could add quantitative runner metrics to this as
> >> well?
> >> > Like by having some benchmarks that process X amount of data, and then
> >> > detailing in the matrix latency, throughput, and (where possible)
> cost,
> >> > etc, numbers for each of the given runners? Semantic support is one
> >> thing,
> >> > but there are other differences between runners that aren't captured
> by
> >> > just checking feature boxes. I'd be curious if anyone has other ideas
> in
> >> > this vein as well. The benchmark idea might not be the best way to go
> >> about
> >> > it.
> >> >
> >> > -Tyler
> >> >
> >> > On Sun, Aug 20, 2017 at 9:43 AM Jesse Anderson <
> >> je...@bigdatainstitute.io>
> >> > wrote:
> >> >
> >> > > It'd be awesome to see these updated. I'd add two more:
> >> > >
> >> > >1. A plain English summary of the runner's support in Beam.
> People
> >> who
> >> > >are new to Beam won't understand the in-depth coverage and need a
> >> > > general
> >> > >idea of how it is supported.
> >> > >2. The production readiness of the runner. Does the maintainer
> >> think
> >> > >this runner is production ready?
> >> > >
> >> > >
> >> > >
> >> > > On Sun, Aug 20, 2017 at 8:03 AM Kenneth Knowles
> >> 
> >> > > wrote:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > I want to revamp
> >> > > > https://beam.apache.org/documentation/runners/capability-matrix/
> >> > > >
> >> > > > When Beam first started, we didn't work on feature branches for
> the
> >> > core
> >> > > > runners, and they had a lot more gaps compared to what goes on
> >> `master`
> >> > > > today, so this tracked our progress in a way that was easy for
> >> users to
> >> > > > read. Now it is still our best/only comparison page for users,
> but I
> >> > > think
> >> > > > we could improve its usefulness.
> >> > > >
> >> > > > For the benefit of the thread, let me inline all the capabilities
> >> fully
> >> > > > here:
> >> > > >
> >> > > > 
> >> > > >
> >> > > > "What is being computed?"
> >> > > >  - ParDo
> >> > > >  - GroupByKey
> >> > > >  - Flatten
> >> > > >  - Combine
> >> > > >  - Composite Transforms
> >> > > >  - Side Inputs
> >> > > >  - Source API
> >> > > >  - Splittable DoFn
> >> > > >  - Metrics
> >> > > >  - Stateful Processing
> >> > > >
> >> > > > "Where in event time?"
> >> > > >  - Global windows
> >> > > >  - Fixed windows
> >> > > >  - Sliding windows
> >> > > >  - Session windows
> >> > > >  - Custom windows
> >> > > >  - Custom merging windows
> >> > > >  - Timestamp control
> >> > > >
> >> > > > "When in processing time?"
> >> > > >  - Configurable triggering
> >> > > >  - Event-time triggers
> >> > > >  - Processing-time triggers
> >> > > >  - Count triggers
> >> > > >  - [Meta]data driven triggers
> >> > > >  - Composite triggers
> >> > > >  - Allowed lateness
> >> > > >  - Timers
> >> > > >
> >> > > > "How do refinements relate?"
> >> > > >  - Discarding
> >> > > >  - Accumulating
> >> > > >  - Accumulating & Retracting
> >> > > >
> >> > > > 
> >> > > >
> >> > > > Here are some issues I'd like to improve:
> >> > > >
> >> > > >  - Rows that are impossible to not support (ParDo)
> >> > > >  -

Re: [ANNOUNCEMENT] New PMC members, August 2017 edition!

2017-08-11 Thread Mingmin Xu

Congratulations to Ahmet and Aviem!

On Fri, Aug 11, 2017 at 11:30 AM, Thomas Groh 
wrote:

> Congratulations to both of you! Looking forwards to both of your continued
> contributions.
>
> On Fri, Aug 11, 2017 at 10:40 AM, Davor Bonaci  wrote:
>
> > Please join me and the rest of Beam PMC in welcoming the following
> > committers as our newest PMC members. They have significantly contributed
> > to the project in different ways, and we look forward to many more
> > contributions in the future.
> >
> > * Ahmet Altay
> > Beyond significant work to drive the Python SDK to the master branch,
> Ahmet
> > has worked project-wide, driving releases, improving processes and
> testing,
> > and growing the community.
> >
> > * Aviem Zur
> > Beyond significant work in the Spark runner, Aviem has worked to improve
> > how the project operates, leading discussions on inclusiveness and
> > openness.
> >
> > Congratulations to both! Welcome!
> >
> > Davor
> >
>



-- 

Mingmin

Re: DSL_SQL branch API review

2017-08-03 Thread Mingmin Xu

Thank you @Tyler to gather the APIs introduced in SQL DSL, add some
comments in the doc.

On Thu, Aug 3, 2017 at 4:21 PM, Tyler Akidau 
wrote:

> Hello Beam dev listers!
>
> TL;DR - DSL_SQL API review happening at
> https://s.apache.org/beam-sql-dsl-api-review
>
> As one of the last steps towards merging the DSL_SQL branch to master [1],
> we are now conducting a holistic API review. As part of that, I've created
> a document [2] that lists out the public API surface, to provide a place
> where offline review and discussions can occur. I've also added an initial
> round of my own comments.
>
> If you're interested in helping out, useful things you can do include:
>
>- [all] Add any comments you might have on the various pieces of the
> API.
>- [DSL_SQL contributors] Verify there are no missing API surfaces (and
>add them if there are; let me know what email address to add for edit
>access, or use suggestions and I'll incorporate).
>- [DSL_SQL contributors] Respond to questions/comments as appropriate.
>
> Thank you!
>
> [1] https://s.apache.org/beam-dsl-sql-burndown
> [2] https://s.apache.org/beam-sql-dsl-api-review
>
> -Tyler
>



-- 

Mingmin

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Mingmin Xu

Second that 'it's responsibility of the transform'. For the case when a
PTransform doesn't have enough information(PTransform developer should have
the knowledge), I would prefer a strict way so users won't forget to call
withSomethingCoder(), like
- a Coder is required to new the PTransform;
- or an interface 'getOutputCoder' to be implemented;

On Wed, Jul 26, 2017 at 11:21 AM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Hm, can you elaborate? I'm not sure how this relates to my suggestion, the
> gist of which is "PTransform's should set the coder on all of their
> outputs, and the user should never have to .setCoder() on a PCollection
> obtained from a PTransform"
>
> On Wed, Jul 26, 2017 at 7:38 AM Lukasz Cwik 
> wrote:
>
> > I'm split between our current one pass model of pipeline construction
> and a
> > two pass model where all information is gathered and then PTransform
> > expansions are performed.
> >
> >
> > On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov <
> > kirpic...@google.com.invalid> wrote:
> >
> > > Hello,
> > >
> > > I've worked on a few different things recently and ran repeatedly into
> > the
> > > same issue: that we do not have clear guidance on who should set the
> > Coder
> > > on a PCollection: is it responsibility of the PTransform that outputs
> it,
> > > or is it responsibility of the user, or is it sometimes one and
> sometimes
> > > the other?
> > >
> > > I believe that the answer is "it's responsibility of the transform" and
> > > moreover that  ideally PCollection.setCoder() should not exist.
> Instead:
> > >
> > > - Require that all transforms set a Coder on the PCollection's they
> > produce
> > > - i.e. it should never be responsibility of the user to "fix up" a
> coder
> > on
> > > a PCollection produced by a transform.
> > >
> > > - Since all transforms are composed of primitive transforms, saying
> > > "transforms must set a Coder" means simply that all *primitive*
> > transforms
> > > must set a Coder on their output.
> > >
> > > - In some cases, a primitive PTransform currently doesn't have enough
> > > information to infer a coder for its output collection - e.g.
> > > ParDo.of(DoFn) might be unable to infer a coder for
> > > OutputT. In that case such transforms should allow the user to provide
> a
> > > coder: ParDo.of(DoFn).withOutputCoder(...) [note that this differs
> from
> > > requiring the user to set a coder on the resulting collection]
> > >
> > > - Corollary: composite transforms need to only configure their
> primitive
> > > transforms (and composite sub-transforms) properly, and give them a
> Coder
> > > if needed.
> > >
> > > - Corollary: a PTransform with type parameters  needs
> to
> > > be configurable with coders for all of these, because the
> implementation
> > of
> > > the transform may change and it may introduce intermediate collections
> > > involving these types. However, in many cases, some of these type
> > > parameters appear in the type of the transform's input, e.g. a
> > > PTransform>, PCollection> will
> always be
> > > able to extract the coders for FooT and BarT from the input
> PCollection,
> > so
> > > the user does not need to provide them. However, a coder for BarT must
> be
> > > provided. I think in most cases the transform needs to be configurable
> > only
> > > with coders for its output.
> > >
> > > Here's a smooth migration path to accomplish the above:
> > > - Make PCollection.createPrimitiveOutputInternal() take a Coder.
> > > - Make all primitive transforms optionally configurable with a coder
> for
> > > their outputs, such as ParDo.of(DoFn).withOutputCoder().
> > > - By using the above, make all composite transforms shipped with the
> SDK
> > > set a Coder on the collections they produce; in some cases, this will
> > > require adding a withSomethingCoder() option to the transform and
> > > propagating that coder to its sub-transforms. If the option is unset,
> > > that's fine for now.
> > > - As a result of the above, get rid of all setCoder() calls in the Beam
> > > repo. The call will still be there, but it will just not be used
> anywhere
> > > in the SDK or examples, and we can mark it deprecated.
> > > - Add guidance to PTransform Style Guide in line with the above
> > >
> > > Does this sound like a good idea? I'm not sure how urgent it would be
> to
> > > actually do this, but I'd like to know whether people agree that this
> is
> > a
> > > good goal in general.
> > >
> >
>



-- 

Mingmin

Re: [NEED HELP] how to revert a PR in branch DSL_SQL

2017-07-20 Thread Mingmin Xu

Thanks @Kenn, awesome explanation! Following option (1) now.


On Thu, Jul 20, 2017 at 10:14 AM, Kenneth Knowles <k...@google.com.invalid>
wrote:

> On Wed, Jul 19, 2017 at 9:35 PM, Mingmin Xu <mingm...@gmail.com> wrote:
>
> > Merge with conflict is not a good choice to me either as lots of files
> are
> > impacted.
> >
>
> Yes, don't do this. You don't need to. I think reverting the merge commit
> is a good idea, then you can start again on this process.
>
>
> > @Kenn, one more question, what's the point by 'but use git merge', any
> > difference from how it was processed in #3553?
> >
>
> In #3533 all of the commits from master were duplicated. The effect of
> whatever rebase happened was to bulk cherrypick them onto the
> github/pr/3553 branch, and then merge those new commits to DSL_SQL.
>
> Specifically, it looks like something like this happened:
>
> git checkout github/pr/3553
> git rebase github/DSL_SQL
> git push apache DSL_SQL
>
> At this point, all of the commits from master that were in #3553 have been
> copied to new, identical, commits that are based on the DSL_SQL branch.
> Maybe the rebase was on something else but anything that is not already in
> the history of github/pr/3553 will cause the same problem.
>
> Instead do this:
>
> git checkout github/DSL_SQL
> git merge --no-ff github/pr/3553
> git push apache DSL_SQL
>
> This will not duplicate any commits.
>
> I find it helpful to remember that rebase is basically bulk cherrypick. It
> is sort of OK to rebase normal PRs because those commits are "dead" after
> the commits have been cherrypicked to master. But these problems don't
> occur if you don't use rebase or cherrypick.
>
> I could be wrong about the exact sequence that led to the situation, but I
> am pretty certain that reverting the merge commit and then doing a normal
> merge will work.
>
> Kenn
>
> On Wed, Jul 19, 2017 at 7:05 PM, Kenneth Knowles <k...@google.com.invalid>
> > wrote:
> >
> > > Yes, merging by rebase doesn't work when you have two branches that
> > > interact.
> > >
> > > I checked `git log github/DSL_SQL ^github/master` to understand what is
> > > going on.
> > >
> > > Here are some ideas:
> > >
> > > I think your option (1) is fine, but you should revert and then replay
> > > #3553 but use git merge. This will work because the duplicated commits
> > that
> > > get reverted are not the ones on master. Then you can do any updates
> you
> > > need and propose the merge DSL_SQL -> master.
> > >
> > > Kenn
> > >
> > > On Wed, Jul 19, 2017 at 5:56 PM, Kai Jiang <jiang...@gmail.com> wrote:
> > >
> > > > Or another option is solve merge conflict. This might not be the best
> > > way.
> > > > Because once master branch has some changes, we still need to do this
> > > same
> > > > way.
> > > >
> > > > I tried merge locally. We could solve conflict files and open PR for
> > > that.
> > > >
> > > > conflict files are:
> > > > both modified:   examples/java/src/main/java/
> > > > org/apache/beam/examples/common/WriteOneFilePerWindow.java
> > > > both modified:   examples/java8/src/main/java/
> > org/apache/beam/examples/
> > > > complete/game/utils/WriteToText.java
> > > > both modified:   runners/core-construction-
> > > java/src/test/java/org/apache/
> > > > beam/runners/core/construction/WriteFilesTranslationTest.java
> > > > both modified:   sdks/java/core/src/main/java/
> org/apache/beam/sdk/io/
> > > > DefaultFilenamePolicy.java
> > > > both modified:   sdks/java/core/src/main/java/
> org/apache/beam/sdk/io/
> > > > FileBasedSink.java
> > > > both modified:   sdks/java/core/src/main/java/
> > > > org/apache/beam/sdk/io/TextIO.
> > > > java
> > > > both modified:   sdks/java/core/src/main/java/
> org/apache/beam/sdk/io/
> > > > WriteFiles.java
> > > > both modified:   sdks/java/core/src/main/java/
> > > org/apache/beam/sdk/values/
> > > > PCollectionViews.java
> > > > both modified:   sdks/java/core/src/test/java/
> org/apache/beam/sdk/io/
> > > > AvroIOTest.java
> > > > both modified:   sdks/java/core/src/test/java/
> org/apache/beam/sdk/io/
> > > > FileBasedSinkTest.java
> > > > both modified:   sdks/java/core/src/test/java/
> org/apache/b

[NEED HELP] how to revert a PR in branch DSL_SQL

2017-07-19 Thread Mingmin Xu

Hi there,

It seems branch DSL_SQL is broken after #3553
, as I cannot create a PR to
master branch with error message '*Can’t automatically merge.*'.

Googled and find two solutions:
1.  submit a revert PR with Git
https://stackoverflow.com/questions/2389361/undo-a-git-merge-that-hasnt-been-pushed-yet/6217372#6217372


I follow this way and here are the steps: (need to adjust target for real
case)
  a. the revert PR https://github.com/XuMingmin/beam/pull/14;
  b. branch which has applied PR in a)
https://github.com/XuMingmin/beam/tree/revert_3553_test;
  c.  Now I can create a PR from XuMingmin/beam/revert_3553_test to
apache/beam/master, see link

;

2. reverting a pull request in Github
https://help.github.com/articles/reverting-a-pull-request/
This is a feature in Github, I cannot see the '*revert*' button maybe
because of permission.

For both the two, I think #3553 need to redo in the end.

Any suggestion which is the right way to go, or any other options?

-- 

Mingmin

Re: [VOTE] Release 2.1.0, release candidate #2

2017-07-18 Thread Mingmin Xu

Thanks Kenn. SQL DSL should be ready in the next version 2.2.0, and agree
to have an overall row "Add SQL DSL" instead of listing all the detailed
tasks.

On Tue, Jul 18, 2017 at 3:54 PM, Kenneth Knowles <k...@google.com.invalid>
wrote:

> Done.
>
> Since it is all on a feature branch and the release notes when it goes to
> master will include "Add SQL DSL" I did not associate the little bits with
> a release.
>
> On Tue, Jul 18, 2017 at 2:51 PM, Mingmin Xu <mingm...@gmail.com> wrote:
>
> > The tasks of SQL should not be labeled as 2.1.0, I've updated some with
> > 2.2.0, fail to change the 'closed' ones. Can anyone with the permission
> > update these tasks
> > https://issues.apache.org/jira/browse/BEAM-2171?jql=
> > project%20%3D%20BEAM%20AND%20fixVersion%20%3D%202.1.0%
> > 20AND%20component%20%3D%20dsl-sql?
> >
> >
> > Thanks!
> > Mingmin
> >
> > On Tue, Jul 18, 2017 at 2:23 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> > wrote:
> >
> > > Yeah, indeed, the issue like BEAM-2171 should not have "Fix Version"
> set
> > > to 2.1.0.
> > >
> > > Regards
> > > JB
> > >
> > > On 07/18/2017 06:52 PM, James wrote:
> > >
> > >> Just noticed that some of the DSL_SQL issues are included in this
> > release?
> > >> e.g. The first one: BEAM-2171, this is not expected,right?
> > >> On Wed, 19 Jul 2017 at 12:30 AM Jean-Baptiste Onofré <j...@nanthrax.net
> >
> > >> wrote:
> > >>
> > >> Hi everyone,
> > >>>
> > >>> Please review and vote on the release candidate #2 for the version
> > 2.1.0,
> > >>> as
> > >>> follows:
> > >>>
> > >>> [ ] +1, Approve the release
> > >>> [ ] -1, Do not approve the release (please provide specific comments)
> > >>>
> > >>>
> > >>> The complete staging area is available for your review, which
> includes:
> > >>> * JIRA release notes [1],
> > >>> * the official Apache source release to be deployed to
> dist.apache.org
> > >>> [2],
> > >>> which is signed with the key with fingerprint C8282E76 [3],
> > >>> * all artifacts to be deployed to the Maven Central Repository [4],
> > >>> * source code tag "v2.1.0-RC2" [5],
> > >>> * website pull request listing the release and publishing the API
> > >>> reference
> > >>> manual [6].
> > >>> * Python artifacts are deployed along with the source release to the
> > >>> dist.apache.org [2].
> > >>>
> > >>> The vote will be open for at least 72 hours. It is adopted by
> majority
> > >>> approval,
> > >>> with at least 3 PMC affirmative votes.
> > >>>
> > >>> Thanks,
> > >>> JB
> > >>>
> > >>> [1]
> > >>>
> > >>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
> > >>> ctId=12319527=12340528
> > >>> [2] https://dist.apache.org/repos/dist/dev/beam/2.1.0/
> > >>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > >>> [4] https://repository.apache.org/content/repositories/orgapache
> > >>> beam-1019/
> > >>> [5] https://github.com/apache/beam/tree/v2.1.0-RC2
> > >>> [6] https://github.com/apache/beam-site/pull/270
> > >>>
> > >>>
> > >>
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
> >
> >
> > --
> > 
> > Mingmin
> >
>



-- 

Mingmin

Re: [VOTE] Release 2.1.0, release candidate #2

2017-07-18 Thread Mingmin Xu

The tasks of SQL should not be labeled as 2.1.0, I've updated some with
2.2.0, fail to change the 'closed' ones. Can anyone with the permission
update these tasks
https://issues.apache.org/jira/browse/BEAM-2171?jql=project%20%3D%20BEAM%20AND%20fixVersion%20%3D%202.1.0%20AND%20component%20%3D%20dsl-sql?


Thanks!
Mingmin

On Tue, Jul 18, 2017 at 2:23 PM, Jean-Baptiste Onofré 
wrote:

> Yeah, indeed, the issue like BEAM-2171 should not have "Fix Version" set
> to 2.1.0.
>
> Regards
> JB
>
> On 07/18/2017 06:52 PM, James wrote:
>
>> Just noticed that some of the DSL_SQL issues are included in this release?
>> e.g. The first one: BEAM-2171, this is not expected,right?
>> On Wed, 19 Jul 2017 at 12:30 AM Jean-Baptiste Onofré 
>> wrote:
>>
>> Hi everyone,
>>>
>>> Please review and vote on the release candidate #2 for the version 2.1.0,
>>> as
>>> follows:
>>>
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>>
>>> The complete staging area is available for your review, which includes:
>>> * JIRA release notes [1],
>>> * the official Apache source release to be deployed to dist.apache.org
>>> [2],
>>> which is signed with the key with fingerprint C8282E76 [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v2.1.0-RC2" [5],
>>> * website pull request listing the release and publishing the API
>>> reference
>>> manual [6].
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval,
>>> with at least 3 PMC affirmative votes.
>>>
>>> Thanks,
>>> JB
>>>
>>> [1]
>>>
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
>>> ctId=12319527=12340528
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.1.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4] https://repository.apache.org/content/repositories/orgapache
>>> beam-1019/
>>> [5] https://github.com/apache/beam/tree/v2.1.0-RC2
>>> [6] https://github.com/apache/beam-site/pull/270
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 

Mingmin

Re: BeamSQL status and merge to master

2017-07-05 Thread Mingmin Xu

Thanks for everybody's effort, we're very close to finish existing tasks.
Here's an status update of SQL DSL, feel free to have a try and share any
comment:

*1. what's done*
  DSL feature is done, with basic filter/project/aggregation/union/join,
built-in functions/UDF/UDAF(pending on #3491)

*2. what's on-going*
  more unit tests, and documentation of README/Beam web.

*3. open questions*
  BEAM-2441 <https://issues.apache.org/jira/browse/BEAM-2441> want to see
any suggestion on the proper module name for SQL work. As mentioned in
task, '*dsl/sql* is for the Java SDK and also prevents alternative language
implementations, however there's another SQL client and not good to be
included as Java SDK extention'.

---
*How to run the example* beam/dsls/sql/example/BeamSqlExample.java
<https://github.com/apache/beam/blob/DSL_SQL/dsls/sql/src/main/java/org/apache/beam/dsls/sql/example/BeamSqlExample.java>
1. run 'mvn install' to avoid the error in #3439
<https://github.com/apache/beam/pull/3439>
2. run 'mvn -pl dsls/sql compile exec:java
-Dexec.mainClass=org.apache.beam.dsls.sql.example.BeamSqlExample
-Dexec.args="--runner=DirectRunner" -Pdirect-runner'

FYI:
1. burn-down list in google doc
https://docs.google.com/document/d/1EHZgSu4Jd75iplYpYT_K_JwSZxL2DWG8kv_EmQzNXFc/edit?usp=sharing
2. JIRA tasks with label 'dsl_sql_merge'
https://issues.apache.org/jira/browse/BEAM-2555?jql=labels%20%3D%20dsl_sql_merge


Mingmin

On Tue, Jun 13, 2017 at 8:51 AM, Lukasz Cwik <lc...@google.com.invalid>
wrote:

> Nevermind, I merged it into #2 about usability.
>
> On Tue, Jun 13, 2017 at 8:50 AM, Lukasz Cwik <lc...@google.com> wrote:
>
> > I added a section about maven module structure/packaging (#6).
> >
> > On Tue, Jun 13, 2017 at 8:30 AM, Tyler Akidau <taki...@google.com.invalid
> >
> > wrote:
> >
> >> Thanks Mingmin. I've copied your list into a doc[1] to make it easier to
> >> collaborate on comments and edits.
> >>
> >> [1] https://s.apache.org/beam-dsl-sql-burndown
> >>
> >> -Tyler
> >>
> >>
> >> On Mon, Jun 12, 2017 at 10:09 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> >> wrote:
> >>
> >> > Hi Mingmin
> >> >
> >> > Sorry, the meeting was in the middle of the night for me and I wasn't
> >> able
> >> > to
> >> > make it.
> >> >
> >> > The timing and checklist look good to me.
> >> >
> >> > We plan to do a Beam release end of June, so, merging in July means we
> >> can
> >> > include it in the next release.
> >> >
> >> > Thanks !
> >> > Regards
> >> > JB
> >> >
> >> > On 06/13/2017 03:06 AM, Mingmin Xu wrote:
> >> > > Hi all,
> >> > >
> >> > > Thanks to join the meeting. As discussed, we're planning to merge
> >> DSL_SQL
> >> > > branch back to master, targeted in the middle of July. A tag
> >> > > 'dsl_sql_merge'[1] is created to track all todo tasks.
> >> > >
> >> > > *What's added in Beam SQL?*
> >> > > BeamSQL provides the capability to execute SQL queries with Beam
> Java
> >> > SDK,
> >> > > either by translating SQL to a PTransform, or run with a standalone
> >> CLI
> >> > > client.
> >> > >
> >> > > *Checklist for merge:*
> >> > > 1. functionality
> >> > >1.1. SQL grammer:
> >> > >  1.1.1. basic query with SELECT/FILTER/PROJECT;
> >> > >  1.1.2. AGGREGATION with global window;
> >> > >  1.1.3. AGGREGATION with FIX_TIME/SLIDING_TIME/SESSION window;
> >> > >  1.1.4. JOIN
> >> > >1.2. UDF/UDAF support;
> >> > >1.3. support predefined String/Math/Date functions, see[2];
> >> > >
> >> > > 2. DSL interface to convert SQL as PTransform;
> >> > >
> >> > > 3. junit test;
> >> > >
> >> > > 4. Java document;
> >> > >
> >> > > 5. Document of SQL feature in website;
> >> > >
> >> > > Any comments/suggestions are very welcomed.
> >> > >
> >> > > Note:
> >> > > [1].
> >> > >
> >> > https://issues.apache.org/jira/browse/BEAM-2436?jql=labels%
> >> 20%3D%20dsl_sql_merge
> >> > >
> >> > > [2]. https://calcite.apache.org/docs/reference.html
> >> > >
> >> >
> >> > --
> >> > Jean-Baptiste Onofré
> >> > jbono...@apache.org
> >> > http://blog.nanthrax.net
> >> > Talend - http://www.talend.com
> >> >
> >>
> >
> >
>



-- 

Mingmin

BeamSQL status and merge to master

2017-06-12 Thread Mingmin Xu

Hi all,

Thanks to join the meeting. As discussed, we're planning to merge DSL_SQL
branch back to master, targeted in the middle of July. A tag
'dsl_sql_merge'[1] is created to track all todo tasks.

*What's added in Beam SQL?*
BeamSQL provides the capability to execute SQL queries with Beam Java SDK,
either by translating SQL to a PTransform, or run with a standalone CLI
client.

*Checklist for merge:*
1. functionality
  1.1. SQL grammer:
1.1.1. basic query with SELECT/FILTER/PROJECT;
1.1.2. AGGREGATION with global window;
1.1.3. AGGREGATION with FIX_TIME/SLIDING_TIME/SESSION window;
1.1.4. JOIN
  1.2. UDF/UDAF support;
  1.3. support predefined String/Math/Date functions, see[2];

2. DSL interface to convert SQL as PTransform;

3. junit test;

4. Java document;

5. Document of SQL feature in website;

Any comments/suggestions are very welcomed.

Note:
[1].
https://issues.apache.org/jira/browse/BEAM-2436?jql=labels%20%3D%20dsl_sql_merge

[2]. https://calcite.apache.org/docs/reference.html
-- 

Mingmin

Re: First stable release completed!

2017-05-17 Thread Mingmin Xu

Congratulations to everyone!

On Wed, May 17, 2017 at 8:36 AM, Dan Halperin 
wrote:

> Great job, folks. What an amazing amount of work, and I'd like to
> especially thank the community for participating in hackathons and
> extensive release validation over the last few weeks! We caught some
> crucial issues in time and really pushed a much better release as a result.
>
> Thanks everyone!
> Dan
>
> On Wed, May 17, 2017 at 11:31 AM, Jesse Anderson <
> je...@bigdatainstitute.io>
> wrote:
>
> > Awesome!
> >
> > On Wed, May 17, 2017, 8:30 AM Ahmet Altay 
> > wrote:
> >
> > > Congratulations everyone, this is great!
> > >
> > > On Wed, May 17, 2017 at 7:26 AM, Kenneth Knowles
>  > >
> > > wrote:
> > >
> > > > Awesome. A huge step.
> > > >
> > > > On Wed, May 17, 2017 at 6:30 AM, Andrew Psaltis <
> > > psaltis.and...@gmail.com>
> > > > wrote:
> > > >
> > > > > This is fantastic.  Great job!
> > > > > On Wed, May 17, 2017 at 08:20 Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > > > > wrote:
> > > > >
> > > > > > Huge congrats to everyone who helped reaching this important
> > > milestone
> > > > !
> > > > > >
> > > > > > Honestly, we are a great team, WE ROCK ! ;)
> > > > > >
> > > > > > Regards
> > > > > > JB
> > > > > >
> > > > > > On 05/17/2017 01:28 PM, Davor Bonaci wrote:
> > > > > > > The first stable release is now complete!
> > > > > > >
> > > > > > > Release artifacts are available through various repositories,
> > > > including
> > > > > > > dist.apache.org, Maven Central, and PyPI. The website is
> > updated,
> > > > and
> > > > > > > announcements are published.
> > > > > > >
> > > > > > > Apache Software Foundation press release:
> > > > > > >
> > > > > > http://globenewswire.com/news-release/2017/05/17/986839/0/
> > > > > en/The-Apache-Software-Foundation-Announces-Apache-
> Beam-v2-0-0.html
> > > > > > >
> > > > > > > Beam blog:
> > > > > > > https://beam.apache.org/blog/2017/05/17/beam-first-stable-
> > > > release.html
> > > > > > >
> > > > > > > Congratulations to everyone -- this is a really big milestone
> for
> > > the
> > > > > > > project, and I'm proud to be a part of this great community.
> > > > > > >
> > > > > > > Davor
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Jean-Baptiste Onofré
> > > > > > jbono...@apache.org
> > > > > > http://blog.nanthrax.net
> > > > > > Talend - http://www.talend.com
> > > > > >
> > > > > --
> > > > > Thanks,
> > > > > Andrew
> > > > >
> > > > > Subscribe to my book: Streaming Data 
> > > > > 
> > > > > twiiter: @itmdata  > user?screen_name=itmdata>
> > > > >
> > > >
> > >
> > --
> > Thanks,
> >
> > Jesse
> >
>



-- 

Mingmin

Re: [VOTE] First stable release: release candidate #4

2017-05-13 Thread Mingmin Xu

+1

Test beam-examples with FlinkRunner, and several cases of KafkaIO/JdbcIO.

Thanks!
Mingmin

On Sat, May 13, 2017 at 7:38 PM, Ahmet Altay 
wrote:

> +1
>
> - Tested Python wordcount with DirectRunner & DataflowRunner on
> Windows/Mac/Linux, and python mobile gaming examples with DirectRunner &
> DataflowRunner on Mac/Linux
> - Verified that generated pydocs are accurate.
> - Python zip file has valid metadata and contains LICENSE, NOTICE and
> README.
>
> Ahmet
>
> On Sat, May 13, 2017 at 1:12 AM, María García Herrero <
> mari...@google.com.invalid> wrote:
>
> > +1 -- validated python quickstart and mobile game for DirectRunner and
> > DataflowRunner on Linux (RC3) and Mac (RC4).
> >
> > Go Beam!
> >
> > On Fri, May 12, 2017 at 11:12 PM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > +1 (binding)
> > >
> > > Tested on beam-samples, especially focus on HDFS support, etc.
> > >
> > > Thanks !
> > > Regards
> > > JB
> > >
> > >
> > > On 05/13/2017 06:47 AM, Davor Bonaci wrote:
> > >
> > >> Hi everyone --
> > >> After going through several release candidates, setting and validating
> > >> acceptance criteria, running a hackathon, and polishing the release,
> now
> > >> is
> > >> the time to vote!
> > >>
> > >> Please review and vote on the release candidate #4 for the version
> > 2.0.0,
> > >> as follows:
> > >> [ ] +1, Approve the release
> > >> [ ] -1, Do not approve the release (please provide specific comments)
> > >>
> > >> The complete staging area is available for review, which includes:
> > >> * JIRA release notes [1],
> > >> * the official Apache source release to be deployed to
> dist.apache.org
> > >> [2],
> > >> which is signed with the key with fingerprint 8F0D334F [3],
> > >> * all artifacts to be deployed to the Maven Central Repository [4],
> > >> * source code tag "v2.0.0-RC4" [5],
> > >> * website pull request listing the release and publishing the API
> > >> reference
> > >> manual [6],
> > >> * Python artifacts are deployed along with the source release to the
> > >> dist.apache.org [2].
> > >>
> > >> Jenkins suites:
> > >> * https://builds.apache.org/job/beam_PreCommit_Java_
> MavenInstall/11439/
> > >> * https://builds.apache.org/job/beam_PostCommit_Java_
> MavenInstall/3801/
> > >> * https://builds.apache.org/job/beam_PostCommit_Python_Verify/2216/
> > >> *
> > >> https://builds.apache.org/job/beam_PostCommit_Java_Validates
> > >> Runner_Apex/1461/
> > >> *
> > >> https://builds.apache.org/job/beam_PostCommit_Java_Validates
> > >> Runner_Dataflow/3123/
> > >> *
> > >> https://builds.apache.org/job/beam_PostCommit_Java_Validates
> > >> Runner_Flink/2808/
> > >> *
> > >> https://builds.apache.org/job/beam_PostCommit_Java_Validates
> > >> Runner_Spark/2060/
> > >>
> > >> The vote will be open for at least 72 hours. It is adopted by majority
> > >> approval of qualified votes, with at least 3 PMC affirmative votes.
> > >>
> > >> Thanks!
> > >>
> > >> Davor
> > >>
> > >> [1]
> > >> https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
> > >> ctId=12319527=12339746
> > >> [2] https://dist.apache.org/repos/dist/dev/beam/2.0.0-RC4/
> > >> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > >> [4] https://repository.apache.org/content/repositories/orgapache
> > >> beam-1017/
> > >> [5] https://github.com/apache/beam/tree/v2.0.0-RC4
> > >> [6] https://github.com/apache/beam-site/pull/231
> > >> [7]
> > >> https://lists.apache.org/thread.html/981c2f13c0daf29876059b1
> > >> 4dbe97e75bcc9e40d3ac38e33a6ecf3f9@%3Cdev.beam.apache.org%3E
> > >> [8]
> > >> https://issues.apache.org/jira/issues/?jql=project%20%3D%
> > >> 20BEAM%20AND%20fixVersion%20%3D%202.0.0%20AND%20resolution%
> > >> 20%3D%20Unresolved%20ORDER%20BY%20component%20ASC%2C%20su
> > >> mmary%20ASC%2C%20assignee%20ASC%2C%20due%20ASC%2C%20priority
> > >> %20DESC%2C%20created%20ASC
> > >>
> > >>
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>



-- 

Mingmin

[PROPOSAL] design of DSL SQL interface

2017-05-12 Thread Mingmin Xu

Hi all,

As you may know, we're working on BeamSQL to execute SQL queries as a Beam
pipeline. This is a valuable feature, not only shipped as a packaged CLI,
but also as part of the SDK to assemble a pipeline.

I prepare a document[1] to list the high level APIs, to show how SQL
queries can be added in a pipeline. Below is a snippet of pseudocode for a
quick reference:

PipelineOptions options =  PipelineOptionsFactory...
Pipeline pipeline = Pipeline.create(options);

//prepare environment of BeamSQL
BeamSQLEnvironment sqlEnv = BeamSQLEnvironment.create(pipeline);
//register table metadata
sqlEnv.addTableMetadata(String tableName, BeamSqlTable tableMetadata);
//register UDF

sqlEnv.registerUDF(String functionName, Method udfMethod);


//explain a SQL statement, SELECT only, and return as a PCollection;
PCollection phase1Stream = sqlEnv.explainSQL(String
sqlStatement);
//A PCollection explained by BeamSQL can be converted into a table, and
apply queries on it;
sqlEnv.registerPCollectionAsTable(String tableName, phase1Stream);

//apply more queries, even based on phase1Stream

pipeline.run().waitUntilFinish();

Any feedback is very welcome!

[1]
https://docs.google.com/document/d/1uWXL_yF3UUO5GfCxbL6kWsmC8xCWfICU3RwiQKsk7Mk/edit?usp=sharing

-- 

Mingmin

Re: Pull request - power function

2017-05-12 Thread Mingmin Xu

Thanks @Tarush, will also take a look.

On Fri, May 12, 2017 at 7:19 AM, Jean-Baptiste Onofré 
wrote:

> Thanks,
>
> we gonna take a look.
>
> Regards
> JB
>
>
> On 05/12/2017 04:12 PM, tarush grover wrote:
>
>> Hi Team,
>>
>> I have opened a pull request Beam-2171 power function #3092.
>>
>> Kindly review and verify.
>>
>> Regards,
>> Tarush
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 

Mingmin

Re: Congratulations Davor!

2017-05-04 Thread Mingmin Xu

Congratulations @Davor!


> On May 4, 2017, at 7:08 AM, Amit Sela  wrote:
> 
> Congratulations Davor!
> 
>> On Thu, May 4, 2017, 10:02 JingsongLee  wrote:
>> 
>> Congratulations!
>> --
>> From:Jesse Anderson 
>> Time:2017 May 4 (Thu) 21:36
>> To:dev 
>> Subject:Re: Congratulations Davor!
>> Congrats!
>> 
>>> On Thu, May 4, 2017, 6:20 AM Aljoscha Krettek  wrote:
>>> 
>>> Congrats! :-)
 On 4. May 2017, at 14:34, Kenneth Knowles 
>>> wrote:
 
 Awesome!
 
> On Thu, May 4, 2017 at 1:19 AM, Ted Yu  wrote:
> 
> Congratulations, Davor!
> 
> On Thu, May 4, 2017 at 12:45 AM, Aviem Zur >> wrote:
> 
>> Congrats Davor! :)
>> 
>> On Thu, May 4, 2017 at 10:42 AM Jean-Baptiste Onofré <
>> j...@nanthrax.net>
>> wrote:
>> 
>>> Congrats ! Well deserved ;)
>>> 
>>> Regards
>>> JB
>>> 
 On 05/04/2017 09:30 AM, Jason Kuster wrote:
 Hi all,
 
 The ASF has just published a blog post[1] welcoming new members of
> the
 Apache Software Foundation, and our own Davor Bonaci is among them!
 Congratulations and thank you to Davor for all of your work for the
>> Beam
 community, and the ASF at large. Well deserved.
 
 Best,
 
 Jason
 
 [1] https://blogs.apache.org/foundation/entry/the-apache-sof
 tware-foundation-welcomes
 
 P.S. I dug through the list to make sure I wasn't missing any other
>> Beam
 community members; if I have, my sincerest apologies and please
>> recognize
 them on this or a new thread.
 
>>> 
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>> 
>> 
> 
>>> 
>>> --
>> Thanks,
>> 
>> Jesse
>> 
>>

Re: [DISCUSSION] Encouraging more contributions

2017-04-24 Thread Mingmin Xu

many design documents are mixed in maillist, jira comments, it would be a
big help to put them in a centralized list. Also I would expect more
wiki/blogs to provide in-depth analysis, like the translation from pipeline
to runner specified topology, window/trigger implementation. Without these
knowledge, it's hard to touch the core concepts.

On Mon, Apr 24, 2017 at 6:03 AM, Jean-Baptiste Onofré 
wrote:

> Got it. By experience on other Apache projects, it's really hard to
> maintain ;)
>
> Regards
> JB
>
>
> On 04/24/2017 02:56 PM, Etienne Chauchot wrote:
>
>> Hi JB,
>>
>> I was proposing a FAQ (or another form), not something about IDE setup.
>> The FAQ
>> could group in the same place Q/A like for example "what is a source, how
>> do I
>> use it to implement an IO"
>>
>> Etienne
>>
>>
>> Le 24/04/2017 à 14:19, Jean-Baptiste Onofré a écrit :
>>
>>> Hi Etienne,
>>>
>>> What about the contribution guide ? I think it's covered in the IntelliJ
>>> and
>>> Eclipse setup sections.
>>>
>>> Regards
>>> JB
>>>
>>> On 04/24/2017 02:12 PM, Etienne Chauchot wrote:
>>>
 Hi all,

 I definitely agree with everything that is said in this thread.

 I might suggest another good to have:

 to ease the work of a new contributor, it would be nice to have some
 sort of
 programming guide but not oriented to pipeline writers but to
 sdk/runner/io/...
 writers.

 I know that new contributors have the docs available in the google
 drive, the
 ML, the code base, and the availability of beamers, but maybe having
 key points
 in a common place (like FAQ for sdk/runner/io/... writers, for example)
 would be
 interesting.

 Best,

 Etienne


 Le 24/04/2017 à 09:14, Jean-Baptiste Onofré a écrit :

> Hi,
>
> I think we already tag the newbie jira ("low hanging fruit" ;)).
>
> Good idea for domain of interest/concept.
>
> Regards
> JB
>
> On 04/24/2017 09:01 AM, Ankur Chauhan wrote:
>
>> Might I suggest adding tags to projects based on area of intetest,
>> concept
>> and if it's a good "first bug".
>>
>> Sent from my iPhone
>>
>> On Apr 23, 2017, at 23:03, Davor Bonaci  wrote:
>>
>>
 1. Have people unassign themselves from issues they're not actively
 working on.
 2. Have the community engage more in triage, improving tickets
 descriptions and raising concerns.
 3. Clean house - apply (2) to currently open issues (over 800).
 Perhaps
 some can be closed.


>>> +1 on all three of these, and will do my part shortly!
>>>
>>> Also, it is worth noting that we have improved as a project in
>>> tracking
>>> issues in the last 1-2 months. There are more resolved issues than
>>> opened
>>> in this period, whereas in the past we'd have a hundred more opened
>>> than
>>> resolved.
>>>
>>> I would also propose to not assign new Jira automatically: now, the
>>> Jira is
>>>
 automatically assigned to the Jira component leader.


>>> Imagine a user discovering an issue and filing a new JIRA issue. It
>>> wouldn't be assigned to anyone, significantly reducing the chance
>>> somebody
>>> will actually help.
>>>
>>> Of course, somebody could search for new issues periodically, etc.
>>> -- but
>>> that just won't happen. The final outcome would be -- instead of a
>>> lot of
>>> issues assigned to component leads, we'd have (much) more unassigned
>>> issues, which were *never* looked at. Assigning an issue just sets a
>>> community expectation that a committer should look -- and it does
>>> help move
>>> things along!
>>>
>>> I think a better approach of addressing the current state would be
>>> increase
>>> the number of components / component leads. With more people
>>> involved and
>>> lower per-person load, I think we'd be more effective.
>>>
>>
>

>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 

Mingmin

Re: [DISCUSSION] Encouraging more contributions

2017-04-22 Thread Mingmin Xu

Good point, could also disable the auto assignment when creating JIRA ticket. 
Now it goes to component leader directly.

Sent from my iPhone

> On Apr 22, 2017, at 7:34 AM, Ted Yu  wrote:
> 
> +1
> 
>> On Sat, Apr 22, 2017 at 7:31 AM, Aviem Zur  wrote:
>> 
>> Hi all,
>> 
>> I wanted to start a discussion about actions we can take to encourage more
>> contributions to the project.
>> 
>> A few points I've been thinking of:
>> 
>> 1. Have people unassign themselves from issues they're not actively working
>> on.
>> 2. Have the community engage more in triage, improving tickets descriptions
>> and raising concerns.
>> 3. Clean house - apply (2) to currently open issues (over 800). Perhaps
>> some can be closed.
>> 
>> Thoughts? Ideas?
>>

Re: [PROPOSAL]: a new feature branch for SQL DSL

2017-04-14 Thread Mingmin Xu

It's more about how State API can be introduced in SQL, the snapshot of
state converts stream to a table which is very helpful. SQL keyword INSERT
INTO may be an option to do that but I've no confidence so far.


On Fri, Apr 14, 2017 at 3:03 PM, Tyler Akidau <taki...@apache.org> wrote:

> Tarush: I don't think it depends upon the time frame (although you may be
> interested in only a specific timeframe materialized within the table).
> Stream to table conversion is purely a byproduct of grouping a stream. I
> have a doc I'm getting some initial reviews on currently that I hope to
> send out next week to hopefully give some more background here. And
> windowing is really just an additional dimension in grouping. An important
> one, to be sure, but still just grouping.
>
> Mingmin: can you expand upon those statements? I'm not sure I fully
> understand what you're saying.
>
> -Tyler
>
> On Wed, Apr 12, 2017 at 9:38 PM Mingmin Xu <mingm...@gmail.com> wrote:
>
> > Expose streaming snapshot via STATE is attractive in Beam model, but
> doubt
> > it's the right way in SQL. IMO,there's 'INSERT INTO' to persistent
> > streaming output.
> >
> >
> > On Wed, Apr 12, 2017 at 8:37 PM, tarush grover <tarushappt...@gmail.com>
> > wrote:
> >
> > > Hi Tyler,
> > >
> > > Transforming stream into a table will also depend on the time frame in
> > the
> > > stream or what windows we choose for the stream.
> > >
> > > Regards,
> > > Tarush
> > >
> > >
> > > On Tue, 11 Apr 2017 at 11:29 PM, Tyler Akidau
> <taki...@google.com.invalid
> > >
> > > wrote:
> > >
> > > > Hi 陈竞,
> > > >
> > > > I'm doubtful there will be an explicit equivalent of the State API in
> > > SQL,
> > > > at least not in the SQL portion of the DSL itself (it might make
> sense
> > to
> > > > expose one within UDFs). The State API is an imperative interface for
> > > > accessing an underlying persistent state table, whereas SQL operates
> > more
> > > > functionally. There's no good way I'm aware of to expose the
> > > > characteristics provided by the State API (logic-driven, fine- and
> > > > coarse-grained reads/writes of potentially multiple fields of state
> > > > utilizing potentially multiple data types) in raw SQL cleanly.
> > > >
> > > > On the upside, SQL has the advantage of making it very easy to
> > > materialize
> > > > new state tables very naturally. In the proposal I'll be sharing for
> > how
> > > I
> > > > think we should integrate streaming into SQL robustly, any time you
> > > perform
> > > > some grouping operation (GROUP BY, JOIN, CUBE, etc) you're
> transforming
> > > > your stream into a table. That table is effectively a persistent
> state
> > > > table. So there exists a large suite of functionality in standard SQL
> > > that
> > > > gives you a lot of powerful tools for creating state.
> > > >
> > > > It may also be possible for the different access patterns of more
> > > > complicated data structures (e.g., bags or lists) to be captured by
> > > > different data types supported by the underlying systems. But I don't
> > > > expect there to be an imperative State access API built into SQL
> > itself.
> > > >
> > > > All that said, I'm curious to hear ideas otherwise if anyone has
> them.
> > > :-)
> > > >
> > > > -Tyler
> > > >
> > > > On Mon, Apr 10, 2017 at 10:19 PM 陈竞 <cj.mag...@gmail.com> wrote:
> > > >
> > > > > i just want to know what the SQL State API equivalent is for SQL,
> > since
> > > > > beam has already support stateful processing using state DoFn
> > > > >
> > > > > 2017-04-11 2:12 GMT+08:00 Tyler Akidau <taki...@google.com.invalid
> >:
> > > > >
> > > > > > 陈竞, what are you specifically curious about regarding state? Are
> > you
> > > > > > wanting to know what the SQL State API equivalent is for SQL? Or
> > are
> > > > you
> > > > > > asking an operational question about where the state for a given
> > SQL
> > > > > > pipeline will live?
> > > > > >
> > > > > > -Tyler
> > > > > >
> > > > > >
> > > > > > On Sun, Apr 9, 20

Re: [PROPOSAL]: a new feature branch for SQL DSL

2017-04-12 Thread Mingmin Xu

Expose streaming snapshot via STATE is attractive in Beam model, but doubt
it's the right way in SQL. IMO,there's 'INSERT INTO' to persistent
streaming output.


On Wed, Apr 12, 2017 at 8:37 PM, tarush grover <tarushappt...@gmail.com>
wrote:

> Hi Tyler,
>
> Transforming stream into a table will also depend on the time frame in the
> stream or what windows we choose for the stream.
>
> Regards,
> Tarush
>
>
> On Tue, 11 Apr 2017 at 11:29 PM, Tyler Akidau <taki...@google.com.invalid>
> wrote:
>
> > Hi 陈竞,
> >
> > I'm doubtful there will be an explicit equivalent of the State API in
> SQL,
> > at least not in the SQL portion of the DSL itself (it might make sense to
> > expose one within UDFs). The State API is an imperative interface for
> > accessing an underlying persistent state table, whereas SQL operates more
> > functionally. There's no good way I'm aware of to expose the
> > characteristics provided by the State API (logic-driven, fine- and
> > coarse-grained reads/writes of potentially multiple fields of state
> > utilizing potentially multiple data types) in raw SQL cleanly.
> >
> > On the upside, SQL has the advantage of making it very easy to
> materialize
> > new state tables very naturally. In the proposal I'll be sharing for how
> I
> > think we should integrate streaming into SQL robustly, any time you
> perform
> > some grouping operation (GROUP BY, JOIN, CUBE, etc) you're transforming
> > your stream into a table. That table is effectively a persistent state
> > table. So there exists a large suite of functionality in standard SQL
> that
> > gives you a lot of powerful tools for creating state.
> >
> > It may also be possible for the different access patterns of more
> > complicated data structures (e.g., bags or lists) to be captured by
> > different data types supported by the underlying systems. But I don't
> > expect there to be an imperative State access API built into SQL itself.
> >
> > All that said, I'm curious to hear ideas otherwise if anyone has them.
> :-)
> >
> > -Tyler
> >
> > On Mon, Apr 10, 2017 at 10:19 PM 陈竞 <cj.mag...@gmail.com> wrote:
> >
> > > i just want to know what the SQL State API equivalent is for SQL, since
> > > beam has already support stateful processing using state DoFn
> > >
> > > 2017-04-11 2:12 GMT+08:00 Tyler Akidau <taki...@google.com.invalid>:
> > >
> > > > 陈竞, what are you specifically curious about regarding state? Are you
> > > > wanting to know what the SQL State API equivalent is for SQL? Or are
> > you
> > > > asking an operational question about where the state for a given SQL
> > > > pipeline will live?
> > > >
> > > > -Tyler
> > > >
> > > >
> > > > On Sun, Apr 9, 2017 at 12:39 PM Mingmin Xu <mingm...@gmail.com>
> wrote:
> > > >
> > > > > Thanks @JB, will come out the initial PR soon.
> > > > >
> > > > > On Sun, Apr 9, 2017 at 12:28 PM, Jean-Baptiste Onofré <
> > j...@nanthrax.net
> > > >
> > > > > wrote:
> > > > >
> > > > > > As discussed, I created the DSL_SQL branch with the skeleton.
> > Mingmin
> > > > is
> > > > > > rebasing on this branch to submit the PR.
> > > > > >
> > > > > > Regards
> > > > > > JB
> > > > > >
> > > > > >
> > > > > > On 04/09/2017 08:02 PM, Mingmin Xu wrote:
> > > > > >
> > > > > >> State is not touched yet, welcome to add it.
> > > > > >>
> > > > > >> On Sun, Apr 9, 2017 at 2:40 AM, 陈竞 <cj.mag...@gmail.com> wrote:
> > > > > >>
> > > > > >> how will this sql support state both in streaming and batch mode
> > > > > >>>
> > > > > >>> 2017-04-07 4:54 GMT+08:00 Mingmin Xu <mingm...@gmail.com>:
> > > > > >>>
> > > > > >>> @Tyler, there's no big change in the previous design doc, I
> added
> > > > some
> > > > > >>>> details in chapter 'Part 2. DML( [INSERT] SELECT )' ,
> describing
> > > > steps
> > > > > >>>> to
> > > > > >>>> process a query, feel free to leave a comment.
> > > > > >>>>
> > > > > >>>> Come through your doc of 'EMIT'

Re: [PROPOSAL]: a new feature branch for SQL DSL

2017-04-11 Thread Mingmin Xu

It's not, you can use the feature branch
https://github.com/apache/beam/tree/DSL_SQL after
https://github.com/apache/beam/pull/2479 is merged, stay tuned.

On Tue, Apr 11, 2017 at 10:00 AM, tarush grover <tarushappt...@gmail.com>
wrote:

> Wanted to know active branch for the beam SQL feature, is it beam - 301
> where the development for this feature would happen?
>
> Regards,
> Tarush
>
> On Tue, 11 Apr 2017 at 10:49 AM, 陈竞 <cj.mag...@gmail.com> wrote:
>
> > i just want to know what the SQL State API equivalent is for SQL, since
> > beam has already support stateful processing using state DoFn
> >
> > 2017-04-11 2:12 GMT+08:00 Tyler Akidau <taki...@google.com.invalid>:
> >
> > > 陈竞, what are you specifically curious about regarding state? Are you
> > > wanting to know what the SQL State API equivalent is for SQL? Or are
> you
> > > asking an operational question about where the state for a given SQL
> > > pipeline will live?
> > >
> > > -Tyler
> > >
> > >
> > > On Sun, Apr 9, 2017 at 12:39 PM Mingmin Xu <mingm...@gmail.com> wrote:
> > >
> > > > Thanks @JB, will come out the initial PR soon.
> > > >
> > > > On Sun, Apr 9, 2017 at 12:28 PM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > > > wrote:
> > > >
> > > > > As discussed, I created the DSL_SQL branch with the skeleton.
> Mingmin
> > > is
> > > > > rebasing on this branch to submit the PR.
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > >
> > > > > On 04/09/2017 08:02 PM, Mingmin Xu wrote:
> > > > >
> > > > >> State is not touched yet, welcome to add it.
> > > > >>
> > > > >> On Sun, Apr 9, 2017 at 2:40 AM, 陈竞 <cj.mag...@gmail.com> wrote:
> > > > >>
> > > > >> how will this sql support state both in streaming and batch mode
> > > > >>>
> > > > >>> 2017-04-07 4:54 GMT+08:00 Mingmin Xu <mingm...@gmail.com>:
> > > > >>>
> > > > >>> @Tyler, there's no big change in the previous design doc, I added
> > > some
> > > > >>>> details in chapter 'Part 2. DML( [INSERT] SELECT )' , describing
> > > steps
> > > > >>>> to
> > > > >>>> process a query, feel free to leave a comment.
> > > > >>>>
> > > > >>>> Come through your doc of 'EMIT', it's awesome from my
> perspective.
> > > > I've
> > > > >>>> some tests on GroupBy with default triggers/allowed_lateness
> now.
> > > EMIT
> > > > >>>> syntax can be added to fill the gap.
> > > > >>>>
> > > > >>>> On Thu, Apr 6, 2017 at 1:04 PM, Tyler Akidau <
> taki...@apache.org>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>> I'm very excited by this development as well, thanks for
> > continuing
> > > to
> > > > >>>>>
> > > > >>>> push
> > > > >>>>
> > > > >>>>> this forward, Mingmin. :-)
> > > > >>>>>
> > > > >>>>> I noticed you'd made some changes to your design doc
> > > > >>>>> <
> > https://docs.google.com/document/d/1Uc5xYTpO9qsLXtT38OfuoqSLimH_
> > > > >>>>> 0a1Bz5BsCROMzCU/edit>.
> > > > >>>>> Is it ready for another review? How reflective is it currently
> of
> > > the
> > > > >>>>>
> > > > >>>> work
> > > > >>>>
> > > > >>>>> that going into the feature branch?
> > > > >>>>>
> > > > >>>>> In parallel, I'd also like to continue helping push forward the
> > > > >>>>>
> > > > >>>> definition
> > > > >>>>
> > > > >>>>> of unified model semantics for SQL so we can get Calcite to a
> > point
> > > > >>>>>
> > > > >>>> where
> > > > >>>
> > > > >>>> it supports the full Beam model. I added a comment
> > > > >>>>> <https://issues.apache.org/j

Re: [PROPOSAL]: a new feature branch for SQL DSL

2017-04-09 Thread Mingmin Xu

Thanks @JB, will come out the initial PR soon.

On Sun, Apr 9, 2017 at 12:28 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> As discussed, I created the DSL_SQL branch with the skeleton. Mingmin is
> rebasing on this branch to submit the PR.
>
> Regards
> JB
>
>
> On 04/09/2017 08:02 PM, Mingmin Xu wrote:
>
>> State is not touched yet, welcome to add it.
>>
>> On Sun, Apr 9, 2017 at 2:40 AM, 陈竞 <cj.mag...@gmail.com> wrote:
>>
>> how will this sql support state both in streaming and batch mode
>>>
>>> 2017-04-07 4:54 GMT+08:00 Mingmin Xu <mingm...@gmail.com>:
>>>
>>> @Tyler, there's no big change in the previous design doc, I added some
>>>> details in chapter 'Part 2. DML( [INSERT] SELECT )' , describing steps
>>>> to
>>>> process a query, feel free to leave a comment.
>>>>
>>>> Come through your doc of 'EMIT', it's awesome from my perspective. I've
>>>> some tests on GroupBy with default triggers/allowed_lateness now. EMIT
>>>> syntax can be added to fill the gap.
>>>>
>>>> On Thu, Apr 6, 2017 at 1:04 PM, Tyler Akidau <taki...@apache.org>
>>>> wrote:
>>>>
>>>> I'm very excited by this development as well, thanks for continuing to
>>>>>
>>>> push
>>>>
>>>>> this forward, Mingmin. :-)
>>>>>
>>>>> I noticed you'd made some changes to your design doc
>>>>> <https://docs.google.com/document/d/1Uc5xYTpO9qsLXtT38OfuoqSLimH_
>>>>> 0a1Bz5BsCROMzCU/edit>.
>>>>> Is it ready for another review? How reflective is it currently of the
>>>>>
>>>> work
>>>>
>>>>> that going into the feature branch?
>>>>>
>>>>> In parallel, I'd also like to continue helping push forward the
>>>>>
>>>> definition
>>>>
>>>>> of unified model semantics for SQL so we can get Calcite to a point
>>>>>
>>>> where
>>>
>>>> it supports the full Beam model. I added a comment
>>>>> <https://issues.apache.org/jira/browse/BEAM-301?
>>>>>
>>>> focusedCommentId=15959621&
>>>>
>>>>> page=com.atlassian.jira.plugin.system.issuetabpanels:
>>>>> comment-tabpanel#comment-15959621>
>>>>> on the JIRA suggesting I create a doc with a specification proposal for
>>>>> EMIT (and any other necessary semantic changes) that we can then
>>>>>
>>>> iterate
>>>
>>>> on
>>>>
>>>>> in public with the Calcite folks. I already have most of the content
>>>>> written (and there's a significant amount of background needed to
>>>>>
>>>> justify
>>>
>>>> some aspects of the proposal), so it'll mostly be a matter of pulling
>>>>>
>>>> it
>>>
>>>> all together into something coherent. Does that sound reasonable to
>>>>> everyone?
>>>>>
>>>>> -Tyler
>>>>>
>>>>>
>>>>> On Thu, Apr 6, 2017 at 10:26 AM Kenneth Knowles <k...@google.com.invalid
>>>>>
>>>>
>>>> wrote:
>>>>>
>>>>> Very cool! I'm really excited about this integration.
>>>>>>
>>>>>> On Thu, Apr 6, 2017 at 9:39 AM, Jean-Baptiste Onofré <
>>>>>>
>>>>> j...@nanthrax.net>
>>>
>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>>
>>>>>>> Mingmin and I prepared a new branch to have the SQL DSL in dsls/sql
>>>>>>> location.
>>>>>>>
>>>>>>> Any help is welcome !
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>>
>>>>>>> On 04/06/2017 06:36 PM, Mingmin Xu wrote:
>>>>>>>
>>>>>>> @Tarush, you're very welcome to join the effort.
>>>>>>>>
>>>>>>>> On Thu, Apr 6, 2017 at 7:22 AM, tarush grover <
>>>>>>>>
>>>>>>> tarushappt...@gmail.com>
>>>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>

Re: [PROPOSAL]: a new feature branch for SQL DSL

2017-04-09 Thread Mingmin Xu

State is not touched yet, welcome to add it.

On Sun, Apr 9, 2017 at 2:40 AM, 陈竞 <cj.mag...@gmail.com> wrote:

> how will this sql support state both in streaming and batch mode
>
> 2017-04-07 4:54 GMT+08:00 Mingmin Xu <mingm...@gmail.com>:
>
> > @Tyler, there's no big change in the previous design doc, I added some
> > details in chapter 'Part 2. DML( [INSERT] SELECT )' , describing steps to
> > process a query, feel free to leave a comment.
> >
> > Come through your doc of 'EMIT', it's awesome from my perspective. I've
> > some tests on GroupBy with default triggers/allowed_lateness now. EMIT
> > syntax can be added to fill the gap.
> >
> > On Thu, Apr 6, 2017 at 1:04 PM, Tyler Akidau <taki...@apache.org> wrote:
> >
> > > I'm very excited by this development as well, thanks for continuing to
> > push
> > > this forward, Mingmin. :-)
> > >
> > > I noticed you'd made some changes to your design doc
> > > <https://docs.google.com/document/d/1Uc5xYTpO9qsLXtT38OfuoqSLimH_
> > > 0a1Bz5BsCROMzCU/edit>.
> > > Is it ready for another review? How reflective is it currently of the
> > work
> > > that going into the feature branch?
> > >
> > > In parallel, I'd also like to continue helping push forward the
> > definition
> > > of unified model semantics for SQL so we can get Calcite to a point
> where
> > > it supports the full Beam model. I added a comment
> > > <https://issues.apache.org/jira/browse/BEAM-301?
> > focusedCommentId=15959621&
> > > page=com.atlassian.jira.plugin.system.issuetabpanels:
> > > comment-tabpanel#comment-15959621>
> > > on the JIRA suggesting I create a doc with a specification proposal for
> > > EMIT (and any other necessary semantic changes) that we can then
> iterate
> > on
> > > in public with the Calcite folks. I already have most of the content
> > > written (and there's a significant amount of background needed to
> justify
> > > some aspects of the proposal), so it'll mostly be a matter of pulling
> it
> > > all together into something coherent. Does that sound reasonable to
> > > everyone?
> > >
> > > -Tyler
> > >
> > >
> > > On Thu, Apr 6, 2017 at 10:26 AM Kenneth Knowles <k...@google.com.invalid
> >
> > > wrote:
> > >
> > > > Very cool! I'm really excited about this integration.
> > > >
> > > > On Thu, Apr 6, 2017 at 9:39 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Mingmin and I prepared a new branch to have the SQL DSL in dsls/sql
> > > > > location.
> > > > >
> > > > > Any help is welcome !
> > > > >
> > > > > Thanks,
> > > > > Regards
> > > > > JB
> > > > >
> > > > >
> > > > > On 04/06/2017 06:36 PM, Mingmin Xu wrote:
> > > > >
> > > > >> @Tarush, you're very welcome to join the effort.
> > > > >>
> > > > >> On Thu, Apr 6, 2017 at 7:22 AM, tarush grover <
> > > tarushappt...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> Hi,
> > > > >>>
> > > > >>> Can I be also part of this feature development.
> > > > >>>
> > > > >>> Regards,
> > > > >>> Tarush Grover
> > > > >>>
> > > > >>> On Thu, Apr 6, 2017 at 3:17 AM, Ted Yu <yuzhih...@gmail.com>
> > wrote:
> > > > >>>
> > > > >>> I compiled BEAM-301 branch with calcite 1.12 - passed.
> > > > >>>>
> > > > >>>> Julian tries to not break existing things, but he will if
> there's
> > a
> > > > >>>>
> > > > >>> reason
> > > > >>>
> > > > >>>> to do so :-)
> > > > >>>>
> > > > >>>> On Wed, Apr 5, 2017 at 2:36 PM, Mingmin Xu <mingm...@gmail.com>
> > > > wrote:
> > > > >>>>
> > > > >>>> @Ted, thanks for the note. I intend to stick with one version,
> > Beam
> > > > >>>>>
> > > > >>>> 0.6.0
> > > > >>>
> > > > >>>> and C

Re: [PROPOSAL]: a new feature branch for SQL DSL

2017-04-06 Thread Mingmin Xu

@Tyler, there's no big change in the previous design doc, I added some
details in chapter 'Part 2. DML( [INSERT] SELECT )' , describing steps to
process a query, feel free to leave a comment.

Come through your doc of 'EMIT', it's awesome from my perspective. I've
some tests on GroupBy with default triggers/allowed_lateness now. EMIT
syntax can be added to fill the gap.

On Thu, Apr 6, 2017 at 1:04 PM, Tyler Akidau <taki...@apache.org> wrote:

> I'm very excited by this development as well, thanks for continuing to push
> this forward, Mingmin. :-)
>
> I noticed you'd made some changes to your design doc
> <https://docs.google.com/document/d/1Uc5xYTpO9qsLXtT38OfuoqSLimH_
> 0a1Bz5BsCROMzCU/edit>.
> Is it ready for another review? How reflective is it currently of the work
> that going into the feature branch?
>
> In parallel, I'd also like to continue helping push forward the definition
> of unified model semantics for SQL so we can get Calcite to a point where
> it supports the full Beam model. I added a comment
> <https://issues.apache.org/jira/browse/BEAM-301?focusedCommentId=15959621;
> page=com.atlassian.jira.plugin.system.issuetabpanels:
> comment-tabpanel#comment-15959621>
> on the JIRA suggesting I create a doc with a specification proposal for
> EMIT (and any other necessary semantic changes) that we can then iterate on
> in public with the Calcite folks. I already have most of the content
> written (and there's a significant amount of background needed to justify
> some aspects of the proposal), so it'll mostly be a matter of pulling it
> all together into something coherent. Does that sound reasonable to
> everyone?
>
> -Tyler
>
>
> On Thu, Apr 6, 2017 at 10:26 AM Kenneth Knowles <k...@google.com.invalid>
> wrote:
>
> > Very cool! I'm really excited about this integration.
> >
> > On Thu, Apr 6, 2017 at 9:39 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> > wrote:
> >
> > > Hi,
> > >
> > > Mingmin and I prepared a new branch to have the SQL DSL in dsls/sql
> > > location.
> > >
> > > Any help is welcome !
> > >
> > > Thanks,
> > > Regards
> > > JB
> > >
> > >
> > > On 04/06/2017 06:36 PM, Mingmin Xu wrote:
> > >
> > >> @Tarush, you're very welcome to join the effort.
> > >>
> > >> On Thu, Apr 6, 2017 at 7:22 AM, tarush grover <
> tarushappt...@gmail.com>
> > >> wrote:
> > >>
> > >> Hi,
> > >>>
> > >>> Can I be also part of this feature development.
> > >>>
> > >>> Regards,
> > >>> Tarush Grover
> > >>>
> > >>> On Thu, Apr 6, 2017 at 3:17 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> > >>>
> > >>> I compiled BEAM-301 branch with calcite 1.12 - passed.
> > >>>>
> > >>>> Julian tries to not break existing things, but he will if there's a
> > >>>>
> > >>> reason
> > >>>
> > >>>> to do so :-)
> > >>>>
> > >>>> On Wed, Apr 5, 2017 at 2:36 PM, Mingmin Xu <mingm...@gmail.com>
> > wrote:
> > >>>>
> > >>>> @Ted, thanks for the note. I intend to stick with one version, Beam
> > >>>>>
> > >>>> 0.6.0
> > >>>
> > >>>> and Calcite 1.11 so far, unless impacted by API change. Before it's
> > >>>>>
> > >>>> merged
> > >>>>
> > >>>>> back to master, will upgrade to the latest version.
> > >>>>>
> > >>>>> On Wed, Apr 5, 2017 at 2:14 PM, Ted Yu <yuzhih...@gmail.com>
> wrote:
> > >>>>>
> > >>>>> Working in feature branch is good - you may want to periodically
> sync
> > >>>>>>
> > >>>>> up
> > >>>>
> > >>>>> with master.
> > >>>>>>
> > >>>>>> I noticed that you are using 1.11.0 of calcite.
> > >>>>>> 1.12 is out, FYI
> > >>>>>>
> > >>>>>> On Wed, Apr 5, 2017 at 2:05 PM, Mingmin Xu <mingm...@gmail.com>
> > >>>>>>
> > >>>>> wrote:
> > >>>
> > >>>>
> > >>>>>> Hi all,
> > >>>>>>>
> > >>>>>>> I'm working on https://issues.apache.org

Re: Kafka Offset handling for Restart/failure scenarios.

2017-03-21 Thread Mingmin Xu

Move discuss to dev-list

Savepoint in Flink, also checkpoint in Spark, should be good enough to
handle this case.

When people don't enable these features, for example only need at-most-once
semantic, each unbounded IO should try its best to restore from last
offset, although CheckpointMark is null. Any ideas?

Mingmin

On Tue, Mar 21, 2017 at 9:39 AM, Dan Halperin <dhalp...@apache.org> wrote:

> hey,
>
> The native Beam UnboundedSource API supports resuming from checkpoint --
> that specifically happens here
> <https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L674>
>  when
> the KafkaCheckpointMark is non-null.
>
> The FlinkRunner should be providing the KafkaCheckpointMark from the most
> recent savepoint upon restore.
>
> There shouldn't be any "special" Flink runner support needed, nor is the
> State API involved.
>
> Dan
>
> On Tue, Mar 21, 2017 at 9:01 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> Would not it be Flink runner specific ?
>>
>> Maybe the State API could do the same in a runner agnostic way (just
>> thinking loud) ?
>>
>> Regards
>> JB
>>
>> On 03/21/2017 04:56 PM, Mingmin Xu wrote:
>>
>>> From KafkaIO itself, looks like it either start_from_beginning or
>>> start_from_latest. It's designed to leverage
>>> `UnboundedSource.CheckpointMark`
>>> during initialization, but so far I don't see it's provided by runners.
>>> At the
>>> moment Flink savepoints is a good option, created a JIRA(BEAM-1775
>>> <https://issues.apache.org/jira/browse/BEAM-1775>)  to handle it in
>>> KafkaIO.
>>>
>>> Mingmin
>>>
>>> On Tue, Mar 21, 2017 at 3:40 AM, Aljoscha Krettek <aljos...@apache.org
>>> <mailto:aljos...@apache.org>> wrote:
>>>
>>> Hi,
>>> Are you using Flink savepoints [1] when restoring your application?
>>> If you
>>> use this the Kafka offset should be stored in state and it should
>>> restart
>>> from the correct position.
>>>
>>> Best,
>>> Aljoscha
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/
>>> setup/savepoints.html
>>> <https://ci.apache.org/projects/flink/flink-docs-release-1.3
>>> /setup/savepoints.html>
>>> > On 21 Mar 2017, at 01:50, Jins George <jins.geo...@aeris.net
>>> <mailto:jins.geo...@aeris.net>> wrote:
>>> >
>>> > Hello,
>>> >
>>> > I am writing a Beam pipeline(streaming) with Flink runner to
>>> consume data
>>> from Kafka and apply some transformations and persist to Hbase.
>>> >
>>> > If I restart the application ( due to failure/manual restart),
>>> consumer
>>> does not resume from the offset where it was prior to restart. It
>>> always
>>> resume from the latest offset.
>>> >
>>> > If I enable Flink checkpionting with hdfs state back-end, system
>>> appears
>>> to be resuming from the earliest offset
>>> >
>>> > Is there a recommended way to resume from the offset where it was
>>> stopped ?
>>> >
>>> > Thanks,
>>> > Jins George
>>>
>>>
>>>
>>>
>>> --
>>> 
>>> Mingmin
>>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>


-- 

Mingmin

Re: [ANNOUNCEMENT] New committers, March 2017 edition!

2017-03-17 Thread Mingmin Xu

Congratulations to all!

On Fri, Mar 17, 2017 at 2:29 PM, Jason Kuster <
jasonkus...@google.com.invalid> wrote:

> Congratulations to the new committers!
>
> On Fri, Mar 17, 2017 at 2:16 PM, Kenneth Knowles 
> wrote:
>
> > Congrats all!
> >
> > On Fri, Mar 17, 2017 at 2:13 PM, Davor Bonaci  wrote:
> >
> > > Please join me and the rest of Beam PMC in welcoming the following
> > > contributors as our newest committers. They have significantly
> > contributed
> > > to the project in different ways, and we look forward to many more
> > > contributions in the future.
> > >
> > > * Chamikara Jayalath
> > > Chamikara has been contributing to Beam since inception, and previously
> > to
> > > Google Cloud Dataflow, accumulating a total of 51 commits (8,301 ++ /
> > 3,892
> > > --) since February 2016 [1]. He contributed broadly to the project, but
> > > most significantly to the Python SDK, building the IO framework in this
> > SDK
> > > [2], [3].
> > >
> > > * Eugene Kirpichov
> > > Eugene has been contributing to Beam since inception, and previously to
> > > Google Cloud Dataflow, accumulating a total of 95 commits (22,122 ++ /
> > > 18,407 --) since February 2016 [1]. In recent months, he’s been driving
> > the
> > > Splittable DoFn effort [4]. A true expert on IO subsystem, Eugene has
> > > reviewed nearly every IO contributed to Beam. Finally, Eugene
> contributed
> > > the Beam Style Guide, and is championing it across the project.
> > >
> > > * Ismaël Mejia
> > > Ismaël has been contributing to Beam since mid-2016, accumulating a
> total
> > > of 35 commits (3,137 ++ / 1,328 --) [1]. He authored the HBaseIO
> > connector,
> > > helped on the Spark runner, and contributed in other areas as well,
> > > including cross-project collaboration with Apache Zeppelin. Ismaël
> > reported
> > > 24 Jira issues.
> > >
> > > * Aviem Zur
> > > Aviem has been contributing to Beam since early fall, accumulating a
> > total
> > > of 49 commits (6,471 ++ / 3,185 --) [1]. He reported 43 Jira issues,
> and
> > > resolved ~30 issues. Aviem improved the stability of the Spark runner a
> > > lot, and introduced support for metrics. Finally, Aviem is championing
> > > dependency management across the project.
> > >
> > > Congratulations to all four! Welcome!
> > >
> > > Davor
> > >
> > > [1]
> > > https://github.com/apache/beam/graphs/contributors?from=
> > > 2016-02-01=2017-03-17=c
> > > [2]
> > > https://github.com/apache/beam/blob/v0.6.0/sdks/python/
> > > apache_beam/io/iobase.py#L70
> > > [3]
> > > https://github.com/apache/beam/blob/v0.6.0/sdks/python/
> > > apache_beam/io/iobase.py#L561
> > > [4] https://s.apache.org/splittable-do-fn
> > >
> >
>
>
>
> --
> ---
> Jason Kuster
> Apache Beam / Google Cloud Dataflow
>



-- 

Mingmin

Re: [BEAM-301] Add a Beam SQL DSL

2017-02-28 Thread Mingmin Xu

Thanks @Tyler, @JB. The first question for me is how it's used, a SQL
prompt like StormSQL, or the DSL API in Flink. In my project, it goes with
the 1st one as it's targeted on self-service for analyst. looks odd for me
to mix Java code and SQL string.

Regarding to the scope, that's a good point to find a proper subset at the
first stage. The items I listed may be too much for phase 1, especially
GROUP-BY. More details would be added.

Btw, this doc mostly talks about streaming, there're already so many
options to run a batch with SQL.

Mingmin

On Tue, Feb 28, 2017 at 10:01 AM, Neelesh Salian <neeleshssal...@gmail.com>
wrote:

> Hi Mingmin,
>
> Thanks for writing it up.
> I haven't had the chance to start work on it.
> Happy to help on tasks for building it.
>
> Feel free to assign BEAN-301 to yourself.
>
>
>
> On Tue, Feb 28, 2017 at 9:38 AM, Tyler Akidau <taki...@google.com.invalid>
> wrote:
>
> > Hi Mingmin,
> >
> > Thanks for your interest in helping out on this task, and for your
> initial
> > proposal. I'm also very happy to work with you on this, and excited to
> see
> > some progress made here. Added a few more comments on the doc, but will
> > summarize them below as well.
> >
> > As far as the DSL point goes, I agree with JB that any sort of interface
> to
> > Beam that uses SQL will be creating a DSL. Having the initial interface
> be
> > an interactive SQL prompt is a perfectly valid approach, but at the end
> of
> > the day, theres' still a DSL under the covers. As such, there are a lot
> of
> > questions that will need to be addressed in designing such a DSL (and the
> > Jira lists some resources discussing those already).
> >
> > That said, it's possible to make progress on a Beam DSL without
> addressing
> > them all (e.g., by tackling only a small subset of functionality first,
> > such as project and filter). But the current phases as listed in the doc
> > will require addressing some of the big ones.
> >
> > So a good first step might be trying to scope the proposal to have a more
> > modest initial set of functionality, or else providing more detail on how
> > you propose to address the issues that will come up with various features
> > currently listed in phase 1, particularly grouping w/ streams.
> >
> > -Tyler
> >
> > On Mon, Feb 27, 2017 at 10:44 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> > wrote:
> >
> > > Hi Mingmin,
> > >
> > > The idea is actual both:
> > >
> > > 1. an interactive SQL prompt where we can express pipeline directly
> > > using SQL.
> > > 2. a SQL DSL to describe a pipeline in SQL and create the corresponding
> > > Java code under the hood.
> > >
> > > I provided couple of comments on the doc. Ready and happy to help you
> on
> > > this (as I created the Jira ;)).
> > >
> > > Regards
> > > JB
> > >
> > > On 02/27/2017 10:33 PM, Mingmin Xu wrote:
> > > > Hello all,
> > > >
> > > > Would like to pop up this task, to see any interest to move it
> forward.
> > > >
> > > > I've a project to run SQL queries with an interactive interface, and
> > > would
> > > > like to share my ideas. A draft doc is available to describe how it
> > works
> > > > with Calcite. --A little different from BEAM-301, that I choose a CLI
> > > > interactive way, not SQL DSL.
> > > >
> > > > Doc link:
> > > >
> > > https://docs.google.com/document/d/1Uc5xYTpO9qsLXtT38OfuoqSLimH_
> > 0a1Bz5BsCROMzCU/edit?usp=sharing
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>
>
>
> --
> Regards,
> Neelesh S. Salian
>



-- 

Mingmin

[BEAM-301] Add a Beam SQL DSL

2017-02-27 Thread Mingmin Xu

Hello all,

Would like to pop up this task, to see any interest to move it forward.

I've a project to run SQL queries with an interactive interface, and would
like to share my ideas. A draft doc is available to describe how it works
with Calcite. --A little different from BEAM-301, that I choose a CLI
interactive way, not SQL DSL.

Doc link:
https://docs.google.com/document/d/1Uc5xYTpO9qsLXtT38OfuoqSLimH_0a1Bz5BsCROMzCU/edit?usp=sharing

-- 

Mingmin Xu

60 matches

Mail list logo