Jenkins build is back to stable : beam_Release_NightlySnapshot #385

2017-04-11 Thread Apache Jenkins Server
See 




HDFS and Google Cloud Storage

2017-04-11 Thread Shen Li
Hi,

Is there any reason why HDFS IO is implemented as a BoundedSource while
Google Cloud Storage is implemented as a scheme ("gs://") for TextIO? To
contribute a new IO connector, how can I determine whether it should be
implemented as a source transform or as a scheme for the TextIO?

Thanks,

Shen


Re: HDFS and Google Cloud Storage

2017-04-11 Thread Jean-Baptiste Onofré

Hi Shen,

We are doing a refactoring of the file IO (IOChannelFactory). Thanks to this 
refactoring, you will be able to use a scheme for hdfs (or s3, ...) with 
different format (avro, text, hadoop input format, ...).


It means that HdfsIO will be deprecated (to be removed at some point). I'm 
working on couple of PRs to leverage the new file IO layer.


Regards
JB

On 04/11/2017 03:56 PM, Shen Li wrote:

Hi,

Is there any reason why HDFS IO is implemented as a BoundedSource while
Google Cloud Storage is implemented as a scheme ("gs://") for TextIO? To
contribute a new IO connector, how can I determine whether it should be
implemented as a source transform or as a scheme for the TextIO?

Thanks,

Shen



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com



Re: Apache Storm/JStorm Runner(s) for Apache Beam

2017-04-11 Thread Pei HE
Hi Taylor,
I am very glad to see the interests in pushing forward Beam Storm runner.

However, I cannot convince myself the benefits of having one runner to
support all.

Beam have three types of users: pipeline writers, library writers, and
runner implementers.

I can see pros vs cons as followings:
Pros:
1. For pipeline writers and library writers, I don't see any benefits
because they are using Beam API directly.
2. For runner implementers: (I am not that familiar with the current
similarities and differences of Storm and JStorm, maybe you can help me to
fill it in.)

Cons:
For pipeline writers and library writers:
1. It means delay of the delivery. We already have a working prototype, and
there are lots of JStorm users eagerly want a JStorm API.
2. "One runner to support all" may increase the complexity, and
compromise the quality of the runner.

>From my point of view, cons are clearly over pros unless I am missing
something.

Let's me know what you think.
Thanks
--
Pei


On Tue, Apr 11, 2017 at 1:47 AM, P. Taylor Goetz  wrote:

> Note: cross-posting to dev@beam and dev@storm
>
> I’ve seen at least two threads on the dev@ list discussing the JStorm
> runner and my hope is we can expand on that discussion and cross-pollinate
> with the Storm/JStorm/Beam communities as well.
>
> A while back I created a very preliminary proof of concept of getting a
> Storm Beam runner working [1]. That was mainly an exercise for me to
> familiarize myself with the Beam API and discover what it would take to
> develop a Beam runner on top of Storm. That code is way out of date (I was
> targeting Beam’s HEAD before the 0.2.0 release, and a lot of changes have
> since taken place) and didn’t really work as Jian Liu pointed out. It was a
> start, that perhaps could be further built upon, or parts harvested, etc. I
> don’t have any particular attachment to that code and wouldn’t be upset if
> it were completely discarded in favor of a better or more extensible
> implementation.
>
> What I would like to see, and I think this is a great opportunity to do
> so, is a closer collaboration between the Apache Storm and JStorm
> communities. For those who aren’t familiar with those projects’
> relationship, I’ll start with a little history…
>
> JStorm began at Alibaba as a fork of Storm (pre-Apache?) with Storm’s
> Clojure code reimplemented in Java. The rationale behind that move was that
> Alibaba had a large number of Java developers but very few who were
> proficient with Clojure. Moving to pure Java made sense as it would expand
> the base of potential contributors.
>
> In late 2015 Alibaba donated the JStorm codebase to the Apache Storm
> project, and the Apache Storm PMC committed to converting its Clojure code
> to Java in order to incorporate the code donation. At the time there was
> one catch — Apache Storm had implemented comprehensive security features
> such as Kerberos authentication/authorization and multi-tenancy in its
> Clojure code, which greatly complicated the move to Java and incorporation
> of the JStorm code. JStorm did not have the same security features. A
> number of JStorm developers have also become Storm PMC members.
>
> Fast forward to today. The Storm community has completed the bulk of the
> move to Java and the next major release (presumably 2.0, which is currently
> under discussion) will be largely Java-based. We are now in a much better
> position to begin incorporating JStorm’s features, as well as implementing
> new features necessary to support the Beam API (such as support for bounded
> pipelines, among other features).
>
> Having separate Apache Storm and JStorm beam runner implementations
> doesn’t feel appropriate in my personal opinion, especially since both
> projects have expressed an ongoing commitment to bringing JStorm’s
> additional features, and just as important, community, to Apache Storm.
>
> One final note, when the Storm community initially discussed developing a
> Beam runner, the general consensus was do so within the Storm repository.
> My current thinking is that such an effort should take place within the
> Beam community, not only since that is the development pattern followed by
> other runner implementations (Flink, Apex, etc.), but also because it would
> serve to increase collaboration between Apache projects (always a good
> thing!).
>
> I would love to hear opinions from others in the Storm/JStorm/Beam
> communities.
>
> -Taylor


Re: Apache Storm/JStorm Runner(s) for Apache Beam

2017-04-11 Thread Kenneth Knowles
Hi Taylor,

Thanks immensely for taking the time to write such rich detail. I have a
lot to learn about the relationship between Storm and JStorm as software
and as communities.

Your final note I can immediately agree with and reinforce. The fruits of
this endeavor should reside in the Beam repository. It is good for all the
reasons you mention. Even more specifically: any runner gains tremendous
benefit from our automated testing, both for achieving maturity and for not
getting broken as the project evolves.

Kenn

On Mon, Apr 10, 2017 at 10:47 AM, P. Taylor Goetz 
wrote:

> Note: cross-posting to dev@beam and dev@storm
>
> I’ve seen at least two threads on the dev@ list discussing the JStorm
> runner and my hope is we can expand on that discussion and cross-pollinate
> with the Storm/JStorm/Beam communities as well.
>
> A while back I created a very preliminary proof of concept of getting a
> Storm Beam runner working [1]. That was mainly an exercise for me to
> familiarize myself with the Beam API and discover what it would take to
> develop a Beam runner on top of Storm. That code is way out of date (I was
> targeting Beam’s HEAD before the 0.2.0 release, and a lot of changes have
> since taken place) and didn’t really work as Jian Liu pointed out. It was a
> start, that perhaps could be further built upon, or parts harvested, etc. I
> don’t have any particular attachment to that code and wouldn’t be upset if
> it were completely discarded in favor of a better or more extensible
> implementation.
>
> What I would like to see, and I think this is a great opportunity to do
> so, is a closer collaboration between the Apache Storm and JStorm
> communities. For those who aren’t familiar with those projects’
> relationship, I’ll start with a little history…
>
> JStorm began at Alibaba as a fork of Storm (pre-Apache?) with Storm’s
> Clojure code reimplemented in Java. The rationale behind that move was that
> Alibaba had a large number of Java developers but very few who were
> proficient with Clojure. Moving to pure Java made sense as it would expand
> the base of potential contributors.
>
> In late 2015 Alibaba donated the JStorm codebase to the Apache Storm
> project, and the Apache Storm PMC committed to converting its Clojure code
> to Java in order to incorporate the code donation. At the time there was
> one catch — Apache Storm had implemented comprehensive security features
> such as Kerberos authentication/authorization and multi-tenancy in its
> Clojure code, which greatly complicated the move to Java and incorporation
> of the JStorm code. JStorm did not have the same security features. A
> number of JStorm developers have also become Storm PMC members.
>
> Fast forward to today. The Storm community has completed the bulk of the
> move to Java and the next major release (presumably 2.0, which is currently
> under discussion) will be largely Java-based. We are now in a much better
> position to begin incorporating JStorm’s features, as well as implementing
> new features necessary to support the Beam API (such as support for bounded
> pipelines, among other features).
>
> Having separate Apache Storm and JStorm beam runner implementations
> doesn’t feel appropriate in my personal opinion, especially since both
> projects have expressed an ongoing commitment to bringing JStorm’s
> additional features, and just as important, community, to Apache Storm.
>
> One final note, when the Storm community initially discussed developing a
> Beam runner, the general consensus was do so within the Storm repository.
> My current thinking is that such an effort should take place within the
> Beam community, not only since that is the development pattern followed by
> other runner implementations (Flink, Apex, etc.), but also because it would
> serve to increase collaboration between Apache projects (always a good
> thing!).
>
> I would love to hear opinions from others in the Storm/JStorm/Beam
> communities.
>
> -Taylor


Re: HDFS and Google Cloud Storage

2017-04-11 Thread Shen Li
Hi JB,

Thanks a lot for your response. Does it mean all file-based IO will be
added as schemes using IOChannelFactory (or the new name FileSystem). All
others, e.g., HTTP, TCP, KV-store, DB, message-queue, should be source/sink
transforms?

Thanks,

Shen

On Tue, Apr 11, 2017 at 10:29 AM, Jean-Baptiste Onofré 
wrote:

> Hi Shen,
>
> We are doing a refactoring of the file IO (IOChannelFactory). Thanks to
> this refactoring, you will be able to use a scheme for hdfs (or s3, ...)
> with different format (avro, text, hadoop input format, ...).
>
> It means that HdfsIO will be deprecated (to be removed at some point). I'm
> working on couple of PRs to leverage the new file IO layer.
>
> Regards
> JB
>
>
> On 04/11/2017 03:56 PM, Shen Li wrote:
>
>> Hi,
>>
>> Is there any reason why HDFS IO is implemented as a BoundedSource while
>> Google Cloud Storage is implemented as a scheme ("gs://") for TextIO? To
>> contribute a new IO connector, how can I determine whether it should be
>> implemented as a source transform or as a scheme for the TextIO?
>>
>> Thanks,
>>
>> Shen
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>


Re: HDFS and Google Cloud Storage

2017-04-11 Thread Jean-Baptiste Onofré
Yes, FileSystem "plugins" will use a scheme. Other connectors will use (as it's 
already the case) DoFn/Source transforms.


Regards
JB

On 04/11/2017 05:05 PM, Shen Li wrote:

Hi JB,

Thanks a lot for your response. Does it mean all file-based IO will be
added as schemes using IOChannelFactory (or the new name FileSystem). All
others, e.g., HTTP, TCP, KV-store, DB, message-queue, should be source/sink
transforms?

Thanks,

Shen

On Tue, Apr 11, 2017 at 10:29 AM, Jean-Baptiste Onofré 
wrote:


Hi Shen,

We are doing a refactoring of the file IO (IOChannelFactory). Thanks to
this refactoring, you will be able to use a scheme for hdfs (or s3, ...)
with different format (avro, text, hadoop input format, ...).

It means that HdfsIO will be deprecated (to be removed at some point). I'm
working on couple of PRs to leverage the new file IO layer.

Regards
JB


On 04/11/2017 03:56 PM, Shen Li wrote:


Hi,

Is there any reason why HDFS IO is implemented as a BoundedSource while
Google Cloud Storage is implemented as a scheme ("gs://") for TextIO? To
contribute a new IO connector, how can I determine whether it should be
implemented as a source transform or as a scheme for the TextIO?

Thanks,

Shen



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com






--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: HDFS and Google Cloud Storage

2017-04-11 Thread Shen Li
Thanks!

Shen

On Tue, Apr 11, 2017 at 11:10 AM, Jean-Baptiste Onofré 
wrote:

> Yes, FileSystem "plugins" will use a scheme. Other connectors will use (as
> it's already the case) DoFn/Source transforms.
>
> Regards
> JB
>
>
> On 04/11/2017 05:05 PM, Shen Li wrote:
>
>> Hi JB,
>>
>> Thanks a lot for your response. Does it mean all file-based IO will be
>> added as schemes using IOChannelFactory (or the new name FileSystem). All
>> others, e.g., HTTP, TCP, KV-store, DB, message-queue, should be
>> source/sink
>> transforms?
>>
>> Thanks,
>>
>> Shen
>>
>> On Tue, Apr 11, 2017 at 10:29 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Hi Shen,
>>>
>>> We are doing a refactoring of the file IO (IOChannelFactory). Thanks to
>>> this refactoring, you will be able to use a scheme for hdfs (or s3, ...)
>>> with different format (avro, text, hadoop input format, ...).
>>>
>>> It means that HdfsIO will be deprecated (to be removed at some point).
>>> I'm
>>> working on couple of PRs to leverage the new file IO layer.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 04/11/2017 03:56 PM, Shen Li wrote:
>>>
>>> Hi,

 Is there any reason why HDFS IO is implemented as a BoundedSource while
 Google Cloud Storage is implemented as a scheme ("gs://") for TextIO? To
 contribute a new IO connector, how can I determine whether it should be
 implemented as a source transform or as a scheme for the TextIO?

 Thanks,

 Shen


 --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: HDFS and Google Cloud Storage

2017-04-11 Thread Stephen Sisk
This is a great question! I filed
https://issues.apache.org/jira/browse/BEAM-1929 to update the I/O docs to
make sure they answer this.

S

On Tue, Apr 11, 2017 at 8:20 AM Shen Li  wrote:

> Thanks!
>
> Shen
>
> On Tue, Apr 11, 2017 at 11:10 AM, Jean-Baptiste Onofré 
> wrote:
>
> > Yes, FileSystem "plugins" will use a scheme. Other connectors will use
> (as
> > it's already the case) DoFn/Source transforms.
> >
> > Regards
> > JB
> >
> >
> > On 04/11/2017 05:05 PM, Shen Li wrote:
> >
> >> Hi JB,
> >>
> >> Thanks a lot for your response. Does it mean all file-based IO will be
> >> added as schemes using IOChannelFactory (or the new name FileSystem).
> All
> >> others, e.g., HTTP, TCP, KV-store, DB, message-queue, should be
> >> source/sink
> >> transforms?
> >>
> >> Thanks,
> >>
> >> Shen
> >>
> >> On Tue, Apr 11, 2017 at 10:29 AM, Jean-Baptiste Onofré  >
> >> wrote:
> >>
> >> Hi Shen,
> >>>
> >>> We are doing a refactoring of the file IO (IOChannelFactory). Thanks to
> >>> this refactoring, you will be able to use a scheme for hdfs (or s3,
> ...)
> >>> with different format (avro, text, hadoop input format, ...).
> >>>
> >>> It means that HdfsIO will be deprecated (to be removed at some point).
> >>> I'm
> >>> working on couple of PRs to leverage the new file IO layer.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>>
> >>> On 04/11/2017 03:56 PM, Shen Li wrote:
> >>>
> >>> Hi,
> 
>  Is there any reason why HDFS IO is implemented as a BoundedSource
> while
>  Google Cloud Storage is implemented as a scheme ("gs://") for TextIO?
> To
>  contribute a new IO connector, how can I determine whether it should
> be
>  implemented as a source transform or as a scheme for the TextIO?
> 
>  Thanks,
> 
>  Shen
> 
> 
>  --
> >>> Jean-Baptiste Onofré
> >>> jbono...@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >>>
> >>>
> >>>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: [PROPOSAL]: a new feature branch for SQL DSL

2017-04-11 Thread tarush grover
Wanted to know active branch for the beam SQL feature, is it beam - 301
where the development for this feature would happen?

Regards,
Tarush

On Tue, 11 Apr 2017 at 10:49 AM, 陈竞  wrote:

> i just want to know what the SQL State API equivalent is for SQL, since
> beam has already support stateful processing using state DoFn
>
> 2017-04-11 2:12 GMT+08:00 Tyler Akidau :
>
> > 陈竞, what are you specifically curious about regarding state? Are you
> > wanting to know what the SQL State API equivalent is for SQL? Or are you
> > asking an operational question about where the state for a given SQL
> > pipeline will live?
> >
> > -Tyler
> >
> >
> > On Sun, Apr 9, 2017 at 12:39 PM Mingmin Xu  wrote:
> >
> > > Thanks @JB, will come out the initial PR soon.
> > >
> > > On Sun, Apr 9, 2017 at 12:28 PM, Jean-Baptiste Onofré  >
> > > wrote:
> > >
> > > > As discussed, I created the DSL_SQL branch with the skeleton. Mingmin
> > is
> > > > rebasing on this branch to submit the PR.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > >
> > > > On 04/09/2017 08:02 PM, Mingmin Xu wrote:
> > > >
> > > >> State is not touched yet, welcome to add it.
> > > >>
> > > >> On Sun, Apr 9, 2017 at 2:40 AM, 陈竞  wrote:
> > > >>
> > > >> how will this sql support state both in streaming and batch mode
> > > >>>
> > > >>> 2017-04-07 4:54 GMT+08:00 Mingmin Xu :
> > > >>>
> > > >>> @Tyler, there's no big change in the previous design doc, I added
> > some
> > >  details in chapter 'Part 2. DML( [INSERT] SELECT )' , describing
> > steps
> > >  to
> > >  process a query, feel free to leave a comment.
> > > 
> > >  Come through your doc of 'EMIT', it's awesome from my perspective.
> > > I've
> > >  some tests on GroupBy with default triggers/allowed_lateness now.
> > EMIT
> > >  syntax can be added to fill the gap.
> > > 
> > >  On Thu, Apr 6, 2017 at 1:04 PM, Tyler Akidau 
> > >  wrote:
> > > 
> > >  I'm very excited by this development as well, thanks for
> continuing
> > to
> > > >
> > >  push
> > > 
> > > > this forward, Mingmin. :-)
> > > >
> > > > I noticed you'd made some changes to your design doc
> > > > <
> https://docs.google.com/document/d/1Uc5xYTpO9qsLXtT38OfuoqSLimH_
> > > > 0a1Bz5BsCROMzCU/edit>.
> > > > Is it ready for another review? How reflective is it currently of
> > the
> > > >
> > >  work
> > > 
> > > > that going into the feature branch?
> > > >
> > > > In parallel, I'd also like to continue helping push forward the
> > > >
> > >  definition
> > > 
> > > > of unified model semantics for SQL so we can get Calcite to a
> point
> > > >
> > >  where
> > > >>>
> > >  it supports the full Beam model. I added a comment
> > > >  > > >
> > >  focusedCommentId=15959621&
> > > 
> > > > page=com.atlassian.jira.plugin.system.issuetabpanels:
> > > > comment-tabpanel#comment-15959621>
> > > > on the JIRA suggesting I create a doc with a specification
> proposal
> > > for
> > > > EMIT (and any other necessary semantic changes) that we can then
> > > >
> > >  iterate
> > > >>>
> > >  on
> > > 
> > > > in public with the Calcite folks. I already have most of the
> > content
> > > > written (and there's a significant amount of background needed to
> > > >
> > >  justify
> > > >>>
> > >  some aspects of the proposal), so it'll mostly be a matter of
> > pulling
> > > >
> > >  it
> > > >>>
> > >  all together into something coherent. Does that sound reasonable
> to
> > > > everyone?
> > > >
> > > > -Tyler
> > > >
> > > >
> > > > On Thu, Apr 6, 2017 at 10:26 AM Kenneth Knowles
> > >  > > >
> > > 
> > >  wrote:
> > > >
> > > > Very cool! I'm really excited about this integration.
> > > >>
> > > >> On Thu, Apr 6, 2017 at 9:39 AM, Jean-Baptiste Onofré <
> > > >>
> > > > j...@nanthrax.net>
> > > >>>
> > >  wrote:
> > > >>
> > > >> Hi,
> > > >>>
> > > >>> Mingmin and I prepared a new branch to have the SQL DSL in
> > dsls/sql
> > > >>> location.
> > > >>>
> > > >>> Any help is welcome !
> > > >>>
> > > >>> Thanks,
> > > >>> Regards
> > > >>> JB
> > > >>>
> > > >>>
> > > >>> On 04/06/2017 06:36 PM, Mingmin Xu wrote:
> > > >>>
> > > >>> @Tarush, you're very welcome to join the effort.
> > > 
> > >  On Thu, Apr 6, 2017 at 7:22 AM, tarush grover <
> > > 
> > > >>> tarushappt...@gmail.com>
> > > >
> > > >> wrote:
> > > 
> > >  Hi,
> > > 
> > > >
> > > > Can I be also part of this feature development.
> > > >
> > > > Regards,
> > > > Tarush Grover
> > > >
> > > > On Thu, Apr 6, 2017 at 3:17 AM, Ted Yu 
> > > >
> > >  wrote:
> > > 
> > > >>

Re: [PROPOSAL]: a new feature branch for SQL DSL

2017-04-11 Thread Mingmin Xu
It's not, you can use the feature branch
https://github.com/apache/beam/tree/DSL_SQL after
https://github.com/apache/beam/pull/2479 is merged, stay tuned.

On Tue, Apr 11, 2017 at 10:00 AM, tarush grover 
wrote:

> Wanted to know active branch for the beam SQL feature, is it beam - 301
> where the development for this feature would happen?
>
> Regards,
> Tarush
>
> On Tue, 11 Apr 2017 at 10:49 AM, 陈竞  wrote:
>
> > i just want to know what the SQL State API equivalent is for SQL, since
> > beam has already support stateful processing using state DoFn
> >
> > 2017-04-11 2:12 GMT+08:00 Tyler Akidau :
> >
> > > 陈竞, what are you specifically curious about regarding state? Are you
> > > wanting to know what the SQL State API equivalent is for SQL? Or are
> you
> > > asking an operational question about where the state for a given SQL
> > > pipeline will live?
> > >
> > > -Tyler
> > >
> > >
> > > On Sun, Apr 9, 2017 at 12:39 PM Mingmin Xu  wrote:
> > >
> > > > Thanks @JB, will come out the initial PR soon.
> > > >
> > > > On Sun, Apr 9, 2017 at 12:28 PM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > > > wrote:
> > > >
> > > > > As discussed, I created the DSL_SQL branch with the skeleton.
> Mingmin
> > > is
> > > > > rebasing on this branch to submit the PR.
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > >
> > > > > On 04/09/2017 08:02 PM, Mingmin Xu wrote:
> > > > >
> > > > >> State is not touched yet, welcome to add it.
> > > > >>
> > > > >> On Sun, Apr 9, 2017 at 2:40 AM, 陈竞  wrote:
> > > > >>
> > > > >> how will this sql support state both in streaming and batch mode
> > > > >>>
> > > > >>> 2017-04-07 4:54 GMT+08:00 Mingmin Xu :
> > > > >>>
> > > > >>> @Tyler, there's no big change in the previous design doc, I added
> > > some
> > > >  details in chapter 'Part 2. DML( [INSERT] SELECT )' , describing
> > > steps
> > > >  to
> > > >  process a query, feel free to leave a comment.
> > > > 
> > > >  Come through your doc of 'EMIT', it's awesome from my
> perspective.
> > > > I've
> > > >  some tests on GroupBy with default triggers/allowed_lateness
> now.
> > > EMIT
> > > >  syntax can be added to fill the gap.
> > > > 
> > > >  On Thu, Apr 6, 2017 at 1:04 PM, Tyler Akidau <
> taki...@apache.org>
> > > >  wrote:
> > > > 
> > > >  I'm very excited by this development as well, thanks for
> > continuing
> > > to
> > > > >
> > > >  push
> > > > 
> > > > > this forward, Mingmin. :-)
> > > > >
> > > > > I noticed you'd made some changes to your design doc
> > > > > <
> > https://docs.google.com/document/d/1Uc5xYTpO9qsLXtT38OfuoqSLimH_
> > > > > 0a1Bz5BsCROMzCU/edit>.
> > > > > Is it ready for another review? How reflective is it currently
> of
> > > the
> > > > >
> > > >  work
> > > > 
> > > > > that going into the feature branch?
> > > > >
> > > > > In parallel, I'd also like to continue helping push forward the
> > > > >
> > > >  definition
> > > > 
> > > > > of unified model semantics for SQL so we can get Calcite to a
> > point
> > > > >
> > > >  where
> > > > >>>
> > > >  it supports the full Beam model. I added a comment
> > > > >  > > > >
> > > >  focusedCommentId=15959621&
> > > > 
> > > > > page=com.atlassian.jira.plugin.system.issuetabpanels:
> > > > > comment-tabpanel#comment-15959621>
> > > > > on the JIRA suggesting I create a doc with a specification
> > proposal
> > > > for
> > > > > EMIT (and any other necessary semantic changes) that we can
> then
> > > > >
> > > >  iterate
> > > > >>>
> > > >  on
> > > > 
> > > > > in public with the Calcite folks. I already have most of the
> > > content
> > > > > written (and there's a significant amount of background needed
> to
> > > > >
> > > >  justify
> > > > >>>
> > > >  some aspects of the proposal), so it'll mostly be a matter of
> > > pulling
> > > > >
> > > >  it
> > > > >>>
> > > >  all together into something coherent. Does that sound reasonable
> > to
> > > > > everyone?
> > > > >
> > > > > -Tyler
> > > > >
> > > > >
> > > > > On Thu, Apr 6, 2017 at 10:26 AM Kenneth Knowles
> > > >  > > > >
> > > > 
> > > >  wrote:
> > > > >
> > > > > Very cool! I'm really excited about this integration.
> > > > >>
> > > > >> On Thu, Apr 6, 2017 at 9:39 AM, Jean-Baptiste Onofré <
> > > > >>
> > > > > j...@nanthrax.net>
> > > > >>>
> > > >  wrote:
> > > > >>
> > > > >> Hi,
> > > > >>>
> > > > >>> Mingmin and I prepared a new branch to have the SQL DSL in
> > > dsls/sql
> > > > >>> location.
> > > > >>>
> > > > >>> Any help is welcome !
> > > > >>>
> > > > >>> Thanks,
> > > > >>> Regards
> > > > >>> JB
> > > > >>>
> > > > >>>
> > > > >>> On 04/06/2017 06:36 PM, Mingmin Xu w

Re: [PROPOSAL]: a new feature branch for SQL DSL

2017-04-11 Thread tarush grover
Thanks Mingmin.

Regards,
Tarush

On Tue, 11 Apr 2017 at 10:39 PM, Mingmin Xu  wrote:

> It's not, you can use the feature branch
> https://github.com/apache/beam/tree/DSL_SQL after
> https://github.com/apache/beam/pull/2479 is merged, stay tuned.
>
> On Tue, Apr 11, 2017 at 10:00 AM, tarush grover 
> wrote:
>
> > Wanted to know active branch for the beam SQL feature, is it beam - 301
> > where the development for this feature would happen?
> >
> > Regards,
> > Tarush
> >
> > On Tue, 11 Apr 2017 at 10:49 AM, 陈竞  wrote:
> >
> > > i just want to know what the SQL State API equivalent is for SQL, since
> > > beam has already support stateful processing using state DoFn
> > >
> > > 2017-04-11 2:12 GMT+08:00 Tyler Akidau :
> > >
> > > > 陈竞, what are you specifically curious about regarding state? Are you
> > > > wanting to know what the SQL State API equivalent is for SQL? Or are
> > you
> > > > asking an operational question about where the state for a given SQL
> > > > pipeline will live?
> > > >
> > > > -Tyler
> > > >
> > > >
> > > > On Sun, Apr 9, 2017 at 12:39 PM Mingmin Xu 
> wrote:
> > > >
> > > > > Thanks @JB, will come out the initial PR soon.
> > > > >
> > > > > On Sun, Apr 9, 2017 at 12:28 PM, Jean-Baptiste Onofré <
> > j...@nanthrax.net
> > > >
> > > > > wrote:
> > > > >
> > > > > > As discussed, I created the DSL_SQL branch with the skeleton.
> > Mingmin
> > > > is
> > > > > > rebasing on this branch to submit the PR.
> > > > > >
> > > > > > Regards
> > > > > > JB
> > > > > >
> > > > > >
> > > > > > On 04/09/2017 08:02 PM, Mingmin Xu wrote:
> > > > > >
> > > > > >> State is not touched yet, welcome to add it.
> > > > > >>
> > > > > >> On Sun, Apr 9, 2017 at 2:40 AM, 陈竞  wrote:
> > > > > >>
> > > > > >> how will this sql support state both in streaming and batch mode
> > > > > >>>
> > > > > >>> 2017-04-07 4:54 GMT+08:00 Mingmin Xu :
> > > > > >>>
> > > > > >>> @Tyler, there's no big change in the previous design doc, I
> added
> > > > some
> > > > >  details in chapter 'Part 2. DML( [INSERT] SELECT )' ,
> describing
> > > > steps
> > > > >  to
> > > > >  process a query, feel free to leave a comment.
> > > > > 
> > > > >  Come through your doc of 'EMIT', it's awesome from my
> > perspective.
> > > > > I've
> > > > >  some tests on GroupBy with default triggers/allowed_lateness
> > now.
> > > > EMIT
> > > > >  syntax can be added to fill the gap.
> > > > > 
> > > > >  On Thu, Apr 6, 2017 at 1:04 PM, Tyler Akidau <
> > taki...@apache.org>
> > > > >  wrote:
> > > > > 
> > > > >  I'm very excited by this development as well, thanks for
> > > continuing
> > > > to
> > > > > >
> > > > >  push
> > > > > 
> > > > > > this forward, Mingmin. :-)
> > > > > >
> > > > > > I noticed you'd made some changes to your design doc
> > > > > > <
> > > https://docs.google.com/document/d/1Uc5xYTpO9qsLXtT38OfuoqSLimH_
> > > > > > 0a1Bz5BsCROMzCU/edit>.
> > > > > > Is it ready for another review? How reflective is it
> currently
> > of
> > > > the
> > > > > >
> > > > >  work
> > > > > 
> > > > > > that going into the feature branch?
> > > > > >
> > > > > > In parallel, I'd also like to continue helping push forward
> the
> > > > > >
> > > > >  definition
> > > > > 
> > > > > > of unified model semantics for SQL so we can get Calcite to a
> > > point
> > > > > >
> > > > >  where
> > > > > >>>
> > > > >  it supports the full Beam model. I added a comment
> > > > > >  > > > > >
> > > > >  focusedCommentId=15959621&
> > > > > 
> > > > > > page=com.atlassian.jira.plugin.system.issuetabpanels:
> > > > > > comment-tabpanel#comment-15959621>
> > > > > > on the JIRA suggesting I create a doc with a specification
> > > proposal
> > > > > for
> > > > > > EMIT (and any other necessary semantic changes) that we can
> > then
> > > > > >
> > > > >  iterate
> > > > > >>>
> > > > >  on
> > > > > 
> > > > > > in public with the Calcite folks. I already have most of the
> > > > content
> > > > > > written (and there's a significant amount of background
> needed
> > to
> > > > > >
> > > > >  justify
> > > > > >>>
> > > > >  some aspects of the proposal), so it'll mostly be a matter of
> > > > pulling
> > > > > >
> > > > >  it
> > > > > >>>
> > > > >  all together into something coherent. Does that sound
> reasonable
> > > to
> > > > > > everyone?
> > > > > >
> > > > > > -Tyler
> > > > > >
> > > > > >
> > > > > > On Thu, Apr 6, 2017 at 10:26 AM Kenneth Knowles
> > > > >  > > > > >
> > > > > 
> > > > >  wrote:
> > > > > >
> > > > > > Very cool! I'm really excited about this integration.
> > > > > >>
> > > > > >> On Thu, Apr 6, 2017 at 9:39 AM, Jean-Baptiste Onofré <
> > > > > >>
> > > > > > j...@nanthrax.net>
> > > > > >>>
> 

Re: [PROPOSAL]: a new feature branch for SQL DSL

2017-04-11 Thread Jean-Baptiste Onofré

Gonna merge tonight.

Regards
JB

On 04/11/2017 07:09 PM, Mingmin Xu wrote:

It's not, you can use the feature branch
https://github.com/apache/beam/tree/DSL_SQL after
https://github.com/apache/beam/pull/2479 is merged, stay tuned.

On Tue, Apr 11, 2017 at 10:00 AM, tarush grover 
wrote:


Wanted to know active branch for the beam SQL feature, is it beam - 301
where the development for this feature would happen?

Regards,
Tarush

On Tue, 11 Apr 2017 at 10:49 AM, 陈竞  wrote:


i just want to know what the SQL State API equivalent is for SQL, since
beam has already support stateful processing using state DoFn

2017-04-11 2:12 GMT+08:00 Tyler Akidau :


陈竞, what are you specifically curious about regarding state? Are you
wanting to know what the SQL State API equivalent is for SQL? Or are

you

asking an operational question about where the state for a given SQL
pipeline will live?

-Tyler


On Sun, Apr 9, 2017 at 12:39 PM Mingmin Xu  wrote:


Thanks @JB, will come out the initial PR soon.

On Sun, Apr 9, 2017 at 12:28 PM, Jean-Baptiste Onofré <

j...@nanthrax.net



wrote:


As discussed, I created the DSL_SQL branch with the skeleton.

Mingmin

is

rebasing on this branch to submit the PR.

Regards
JB


On 04/09/2017 08:02 PM, Mingmin Xu wrote:


State is not touched yet, welcome to add it.

On Sun, Apr 9, 2017 at 2:40 AM, 陈竞  wrote:

how will this sql support state both in streaming and batch mode


2017-04-07 4:54 GMT+08:00 Mingmin Xu :

@Tyler, there's no big change in the previous design doc, I added

some

details in chapter 'Part 2. DML( [INSERT] SELECT )' , describing

steps

to
process a query, feel free to leave a comment.

Come through your doc of 'EMIT', it's awesome from my

perspective.

I've

some tests on GroupBy with default triggers/allowed_lateness

now.

EMIT

syntax can be added to fill the gap.

On Thu, Apr 6, 2017 at 1:04 PM, Tyler Akidau <

taki...@apache.org>

wrote:

I'm very excited by this development as well, thanks for

continuing

to



push


this forward, Mingmin. :-)

I noticed you'd made some changes to your design doc
<

https://docs.google.com/document/d/1Uc5xYTpO9qsLXtT38OfuoqSLimH_

0a1Bz5BsCROMzCU/edit>.
Is it ready for another review? How reflective is it currently

of

the



work


that going into the feature branch?

In parallel, I'd also like to continue helping push forward the


definition


of unified model semantics for SQL so we can get Calcite to a

point



where



it supports the full Beam model. I added a comment


on the JIRA suggesting I create a doc with a specification

proposal

for

EMIT (and any other necessary semantic changes) that we can

then



iterate



on


in public with the Calcite folks. I already have most of the

content

written (and there's a significant amount of background needed

to



justify



some aspects of the proposal), so it'll mostly be a matter of

pulling



it



all together into something coherent. Does that sound reasonable

to

everyone?

-Tyler


On Thu, Apr 6, 2017 at 10:26 AM Kenneth Knowles





wrote:


Very cool! I'm really excited about this integration.


On Thu, Apr 6, 2017 at 9:39 AM, Jean-Baptiste Onofré <


j...@nanthrax.net>



wrote:


Hi,


Mingmin and I prepared a new branch to have the SQL DSL in

dsls/sql

location.

Any help is welcome !

Thanks,
Regards
JB


On 04/06/2017 06:36 PM, Mingmin Xu wrote:

@Tarush, you're very welcome to join the effort.


On Thu, Apr 6, 2017 at 7:22 AM, tarush grover <


tarushappt...@gmail.com>



wrote:


Hi,



Can I be also part of this feature development.

Regards,
Tarush Grover

On Thu, Apr 6, 2017 at 3:17 AM, Ted Yu <

yuzhih...@gmail.com>



wrote:





I compiled BEAM-301 branch with calcite 1.12 - passed.



Julian tries to not break existing things, but he will if


there's



a




reason


to do so :-)


On Wed, Apr 5, 2017 at 2:36 PM, Mingmin Xu <

mingm...@gmail.com>



wrote:





@Ted, thanks for the note. I intend to stick with one

version,



Beam





0.6.0




and Calcite 1.11 so far, unless impacted by API change.

Before



it's





merged


back to master, will upgrade to the latest version.


On Wed, Apr 5, 2017 at 2:14 PM, Ted Yu <

yuzhih...@gmail.com>



wrote:





Working in feature branch is good - you may want to


periodically



sync





up




with master.




I noticed that you are using 1.11.0 of calcite.
1.12 is out, FYI

On Wed, Apr 5, 2017 at 2:05 PM, Mingmin Xu <


mingm...@gmail.com>





wrote:







Hi all,




I'm working on https://issues.apache.org/


jira/browse/BEAM-301(Add





a





Beam




SQL DSL). The skeleton is already in



https://github.com/XuMingmin/beam/tree/BEAM-301, using

Java



SDK



in




the





back-end. The goal is to provide a SQL interface over

Beam,



based





on





Calcite, including:



1). a translator to creat

Re: [PROPOSAL]: a new feature branch for SQL DSL

2017-04-11 Thread Tyler Akidau
Hi 陈竞,

I'm doubtful there will be an explicit equivalent of the State API in SQL,
at least not in the SQL portion of the DSL itself (it might make sense to
expose one within UDFs). The State API is an imperative interface for
accessing an underlying persistent state table, whereas SQL operates more
functionally. There's no good way I'm aware of to expose the
characteristics provided by the State API (logic-driven, fine- and
coarse-grained reads/writes of potentially multiple fields of state
utilizing potentially multiple data types) in raw SQL cleanly.

On the upside, SQL has the advantage of making it very easy to materialize
new state tables very naturally. In the proposal I'll be sharing for how I
think we should integrate streaming into SQL robustly, any time you perform
some grouping operation (GROUP BY, JOIN, CUBE, etc) you're transforming
your stream into a table. That table is effectively a persistent state
table. So there exists a large suite of functionality in standard SQL that
gives you a lot of powerful tools for creating state.

It may also be possible for the different access patterns of more
complicated data structures (e.g., bags or lists) to be captured by
different data types supported by the underlying systems. But I don't
expect there to be an imperative State access API built into SQL itself.

All that said, I'm curious to hear ideas otherwise if anyone has them. :-)

-Tyler

On Mon, Apr 10, 2017 at 10:19 PM 陈竞  wrote:

> i just want to know what the SQL State API equivalent is for SQL, since
> beam has already support stateful processing using state DoFn
>
> 2017-04-11 2:12 GMT+08:00 Tyler Akidau :
>
> > 陈竞, what are you specifically curious about regarding state? Are you
> > wanting to know what the SQL State API equivalent is for SQL? Or are you
> > asking an operational question about where the state for a given SQL
> > pipeline will live?
> >
> > -Tyler
> >
> >
> > On Sun, Apr 9, 2017 at 12:39 PM Mingmin Xu  wrote:
> >
> > > Thanks @JB, will come out the initial PR soon.
> > >
> > > On Sun, Apr 9, 2017 at 12:28 PM, Jean-Baptiste Onofré  >
> > > wrote:
> > >
> > > > As discussed, I created the DSL_SQL branch with the skeleton. Mingmin
> > is
> > > > rebasing on this branch to submit the PR.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > >
> > > > On 04/09/2017 08:02 PM, Mingmin Xu wrote:
> > > >
> > > >> State is not touched yet, welcome to add it.
> > > >>
> > > >> On Sun, Apr 9, 2017 at 2:40 AM, 陈竞  wrote:
> > > >>
> > > >> how will this sql support state both in streaming and batch mode
> > > >>>
> > > >>> 2017-04-07 4:54 GMT+08:00 Mingmin Xu :
> > > >>>
> > > >>> @Tyler, there's no big change in the previous design doc, I added
> > some
> > >  details in chapter 'Part 2. DML( [INSERT] SELECT )' , describing
> > steps
> > >  to
> > >  process a query, feel free to leave a comment.
> > > 
> > >  Come through your doc of 'EMIT', it's awesome from my perspective.
> > > I've
> > >  some tests on GroupBy with default triggers/allowed_lateness now.
> > EMIT
> > >  syntax can be added to fill the gap.
> > > 
> > >  On Thu, Apr 6, 2017 at 1:04 PM, Tyler Akidau 
> > >  wrote:
> > > 
> > >  I'm very excited by this development as well, thanks for
> continuing
> > to
> > > >
> > >  push
> > > 
> > > > this forward, Mingmin. :-)
> > > >
> > > > I noticed you'd made some changes to your design doc
> > > > <
> https://docs.google.com/document/d/1Uc5xYTpO9qsLXtT38OfuoqSLimH_
> > > > 0a1Bz5BsCROMzCU/edit>.
> > > > Is it ready for another review? How reflective is it currently of
> > the
> > > >
> > >  work
> > > 
> > > > that going into the feature branch?
> > > >
> > > > In parallel, I'd also like to continue helping push forward the
> > > >
> > >  definition
> > > 
> > > > of unified model semantics for SQL so we can get Calcite to a
> point
> > > >
> > >  where
> > > >>>
> > >  it supports the full Beam model. I added a comment
> > > >  > > >
> > >  focusedCommentId=15959621&
> > > 
> > > > page=com.atlassian.jira.plugin.system.issuetabpanels:
> > > > comment-tabpanel#comment-15959621>
> > > > on the JIRA suggesting I create a doc with a specification
> proposal
> > > for
> > > > EMIT (and any other necessary semantic changes) that we can then
> > > >
> > >  iterate
> > > >>>
> > >  on
> > > 
> > > > in public with the Calcite folks. I already have most of the
> > content
> > > > written (and there's a significant amount of background needed to
> > > >
> > >  justify
> > > >>>
> > >  some aspects of the proposal), so it'll mostly be a matter of
> > pulling
> > > >
> > >  it
> > > >>>
> > >  all together into something coherent. Does that sound reasonable
> to
> > > > everyone?
> > > >
> > > > -Tyler
> > > >
> >

Re: Python build artifacts seem to be misconfigured

2017-04-11 Thread Robert Bradshaw
We should also ignore them: https://github.com/apache/beam/pull/2494

On Thu, Apr 6, 2017 at 6:45 PM, Kenneth Knowles  wrote:
> Thanks for the pointer. I'll dig in to tox docs to see why this isn't
> happening. Probably something to do with unclean shutdowns.
>
> On Thu, Apr 6, 2017 at 6:10 PM, Vikas RK  wrote:
>
>> Those are cython generated files that should be deleted according to
>> https://github.com/apache/beam/blob/master/sdks/python/tox.ini#L54
>>
>>
>>
>> On 6 April 2017 at 17:58, Kenneth Knowles  wrote:
>>
>> > Hi all,
>> >
>> > It appears that the Python build process creates quite a few files that
>> are
>> > not accounted for in our .gitignore and that also trip the RAT check next
>> > time around. These should be set up so that RAT and git both ignore the
>> > files.
>> >
>> > It is possible that others have defaults that differ from mine, but
>> > droppings from a recent `mvn verify` include:
>> >
>> > sdks/python/apache_beam/coders/coder_impl.c
>> > sdks/python/apache_beam/coders/coder_impl.so
>> > sdks/python/apache_beam/coders/stream.c
>> > sdks/python/apache_beam/coders/stream.so
>> > sdks/python/apache_beam/metrics/execution.c
>> > sdks/python/apache_beam/metrics/execution.so
>> > sdks/python/apache_beam/runners/common.c
>> > sdks/python/apache_beam/runners/common.so
>> > sdks/python/apache_beam/transforms/cy_combiners.c
>> > sdks/python/apache_beam/transforms/cy_combiners.so
>> > sdks/python/apache_beam/utils/counters.c
>> > sdks/python/apache_beam/utils/counters.so
>> > sdks/python/apache_beam/utils/windowed_value.c
>> > sdks/python/apache_beam/utils/windowed_value.so
>> > sdks/python/nose-1.3.7-py2.7.egg/
>> >
>> > Can someone who knows the Python SDK build process rectify?
>> >
>> > Kenn
>> >
>>


Renaming SideOutput

2017-04-11 Thread Thomas Groh
Hey everyone:

I'd like to rename DoFn.Context#sideOutput to #output (in the Java SDK).

Having two methods, both named output, one which takes the "main output
type" and one that takes a tag to specify the type more clearly
communicates the actual behavior - sideOutput isn't a "special" way to
output, it's the same as output(T), just to a specified PCollection. This
will help pipeline authors understand the actual behavior of outputting to
a tag, and detangle it from "sideInput", which is a special way to receive
input. Giving them the same name means that it's not even strange to call
output and provide the main output type, which is what we want - it's a
more specific way to output, but does not have different restrictions or
capabilities.

This is also a pretty small change within the SDK - it touches about 20
files, and the changes are pretty automatic.

Thanks,

Thomas


Re: Renaming SideOutput

2017-04-11 Thread Stephen Sisk
strong +1 for changing the name away from sideOutput - the fact that
sideInput and sideOutput are not really related was definitely a source of
confusion for me when learning beam.

S

On Tue, Apr 11, 2017 at 1:56 PM Thomas Groh 
wrote:

> Hey everyone:
>
> I'd like to rename DoFn.Context#sideOutput to #output (in the Java SDK).
>
> Having two methods, both named output, one which takes the "main output
> type" and one that takes a tag to specify the type more clearly
> communicates the actual behavior - sideOutput isn't a "special" way to
> output, it's the same as output(T), just to a specified PCollection. This
> will help pipeline authors understand the actual behavior of outputting to
> a tag, and detangle it from "sideInput", which is a special way to receive
> input. Giving them the same name means that it's not even strange to call
> output and provide the main output type, which is what we want - it's a
> more specific way to output, but does not have different restrictions or
> capabilities.
>
> This is also a pretty small change within the SDK - it touches about 20
> files, and the changes are pretty automatic.
>
> Thanks,
>
> Thomas
>


Re: Renaming SideOutput

2017-04-11 Thread Robert Bradshaw
+1, I think this is a lot clearer.

On Tue, Apr 11, 2017 at 2:24 PM, Stephen Sisk  wrote:
> strong +1 for changing the name away from sideOutput - the fact that
> sideInput and sideOutput are not really related was definitely a source of
> confusion for me when learning beam.
>
> S
>
> On Tue, Apr 11, 2017 at 1:56 PM Thomas Groh 
> wrote:
>
>> Hey everyone:
>>
>> I'd like to rename DoFn.Context#sideOutput to #output (in the Java SDK).
>>
>> Having two methods, both named output, one which takes the "main output
>> type" and one that takes a tag to specify the type more clearly
>> communicates the actual behavior - sideOutput isn't a "special" way to
>> output, it's the same as output(T), just to a specified PCollection. This
>> will help pipeline authors understand the actual behavior of outputting to
>> a tag, and detangle it from "sideInput", which is a special way to receive
>> input. Giving them the same name means that it's not even strange to call
>> output and provide the main output type, which is what we want - it's a
>> more specific way to output, but does not have different restrictions or
>> capabilities.
>>
>> This is also a pretty small change within the SDK - it touches about 20
>> files, and the changes are pretty automatic.
>>
>> Thanks,
>>
>> Thomas
>>


Public in-progress I/O Transform list

2017-04-11 Thread Stephen Sisk
Hi!

We occasionally get questions about whether or not an I/O is planned to be
added to Beam. I've added a list of known in-progress I/O Transforms to the
"Built-in Transforms" page (where in-progress is defined as "has a JIRA
issue"). The Built-In Transforms page is the publicly visible list of I/O
transforms, so it's a natural place to let users know what's also
in-progress.

You can find the current list here:
https://beam.apache.org/documentation/io/built-in/

I may have missed one or two I/Os (there's no good way to find them in JIRA
as far as I can tell), so if you're working on an I/O Transform that you
intend to contribute to beam that's not listed, feel free to send a PR
adding it to the list.

Thanks,
Stephen


Re: Renaming SideOutput

2017-04-11 Thread Kenneth Knowles
+1 ditto about sideInput and sideOutput not actually being related

On Tue, Apr 11, 2017 at 3:52 PM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> +1, I think this is a lot clearer.
>
> On Tue, Apr 11, 2017 at 2:24 PM, Stephen Sisk 
> wrote:
> > strong +1 for changing the name away from sideOutput - the fact that
> > sideInput and sideOutput are not really related was definitely a source
> of
> > confusion for me when learning beam.
> >
> > S
> >
> > On Tue, Apr 11, 2017 at 1:56 PM Thomas Groh 
> > wrote:
> >
> >> Hey everyone:
> >>
> >> I'd like to rename DoFn.Context#sideOutput to #output (in the Java SDK).
> >>
> >> Having two methods, both named output, one which takes the "main output
> >> type" and one that takes a tag to specify the type more clearly
> >> communicates the actual behavior - sideOutput isn't a "special" way to
> >> output, it's the same as output(T), just to a specified PCollection.
> This
> >> will help pipeline authors understand the actual behavior of outputting
> to
> >> a tag, and detangle it from "sideInput", which is a special way to
> receive
> >> input. Giving them the same name means that it's not even strange to
> call
> >> output and provide the main output type, which is what we want - it's a
> >> more specific way to output, but does not have different restrictions or
> >> capabilities.
> >>
> >> This is also a pretty small change within the SDK - it touches about 20
> >> files, and the changes are pretty automatic.
> >>
> >> Thanks,
> >>
> >> Thomas
> >>
>


Re: Renaming SideOutput

2017-04-11 Thread Robert Bradshaw
We should do some renaming in Python too. Right now we have
SideOutputValue which I'd propose naming TaggedOutput or something
like that.

Should the docs change too?
https://beam.apache.org/documentation/programming-guide/#transforms-sideio

On Tue, Apr 11, 2017 at 5:25 PM, Kenneth Knowles  
wrote:
> +1 ditto about sideInput and sideOutput not actually being related
>
> On Tue, Apr 11, 2017 at 3:52 PM, Robert Bradshaw <
> rober...@google.com.invalid> wrote:
>
>> +1, I think this is a lot clearer.
>>
>> On Tue, Apr 11, 2017 at 2:24 PM, Stephen Sisk 
>> wrote:
>> > strong +1 for changing the name away from sideOutput - the fact that
>> > sideInput and sideOutput are not really related was definitely a source
>> of
>> > confusion for me when learning beam.
>> >
>> > S
>> >
>> > On Tue, Apr 11, 2017 at 1:56 PM Thomas Groh 
>> > wrote:
>> >
>> >> Hey everyone:
>> >>
>> >> I'd like to rename DoFn.Context#sideOutput to #output (in the Java SDK).
>> >>
>> >> Having two methods, both named output, one which takes the "main output
>> >> type" and one that takes a tag to specify the type more clearly
>> >> communicates the actual behavior - sideOutput isn't a "special" way to
>> >> output, it's the same as output(T), just to a specified PCollection.
>> This
>> >> will help pipeline authors understand the actual behavior of outputting
>> to
>> >> a tag, and detangle it from "sideInput", which is a special way to
>> receive
>> >> input. Giving them the same name means that it's not even strange to
>> call
>> >> output and provide the main output type, which is what we want - it's a
>> >> more specific way to output, but does not have different restrictions or
>> >> capabilities.
>> >>
>> >> This is also a pretty small change within the SDK - it touches about 20
>> >> files, and the changes are pretty automatic.
>> >>
>> >> Thanks,
>> >>
>> >> Thomas
>> >>
>>


Re: Renaming SideOutput

2017-04-11 Thread Thomas Groh
I think that's a good idea. I would call the outputs of a ParDo the "Main
Output" and "Additional Outputs" - it seems like an easy way to make it
clear that there's one output that is always expected, and there may be
more.

On Tue, Apr 11, 2017 at 5:29 PM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> We should do some renaming in Python too. Right now we have
> SideOutputValue which I'd propose naming TaggedOutput or something
> like that.
>
> Should the docs change too?
> https://beam.apache.org/documentation/programming-guide/#transforms-sideio
>
> On Tue, Apr 11, 2017 at 5:25 PM, Kenneth Knowles 
> wrote:
> > +1 ditto about sideInput and sideOutput not actually being related
> >
> > On Tue, Apr 11, 2017 at 3:52 PM, Robert Bradshaw <
> > rober...@google.com.invalid> wrote:
> >
> >> +1, I think this is a lot clearer.
> >>
> >> On Tue, Apr 11, 2017 at 2:24 PM, Stephen Sisk 
> >> wrote:
> >> > strong +1 for changing the name away from sideOutput - the fact that
> >> > sideInput and sideOutput are not really related was definitely a
> source
> >> of
> >> > confusion for me when learning beam.
> >> >
> >> > S
> >> >
> >> > On Tue, Apr 11, 2017 at 1:56 PM Thomas Groh  >
> >> > wrote:
> >> >
> >> >> Hey everyone:
> >> >>
> >> >> I'd like to rename DoFn.Context#sideOutput to #output (in the Java
> SDK).
> >> >>
> >> >> Having two methods, both named output, one which takes the "main
> output
> >> >> type" and one that takes a tag to specify the type more clearly
> >> >> communicates the actual behavior - sideOutput isn't a "special" way
> to
> >> >> output, it's the same as output(T), just to a specified PCollection.
> >> This
> >> >> will help pipeline authors understand the actual behavior of
> outputting
> >> to
> >> >> a tag, and detangle it from "sideInput", which is a special way to
> >> receive
> >> >> input. Giving them the same name means that it's not even strange to
> >> call
> >> >> output and provide the main output type, which is what we want -
> it's a
> >> >> more specific way to output, but does not have different
> restrictions or
> >> >> capabilities.
> >> >>
> >> >> This is also a pretty small change within the SDK - it touches about
> 20
> >> >> files, and the changes are pretty automatic.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Thomas
> >> >>
> >>
>


Re: Renaming SideOutput

2017-04-11 Thread Aljoscha Krettek
+1

On Wed, Apr 12, 2017, at 02:34, Thomas Groh wrote:
> I think that's a good idea. I would call the outputs of a ParDo the "Main
> Output" and "Additional Outputs" - it seems like an easy way to make it
> clear that there's one output that is always expected, and there may be
> more.
> 
> On Tue, Apr 11, 2017 at 5:29 PM, Robert Bradshaw <
> rober...@google.com.invalid> wrote:
> 
> > We should do some renaming in Python too. Right now we have
> > SideOutputValue which I'd propose naming TaggedOutput or something
> > like that.
> >
> > Should the docs change too?
> > https://beam.apache.org/documentation/programming-guide/#transforms-sideio
> >
> > On Tue, Apr 11, 2017 at 5:25 PM, Kenneth Knowles 
> > wrote:
> > > +1 ditto about sideInput and sideOutput not actually being related
> > >
> > > On Tue, Apr 11, 2017 at 3:52 PM, Robert Bradshaw <
> > > rober...@google.com.invalid> wrote:
> > >
> > >> +1, I think this is a lot clearer.
> > >>
> > >> On Tue, Apr 11, 2017 at 2:24 PM, Stephen Sisk 
> > >> wrote:
> > >> > strong +1 for changing the name away from sideOutput - the fact that
> > >> > sideInput and sideOutput are not really related was definitely a
> > source
> > >> of
> > >> > confusion for me when learning beam.
> > >> >
> > >> > S
> > >> >
> > >> > On Tue, Apr 11, 2017 at 1:56 PM Thomas Groh  > >
> > >> > wrote:
> > >> >
> > >> >> Hey everyone:
> > >> >>
> > >> >> I'd like to rename DoFn.Context#sideOutput to #output (in the Java
> > SDK).
> > >> >>
> > >> >> Having two methods, both named output, one which takes the "main
> > output
> > >> >> type" and one that takes a tag to specify the type more clearly
> > >> >> communicates the actual behavior - sideOutput isn't a "special" way
> > to
> > >> >> output, it's the same as output(T), just to a specified PCollection.
> > >> This
> > >> >> will help pipeline authors understand the actual behavior of
> > outputting
> > >> to
> > >> >> a tag, and detangle it from "sideInput", which is a special way to
> > >> receive
> > >> >> input. Giving them the same name means that it's not even strange to
> > >> call
> > >> >> output and provide the main output type, which is what we want -
> > it's a
> > >> >> more specific way to output, but does not have different
> > restrictions or
> > >> >> capabilities.
> > >> >>
> > >> >> This is also a pretty small change within the SDK - it touches about
> > 20
> > >> >> files, and the changes are pretty automatic.
> > >> >>
> > >> >> Thanks,
> > >> >>
> > >> >> Thomas
> > >> >>
> > >>
> >


Re: Renaming SideOutput

2017-04-11 Thread Ankur Chauhan
+1 this is pretty much the topmost things that I found odd when starting with 
the beam model. It would definitely be more intuitive to have a consistent 
name. 

Sent from my iPhone

> On Apr 11, 2017, at 18:29, Aljoscha Krettek  wrote:
> 
> +1
> 
>> On Wed, Apr 12, 2017, at 02:34, Thomas Groh wrote:
>> I think that's a good idea. I would call the outputs of a ParDo the "Main
>> Output" and "Additional Outputs" - it seems like an easy way to make it
>> clear that there's one output that is always expected, and there may be
>> more.
>> 
>> On Tue, Apr 11, 2017 at 5:29 PM, Robert Bradshaw <
>> rober...@google.com.invalid> wrote:
>> 
>>> We should do some renaming in Python too. Right now we have
>>> SideOutputValue which I'd propose naming TaggedOutput or something
>>> like that.
>>> 
>>> Should the docs change too?
>>> https://beam.apache.org/documentation/programming-guide/#transforms-sideio
>>> 
>>> On Tue, Apr 11, 2017 at 5:25 PM, Kenneth Knowles 
>>> wrote:
 +1 ditto about sideInput and sideOutput not actually being related
 
 On Tue, Apr 11, 2017 at 3:52 PM, Robert Bradshaw <
 rober...@google.com.invalid> wrote:
 
> +1, I think this is a lot clearer.
> 
> On Tue, Apr 11, 2017 at 2:24 PM, Stephen Sisk 
> wrote:
>> strong +1 for changing the name away from sideOutput - the fact that
>> sideInput and sideOutput are not really related was definitely a
>>> source
> of
>> confusion for me when learning beam.
>> 
>> S
>> 
>> On Tue, Apr 11, 2017 at 1:56 PM Thomas Groh >>> 
>> wrote:
>> 
>>> Hey everyone:
>>> 
>>> I'd like to rename DoFn.Context#sideOutput to #output (in the Java
>>> SDK).
>>> 
>>> Having two methods, both named output, one which takes the "main
>>> output
>>> type" and one that takes a tag to specify the type more clearly
>>> communicates the actual behavior - sideOutput isn't a "special" way
>>> to
>>> output, it's the same as output(T), just to a specified PCollection.
> This
>>> will help pipeline authors understand the actual behavior of
>>> outputting
> to
>>> a tag, and detangle it from "sideInput", which is a special way to
> receive
>>> input. Giving them the same name means that it's not even strange to
> call
>>> output and provide the main output type, which is what we want -
>>> it's a
>>> more specific way to output, but does not have different
>>> restrictions or
>>> capabilities.
>>> 
>>> This is also a pretty small change within the SDK - it touches about
>>> 20
>>> files, and the changes are pretty automatic.
>>> 
>>> Thanks,
>>> 
>>> Thomas
>>> 
> 
>>> 


答复: Renaming SideOutput

2017-04-11 Thread 上海_技术部_数据平台_唐觊隽
+1 more clearer


-邮件原件-
发件人: Ankur Chauhan [mailto:an...@malloc64.com] 
发送时间: 2017年4月12日 10:36
收件人: dev@beam.apache.org
主题: Re: Renaming SideOutput

+1 this is pretty much the topmost things that I found odd when starting with 
the beam model. It would definitely be more intuitive to have a consistent 
name. 

Sent from my iPhone

> On Apr 11, 2017, at 18:29, Aljoscha Krettek  wrote:
> 
> +1
> 
>> On Wed, Apr 12, 2017, at 02:34, Thomas Groh wrote:
>> I think that's a good idea. I would call the outputs of a ParDo the 
>> "Main Output" and "Additional Outputs" - it seems like an easy way to 
>> make it clear that there's one output that is always expected, and 
>> there may be more.
>> 
>> On Tue, Apr 11, 2017 at 5:29 PM, Robert Bradshaw < 
>> rober...@google.com.invalid> wrote:
>> 
>>> We should do some renaming in Python too. Right now we have 
>>> SideOutputValue which I'd propose naming TaggedOutput or something 
>>> like that.
>>> 
>>> Should the docs change too?
>>> https://beam.apache.org/documentation/programming-guide/#transforms-
>>> sideio
>>> 
>>> On Tue, Apr 11, 2017 at 5:25 PM, Kenneth Knowles 
>>> 
>>> wrote:
 +1 ditto about sideInput and sideOutput not actually being related
 
 On Tue, Apr 11, 2017 at 3:52 PM, Robert Bradshaw < 
 rober...@google.com.invalid> wrote:
 
> +1, I think this is a lot clearer.
> 
> On Tue, Apr 11, 2017 at 2:24 PM, Stephen Sisk 
> 
> wrote:
>> strong +1 for changing the name away from sideOutput - the fact 
>> that sideInput and sideOutput are not really related was 
>> definitely a
>>> source
> of
>> confusion for me when learning beam.
>> 
>> S
>> 
>> On Tue, Apr 11, 2017 at 1:56 PM Thomas Groh 
>> >>> 
>> wrote:
>> 
>>> Hey everyone:
>>> 
>>> I'd like to rename DoFn.Context#sideOutput to #output (in the 
>>> Java
>>> SDK).
>>> 
>>> Having two methods, both named output, one which takes the "main
>>> output
>>> type" and one that takes a tag to specify the type more clearly 
>>> communicates the actual behavior - sideOutput isn't a "special" 
>>> way
>>> to
>>> output, it's the same as output(T), just to a specified PCollection.
> This
>>> will help pipeline authors understand the actual behavior of
>>> outputting
> to
>>> a tag, and detangle it from "sideInput", which is a special way 
>>> to
> receive
>>> input. Giving them the same name means that it's not even 
>>> strange to
> call
>>> output and provide the main output type, which is what we want -
>>> it's a
>>> more specific way to output, but does not have different
>>> restrictions or
>>> capabilities.
>>> 
>>> This is also a pretty small change within the SDK - it touches 
>>> about
>>> 20
>>> files, and the changes are pretty automatic.
>>> 
>>> Thanks,
>>> 
>>> Thomas
>>> 
> 
>>> 


Re: Renaming SideOutput

2017-04-11 Thread JingsongLee
strong +1
best,
JingsongLee--From:Tang
 Jijun(上海_技术部_数据平台_唐觊隽) Time:2017 Apr 12 (Wed) 
10:39To:dev@beam.apache.org Subject:答复: Renaming SideOutput
+1 more clearer


-邮件原件-
发件人: Ankur Chauhan [mailto:an...@malloc64.com] 
发送时间: 2017年4月12日 10:36
收件人: dev@beam.apache.org
主题: Re: Renaming SideOutput

+1 this is pretty much the topmost things that I found odd when starting with 
the beam model. It would definitely be more intuitive to have a consistent 
name. 

Sent from my iPhone

> On Apr 11, 2017, at 18:29, Aljoscha Krettek  wrote:
> 
> +1
> 
>> On Wed, Apr 12, 2017, at 02:34, Thomas Groh wrote:
>> I think that's a good idea. I would call the outputs of a ParDo the 
>> "Main Output" and "Additional Outputs" - it seems like an easy way to 
>> make it clear that there's one output that is always expected, and 
>> there may be more.
>> 
>> On Tue, Apr 11, 2017 at 5:29 PM, Robert Bradshaw < 
>> rober...@google.com.invalid> wrote:
>> 
>>> We should do some renaming in Python too. Right now we have 
>>> SideOutputValue which I'd propose naming TaggedOutput or something 
>>> like that.
>>> 
>>> Should the docs change too?
>>> https://beam.apache.org/documentation/programming-guide/#transforms-
>>> sideio
>>> 
>>> On Tue, Apr 11, 2017 at 5:25 PM, Kenneth Knowles 
>>> 
>>> wrote:
 +1 ditto about sideInput and sideOutput not actually being related
 
 On Tue, Apr 11, 2017 at 3:52 PM, Robert Bradshaw < 
 rober...@google.com.invalid> wrote:
 
> +1, I think this is a lot clearer.
> 
> On Tue, Apr 11, 2017 at 2:24 PM, Stephen Sisk 
> 
> wrote:
>> strong +1 for changing the name away from sideOutput - the fact 
>> that sideInput and sideOutput are not really related was 
>> definitely a
>>> source
> of
>> confusion for me when learning beam.
>> 
>> S
>> 
>> On Tue, Apr 11, 2017 at 1:56 PM Thomas Groh 
>> >>> 
>> wrote:
>> 
>>> Hey everyone:
>>> 
>>> I'd like to rename DoFn.Context#sideOutput to #output (in the 
>>> Java
>>> SDK).
>>> 
>>> Having two methods, both named output, one which takes the "main
>>> output
>>> type" and one that takes a tag to specify the type more clearly 
>>> communicates the actual behavior - sideOutput isn't a "special" 
>>> way
>>> to
>>> output, it's the same as output(T), just to a specified PCollection.
> This
>>> will help pipeline authors understand the actual behavior of
>>> outputting
> to
>>> a tag, and detangle it from "sideInput", which is a special way 
>>> to
> receive
>>> input. Giving them the same name means that it's not even 
>>> strange to
> call
>>> output and provide the main output type, which is what we want -
>>> it's a
>>> more specific way to output, but does not have different
>>> restrictions or
>>> capabilities.
>>> 
>>> This is also a pretty small change within the SDK - it touches 
>>> about
>>> 20
>>> files, and the changes are pretty automatic.
>>> 
>>> Thanks,
>>> 
>>> Thomas
>>> 
> 
>>> 


Re: Renaming SideOutput

2017-04-11 Thread Aviem Zur
+1

On Wed, Apr 12, 2017 at 6:06 AM JingsongLee  wrote:

> strong +1
> best,
> JingsongLee--From:Tang
> Jijun(上海_技术部_数据平台_唐觊隽) Time:2017 Apr 12 (Wed)
> 10:39To:dev@beam.apache.org Subject:答复: Renaming
> SideOutput
> +1 more clearer
>
>
> -邮件原件-
> 发件人: Ankur Chauhan [mailto:an...@malloc64.com]
> 发送时间: 2017年4月12日 10:36
> 收件人: dev@beam.apache.org
> 主题: Re: Renaming SideOutput
>
>
> +1 this is pretty much the topmost things that I found odd when starting with 
> the beam model. It would definitely be more intuitive to have a consistent 
> name.
>
> Sent from my iPhone
>
> > On Apr 11, 2017, at 18:29, Aljoscha Krettek  wrote:
> >
> > +1
> >
> >> On Wed, Apr 12, 2017, at 02:34, Thomas Groh wrote:
> >> I think that's a good idea. I would call the outputs of a ParDo the
> >> "Main Output" and "Additional Outputs" - it seems like an easy way to
> >> make it clear that there's one output that is always expected, and
> >> there may be more.
> >>
> >> On Tue, Apr 11, 2017 at 5:29 PM, Robert Bradshaw <
> >> rober...@google.com.invalid> wrote:
> >>
> >>> We should do some renaming in Python too. Right now we have
> >>> SideOutputValue which I'd propose naming TaggedOutput or something
> >>> like that.
> >>>
> >>> Should the docs change too?
> >>> https://beam.apache.org/documentation/programming-guide/#transforms-
> >>> sideio
> >>>
> >>> On Tue, Apr 11, 2017 at 5:25 PM, Kenneth Knowles
> >>> 
> >>> wrote:
>  +1 ditto about sideInput and sideOutput not actually being related
> 
>  On Tue, Apr 11, 2017 at 3:52 PM, Robert Bradshaw <
>  rober...@google.com.invalid> wrote:
> 
> > +1, I think this is a lot clearer.
> >
> > On Tue, Apr 11, 2017 at 2:24 PM, Stephen Sisk
> > 
> > wrote:
> >> strong +1 for changing the name away from sideOutput - the fact
> >> that sideInput and sideOutput are not really related was
> >> definitely a
> >>> source
> > of
> >> confusion for me when learning beam.
> >>
> >> S
> >>
> >> On Tue, Apr 11, 2017 at 1:56 PM Thomas Groh
> >>  
> >> wrote:
> >>
> >>> Hey everyone:
> >>>
> >>> I'd like to rename DoFn.Context#sideOutput to #output (in the
> >>> Java
> >>> SDK).
> >>>
> >>> Having two methods, both named output, one which takes the "main
> >>> output
> >>> type" and one that takes a tag to specify the type more clearly
> >>> communicates the actual behavior - sideOutput isn't a "special"
> >>> way
> >>> to
>
> >>> output, it's the same as output(T), just to a specified PCollection.
> > This
> >>> will help pipeline authors understand the actual behavior of
> >>> outputting
> > to
> >>> a tag, and detangle it from "sideInput", which is a special way
> >>> to
> > receive
> >>> input. Giving them the same name means that it's not even
> >>> strange to
> > call
> >>> output and provide the main output type, which is what we want -
> >>> it's a
> >>> more specific way to output, but does not have different
> >>> restrictions or
> >>> capabilities.
> >>>
> >>> This is also a pretty small change within the SDK - it touches
> >>> about
> >>> 20
> >>> files, and the changes are pretty automatic.
> >>>
> >>> Thanks,
> >>>
> >>> Thomas
> >>>
> >
> >>>
>


RE: Apache Storm/JStorm Runner(s) for Apache Beam

2017-04-11 Thread 刘键(Basti Liu)
Hi Taylor,

It is glad to see your opinion. 
After the open source of Beam, there are a lot of interests in Beam from our 
internal users in Alibaba and other companies in China, which promotes us to 
provide the support of JStorm runner. But since the implementation of Storm 
runner is out of date, and over the past year many new features or different 
solution(especially for exactly once and state) were introduced in JStorm, we 
have to start the separate development of JStorm runner. 
Currently, we have finished a prototype(support most PTransforms, window and 
trigger of Beam) as Pei mentioned in another email, and the full testing is 
still on-going. Some users has built up their trial topology on it in Alibaba. 
But for further improvement, we still need the help of review from Beam 
community to ensure the correctness, and get notification of any broken or 
un-compatible update of Beam evolves. That is the reason why we decide to 
commit JStorm runner into Beam repository.

For personal understanding, the JStorm runner is not a duplicated effort. The 
major part of JStorm runner is probably reused in Storm. Some other parts like 
exactly once and state needs a propagation. When Storm community plan to 
restart the development of Storm runner, we'd like to help on this, as a part 
of merging JStorm features planned before. At that time, we can discuss whether 
merging JStorm feature or propagation is required.
Looking forward to the better collaboration between Beam, Storm and JStorm.

Regards
Jian Liu(Basti)

-Original Message-
From: P. Taylor Goetz [mailto:ptgo...@apache.org] 
Sent: Tuesday, April 11, 2017 1:48 AM
To: dev@beam.apache.org; d...@storm.apache.org
Subject: Apache Storm/JStorm Runner(s) for Apache Beam

Note: cross-posting to dev@beam and dev@storm

I’ve seen at least two threads on the dev@ list discussing the JStorm runner 
and my hope is we can expand on that discussion and cross-pollinate with the 
Storm/JStorm/Beam communities as well.

A while back I created a very preliminary proof of concept of getting a Storm 
Beam runner working [1]. That was mainly an exercise for me to familiarize 
myself with the Beam API and discover what it would take to develop a Beam 
runner on top of Storm. That code is way out of date (I was targeting Beam’s 
HEAD before the 0.2.0 release, and a lot of changes have since taken place) and 
didn’t really work as Jian Liu pointed out. It was a start, that perhaps could 
be further built upon, or parts harvested, etc. I don’t have any particular 
attachment to that code and wouldn’t be upset if it were completely discarded 
in favor of a better or more extensible implementation.

What I would like to see, and I think this is a great opportunity to do so, is 
a closer collaboration between the Apache Storm and JStorm communities. For 
those who aren’t familiar with those projects’ relationship, I’ll start with a 
little history…

JStorm began at Alibaba as a fork of Storm (pre-Apache?) with Storm’s Clojure 
code reimplemented in Java. The rationale behind that move was that Alibaba had 
a large number of Java developers but very few who were proficient with 
Clojure. Moving to pure Java made sense as it would expand the base of 
potential contributors.

In late 2015 Alibaba donated the JStorm codebase to the Apache Storm project, 
and the Apache Storm PMC committed to converting its Clojure code to Java in 
order to incorporate the code donation. At the time there was one catch — 
Apache Storm had implemented comprehensive security features such as Kerberos 
authentication/authorization and multi-tenancy in its Clojure code, which 
greatly complicated the move to Java and incorporation of the JStorm code. 
JStorm did not have the same security features. A number of JStorm developers 
have also become Storm PMC members.

Fast forward to today. The Storm community has completed the bulk of the move 
to Java and the next major release (presumably 2.0, which is currently under 
discussion) will be largely Java-based. We are now in a much better position to 
begin incorporating JStorm’s features, as well as implementing new features 
necessary to support the Beam API (such as support for bounded pipelines, among 
other features).

Having separate Apache Storm and JStorm beam runner implementations doesn’t 
feel appropriate in my personal opinion, especially since both projects have 
expressed an ongoing commitment to bringing JStorm’s additional features, and 
just as important, community, to Apache Storm.

One final note, when the Storm community initially discussed developing a Beam 
runner, the general consensus was do so within the Storm repository. My current 
thinking is that such an effort should take place within the Beam community, 
not only since that is the development pattern followed by other runner 
implementations (Flink, Apex, etc.), but also because it would serve to 
increase collaboration between Apache projects (always a good thing!)

RE: Renaming SideOutput

2017-04-11 Thread 刘键(Basti Liu)
+1. 
SideInput and SideOutput probably make new user confused. It is different 
behavior.
BTW, is it also better to change "main output" to "default output" when user 
does not explicitly specify an output tag?

Regards
Jian Liu(Basti)

-Original Message-
From: Thomas Groh [mailto:tg...@google.com.INVALID] 
Sent: Wednesday, April 12, 2017 4:56 AM
To: dev@beam.apache.org
Subject: Renaming SideOutput

Hey everyone:

I'd like to rename DoFn.Context#sideOutput to #output (in the Java SDK).

Having two methods, both named output, one which takes the "main output type" 
and one that takes a tag to specify the type more clearly communicates the 
actual behavior - sideOutput isn't a "special" way to output, it's the same as 
output(T), just to a specified PCollection. This will help pipeline authors 
understand the actual behavior of outputting to a tag, and detangle it from 
"sideInput", which is a special way to receive input. Giving them the same name 
means that it's not even strange to call output and provide the main output 
type, which is what we want - it's a more specific way to output, but does not 
have different restrictions or capabilities.

This is also a pretty small change within the SDK - it touches about 20 files, 
and the changes are pretty automatic.

Thanks,

Thomas



Re: Renaming SideOutput

2017-04-11 Thread Ted Yu
+1

> On Apr 11, 2017, at 5:34 PM, Thomas Groh  wrote:
> 
> I think that's a good idea. I would call the outputs of a ParDo the "Main
> Output" and "Additional Outputs" - it seems like an easy way to make it
> clear that there's one output that is always expected, and there may be
> more.
> 
> On Tue, Apr 11, 2017 at 5:29 PM, Robert Bradshaw <
> rober...@google.com.invalid> wrote:
> 
>> We should do some renaming in Python too. Right now we have
>> SideOutputValue which I'd propose naming TaggedOutput or something
>> like that.
>> 
>> Should the docs change too?
>> https://beam.apache.org/documentation/programming-guide/#transforms-sideio
>> 
>> On Tue, Apr 11, 2017 at 5:25 PM, Kenneth Knowles 
>> wrote:
>>> +1 ditto about sideInput and sideOutput not actually being related
>>> 
>>> On Tue, Apr 11, 2017 at 3:52 PM, Robert Bradshaw <
>>> rober...@google.com.invalid> wrote:
>>> 
 +1, I think this is a lot clearer.
 
 On Tue, Apr 11, 2017 at 2:24 PM, Stephen Sisk 
 wrote:
> strong +1 for changing the name away from sideOutput - the fact that
> sideInput and sideOutput are not really related was definitely a
>> source
 of
> confusion for me when learning beam.
> 
> S
> 
> On Tue, Apr 11, 2017 at 1:56 PM Thomas Groh >> 
> wrote:
> 
>> Hey everyone:
>> 
>> I'd like to rename DoFn.Context#sideOutput to #output (in the Java
>> SDK).
>> 
>> Having two methods, both named output, one which takes the "main
>> output
>> type" and one that takes a tag to specify the type more clearly
>> communicates the actual behavior - sideOutput isn't a "special" way
>> to
>> output, it's the same as output(T), just to a specified PCollection.
 This
>> will help pipeline authors understand the actual behavior of
>> outputting
 to
>> a tag, and detangle it from "sideInput", which is a special way to
 receive
>> input. Giving them the same name means that it's not even strange to
 call
>> output and provide the main output type, which is what we want -
>> it's a
>> more specific way to output, but does not have different
>> restrictions or
>> capabilities.
>> 
>> This is also a pretty small change within the SDK - it touches about
>> 20
>> files, and the changes are pretty automatic.
>> 
>> Thanks,
>> 
>> Thomas
>>