Re: Hackathon @BeamSummit @ApacheCon

2019-08-22 Thread Kenneth Knowles
I will be at Beam Summit / ApacheCon NA and would love to drop by a
hackathon room if one is arranged. Really excited for both my first
ApacheCon and Beam Summit (finally!)

Kenn

On Thu, Aug 22, 2019 at 10:18 AM Austin Bennett 
wrote:

> And, for clarity, especially focused on Hackathon times on Monday and/or
> Tuesday of ApacheCon, to not conflict with BeamSummit sessions.
>
> On Thu, Aug 22, 2019 at 9:47 AM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> Less than 3 weeks till Beam Summit @ApacheCon!
>>
>> We are to be in Vegas for BeamSummit and ApacheCon in a few weeks.
>>
>> Likely to reserve space in the Hackathon Room to accomplish some tasks:
>> * Help Users
>> * Build Beam
>> * Collaborate with other projects
>> * etc
>>
>> If you're to be around (or not) let us know how you'd like to be
>> involved.  Also, please share and surface anything that would be good for
>> us to look at (and, esp. any beginner tasks, in case we can entice some new
>> contributors).
>>
>>
>> P.S.  See BeamSummit.org, if you're thinking of attending - there's a
>> discount code.
>>
>


Re: [RESULT] [VOTE] Release 2.15.0, release candidate #2

2019-08-22 Thread Kenneth Knowles
My +1 is also binding :-)

Nice work on the release!

Kenn

On Wed, Aug 21, 2019 at 9:51 AM Yifan Zou  wrote:

> Hi all,
>
> I'm happy to announce that we have unanimously approved this release.
>
> There are 4 approving votes, 3 of which are binding (in order):
> * Ahmet (al...@google.com);
> * Pablo (pabl...@google.com);
> * Lukasz (lc...@google.com);
>
> There are no disapproving votes.
>
> Thanks everyone!
>
> Next step is to finalize the release (merge the docs/website/blog PRs,
> publish artifacts). Please let me know if you have any questions.
>
> Regards,
> Yifan
>


Re: SqlTransform Metadata

2019-08-22 Thread Kenneth Knowles
I think this thread is the only discussion of it. I still favor a separate
transform SqlTransform.withMetadata().query(...). That way, there's no
change to SQL.

Kenn

On Wed, Aug 21, 2019 at 12:53 AM Reza Rokni  wrote:

> @Kenn / @Rob  has there been any other discussions on how the timestamp
> value can be accessed from within the SQL since this thread in May?
>
> If not my vote  is for a convenience method  that gives access to the
> timestamp as a function call within the SQL statement.
>
> Reza
>
> On Wed, 22 May 2019 at 10:06, Reza Rokni  wrote:
>
>> Hi,
>>
>> Coming back to this do we have enough of a consensus to say that in
>> principle this is a good idea? If yes I will raise a Jira for this.
>>
>> Cheers
>>
>> Reza
>>
>> On Thu, 16 May 2019 at 02:58, Robert Bradshaw 
>> wrote:
>>
>>> On Wed, May 15, 2019 at 8:51 PM Kenneth Knowles  wrote:
>>> >
>>> > On Wed, May 15, 2019 at 3:05 AM Robert Bradshaw 
>>> wrote:
>>> >>
>>> >> Isn't there an API for concisely computing new fields from old ones?
>>> >> Perhaps these expressions could contain references to metadata value
>>> >> such as timestamp. Otherwise,
>>> >
>>> > Even so, being able to refer to the timestamp implies something about
>>> its presence in a namespace, shared with other user-decided names.
>>>
>>> I was thinking that functions may live in a different namespace than
>>> fields.
>>>
>>> > And it may be nice for users to use that API within the composite
>>> SqlTransform. I think there are a lot of options.
>>> >
>>> >> Rather than withMetadata reifying the value as a nested field, with
>>> >> the timestamp, window, etc. at the top level, one could let it take a
>>> >> field name argument that attaches all the metadata as an extra
>>> >> (struct-like) field. This would be like attachX, but without having to
>>> >> have a separate method for every X.
>>> >
>>> > If you leave the input field names at the top level, then any "attach"
>>> style API requires choosing a name that doesn't conflict with input field
>>> names. You can't write a generic transform that works with all inputs. I
>>> think it is much simpler to move the input field all into a nested
>>> row/struct. Putting all the metadata in a second nested row/struct is just
>>> as good as top-level, perhaps. But moving the input into the struct/row is
>>> important.
>>>
>>> Very good point about writing generic transforms. It does mean a lot
>>> of editing if one decides one wants to access the metadata field(s)
>>> after-the-fact. (I also don't think we need to put the metadata in a
>>> nested struct if the value is.)
>>>
>>> >> It seems restrictive to only consider this a a special mode for
>>> >> SqlTransform rather than a more generic operation. (For SQL, my first
>>> >> instinct would be to just make this a special function like
>>> >> element_timestamp(), but there is some ambiguity there when there are
>>> >> multiple tables in the expression.)
>>> >
>>> > I would propose it as both: we already have some Reify transforms, and
>>> you could make a general operation that does this small data preparation
>>> easily. I think the proposal is just to add a convenience build method on
>>> SqlTransform to include the underlying functionality as part of the
>>> composite, which we really already have.
>>> >
>>> > I don't think we should extend SQL with built-in functions for
>>> element_timestamp() and things like that, because SQL already has TIMESTAMP
>>> columns and it is very natural to use SQL on unbounded relations where the
>>> timestamp is just part of the data.
>>>
>>> That's why I was suggesting a single element_metadata() rather than
>>> exploding each one out.
>>>
>>> Do you have a pointer to what the TIMESTAMP columns are? (I'm assuming
>>> this is a special field, but distinct from the metadata timestamp?)
>>>
>>> >> On Wed, May 15, 2019 at 5:03 AM Reza Rokni  wrote:
>>> >> >
>>> >> > Hi,
>>> >> >
>>> >> > One use case would be when dealing with the windowing functions for
>>> example:
>>> >> >
>>> >> > SELECT f_int, COUNT(*) , TUMBLE_START(f_timestamp, INTERVAL '1'
>>> HOUR) tumble_start
>>> >> >   FROM PCOLLECTION
>>> >> >   GROUP BY
>>> >> > f_int,
>>> >> > TUMBLE(f_timestamp, INTERVAL '1' HOUR)
>>> >> >
>>> >> > For an element which is using Metadata to inform the EvenTime of
>>> the element, rather than data within the element itself, I would need to
>>> create a new schema which added the timestamp as a field. I think other
>>> examples which maybe interesting is getting the value of a row with the
>>> max/min timestamp. None of this would be difficult but it does feel a
>>> little on the verbose side and also makes the pipeline a little harder to
>>> read.
>>> >> >
>>> >> > Cheers
>>> >> > Reza
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > From: Kenneth Knowles 
>>> >> > Date: Wed, 15 May 2019 at 01:15
>>> >> > To: dev
>>> >> >
>>> >> >> We have support for nested rows so this should be easy. The
>>> .withMetadata would reify the struct, moving f

Re: [ANN] Seeking volunteer mentors from all Apache projects to help mentor under-represented contributors

2019-08-22 Thread Pablo Estrada
Hi Kenn,
I'd be happy to do something like this. I have a couple projects in mind as
well.
I'll do these tasks by the deadline.
Best
-P.

On Thu, Aug 22, 2019 at 12:29 PM Rui Wang  wrote:

> Thanks Kenn for forwarding this. I do have several BeamSQL related
> projects and I could apply for bing a mentor.
>
>>
>> 1. Ensure you can host an Outreachy intern
>>
>> Understand the commitment to be a mentor [2] [4], we’ll also ask you that
>> you connect with the ASF D&I committee to ensure we capture a contribution
>> friction log from your interns. If you can fulfill this commitment, then
>> move forward to get consensus from your project’s PMC and move to the next
>> step.
>>
>
> In Beam community specifically, what would be our process to get consensus
> from Beam's PMC and move to the next step?
>
>
> -Rui
>
>
>> 2. Register your project with Outreachy
>>
>> Register your project in the Outreachy website [6]. Please be as specific
>> as possible when describing your idea. Include the programming language,
>> the tools and skills required, but try not to scare potential students
>> away. They are supposed to learn what's required before the program starts.
>> Use labels, e.g. for the programming language (java, c, c++, erlang,
>> python, brainfuck, ...) or technology area (cloud, xml, web, foo, bar, ...).
>>
>> If you want help crafting your project proposal, please contact the
>> Apache coordinators: "Matt Sicker" ,. "Awasum Yannick"
>> ,. "Katia Rojas" .
>>
>> 3. Curate a list of tasks for your Outreachy project
>>
>> Add an “outreachy19dec” label to issues related to your project. You
>> should include links to search filters listing these issues in your project
>> application. It’s also useful to use a “newbie-friendly” label to
>> distinguish the starter tasks from the larger or more complex project
>> tasks. This will provide tasks for applicants to complete during the
>> application process.
>>
>> If your project doesn't use JIRA (e.g.httpd, ooo), you can use the
>> Diversity & Inclusion board to coordinate with your applicants, just use
>> the “Outreachy” component.
>>
>> [4] Contains some additional information (this could be the FAQ page).
>>
>> P.S.: this email is free to be shared publicly if you want to.
>>
>> References:
>>
>> [1] https://www.outreachy.org
>>
>> [2] https://www.outreachy.org/mentor/mentor-faq/#define-a-project
>>
>> [3] https://www.outreachy.org/communities/cfp/apache/
>>
>> [4] https://www.outreachy.org/mentor/mentor-faq/
>>
>> [5] https://issues.apache.org/jira/projects/DI
>>
>> [6]
>> https://www.outreachy.org/december-2019-to-march-2020-internship-round/communities/apache/submit-project/
>>
>> https://www.outreachy.org/communities/cfp/
>>
>>
>>


Re: Add a JIRA component: dsl-sql-zetasql?

2019-08-22 Thread Rui Wang
Thank you Kenneth. I have filed 35 JIRA in this component as the following
up work for Beam ZetaSQL.


-Rui

On Thu, Aug 22, 2019 at 1:25 PM Kenneth Knowles  wrote:

> Makes sense to me. I've done it.
>
> On Thu, Aug 22, 2019 at 11:10 AM Rui Wang  wrote:
>
>> Hi Community,
>>
>> As Beam ZetaSQL is already merged, can we add a new component so I can
>> file JIRAs related to Beam ZetaSQL under that component? It's mostly for
>> the project management purpose.
>>
>>
>> The component can be named as: dsl-sql-zetasql.
>>
>>
>> -Rui
>>
>


Re: Add a JIRA component: dsl-sql-zetasql?

2019-08-22 Thread Kenneth Knowles
Makes sense to me. I've done it.

On Thu, Aug 22, 2019 at 11:10 AM Rui Wang  wrote:

> Hi Community,
>
> As Beam ZetaSQL is already merged, can we add a new component so I can
> file JIRAs related to Beam ZetaSQL under that component? It's mostly for
> the project management purpose.
>
>
> The component can be named as: dsl-sql-zetasql.
>
>
> -Rui
>


Re: (mini-doc) Beam (Flink) portable job templates

2019-08-22 Thread Kyle Weaver
Following up on discussion in this morning's OSS runners meeting, I have
uploaded a draft PR for the full implementation (job creation + execution):
https://github.com/apache/beam/pull/9408

Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com


On Tue, Aug 20, 2019 at 1:24 PM Robert Bradshaw  wrote:

> The point of expansion services is to run at pipeline construction
> time so that the caller can build on top of the outputs. E.g. we're
> hoping to expose Beam's SQL transforms to other languages via an
> expansion service and *not* duplicate the logic of parsing the SQL
> statements to determine the type(s) of the outputs. Even for simpler
> IOs, we would like to take advantage of schema information (e.g.
> looked up at construction time) to produce results and validate (or
> even inform) subsequent construction.
>
> I think we're also making a mistake in talking about "the" expansion
> service here, as if there was only one well defined service that all
> pipenes used. If we go the route of deferring some expansion to the
> runner, we need a way of naming expansion services. It seems like this
> proposal is simply isomorphic to defining new primitive transforms
> which some (all?) runners are just expected to understand.
>
> On Tue, Aug 20, 2019 at 10:11 AM Thomas Weise  wrote:
> >
> >
> >
> > On Tue, Aug 20, 2019 at 8:56 AM Lukasz Cwik  wrote:
> >>
> >>
> >>
> >> On Mon, Aug 19, 2019 at 5:52 PM Ahmet Altay  wrote:
> >>>
> >>>
> >>>
> >>> On Sun, Aug 18, 2019 at 12:34 PM Thomas Weise  wrote:
> 
>  There is a PR open for this: https://github.com/apache/beam/pull/9331
> 
>  (it wasn't tagged with the JIRA and therefore not linked)
> 
>  I think it is worthwhile to explore how we could further detangle the
> client side Python and Java dependencies.
> 
>  The expansion service is one more dependency to consider in a build
> environment. Is it really necessary to expand external transforms prior to
> submission to the job service?
> >>>
> >>>
> >>> +1, this will make it easier to use external transforms from the
> already familiar client environments.
> >>>
> >>
> >>
> >> The intent is to make it so that you CAN (not MUST) run an expansion
> service separate from a Runner. Creating a single endpoint that hosts both
> the Job and Expansion service is something that gRPC does very easily since
> you can host multiple service definitions on a single port.
> >
> >
> > Yes, that's fine. The point here is when the expansion occurs. I believe
> the runner can also invoke the expansion service, thereby eliminating the
> expansion service interaction from the client side.
> >
> >
> >>
> >>
> 
> 
>  Can we come up with a partially constructed proto that can be
> produced by just running the Python entry point? Note this would also
> require pushing the pipeline options parsing into the job service.
> >>>
> >>>
> >>> Why would this require pushing the pipeline options parsing to the job
> service. Assuming that python will have enough idea about the external
> transform what options it will need. The necessary bit could be converted
> to arguments and be part of that partially constructed proto.
> >>>
> 
> 
>  On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri <
> ecanzoni...@gmail.com> wrote:
> >
> > I found the tracking ticket at BEAM-7966
> >
> > On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri <
> ecanzoni...@gmail.com> wrote:
> >>
> >> Is this alternative still being considered? Creating a portable jar
> sounds like a good solution to re-use the existing runner specific
> deployment mechanism (e.g. Flink k8s operator) and in general simplify the
> deployment story.
> >>
> >> On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw <
> rober...@google.com> wrote:
> >>>
> >>> The expansion service is a separate service. (The flink jar
> happens to
> >>> bring both up.) However, there is negotiation to receive/validate
> the
> >>> pipeline options.
> >>>
> >>> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise 
> wrote:
> >>> >
> >>> > We would also need to consider cross-language pipelines that
> (currently) assume the interaction with an expansion service at
> construction time.
> >>> >
> >>> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver 
> wrote:
> >>> >>
> >>> >> > It might also be useful to have the option to just output the
> proto and artifacts, as alternative to the jar file.
> >>> >>
> >>> >> Sure, that wouldn't be too big a change if we were to decide to
> go the SDK route.
> >>> >>
> >>> >> > For the Flink entry point we would need to allow for the job
> server to be used as a library.
> >>> >>
> >>> >> We don't need the whole job server, we only need to add a main
> method to FlinkPipelineRunner [1] as the entry point, which would basically
> just do the setup described in the doc then call FlinkPipelineRunner::run.
> >>> >>
> >>> >> [1]

Re: [ANN] Seeking volunteer mentors from all Apache projects to help mentor under-represented contributors

2019-08-22 Thread Rui Wang
Thanks Kenn for forwarding this. I do have several BeamSQL related projects
and I could apply for bing a mentor.

>
> 1. Ensure you can host an Outreachy intern
>
> Understand the commitment to be a mentor [2] [4], we’ll also ask you that
> you connect with the ASF D&I committee to ensure we capture a contribution
> friction log from your interns. If you can fulfill this commitment, then
> move forward to get consensus from your project’s PMC and move to the next
> step.
>

In Beam community specifically, what would be our process to get consensus
from Beam's PMC and move to the next step?


-Rui


> 2. Register your project with Outreachy
>
> Register your project in the Outreachy website [6]. Please be as specific
> as possible when describing your idea. Include the programming language,
> the tools and skills required, but try not to scare potential students
> away. They are supposed to learn what's required before the program starts.
> Use labels, e.g. for the programming language (java, c, c++, erlang,
> python, brainfuck, ...) or technology area (cloud, xml, web, foo, bar, ...).
>
> If you want help crafting your project proposal, please contact the Apache
> coordinators: "Matt Sicker" ,. "Awasum Yannick" <
> yannickawa...@gmail.com>,. "Katia Rojas" .
>
> 3. Curate a list of tasks for your Outreachy project
>
> Add an “outreachy19dec” label to issues related to your project. You
> should include links to search filters listing these issues in your project
> application. It’s also useful to use a “newbie-friendly” label to
> distinguish the starter tasks from the larger or more complex project
> tasks. This will provide tasks for applicants to complete during the
> application process.
>
> If your project doesn't use JIRA (e.g.httpd, ooo), you can use the
> Diversity & Inclusion board to coordinate with your applicants, just use
> the “Outreachy” component.
>
> [4] Contains some additional information (this could be the FAQ page).
>
> P.S.: this email is free to be shared publicly if you want to.
>
> References:
>
> [1] https://www.outreachy.org
>
> [2] https://www.outreachy.org/mentor/mentor-faq/#define-a-project
>
> [3] https://www.outreachy.org/communities/cfp/apache/
>
> [4] https://www.outreachy.org/mentor/mentor-faq/
>
> [5] https://issues.apache.org/jira/projects/DI
>
> [6]
> https://www.outreachy.org/december-2019-to-march-2020-internship-round/communities/apache/submit-project/
>
> https://www.outreachy.org/communities/cfp/
>
>
>


Add a JIRA component: dsl-sql-zetasql?

2019-08-22 Thread Rui Wang
Hi Community,

As Beam ZetaSQL is already merged, can we add a new component so I can file
JIRAs related to Beam ZetaSQL under that component? It's mostly for the
project management purpose.


The component can be named as: dsl-sql-zetasql.


-Rui


Mentorship Program

2019-08-22 Thread sridhar inuog
Beam Community,
 I am new to this community and It has been a great
experience so far. I am especially thankful to Rui Wang for making this
experience smooth and enjoyable. I want others to have the same experience
as well. For that, I suggest Beam community have formal Mentorship program
or at least have an alias for newbies. As this is a virtual team it is not
possible to walk up to someone and ask some basic questions when they are
not busy. I am a bit shy sending an email with simple questions to the same
alias where architecture level discussions are happening and having an
avenue for that will help people like me. I have seen such emails only once
on this alias and no further emails from that user. I also see some jira
issues assigned but not worked, granted there could be many reasons for
that but I am sure more people will benefit from having mentors.

   Just to be clear I am suggesting permanent
mentorship program or alias for newbies in this community which is separate
from what is proposed in the email below. If such a thing exists, can you
please point me to it. Some of the activities that can be done under this
are code walkthroughs. It was great that Pablo Estrada organized such an
event once. Some of these may be part of local beam summits but not
everyone is able to attend those. I think facilitating new developers
adoption to the community will make this community strong and more
productive.

Thanks,
Sridhar


On Thu, Aug 22, 2019 at 10:33 AM Kenneth Knowles  wrote:

> Seeking mentors and proposals. This could be a great opportunity to build
> and diversify Beam's community.
>
> If you are not familiar with Outreachy, you may start by thinking of it as
> "like Google Summer of Code for groups under-represented in tech". You
> mentor someone remotely and Outreachy pays them a stipend to contribute to
> Beam.
>
> Kenn
>
> -- Forwarded message -
> From: Katia Rojas 
> Date: Sat, Aug 17, 2019 at 8:16 AM
> Subject: [ANN] Seeking volunteer mentors from all Apache projects to help
> mentor under-represented contributors
> To: 
>
>
> Hello folks,
>
> The ASF has successfully been accepted as a participating FOSS community
> in the Outreachy Program [1] to work with Outreachy organizers to offer
> remote internships to applicants around the world.  With this program, we
> are looking forward to improving inclusion in our communities by
> understanding what are the barriers that underrepresented groups in the
> tech industry have while trying to start their journey.
>
> Outreachy's goal is to support people from groups underrepresented in the
> technology industry. Outreachy interns will work remotely with mentors on
> projects ranging from programming, user experience, documentation,
> illustration and graphic design, to data science. Outreachy interns will
> receive stipends for developing said projects full-time for three months.
>
> Mentors will provide mentoring and project ideas and in return have the
> opportunity to get new participants - most importantly - to identify and
> bring in new committers from underrepresented groups.
>
> If you are an ASF committer and you want to participate with your project,
> we ask you to do the following things by no later than 2019-Sep-17 23:00
> UTC (project list submission are due a week later):
>
> 1. Ensure you can host an Outreachy intern
>
> Understand the commitment to be a mentor [2] [4], we’ll also ask you that
> you connect with the ASF D&I committee to ensure we capture a contribution
> friction log from your interns. If you can fulfill this commitment, then
> move forward to get consensus from your project’s PMC and move to the next
> step.
>
> 2. Register your project with Outreachy
>
> Register your project in the Outreachy website [6]. Please be as specific
> as possible when describing your idea. Include the programming language,
> the tools and skills required, but try not to scare potential students
> away. They are supposed to learn what's required before the program starts.
> Use labels, e.g. for the programming language (java, c, c++, erlang,
> python, brainfuck, ...) or technology area (cloud, xml, web, foo, bar, ...).
>
> If you want help crafting your project proposal, please contact the Apache
> coordinators: "Matt Sicker" ,. "Awasum Yannick" <
> yannickawa...@gmail.com>,. "Katia Rojas" .
>
> 3. Curate a list of tasks for your Outreachy project
>
> Add an “outreachy19dec” label to issues related to your project. You
> should include links to search filters listing these issues in your project
> application. It’s also useful to use a “newbie-friendly” label to
> distinguish the starter tasks from the larger or more complex project
> tasks. This will provide tasks for applicants to complete during the
> application process.
>
> If your project doesn't use JIRA (e.g.httpd, ooo), you can use the
> Diversity & Inclusion board to coordinate with your applicants, just use
> the “

Re: [Discuss] Propose Calcite Vendor Release

2019-08-22 Thread Rui Wang
I will wait until next Monday(08/26) to kick off the release if there is no
objection.


-Rui

On Thu, Aug 22, 2019 at 8:49 AM Lukasz Cwik  wrote:

> +1 for release
>
> On Thu, Aug 22, 2019 at 8:20 AM Kenneth Knowles  wrote:
>
>> +1 to doing this release. There is no risk since nothing will use the 0.1
>> version and if it has problems we just make 0.2, etc, etc.
>>
>> And big thanks to Rui for volunteering.
>>
>> On Wed, Aug 21, 2019 at 11:11 PM Kai Jiang  wrote:
>>
>>> Thanks Rui! For sure, any objections should be resolved before releasing.
>>>
>>> On Wed, Aug 21, 2019 at 10:24 PM Rui Wang  wrote:
>>>
 I can be the release manager to help release vendor calcite. Per [1],
 before we start a release, we have to reach consensus before starting a
 release.


 [1]: https://s.apache.org/beam-release-vendored-artifacts

 -Rui

 On Wed, Aug 21, 2019 at 5:00 PM Kai Jiang  wrote:

> Hi Community,
>
> As a part of effort to unblock for vendor calcite in SQL module, we
> broke it into pull/9333  for
> going through vendored dependencies release process separately.
>
> I want to propose Calcite vendor release and look for a release
> manager to help with the release process.
>
> Best,
> Kai
>



Re: Hackathon @BeamSummit @ApacheCon

2019-08-22 Thread Austin Bennett
And, for clarity, especially focused on Hackathon times on Monday and/or
Tuesday of ApacheCon, to not conflict with BeamSummit sessions.

On Thu, Aug 22, 2019 at 9:47 AM Austin Bennett 
wrote:

> Less than 3 weeks till Beam Summit @ApacheCon!
>
> We are to be in Vegas for BeamSummit and ApacheCon in a few weeks.
>
> Likely to reserve space in the Hackathon Room to accomplish some tasks:
> * Help Users
> * Build Beam
> * Collaborate with other projects
> * etc
>
> If you're to be around (or not) let us know how you'd like to be
> involved.  Also, please share and surface anything that would be good for
> us to look at (and, esp. any beginner tasks, in case we can entice some new
> contributors).
>
>
> P.S.  See BeamSummit.org, if you're thinking of attending - there's a
> discount code.
>


Re: Write-through-cache in State logic

2019-08-22 Thread Maximilian Michels
Just to give a quick update here. Rakesh, Thomas, and I had a discussion
about async writes from the Python SDK to the Runner. Robert was also
present for some parts of the discussion.

We concluded that blocking writes with the need to refresh the cache
token each time are not going to provide enough throughput/latency.

We figured that it will be enough to use a single cache token per
Runner<=>SDK Harness connection. This cache token will be provided by
the Runner in the ProcessBundleRequest. Writes will not yield a new
cache token. The advantage is that we can use one cache token for the
life time of the bundle and also across bundles, unless the Runner
switches to a new Runner<=>SDK Harness connection; then the Runner would
have to generate a new cache token.

We might require additional cache tokens for the side inputs. For now,
I'm planning to only tackle user state which seems to be the area where
users have expressed the most need for caching.

-Max

On 21.08.19 20:05, Maximilian Michels wrote:
>> There is probably a misunderstanding here: I'm suggesting to use a worker ID 
>> instead of cache tokens, not additionally.
> 
> Ah! Misread that. We need a changing token to indicate that the cache is
> stale, e.g. checkpoint has failed / restoring from an old checkpoint. If
> the _Runner_ generates a new unique token/id for workers which outlast
> the Runner, then this should work fine. I don't think it is safe for the
> worker to supply the id. The Runner should be in control of cache tokens
> to avoid invalid tokens.
> 
>> In the PR the token is modified as part of updating the state. Doesn't the 
>> SDK need the new token to update it's cache entry also? That's where it 
>> would help the SDK to know the new token upfront.
> 
> If the state is updated in the Runner, a new token has to be generated.
> The old one is not valid anymore. The SDK will use the updated token to
> store the new value in the cache. I understand that it would be nice to
> know the token upfront. That could be possible with some token
> generation scheme. On the other hand, writes can be asynchronous and
> thus not block the UDF.
> 
>> But I believe there is no need to change the token in first place, unless 
>> bundles for the same key (ranges) can be processed by different workers.
> 
> That's certainly possible, e.g. two workers A and B take turn processing
> a certain key range, one bundle after another:
> 
> You process a bundle with a token T with A, then worker B takes over.
> Both have an entry with cache token T. So B goes on to modify the state
> and uses the same cache token T. Then A takes over again. A would have a
> stale cache entry but T would still be a valid cache token.
> 
>> Indeed the fact that Dataflow can dynamically split and merge these ranges 
>> is what makes it trickier. If Flink does not repartition the ranges, then 
>> things are much easier.
> 
> Flink does not dynamically repartition key ranges (yet). If it started
> to support that, we would invalidate the cache tokens for the changed
> partitions.
> 
> 
> I'd suggest the following cache token generation scheme:
> 
> One cache token per key range for user state and one cache token for
> each side input. On writes to user state or changing side input, the
> associated cache token will be renewed.
> 
> On the SDK side, it should be sufficient to let the SDK re-associate all
> its cached data belonging to a valid cache token with a new cache token
> returned by a successful write. This has to happen in the active scope
> (i.e. user state, or a particular side input).
> 
> If the key range changes, new cache tokens have to generated. This
> should happen automatically because the Runner does not checkpoint cache
> tokens and will generate new ones when it restarts from an earlier
> checkpoint.
> 
> The current PR needs to be changed to (1) only keep a single cache token
> per user state and key range (2) add support for cache tokens for each
> side input.
> 
> Hope that makes sense.
> 
> -Max
> 
> On 21.08.19 17:27, Reuven Lax wrote:
>>
>>
>> On Wed, Aug 21, 2019 at 2:16 AM Maximilian Michels > > wrote:
>>
>> Appreciate all your comments! Replying below.
>>
>>
>> @Luke:
>>
>> > Having cache tokens per key would be very expensive indeed and I
>> believe we should go with a single cache token "per" bundle.
>>
>> Thanks for your comments on the PR. I was thinking to propose something
>> along this lines of having cache tokens valid for a particular
>> checkpointing "epoch". That would require even less token renewal than
>> the per-bundle approach.
>>
>>
>> @Thomas, thanks for the input. Some remarks:
>>
>> > Wouldn't it be simpler to have the runner just track a unique ID
>> for each worker and use that to communicate if the cache is valid or
>> not?
>>
>> We do not need a unique id per worker. If a cache token is valid for a
>> particular worker, it is also valid for anot

Re: Hackathon @BeamSummit @ApacheCon

2019-08-22 Thread Rakesh Kumar
Hi Austin,

I am attending Beam Summit and also presenting on a use case of Beam. I
would be more than happy to provide you some help in the hackathon.



On Thu, Aug 22, 2019 at 9:48 AM Austin Bennett 
wrote:

> Less than 3 weeks till Beam Summit @ApacheCon!
>
> We are to be in Vegas for BeamSummit and ApacheCon in a few weeks.
>
> Likely to reserve space in the Hackathon Room to accomplish some tasks:
> * Help Users
> * Build Beam
> * Collaborate with other projects
> * etc
>
> If you're to be around (or not) let us know how you'd like to be
> involved.  Also, please share and surface anything that would be good for
> us to look at (and, esp. any beginner tasks, in case we can entice some new
> contributors).
>
>
> P.S.  See BeamSummit.org, if you're thinking of attending - there's a
> discount code.
>


Hackathon @BeamSummit @ApacheCon

2019-08-22 Thread Austin Bennett
Less than 3 weeks till Beam Summit @ApacheCon!

We are to be in Vegas for BeamSummit and ApacheCon in a few weeks.

Likely to reserve space in the Hackathon Room to accomplish some tasks:
* Help Users
* Build Beam
* Collaborate with other projects
* etc

If you're to be around (or not) let us know how you'd like to be involved.
Also, please share and surface anything that would be good for us to look
at (and, esp. any beginner tasks, in case we can entice some new
contributors).


P.S.  See BeamSummit.org, if you're thinking of attending - there's a
discount code.


Re: [Discuss] Propose Calcite Vendor Release

2019-08-22 Thread Lukasz Cwik
+1 for release

On Thu, Aug 22, 2019 at 8:20 AM Kenneth Knowles  wrote:

> +1 to doing this release. There is no risk since nothing will use the 0.1
> version and if it has problems we just make 0.2, etc, etc.
>
> And big thanks to Rui for volunteering.
>
> On Wed, Aug 21, 2019 at 11:11 PM Kai Jiang  wrote:
>
>> Thanks Rui! For sure, any objections should be resolved before releasing.
>>
>> On Wed, Aug 21, 2019 at 10:24 PM Rui Wang  wrote:
>>
>>> I can be the release manager to help release vendor calcite. Per [1],
>>> before we start a release, we have to reach consensus before starting a
>>> release.
>>>
>>>
>>> [1]: https://s.apache.org/beam-release-vendored-artifacts
>>>
>>> -Rui
>>>
>>> On Wed, Aug 21, 2019 at 5:00 PM Kai Jiang  wrote:
>>>
 Hi Community,

 As a part of effort to unblock for vendor calcite in SQL module, we
 broke it into pull/9333  for
 going through vendored dependencies release process separately.

 I want to propose Calcite vendor release and look for a release manager
 to help with the release process.

 Best,
 Kai

>>>


Fwd: [ANN] Seeking volunteer mentors from all Apache projects to help mentor under-represented contributors

2019-08-22 Thread Kenneth Knowles
Seeking mentors and proposals. This could be a great opportunity to build
and diversify Beam's community.

If you are not familiar with Outreachy, you may start by thinking of it as
"like Google Summer of Code for groups under-represented in tech". You
mentor someone remotely and Outreachy pays them a stipend to contribute to
Beam.

Kenn

-- Forwarded message -
From: Katia Rojas 
Date: Sat, Aug 17, 2019 at 8:16 AM
Subject: [ANN] Seeking volunteer mentors from all Apache projects to help
mentor under-represented contributors
To: 


Hello folks,

The ASF has successfully been accepted as a participating FOSS community in
the Outreachy Program [1] to work with Outreachy organizers to offer remote
internships to applicants around the world.  With this program, we are
looking forward to improving inclusion in our communities by understanding
what are the barriers that underrepresented groups in the tech industry
have while trying to start their journey.

Outreachy's goal is to support people from groups underrepresented in the
technology industry. Outreachy interns will work remotely with mentors on
projects ranging from programming, user experience, documentation,
illustration and graphic design, to data science. Outreachy interns will
receive stipends for developing said projects full-time for three months.

Mentors will provide mentoring and project ideas and in return have the
opportunity to get new participants - most importantly - to identify and
bring in new committers from underrepresented groups.

If you are an ASF committer and you want to participate with your project,
we ask you to do the following things by no later than 2019-Sep-17 23:00
UTC (project list submission are due a week later):

1. Ensure you can host an Outreachy intern

Understand the commitment to be a mentor [2] [4], we’ll also ask you that
you connect with the ASF D&I committee to ensure we capture a contribution
friction log from your interns. If you can fulfill this commitment, then
move forward to get consensus from your project’s PMC and move to the next
step.

2. Register your project with Outreachy

Register your project in the Outreachy website [6]. Please be as specific
as possible when describing your idea. Include the programming language,
the tools and skills required, but try not to scare potential students
away. They are supposed to learn what's required before the program starts.
Use labels, e.g. for the programming language (java, c, c++, erlang,
python, brainfuck, ...) or technology area (cloud, xml, web, foo, bar, ...).

If you want help crafting your project proposal, please contact the Apache
coordinators: "Matt Sicker" ,. "Awasum Yannick" <
yannickawa...@gmail.com>,. "Katia Rojas" .

3. Curate a list of tasks for your Outreachy project

Add an “outreachy19dec” label to issues related to your project. You should
include links to search filters listing these issues in your project
application. It’s also useful to use a “newbie-friendly” label to
distinguish the starter tasks from the larger or more complex project
tasks. This will provide tasks for applicants to complete during the
application process.

If your project doesn't use JIRA (e.g.httpd, ooo), you can use the
Diversity & Inclusion board to coordinate with your applicants, just use
the “Outreachy” component.

[4] Contains some additional information (this could be the FAQ page).

P.S.: this email is free to be shared publicly if you want to.

References:

[1] https://www.outreachy.org

[2] https://www.outreachy.org/mentor/mentor-faq/#define-a-project

[3] https://www.outreachy.org/communities/cfp/apache/

[4] https://www.outreachy.org/mentor/mentor-faq/

[5] https://issues.apache.org/jira/projects/DI

[6]
https://www.outreachy.org/december-2019-to-march-2020-internship-round/communities/apache/submit-project/

https://www.outreachy.org/communities/cfp/


Re: [Discuss] Propose Calcite Vendor Release

2019-08-22 Thread Kenneth Knowles
+1 to doing this release. There is no risk since nothing will use the 0.1
version and if it has problems we just make 0.2, etc, etc.

And big thanks to Rui for volunteering.

On Wed, Aug 21, 2019 at 11:11 PM Kai Jiang  wrote:

> Thanks Rui! For sure, any objections should be resolved before releasing.
>
> On Wed, Aug 21, 2019 at 10:24 PM Rui Wang  wrote:
>
>> I can be the release manager to help release vendor calcite. Per [1],
>> before we start a release, we have to reach consensus before starting a
>> release.
>>
>>
>> [1]: https://s.apache.org/beam-release-vendored-artifacts
>>
>> -Rui
>>
>> On Wed, Aug 21, 2019 at 5:00 PM Kai Jiang  wrote:
>>
>>> Hi Community,
>>>
>>> As a part of effort to unblock for vendor calcite in SQL module, we
>>> broke it into pull/9333  for
>>> going through vendored dependencies release process separately.
>>>
>>> I want to propose Calcite vendor release and look for a release manager
>>> to help with the release process.
>>>
>>> Best,
>>> Kai
>>>
>>


Re: Support ZetaSQL as a new SQL dialect in BeamSQL

2019-08-22 Thread Alex Van Boxel
This is a very informative thread. I would love that a lot of this
information and reasoning end up in the documentation.

 _/
_/ Alex Van Boxel


On Wed, Aug 21, 2019 at 9:17 PM Rui Wang  wrote:

> Thanks everyone! Now Beam ZetaSQL is merged into Beam repo!
>
>
> -Rui
>
> On Mon, Aug 19, 2019 at 8:36 AM Ahmet Altay  wrote:
>
>> Thank you both!
>>
>> On Mon, Aug 19, 2019 at 8:01 AM Kenneth Knowles  wrote:
>>
>>> The i.p. clearance is complete:
>>> https://lists.apache.org/thread.html/239be048e7748f079dc34b06020e0c8f094859cb4a558b361f6b8eb5@
>>>
>>> Kenn
>>>
>>> On Mon, Aug 12, 2019 at 4:25 PM Rui Wang  wrote:
>>>
 Thanks Kenneth.

 I will start a vote for Beam ZetaSQL contribution.

 -Rui

 On Mon, Aug 12, 2019 at 4:11 PM Kenneth Knowles 
 wrote:

> Nice explanations of the reasoning. I think two things will stay
> approximately the same even as the ecosystem develops: (1) ZetaSQL has
> pretty clear semantics so we will have a compliant parser, whether it is
> the official one or another like Calcite Babel, and (2) we will need a way
> to implement all the standard ZetaSQL functions and this will be the same
> no matter the frontend.
>
> For a contribution this large where i.p. clearance is necessary, a
> vote is appropriate. It can happen at the same time or even after i.p.
> clearance.
>
> Kenn
>
> On Wed, Aug 7, 2019 at 1:08 PM Mingmin Xu  wrote:
>
>> Thanks to highlight the parts of types/operators/functions/..., that
>> does make things more complicated. +1 that as a short/middle term 
>> solution,
>> the proposal is reasonable. We could follow up in future to handle it in
>> Calcite Babel if possible.
>>
>> Mingmin
>>
>> On Tue, Aug 6, 2019 at 3:57 PM Rui Wang  wrote:
>>
>>> Hi Mingmin,
>>>
>>> Honestly I don't have an answer to it: a SQL dialect is complicated
>>> and I don't have enough understanding on Calcite (Calcite has a big 
>>> repo).
>>> Based on my read from CALCITE-2280
>>> , the closer to
>>> standard sql that a dialect is, the less blockers that we will have to
>>> support this dialect in Calcite babel parser.
>>>
>>> However, this is a good question, which raises a good aspect that I
>>> found people usually ignore: supporting a SQL dialect is not only 
>>> support a
>>> type of syntax. It also includes data types, built-in sql functions,
>>> operators and many other stuff.
>>>
>>> I especially found the following incompatibilities between Calcite
>>> and ZetaSQL during the development:
>>> 1. Calcite does not support Struct/Row type well because Calcite
>>> flattens Rows when reading from tables by adding an extra Projection on 
>>> top
>>> of tables.
>>> 2. I had trouble in supporting DATETIME(or timestamp without
>>> time zone) type.
>>> 3. Huge incompatibilities on SQL functions. E.g. return type is
>>> different for AVG(long), and many many more.
>>> 4. I am not sure if Calcite has the same set of type casting rules
>>> as BigQuery(my impression is there are differences).
>>>
>>>
>>> I would say in the short/mid term, it's much easier to use logical
>>> plan as IR to implement another SQL dialect for BeamSQL (Linkedin has
>>> similar practice, see their blog post
>>> 
>>> ).
>>>
>>> For the longer term, it would be interesting to see how we can add
>>> BigQuery syntax (plus its data types and sql functions) to Calcite babel
>>> parser.
>>>
>>>
>>>
>>> -Rui
>>>
>>>
>>> On Tue, Aug 6, 2019 at 2:49 PM Mingmin Xu 
>>> wrote:
>>>
 Just take a look at
 https://issues.apache.org/jira/browse/CALCITE-2280 which
 introduced Babel parser in Calcite to support varied dialects, this 
 may be
 an easier way to support BigQuery syntax. @Rui do you notice any big
 difference between Calcite engine and ZetaSQL, like parsing, 
 optimization?
 If that's the case, it make sense to build the alternative switch in 
 Beam
 side.

 On Sun, Aug 4, 2019 at 4:47 PM Rui Wang  wrote:

> Mingmin - it sounds like an awesome idea to translate from
> SparkSQL. It's even more exciting to know if we could translate Spark
> Structured Streaming code by a similar way, which enables existing 
> Spark
> SQL/Structure Streaming pipelines run on Beam.
>
> Reuven - Thanks for bringing it up. I tried to search dev@calcite
> and only found[1]. From that thread, I see that adding ZetaSQL to 
> Calcite
> itself is still a discussion. I am also looking for if anyone

Re: [DISCUSS] Making consistent use of Optionals

2019-08-22 Thread Ismaël Mejía
Thanks Jan for bringing this subject to the mailing list.

I tend to lean towards the java Optional. Optionals are mostly useful
as return types and since return types usually end up in the API. I am
afraid we end up leaking Guava's Optional as part of the public API.
We can define however a rule to not allow Optional in the public APIs
(and enforce it via checkstyle) and in that case it will be ok to use
either of them.

Ismaël



On Thu, Aug 22, 2019 at 1:29 AM Kenneth Knowles  wrote:
>
> As mentioned on PR, I'm not convinced by Flink's discussion and all evidence 
> I know has shown it to have non-measurable performance impact.
>
> I'm OK either way, though, at this point.
>
>  - Whatever the consensus, let us set up checkstyle/analysis so that we 
> maintain compatible across the codebase.
>  - We should see what we can do to enforce not using non-serializable 
> Optional in fields.
>  - In special cases where you really do want an Optional stored or 
> transmitted we could make our own SerializableOptional to still keep it 
> compatible (example: the common Map> representing the three 
> states of --foo=v --foo=null and no foo specified)
>  - Always using null & @Nullable for possible-empty fields will make the 
> codebase poorer overall, and NPEs are still a user-facing problem that hits 
> us pretty often. We should redouble our efforts to have a correct and fully 
> strict analysis of these.
>
> Kenn
>
> On Wed, Aug 21, 2019 at 1:09 PM Jan Lukavský  wrote:
>>
>> Sorry, forgot to add link to the Flink discussion [1].
>>
>> [1]
>> https://lists.apache.org/thread.html/f5f8ce92f94c9be6774340fbd7ae5e4afe07386b6765ad3cfb13aec0@%3Cdev.flink.apache.org%3E
>>
>> On 8/21/19 10:08 PM, Jan Lukavský wrote:
>> > Hi,
>> >
>> > sorry if this discussion have been already taken, but I'd like to know
>> > others opinions about how we use Optionals. The state in current
>> > master is as follows:
>> >
>> > $ git grep "import" | grep "java.util.Optional" | wc -l
>> > 85
>> > $ git grep "import" | grep "Optional" | grep guava | wc -l
>> > 45
>> >
>> > I'd like to propose that we use only one Optional, for consistency.
>> > There are arguments towards using each one of them, if I try to sum
>> > these up:
>> >
>> > Pros for java.util:
>> >
>> >  * Part of standard lib
>> >
>> >  * Will (in the future) probably better integrate with other standard
>> > APIs (e.g. Optional.stream in JDK 9, but probably more to come)
>> >
>> > Pros for guava:
>> >
>> >  * Guava's Optional is Serializable
>> >
>> > There was recently a discussion on Flink's mailing list [1], which
>> > arrived at a conclusion, that using Optional as a field should be
>> > discouraged (in favor of @Nullable). That would imply that
>> > non-serializability of java.util.Optional is not an issue. But maybe
>> > we can arrive at a different conclusion.
>> >
>> > WDYT?
>> >
>> >  Jan
>> >