FirestoreIO connector [JavaSDK]

2019-11-04 Thread Stefan Djelekar
Hi beam devs,

I'm Stefan from Redbox. We are a customer of GCP and we are in need of beam 
Java connector for Firestore.

There is a pending JIRA item for this. 
https://issues.apache.org/jira/browse/BEAM-8376

The team inside of the company has been working on this for a while and we 
would like to contribute!
What is the best way to do this? Can we perhaps get a review from an 
experienced beam member?
Let me know what do you think.

All the best,

Stefan Đelekar
Sofware Engineer
stefan.djele...@redbox.com
djelekar.com



Re: FirestoreIO connector [JavaSDK]

2019-11-04 Thread Ismaël Mejía
Hello,

Please open a PR with the code of the new connector rebased against
the latest master.

It is worth to take a look at Beam's contribution guide [1] and
PTransform style guide [2] in advance.
In any case we can guide you on any issue during the PR review so
please go ahead.

Regards,
Ismaël

[1] https://beam.apache.org/contribute/
[2] https://beam.apache.org/contribute/ptransform-style-guide/


On Mon, Nov 4, 2019 at 10:48 AM Stefan Djelekar
 wrote:
>
> Hi beam devs,
>
>
>
> I'm Stefan from Redbox. We are a customer of GCP and we are in need of beam 
> Java connector for Firestore.
>
>
>
> There is a pending JIRA item for this. 
> https://issues.apache.org/jira/browse/BEAM-8376
>
>
>
> The team inside of the company has been working on this for a while and we 
> would like to contribute!
>
> What is the best way to do this? Can we perhaps get a review from an 
> experienced beam member?
>
> Let me know what do you think.
>
>
>
> All the best,
>
>
>
> Stefan Đelekar
>
> Sofware Engineer
>
> stefan.djele...@redbox.com
>
> djelekar.com
>
>


Re: Proposal: @RequiresTimeSortedInput

2019-11-04 Thread Jan Lukavský

Hi,

there has been some development around this [1], which essentially 
concludes that currently this feature can be safely supported only by 
direct runner, flink runner (both batch and streaming, non-portable 
only) and spark (batch, legacy only). This is due to the fact, that time 
sorting relies heavily on timers to be strictly ordered. Failing to do 
so might result in unpredictable data loss, due to window-cleanup of 
state occurring prior to all elements being emitted (note that this 
generally might happen even to current user pipelines!). I can link 
issues [2], [3] and [4] to [5], but the question is, with only so few 
runners being able to support this, what should be the best way to 
incorporate this into any upcoming release (I'm assuming that this will 
pass a vote, which is not known yet)? I'd say that the best way would be 
the affected runners to fail to execute the pipeline until the 
respective issues are resolved. Another option would be to block this 
until the issues are resolved in runners, but that might delay the 
availability of this feature for some unknown time.


Thanks for any opinions,

Jan

[1] 
https://lists.apache.org/thread.html/71a8f48ca518f1f2e6e9b1284114624670884775d209b0097f68264b@%3Cdev.beam.apache.org%3E


[2] https://issues.apache.org/jira/browse/BEAM-8459

[3] https://issues.apache.org/jira/browse/BEAM-8460

[4] https://issues.apache.org/jira/browse/BEAM-8543.

[5] https://issues.apache.org/jira/browse/BEAM-8550

On 10/31/19 2:59 PM, Jan Lukavský wrote:

Hi,

as a follow-up from previous design draft, I'd like to promote the 
document [1] and associated PR [2] to proposal.


The PR contains working implementation for:

 - non-portable batch flink and batch spark (legacy)

 - all non-portable streaming runners that use StatefulDoFnRunner 
(direct, samza, dataflow)


 - portable flink (batch, streaming)

There are still some unresolved issues:

 a) no way to specify allowed lateness (currently is simply zero, late 
data should be dropped)


 b) need a way to specify user UDF for extracting timestamp (according 
to [3] it would be useful to have that option)


 c) need to add more tests (e.g. late data)

The plan is to postpone resolution of issues a) and b) after the 
proposal is merged. I'd like to gather some more feedback on the 
proposal, iterate over that again, add more tests and then pass this 
to a vote.


Unrelated - during implementation a bug [4] in Samza runner was found.

Looking forward to any comments!

Jan

[1] 
https://docs.google.com/document/d/1ObLVUFsf1NcG8ZuIZE4aVy2RYKx2FfyMhkZYWPnI9-c/ 



[2] https://github.com/apache/beam/pull/8774

[3] 
https://lists.apache.org/thread.html/813429e78b895a336d4f5507e3b2330282e2904fa25d52d6d441741a@%3Cdev.beam.apache.org%3E


[4] https://issues.apache.org/jira/browse/BEAM-8529


On 5/23/19 4:10 PM, Jan Lukavský wrote:

Hi,

I have written a very brief draft of how it might be possible to 
implement @RequireTimeSortedInput discussed in [1]. I see the 
document [2] a starting point for a discussion. There are several 
open questions, which I believe can be resolved by this great 
community. :-)


Jan

[1] 
https://lists.apache.org/thread.html/4609a1bb1662690d67950e76d2f1108b51327b8feaf9580de659552e@%3Cdev.beam.apache.org%3E


[2] 
https://docs.google.com/document/d/1ObLVUFsf1NcG8ZuIZE4aVy2RYKx2FfyMhkZYWPnI9-c/




Beam Dependency Check Report (2019-11-04)

2019-11-04 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
mock
2.0.0
3.0.5
2019-05-20
2019-05-20BEAM-7369
oauth2client
3.0.0
4.1.3
2018-12-10
2018-12-10BEAM-6089
Sphinx
1.8.5
2.2.1
2019-05-20
2019-10-28BEAM-7370
High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
com.github.ben-manes.versions:com.github.ben-manes.versions.gradle.plugin
0.20.0
0.27.0
2019-02-11
2019-10-21BEAM-6645
com.github.spotbugs:spotbugs
3.1.12
4.0.0-beta4
2019-03-01
2019-09-18BEAM-7792
com.github.spotbugs:spotbugs-annotations
3.1.12
4.0.0-beta4
2019-03-01
2019-09-18BEAM-6951
javax.servlet:javax.servlet-api
3.1.0
4.0.1
2013-04-25
2018-04-20BEAM-5750
org.conscrypt:conscrypt-openjdk
1.1.3
2.2.1
2018-06-04
2019-08-08BEAM-5748
org.eclipse.jetty:jetty-server
9.2.10.v20150310
10.0.0-alpha0
2015-03-10
2019-07-11BEAM-5752
org.eclipse.jetty:jetty-servlet
9.2.10.v20150310
10.0.0-alpha0
2015-03-10
2019-07-11BEAM-5753
Gradle:
5.2.1
5.6.4
2019-08-19
2019-11-04BEAM-8002

 A dependency update is high priority if it satisfies one of following criteria: 

 It has major versions update available, e.g. org.assertj:assertj-core 2.5.0 -> 3.10.0; 


 It is over 3 minor versions behind the latest version, e.g. org.tukaani:xz 1.5 -> 1.8; 


 The current version is behind the later version for over 180 days, e.g. com.google.auto.service:auto-service 2014-10-24 -> 2017-12-11. 

 In Beam, we make a best-effort attempt at keeping all dependencies up-to-date.
 In the future, issues will be filed and tracked for these automatically,
 but in the meantime you can search for existing issues or open a new one.

 For more information:  Beam Dependency Guide  

Re: [spark structured streaming runner] merge to master?

2019-11-04 Thread Alexey Romanenko
I’m also as Kenn strongly against 3 jars but I don’t see a big difference 
between 1 or 2 jar(s) in case if 2 jars won’t contain the same classes or 
cross-dependencies between them. I don’t have strong arguments “for" or 
“against” these both options but, as I mentioned before, achieve it with one 
jar perhaps would be easier and less error-prone.

> On 31 Oct 2019, at 02:06, Kenneth Knowles  wrote:
> 
> Very good points. We definitely ship a lot of code/features in very early 
> stages, and there seems to be no problem.
> 
> I intend mostly to leave this judgment to people like you who know better 
> about Spark users.
> 
> But I do think 1 or 2 jars is better than 3. I really don't like "3 jars" and 
> I did give two reasons:
> 
> 1. diamond deps where things overlap
> 2. figuring out which thing to depend on
> 
> Both are annoying for users. I am not certain if it could lead to a real 
> unsolvable situation. This is just a Java ecosystem problem so I feel 
> qualified to comment.
> 
> I did also ask if there were major dependency differences between the two 
> that could cause problems for users. This question was dropped and no one 
> cares to comment so I assume it is not an issue. So then I favor having just 
> 1 jar with both runners.
> 
> Kenn
> 
> On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía  > wrote:
> I am still a bit lost about why we are discussing options without giving any
> arguments or reasons for the options? Why is 2 modules better than 3 or 3 
> better
> than 2, or even better, what forces us to have something different than a 
> single
> module?
> 
> What are the reasons for wanting to have separate jars? If the issue is that 
> the
> code is unfinished or not passing the tests, the impact for end users is 
> minimal
> because they cannot accidentally end up running the new runner, and if they
> decide to do so we can warn them it is at their own risk and not ready for
> production in the documentation + runner.
> 
> If the fear is that new code may end up being intertwined with the classic and
> portable runners and have some side effects. We have the ValidatesRunner +
> Nexmark in the CI to cover this so again I do not see what is the problem that
> requires modules to be separate.
> 
> If the issue is being uncomfortable about having in-progress code in released
> artifacts we have been doing this in Beam forever, for example most of the 
> work
> on portability and Schema/SQL, and all of those were still part of artifacts
> long time before they were ready for prime use, so I still don't see why this
> case is different to require different artifacts.
> 
> I have the impression we are trying to solve a non-issue by adding a lot of
> artificial complexity (in particular to the users), or am I missing something
> else?
> 
> On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles  > wrote:
> >
> > Oh, I mean that we ship just 2 jars.
> >
> > And since Spark users always build an uber jar, they can still depend on 
> > both of ours and be able to switch runners with a flag.
> >
> > I really dislike projects shipping overlapping jars. It is confusing and 
> > causes major diamond dependency problems.
> >
> > Kenn
> >
> > On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko  > > wrote:
> >>
> >> Yes, agree, two jars included in uber jar will work in the similar way. 
> >> Though having 3 jars looks still quite confusing for me.
> >>
> >> On 29 Oct 2019, at 23:54, Kenneth Knowles  >> > wrote:
> >>
> >> Is it just as easy to have two jars and build an uber jar with both 
> >> included? Then the runner can still be toggled with a flag.
> >>
> >> Kenn
> >>
> >> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko  >> > wrote:
> >>>
> >>> Hmm, I don’t think that jar size should play a big role comparing to the 
> >>> whole size of shaded jar of users job. Even more, I think it will be 
> >>> quite confusing for users to choose which jar to use if we will have 3 
> >>> different ones for similar purposes. Though, let’s see what others think.
> >>>
> >>> On 29 Oct 2019, at 15:32, Etienne Chauchot  >>> > wrote:
> >>>
> >>> Hi Alexey,
> >>>
> >>> Thanks for your opinion !
> >>>
> >>> Comments inline
> >>>
> >>> Etienne
> >>>
> >>> On 28/10/2019 17:34, Alexey Romanenko wrote:
> >>>
> >>> Let me share some of my thoughts on this.
> >>>
> >>> - shall we filter out the package name from the release?
> >>>
> >>> Until new runner is not ready to be used in production (or, at least, be 
> >>> used for beta testing but users should be clearly warned about that in 
> >>> this case), I believe we need to filter out its classes from published 
> >>> jar to avoid a confusion.
> >>>
> >>> Yes that is what I think also
> >>>
> >>> - should we release 2 jars: one for the old and one for the new ?
> >>>
> >>> - should we release 3 jars: one for th

Re: The state of external transforms in Beam

2019-11-04 Thread Chamikara Jayalath
Makes sense.

I can look into expanding on what we have at following location and adding
links to some of the existing work as a first step.
https://beam.apache.org/roadmap/connectors-multi-sdk/

Created https://issues.apache.org/jira/browse/BEAM-8553

We also need more detailed documentation for cross-language transforms but
that can be separate (and hopefully with help from tech writers who have
been helping with Beam documentation in general).

Thanks,
Cham


On Sun, Nov 3, 2019 at 7:16 PM Thomas Weise  wrote:

> This thread was very helpful to find more detail in
> https://jira.apache.org/jira/browse/BEAM-7870
>
> It would be great to have cross-language current state mentioned as top
> level entry on https://beam.apache.org/roadmap/
>
>
> On Mon, Sep 16, 2019 at 6:07 PM Chamikara Jayalath 
> wrote:
>
>> Thanks for the nice write up Chad.
>>
>> On Mon, Sep 16, 2019 at 12:17 PM Robert Bradshaw 
>> wrote:
>>
>>> Thanks for bringing this up again. My thoughts on the open questions
>>> below.
>>>
>>> On Mon, Sep 16, 2019 at 11:51 AM Chad Dombrova 
>>> wrote:
>>> > That commit solves 2 problems:
>>> >
>>> > Adds the pubsub Java deps so that they’re available in our portable
>>> pipeline
>>> > Makes the coder for the PubsubIO message-holder type, PubsubMessage,
>>> available as a standard coder. This is required because both PubsubIO.Read
>>> and PubsubIO.Write expand to ParDos which pass along these PubsubMessage
>>> objects, but only “standard” (i.e. portable) coders can be used, so we have
>>> to hack it to make PubsubMessage appear as a standard coder.
>>> >
>>> > More details:
>>> >
>>> > There’s a similar magic commit required for Kafka external transforms
>>> > The Jira issue for this problem is here:
>>> https://jira.apache.org/jira/browse/BEAM-7870
>>> > For problem #2 above there seems to be some consensus forming around
>>> using Avro or schema/row coders to send compound types in a portable way.
>>> Here’s the PR for making row coders portable
>>> > https://github.com/apache/beam/pull/9188
>>>
>>> +1. Note that this doesn't mean that the IO itself must produce rows;
>>> part of the Schema work in Java is to make it easy to automatically
>>> convert from various Java classes to schemas transparently, so this
>>> same logic that would allow one to apply an SQL filter directly to a
>>> Kafka/PubSub read would allow cross-language. Even if that doesn't
>>> work, we need not uglify the Java API; we can have an
>>> option/alternative transform that appends the convert-to-Row DoFn for
>>> easier use by external (though the goal of the former work is to make
>>> this step unnecissary).
>>>
>>
>> Updating all IO connectors / transforms to have a version that
>> produces/consumes a PCollection is infeasible so I agree that we need
>> an automatic conversion to/from PCollection possibly by injecting
>> PTransfroms during ExternalTransform expansion.
>>
>>>
>>> > I don’t really have any ideas for problem #1
>>>
>>> The crux of the issue here is that the jobs API was not designed with
>>> cross-language in mind, and so the artifact API ties artifacts to jobs
>>> rather than to environments. To solve this we need to augment the
>>> notion of environment to allow the specification of additional
>>> dependencies (e.g. jar files in this specific case, or better as
>>> maven/pypi/... dependencies (with version ranges) such that
>>> environment merging and dependency resolution can be sanely done), and
>>> a way for the expansion service to provide such dependencies.
>>>
>>> Max wrote up a summary of the prior discussions at
>>>
>>> https://docs.google.com/document/d/1XaiNekAY2sptuQRIXpjGAyaYdSc-wlJ-VKjl04c8N48/edit#heading=h.900gc947qrw8
>>>
>>> In the short term, one can build a custom docker image that has all
>>> the requisite dependencies installed.
>>>
>>> This touches on a related but separable issue that one may want to run
>>> some of these transforms "natively" in the same process as the runner
>>> (e.g. a Java IO in the Flink Java Runner) rather than via docker.
>>> (Similarly with subprocess.) Exactly how that works with environment
>>> specifications is also a bit TBD, but my proposal has been that these
>>> are best viewed as runner-specific substitutions of standard
>>> environments.
>>>
>>
>> We need a permanent solution for this but for now we have a temporary
>> solution where additional jar files can be specified through an experiment
>> when running a Python pipeline:
>> https://github.com/apache/beam/blob/9678149872de2799ea1643f834f2bec88d346af8/sdks/python/apache_beam/io/external/xlang_parquetio_test.py#L55
>>
>> Thanks,
>> Cham
>>
>>
>>>
>>> > So the portability expansion system works, and now it’s time to sand
>>> off some of the rough corners. I’d love to hear others’ thoughts on how to
>>> resolve some of these remaining issues.
>>>
>>> +1
>>>
>>>
>>> On Mon, Sep 16, 2019 at 11:51 AM Chad Dombrova 
>>> wrote:
>>> >
>>> > Hi all,
>>> > There was some interest in this topic at the Beam Summi

Re: Python Beam pipelines on Flink on Kubernetes

2019-11-04 Thread Chad Dombrova
Done:  https://github.com/lyft/flinkk8soperator/issues/123



On Fri, Nov 1, 2019 at 5:08 PM Thomas Weise  wrote:

> That's a good idea. Probably best to add an example in:
> https://github.com/lyft/flinkk8soperator
>
> Do you want to add an issue?
>
> (It will have to wait for 2.18 release though.)
>
>
> On Fri, Nov 1, 2019 at 11:37 AM Chad Dombrova  wrote:
>
>> Hi Thomas,
>> Do you have an example Dockerfile demonstrating best practices for
>> building an image that contains both Flink and Beam SDK dependencies?  That
>> would be useful.
>>
>> -chad
>>
>>
>> On Fri, Nov 1, 2019 at 10:18 AM Thomas Weise  wrote:
>>
>>> For folks looking to run Beam on Flink on k8s, see update in [1]
>>>
>>> I also updated [2]
>>>
>>> TLDR: at this time best option to run portable pipelines on k8s is to
>>> create container images that have both Flink and the SDK dependencies.
>>>
>>> I'm curious how much interest there is to use the official SDK container
>>> images and keep Flink and portable pipeline separate as far as the image
>>> build goes? Deployment can be achieved with the sidecar container approach.
>>> Most of mechanics are in place already, one addition would be an
>>> abstraction for where the SDK entry point executes (process, external),
>>> similar to how we have it for the workers.
>>>
>>> Thanks,
>>> Thomas
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/10aada9c200d4300059d02e4baa47dda0cc2c6fc9432f950194a7f5e@%3Cdev.beam.apache.org%3E
>>>
>>> [2]
>>> https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI
>>>
>>> On Wed, Aug 21, 2019 at 9:44 PM Thomas Weise  wrote:
>>>
 The changes to containerize the Python SDK worker pool are nearly
 complete. I also updated the document for next implementation steps.

 The favored approach (initially targeted option) for pipeline
 submission is support for the (externally created) fat far. It will keep
 changes to the operator to a minimum and is applicable for any other Flink
 job as well.

 Regarding fine grained resource scheduling, that would happen within
 the pods scheduled by the k8s operator (or other external orchestration
 tool) or, further down the road, in a completely elastic/dynamic fashion
 with active execution mode (where Flink would request resources directly
 from k8s, similar to how it would work on YARN).


 On Tue, Aug 13, 2019 at 10:59 AM Chad Dombrova 
 wrote:

> Hi Thomas,
> Nice work!  It's really clearly presented.
>
> What's the current favored approach for pipeline submission?
>
> I'm also interested to know how this plan overlaps (if at all) with
> the work on Fine-Grained Resource Scheduling [1][2] that's being done for
> Flink 1.9+, which has implications for creation of task managers in
> kubernetes.
>
> [1]
> https://docs.google.com/document/d/1h68XOG-EyOFfcomd2N7usHK1X429pJSMiwZwAXCwx1k/edit#heading=h.72szmms7ufza
> [2] https://issues.apache.org/jira/browse/FLINK-12761
>
> -chad
>
>
> On Tue, Aug 13, 2019 at 9:14 AM Thomas Weise  wrote:
>
>> There have been a few comments in the doc and I'm going to start
>> working on the worker execution part.
>>
>> Instead of running a Docker container for each worker, the preferred
>> option is to use the worker pool:
>>
>>
>> https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/edit#heading=h.bzs6y6ms0898
>>
>> There are some notes at the end of the doc regarding the
>> implementation. It would be great if those interested in the environments
>> and Python docker container could take a look. In a nutshell, the 
>> proposal
>> is to make few changes to the Python container so that it (optionally) 
>> can
>> be used to run the worker pool.
>>
>> Thanks,
>> Thomas
>>
>>
>> On Wed, Jul 24, 2019 at 9:42 PM Pablo Estrada 
>> wrote:
>>
>>> I am very happy to see this. I'll take a look, and leave my comments.
>>>
>>> I think this is something we'd been needing, and it's great that you
>>> guys are putting thought into it. Thanks!<3
>>>
>>> On Wed, Jul 24, 2019 at 9:01 PM Thomas Weise  wrote:
>>>
 Hi,

 Recently Lyft open sourced *FlinkK8sOperator,* a Kubernetes
 operator to manage Flink deployments on Kubernetes:

 https://github.com/lyft/flinkk8soperator/

 We are now discussing how to extend this operator to also support
 deployment of Python Beam pipelines with the Flink runner. I would 
 like to
 share the proposal with the Beam community to enlist feedback as well 
 as
 explore opportunities for collaboration:


 https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/

 Looking forward to your comments and sugge

Embedding expansion service for cross language in the runner

2019-11-04 Thread Hai Lu
Hi,

We're looking into leveraging the cross language pipeline feature in our
Beam pipelines on Samza runner. While the feature seems to work well, the
PTransform expansion as a standalone service isn't very convenient.
Particularly that the Python pipeline needs to specify the address of the
expansion service.

I'm wondering why we couldn't embed the expansion service into runner
itself. I understand the cross language feature wants to be runner
independent, but does it make sense to at least provide the option to allow
runner to use the expansion service as a library and make it transparent to
the portable pipeline?

Thanks,
Hai


Query Beam internal state

2019-11-04 Thread Dengpan Y
Hi,  I recently see several Beam users using Samza Runner asking for a
tool/feature to query the internal state of PTransform, especially the
DoFn, for instance, the user can use the @StateId to define named state,
and the user would like to query the content of the state for a specific
key of that state.

Does Beam have any plan to support this feature from Beam SDK, such as
develop a PTransform that can be wired into the pipeline so that the user
can query the state on demand?
Or If this has to be done on the Runner side, is there any best
practice/idea that we can follow?

Thanks
Dengpan


Re: Embedding expansion service for cross language in the runner

2019-11-04 Thread Chamikara Jayalath
On Mon, Nov 4, 2019 at 11:01 AM Hai Lu  wrote:

> Hi,
>
> We're looking into leveraging the cross language pipeline feature in our
> Beam pipelines on Samza runner. While the feature seems to work well, the
> PTransform expansion as a standalone service isn't very convenient.
> Particularly that the Python pipeline needs to specify the address of the
> expansion service.
>
> I'm wondering why we couldn't embed the expansion service into runner
> itself. I understand the cross language feature wants to be runner
> independent, but does it make sense to at least provide the option to allow
> runner to use the expansion service as a library and make it transparent to
> the portable pipeline?
>

Beam composite transforms are expanded before defining the portable job
definition (and before submitting the jobs to the runner). So naturally
this is something that has to be done in the Beam side. As an added
benefit, as you identified, this allows us to keep this logic runner
independent.
I think there were discussions regarding automatically starting up a local
expansion service if one is not specified. Will this address your concerns ?

Thanks,
Cham


>
> Thanks,
> Hai
>


Re: Embedding expansion service for cross language in the runner

2019-11-04 Thread Robert Bradshaw
On Mon, Nov 4, 2019 at 11:54 AM Chamikara Jayalath  wrote:
>
> On Mon, Nov 4, 2019 at 11:01 AM Hai Lu  wrote:
>>
>> Hi,
>>
>> We're looking into leveraging the cross language pipeline feature in our 
>> Beam pipelines on Samza runner. While the feature seems to work well, the 
>> PTransform expansion as a standalone service isn't very convenient. 
>> Particularly that the Python pipeline needs to specify the address of the 
>> expansion service.
>>
>> I'm wondering why we couldn't embed the expansion service into runner 
>> itself. I understand the cross language feature wants to be runner 
>> independent, but does it make sense to at least provide the option to allow 
>> runner to use the expansion service as a library and make it transparent to 
>> the portable pipeline?
>
>
> Beam composite transforms are expanded before defining the portable job 
> definition (and before submitting the jobs to the runner). So naturally this 
> is something that has to be done in the Beam side. As an added benefit, as 
> you identified, this allows us to keep this logic runner independent.
> I think there were discussions regarding automatically starting up a local 
> expansion service if one is not specified. Will this address your concerns ?

Just to add to this, If you have a pipeline A -> B -> C, the expansion
of B often needs to be evaluated before C can be applied (e.g. we're
planning on exposing the SQL transforms cross language, and many
cross-language IOs can query and supply their own schemas for
downstream type checking), so one cannot construct the "whole"
pipeline, pass it to the runner, and let the runner do the expansion.


Re: Python Precommit duration pushing 2 hours

2019-11-04 Thread Ahmet Altay
Python precommits are still timing out on #9925. I am guessing that means
this change would not be enough.

I am proposing cutting down the number of test variants we run in
precommits. Currently for each version we ran the following variants
serially:
- base: Runs all unit tests with tox
- Cython: Installs cython and runs all unit tests as base version. The
original purpose was to ensure that tests pass with or without cython.
There is probably a huge overlap with base. (IIRC only a few coders have
different slow vs fast tests.)
- GCP: Installs GCP dependencies and tests all base + additional gcp
specific tests. The original purpose was to ensure that GCP is an optional
component and all non-GCP tests still works without GCP components.

We can reduce the list to cython + GCP tests only. This will cover the same
group of tests and will check that tests pass with or without cython or GCP
dependencies. This could reduce the precommit time by ~30 minutes.

What do you think?

Ahmet


On Tue, Oct 29, 2019 at 11:15 AM Robert Bradshaw 
wrote:

> https://github.com/apache/beam/pull/9925
>
> On Tue, Oct 29, 2019 at 10:24 AM Udi Meiri  wrote:
> >
> > I don't have the bandwidth right now to tackle this. Feel free to take
> it.
> >
> > On Tue, Oct 29, 2019 at 10:16 AM Robert Bradshaw 
> wrote:
> >>
> >> The Python SDK does as well. These calls are coming from
> >> to_runner_api, is_stateful_dofn, and validate_stateful_dofn which are
> >> invoked once per pipene or bundle. They are, however, surprisingly
> >> expensive. Even memoizing across those three calls should save a
> >> significant amount of time. Udi, did you want to tackle this?
> >>
> >> Looking at the profile, Pipeline.to_runner_api() is being called 30
> >> times in this test, and [Applied]PTransform.to_fn_api being called
> >> 3111 times, so that in itself might be interesting to investigate.
> >>
> >> On Tue, Oct 29, 2019 at 8:26 AM Robert Burke 
> wrote:
> >> >
> >> > As does the Go SDK. Invokers are memoized and when possible code is
> generated to avoid reflection.
> >> >
> >> > On Tue, Oct 29, 2019, 6:46 AM Kenneth Knowles  wrote:
> >> >>
> >> >> Noting for the benefit of the thread archive in case someone goes
> digging and wonders if this affects other SDKs: the Java SDK memoizes
> DoFnSignatures and generated DoFnInvoker classes.
> >> >>
> >> >> Kenn
> >> >>
> >> >> On Mon, Oct 28, 2019 at 6:59 PM Udi Meiri  wrote:
> >> >>>
> >> >>> Re: #9283 slowing down tests, ideas for slowness:
> >> >>> 1. I added a lot of test cases, some with locally run pipelines.
> >> >>> 2. The PR somehow changed how coders are selected, and now we're
> using less efficient ones.
> >> >>> 3. New dependency funcsigs is slowing things down? (py2 only)
> >> >>>
> >> >>> I ran "pytest -k PipelineAnalyzerTest --profile-svg" on 2.7 and 3.7
> and got these cool graphs (attached).
> >> >>> 2.7: core:294:get_function_arguments takes 56.66% of CPU time
> (IIUC), gets called ~230k times
> >> >>> 3.7: core:294:get_function_arguments 30.88%, gets called ~200k times
> >> >>>
> >> >>> After memoization of get_function_args_defaults:
> >> >>> 2.7: core:294:get_function_arguments 20.02%
> >> >>> 3.7: core:294:get_function_arguments 8.11%
> >> >>>
> >> >>>
> >> >>> On Mon, Oct 28, 2019 at 5:38 PM Pablo Estrada 
> wrote:
> >> 
> >>  *not deciles, but 9-percentiles : )
> >> 
> >>  On Mon, Oct 28, 2019 at 5:31 PM Pablo Estrada 
> wrote:
> >> >
> >> > I've ran the tests in Python 2 (without cython), and used a
> utility to track runtime for each test method. I found some of the
> following things:
> >> > - Total test methods run: 2665
> >> > - Total test runtime: 990 seconds
> >> > - Deciles of time spent:
> >> >   - 1949 tests run in the first 9% of time
> >> >   - 173 in the 9-18% rang3e
> >> >   - 130 in the 18-27% range
> >> >   - 95 in the 27-36% range
> >> >   - 77
> >> >   - 66
> >> >   - 55
> >> >   - 46
> >> >   - 37
> >> >   - 24
> >> >   - 13 tests run in the last 9% of time. This represents about 1
> minute and a half.
> >> >
> >> > We may be able to look at the slowest X tests, and get gradual
> improvements from there. Although it seems .. not dramatic ones : )
> >> >
> >> > FWIW I uploaded the results here:
> https://storage.googleapis.com/apache-beam-website-pull-requests/python-tests/nosetimes.json
> >> >
> >> > The slowest 13 tests were:
> >> >
> >> >
> [('apache_beam.runners.interactive.pipeline_analyzer_test.PipelineAnalyzerTest.test_basic',
> >> >   5.253582000732422),
> >> >
> ('apache_beam.runners.interactive.interactive_runner_test.InteractiveRunnerTest.test_wordcount',
> >> >   7.907713890075684),
> >> >
> ('apache_beam.io.gcp.bigquery_test.PipelineBasedStreamingInsertTest.test_failure_has_same_insert_ids',
> >> >   5.237942934036255),
> >> >
> ('apache_beam.transforms.combiners_test.CombineTest.test_global_sample',
> >> >   5.56

Re: Python Precommit duration pushing 2 hours

2019-11-04 Thread Robert Bradshaw
+1, this seems like a good step with a clear win.

On Mon, Nov 4, 2019 at 12:06 PM Ahmet Altay  wrote:
>
> Python precommits are still timing out on #9925. I am guessing that means 
> this change would not be enough.
>
> I am proposing cutting down the number of test variants we run in precommits. 
> Currently for each version we ran the following variants serially:
> - base: Runs all unit tests with tox
> - Cython: Installs cython and runs all unit tests as base version. The 
> original purpose was to ensure that tests pass with or without cython. There 
> is probably a huge overlap with base. (IIRC only a few coders have different 
> slow vs fast tests.)
> - GCP: Installs GCP dependencies and tests all base + additional gcp specific 
> tests. The original purpose was to ensure that GCP is an optional component 
> and all non-GCP tests still works without GCP components.
>
> We can reduce the list to cython + GCP tests only. This will cover the same 
> group of tests and will check that tests pass with or without cython or GCP 
> dependencies. This could reduce the precommit time by ~30 minutes.
>
> What do you think?
>
> Ahmet
>
>
> On Tue, Oct 29, 2019 at 11:15 AM Robert Bradshaw  wrote:
>>
>> https://github.com/apache/beam/pull/9925
>>
>> On Tue, Oct 29, 2019 at 10:24 AM Udi Meiri  wrote:
>> >
>> > I don't have the bandwidth right now to tackle this. Feel free to take it.
>> >
>> > On Tue, Oct 29, 2019 at 10:16 AM Robert Bradshaw  
>> > wrote:
>> >>
>> >> The Python SDK does as well. These calls are coming from
>> >> to_runner_api, is_stateful_dofn, and validate_stateful_dofn which are
>> >> invoked once per pipene or bundle. They are, however, surprisingly
>> >> expensive. Even memoizing across those three calls should save a
>> >> significant amount of time. Udi, did you want to tackle this?
>> >>
>> >> Looking at the profile, Pipeline.to_runner_api() is being called 30
>> >> times in this test, and [Applied]PTransform.to_fn_api being called
>> >> 3111 times, so that in itself might be interesting to investigate.
>> >>
>> >> On Tue, Oct 29, 2019 at 8:26 AM Robert Burke  wrote:
>> >> >
>> >> > As does the Go SDK. Invokers are memoized and when possible code is 
>> >> > generated to avoid reflection.
>> >> >
>> >> > On Tue, Oct 29, 2019, 6:46 AM Kenneth Knowles  wrote:
>> >> >>
>> >> >> Noting for the benefit of the thread archive in case someone goes 
>> >> >> digging and wonders if this affects other SDKs: the Java SDK memoizes 
>> >> >> DoFnSignatures and generated DoFnInvoker classes.
>> >> >>
>> >> >> Kenn
>> >> >>
>> >> >> On Mon, Oct 28, 2019 at 6:59 PM Udi Meiri  wrote:
>> >> >>>
>> >> >>> Re: #9283 slowing down tests, ideas for slowness:
>> >> >>> 1. I added a lot of test cases, some with locally run pipelines.
>> >> >>> 2. The PR somehow changed how coders are selected, and now we're 
>> >> >>> using less efficient ones.
>> >> >>> 3. New dependency funcsigs is slowing things down? (py2 only)
>> >> >>>
>> >> >>> I ran "pytest -k PipelineAnalyzerTest --profile-svg" on 2.7 and 3.7 
>> >> >>> and got these cool graphs (attached).
>> >> >>> 2.7: core:294:get_function_arguments takes 56.66% of CPU time (IIUC), 
>> >> >>> gets called ~230k times
>> >> >>> 3.7: core:294:get_function_arguments 30.88%, gets called ~200k times
>> >> >>>
>> >> >>> After memoization of get_function_args_defaults:
>> >> >>> 2.7: core:294:get_function_arguments 20.02%
>> >> >>> 3.7: core:294:get_function_arguments 8.11%
>> >> >>>
>> >> >>>
>> >> >>> On Mon, Oct 28, 2019 at 5:38 PM Pablo Estrada  
>> >> >>> wrote:
>> >> 
>> >>  *not deciles, but 9-percentiles : )
>> >> 
>> >>  On Mon, Oct 28, 2019 at 5:31 PM Pablo Estrada  
>> >>  wrote:
>> >> >
>> >> > I've ran the tests in Python 2 (without cython), and used a utility 
>> >> > to track runtime for each test method. I found some of the 
>> >> > following things:
>> >> > - Total test methods run: 2665
>> >> > - Total test runtime: 990 seconds
>> >> > - Deciles of time spent:
>> >> >   - 1949 tests run in the first 9% of time
>> >> >   - 173 in the 9-18% rang3e
>> >> >   - 130 in the 18-27% range
>> >> >   - 95 in the 27-36% range
>> >> >   - 77
>> >> >   - 66
>> >> >   - 55
>> >> >   - 46
>> >> >   - 37
>> >> >   - 24
>> >> >   - 13 tests run in the last 9% of time. This represents about 1 
>> >> > minute and a half.
>> >> >
>> >> > We may be able to look at the slowest X tests, and get gradual 
>> >> > improvements from there. Although it seems .. not dramatic ones : )
>> >> >
>> >> > FWIW I uploaded the results here: 
>> >> > https://storage.googleapis.com/apache-beam-website-pull-requests/python-tests/nosetimes.json
>> >> >
>> >> > The slowest 13 tests were:
>> >> >
>> >> > [('apache_beam.runners.interactive.pipeline_analyzer_test.PipelineAnalyzerTest.test_basic',
>> >> >   5.253582000732422),
>> >> >  
>> >> > ('apache_beam.r

Re: Python Precommit duration pushing 2 hours

2019-11-04 Thread Udi Meiri
+1

On Mon, Nov 4, 2019 at 12:09 PM Robert Bradshaw  wrote:

> +1, this seems like a good step with a clear win.
>
> On Mon, Nov 4, 2019 at 12:06 PM Ahmet Altay  wrote:
> >
> > Python precommits are still timing out on #9925. I am guessing that
> means this change would not be enough.
> >
> > I am proposing cutting down the number of test variants we run in
> precommits. Currently for each version we ran the following variants
> serially:
> > - base: Runs all unit tests with tox
> > - Cython: Installs cython and runs all unit tests as base version. The
> original purpose was to ensure that tests pass with or without cython.
> There is probably a huge overlap with base. (IIRC only a few coders have
> different slow vs fast tests.)
> > - GCP: Installs GCP dependencies and tests all base + additional gcp
> specific tests. The original purpose was to ensure that GCP is an optional
> component and all non-GCP tests still works without GCP components.
> >
> > We can reduce the list to cython + GCP tests only. This will cover the
> same group of tests and will check that tests pass with or without cython
> or GCP dependencies. This could reduce the precommit time by ~30 minutes.
> >
> > What do you think?
> >
> > Ahmet
> >
> >
> > On Tue, Oct 29, 2019 at 11:15 AM Robert Bradshaw 
> wrote:
> >>
> >> https://github.com/apache/beam/pull/9925
> >>
> >> On Tue, Oct 29, 2019 at 10:24 AM Udi Meiri  wrote:
> >> >
> >> > I don't have the bandwidth right now to tackle this. Feel free to
> take it.
> >> >
> >> > On Tue, Oct 29, 2019 at 10:16 AM Robert Bradshaw 
> wrote:
> >> >>
> >> >> The Python SDK does as well. These calls are coming from
> >> >> to_runner_api, is_stateful_dofn, and validate_stateful_dofn which are
> >> >> invoked once per pipene or bundle. They are, however, surprisingly
> >> >> expensive. Even memoizing across those three calls should save a
> >> >> significant amount of time. Udi, did you want to tackle this?
> >> >>
> >> >> Looking at the profile, Pipeline.to_runner_api() is being called 30
> >> >> times in this test, and [Applied]PTransform.to_fn_api being called
> >> >> 3111 times, so that in itself might be interesting to investigate.
> >> >>
> >> >> On Tue, Oct 29, 2019 at 8:26 AM Robert Burke 
> wrote:
> >> >> >
> >> >> > As does the Go SDK. Invokers are memoized and when possible code
> is generated to avoid reflection.
> >> >> >
> >> >> > On Tue, Oct 29, 2019, 6:46 AM Kenneth Knowles 
> wrote:
> >> >> >>
> >> >> >> Noting for the benefit of the thread archive in case someone goes
> digging and wonders if this affects other SDKs: the Java SDK memoizes
> DoFnSignatures and generated DoFnInvoker classes.
> >> >> >>
> >> >> >> Kenn
> >> >> >>
> >> >> >> On Mon, Oct 28, 2019 at 6:59 PM Udi Meiri 
> wrote:
> >> >> >>>
> >> >> >>> Re: #9283 slowing down tests, ideas for slowness:
> >> >> >>> 1. I added a lot of test cases, some with locally run pipelines.
> >> >> >>> 2. The PR somehow changed how coders are selected, and now we're
> using less efficient ones.
> >> >> >>> 3. New dependency funcsigs is slowing things down? (py2 only)
> >> >> >>>
> >> >> >>> I ran "pytest -k PipelineAnalyzerTest --profile-svg" on 2.7 and
> 3.7 and got these cool graphs (attached).
> >> >> >>> 2.7: core:294:get_function_arguments takes 56.66% of CPU time
> (IIUC), gets called ~230k times
> >> >> >>> 3.7: core:294:get_function_arguments 30.88%, gets called ~200k
> times
> >> >> >>>
> >> >> >>> After memoization of get_function_args_defaults:
> >> >> >>> 2.7: core:294:get_function_arguments 20.02%
> >> >> >>> 3.7: core:294:get_function_arguments 8.11%
> >> >> >>>
> >> >> >>>
> >> >> >>> On Mon, Oct 28, 2019 at 5:38 PM Pablo Estrada <
> pabl...@google.com> wrote:
> >> >> 
> >> >>  *not deciles, but 9-percentiles : )
> >> >> 
> >> >>  On Mon, Oct 28, 2019 at 5:31 PM Pablo Estrada <
> pabl...@google.com> wrote:
> >> >> >
> >> >> > I've ran the tests in Python 2 (without cython), and used a
> utility to track runtime for each test method. I found some of the
> following things:
> >> >> > - Total test methods run: 2665
> >> >> > - Total test runtime: 990 seconds
> >> >> > - Deciles of time spent:
> >> >> >   - 1949 tests run in the first 9% of time
> >> >> >   - 173 in the 9-18% rang3e
> >> >> >   - 130 in the 18-27% range
> >> >> >   - 95 in the 27-36% range
> >> >> >   - 77
> >> >> >   - 66
> >> >> >   - 55
> >> >> >   - 46
> >> >> >   - 37
> >> >> >   - 24
> >> >> >   - 13 tests run in the last 9% of time. This represents about
> 1 minute and a half.
> >> >> >
> >> >> > We may be able to look at the slowest X tests, and get gradual
> improvements from there. Although it seems .. not dramatic ones : )
> >> >> >
> >> >> > FWIW I uploaded the results here:
> https://storage.googleapis.com/apache-beam-website-pull-requests/python-tests/nosetimes.json
> >> >> >
> >> >> > The slowest 13 tests were:
> >> >> >
> >> >> >
> 

Re: Python Precommit duration pushing 2 hours

2019-11-04 Thread Ahmet Altay
PR for the proposed change: https://github.com/apache/beam/pull/9985

On Mon, Nov 4, 2019 at 1:35 PM Udi Meiri  wrote:

> +1
>
> On Mon, Nov 4, 2019 at 12:09 PM Robert Bradshaw 
> wrote:
>
>> +1, this seems like a good step with a clear win.
>>
>> On Mon, Nov 4, 2019 at 12:06 PM Ahmet Altay  wrote:
>> >
>> > Python precommits are still timing out on #9925. I am guessing that
>> means this change would not be enough.
>> >
>> > I am proposing cutting down the number of test variants we run in
>> precommits. Currently for each version we ran the following variants
>> serially:
>> > - base: Runs all unit tests with tox
>> > - Cython: Installs cython and runs all unit tests as base version. The
>> original purpose was to ensure that tests pass with or without cython.
>> There is probably a huge overlap with base. (IIRC only a few coders have
>> different slow vs fast tests.)
>> > - GCP: Installs GCP dependencies and tests all base + additional gcp
>> specific tests. The original purpose was to ensure that GCP is an optional
>> component and all non-GCP tests still works without GCP components.
>> >
>> > We can reduce the list to cython + GCP tests only. This will cover the
>> same group of tests and will check that tests pass with or without cython
>> or GCP dependencies. This could reduce the precommit time by ~30 minutes.
>> >
>> > What do you think?
>> >
>> > Ahmet
>> >
>> >
>> > On Tue, Oct 29, 2019 at 11:15 AM Robert Bradshaw 
>> wrote:
>> >>
>> >> https://github.com/apache/beam/pull/9925
>> >>
>> >> On Tue, Oct 29, 2019 at 10:24 AM Udi Meiri  wrote:
>> >> >
>> >> > I don't have the bandwidth right now to tackle this. Feel free to
>> take it.
>> >> >
>> >> > On Tue, Oct 29, 2019 at 10:16 AM Robert Bradshaw <
>> rober...@google.com> wrote:
>> >> >>
>> >> >> The Python SDK does as well. These calls are coming from
>> >> >> to_runner_api, is_stateful_dofn, and validate_stateful_dofn which
>> are
>> >> >> invoked once per pipene or bundle. They are, however, surprisingly
>> >> >> expensive. Even memoizing across those three calls should save a
>> >> >> significant amount of time. Udi, did you want to tackle this?
>> >> >>
>> >> >> Looking at the profile, Pipeline.to_runner_api() is being called 30
>> >> >> times in this test, and [Applied]PTransform.to_fn_api being called
>> >> >> 3111 times, so that in itself might be interesting to investigate.
>> >> >>
>> >> >> On Tue, Oct 29, 2019 at 8:26 AM Robert Burke 
>> wrote:
>> >> >> >
>> >> >> > As does the Go SDK. Invokers are memoized and when possible code
>> is generated to avoid reflection.
>> >> >> >
>> >> >> > On Tue, Oct 29, 2019, 6:46 AM Kenneth Knowles 
>> wrote:
>> >> >> >>
>> >> >> >> Noting for the benefit of the thread archive in case someone
>> goes digging and wonders if this affects other SDKs: the Java SDK memoizes
>> DoFnSignatures and generated DoFnInvoker classes.
>> >> >> >>
>> >> >> >> Kenn
>> >> >> >>
>> >> >> >> On Mon, Oct 28, 2019 at 6:59 PM Udi Meiri 
>> wrote:
>> >> >> >>>
>> >> >> >>> Re: #9283 slowing down tests, ideas for slowness:
>> >> >> >>> 1. I added a lot of test cases, some with locally run pipelines.
>> >> >> >>> 2. The PR somehow changed how coders are selected, and now
>> we're using less efficient ones.
>> >> >> >>> 3. New dependency funcsigs is slowing things down? (py2 only)
>> >> >> >>>
>> >> >> >>> I ran "pytest -k PipelineAnalyzerTest --profile-svg" on 2.7 and
>> 3.7 and got these cool graphs (attached).
>> >> >> >>> 2.7: core:294:get_function_arguments takes 56.66% of CPU time
>> (IIUC), gets called ~230k times
>> >> >> >>> 3.7: core:294:get_function_arguments 30.88%, gets called ~200k
>> times
>> >> >> >>>
>> >> >> >>> After memoization of get_function_args_defaults:
>> >> >> >>> 2.7: core:294:get_function_arguments 20.02%
>> >> >> >>> 3.7: core:294:get_function_arguments 8.11%
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> On Mon, Oct 28, 2019 at 5:38 PM Pablo Estrada <
>> pabl...@google.com> wrote:
>> >> >> 
>> >> >>  *not deciles, but 9-percentiles : )
>> >> >> 
>> >> >>  On Mon, Oct 28, 2019 at 5:31 PM Pablo Estrada <
>> pabl...@google.com> wrote:
>> >> >> >
>> >> >> > I've ran the tests in Python 2 (without cython), and used a
>> utility to track runtime for each test method. I found some of the
>> following things:
>> >> >> > - Total test methods run: 2665
>> >> >> > - Total test runtime: 990 seconds
>> >> >> > - Deciles of time spent:
>> >> >> >   - 1949 tests run in the first 9% of time
>> >> >> >   - 173 in the 9-18% rang3e
>> >> >> >   - 130 in the 18-27% range
>> >> >> >   - 95 in the 27-36% range
>> >> >> >   - 77
>> >> >> >   - 66
>> >> >> >   - 55
>> >> >> >   - 46
>> >> >> >   - 37
>> >> >> >   - 24
>> >> >> >   - 13 tests run in the last 9% of time. This represents
>> about 1 minute and a half.
>> >> >> >
>> >> >> > We may be able to look at the slowest X tests, and get
>> gradual improvements from there. Although it seem

Re: Embedding expansion service for cross language in the runner

2019-11-04 Thread Thomas Weise
The expansion service can be provided by the job server, as done in the
Flink runner. It needs to be available at pipeline construction time, but
there is no need to run a separate service.

Thomas

On Mon, Nov 4, 2019 at 12:03 PM Robert Bradshaw  wrote:

> On Mon, Nov 4, 2019 at 11:54 AM Chamikara Jayalath 
> wrote:
> >
> > On Mon, Nov 4, 2019 at 11:01 AM Hai Lu  wrote:
> >>
> >> Hi,
> >>
> >> We're looking into leveraging the cross language pipeline feature in
> our Beam pipelines on Samza runner. While the feature seems to work well,
> the PTransform expansion as a standalone service isn't very convenient.
> Particularly that the Python pipeline needs to specify the address of the
> expansion service.
> >>
> >> I'm wondering why we couldn't embed the expansion service into runner
> itself. I understand the cross language feature wants to be runner
> independent, but does it make sense to at least provide the option to allow
> runner to use the expansion service as a library and make it transparent to
> the portable pipeline?
> >
> >
> > Beam composite transforms are expanded before defining the portable job
> definition (and before submitting the jobs to the runner). So naturally
> this is something that has to be done in the Beam side. As an added
> benefit, as you identified, this allows us to keep this logic runner
> independent.
> > I think there were discussions regarding automatically starting up a
> local expansion service if one is not specified. Will this address your
> concerns ?
>
> Just to add to this, If you have a pipeline A -> B -> C, the expansion
> of B often needs to be evaluated before C can be applied (e.g. we're
> planning on exposing the SQL transforms cross language, and many
> cross-language IOs can query and supply their own schemas for
> downstream type checking), so one cannot construct the "whole"
> pipeline, pass it to the runner, and let the runner do the expansion.
>


Re: Embedding expansion service for cross language in the runner

2019-11-04 Thread Robert Bradshaw
To clarify, starting up the Flink Job Server by default starts up an
Expansion Service on the hard-coded, default port 8097.

On Mon, Nov 4, 2019 at 2:02 PM Thomas Weise  wrote:
>
> The expansion service can be provided by the job server, as done in the Flink 
> runner. It needs to be available at pipeline construction time, but there is 
> no need to run a separate service.
>
> Thomas
>
> On Mon, Nov 4, 2019 at 12:03 PM Robert Bradshaw  wrote:
>>
>> On Mon, Nov 4, 2019 at 11:54 AM Chamikara Jayalath  
>> wrote:
>> >
>> > On Mon, Nov 4, 2019 at 11:01 AM Hai Lu  wrote:
>> >>
>> >> Hi,
>> >>
>> >> We're looking into leveraging the cross language pipeline feature in our 
>> >> Beam pipelines on Samza runner. While the feature seems to work well, the 
>> >> PTransform expansion as a standalone service isn't very convenient. 
>> >> Particularly that the Python pipeline needs to specify the address of the 
>> >> expansion service.
>> >>
>> >> I'm wondering why we couldn't embed the expansion service into runner 
>> >> itself. I understand the cross language feature wants to be runner 
>> >> independent, but does it make sense to at least provide the option to 
>> >> allow runner to use the expansion service as a library and make it 
>> >> transparent to the portable pipeline?
>> >
>> >
>> > Beam composite transforms are expanded before defining the portable job 
>> > definition (and before submitting the jobs to the runner). So naturally 
>> > this is something that has to be done in the Beam side. As an added 
>> > benefit, as you identified, this allows us to keep this logic runner 
>> > independent.
>> > I think there were discussions regarding automatically starting up a local 
>> > expansion service if one is not specified. Will this address your concerns 
>> > ?
>>
>> Just to add to this, If you have a pipeline A -> B -> C, the expansion
>> of B often needs to be evaluated before C can be applied (e.g. we're
>> planning on exposing the SQL transforms cross language, and many
>> cross-language IOs can query and supply their own schemas for
>> downstream type checking), so one cannot construct the "whole"
>> pipeline, pass it to the runner, and let the runner do the expansion.


[Discuss] Beam mascot

2019-11-04 Thread Aizhamal Nurmamat kyzy
Hi everybody,

I think the idea of creating a Beam mascot has been brought up a couple
times here in the past, but I would like us to go through with it this time
if we are all in agreement:)

We can brainstorm in this thread what the mascot should be given Beam’s
characteristics and principles. What do you all think?

For example, I am proposing a beaver as a mascot, because:
1. Beavers build dams out of logs for streams
2. The name is close to Beam
3. And with the right imagination, you can make a really cute beaver :D
https://imgur.com/gallery/RLo05M9

WDYT? If you don’t like the beaver, what are the other options that you
think could be appropriate? I would like to invite others to propose ideas
and get the discussions going.

Thanks,
Aizhamal


Re: Query Beam internal state

2019-11-04 Thread Luke Cwik
This has come up for Dataflow customers as well where people would like to
directly serve the content that is stored in state and I'm not aware of any
current plans to expose this from Beam at the moment.

You could implement this yourself by writing the value to state and also
writing the value out to memcache (or other distributed key+value store)
using window + user key as the memcache key but you may make the "state"
viewable before it becomes "committed" in a Runner since processing of the
bundle could be slow or could fail. If you need to have a point in time
view of what the runner knows then you would need support from runners to
be able to do this.


On Mon, Nov 4, 2019 at 11:46 AM Dengpan Y  wrote:

> Hi,  I recently see several Beam users using Samza Runner asking for a
> tool/feature to query the internal state of PTransform, especially the
> DoFn, for instance, the user can use the @StateId to define named state,
> and the user would like to query the content of the state for a specific
> key of that state.
>
> Does Beam have any plan to support this feature from Beam SDK, such as
> develop a PTransform that can be wired into the pipeline so that the user
> can query the state on demand?
> Or If this has to be done on the Runner side, is there any best
> practice/idea that we can follow?
>
> Thanks
> Dengpan
>
>


Re: [Discuss] Beam mascot

2019-11-04 Thread David Cavazos
I like it, a beaver could be a cute mascot :)

On Mon, Nov 4, 2019 at 3:33 PM Aizhamal Nurmamat kyzy 
wrote:

> Hi everybody,
>
> I think the idea of creating a Beam mascot has been brought up a couple
> times here in the past, but I would like us to go through with it this time
> if we are all in agreement:)
>
> We can brainstorm in this thread what the mascot should be given Beam’s
> characteristics and principles. What do you all think?
>
> For example, I am proposing a beaver as a mascot, because:
> 1. Beavers build dams out of logs for streams
> 2. The name is close to Beam
> 3. And with the right imagination, you can make a really cute beaver :D
> https://imgur.com/gallery/RLo05M9
>
> WDYT? If you don’t like the beaver, what are the other options that you
> think could be appropriate? I would like to invite others to propose ideas
> and get the discussions going.
>
> Thanks,
> Aizhamal
>


Re: Triggers still finish and drop all data

2019-11-04 Thread Kenneth Knowles
By the way, adding this guard uncovered two bugs in Beam's Java codebase,
luckily only benchmarks and tests. There were *no* non-buggy instances of a
finishing trigger. They both declare allowed lateness that is never used.

Nexmark query 10:

// Clear fancy triggering from above.
.apply(
Window.>into(...)
.triggering(AfterWatermark.pastEndOfWindow())
// We expect no late data here, but we'll assume the worst
so we can detect any.
.withAllowedLateness(Duration.standardDays(1))
.discardingFiredPanes())

This is nonsensical: the trigger will fire once and close, never firing
again. So the allowed lateness has no effect except to change counters from
"dropped due to lateness" to "dropped due to trigger closing". The intent
would appear to be to restore the default triggering, but it failed.

PipelineTranslationTest:

   Window.into(FixedWindows.of(Duration.standardMinutes(7)))
.triggering(
AfterWatermark.pastEndOfWindow()

.withEarlyFirings(AfterPane.elementCountAtLeast(19)))
.accumulatingFiredPanes()
.withAllowedLateness(Duration.standardMinutes(3L)));

Again, the allowed lateness has no effect. This test is just to test
portable proto round-trip. But still it is odd to write a nonsensical
pipeline for this.

Takeaway: experienced Beam developers never use this pattern, but they
still get it wrong and create pipelines that would have data loss bugs
because of it.

Since there is no other discussion here, I will trust the community is OK
with this change and follow Jan's review of my implementation of his idea.

Kenn


On Thu, Oct 31, 2019 at 4:06 PM Kenneth Knowles  wrote:

> Opened https://github.com/apache/beam/pull/9960 for this idea. This will
> alert users to broken pipelines and force them to alter them.
>
> Kenn
>
> On Thu, Oct 31, 2019 at 2:12 PM Kenneth Knowles  wrote:
>
>> On Thu, Oct 31, 2019 at 2:11 AM Jan Lukavský  wrote:
>>
>>> Hi Kenn,
>>>
>>> does there still remain some use for trigger to finish? If we don't drop
>>> data, would it still be of any use to users? If not, would it be better
>>> to just remove the functionality completely, so that users who use it
>>> (and it will possibly break for them) are aware of it at compile time?
>>>
>>> Jan
>>>
>>
>> Good point. I believe there is no good use for a top-level trigger
>> finishing. As mentioned, the intended uses aren't really met by triggers,
>> but are met by stateful DoFn.
>>
>> Eugene's bug even has this title :-). We could not change any behavior
>> but just reject pipelines with broken top-level triggers. This is probably
>> a better solution. Because if a user has a broken trigger, the new behavior
>> is probably not enough to magically fix their pipeline. They are better off
>> knowing that they are broken and fixing it.
>>
>> And at that point, there is a lot of dead code and my PR is really just
>> cleaning it up as a simplification.
>>
>> Kenn
>>
>>
>>
>>> On 10/30/19 11:26 PM, Kenneth Knowles wrote:
>>> > Problem: a trigger can "finish" which causes a window to "close" and
>>> > drop all remaining data arriving for that window.
>>> >
>>> > This has been discussed many times and I thought fixed, but it seems
>>> > to not be fixed. It does not seem to have its own Jira or thread that
>>> > I can find. But here are some pointers:
>>> >
>>> >  - data loss bug:
>>> >
>>> https://lists.apache.org/thread.html/ce413231d0b7d52019668765186ef27a7ffb69b151fdb34f4bf80b0f@%3Cdev.beam.apache.org%3E
>>> >  - user hitting the bug:
>>> >
>>> https://lists.apache.org/thread.html/28879bc80cd5c7ef1a3e38cb1d2c063165d40c13c02894bbccd66aca@%3Cuser.beam.apache.org%3E
>>> >  - user confusion:
>>> >
>>> https://lists.apache.org/thread.html/2707aa449c8c6de1c6e3e8229db396323122304c14931c44d0081449@%3Cuser.beam.apache.org%3E
>>> >  - thread from 2016 on the topic:
>>> >
>>> https://lists.apache.org/thread.html/5f44b62fdaf34094ccff8da2a626b7cd344d29a8a0fff6eac8e148ea@%3Cdev.beam.apache.org%3E
>>> >
>>> > In theory, trigger finishing was intended for users who can get their
>>> > answers from a smaller amount of data and then drop the rest. In
>>> > practice, triggers aren't really expressive enough for this. Stateful
>>> > DoFn is the solution for these cases.
>>> >
>>> > I've opened https://github.com/apache/beam/pull/9942 which makes the
>>> > following changes:
>>> >
>>> >  - when a trigger says it is finished, it never fires again but data
>>> > is still kept
>>> >  - at GC time the final output will be emitted
>>> >
>>> > As with all bugfixes, this is backwards-incompatible (if your pipeline
>>> > relies on buggy behavior, it will stop working). So this is a major
>>> > change that I wanted to discuss on dev@.
>>> >
>>> > Kenn
>>> >
>>>
>>


Re: FirestoreIO connector [JavaSDK]

2019-11-04 Thread Chamikara Jayalath
Thanks for the contribution. Happy to help with the review.
Also, probably it'll be good to follow patterns employed by the existing
Datastore connector when applicable:
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreV1.java

Thanks,
Cham

On Mon, Nov 4, 2019 at 2:37 AM Ismaël Mejía  wrote:

> Hello,
>
> Please open a PR with the code of the new connector rebased against
> the latest master.
>
> It is worth to take a look at Beam's contribution guide [1] and
> PTransform style guide [2] in advance.
> In any case we can guide you on any issue during the PR review so
> please go ahead.
>
> Regards,
> Ismaël
>
> [1] https://beam.apache.org/contribute/
> [2] https://beam.apache.org/contribute/ptransform-style-guide/
>
>
> On Mon, Nov 4, 2019 at 10:48 AM Stefan Djelekar
>  wrote:
> >
> > Hi beam devs,
> >
> >
> >
> > I'm Stefan from Redbox. We are a customer of GCP and we are in need of
> beam Java connector for Firestore.
> >
> >
> >
> > There is a pending JIRA item for this.
> https://issues.apache.org/jira/browse/BEAM-8376
> >
> >
> >
> > The team inside of the company has been working on this for a while and
> we would like to contribute!
> >
> > What is the best way to do this? Can we perhaps get a review from an
> experienced beam member?
> >
> > Let me know what do you think.
> >
> >
> >
> > All the best,
> >
> >
> >
> > Stefan Đelekar
> >
> > Sofware Engineer
> >
> > stefan.djele...@redbox.com
> >
> > djelekar.com
> >
> >
>


Re: [Discuss] Beam mascot

2019-11-04 Thread Robert Burke
As both a Canadian, and the resident fan of a programming language with a
rodent mascot, I endorse this mascot.

On Mon, Nov 4, 2019, 4:11 PM David Cavazos  wrote:

> I like it, a beaver could be a cute mascot :)
>
> On Mon, Nov 4, 2019 at 3:33 PM Aizhamal Nurmamat kyzy 
> wrote:
>
>> Hi everybody,
>>
>> I think the idea of creating a Beam mascot has been brought up a couple
>> times here in the past, but I would like us to go through with it this time
>> if we are all in agreement:)
>>
>> We can brainstorm in this thread what the mascot should be given Beam’s
>> characteristics and principles. What do you all think?
>>
>> For example, I am proposing a beaver as a mascot, because:
>> 1. Beavers build dams out of logs for streams
>> 2. The name is close to Beam
>> 3. And with the right imagination, you can make a really cute beaver :D
>> https://imgur.com/gallery/RLo05M9
>>
>> WDYT? If you don’t like the beaver, what are the other options that you
>> think could be appropriate? I would like to invite others to propose ideas
>> and get the discussions going.
>>
>> Thanks,
>> Aizhamal
>>
>
On Mon, Nov 4, 2019, 4:11 PM David Cavazos  wrote:

> I like it, a beaver could be a cute mascot :)
>
> On Mon, Nov 4, 2019 at 3:33 PM Aizhamal Nurmamat kyzy 
> wrote:
>
>> Hi everybody,
>>
>> I think the idea of creating a Beam mascot has been brought up a couple
>> times here in the past, but I would like us to go through with it this time
>> if we are all in agreement:)
>>
>> We can brainstorm in this thread what the mascot should be given Beam’s
>> characteristics and principles. What do you all think?
>>
>> For example, I am proposing a beaver as a mascot, because:
>> 1. Beavers build dams out of logs for streams
>> 2. The name is close to Beam
>> 3. And with the right imagination, you can make a really cute beaver :D
>> https://imgur.com/gallery/RLo05M9
>>
>> WDYT? If you don’t like the beaver, what are the other options that you
>> think could be appropriate? I would like to invite others to propose ideas
>> and get the discussions going.
>>
>> Thanks,
>> Aizhamal
>>
>


Re: [Discuss] Beam mascot

2019-11-04 Thread Kenneth Knowles
Yes! Let's have a mascot!

Direct connections often have duplicates. For example in the log processing
space, there is https://www.linkedin.com/in/hooverbeaver

I like a flying squirrel, but Flink already is a squirrel.

Hedgehog? I could not find any source of confusion for this one.

Kenn


On Mon, Nov 4, 2019 at 6:02 PM Robert Burke  wrote:

> As both a Canadian, and the resident fan of a programming language with a
> rodent mascot, I endorse this mascot.
>
> On Mon, Nov 4, 2019, 4:11 PM David Cavazos  wrote:
>
>> I like it, a beaver could be a cute mascot :)
>>
>> On Mon, Nov 4, 2019 at 3:33 PM Aizhamal Nurmamat kyzy <
>> aizha...@apache.org> wrote:
>>
>>> Hi everybody,
>>>
>>> I think the idea of creating a Beam mascot has been brought up a couple
>>> times here in the past, but I would like us to go through with it this time
>>> if we are all in agreement:)
>>>
>>> We can brainstorm in this thread what the mascot should be given Beam’s
>>> characteristics and principles. What do you all think?
>>>
>>> For example, I am proposing a beaver as a mascot, because:
>>> 1. Beavers build dams out of logs for streams
>>> 2. The name is close to Beam
>>> 3. And with the right imagination, you can make a really cute beaver :D
>>> https://imgur.com/gallery/RLo05M9
>>>
>>> WDYT? If you don’t like the beaver, what are the other options that you
>>> think could be appropriate? I would like to invite others to propose ideas
>>> and get the discussions going.
>>>
>>> Thanks,
>>> Aizhamal
>>>
>>
> On Mon, Nov 4, 2019, 4:11 PM David Cavazos  wrote:
>
>> I like it, a beaver could be a cute mascot :)
>>
>> On Mon, Nov 4, 2019 at 3:33 PM Aizhamal Nurmamat kyzy <
>> aizha...@apache.org> wrote:
>>
>>> Hi everybody,
>>>
>>> I think the idea of creating a Beam mascot has been brought up a couple
>>> times here in the past, but I would like us to go through with it this time
>>> if we are all in agreement:)
>>>
>>> We can brainstorm in this thread what the mascot should be given Beam’s
>>> characteristics and principles. What do you all think?
>>>
>>> For example, I am proposing a beaver as a mascot, because:
>>> 1. Beavers build dams out of logs for streams
>>> 2. The name is close to Beam
>>> 3. And with the right imagination, you can make a really cute beaver :D
>>> https://imgur.com/gallery/RLo05M9
>>>
>>> WDYT? If you don’t like the beaver, what are the other options that you
>>> think could be appropriate? I would like to invite others to propose ideas
>>> and get the discussions going.
>>>
>>> Thanks,
>>> Aizhamal
>>>
>>


Re: Query Beam internal state

2019-11-04 Thread Kenneth Knowles
Augmenting Luke's suggestion with @RequiresStableInput is our design for
ensuring values written to the key+value store are stable. This will have
some cost. It is also not supported by all runners.

Kenn

On Mon, Nov 4, 2019 at 3:44 PM Luke Cwik  wrote:

> This has come up for Dataflow customers as well where people would like to
> directly serve the content that is stored in state and I'm not aware of any
> current plans to expose this from Beam at the moment.
>
> You could implement this yourself by writing the value to state and also
> writing the value out to memcache (or other distributed key+value store)
> using window + user key as the memcache key but you may make the "state"
> viewable before it becomes "committed" in a Runner since processing of the
> bundle could be slow or could fail. If you need to have a point in time
> view of what the runner knows then you would need support from runners to
> be able to do this.
>
>
> On Mon, Nov 4, 2019 at 11:46 AM Dengpan Y  wrote:
>
>> Hi,  I recently see several Beam users using Samza Runner asking for a
>> tool/feature to query the internal state of PTransform, especially the
>> DoFn, for instance, the user can use the @StateId to define named state,
>> and the user would like to query the content of the state for a specific
>> key of that state.
>>
>> Does Beam have any plan to support this feature from Beam SDK, such as
>> develop a PTransform that can be wired into the pipeline so that the user
>> can query the state on demand?
>> Or If this has to be done on the Runner side, is there any best
>> practice/idea that we can follow?
>>
>> Thanks
>> Dengpan
>>
>>


Re: Python Precommit duration pushing 2 hours

2019-11-04 Thread Kenneth Knowles
+1 to moving forward with this

Could we move GCP tests outside the core? Then only code changes
touches/affecting GCP would cause them to run in precommit. Could still run
them in postcommit in their own suite. If the core has reasonably stable
abstractions that the connectors are built on, this should not change
coverage much.

Kenn

On Mon, Nov 4, 2019 at 1:55 PM Ahmet Altay  wrote:

> PR for the proposed change: https://github.com/apache/beam/pull/9985
>
> On Mon, Nov 4, 2019 at 1:35 PM Udi Meiri  wrote:
>
>> +1
>>
>> On Mon, Nov 4, 2019 at 12:09 PM Robert Bradshaw 
>> wrote:
>>
>>> +1, this seems like a good step with a clear win.
>>>
>>> On Mon, Nov 4, 2019 at 12:06 PM Ahmet Altay  wrote:
>>> >
>>> > Python precommits are still timing out on #9925. I am guessing that
>>> means this change would not be enough.
>>> >
>>> > I am proposing cutting down the number of test variants we run in
>>> precommits. Currently for each version we ran the following variants
>>> serially:
>>> > - base: Runs all unit tests with tox
>>> > - Cython: Installs cython and runs all unit tests as base version. The
>>> original purpose was to ensure that tests pass with or without cython.
>>> There is probably a huge overlap with base. (IIRC only a few coders have
>>> different slow vs fast tests.)
>>> > - GCP: Installs GCP dependencies and tests all base + additional gcp
>>> specific tests. The original purpose was to ensure that GCP is an optional
>>> component and all non-GCP tests still works without GCP components.
>>> >
>>> > We can reduce the list to cython + GCP tests only. This will cover the
>>> same group of tests and will check that tests pass with or without cython
>>> or GCP dependencies. This could reduce the precommit time by ~30 minutes.
>>> >
>>> > What do you think?
>>> >
>>> > Ahmet
>>> >
>>> >
>>> > On Tue, Oct 29, 2019 at 11:15 AM Robert Bradshaw 
>>> wrote:
>>> >>
>>> >> https://github.com/apache/beam/pull/9925
>>> >>
>>> >> On Tue, Oct 29, 2019 at 10:24 AM Udi Meiri  wrote:
>>> >> >
>>> >> > I don't have the bandwidth right now to tackle this. Feel free to
>>> take it.
>>> >> >
>>> >> > On Tue, Oct 29, 2019 at 10:16 AM Robert Bradshaw <
>>> rober...@google.com> wrote:
>>> >> >>
>>> >> >> The Python SDK does as well. These calls are coming from
>>> >> >> to_runner_api, is_stateful_dofn, and validate_stateful_dofn which
>>> are
>>> >> >> invoked once per pipene or bundle. They are, however, surprisingly
>>> >> >> expensive. Even memoizing across those three calls should save a
>>> >> >> significant amount of time. Udi, did you want to tackle this?
>>> >> >>
>>> >> >> Looking at the profile, Pipeline.to_runner_api() is being called 30
>>> >> >> times in this test, and [Applied]PTransform.to_fn_api being called
>>> >> >> 3111 times, so that in itself might be interesting to investigate.
>>> >> >>
>>> >> >> On Tue, Oct 29, 2019 at 8:26 AM Robert Burke 
>>> wrote:
>>> >> >> >
>>> >> >> > As does the Go SDK. Invokers are memoized and when possible code
>>> is generated to avoid reflection.
>>> >> >> >
>>> >> >> > On Tue, Oct 29, 2019, 6:46 AM Kenneth Knowles 
>>> wrote:
>>> >> >> >>
>>> >> >> >> Noting for the benefit of the thread archive in case someone
>>> goes digging and wonders if this affects other SDKs: the Java SDK memoizes
>>> DoFnSignatures and generated DoFnInvoker classes.
>>> >> >> >>
>>> >> >> >> Kenn
>>> >> >> >>
>>> >> >> >> On Mon, Oct 28, 2019 at 6:59 PM Udi Meiri 
>>> wrote:
>>> >> >> >>>
>>> >> >> >>> Re: #9283 slowing down tests, ideas for slowness:
>>> >> >> >>> 1. I added a lot of test cases, some with locally run
>>> pipelines.
>>> >> >> >>> 2. The PR somehow changed how coders are selected, and now
>>> we're using less efficient ones.
>>> >> >> >>> 3. New dependency funcsigs is slowing things down? (py2 only)
>>> >> >> >>>
>>> >> >> >>> I ran "pytest -k PipelineAnalyzerTest --profile-svg" on 2.7
>>> and 3.7 and got these cool graphs (attached).
>>> >> >> >>> 2.7: core:294:get_function_arguments takes 56.66% of CPU time
>>> (IIUC), gets called ~230k times
>>> >> >> >>> 3.7: core:294:get_function_arguments 30.88%, gets called ~200k
>>> times
>>> >> >> >>>
>>> >> >> >>> After memoization of get_function_args_defaults:
>>> >> >> >>> 2.7: core:294:get_function_arguments 20.02%
>>> >> >> >>> 3.7: core:294:get_function_arguments 8.11%
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>> On Mon, Oct 28, 2019 at 5:38 PM Pablo Estrada <
>>> pabl...@google.com> wrote:
>>> >> >> 
>>> >> >>  *not deciles, but 9-percentiles : )
>>> >> >> 
>>> >> >>  On Mon, Oct 28, 2019 at 5:31 PM Pablo Estrada <
>>> pabl...@google.com> wrote:
>>> >> >> >
>>> >> >> > I've ran the tests in Python 2 (without cython), and used a
>>> utility to track runtime for each test method. I found some of the
>>> following things:
>>> >> >> > - Total test methods run: 2665
>>> >> >> > - Total test runtime: 990 seconds
>>> >> >> > - Deciles of time spent:
>>> >> >> >   - 1949 tests run in the first 9%

Re: [Discuss] Beam mascot

2019-11-04 Thread Eugene Kirpichov
Feels like "Beam" would go well with an animal that has glowing bright eyes
(with beams of light shooting out of them), such as a lemur [1] or an owl.

[1] https://www.cnn.com/travel/article/madagascar-lemurs/index.html

On Mon, Nov 4, 2019 at 7:33 PM Kenneth Knowles  wrote:

> Yes! Let's have a mascot!
>
> Direct connections often have duplicates. For example in the log
> processing space, there is https://www.linkedin.com/in/hooverbeaver
>
> I like a flying squirrel, but Flink already is a squirrel.
>
> Hedgehog? I could not find any source of confusion for this one.
>
> Kenn
>
>
> On Mon, Nov 4, 2019 at 6:02 PM Robert Burke  wrote:
>
>> As both a Canadian, and the resident fan of a programming language with a
>> rodent mascot, I endorse this mascot.
>>
>> On Mon, Nov 4, 2019, 4:11 PM David Cavazos  wrote:
>>
>>> I like it, a beaver could be a cute mascot :)
>>>
>>> On Mon, Nov 4, 2019 at 3:33 PM Aizhamal Nurmamat kyzy <
>>> aizha...@apache.org> wrote:
>>>
 Hi everybody,

 I think the idea of creating a Beam mascot has been brought up a couple
 times here in the past, but I would like us to go through with it this time
 if we are all in agreement:)

 We can brainstorm in this thread what the mascot should be given Beam’s
 characteristics and principles. What do you all think?

 For example, I am proposing a beaver as a mascot, because:
 1. Beavers build dams out of logs for streams
 2. The name is close to Beam
 3. And with the right imagination, you can make a really cute beaver :D
 https://imgur.com/gallery/RLo05M9

 WDYT? If you don’t like the beaver, what are the other options that you
 think could be appropriate? I would like to invite others to propose ideas
 and get the discussions going.

 Thanks,
 Aizhamal

>>>
>> On Mon, Nov 4, 2019, 4:11 PM David Cavazos  wrote:
>>
>>> I like it, a beaver could be a cute mascot :)
>>>
>>> On Mon, Nov 4, 2019 at 3:33 PM Aizhamal Nurmamat kyzy <
>>> aizha...@apache.org> wrote:
>>>
 Hi everybody,

 I think the idea of creating a Beam mascot has been brought up a couple
 times here in the past, but I would like us to go through with it this time
 if we are all in agreement:)

 We can brainstorm in this thread what the mascot should be given Beam’s
 characteristics and principles. What do you all think?

 For example, I am proposing a beaver as a mascot, because:
 1. Beavers build dams out of logs for streams
 2. The name is close to Beam
 3. And with the right imagination, you can make a really cute beaver :D
 https://imgur.com/gallery/RLo05M9

 WDYT? If you don’t like the beaver, what are the other options that you
 think could be appropriate? I would like to invite others to propose ideas
 and get the discussions going.

 Thanks,
 Aizhamal

>>>