Re: Possible Python SDK performance regression

2019-09-06 Thread Thomas Weise
The issue is only visible with Python 3.6, not 2.7.

If there is a framework in place to add a streaming test, that would be
great. We would use what we have internally as starting point.

On Thu, Sep 5, 2019 at 5:00 PM Ahmet Altay  wrote:

>
>
> On Thu, Sep 5, 2019 at 4:15 PM Thomas Weise  wrote:
>
>> The workload is quite different. What I have is streaming with state and
>> timers.
>>
>>
>>
>> On Thu, Sep 5, 2019 at 3:47 PM Pablo Estrada  wrote:
>>
>>> We only recently started running Chicago Taxi Example. +Michał Walenia
>>>  I don't see it in the dashboards. Do you
>>> know if it's possible to see any trends in the data?
>>>
>>> We have a few tests running now:
>>> - Combine tests:
>>> https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373
>>> - GBK tests:
>>> https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373
>>>
>>> They don't seem to show a very drastic jump either, but they aren't very
>>> old.
>>>
>>> There is also work ongoing to add alerting for this sort of regressions
>>> by Kasia and Kamil (added). The work is not there yet (it's in progress).
>>> Best
>>> -P.
>>>
>>> On Thu, Sep 5, 2019 at 3:35 PM Thomas Weise  wrote:
>>>
 It probably won't be practical to do a bisect due to the high cost of
 each iteration with our fork/deploy setup.

 Perhaps it is time to setup something with the synthetic source that
 works just with Beam as dependency.

>>>
> I agree with this.
>
> Pablo, Kasia, Kamil, does the new benchmarks give us a easy to use
> framework for using synthetic source in benchmarks?
>
>
>>
 On Thu, Sep 5, 2019 at 3:23 PM Ahmet Altay  wrote:

> There are a few in this dashboard [1], but not very useful in this
> case because they do not go back more than a month and not very
> comprehensive. I do not see a jump there. Thomas, would it be possible to
> bisect to find what commit caused the regression?
>
> +Pablo Estrada  do we have any python on flink
> benchmarks for chicago example?
> +Alan Myrvold  +Yifan Zou  It
> would be good to have alerts on benchmarks. Do we have such an ability
> today?
>
> [1] https://apache-beam-testing.appspot.com/dashboard-admin
>
> On Thu, Sep 5, 2019 at 3:15 PM Thomas Weise  wrote:
>
>> Hi,
>>
>> Are there any performance tests run for the Python SDK as part of
>> release verification (or otherwise as well)?
>>
>> I see what appears to be a regression in master (compared to 2.14)
>> with our in-house application (~ 25% jump in cpu utilization and
>> corresponds drop in throughput).
>>
>> I wanted to see if there is anything available to verify that within
>> Beam.
>>
>> Thanks,
>> Thomas
>>
>>


Re: Possible Python SDK performance regression

2019-09-06 Thread Ahmet Altay
+Valentyn Tymofieiev  do we have benchmarks in
different python versions? Was there a recent change that is specific to
python 3.x ?

On Fri, Sep 6, 2019 at 8:36 AM Thomas Weise  wrote:

> The issue is only visible with Python 3.6, not 2.7.
>
> If there is a framework in place to add a streaming test, that would be
> great. We would use what we have internally as starting point.
>
> On Thu, Sep 5, 2019 at 5:00 PM Ahmet Altay  wrote:
>
>>
>>
>> On Thu, Sep 5, 2019 at 4:15 PM Thomas Weise  wrote:
>>
>>> The workload is quite different. What I have is streaming with state and
>>> timers.
>>>
>>>
>>>
>>> On Thu, Sep 5, 2019 at 3:47 PM Pablo Estrada  wrote:
>>>
 We only recently started running Chicago Taxi Example. +Michał Walenia
  I don't see it in the dashboards. Do you
 know if it's possible to see any trends in the data?

 We have a few tests running now:
 - Combine tests:
 https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373
 - GBK tests:
 https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373

 They don't seem to show a very drastic jump either, but they aren't
 very old.

 There is also work ongoing to add alerting for this sort of regressions
 by Kasia and Kamil (added). The work is not there yet (it's in progress).
 Best
 -P.

 On Thu, Sep 5, 2019 at 3:35 PM Thomas Weise  wrote:

> It probably won't be practical to do a bisect due to the high cost of
> each iteration with our fork/deploy setup.
>
> Perhaps it is time to setup something with the synthetic source that
> works just with Beam as dependency.
>

>> I agree with this.
>>
>> Pablo, Kasia, Kamil, does the new benchmarks give us a easy to use
>> framework for using synthetic source in benchmarks?
>>
>>
>>>
> On Thu, Sep 5, 2019 at 3:23 PM Ahmet Altay  wrote:
>
>> There are a few in this dashboard [1], but not very useful in this
>> case because they do not go back more than a month and not very
>> comprehensive. I do not see a jump there. Thomas, would it be possible to
>> bisect to find what commit caused the regression?
>>
>> +Pablo Estrada  do we have any python on flink
>> benchmarks for chicago example?
>> +Alan Myrvold  +Yifan Zou  It
>> would be good to have alerts on benchmarks. Do we have such an ability
>> today?
>>
>> [1] https://apache-beam-testing.appspot.com/dashboard-admin
>>
>> On Thu, Sep 5, 2019 at 3:15 PM Thomas Weise  wrote:
>>
>>> Hi,
>>>
>>> Are there any performance tests run for the Python SDK as part of
>>> release verification (or otherwise as well)?
>>>
>>> I see what appears to be a regression in master (compared to 2.14)
>>> with our in-house application (~ 25% jump in cpu utilization and
>>> corresponds drop in throughput).
>>>
>>> I wanted to see if there is anything available to verify that within
>>> Beam.
>>>
>>> Thanks,
>>> Thomas
>>>
>>>


Re: installing Apache Beam on Pycharm with Python 3.7

2019-09-06 Thread Rakesh Kumar
Hi Priti,

It would be helpful if you can provide more information about your
environment and the error message. You can also ask this question in
stackoverflow with 'apache-beam' tag for better visibility.


On Thu, Sep 5, 2019 at 9:43 AM Priti Badami <
pbadami.srdataengin...@gmail.com> wrote:

> Hi Dev Team,
>
> I am trying to install Apache Beam. I have pip 19.2.3 but I am facing
> issues while installing Beam.
>
> please advice,
>
> Thanks,
> Priti Badami
>


[report] Understanding how people use the Apache Beam website!

2019-09-06 Thread Pablo Estrada
Hello all,
I've put together a report analyzing how people have been using the Apache
Beam website. The report is relatively simple, but it does show some
interesting insights about how users navigate the website, what are the
most popular pages - and it compiles a few action items (ideas) to improve
the documentation.

Check it out: https://s.apache.org/beam-ga-report

And please leave comments : )

For the curious, the action items are:
- Always cross-link with the programming guide
- (Continuously) Invest in high-traffic page
- Move de-facto documentation into official documentation (I'm looking at
you State&Timers : ))
- Encourage use of the Blog

Thanks!
-P.


Re: [report] Understanding how people use the Apache Beam website!

2019-09-06 Thread Kenneth Knowles
On Fri, Sep 6, 2019 at 10:48 AM Pablo Estrada  wrote:

>
> - Move de-facto documentation into official documentation (I'm looking at
> you State&Timers : ))
> - Encourage use of the Blog
>

Well, which is it?!? :-p

 - Encourage use of the Blog [but not for reference documentation] :-)

Kenn

>


[discuss] How we support our users on Slack / Mailing list / StackOverflow

2019-09-06 Thread Pablo Estrada
Hello all,

THE SITUATION:
It was brought to my attention recently that Python users in Slack are not
getting much support, because most of the Beam Python-knowledgeable people
are not on Slack. Unfortunately, in the Beam site, we do refer people to
Slack for assistance[1].

Java users do receive reasonable support, because there are enough Beam
Java-knowledgeable people online, and willing to answer.

On the other hand, at Google we do have a number of people who are
responsible to answer questions on StackOverflow[2], and we do our best to
answer promptly. I think we do a reasonable job overall.

SO LET'S DISCUSS:
How should we advise the community to ask questions about Beam?
- Perhaps we should encourage people to try the mailing list first
- Perhaps we should encourage people to try StackOverflow first
- Perhaps we should write a bot that encourages Python users to go to
StackOverflow
- something else?

My personal opinion is that a mailing list is not great: It's intimidating,
it does not provide great indexing or searchability.

WHAT I PROPOSE:

I think explicitly encouraging everyone to go to StackOverflow first will
be the best alternative: It's indexed, searchable, less intimidating than
the mailing list. We can add that they can try Slack as well - without any
guarantees.

What do others think?
-P.

[1] https://beam.apache.org/community/contact-us/
[2] https://stackoverflow.com/questions/tagged/apache-beam?tab=Newest


Interactive Beam - support for caching and introspection of PCollections

2019-09-06 Thread Alexey Strokach
Hi everyone,

I have recently finished my internship at Google, which involved doing some
work with Apache Beam in a Jupyter Notebook environment. One limitation
that I encountered with my workflow is the lack of support for
introspecting the contents of a PCollection and excessive boilerplate
required to move data between a Beam Pipeline and the Python interpreter.

With guidance from Vanya Tarasonv and Harsh Vardhan, I have created a
design document which describes those limitations:
https://docs.google.com/document/d/1sISjl4Q60mR1V22R1UZd417wVEn_EmZT-SalTHXG4H0/
.

I also have two PRs outstanding, which add support for materializing and
accessing bounded and unbounded PCollections both from a Beam Pipeline and
from the Python interpreter.
- https://github.com/apache/beam/pull/8884
- https://github.com/apache/beam/pull/8961

I am aware of the work being carried out by +Ning Kang and +David Yan on
[Interactive Beam](
https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/),
and upon discussion, it does not appear that our PRs would conflict with
their vision.

Any feedback from the Apache Beam community would be very much appreciated
:).

Thank you,
Alexey


clickhouse tests failing

2019-09-06 Thread Elliotte Rusty Harold
At head I noticed the following:


$ ./gradlew -p sdks/java/io/ check
Configuration on demand is an incubating feature.

> Task :sdks:java:io:clickhouse:test

org.apache.beam.sdk.io.clickhouse.ClickHouseIOTest > classMethod FAILED
java.lang.IllegalStateException

org.apache.beam.sdk.io.clickhouse.ClickHouseIOTest > classMethod FAILED
java.lang.NullPointerException

org.apache.beam.sdk.io.clickhouse.AtomicInsertTest > classMethod FAILED
java.lang.IllegalStateException

org.apache.beam.sdk.io.clickhouse.AtomicInsertTest > classMethod FAILED
java.lang.NullPointerException

29 tests completed, 4 failed

> Task :sdks:java:io:clickhouse:test FAILED

FAILURE: Build failed with an exception.


Is anyone else seeing this? Are the tests expected to pass, or is
there some requirement (e.g. Java 11) that I might be missing?

-- 
Elliotte Rusty Harold
elh...@ibiblio.org


Re: [discuss] How we support our users on Slack / Mailing list / StackOverflow

2019-09-06 Thread Udi Meiri
I don't go on Slack, but I will be notified of mentions. It has the
advantage of being an informal space.
SO can feel just as intimidating as the mailing list IMO. Unlike the
others, it doesn't lend itself very well to discussions (you can only post
comments or answers).



On Fri, Sep 6, 2019 at 10:55 AM Pablo Estrada  wrote:

> Hello all,
>
> THE SITUATION:
> It was brought to my attention recently that Python users in Slack are not
> getting much support, because most of the Beam Python-knowledgeable people
> are not on Slack. Unfortunately, in the Beam site, we do refer people to
> Slack for assistance[1].
>
> Java users do receive reasonable support, because there are enough Beam
> Java-knowledgeable people online, and willing to answer.
>
> On the other hand, at Google we do have a number of people who are
> responsible to answer questions on StackOverflow[2], and we do our best to
> answer promptly. I think we do a reasonable job overall.
>
> SO LET'S DISCUSS:
> How should we advise the community to ask questions about Beam?
> - Perhaps we should encourage people to try the mailing list first
> - Perhaps we should encourage people to try StackOverflow first
> - Perhaps we should write a bot that encourages Python users to go to
> StackOverflow
> - something else?
>
> My personal opinion is that a mailing list is not great: It's
> intimidating, it does not provide great indexing or searchability.
>
> WHAT I PROPOSE:
>
> I think explicitly encouraging everyone to go to StackOverflow first will
> be the best alternative: It's indexed, searchable, less intimidating than
> the mailing list. We can add that they can try Slack as well - without any
> guarantees.
>
> What do others think?
> -P.
>
> [1] https://beam.apache.org/community/contact-us/
> [2] https://stackoverflow.com/questions/tagged/apache-beam?tab=Newest
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Improve container support

2019-09-06 Thread Hannah Jiang
Hi team

I haven't received any objections, so will proceed with settings mentioned
in a previous email.

A reminder to PMC members, please let me know your docker hub id if you
want to be an admin.

Thanks,
Hannah

On Thu, Sep 5, 2019 at 5:02 PM Ankur Goenka  wrote:

> Please ignore the previous email. I was looking at the older document in
> the mail thread.
>
> On Thu, Sep 5, 2019 at 4:58 PM Ankur Goenka  wrote:
>
>> I think sdk in the name is obsolete as they are all under sdks name space.
>>
>> On Thu, Sep 5, 2019 at 3:26 PM Hannah Jiang 
>> wrote:
>>
>>> Hi Team
>>>
>>> Thanks for all the comments about beam containers.
>>> After considering various opinions and investigating gcr and docker hub,
>>> we decided to push images to docker hub.
>>>
>>> Each image will have two tags, {version}_rc and {version}. {version} tag
>>> will be added after the release candidate image is verified.
>>> Meanwhile, we will have* latest* tag for each repository, which always
>>> points to the most recent verified release image, so users can pull it by
>>> default.
>>>
>>> Docker hub doesn't support leveled repository, which means we should
>>> follow *repository:tag* format.
>>> it's too general if we use {language_version} as repository for SDK
>>> images. (version is added when we support multiple versions.)
>>> So I would like to include *sdk* to repository. Images generated at
>>> local will also have the same name.
>>> Here are some examples:
>>>
>>>- python2.7_sdk:2.15.0
>>>- java_sdk:2.15.0_rc
>>>- go_sdk:latest
>>>
>>> I will proceed with this format if there is no strong opposition by
>>> tomorrow noon(PST).
>>>
>>> *To PMC members*:
>>> Permission control will follow the pypi model. All interested PMC
>>> members will be added as admins and release managers will be granted push
>>> permission.
>>> Please let me know your *docker id* if you want to be added as an admin.
>>>
>>> Thanks,
>>> Hannah
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Sep 4, 2019 at 3:47 PM Thomas Weise  wrote:
>>>
 This will greatly simplify trying out portable runners:
 https://beam.apache.org/documentation/runners/flink/#executing-a-beam-pipeline-on-a-flink-cluster

 Can't wait for following to disappear from the instructions page: ./gradlew
 :sdks:python:container:docker

 On Wed, Sep 4, 2019 at 3:35 PM Thomas Weise  wrote:

> Awesome, thank you!
>
>
> On Wed, Sep 4, 2019 at 3:22 PM Hannah Jiang 
> wrote:
>
>> Hi Thomas
>>
>> I created snapshot images from head as of around 2PM today.
>> You can pull images from
>> gcr.io/apache-beam-testing/beam/sdks/snapshot.
>>
>> Thanks,
>> Hannah
>>
>> On Wed, Sep 4, 2019 at 1:41 PM Thomas Weise  wrote:
>>
>>> Hi Hannah,
>>>
>>> Thank you, I know how to build the containers locally, but not how
>>> to publish them!
>>>
>>> The cwiki says "Publishing images to gcr.io/beam requires
>>> permissions in apache-beam-testing project."
>>>
>>> Can I get access to the testing project (at least temporarily) and
>>> what would I need to setup to run the publish target that is shown on 
>>> cwiki?
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>> On Wed, Sep 4, 2019 at 11:06 AM Hannah Jiang 
>>> wrote:
>>>
 Hi Thomas

 I haven't uploaded any snapshot images yet. Here is how you can
 create one from head.
 > cd [...]/beam/
 # For Python
 > ./gradlew :sdks:python:container:py{version}:docker *where
 version is {2,35,36,37}*
 # For Java
 > ./gradlew -p sdks/java/container docker
 # For Go
 > ./gradlew -p sdks/go/container docker

 The 2.15 one is just for testing, not a real 2.15.0, nor a snapshot
 from head.

 Please let me know if you have any questions.
 Hannah

 On Wed, Sep 4, 2019 at 10:57 AM Thomas Weise 
 wrote:

> I actually found something in [1], but it is 2.15 unfortunately.
>
> [1]
> https://console.cloud.google.com/gcr/images/apache-beam-testing/GLOBAL/beam/sdks/release/python2.7?gcrImageListsize=30
>
> On Wed, Sep 4, 2019 at 10:35 AM Thomas Weise 
> wrote:
>
>> Thanks for working on this. Do you happen to have publicly
>> accessible snapshots published for your testing currently (even when 
>> the
>> final location isn't sorted out)?
>>
>> I would like to use a 2.16 based Python SDK image for working on
>> my downstream project, but could not find anything in
>> gcr.io/apache-beam-testing/beam/sdks/rc/snapshot
>>
>> Thanks,
>> Thomas
>>
>> On Fri, Aug 30, 2019 at 10:56 AM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> On Tue, Aug 27, 2019 at 3:35 PM 

Re: Interactive Beam - support for caching and introspection of PCollections

2019-09-06 Thread Ning Kang
Thanks Alexey! The materialization of PCollection data directly from cache
instead of going through the pipeline result would be very helpful for what
we want to achieve!

On Fri, Sep 6, 2019 at 12:31 PM Alexey Strokach  wrote:

> Hi everyone,
>
> I have recently finished my internship at Google, which involved doing
> some work with Apache Beam in a Jupyter Notebook environment. One
> limitation that I encountered with my workflow is the lack of support for
> introspecting the contents of a PCollection and excessive boilerplate
> required to move data between a Beam Pipeline and the Python interpreter.
>
> With guidance from Vanya Tarasonv and Harsh Vardhan, I have created a
> design document which describes those limitations:
> https://docs.google.com/document/d/1sISjl4Q60mR1V22R1UZd417wVEn_EmZT-SalTHXG4H0/
> .
>
> I also have two PRs outstanding, which add support for materializing and
> accessing bounded and unbounded PCollections both from a Beam Pipeline and
> from the Python interpreter.
> - https://github.com/apache/beam/pull/8884
> - https://github.com/apache/beam/pull/8961
>
> I am aware of the work being carried out by +Ning Kang and +David Yan on
> [Interactive Beam](
> https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/),
> and upon discussion, it does not appear that our PRs would conflict with
> their vision.
>
> Any feedback from the Apache Beam community would be very much appreciated
> :).
>
> Thank you,
> Alexey
>
>
>
>
>


Re: [discuss] How we support our users on Slack / Mailing list / StackOverflow

2019-09-06 Thread Robert Bradshaw
I would also suggest SO as the best alternative, especially due to its
indexability and searchability. If discussion is needed, the users
list (my preference) or slack can be good options, and ideally the
resolution is brought back to SO.

On Fri, Sep 6, 2019 at 1:10 PM Udi Meiri  wrote:
>
> I don't go on Slack, but I will be notified of mentions. It has the advantage 
> of being an informal space.
> SO can feel just as intimidating as the mailing list IMO. Unlike the others, 
> it doesn't lend itself very well to discussions (you can only post comments 
> or answers).
>
>
>
> On Fri, Sep 6, 2019 at 10:55 AM Pablo Estrada  wrote:
>>
>> Hello all,
>>
>> THE SITUATION:
>> It was brought to my attention recently that Python users in Slack are not 
>> getting much support, because most of the Beam Python-knowledgeable people 
>> are not on Slack. Unfortunately, in the Beam site, we do refer people to 
>> Slack for assistance[1].
>>
>> Java users do receive reasonable support, because there are enough Beam 
>> Java-knowledgeable people online, and willing to answer.
>>
>> On the other hand, at Google we do have a number of people who are 
>> responsible to answer questions on StackOverflow[2], and we do our best to 
>> answer promptly. I think we do a reasonable job overall.
>>
>> SO LET'S DISCUSS:
>> How should we advise the community to ask questions about Beam?
>> - Perhaps we should encourage people to try the mailing list first
>> - Perhaps we should encourage people to try StackOverflow first
>> - Perhaps we should write a bot that encourages Python users to go to 
>> StackOverflow
>> - something else?
>>
>> My personal opinion is that a mailing list is not great: It's intimidating, 
>> it does not provide great indexing or searchability.
>>
>> WHAT I PROPOSE:
>>
>> I think explicitly encouraging everyone to go to StackOverflow first will be 
>> the best alternative: It's indexed, searchable, less intimidating than the 
>> mailing list. We can add that they can try Slack as well - without any 
>> guarantees.
>>
>> What do others think?
>> -P.
>>
>> [1] https://beam.apache.org/community/contact-us/
>> [2] https://stackoverflow.com/questions/tagged/apache-beam?tab=Newest


Re: Possible Python SDK performance regression

2019-09-06 Thread Valentyn Tymofieiev
+Mark Liu  has added some benchmarks running across
multiple Python versions. Specifically we run 1 GB wordcount job on
Dataflow runner on Python 2.7, 3.5-3.7. The benchmarks do not have
configured alerting and to my knowledge are not actively monitored yet.

The zoom buttons on the dashboard [1] seem to be malfunctioning, as it is
not readily possible to extend the range for some reason, however, the data
is available in BigQuery, and by adjusting the SQL query we can see an
regression in benchmark performance on July 4 [2].

I looked at merge commits on July 4, and only saw changes to loadtest
infrastructure [3].  AFAIK that change affects a different set of
performance tests than wordcount 1GB benchmark, however I may be wrong.
Lukasz, Kamil, or Mark can correct me. Either way, it is not clear why
only py35-37 benchmarks were affected, but not py27. It is also possible
that a new version of some Beam dependency was released, that affected
benchmark performance.

Was there a recent change that is specific to python 3.x ?


We had some changes related to type inference that are specific to python
version, for example https://github.com/apache/beam/pull/8893.

Thomas, is it possible for you to do the bisection using SDK code from
master at various commits to narrow down the regression on your end?

[1]
https://apache-beam-testing.appspot.com/explore?dashboard=5691127080419328
[2] https://drive.google.com/file/d/1ERlnN8bA2fKCUPBHTnid1l__81qpQe2W/view
[3]
https://github.com/apache/beam/commit/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5



On Fri, Sep 6, 2019 at 8:38 AM Ahmet Altay  wrote:

> +Valentyn Tymofieiev  do we have benchmarks in
> different python versions? Was there a recent change that is specific to
> python 3.x ?
>
> On Fri, Sep 6, 2019 at 8:36 AM Thomas Weise  wrote:
>
>> The issue is only visible with Python 3.6, not 2.7.
>>
>> If there is a framework in place to add a streaming test, that would be
>> great. We would use what we have internally as starting point.
>>
>> On Thu, Sep 5, 2019 at 5:00 PM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Thu, Sep 5, 2019 at 4:15 PM Thomas Weise  wrote:
>>>
 The workload is quite different. What I have is streaming with state
 and timers.



 On Thu, Sep 5, 2019 at 3:47 PM Pablo Estrada 
 wrote:

> We only recently started running Chicago Taxi Example. +Michał Walenia
>  I don't see it in the dashboards. Do you
> know if it's possible to see any trends in the data?
>
> We have a few tests running now:
> - Combine tests:
> https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373
> - GBK tests:
> https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373
>
> They don't seem to show a very drastic jump either, but they aren't
> very old.
>
> There is also work ongoing to add alerting for this sort of
> regressions by Kasia and Kamil (added). The work is not there yet (it's in
> progress).
> Best
> -P.
>
> On Thu, Sep 5, 2019 at 3:35 PM Thomas Weise  wrote:
>
>> It probably won't be practical to do a bisect due to the high cost of
>> each iteration with our fork/deploy setup.
>>
>> Perhaps it is time to setup something with the synthetic source that
>> works just with Beam as dependency.
>>
>
>>> I agree with this.
>>>
>>> Pablo, Kasia, Kamil, does the new benchmarks give us a easy to use
>>> framework for using synthetic source in benchmarks?
>>>
>>>

>> On Thu, Sep 5, 2019 at 3:23 PM Ahmet Altay  wrote:
>>
>>> There are a few in this dashboard [1], but not very useful in this
>>> case because they do not go back more than a month and not very
>>> comprehensive. I do not see a jump there. Thomas, would it be possible 
>>> to
>>> bisect to find what commit caused the regression?
>>>
>>> +Pablo Estrada  do we have any python on flink
>>> benchmarks for chicago example?
>>> +Alan Myrvold  +Yifan Zou  It
>>> would be good to have alerts on benchmarks. Do we have such an ability
>>> today?
>>>
>>> [1] https://apache-beam-testing.appspot.com/dashboard-admin
>>>
>>> On Thu, Sep 5, 2019 at 3:15 PM Thomas Weise  wrote:
>>>
 Hi,

 Are there any performance tests run for the Python SDK as part of
 release verification (or otherwise as well)?

 I see what appears to be a regression in master (compared to 2.14)
 with our in-house application (~ 25% jump in cpu utilization and
 corresponds drop in throughput).

 I wanted to see if there is anything available to verify that
 within Beam.

 Thanks,
 Thomas




Re: Interactive Beam - support for caching and introspection of PCollections

2019-09-06 Thread Ahmet Altay
(I believe you wanted to add +David Yan )

I am happy to see there are multiple related efforts. Both are introducing
concepts. I would hope that beyond conflicts, we are not creating
duplication and building a coherent experience. Could you reference to the
discussions where this was agreed upon?

On Fri, Sep 6, 2019 at 2:15 PM Ning Kang  wrote:

> Thanks Alexey! The materialization of PCollection data directly from cache
> instead of going through the pipeline result would be very helpful for what
> we want to achieve!
>
> On Fri, Sep 6, 2019 at 12:31 PM Alexey Strokach 
> wrote:
>
>> Hi everyone,
>>
>> I have recently finished my internship at Google, which involved doing
>> some work with Apache Beam in a Jupyter Notebook environment. One
>> limitation that I encountered with my workflow is the lack of support for
>> introspecting the contents of a PCollection and excessive boilerplate
>> required to move data between a Beam Pipeline and the Python interpreter.
>>
>> With guidance from Vanya Tarasonv and Harsh Vardhan, I have created a
>> design document which describes those limitations:
>> https://docs.google.com/document/d/1sISjl4Q60mR1V22R1UZd417wVEn_EmZT-SalTHXG4H0/
>> .
>>
>> I also have two PRs outstanding, which add support for materializing and
>> accessing bounded and unbounded PCollections both from a Beam Pipeline and
>> from the Python interpreter.
>> - https://github.com/apache/beam/pull/8884
>> - https://github.com/apache/beam/pull/8961
>>
>> I am aware of the work being carried out by +Ning Kang and +David Yan on
>> [Interactive Beam](
>> https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/),
>> and upon discussion, it does not appear that our PRs would conflict with
>> their vision.
>>
>> Any feedback from the Apache Beam community would be very much
>> appreciated :).
>>
>> Thank you,
>> Alexey
>>
>>
>>
>>
>>


Re: [discuss] How we support our users on Slack / Mailing list / StackOverflow

2019-09-06 Thread Kenneth Knowles
+1 to StackOverflow first, though I'm not important for Beam Python users.
Udi has a good point about discussions. If an SO question has a lot of back
and forth, or no response, then it is good to point to other channels the
user might try next.

Kenn

On Fri, Sep 6, 2019 at 2:20 PM Robert Bradshaw  wrote:

> I would also suggest SO as the best alternative, especially due to its
> indexability and searchability. If discussion is needed, the users
> list (my preference) or slack can be good options, and ideally the
> resolution is brought back to SO.
>
> On Fri, Sep 6, 2019 at 1:10 PM Udi Meiri  wrote:
> >
> > I don't go on Slack, but I will be notified of mentions. It has the
> advantage of being an informal space.
> > SO can feel just as intimidating as the mailing list IMO. Unlike the
> others, it doesn't lend itself very well to discussions (you can only post
> comments or answers).
> >
> >
> >
> > On Fri, Sep 6, 2019 at 10:55 AM Pablo Estrada 
> wrote:
> >>
> >> Hello all,
> >>
> >> THE SITUATION:
> >> It was brought to my attention recently that Python users in Slack are
> not getting much support, because most of the Beam Python-knowledgeable
> people are not on Slack. Unfortunately, in the Beam site, we do refer
> people to Slack for assistance[1].
> >>
> >> Java users do receive reasonable support, because there are enough Beam
> Java-knowledgeable people online, and willing to answer.
> >>
> >> On the other hand, at Google we do have a number of people who are
> responsible to answer questions on StackOverflow[2], and we do our best to
> answer promptly. I think we do a reasonable job overall.
> >>
> >> SO LET'S DISCUSS:
> >> How should we advise the community to ask questions about Beam?
> >> - Perhaps we should encourage people to try the mailing list first
> >> - Perhaps we should encourage people to try StackOverflow first
> >> - Perhaps we should write a bot that encourages Python users to go to
> StackOverflow
> >> - something else?
> >>
> >> My personal opinion is that a mailing list is not great: It's
> intimidating, it does not provide great indexing or searchability.
> >>
> >> WHAT I PROPOSE:
> >>
> >> I think explicitly encouraging everyone to go to StackOverflow first
> will be the best alternative: It's indexed, searchable, less intimidating
> than the mailing list. We can add that they can try Slack as well - without
> any guarantees.
> >>
> >> What do others think?
> >> -P.
> >>
> >> [1] https://beam.apache.org/community/contact-us/
> >> [2] https://stackoverflow.com/questions/tagged/apache-beam?tab=Newest
>


Re: Hackathon @BeamSummit @ApacheCon

2019-09-06 Thread Mikhail Gryzykhin
I'll be in most of the week and will join gladly.

On Thu, Sep 5, 2019, 14:32 Chad Dombrova  wrote:

> Has a date and time been picked for this?  I'll be there for part of the
> week and would love to join.
>
> On Tue, Sep 3, 2019 at 11:31 AM Brian Hulette  wrote:
>
>> I will be around all week as well and would love to help with a Beam
>> hackathon in any way :)
>>
>> On Thu, Aug 29, 2019 at 9:46 AM Maximilian Michels 
>> wrote:
>>
>>> Hey,
>>>
>>> I'm in as well! Austin and I recently talked about how we could organize
>>> the hackathon. Likely it will be an hour per day for exchanging ideas
>>> and learning about Beam. For example, there has been interest from the
>>> Apache Streams project to discuss points for collaboration.
>>>
>>> We will soon announce the exact hours.
>>>
>>> Cheers,
>>> Max
>>>
>>> On 23.08.19 05:06, Kenneth Knowles wrote:
>>> > I will be at Beam Summit / ApacheCon NA and would love to drop by a
>>> > hackathon room if one is arranged. Really excited for both my first
>>> > ApacheCon and Beam Summit (finally!)
>>> >
>>> > Kenn
>>> >
>>> > On Thu, Aug 22, 2019 at 10:18 AM Austin Bennett
>>> > mailto:whatwouldausti...@gmail.com>>
>>> wrote:
>>> >
>>> > And, for clarity, especially focused on Hackathon times on Monday
>>> > and/or Tuesday of ApacheCon, to not conflict with BeamSummit
>>> sessions.
>>> >
>>> > On Thu, Aug 22, 2019 at 9:47 AM Austin Bennett
>>> > mailto:whatwouldausti...@gmail.com>>
>>> > wrote:
>>> >
>>> > Less than 3 weeks till Beam Summit @ApacheCon!
>>> >
>>> > We are to be in Vegas for BeamSummit and ApacheCon in a few
>>> weeks.
>>> >
>>> > Likely to reserve space in the Hackathon Room to accomplish
>>> some
>>> > tasks:
>>> > * Help Users
>>> > * Build Beam
>>> > * Collaborate with other projects
>>> > * etc
>>> >
>>> > If you're to be around (or not) let us know how you'd like to
>>> be
>>> > involved.  Also, please share and surface anything that would
>>> be
>>> > good for us to look at (and, esp. any beginner tasks, in case
>>> we
>>> > can entice some new contributors).
>>> >
>>> >
>>> > P.S.  See BeamSummit.org, if you're thinking of attending -
>>> > there's a discount code.
>>> >
>>>
>>


Re: [discuss] How we support our users on Slack / Mailing list / StackOverflow

2019-09-06 Thread Ahmet Altay
Both StackOverflow and mailing lists have better answer rates for python
questions. Suggesting either one of them makes sense. I also find
StackOverflow easier to use but that is a personal preference.  The
original problem is that lack of support within Slack. Both mailing list
and stackoverflow are already listed in the support page above Slack. How
are we going to redirect these folks from Slack to SO or ML?

Also, what is the profile of people on slack in general. I had the
impression that it is more tuned for developer working on Beam to interact
rather than for users to ask Beam questions. Is this accurate?

Ahmet

On Fri, Sep 6, 2019 at 4:41 PM Kenneth Knowles  wrote:

> +1 to StackOverflow first, though I'm not important for Beam Python users.
> Udi has a good point about discussions. If an SO question has a lot of back
> and forth, or no response, then it is good to point to other channels the
> user might try next.
>
> Kenn
>
> On Fri, Sep 6, 2019 at 2:20 PM Robert Bradshaw 
> wrote:
>
>> I would also suggest SO as the best alternative, especially due to its
>> indexability and searchability. If discussion is needed, the users
>> list (my preference) or slack can be good options, and ideally the
>> resolution is brought back to SO.
>>
>> On Fri, Sep 6, 2019 at 1:10 PM Udi Meiri  wrote:
>> >
>> > I don't go on Slack, but I will be notified of mentions. It has the
>> advantage of being an informal space.
>> > SO can feel just as intimidating as the mailing list IMO. Unlike the
>> others, it doesn't lend itself very well to discussions (you can only post
>> comments or answers).
>> >
>> >
>> >
>> > On Fri, Sep 6, 2019 at 10:55 AM Pablo Estrada 
>> wrote:
>> >>
>> >> Hello all,
>> >>
>> >> THE SITUATION:
>> >> It was brought to my attention recently that Python users in Slack are
>> not getting much support, because most of the Beam Python-knowledgeable
>> people are not on Slack. Unfortunately, in the Beam site, we do refer
>> people to Slack for assistance[1].
>> >>
>> >> Java users do receive reasonable support, because there are enough
>> Beam Java-knowledgeable people online, and willing to answer.
>> >>
>> >> On the other hand, at Google we do have a number of people who are
>> responsible to answer questions on StackOverflow[2], and we do our best to
>> answer promptly. I think we do a reasonable job overall.
>> >>
>> >> SO LET'S DISCUSS:
>> >> How should we advise the community to ask questions about Beam?
>> >> - Perhaps we should encourage people to try the mailing list first
>> >> - Perhaps we should encourage people to try StackOverflow first
>> >> - Perhaps we should write a bot that encourages Python users to go to
>> StackOverflow
>> >> - something else?
>> >>
>> >> My personal opinion is that a mailing list is not great: It's
>> intimidating, it does not provide great indexing or searchability.
>> >>
>> >> WHAT I PROPOSE:
>> >>
>> >> I think explicitly encouraging everyone to go to StackOverflow first
>> will be the best alternative: It's indexed, searchable, less intimidating
>> than the mailing list. We can add that they can try Slack as well - without
>> any guarantees.
>> >>
>> >> What do others think?
>> >> -P.
>> >>
>> >> [1] https://beam.apache.org/community/contact-us/
>> >> [2] https://stackoverflow.com/questions/tagged/apache-beam?tab=Newest
>>
>


Re: Hackathon @BeamSummit @ApacheCon

2019-09-06 Thread Austin Bennett
Ah, yes.  We'll definitely be in Hackathon space 2-3p on Monday and Tuesday
(and can stay longer if needed).  We aren't scheduling anything official on
Wed and Thurs, given the multiple Beam tracks that are occurring.

On Fri, Sep 6, 2019 at 4:46 PM Mikhail Gryzykhin  wrote:

> I'll be in most of the week and will join gladly.
>
> On Thu, Sep 5, 2019, 14:32 Chad Dombrova  wrote:
>
>> Has a date and time been picked for this?  I'll be there for part of the
>> week and would love to join.
>>
>> On Tue, Sep 3, 2019 at 11:31 AM Brian Hulette 
>> wrote:
>>
>>> I will be around all week as well and would love to help with a Beam
>>> hackathon in any way :)
>>>
>>> On Thu, Aug 29, 2019 at 9:46 AM Maximilian Michels 
>>> wrote:
>>>
 Hey,

 I'm in as well! Austin and I recently talked about how we could
 organize
 the hackathon. Likely it will be an hour per day for exchanging ideas
 and learning about Beam. For example, there has been interest from the
 Apache Streams project to discuss points for collaboration.

 We will soon announce the exact hours.

 Cheers,
 Max

 On 23.08.19 05:06, Kenneth Knowles wrote:
 > I will be at Beam Summit / ApacheCon NA and would love to drop by a
 > hackathon room if one is arranged. Really excited for both my first
 > ApacheCon and Beam Summit (finally!)
 >
 > Kenn
 >
 > On Thu, Aug 22, 2019 at 10:18 AM Austin Bennett
 > mailto:whatwouldausti...@gmail.com>>
 wrote:
 >
 > And, for clarity, especially focused on Hackathon times on Monday
 > and/or Tuesday of ApacheCon, to not conflict with BeamSummit
 sessions.
 >
 > On Thu, Aug 22, 2019 at 9:47 AM Austin Bennett
 > mailto:whatwouldausti...@gmail.com
 >>
 > wrote:
 >
 > Less than 3 weeks till Beam Summit @ApacheCon!
 >
 > We are to be in Vegas for BeamSummit and ApacheCon in a few
 weeks.
 >
 > Likely to reserve space in the Hackathon Room to accomplish
 some
 > tasks:
 > * Help Users
 > * Build Beam
 > * Collaborate with other projects
 > * etc
 >
 > If you're to be around (or not) let us know how you'd like to
 be
 > involved.  Also, please share and surface anything that would
 be
 > good for us to look at (and, esp. any beginner tasks, in case
 we
 > can entice some new contributors).
 >
 >
 > P.S.  See BeamSummit.org, if you're thinking of attending -
 > there's a discount code.
 >

>>>


Re: Hackathon @BeamSummit @ApacheCon

2019-09-06 Thread Austin Bennett
+u...@beam.apache.org 

On Fri, Sep 6, 2019 at 5:24 PM Austin Bennett 
wrote:

> Ah, yes.  We'll definitely be in Hackathon space 2-3p on Monday and
> Tuesday (and can stay longer if needed).  We aren't scheduling anything
> official on Wed and Thurs, given the multiple Beam tracks that are
> occurring.
>
> On Fri, Sep 6, 2019 at 4:46 PM Mikhail Gryzykhin 
> wrote:
>
>> I'll be in most of the week and will join gladly.
>>
>> On Thu, Sep 5, 2019, 14:32 Chad Dombrova  wrote:
>>
>>> Has a date and time been picked for this?  I'll be there for part of the
>>> week and would love to join.
>>>
>>> On Tue, Sep 3, 2019 at 11:31 AM Brian Hulette 
>>> wrote:
>>>
 I will be around all week as well and would love to help with a Beam
 hackathon in any way :)

 On Thu, Aug 29, 2019 at 9:46 AM Maximilian Michels 
 wrote:

> Hey,
>
> I'm in as well! Austin and I recently talked about how we could
> organize
> the hackathon. Likely it will be an hour per day for exchanging ideas
> and learning about Beam. For example, there has been interest from the
> Apache Streams project to discuss points for collaboration.
>
> We will soon announce the exact hours.
>
> Cheers,
> Max
>
> On 23.08.19 05:06, Kenneth Knowles wrote:
> > I will be at Beam Summit / ApacheCon NA and would love to drop by a
> > hackathon room if one is arranged. Really excited for both my first
> > ApacheCon and Beam Summit (finally!)
> >
> > Kenn
> >
> > On Thu, Aug 22, 2019 at 10:18 AM Austin Bennett
> > mailto:whatwouldausti...@gmail.com>>
> wrote:
> >
> > And, for clarity, especially focused on Hackathon times on Monday
> > and/or Tuesday of ApacheCon, to not conflict with BeamSummit
> sessions.
> >
> > On Thu, Aug 22, 2019 at 9:47 AM Austin Bennett
> > mailto:whatwouldausti...@gmail.com
> >>
> > wrote:
> >
> > Less than 3 weeks till Beam Summit @ApacheCon!
> >
> > We are to be in Vegas for BeamSummit and ApacheCon in a few
> weeks.
> >
> > Likely to reserve space in the Hackathon Room to accomplish
> some
> > tasks:
> > * Help Users
> > * Build Beam
> > * Collaborate with other projects
> > * etc
> >
> > If you're to be around (or not) let us know how you'd like
> to be
> > involved.  Also, please share and surface anything that
> would be
> > good for us to look at (and, esp. any beginner tasks, in
> case we
> > can entice some new contributors).
> >
> >
> > P.S.  See BeamSummit.org, if you're thinking of attending -
> > there's a discount code.
> >
>



Re: [discuss] How we support our users on Slack / Mailing list / StackOverflow

2019-09-06 Thread Austin Bennett
I see no reason slack can't be suitable for Beam users -- other open source
projects do utilize Slack for user chatter, too.  Though what it could be
is different from how currently used.  There are 173 accounts in
#beam-python, and a decent portion of recent conversations (at quick
glance) look like they are users asking for advice (which maybe should be
pointed to the Google Cloud Slack account...)

I suggest meet users wherever they are (don't abandon slack), but that is
from a community standpoint.  If people are metrics focused, that might be
harder in slack and/or we can find ways to measure things for those that
have benchmarks to hit.  I am willing to dig into Slack's API if desired,
to surface/forward messages as useful.  Not sure how all that would look,
open to figure it out.



On Fri, Sep 6, 2019 at 4:47 PM Ahmet Altay  wrote:

> Both StackOverflow and mailing lists have better answer rates for python
> questions. Suggesting either one of them makes sense. I also find
> StackOverflow easier to use but that is a personal preference.  The
> original problem is that lack of support within Slack. Both mailing list
> and stackoverflow are already listed in the support page above Slack. How
> are we going to redirect these folks from Slack to SO or ML?
>
> Also, what is the profile of people on slack in general. I had the
> impression that it is more tuned for developer working on Beam to interact
> rather than for users to ask Beam questions. Is this accurate?
>
> Ahmet
>
> On Fri, Sep 6, 2019 at 4:41 PM Kenneth Knowles  wrote:
>
>> +1 to StackOverflow first, though I'm not important for Beam Python
>> users. Udi has a good point about discussions. If an SO question has a lot
>> of back and forth, or no response, then it is good to point to other
>> channels the user might try next.
>>
>> Kenn
>>
>> On Fri, Sep 6, 2019 at 2:20 PM Robert Bradshaw 
>> wrote:
>>
>>> I would also suggest SO as the best alternative, especially due to its
>>> indexability and searchability. If discussion is needed, the users
>>> list (my preference) or slack can be good options, and ideally the
>>> resolution is brought back to SO.
>>>
>>> On Fri, Sep 6, 2019 at 1:10 PM Udi Meiri  wrote:
>>> >
>>> > I don't go on Slack, but I will be notified of mentions. It has the
>>> advantage of being an informal space.
>>> > SO can feel just as intimidating as the mailing list IMO. Unlike the
>>> others, it doesn't lend itself very well to discussions (you can only post
>>> comments or answers).
>>> >
>>> >
>>> >
>>> > On Fri, Sep 6, 2019 at 10:55 AM Pablo Estrada 
>>> wrote:
>>> >>
>>> >> Hello all,
>>> >>
>>> >> THE SITUATION:
>>> >> It was brought to my attention recently that Python users in Slack
>>> are not getting much support, because most of the Beam Python-knowledgeable
>>> people are not on Slack. Unfortunately, in the Beam site, we do refer
>>> people to Slack for assistance[1].
>>> >>
>>> >> Java users do receive reasonable support, because there are enough
>>> Beam Java-knowledgeable people online, and willing to answer.
>>> >>
>>> >> On the other hand, at Google we do have a number of people who are
>>> responsible to answer questions on StackOverflow[2], and we do our best to
>>> answer promptly. I think we do a reasonable job overall.
>>> >>
>>> >> SO LET'S DISCUSS:
>>> >> How should we advise the community to ask questions about Beam?
>>> >> - Perhaps we should encourage people to try the mailing list first
>>> >> - Perhaps we should encourage people to try StackOverflow first
>>> >> - Perhaps we should write a bot that encourages Python users to go to
>>> StackOverflow
>>> >> - something else?
>>> >>
>>> >> My personal opinion is that a mailing list is not great: It's
>>> intimidating, it does not provide great indexing or searchability.
>>> >>
>>> >> WHAT I PROPOSE:
>>> >>
>>> >> I think explicitly encouraging everyone to go to StackOverflow first
>>> will be the best alternative: It's indexed, searchable, less intimidating
>>> than the mailing list. We can add that they can try Slack as well - without
>>> any guarantees.
>>> >>
>>> >> What do others think?
>>> >> -P.
>>> >>
>>> >> [1] https://beam.apache.org/community/contact-us/
>>> >> [2] https://stackoverflow.com/questions/tagged/apache-beam?tab=Newest
>>>
>>


Re: Possible Python SDK performance regression

2019-09-06 Thread Thomas Weise
On Fri, Sep 6, 2019 at 2:24 PM Valentyn Tymofieiev 
wrote:

> +Mark Liu  has added some benchmarks running across
> multiple Python versions. Specifically we run 1 GB wordcount job on
> Dataflow runner on Python 2.7, 3.5-3.7. The benchmarks do not have
> configured alerting and to my knowledge are not actively monitored yet.
>

Are there any benchmarks for streaming? Streaming and batch are quite
different runtime paths. And some of the issues can only be identified with
longer running processes through metrics. It would be good to verify
utilization of memory, cpu etc.

I additionally discovered that our 2.16 upgrade exhibits a memory leak in
the Python worker (Py 2.7).


> Thomas, is it possible for you to do the bisection using SDK code from
> master at various commits to narrow down the regression on your end?
>

I don't know how soon I will get to it. It's of course possible, but
expensive due to having to rebase the fork, build and deploy an entire
stack of stuff for each iteration. The pipeline itself is super simple. We
need this testbed as part of Beam. It would be nice to be able to pick an
update and have more confidence that the baseline has not slipped.


>
> [1]
> https://apache-beam-testing.appspot.com/explore?dashboard=5691127080419328
> [2] https://drive.google.com/file/d/1ERlnN8bA2fKCUPBHTnid1l__81qpQe2W/view
> [3]
> https://github.com/apache/beam/commit/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5
>
>
>
> On Fri, Sep 6, 2019 at 8:38 AM Ahmet Altay  wrote:
>
>> +Valentyn Tymofieiev  do we have benchmarks in
>> different python versions? Was there a recent change that is specific to
>> python 3.x ?
>>
>> On Fri, Sep 6, 2019 at 8:36 AM Thomas Weise  wrote:
>>
>>> The issue is only visible with Python 3.6, not 2.7.
>>>
>>> If there is a framework in place to add a streaming test, that would be
>>> great. We would use what we have internally as starting point.
>>>
>>> On Thu, Sep 5, 2019 at 5:00 PM Ahmet Altay  wrote:
>>>


 On Thu, Sep 5, 2019 at 4:15 PM Thomas Weise  wrote:

> The workload is quite different. What I have is streaming with state
> and timers.
>
>
>
> On Thu, Sep 5, 2019 at 3:47 PM Pablo Estrada 
> wrote:
>
>> We only recently started running Chicago Taxi Example. +Michał
>> Walenia  I don't see it in the
>> dashboards. Do you know if it's possible to see any trends in the data?
>>
>> We have a few tests running now:
>> - Combine tests:
>> https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373
>> - GBK tests:
>> https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373
>>
>> They don't seem to show a very drastic jump either, but they aren't
>> very old.
>>
>> There is also work ongoing to add alerting for this sort of
>> regressions by Kasia and Kamil (added). The work is not there yet (it's 
>> in
>> progress).
>> Best
>> -P.
>>
>> On Thu, Sep 5, 2019 at 3:35 PM Thomas Weise  wrote:
>>
>>> It probably won't be practical to do a bisect due to the high cost
>>> of each iteration with our fork/deploy setup.
>>>
>>> Perhaps it is time to setup something with the synthetic source that
>>> works just with Beam as dependency.
>>>
>>
 I agree with this.

 Pablo, Kasia, Kamil, does the new benchmarks give us a easy to use
 framework for using synthetic source in benchmarks?


>
>>> On Thu, Sep 5, 2019 at 3:23 PM Ahmet Altay  wrote:
>>>
 There are a few in this dashboard [1], but not very useful in this
 case because they do not go back more than a month and not very
 comprehensive. I do not see a jump there. Thomas, would it be possible 
 to
 bisect to find what commit caused the regression?

 +Pablo Estrada  do we have any python on flink
 benchmarks for chicago example?
 +Alan Myrvold  +Yifan Zou
  It would be good to have alerts on
 benchmarks. Do we have such an ability today?

 [1] https://apache-beam-testing.appspot.com/dashboard-admin

 On Thu, Sep 5, 2019 at 3:15 PM Thomas Weise  wrote:

> Hi,
>
> Are there any performance tests run for the Python SDK as part of
> release verification (or otherwise as well)?
>
> I see what appears to be a regression in master (compared to 2.14)
> with our in-house application (~ 25% jump in cpu utilization and
> corresponds drop in throughput).
>
> I wanted to see if there is anything available to verify that
> within Beam.
>
> Thanks,
> Thomas
>
>


Re: Possible Python SDK performance regression

2019-09-06 Thread Ahmet Altay
On Fri, Sep 6, 2019 at 6:17 PM Thomas Weise  wrote:

>
>
> On Fri, Sep 6, 2019 at 2:24 PM Valentyn Tymofieiev 
> wrote:
>
>> +Mark Liu  has added some benchmarks running across
>> multiple Python versions. Specifically we run 1 GB wordcount job on
>> Dataflow runner on Python 2.7, 3.5-3.7. The benchmarks do not have
>> configured alerting and to my knowledge are not actively monitored yet.
>>
>
> Are there any benchmarks for streaming? Streaming and batch are quite
> different runtime paths. And some of the issues can only be identified
> with longer running processes through metrics. It would be good to verify
> utilization of memory, cpu etc.
>
> I additionally discovered that our 2.16 upgrade exhibits a memory leak in
> the Python worker (Py 2.7).
>

Do you have more details on this one?


>
>
>> Thomas, is it possible for you to do the bisection using SDK code from
>> master at various commits to narrow down the regression on your end?
>>
>
> I don't know how soon I will get to it. It's of course possible, but
> expensive due to having to rebase the fork, build and deploy an entire
> stack of stuff for each iteration. The pipeline itself is super simple. We
> need this testbed as part of Beam. It would be nice to be able to pick an
> update and have more confidence that the baseline has not slipped.
>
>
>>
>> [1]
>> https://apache-beam-testing.appspot.com/explore?dashboard=5691127080419328
>> [2]
>> https://drive.google.com/file/d/1ERlnN8bA2fKCUPBHTnid1l__81qpQe2W/view
>> [3]
>> https://github.com/apache/beam/commit/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5
>>
>>
>>
>> On Fri, Sep 6, 2019 at 8:38 AM Ahmet Altay  wrote:
>>
>>> +Valentyn Tymofieiev  do we have benchmarks in
>>> different python versions? Was there a recent change that is specific to
>>> python 3.x ?
>>>
>>> On Fri, Sep 6, 2019 at 8:36 AM Thomas Weise  wrote:
>>>
 The issue is only visible with Python 3.6, not 2.7.

 If there is a framework in place to add a streaming test, that would be
 great. We would use what we have internally as starting point.

 On Thu, Sep 5, 2019 at 5:00 PM Ahmet Altay  wrote:

>
>
> On Thu, Sep 5, 2019 at 4:15 PM Thomas Weise  wrote:
>
>> The workload is quite different. What I have is streaming with state
>> and timers.
>>
>>
>>
>> On Thu, Sep 5, 2019 at 3:47 PM Pablo Estrada 
>> wrote:
>>
>>> We only recently started running Chicago Taxi Example. +Michał
>>> Walenia  I don't see it in the
>>> dashboards. Do you know if it's possible to see any trends in the data?
>>>
>>> We have a few tests running now:
>>> - Combine tests:
>>> https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373
>>> - GBK tests:
>>> https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373
>>>
>>> They don't seem to show a very drastic jump either, but they aren't
>>> very old.
>>>
>>> There is also work ongoing to add alerting for this sort of
>>> regressions by Kasia and Kamil (added). The work is not there yet (it's 
>>> in
>>> progress).
>>> Best
>>> -P.
>>>
>>> On Thu, Sep 5, 2019 at 3:35 PM Thomas Weise  wrote:
>>>
 It probably won't be practical to do a bisect due to the high cost
 of each iteration with our fork/deploy setup.

 Perhaps it is time to setup something with the synthetic source
 that works just with Beam as dependency.

>>>
> I agree with this.
>
> Pablo, Kasia, Kamil, does the new benchmarks give us a easy to use
> framework for using synthetic source in benchmarks?
>
>
>>
 On Thu, Sep 5, 2019 at 3:23 PM Ahmet Altay 
 wrote:

> There are a few in this dashboard [1], but not very useful in this
> case because they do not go back more than a month and not very
> comprehensive. I do not see a jump there. Thomas, would it be 
> possible to
> bisect to find what commit caused the regression?
>
> +Pablo Estrada  do we have any python on
> flink benchmarks for chicago example?
> +Alan Myrvold  +Yifan Zou
>  It would be good to have alerts on
> benchmarks. Do we have such an ability today?
>
> [1] https://apache-beam-testing.appspot.com/dashboard-admin
>
> On Thu, Sep 5, 2019 at 3:15 PM Thomas Weise 
> wrote:
>
>> Hi,
>>
>> Are there any performance tests run for the Python SDK as part of
>> release verification (or otherwise as well)?
>>
>> I see what appears to be a regression in master (compared to
>> 2.14) with our in-house application (~ 25% jump in cpu utilization 
>> and
>> corresponds drop in throughput).
>>
>> I wanted to see

Re: Possible Python SDK performance regression

2019-09-06 Thread Valentyn Tymofieiev
On Fri, Sep 6, 2019 at 6:23 PM Ahmet Altay  wrote:

>
>
> On Fri, Sep 6, 2019 at 6:17 PM Thomas Weise  wrote:
>
>>
>>
>> On Fri, Sep 6, 2019 at 2:24 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> +Mark Liu  has added some benchmarks running across
>>> multiple Python versions. Specifically we run 1 GB wordcount job on
>>> Dataflow runner on Python 2.7, 3.5-3.7. The benchmarks do not have
>>> configured alerting and to my knowledge are not actively monitored yet.
>>>
>>
>> Are there any benchmarks for streaming? Streaming and batch are quite
>> different runtime paths. And some of the issues can only be identified
>> with longer running processes through metrics. It would be good to verify
>> utilization of memory, cpu etc.
>>
>
Fully agree. I don't think we have a comprehensive set of streaming
benchmarks that span multiple python versions yet.

>
>> I additionally discovered that our 2.16 upgrade exhibits a memory leak in
>> the Python worker (Py 2.7).
>>
>
> Do you have more details on this one?
>
>
>>
>>
>>> Thomas, is it possible for you to do the bisection using SDK code from
>>> master at various commits to narrow down the regression on your end?
>>>
>>
>> I don't know how soon I will get to it. It's of course possible, but
>> expensive due to having to rebase the fork, build and deploy an entire
>> stack of stuff for each iteration. The pipeline itself is super simple. We
>> need this testbed as part of Beam. It would be nice to be able to pick an
>> update and have more confidence that the baseline has not slipped.
>>
>>
>>>
>>> [1]
>>> https://apache-beam-testing.appspot.com/explore?dashboard=5691127080419328
>>> [2]
>>> https://drive.google.com/file/d/1ERlnN8bA2fKCUPBHTnid1l__81qpQe2W/view
>>> [3]
>>> https://github.com/apache/beam/commit/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5
>>>
>>>
>>>
>>> On Fri, Sep 6, 2019 at 8:38 AM Ahmet Altay  wrote:
>>>
 +Valentyn Tymofieiev  do we have benchmarks in
 different python versions? Was there a recent change that is specific to
 python 3.x ?

 On Fri, Sep 6, 2019 at 8:36 AM Thomas Weise  wrote:

> The issue is only visible with Python 3.6, not 2.7.
>
> If there is a framework in place to add a streaming test, that would
> be great. We would use what we have internally as starting point.
>
> On Thu, Sep 5, 2019 at 5:00 PM Ahmet Altay  wrote:
>
>>
>>
>> On Thu, Sep 5, 2019 at 4:15 PM Thomas Weise  wrote:
>>
>>> The workload is quite different. What I have is streaming with state
>>> and timers.
>>>
>>>
>>>
>>> On Thu, Sep 5, 2019 at 3:47 PM Pablo Estrada 
>>> wrote:
>>>
 We only recently started running Chicago Taxi Example. +Michał
 Walenia  I don't see it in the
 dashboards. Do you know if it's possible to see any trends in the data?

 We have a few tests running now:
 - Combine tests:
 https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373
 - GBK tests:
 https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373

 They don't seem to show a very drastic jump either, but they aren't
 very old.

 There is also work ongoing to add alerting for this sort of
 regressions by Kasia and Kamil (added). The work is not there yet 
 (it's in
 progress).
 Best
 -P.

 On Thu, Sep 5, 2019 at 3:35 PM Thomas Weise  wrote:

> It probably won't be practical to do a bisect due to the high cost
> of each iteration with our fork/deploy setup.
>
> Perhaps it is time to setup something with the synthetic source
> that works just with Beam as dependency.
>

>> I agree with this.
>>
>> Pablo, Kasia, Kamil, does the new benchmarks give us a easy to use
>> framework for using synthetic source in benchmarks?
>>
>>
>>>
> On Thu, Sep 5, 2019 at 3:23 PM Ahmet Altay 
> wrote:
>
>> There are a few in this dashboard [1], but not very useful in
>> this case because they do not go back more than a month and not very
>> comprehensive. I do not see a jump there. Thomas, would it be 
>> possible to
>> bisect to find what commit caused the regression?
>>
>> +Pablo Estrada  do we have any python on
>> flink benchmarks for chicago example?
>> +Alan Myrvold  +Yifan Zou
>>  It would be good to have alerts on
>> benchmarks. Do we have such an ability today?
>>
>> [1] https://apache-beam-testing.appspot.com/dashboard-admin
>>
>> On Thu, Sep 5, 2019 at 3:15 PM Thomas Weise 
>> wrote:
>>
>>> Hi,
>>>
>>> Are there any performance tests run for the Python SDK as part
>

Re: Possible Python SDK performance regression

2019-09-06 Thread Thomas Weise
On Fri, Sep 6, 2019 at 6:23 PM Ahmet Altay  wrote:

>
>
> On Fri, Sep 6, 2019 at 6:17 PM Thomas Weise  wrote:
>
>>
>>
>> On Fri, Sep 6, 2019 at 2:24 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> +Mark Liu  has added some benchmarks running across
>>> multiple Python versions. Specifically we run 1 GB wordcount job on
>>> Dataflow runner on Python 2.7, 3.5-3.7. The benchmarks do not have
>>> configured alerting and to my knowledge are not actively monitored yet.
>>>
>>
>> Are there any benchmarks for streaming? Streaming and batch are quite
>> different runtime paths. And some of the issues can only be identified
>> with longer running processes through metrics. It would be good to verify
>> utilization of memory, cpu etc.
>>
>> I additionally discovered that our 2.16 upgrade exhibits a memory leak in
>> the Python worker (Py 2.7).
>>
>
> Do you have more details on this one?
>

Unfortunately only that at the moment. The workers eat up all memory and
eventually crash. Reverted back to 2.14 / Py 3.6 and the issue is gone.


>
>
>>
>>
>>> Thomas, is it possible for you to do the bisection using SDK code from
>>> master at various commits to narrow down the regression on your end?
>>>
>>
>> I don't know how soon I will get to it. It's of course possible, but
>> expensive due to having to rebase the fork, build and deploy an entire
>> stack of stuff for each iteration. The pipeline itself is super simple. We
>> need this testbed as part of Beam. It would be nice to be able to pick an
>> update and have more confidence that the baseline has not slipped.
>>
>>
>>>
>>> [1]
>>> https://apache-beam-testing.appspot.com/explore?dashboard=5691127080419328
>>> [2]
>>> https://drive.google.com/file/d/1ERlnN8bA2fKCUPBHTnid1l__81qpQe2W/view
>>> [3]
>>> https://github.com/apache/beam/commit/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5
>>>
>>>
>>>
>>> On Fri, Sep 6, 2019 at 8:38 AM Ahmet Altay  wrote:
>>>
 +Valentyn Tymofieiev  do we have benchmarks in
 different python versions? Was there a recent change that is specific to
 python 3.x ?

 On Fri, Sep 6, 2019 at 8:36 AM Thomas Weise  wrote:

> The issue is only visible with Python 3.6, not 2.7.
>
> If there is a framework in place to add a streaming test, that would
> be great. We would use what we have internally as starting point.
>
> On Thu, Sep 5, 2019 at 5:00 PM Ahmet Altay  wrote:
>
>>
>>
>> On Thu, Sep 5, 2019 at 4:15 PM Thomas Weise  wrote:
>>
>>> The workload is quite different. What I have is streaming with state
>>> and timers.
>>>
>>>
>>>
>>> On Thu, Sep 5, 2019 at 3:47 PM Pablo Estrada 
>>> wrote:
>>>
 We only recently started running Chicago Taxi Example. +Michał
 Walenia  I don't see it in the
 dashboards. Do you know if it's possible to see any trends in the data?

 We have a few tests running now:
 - Combine tests:
 https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373
 - GBK tests:
 https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373

 They don't seem to show a very drastic jump either, but they aren't
 very old.

 There is also work ongoing to add alerting for this sort of
 regressions by Kasia and Kamil (added). The work is not there yet 
 (it's in
 progress).
 Best
 -P.

 On Thu, Sep 5, 2019 at 3:35 PM Thomas Weise  wrote:

> It probably won't be practical to do a bisect due to the high cost
> of each iteration with our fork/deploy setup.
>
> Perhaps it is time to setup something with the synthetic source
> that works just with Beam as dependency.
>

>> I agree with this.
>>
>> Pablo, Kasia, Kamil, does the new benchmarks give us a easy to use
>> framework for using synthetic source in benchmarks?
>>
>>
>>>
> On Thu, Sep 5, 2019 at 3:23 PM Ahmet Altay 
> wrote:
>
>> There are a few in this dashboard [1], but not very useful in
>> this case because they do not go back more than a month and not very
>> comprehensive. I do not see a jump there. Thomas, would it be 
>> possible to
>> bisect to find what commit caused the regression?
>>
>> +Pablo Estrada  do we have any python on
>> flink benchmarks for chicago example?
>> +Alan Myrvold  +Yifan Zou
>>  It would be good to have alerts on
>> benchmarks. Do we have such an ability today?
>>
>> [1] https://apache-beam-testing.appspot.com/dashboard-admin
>>
>> On Thu, Sep 5, 2019 at 3:15 PM Thomas Weise 
>> wrote:
>>
>>> Hi,
>>>
>>> Are there any performance tests run fo

Re: Possible Python SDK performance regression

2019-09-06 Thread Valentyn Tymofieiev
Sounds like these regressions need to be investigated ahead of 2.16.0
release.

On Fri, Sep 6, 2019 at 6:44 PM Thomas Weise  wrote:

>
>
> On Fri, Sep 6, 2019 at 6:23 PM Ahmet Altay  wrote:
>
>>
>>
>> On Fri, Sep 6, 2019 at 6:17 PM Thomas Weise  wrote:
>>
>>>
>>>
>>> On Fri, Sep 6, 2019 at 2:24 PM Valentyn Tymofieiev 
>>> wrote:
>>>
 +Mark Liu  has added some benchmarks running
 across multiple Python versions. Specifically we run 1 GB wordcount job on
 Dataflow runner on Python 2.7, 3.5-3.7. The benchmarks do not have
 configured alerting and to my knowledge are not actively monitored yet.

>>>
>>> Are there any benchmarks for streaming? Streaming and batch are quite
>>> different runtime paths. And some of the issues can only be identified
>>> with longer running processes through metrics. It would be good to verify
>>> utilization of memory, cpu etc.
>>>
>>> I additionally discovered that our 2.16 upgrade exhibits a memory leak
>>> in the Python worker (Py 2.7).
>>>
>>
>> Do you have more details on this one?
>>
>
> Unfortunately only that at the moment. The workers eat up all memory and
> eventually crash. Reverted back to 2.14 / Py 3.6 and the issue is gone.
>
>
>>
>>
>>>
>>>
 Thomas, is it possible for you to do the bisection using SDK code from
 master at various commits to narrow down the regression on your end?

>>>
>>> I don't know how soon I will get to it. It's of course possible, but
>>> expensive due to having to rebase the fork, build and deploy an entire
>>> stack of stuff for each iteration. The pipeline itself is super simple. We
>>> need this testbed as part of Beam. It would be nice to be able to pick an
>>> update and have more confidence that the baseline has not slipped.
>>>
>>>

 [1]
 https://apache-beam-testing.appspot.com/explore?dashboard=5691127080419328
 [2]
 https://drive.google.com/file/d/1ERlnN8bA2fKCUPBHTnid1l__81qpQe2W/view
 [3]
 https://github.com/apache/beam/commit/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5



 On Fri, Sep 6, 2019 at 8:38 AM Ahmet Altay  wrote:

> +Valentyn Tymofieiev  do we have benchmarks in
> different python versions? Was there a recent change that is specific to
> python 3.x ?
>
> On Fri, Sep 6, 2019 at 8:36 AM Thomas Weise  wrote:
>
>> The issue is only visible with Python 3.6, not 2.7.
>>
>> If there is a framework in place to add a streaming test, that would
>> be great. We would use what we have internally as starting point.
>>
>> On Thu, Sep 5, 2019 at 5:00 PM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Thu, Sep 5, 2019 at 4:15 PM Thomas Weise  wrote:
>>>
 The workload is quite different. What I have is streaming with
 state and timers.



 On Thu, Sep 5, 2019 at 3:47 PM Pablo Estrada 
 wrote:

> We only recently started running Chicago Taxi Example. +Michał
> Walenia  I don't see it in the
> dashboards. Do you know if it's possible to see any trends in the 
> data?
>
> We have a few tests running now:
> - Combine tests:
> https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373
> - GBK tests:
> https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373
>
> They don't seem to show a very drastic jump either, but they
> aren't very old.
>
> There is also work ongoing to add alerting for this sort of
> regressions by Kasia and Kamil (added). The work is not there yet 
> (it's in
> progress).
> Best
> -P.
>
> On Thu, Sep 5, 2019 at 3:35 PM Thomas Weise 
> wrote:
>
>> It probably won't be practical to do a bisect due to the high
>> cost of each iteration with our fork/deploy setup.
>>
>> Perhaps it is time to setup something with the synthetic source
>> that works just with Beam as dependency.
>>
>
>>> I agree with this.
>>>
>>> Pablo, Kasia, Kamil, does the new benchmarks give us a easy to use
>>> framework for using synthetic source in benchmarks?
>>>
>>>

>> On Thu, Sep 5, 2019 at 3:23 PM Ahmet Altay 
>> wrote:
>>
>>> There are a few in this dashboard [1], but not very useful in
>>> this case because they do not go back more than a month and not very
>>> comprehensive. I do not see a jump there. Thomas, would it be 
>>> possible to
>>> bisect to find what commit caused the regression?
>>>
>>> +Pablo Estrada  do we have any python on
>>> flink benchmarks for chicago example?
>>> +Alan Myrvold  +Yifan Zou
>>>  It would be good to have alerts on
>>> benchm

Re: [discuss] How we support our users on Slack / Mailing list / StackOverflow

2019-09-06 Thread Ahmet Altay
I agree Slack can be used by Beam users and it would be good to meet users
where they are. If I understand correctly, the issue Pablo is raising is
that there are not enough people online in Slack that can answer python
questions. We also need to help people who ask questions and who can answer
them find a common platform. Perhaps simply adding a subject in Slack chat
rooms suggesting SO as an alternative question asking platform might
improve the situation.

On Fri, Sep 6, 2019 at 5:45 PM Austin Bennett 
wrote:

> I see no reason slack can't be suitable for Beam users -- other open
> source projects do utilize Slack for user chatter, too.  Though what it
> could be is different from how currently used.  There are 173 accounts in
> #beam-python, and a decent portion of recent conversations (at quick
> glance) look like they are users asking for advice (which maybe should be
> pointed to the Google Cloud Slack account...)
>
> I suggest meet users wherever they are (don't abandon slack), but that is
> from a community standpoint.  If people are metrics focused, that might be
> harder in slack and/or we can find ways to measure things for those that
> have benchmarks to hit.  I am willing to dig into Slack's API if desired,
> to surface/forward messages as useful.  Not sure how all that would look,
> open to figure it out.
>
>
>
> On Fri, Sep 6, 2019 at 4:47 PM Ahmet Altay  wrote:
>
>> Both StackOverflow and mailing lists have better answer rates for python
>> questions. Suggesting either one of them makes sense. I also find
>> StackOverflow easier to use but that is a personal preference.  The
>> original problem is that lack of support within Slack. Both mailing list
>> and stackoverflow are already listed in the support page above Slack. How
>> are we going to redirect these folks from Slack to SO or ML?
>>
>> Also, what is the profile of people on slack in general. I had the
>> impression that it is more tuned for developer working on Beam to interact
>> rather than for users to ask Beam questions. Is this accurate?
>>
>> Ahmet
>>
>> On Fri, Sep 6, 2019 at 4:41 PM Kenneth Knowles  wrote:
>>
>>> +1 to StackOverflow first, though I'm not important for Beam Python
>>> users. Udi has a good point about discussions. If an SO question has a lot
>>> of back and forth, or no response, then it is good to point to other
>>> channels the user might try next.
>>>
>>> Kenn
>>>
>>> On Fri, Sep 6, 2019 at 2:20 PM Robert Bradshaw 
>>> wrote:
>>>
 I would also suggest SO as the best alternative, especially due to its
 indexability and searchability. If discussion is needed, the users
 list (my preference) or slack can be good options, and ideally the
 resolution is brought back to SO.

 On Fri, Sep 6, 2019 at 1:10 PM Udi Meiri  wrote:
 >
 > I don't go on Slack, but I will be notified of mentions. It has the
 advantage of being an informal space.
 > SO can feel just as intimidating as the mailing list IMO. Unlike the
 others, it doesn't lend itself very well to discussions (you can only post
 comments or answers).
 >
 >
 >
 > On Fri, Sep 6, 2019 at 10:55 AM Pablo Estrada 
 wrote:
 >>
 >> Hello all,
 >>
 >> THE SITUATION:
 >> It was brought to my attention recently that Python users in Slack
 are not getting much support, because most of the Beam Python-knowledgeable
 people are not on Slack. Unfortunately, in the Beam site, we do refer
 people to Slack for assistance[1].
 >>
 >> Java users do receive reasonable support, because there are enough
 Beam Java-knowledgeable people online, and willing to answer.
 >>
 >> On the other hand, at Google we do have a number of people who are
 responsible to answer questions on StackOverflow[2], and we do our best to
 answer promptly. I think we do a reasonable job overall.
 >>
 >> SO LET'S DISCUSS:
 >> How should we advise the community to ask questions about Beam?
 >> - Perhaps we should encourage people to try the mailing list first
 >> - Perhaps we should encourage people to try StackOverflow first
 >> - Perhaps we should write a bot that encourages Python users to go
 to StackOverflow
 >> - something else?
 >>
 >> My personal opinion is that a mailing list is not great: It's
 intimidating, it does not provide great indexing or searchability.
 >>
 >> WHAT I PROPOSE:
 >>
 >> I think explicitly encouraging everyone to go to StackOverflow first
 will be the best alternative: It's indexed, searchable, less intimidating
 than the mailing list. We can add that they can try Slack as well - without
 any guarantees.
 >>
 >> What do others think?
 >> -P.
 >>
 >> [1] https://beam.apache.org/community/contact-us/
 >> [2]
 https://stackoverflow.com/questions/tagged/apache-beam?tab=Newest

>>>


Re: Possible Python SDK performance regression

2019-09-06 Thread Ahmet Altay
I agree, let's investigate. Thomas could you file JIRAs once you have
additional information.

Valentyn, I think the performance regression could be investigated now, by
running whatever benchmarks that is available against 2.14, 2.15 and head
and see if the same regression could be reproduced.

On Fri, Sep 6, 2019 at 7:11 PM Valentyn Tymofieiev 
wrote:

> Sounds like these regressions need to be investigated ahead of 2.16.0
> release.
>
> On Fri, Sep 6, 2019 at 6:44 PM Thomas Weise  wrote:
>
>>
>>
>> On Fri, Sep 6, 2019 at 6:23 PM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Fri, Sep 6, 2019 at 6:17 PM Thomas Weise  wrote:
>>>


 On Fri, Sep 6, 2019 at 2:24 PM Valentyn Tymofieiev 
 wrote:

> +Mark Liu  has added some benchmarks running
> across multiple Python versions. Specifically we run 1 GB wordcount job on
> Dataflow runner on Python 2.7, 3.5-3.7. The benchmarks do not have
> configured alerting and to my knowledge are not actively monitored yet.
>

 Are there any benchmarks for streaming? Streaming and batch are quite
 different runtime paths. And some of the issues can only be identified
 with longer running processes through metrics. It would be good to verify
 utilization of memory, cpu etc.

 I additionally discovered that our 2.16 upgrade exhibits a memory leak
 in the Python worker (Py 2.7).

>>>
>>> Do you have more details on this one?
>>>
>>
>> Unfortunately only that at the moment. The workers eat up all memory and
>> eventually crash. Reverted back to 2.14 / Py 3.6 and the issue is gone.
>>
>>
>>>
>>>


> Thomas, is it possible for you to do the bisection using SDK code from
> master at various commits to narrow down the regression on your end?
>

 I don't know how soon I will get to it. It's of course possible, but
 expensive due to having to rebase the fork, build and deploy an entire
 stack of stuff for each iteration. The pipeline itself is super simple. We
 need this testbed as part of Beam. It would be nice to be able to pick an
 update and have more confidence that the baseline has not slipped.


>
> [1]
> https://apache-beam-testing.appspot.com/explore?dashboard=5691127080419328
> [2]
> https://drive.google.com/file/d/1ERlnN8bA2fKCUPBHTnid1l__81qpQe2W/view
> [3]
> https://github.com/apache/beam/commit/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5
>
>
>
> On Fri, Sep 6, 2019 at 8:38 AM Ahmet Altay  wrote:
>
>> +Valentyn Tymofieiev  do we have benchmarks in
>> different python versions? Was there a recent change that is specific to
>> python 3.x ?
>>
>> On Fri, Sep 6, 2019 at 8:36 AM Thomas Weise  wrote:
>>
>>> The issue is only visible with Python 3.6, not 2.7.
>>>
>>> If there is a framework in place to add a streaming test, that would
>>> be great. We would use what we have internally as starting point.
>>>
>>> On Thu, Sep 5, 2019 at 5:00 PM Ahmet Altay  wrote:
>>>


 On Thu, Sep 5, 2019 at 4:15 PM Thomas Weise  wrote:

> The workload is quite different. What I have is streaming with
> state and timers.
>
>
>
> On Thu, Sep 5, 2019 at 3:47 PM Pablo Estrada 
> wrote:
>
>> We only recently started running Chicago Taxi Example. +Michał
>> Walenia  I don't see it in the
>> dashboards. Do you know if it's possible to see any trends in the 
>> data?
>>
>> We have a few tests running now:
>> - Combine tests:
>> https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373
>> - GBK tests:
>> https://apache-beam-testing.appspot.com/explore?dashboard=5763764733345792&widget=201943890&container=1334074373
>>
>> They don't seem to show a very drastic jump either, but they
>> aren't very old.
>>
>> There is also work ongoing to add alerting for this sort of
>> regressions by Kasia and Kamil (added). The work is not there yet 
>> (it's in
>> progress).
>> Best
>> -P.
>>
>> On Thu, Sep 5, 2019 at 3:35 PM Thomas Weise 
>> wrote:
>>
>>> It probably won't be practical to do a bisect due to the high
>>> cost of each iteration with our fork/deploy setup.
>>>
>>> Perhaps it is time to setup something with the synthetic source
>>> that works just with Beam as dependency.
>>>
>>
 I agree with this.

 Pablo, Kasia, Kamil, does the new benchmarks give us a easy to use
 framework for using synthetic source in benchmarks?


>
>>> On Thu, Sep 5, 2019 at 3:23 PM Ahmet Altay 
>>> wrote:
>>>
 There are a few in this dashb