SDK Harness Memory Usage

2022-12-08 Thread Arwin Tio via dev
Hi Beam Team,

Can somebody help me understand what are the factors behind SDK Harness
memory usage? My first guess is that the SDK Harness memory usage depends
on:

1. User code (i.e. DoFns)
2. Bundle size

Basically, the maximum memory usage an SDK Harness needs is however much
memory it takes for the user DoFn to process the largest bundle size. And
the bundle size is determined by the Runner. So to limit SDK Harness memory
usage, we have to ensure that our Runner selects small bundle sizes.

However, looking through some design and the code, it seems like:

   - sdk_worker.py
   

seems
   to be have multiple active bundle processors at the same time
   - The Fn API: How to send and receive data
   

design
   doc seems to describe multiplexing multiple logical streams over a gRPC
   connection

Does this mean that the SDK Harnesses process multiple bundles at the same
time? If so, how are the number of concurrent bundles limited?

Or in general, what suggestions do you have to reduce memory usage of SDK
Harnesses?

Thanks,

Arwin

-- 


*Confidentiality Note:* We care about protecting our proprietary 
information, confidential material, and trade secrets. This message may 
contain some or all of those things. Cruise will suffer material harm if 
anyone other than the intended recipient disseminates or takes any action 
based on this message. If you have received this message (including any 
attachments) in error, please delete it immediately and notify the sender 
promptly.


Re: Achievement unlocked: fully triaged

2022-12-08 Thread Kenneth Knowles
Merged it. Please be on the lookout for bugs I have introduced, since they
could result in issues slipping through the cracks.

On Wed, Dec 7, 2022 at 3:31 PM Kenneth Knowles  wrote:

> OK I did my best in https://github.com/apache/beam/pull/24585
>
> Yet another thought I had: Do we really need multiple templates? Do people
> like the tags in the titles? After creation all these things are "just
> labels" so there's no particular reason we need to conceptualize issues
> as partitioned into these classes, unless it is helpful.
>
> Kenn
>
> On Tue, Dec 6, 2022 at 12:28 PM Robert Burke  wrote:
>
>> +1 to letting conjunctions form naturally.
>>
>> In the bikeshedding discusion:
>>
>> That would mean I'm biased to having the reduced label set to have
>> reduced colours for the general category.
>>
>> Eg. SDK colour, Runner colour, beam resources colour, and IO being it's
>> own special unique colour, and awaiting triage being unique as well.
>>
>> This would make triage checking a bit more glancible, since except for
>> very particular issues that might warrant "several SDKs" or "several
>> runners".
>>
>> On Tue, Dec 6, 2022, 11:13 AM Danny McCormick via dev <
>> dev@beam.apache.org> wrote:
>>
>>> I like that idea (and the list) as well.
>>>
>>> On Tue, Dec 6, 2022 at 1:59 PM Kerry Donny-Clark 
>>> wrote:
>>>
 I really like the idea of multi-select and automatic "awaiting triage".
 Kenn, I think the list you have looks good to me.

 On Tue, Dec 6, 2022 at 1:55 PM Kenneth Knowles  wrote:

> Noting that what you've listed are the options in the issue template,
> which are then expanded to multiple labels. So focusing on the issue
> template, I like the general idea, but maybe we can simplify it even more:
>
> When a user is filing a bug, I think a good outcome is for it to get
> into the right person's saved search (like Go, Python, etc) while still
> having the "awaiting triage" label on it.
>
> What if we just went all the way simple and had checkboxes for just
> the highest level. Something like the following:
>
> Which language SDK or feature is related to your report? (check all
> that apply)
> [ ] Python
> [ ] Java
> [ ] Go
> [ ] Typescript
> [ ] IO connector
> [ ] Beam examples
> [ ] Beam playground
> [ ] Beam katas
> [ ] Website
> [ ] Spark Runner
> [ ] Flink Runner
> [ ] Samza Runner
> [ ] Twister2 Runner
> [ ] Hazelcast Jet Runner
> [ ] Google Cloud Dataflow Runner
>
> We could even trim it even further to just language, and let the
> person doing triage handle the rest.
>
> Kenn
>
> On Tue, Dec 6, 2022 at 9:11 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> > Is it possible to not have a default option?
>>
>> Sadly, no AFAIK. I agree this would help. We could try things like
>> making the default " " and auto-closing issues that don't pick something
>> other than the default, that's a pretty rough experience though and not
>> worth it IMO.
>>
>> > I definitely think reducing the label zoo could help.
>>
>> What's our desired end state here? I put together a doc with my
>> suggested labels -
>> https://docs.google.com/document/d/1FpaFr_Sdg217ogd5oMDRX4uLIMSatKLF_if9CzLg9tM/edit?usp=sharing
>>  -
>> listed below as well for convenience. Please comment in the doc if you 
>> have
>> thoughts/labels you care about, or continue the email thread if you have
>> bigger ideas (e.g. getting rid of labels, changing our templates entirely
>> instead, etc...).
>>
>> *Danny's Proposed Labels:*
>>
>>
>>-
>>
>>beam-community
>>-
>>
>>beam-playground
>>-
>>
>>community-metrics
>>-
>>
>>cross-language
>>-
>>
>>examples-java
>>-
>>
>>examples-python
>>-
>>
>>extensions
>>-
>>
>>infrastructure
>>-
>>
>>io-go
>>-
>>
>>io-ideas
>>-
>>
>>io-java
>>-
>>
>>io-py
>>-
>>
>>katas
>>-
>>
>>release
>>-
>>
>>run-inference
>>-
>>
>>runner
>>-
>>
>>runner-dataflow
>>-
>>
>>runner-direct
>>-
>>
>>runner-flink
>>-
>>
>>runner-samza
>>-
>>
>>runner-spark
>>-
>>
>>runner-universal
>>-
>>
>>sdk-go
>>-
>>
>>sdk-ideas
>>-
>>
>>sdk-java
>>-
>>
>>sdk-py
>>-
>>
>>sdk-typescript
>>-
>>
>>test-failures
>>-
>>
>>website
>>
>>
>> On Tue, Dec 6, 2022 at 11:17 AM Bjorn Pedersen <
>> bjornpeder...@g

Beam Dependency Check Report (2022-12-08)

2022-12-08 Thread Apache Jenkins Server
<<< text/html; charset=UTF-8: Unrecognized >>>


Re: Gradle Task Configuration Avoidance

2022-12-08 Thread Daniel Collins via dev
We could probably add a lint that rejects the spelling `task("` pretty
easily that would catch most of these.

On Thu, Dec 8, 2022 at 11:34 AM Luke Cwik via dev 
wrote:

> I have found the Gradle build reports very useful to enumerate
> deprecations and an easier thing to look at over the command line output.
>
> On Thu, Dec 8, 2022 at 8:26 AM Damon Douglas via dev 
> wrote:
>
>> Thank you, Kerry, for your kind and encouraging words!
>>
>> Kenn, I wondered as well whether there exist proactive options.  I know
>> that gradle will warn of soon-to-be deprecated syntax in the build.gradle
>> files when executing gradle tasks on the command-line.  Perhaps we can
>> start there.  Not to sound cliche, but with any process improvement,
>> awareness is the first step.
>>
>> On Mon, Dec 5, 2022 at 3:54 PM Kenneth Knowles  wrote:
>>
>>> Nice!
>>>
>>> I believe at some point in the past we made a pass to try to convert our
>>> stuff to this model. I wonder if we can prevent it proactively somehow,
>>> like disabling the legacy way of creating tasks or something.
>>>
>>> Kenn
>>>
>>> On Mon, Dec 5, 2022 at 6:25 AM Kerry Donny-Clark via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Thanks Damon! I really appreciate how clear your emails are here.
 Instead of my usual feeling of "I don't quite understand, and don't have
 time to get context" I can read all the context in the mail.
 This error message had confused me, so I really appreciate the cleanup
 and explanation.

 On Fri, Dec 2, 2022, 7:28 PM Damon Douglas via dev 
 wrote:

> Hello Everyone,
>
> *If you are new to Beam and coming from non-Java language conventions,
> it is likely you are new to gradle.  At the end of this email is a list of
> definitions and references to help understand this email.*
>
> *Short Version (For those who know gradle)*:
> A pull request [1] may fix the continual error message "Error: Backend
> initialization required, please run "terraform init"".  The PR applies 
> Task
> Configuration Avoidance [2] by applying changes to a few tasks from
> tasks(String) to tasks.register(String).
>
> *Long Version (For those who are not as familiar with gradle)*:
>
> I write this not as an expert but as someone still learning.  Gradle
> [3] is the software we use in the Beam repository to automate many needed
> tasks associated with building and testing code.  It is typically used in
> Java projects but can be extended for other purposes.  We store code
> related to our Beam Playground [4] that also uses gradle though it is not
> mainly a Java project.  The unit of work for Gradle is what is called a
> task.  To run a task you open a terminal and type "./gradlew
> nameOfMyTask".  There are two main ways to create a custom task in our
> build.gradle files.  One is writing task("doSomething") and the other is
> tasks.register("doSomethingElse").  According to [2], the recommendation 
> is
> to use the tasks.register("doSomething").  This avoids executing other 
> work
> (configuration but don't worry about it for now) until one runs the
> doSomething task or another task we are running depends on it.
>
> So why were we seeing this "Error: Backend initialization required"
> message all the time?  The reason is that tasks were configured as
> task("doSomething").  All I had to do was change this to
> tasks.register("doSomething") and it removed the message.
>
> *Definitions/References*
>
> 1. https://github.com/apache/beam/pull/24509
> 2.
> https://docs.gradle.org/current/userguide/task_configuration_avoidance.html
> 3. https://docs.gradle.org/current/userguide/what_is_gradle.html
> 4. https://play.beam.apache.org/
>
> *Suggested Learning Path To Understand This Email*
> 1.
> https://docs.gradle.org/current/samples/sample_building_java_libraries.html
> 2. https://docs.gradle.org/current/userguide/build_lifecycle.html
> 3. https://docs.gradle.org/current/userguide/tutorial_using_tasks.html
> 4.
> https://docs.gradle.org/current/userguide/task_configuration_avoidance.html
>
> Best,
>
> Damon
>
>


Re: Gradle Task Configuration Avoidance

2022-12-08 Thread Luke Cwik via dev
I have found the Gradle build reports very useful to enumerate deprecations
and an easier thing to look at over the command line output.

On Thu, Dec 8, 2022 at 8:26 AM Damon Douglas via dev 
wrote:

> Thank you, Kerry, for your kind and encouraging words!
>
> Kenn, I wondered as well whether there exist proactive options.  I know
> that gradle will warn of soon-to-be deprecated syntax in the build.gradle
> files when executing gradle tasks on the command-line.  Perhaps we can
> start there.  Not to sound cliche, but with any process improvement,
> awareness is the first step.
>
> On Mon, Dec 5, 2022 at 3:54 PM Kenneth Knowles  wrote:
>
>> Nice!
>>
>> I believe at some point in the past we made a pass to try to convert our
>> stuff to this model. I wonder if we can prevent it proactively somehow,
>> like disabling the legacy way of creating tasks or something.
>>
>> Kenn
>>
>> On Mon, Dec 5, 2022 at 6:25 AM Kerry Donny-Clark via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Thanks Damon! I really appreciate how clear your emails are here.
>>> Instead of my usual feeling of "I don't quite understand, and don't have
>>> time to get context" I can read all the context in the mail.
>>> This error message had confused me, so I really appreciate the cleanup
>>> and explanation.
>>>
>>> On Fri, Dec 2, 2022, 7:28 PM Damon Douglas via dev 
>>> wrote:
>>>
 Hello Everyone,

 *If you are new to Beam and coming from non-Java language conventions,
 it is likely you are new to gradle.  At the end of this email is a list of
 definitions and references to help understand this email.*

 *Short Version (For those who know gradle)*:
 A pull request [1] may fix the continual error message "Error: Backend
 initialization required, please run "terraform init"".  The PR applies Task
 Configuration Avoidance [2] by applying changes to a few tasks from
 tasks(String) to tasks.register(String).

 *Long Version (For those who are not as familiar with gradle)*:

 I write this not as an expert but as someone still learning.  Gradle
 [3] is the software we use in the Beam repository to automate many needed
 tasks associated with building and testing code.  It is typically used in
 Java projects but can be extended for other purposes.  We store code
 related to our Beam Playground [4] that also uses gradle though it is not
 mainly a Java project.  The unit of work for Gradle is what is called a
 task.  To run a task you open a terminal and type "./gradlew
 nameOfMyTask".  There are two main ways to create a custom task in our
 build.gradle files.  One is writing task("doSomething") and the other is
 tasks.register("doSomethingElse").  According to [2], the recommendation is
 to use the tasks.register("doSomething").  This avoids executing other work
 (configuration but don't worry about it for now) until one runs the
 doSomething task or another task we are running depends on it.

 So why were we seeing this "Error: Backend initialization required"
 message all the time?  The reason is that tasks were configured as
 task("doSomething").  All I had to do was change this to
 tasks.register("doSomething") and it removed the message.

 *Definitions/References*

 1. https://github.com/apache/beam/pull/24509
 2.
 https://docs.gradle.org/current/userguide/task_configuration_avoidance.html
 3. https://docs.gradle.org/current/userguide/what_is_gradle.html
 4. https://play.beam.apache.org/

 *Suggested Learning Path To Understand This Email*
 1.
 https://docs.gradle.org/current/samples/sample_building_java_libraries.html
 2. https://docs.gradle.org/current/userguide/build_lifecycle.html
 3. https://docs.gradle.org/current/userguide/tutorial_using_tasks.html
 4.
 https://docs.gradle.org/current/userguide/task_configuration_avoidance.html

 Best,

 Damon




Re: Gradle Task Configuration Avoidance

2022-12-08 Thread Damon Douglas via dev
Thank you, Kerry, for your kind and encouraging words!

Kenn, I wondered as well whether there exist proactive options.  I know
that gradle will warn of soon-to-be deprecated syntax in the build.gradle
files when executing gradle tasks on the command-line.  Perhaps we can
start there.  Not to sound cliche, but with any process improvement,
awareness is the first step.

On Mon, Dec 5, 2022 at 3:54 PM Kenneth Knowles  wrote:

> Nice!
>
> I believe at some point in the past we made a pass to try to convert our
> stuff to this model. I wonder if we can prevent it proactively somehow,
> like disabling the legacy way of creating tasks or something.
>
> Kenn
>
> On Mon, Dec 5, 2022 at 6:25 AM Kerry Donny-Clark via dev <
> dev@beam.apache.org> wrote:
>
>> Thanks Damon! I really appreciate how clear your emails are here. Instead
>> of my usual feeling of "I don't quite understand, and don't have time to
>> get context" I can read all the context in the mail.
>> This error message had confused me, so I really appreciate the cleanup
>> and explanation.
>>
>> On Fri, Dec 2, 2022, 7:28 PM Damon Douglas via dev 
>> wrote:
>>
>>> Hello Everyone,
>>>
>>> *If you are new to Beam and coming from non-Java language conventions,
>>> it is likely you are new to gradle.  At the end of this email is a list of
>>> definitions and references to help understand this email.*
>>>
>>> *Short Version (For those who know gradle)*:
>>> A pull request [1] may fix the continual error message "Error: Backend
>>> initialization required, please run "terraform init"".  The PR applies Task
>>> Configuration Avoidance [2] by applying changes to a few tasks from
>>> tasks(String) to tasks.register(String).
>>>
>>> *Long Version (For those who are not as familiar with gradle)*:
>>>
>>> I write this not as an expert but as someone still learning.  Gradle [3]
>>> is the software we use in the Beam repository to automate many needed tasks
>>> associated with building and testing code.  It is typically used in Java
>>> projects but can be extended for other purposes.  We store code related to
>>> our Beam Playground [4] that also uses gradle though it is not mainly a
>>> Java project.  The unit of work for Gradle is what is called a task.  To
>>> run a task you open a terminal and type "./gradlew nameOfMyTask".  There
>>> are two main ways to create a custom task in our build.gradle files.  One
>>> is writing task("doSomething") and the other is
>>> tasks.register("doSomethingElse").  According to [2], the recommendation is
>>> to use the tasks.register("doSomething").  This avoids executing other work
>>> (configuration but don't worry about it for now) until one runs the
>>> doSomething task or another task we are running depends on it.
>>>
>>> So why were we seeing this "Error: Backend initialization required"
>>> message all the time?  The reason is that tasks were configured as
>>> task("doSomething").  All I had to do was change this to
>>> tasks.register("doSomething") and it removed the message.
>>>
>>> *Definitions/References*
>>>
>>> 1. https://github.com/apache/beam/pull/24509
>>> 2.
>>> https://docs.gradle.org/current/userguide/task_configuration_avoidance.html
>>> 3. https://docs.gradle.org/current/userguide/what_is_gradle.html
>>> 4. https://play.beam.apache.org/
>>>
>>> *Suggested Learning Path To Understand This Email*
>>> 1.
>>> https://docs.gradle.org/current/samples/sample_building_java_libraries.html
>>> 2. https://docs.gradle.org/current/userguide/build_lifecycle.html
>>> 3. https://docs.gradle.org/current/userguide/tutorial_using_tasks.html
>>> 4.
>>> https://docs.gradle.org/current/userguide/task_configuration_avoidance.html
>>>
>>> Best,
>>>
>>> Damon
>>>
>>>


Beam High Priority Issue Report (37)

2022-12-08 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/24415 [Bug]: Cannot find a matching 
Calcite SqlTypeName for Beam type: LOGICAL_TYPE seen in 2.44.0 SNAPSHOT
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24383 [Bug]: Daemon will be stopped at 
the end of the build after the daemon was no longer found in the daemon registry
https://github.com/apache/beam/issues/24367 [Bug]: workflow.tar.gz cannot be 
passed to flink runner
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/24267 [Failing Test]: Timeout waiting to 
lock gradle
https://github.com/apache/beam/issues/24263 [Bug]: Remote call on 
apache-beam-jenkins-3 failed. The channel is closing down or has closed down
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/23286 [Bug]: 
beam_PerformanceTests_InfluxDbIO_IT Flaky > 50 % Fail 
https://github.com/apache/beam/issues/22969 Discrepancy in behavior of 
`DoFn.process()` when `yield` is combined with `return` statement, or vice versa
https://github.com/apache/beam/issues/22961 [Bug]: WriteToBigQuery silently 
skips most of records without job fail
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/22321 
PortableRunnerTestWithExternalEnv.test_pardo_large_input is regularly failing 
on jenkins
https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't get output 
to Failed Inserts PCollection
https://github.com/apache/beam/issues/21695 DataflowPipelineResult does not 
raise exception for unsuccessful states.
https://github.com/apache/beam/issues/21561 
ExternalPythonTransformTest.trivialPythonTransform flaky
https://github.com/apache/beam/issues/21480 flake: 
FlinkRunnerTest.testEnsureStdoutStdErrIsRestored
https://github.com/apache/beam/issues/21474 Flaky tests: Gradle build daemon 
disappeared unexpectedly
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21462 Flake in 
org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in use
https://github.com/apache/beam/issues/21333 Flink testParDoRequiresStableInput 
flaky
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21261 
org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
 is flaky
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey
https://github.com/apache/beam/issues/21113 
testTwoTimersSettingEachOtherWithCreateAsInputBounded flaky
https://github.com/apache/beam/issues/20976 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky
https://github.com/apache/beam/issues/20975 
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: 
false] is flaky
https://github.com/apache/beam/issues/20974 Python GHA PreCommits flake with 
grpc.FutureTimeoutError on SDK harness startup
https://github.com/apache/beam/issues/20689 Kafka commitOffsetsInFinalize OOM 
on Flink
https://github.com/apache/beam/issues/20108 Python direct runner doesn't emit 
empty pane when it should
https://github.com/apache/beam/issues/19814 Flink streaming flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
https://github.com/apache/beam/issues/19734 
WatchTest.testMultiplePollsWithManyResults flake: Outputs must be in timestamp 
order (sickbayed)
https://github.com/apache/beam/issues/19465 Explore possibilities to lower 
in-use IP address quota foot