Re: Custom Inference Fns in RunInference

2022-10-04 Thread Jack McCluskey via dev
Thank you to everyone who chimed in on the doc! The discussion was very
productive, and I have since updated the design doc with some more detail
based on feedback and some suggestions. Any and all feedback is welcome!

Thanks,

Jack McCluskey

On Fri, Sep 16, 2022 at 2:45 PM Jack McCluskey 
wrote:

> Hey everyone,
>
> I'm back with a brief design doc discussing ways that users could provide
> custom inference functions for RunInference model handlers, which is
> available at
>  
> https://docs.google.com/document/d/1YYGsF20kminz7j9ifFdCD5WQwVl8aTeCo0cgPjbdFNU/edit?usp=sharing
> 
>  now.
>
> It's not a huge code change or a significantly long doc, but it's
> establishing a convention for model handlers moving forward and that
> warrants some discussion.
>
> Thanks,
>
> Jack McCluskey
>
> --
>
>
> Jack McCluskey
> SWE - DataPLS PLAT/ Beam Go
> RDU
> jrmcclus...@gmail.com
>
>
>


Help on Apache Beam Pipeline Optimization

2022-10-04 Thread Yi En Ong
Hi,


I am trying to optimize my Apache Beam pipeline on Google Cloud Platform
Dataflow, and I would really appreciate your help and advice.


Background information: I am trying to read data from PubSub Messages, and
aggregate them based on 3 time windows: 1 min, 5 min and 60 min. Such
aggregations consists of summing, averaging, finding the maximum or
minimum, etc. For example, for all data collected from 1200 to 1201, I want
to aggregate them and write the output into BigTable's 1-min column family.
And for all data collected from 1200 to 1205, I want to similarly aggregate
them and write the output into BigTable's 5-min column. Same goes for 60min.


The current approach I took is to have 3 separate dataflow jobs (i.e. 3
separate Beam Pipelines), each one having a different window duration
(1min, 5min and 60min). See
https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/windowing/Window.html.
And the outputs of all 3 dataflow jobs are written to the same BigTable,
but on different column families. Other than that, the function and
aggregations of the data are the same for the 3 jobs.


However, this seems to be very computationally inefficient, and cost
inefficient, as the 3 jobs are essentially doing the same function, with
the only exception being the window time duration and output column family.


Some challenges and limitations we faced was that from the documentation,
it seems like we are unable to create multiple windows of different periods
in a singular dataflow job. Also, when we write the final data into big
table, we would have to define the table, column family, column, and
rowkey. And unfortunately, the column family is a fixed property (i.e. it
cannot be redefined or changed given the window period).


Hence, I am writing to ask if there is a way to only use 1 dataflow job
that fulfils the objective of this project? Which is to aggregate data on
different window periods, and write them to different column families of
the same BigTable.


Thank you


Re: Beam Website Feedback

2022-10-04 Thread Brian Hulette via dev
On Tue, Oct 4, 2022 at 8:58 AM Alexey Romanenko 
wrote:

> Thanks for your feedback.
>
> At the time, using a Google website search was a simplest solution since,
> before, we didn’t have a search at all. I agree that it could be
> frustrating to have ad links before the actual results (not sure that we
> can avoid them there) but "it is what it is” and it's still possible to
> have the correct links further which is better than nothing.
>
> Beam community is always welcome for suggestions and, especially,
> contributions to improve the project in any possible way. I’d be happy to
> assist on this topic if someone will decide to improve Beam website search.
>

+1, PRs welcome :)
I put some specific suggestions for a replacement in the issue, based on
recommendations from the hugo docs [1].

[1] https://gohugo.io/tools/search/


>
> —
> Alexey
>
> On 3 Oct 2022, at 23:21, Borris  wrote:
>
> This is my experience of trying the search capability.
>
>- I know I want to read about dataframes (I was reading this 10
>minutes ago but browsing history didn't take me back to where I wanted)
>- I search for "dataframes"
>- I am presented with a whole load of pages that are elsewhere (other
>sites) - maybe what I want is some pages below, but I stop at this point as
>I think its a fundamental failure of what I expect from the search dialogue
>- If I enter "beam.apache.org: dataframe" to the search dialogue then
>the sensible relevant page is now visible, only 5 links down
>- I know this may be a penalty of getting a "free" search service from
>your viewpoint
>- But from my viewpoint this is a failure. Your search capability
>fails to understand that by searching for something on your site, rather
>than generically through a search engine, I am massively predisposed to the
>pages on your site, whereas the search results are more predisposed to
>offering advertising opportunities.
>- It is very frustrating that something as simple as, on the Beam
>site, going to the page about Beam Dataframes takes such a level of hoop
>jumping
>
> That is my feedback offering. Thank you for taking the time to read it.
>
>
>
>
>


Re: [VOTE] Release 2.42.0, release candidate #1

2022-10-04 Thread Robert Burke
And the missing version substitutions are
Gradle 7.5.1
JDK Version: AdoptOpen JDK 1.8.0_292

On Tue, Oct 4, 2022, 9:01 AM Robert Burke  wrote:

> Agreed that the Python results appeared to be missing. The comment history
> indicates they were invoked however (and they appear in the Mass Comment
> script).
>
> Re-runs are failing for either unclear reasons, timeouts or other apparent
> infrastructure flakes.
>
> While we have 3 binding +1 votes for RC1, please hold while this is looked
> into, it may require an RC2 to resolve.
>
> On Mon, Oct 3, 2022, 8:15 PM Ahmet Altay via dev 
> wrote:
>
>> +1 (binding) - I validated python quick starts on direct runner.
>>
>> Thank you for working on the release!
>>
>> Ahmet
>>
>> On Mon, Oct 3, 2022 at 9:06 AM Valentyn Tymofieiev via dev <
>> dev@beam.apache.org> wrote:
>>
>>> I validated that Dataflow and Beam Python containers have dependencies
>>> that match Beam requirements.
>>>
>>> I came across https://github.com/apache/beam/pull/23200 - there are
>>> failed tests and I don't see test results for Python PostCommit suites. Do
>>> you know what's the status of both?
>>>
>>> Minor nits: missing substitution in  * Java artifacts were built with
>>> Gradle GRADLE_VERSION and OpenJDK/Oracle JDK JDK_VERSION.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On Mon, Oct 3, 2022 at 7:21 AM Ritesh Ghorse via dev <
>>> dev@beam.apache.org> wrote:
>>>
 +1 (non-binding)
 Validated Go SDK Quickstart on Direct and Dataflow runner


 On Mon, Oct 3, 2022 at 9:38 AM Alexey Romanenko <
 aromanenko@gmail.com> wrote:

> +1 (binding)
>
> Tested with  https://github.com/Talend/beam-samples/
> (Java SDK v8 & v11, Spark 3 runner).
>
> ---
> Alexey
>
> On 3 Oct 2022, at 14:32, Chamikara Jayalath via dev <
> dev@beam.apache.org> wrote:
>
> +1 (binding)
>
> Verified checksums and signatures of artifacts.
> Validated some multi-language pipelines.
>
> Thanks,
> Cham
>
> On Thu, Sep 29, 2022 at 6:12 PM Robert Burke via dev <
> dev@beam.apache.org> wrote:
>
>> Hi everyone,
>> Please review and vote on the release candidate #1 for the version
>> 2.42.0, as follows:
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>> Reviewers are encouraged to test their own use cases with the release
>> candidate, and vote +1 if no issues are found.
>>
>> The complete staging area is available for your review, which
>> includes:
>> * GitHub Release notes [1],
>> * the official Apache source release to be deployed to
>> dist.apache.org [2], which is signed with the key with fingerprint
>> A52F5C83BAE26160120EC25F3D56ACFBFB2975E1 [3],
>> * all artifacts to be deployed to the Maven Central Repository [4],
>> * source code tag "v2.42.0-RC1" [5],
>> * website pull request listing the release [6], the blog post [6],
>> and publishing the API reference manual [7].
>> * Java artifacts were built with Gradle GRADLE_VERSION and
>> OpenJDK/Oracle JDK JDK_VERSION.
>> * Python artifacts are deployed along with the source release to the
>> dist.apache.org [2] and PyPI [8]
>> * Go Package information and SDK RC  [9]
>> * Validation sheet with a tab for 2.42.0 release to help with
>> validation [10].
>> * Docker images published to Docker Hub [11].
>>
>> The vote will be open for at least 72 hours. It is adopted by
>> majority approval, with at least 3 PMC affirmative votes.
>>
>> For guidelines on how to try the release in your projects, check out
>> our blog post at https://beam.apache.org/blog/validate-beam-release/.
>>
>> Thanks,
>> Robert Burke
>> 2.42.0 Release Manager
>>
>> [1] https://github.com/apache/beam/milestone/4
>> [2] https://dist.apache.org/repos/dist/dev/beam/2.42.0/
>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1285/
>> [5] https://github.com/apache/beam/tree/v2.42.0-RC1
>> [6] https://github.com/apache/beam/pull/23406
>> [7] https://github.com/apache/beam-site/pull/634
>> [8] https://pypi.org/project/apache-beam/2.42.0rc1/
>> [9]
>> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.42.0-RC1/go/pkg/beam
>>
>> [10]
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=265602293
>> [11] https://hub.docker.com/search?q=apache%2Fbeam&type=image
>>
>>
>


Re: [VOTE] Release 2.42.0, release candidate #1

2022-10-04 Thread Robert Burke
Agreed that the Python results appeared to be missing. The comment history
indicates they were invoked however (and they appear in the Mass Comment
script).

Re-runs are failing for either unclear reasons, timeouts or other apparent
infrastructure flakes.

While we have 3 binding +1 votes for RC1, please hold while this is looked
into, it may require an RC2 to resolve.

On Mon, Oct 3, 2022, 8:15 PM Ahmet Altay via dev 
wrote:

> +1 (binding) - I validated python quick starts on direct runner.
>
> Thank you for working on the release!
>
> Ahmet
>
> On Mon, Oct 3, 2022 at 9:06 AM Valentyn Tymofieiev via dev <
> dev@beam.apache.org> wrote:
>
>> I validated that Dataflow and Beam Python containers have dependencies
>> that match Beam requirements.
>>
>> I came across https://github.com/apache/beam/pull/23200 - there are
>> failed tests and I don't see test results for Python PostCommit suites. Do
>> you know what's the status of both?
>>
>> Minor nits: missing substitution in  * Java artifacts were built with
>> Gradle GRADLE_VERSION and OpenJDK/Oracle JDK JDK_VERSION.
>>
>> Thanks!
>>
>>
>>
>> On Mon, Oct 3, 2022 at 7:21 AM Ritesh Ghorse via dev 
>> wrote:
>>
>>> +1 (non-binding)
>>> Validated Go SDK Quickstart on Direct and Dataflow runner
>>>
>>>
>>> On Mon, Oct 3, 2022 at 9:38 AM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
 +1 (binding)

 Tested with  https://github.com/Talend/beam-samples/
 (Java SDK v8 & v11, Spark 3 runner).

 ---
 Alexey

 On 3 Oct 2022, at 14:32, Chamikara Jayalath via dev <
 dev@beam.apache.org> wrote:

 +1 (binding)

 Verified checksums and signatures of artifacts.
 Validated some multi-language pipelines.

 Thanks,
 Cham

 On Thu, Sep 29, 2022 at 6:12 PM Robert Burke via dev <
 dev@beam.apache.org> wrote:

> Hi everyone,
> Please review and vote on the release candidate #1 for the version
> 2.42.0, as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> Reviewers are encouraged to test their own use cases with the release
> candidate, and vote +1 if no issues are found.
>
> The complete staging area is available for your review, which includes:
> * GitHub Release notes [1],
> * the official Apache source release to be deployed to dist.apache.org 
> [2],
> which is signed with the key with fingerprint
> A52F5C83BAE26160120EC25F3D56ACFBFB2975E1 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.42.0-RC1" [5],
> * website pull request listing the release [6], the blog post [6], and
> publishing the API reference manual [7].
> * Java artifacts were built with Gradle GRADLE_VERSION and
> OpenJDK/Oracle JDK JDK_VERSION.
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2] and PyPI [8]
> * Go Package information and SDK RC  [9]
> * Validation sheet with a tab for 2.42.0 release to help with
> validation [10].
> * Docker images published to Docker Hub [11].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> For guidelines on how to try the release in your projects, check out
> our blog post at https://beam.apache.org/blog/validate-beam-release/.
>
> Thanks,
> Robert Burke
> 2.42.0 Release Manager
>
> [1] https://github.com/apache/beam/milestone/4
> [2] https://dist.apache.org/repos/dist/dev/beam/2.42.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4]
> https://repository.apache.org/content/repositories/orgapachebeam-1285/
> [5] https://github.com/apache/beam/tree/v2.42.0-RC1
> [6] https://github.com/apache/beam/pull/23406
> [7] https://github.com/apache/beam-site/pull/634
> [8] https://pypi.org/project/apache-beam/2.42.0rc1/
> [9]
> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.42.0-RC1/go/pkg/beam
>
> [10]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=265602293
> [11] https://hub.docker.com/search?q=apache%2Fbeam&type=image
>
>



Re: Beam Website Feedback

2022-10-04 Thread Alexey Romanenko
Thanks for your feedback. 

At the time, using a Google website search was a simplest solution since, 
before, we didn’t have a search at all. I agree that it could be frustrating to 
have ad links before the actual results (not sure that we can avoid them there) 
but "it is what it is” and it's still possible to have the correct links 
further which is better than nothing. 

Beam community is always welcome for suggestions and, especially, contributions 
to improve the project in any possible way. I’d be happy to assist on this 
topic if someone will decide to improve Beam website search.

—
Alexey

> On 3 Oct 2022, at 23:21, Borris  wrote:
> 
> This is my experience of trying the search capability.
> 
> I know I want to read about dataframes (I was reading this 10 minutes ago but 
> browsing history didn't take me back to where I wanted)
> I search for "dataframes"
> I am presented with a whole load of pages that are elsewhere (other sites) - 
> maybe what I want is some pages below, but I stop at this point as I think 
> its a fundamental failure of what I expect from the search dialogue
> If I enter "beam.apache.org: dataframe" to the search dialogue then the 
> sensible relevant page is now visible, only 5 links down
> I know this may be a penalty of getting a "free" search service from your 
> viewpoint
> But from my viewpoint this is a failure. Your search capability fails to 
> understand that by searching for something on your site, rather than 
> generically through a search engine, I am massively predisposed to the pages 
> on your site, whereas the search results are more predisposed to offering 
> advertising opportunities.
> It is very frustrating that something as simple as, on the Beam site, going 
> to the page about Beam Dataframes takes such a level of hoop jumping
> That is my feedback offering. Thank you for taking the time to read it.
> 
> 
> 
> 
> 



Beam High Priority Issue Report (68)

2022-10-04 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/23350 [Bug]: 
apache_beam.io.gcp.bigquery_file_loads_test.BigQueryFileLoadsIT.test_bqfl_streaming
 - failing test
https://github.com/apache/beam/issues/23306 [Bug]: BigQueryBatchFileLoads in 
python loses data when using WRITE_TRUNCATE
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22303 [Task]: Add tests to Kafka SDF and 
fix known and discovered issues
https://github.com/apache/beam/issues/22299 [Bug]: JDBCIO Write freeze at 
getConnection() in WriteFn
https://github.com/apache/beam/issues/21794 Dataflow runner creates a new timer 
whenever the output timestamp is change
https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't get output 
to Failed Inserts PCollection
https://github.com/apache/beam/issues/21704 beam_PostCommit_Java_DataflowV2 
failures parent bug
https://github.com/apache/beam/issues/21701 beam_PostCommit_Java_DataflowV1 
failing with a variety of flakes and errors
https://github.com/apache/beam/issues/21700 
--dataflowServiceOptions=use_runner_v2 is broken
https://github.com/apache/beam/issues/21696 Flink Tests failure :  
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.beam.runners.core.construction.SerializablePipelineOptions 
https://github.com/apache/beam/issues/21480 flake: 
FlinkRunnerTest.testEnsureStdoutStdErrIsRestored
https://github.com/apache/beam/issues/21472 Dataflow streaming tests failing 
new AfterSynchronizedProcessingTime test
https://github.com/apache/beam/issues/21471 Flakes: Failed to load cache entry
https://github.com/apache/beam/issues/21470 Test flake: test_split_half_sdf
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21468 
beam_PostCommit_Python_Examples_Dataflow failing
https://github.com/apache/beam/issues/21467 GBK and CoGBK streaming Java load 
tests failing
https://github.com/apache/beam/issues/21463 NPE in Flink Portable 
ValidatesRunner streaming suite
https://github.com/apache/beam/issues/21462 Flake in 
org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in use
https://github.com/apache/beam/issues/21271 pubsublite.ReadWriteIT flaky in 
beam_PostCommit_Java_DataflowV2  
https://github.com/apache/beam/issues/21270 
org.apache.beam.sdk.transforms.CombineTest$WindowingTests.testWindowedCombineGloballyAsSingletonView
 flaky on Dataflow Runner V2
https://github.com/apache/beam/issues/21266 
org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
 is flaky in Java ValidatesRunner Flink suite.
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21261 
org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
 is flaky
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21257 Either Create or DirectRunner fails 
to produce all elements to the following transform
https://github.com/apache/beam/issues/21123 Multiple jobs running on Flink 
session cluster reuse the persistent Python environment.
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey
https://github.com/apache/beam/issues/21118 
PortableRunnerTestWithExternalEnv.test_pardo_timers flaky
https://github.com/apache/beam/issues/21114 Already Exists: Dataset 
apache-beam-testing:python_bq_file_loads_NNN
https://github.com/apache/beam/issues/21113 
testTwoTimersSettingEachOtherWithCreateAsInputBounded flaky
https://github.com/apache/beam/issues/2 Java creates an incorrect pipeline 
proto when core-construction-java jar is not in the CLASSPATH
https://github.com/apache/beam/issues/20981 Python precommit flaky: Failed to 
read inputs in the data plane
https://github.com/apache/beam/issues/20977 SamzaStoreStateInternalsTest is 
flaky
https://github.com/apache/beam/issues/20976 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky
https://github.com/apache/beam/issues/20975 
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: 
false] is flaky
https://github.com/apache/beam/issues/20974 Python GHA PreCommits flake with 
grpc.FutureTimeoutError on SDK harness startup
https://github.com/apache/beam/issues/20815 
testTeardownCalledAfterExceptionInPro