Re: [Discuss] Idea to increase RC voting participation

2023-12-05 Thread Svetak Sundhar via dev
Hi all,

Following back up on this thread, some tips on validating RCs are now
documented [1]. Please do add any instructions, especially for more
SDK/Runner specific combos.

I'll take a closer look now into the automation discussed above on this
thread.

Thanks,
Svetak

[1] https://github.com/apache/beam/pull/29595


On Wed, Oct 25, 2023 at 9:52 AM Danny McCormick via dev 
wrote:

> > One easy and standard way to make it more resilient would be to make it
> idempotent instead of counting on uptime or receiving any particular event.
>
> Yep, agreed that this wouldn't be super hard if someone wants to take it
> on. Basically it would just be updating the tool to run on a schedule and
> look for issues that have been closed as completed in the last N days (more
> or less this query -
> https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aclosed+reason%3Acompleted+created%3A%3E2023-01-01+).
> I have seen some milestones intentionally removed from issues after the bot
> adds them (probably because it's non-obvious that you can mark an issue as
> not planned instead), so we'd probably want to account for that and no-op
> if a milestone was removed post-close.
>
> One downside of this approach is that you significantly increase the
> chances of an issue getting misassigned to the wrong milestone if it comes
> in around the cut; you'd need to either account for this by checking out
> the repo to get the version at the time the issue was closed
> (expensive/non-trivial) or live with this downside. It's probably an ok
> downside to live with.
>
> You could also do a hybrid approach where you run on issue close and run a
> scheduled or manual pre-release step to clean up any stragglers. This would
> be the most robust option.
>
> On Wed, Oct 25, 2023 at 7:43 AM Kenneth Knowles  wrote:
>
>> Agree. As long as we are getting enough of them, then our records as well
>> as any automation depending on it are fine. One easy and standard way to
>> make it more resilient would be to make it idempotent instead of counting
>> on uptime or receiving any particular event.
>>
>> Kenn
>>
>> On Tue, Oct 24, 2023 at 2:58 PM Danny McCormick <
>> dannymccorm...@google.com> wrote:
>>
>>> Looks like for some reason the workflow didn't trigger. This is running
>>> on GitHub's hosted runners, so my best guess is an outage.
>>>
>>> Looking at a more refined query, this year there have been 14 issues
>>> that were missed by the automation (3 had their milestone manually removed)
>>> -
>>> https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed+reason%3Acompleted+created%3A%3E2023-01-01
>>>  out
>>> of 605 total -
>>> https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aclosed+reason%3Acompleted+created%3A%3E2023-01-01+
>>>  -
>>> as best I can tell there were a small number of workflow flakes and then
>>> GHA didn't correctly trigger a few.
>>>
>>> If we wanted, we could set up some recurring automation to go through
>>> and try to pick up the ones without milestones (or modify our existing
>>> automation to be more tolerant to failures), but it doesn't seem super
>>> urgent to me (feel free to disagree). I don't think this piece needs to be
>>> perfect.
>>>
>>> On Tue, Oct 24, 2023 at 2:40 PM Kenneth Knowles  wrote:
>>>
 Just grabbing one at random for an example,
 https://github.com/apache/beam/issues/28635 seems like it was closed
 as completed but not tagged.

 I'm happy to see that the bot reads the version from the repo to find
 the appropriate milestone, rather than using the nearest open one. Just
 recording that for the thread since I first read the description as the
 latter.

 Kenn

 On Tue, Oct 24, 2023 at 2:34 PM Danny McCormick via dev <
 dev@beam.apache.org> wrote:

> We do tag issues to milestones when the issue is marked as "completed"
> (as opposed to "not planned") -
> https://github.com/apache/beam/blob/master/.github/workflows/assign_milestone.yml.
> So I think using issues is probably about as accurate as using commits.
>
> > It looks like we have 820 with no milestone
> https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed
>
> Most predate the automation, though maybe not all? Some of those may
> have been closed as "not planned".
>
> > This could (should) be automatically discoverable. A (closed) issues
> is associated with commits which are associated with a release.
>
> Today, we just tag issues to the upcoming milestone when they're
> closed. In theory you could do something more sophisticated using linked
> commits, but in practice people aren't clean enough about linking commits
> to issues. Again, this is fixable by automation/enforcement, but I don't
> think it actually gives us much value beyond what we have today.
>
> On Tue, Oct 24, 2023 at 1:54 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>

Re: Build python Beam from source

2023-12-05 Thread Robert Bradshaw via dev
To use cross language capabilities from a non-release branch you'll
have to build the cross-language bits yourself as well. This can be
done by

(1) Making sure Java (for java dependencies) is installed.
(2) In the top level of the repository, running .//gradlew
sdks:java:io:expansion-service:shadowJar

For released versions of Beam, it will automatically fetch the
pre-built, released artifacts for you from maven. You can manually
request those of a previous release by passing something like

--beam_services='{"sdks:java:extensions:sql:expansion-service:shadowJar":
"https://repository.apache.org/content/repositories/orgapachebeam-1361/org/apache/beam/beam-sdks-java-extensions-sql-expansion-service/2.52.0/beam-sdks-java-extensions-sql-expansion-service-2.52.0.jar"}'

which basically says "when looking for this target, use that jar"
though as this is using an out-of-date copy of the libraries this may
not always work.


On Tue, Dec 5, 2023 at 6:14 AM Поротиков Станислав Вячеславович via
dev  wrote:
>
> Hello!
> How to properly install/build apache-beam python package from source?
>
> I've tried running:
>
> pip install .
>
> from skds/python directory
>
> It's installed successfully, but when I try to run python beam pipeline, it 
> complains:
> RuntimeError: 
> /lib/sdks/java/io/expansion-service/build/libs/beam-sdks-java-io-expansion-service-2.52.0-SNAPSHOT.jar
>  not found. Please build the server with
>
>  cd /lib; ./gradlew 
> sdks:java:io:expansion-service:shadowJar
>
>
>
> Glad to any help!
>
>
>
> Best regards,
>
> Stanislav Porotikov
>
>


Build python Beam from source

2023-12-05 Thread Поротиков Станислав Вячеславович via dev
Hello!
How to properly install/build apache-beam python package from source?
I've tried running:
pip install .
from skds/python 
directory
It's installed successfully, but when I try to run python beam pipeline, it 
complains:
RuntimeError: 
/lib/sdks/java/io/expansion-service/build/libs/beam-sdks-java-io-expansion-service-2.52.0-SNAPSHOT.jar
 not found. Please build the server with
 cd /lib; ./gradlew 
sdks:java:io:expansion-service:shadowJar

Glad to any help!

Best regards,
Stanislav Porotikov



Re: Embeddings generation in MLTransform

2023-12-05 Thread Alexey Romanenko
You need to send a blank email to dev-unsubscr...@beam.apache.org 


—
Alexey


> On 5 Dec 2023, at 11:57, Divya Sanghi  wrote:
> 
> Can someone suggest how to unsubscribe?
> 
> On Mon, Oct 30, 2023 at 7:33 PM Anand Inguva via dev  > wrote:
>> Hi all,
>> 
>> In Apache Beam 2.50.0 Python SDK, we added MLTransform 
>> ,
>>  which is used to pre/post process data using common ML operations. Now, we 
>> are planning to generate embeddings 
>>  with ML models using MLTransform. 
>> 
>> I have created a doc 
>> 
>>  on how we can do this. Please go through the doc if interested and let me 
>> know of any feedback. 
>> 
>> Thanks,
>> Anand
>> 
>> Doc: 
>> https://docs.google.com/document/d/1En4bfbTu4rvu7LWJIKV3G33jO-xJfTdbaSFSURmQw_s/edit#heading=h.wskna8eurvjv



Re: Embeddings generation in MLTransform

2023-12-05 Thread Divya Sanghi
Can someone suggest how to unsubscribe?

On Mon, Oct 30, 2023 at 7:33 PM Anand Inguva via dev 
wrote:

> Hi all,
>
> In Apache Beam 2.50.0 Python SDK, we added MLTransform
> ,
> which is used to pre/post process data using common ML operations. Now, we
> are planning to generate embeddings
>  with ML models using
> MLTransform.
>
> I have created a doc
> 
> on how we can do this. Please go through the doc if interested and let me
> know of any feedback.
>
> Thanks,
> Anand
>
> Doc:
> https://docs.google.com/document/d/1En4bfbTu4rvu7LWJIKV3G33jO-xJfTdbaSFSURmQw_s/edit#heading=h.wskna8eurvjv
>


Beam High Priority Issue Report (48)

2023-12-05 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/29413 [Bug]: Can not use Avro over 1.8.2 
with Beam 2.52.0
https://github.com/apache/beam/issues/29099 [Bug]: FnAPI Java SDK Harness 
doesn't update user counters in OnTimer callback functions
https://github.com/apache/beam/issues/29022 [Failing Test]: Python Github 
actions tests are failing due to update of pip 
https://github.com/apache/beam/issues/28760 [Bug]: EFO Kinesis IO reader 
provided by apache beam does not pick the event time for watermarking
https://github.com/apache/beam/issues/28715 [Bug]: Python WriteToBigtable get 
stuck for large jobs due to client dead lock
https://github.com/apache/beam/issues/28383 [Failing Test]: 
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testMaxThreadMetric
https://github.com/apache/beam/issues/28339 Fix failing 
"beam_PostCommit_XVR_GoUsingJava_Dataflow" job
https://github.com/apache/beam/issues/28326 Bug: 
apache_beam.io.gcp.pubsublite.ReadFromPubSubLite not working
https://github.com/apache/beam/issues/28142 [Bug]: [Go SDK] Memory seems to be 
leaking on 2.49.0 with Dataflow
https://github.com/apache/beam/issues/27892 [Bug]: ignoreUnknownValues not 
working when using CreateDisposition.CREATE_IF_NEEDED 
https://github.com/apache/beam/issues/27648 [Bug]: Python SDFs (e.g. 
PeriodicImpulse) running in Flink and polling using tracker.defer_remainder 
have checkpoint size growing indefinitely 
https://github.com/apache/beam/issues/27616 [Bug]: Unable to use 
applyRowMutations() in bigquery IO apache beam java
https://github.com/apache/beam/issues/27486 [Bug]: Read from datastore with 
inequality filters
https://github.com/apache/beam/issues/27314 [Failing Test]: 
bigquery.StorageApiSinkCreateIfNeededIT.testCreateManyTables[1]
https://github.com/apache/beam/issues/27238 [Bug]: Window trigger has lag when 
using Kafka and GroupByKey on Dataflow Runner
https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested 
ROW (described below)
https://github.com/apache/beam/issues/26343 [Bug]: 
apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is 
flaky
https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not 
propagate a Coder to AvroSource
https://github.com/apache/beam/issues/26041 [Bug]: Unable to create 
exactly-once Flink pipeline with stream source and file sink
https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK 
Harness ProcessBundleProgress
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder 
will drop message id and orderingKey
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns wrong tableId
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) 
failing: ParDoTest$TimestampTests/OnWindowExpirationTests
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey