[beam-starter-typescript]: Missing place to create issue

2023-06-12 Thread david-kh...@hotmail.com
Hi Beam community,

I am David and new to the community. After tried to tweak some code from 
beam-starter-ts, I have found some issues and want to raise. But there is no 
way I can create an Github issue in the same project
apache/beam-starter-typescript: Apache beam 
(github.com).

I also double check the Contribute.md and get no idea still.

Would you mind guide me to the right path?

Regards,
David L.


Re: [Proposal] Kaskada DSL and FnHarness for Temporal Queries

2023-06-12 Thread Ben Chambers
Hey Daniel -- Great question!

Kaskada was designed to be similar to SQL but with a few differences.
The most significant is the assumption of both ordering and grouping.
Kaskada uses this to automatically merge multiple input collections,
and to allow data-dependent windows that identify a range of time. For
instance, the query `Purchases.amount | sum(window = since(Login))` to
sum the amount spent since the last login. In user studies, we've
heard that these make it much easier to compose queries analyzing the
entire "journey" or "funnel" for each user.

There are also cases where the ordering assumption *isn't* a good fit
-- queries that aren't as sensitive to time. Having both options
readily available would allow a user to choose what is most natural to
them and their use case.

-- Ben

On Mon, Jun 12, 2023 at 12:14 PM Daniel Collins  wrote:
>
> How does this mechanism differ from beam SQL which already offers windowing 
> via SQL over PCollections?
>
> https://beam.apache.org/documentation/dsls/sql/extensions/windowing-and-triggering/
>
> -Daniel
>
> On Mon, Jun 12, 2023 at 3:11 PM Ryan Michael  wrote:
>>
>> Hello, Beam (also)!
>>
>> Just introducing myself - I'm Ryan and I've been working with Ben on the 
>> Kaskada project for the past few years. As Ben mentioned, I think there's a 
>> great opportunity to bring together some of the work we've done to make 
>> time-based computation easier to reason about with the Beam community's work 
>> on scalable streaming computation.
>>
>> I'll be at the Beam Summit in NYC starting Wednesday and presenting a short 
>> overview of how we see Kaskada fitting into the Generative AI world at the 
>> "Generative AI Meetup" Wednesday afternoon - if the doc Ben linked to (or 
>> GenAI) is interesting to you and you'll be at the conference I'd love to 
>> touch base in person!
>>
>> -Ryan
>>
>> On Mon, Jun 12, 2023 at 2:51 PM Ben Chambers  wrote:
>>>
>>> Hello Beam!
>>>
>>> Kaskada has created a query language for expressing temporal queries,
>>> making it easy to work with multiple streams and perform temporally
>>> correct joins. We’re looking at taking our native, columnar execution
>>> engine and making it available as a PTransform and FnHarness for use
>>> with Apache Beam.
>>>
>>> We’ve drafted a [short document][proposal] outlining our planned
>>> approach and the potential benefits to Kaskada and Beam users. It
>>> would be super helpful to get some feedback on this approach and any
>>> ways that it could be improved / better integrated with Beam to
>>> provide more value!
>>>
>>> Could you see yourself using (or contributing) to this work? Let us know!
>>>
>>> Thanks!
>>>
>>> Ben
>>>
>>> [proposal]: 
>>> https://docs.google.com/document/d/1w6DYpYCi1c521AOh83JN3CB3C9pZwBPruUbawqH-NsA/edit
>>
>>
>>
>> --
>> Ryan Michael
>> keri...@gmail.com | 512.466.3662 | github | linkedin


Re: [Proposal] Kaskada DSL and FnHarness for Temporal Queries

2023-06-12 Thread Daniel Collins via dev
How does this mechanism differ from beam SQL which already offers windowing
via SQL over PCollections?

https://beam.apache.org/documentation/dsls/sql/extensions/windowing-and-triggering/

-Daniel

On Mon, Jun 12, 2023 at 3:11 PM Ryan Michael  wrote:

> Hello, Beam (also)!
>
> Just introducing myself - I'm Ryan and I've been working with Ben on the
> Kaskada project for the past few years. As Ben mentioned, I think there's a
> great opportunity to bring together some of the work we've done to make
> time-based computation easier to reason about with the Beam community's
> work on scalable streaming computation.
>
> I'll be at the Beam Summit in NYC starting Wednesday and presenting a
> short overview of how we see Kaskada fitting into the Generative AI world
> at the "Generative AI Meetup" Wednesday afternoon - if the doc Ben linked
> to (or GenAI) is interesting to you and you'll be at the conference I'd
> love to touch base in person!
>
> -Ryan
>
> On Mon, Jun 12, 2023 at 2:51 PM Ben Chambers  wrote:
>
>> Hello Beam!
>>
>> Kaskada has created a query language for expressing temporal queries,
>> making it easy to work with multiple streams and perform temporally
>> correct joins. We’re looking at taking our native, columnar execution
>> engine and making it available as a PTransform and FnHarness for use
>> with Apache Beam.
>>
>> We’ve drafted a [short document][proposal] outlining our planned
>> approach and the potential benefits to Kaskada and Beam users. It
>> would be super helpful to get some feedback on this approach and any
>> ways that it could be improved / better integrated with Beam to
>> provide more value!
>>
>> Could you see yourself using (or contributing) to this work? Let us know!
>>
>> Thanks!
>>
>> Ben
>>
>> [proposal]:
>> https://docs.google.com/document/d/1w6DYpYCi1c521AOh83JN3CB3C9pZwBPruUbawqH-NsA/edit
>>
>
>
> --
> *Ryan Michael *
> keri...@gmail.com | 512.466.3662 <(512)%20466-3662> | github
>  | linkedin
> 
>


Re: [Proposal] Kaskada DSL and FnHarness for Temporal Queries

2023-06-12 Thread Ryan Michael
Hello, Beam (also)!

Just introducing myself - I'm Ryan and I've been working with Ben on the
Kaskada project for the past few years. As Ben mentioned, I think there's a
great opportunity to bring together some of the work we've done to make
time-based computation easier to reason about with the Beam community's
work on scalable streaming computation.

I'll be at the Beam Summit in NYC starting Wednesday and presenting a short
overview of how we see Kaskada fitting into the Generative AI world at the
"Generative AI Meetup" Wednesday afternoon - if the doc Ben linked to (or
GenAI) is interesting to you and you'll be at the conference I'd love to
touch base in person!

-Ryan

On Mon, Jun 12, 2023 at 2:51 PM Ben Chambers  wrote:

> Hello Beam!
>
> Kaskada has created a query language for expressing temporal queries,
> making it easy to work with multiple streams and perform temporally
> correct joins. We’re looking at taking our native, columnar execution
> engine and making it available as a PTransform and FnHarness for use
> with Apache Beam.
>
> We’ve drafted a [short document][proposal] outlining our planned
> approach and the potential benefits to Kaskada and Beam users. It
> would be super helpful to get some feedback on this approach and any
> ways that it could be improved / better integrated with Beam to
> provide more value!
>
> Could you see yourself using (or contributing) to this work? Let us know!
>
> Thanks!
>
> Ben
>
> [proposal]:
> https://docs.google.com/document/d/1w6DYpYCi1c521AOh83JN3CB3C9pZwBPruUbawqH-NsA/edit
>


-- 
*Ryan Michael *
keri...@gmail.com | 512.466.3662 | github  |
linkedin 


Re: Ensuring a task does not get executed concurrently

2023-06-12 Thread Robert Bradshaw via dev
If you absolutely cannot tolerate concurrency an external locking mechanism
is required. While a distributed system often waits for a work item to fail
before trying it, this is not always the case (e.g. backup workers may be
scheduled and whoever finishes first is determined to be the successful
attempt). Worse, it can even be the case where the master thinks a task has
failed (e.g. due to loss of contact) and re-assign the item to
another worker, when in fact the original worker has not fully died (e.g.
it simply lost network connectivity, or entered a bad-but-not-fatal state
where a user-code thread continues on--we call these zombie workers and
though they're uncommon they're nigh impossible to rule out).

On Mon, Jun 12, 2023 at 11:36 AM Bruno Volpato via dev 
wrote:

> Hi Stephan,
>
> I am not sure if this is the best way to achieve this, but I've seen
> parallelism being limited by using state / KV and limiting the number of
> keys.
> In your case, you could have the same key for both non concurrency-safe
> operations and when using state, the Beam model will guarantee that they
> aren't concurrently executed.
>
> This blog post may be helpful:
> https://beam.apache.org/blog/stateful-processing/
>
>
>
>
> On Mon, Jun 12, 2023 at 2:21 PM Stephan Hoyer via dev 
> wrote:
>
>> Can the Beam data model (specifically the Python SDK) support executing
>> functions that are idempotent but not concurrency-safe?
>>
>> I am thinking of a task like setting up a database (or in my case, a Zarr
>>  store in Xarray-Beam
>> ) where it is not safe to run
>> setup concurrently, but if the whole operation fails it is safe to retry.
>>
>> I recognize that a better model would be to use entirely atomic
>> operations, but sometimes this can be challenging to guarantee for tools
>> that were not designed with parallel computing in mind.
>>
>> Cheers,
>> Stephan
>>
>


[Proposal] Kaskada DSL and FnHarness for Temporal Queries

2023-06-12 Thread Ben Chambers
Hello Beam!

Kaskada has created a query language for expressing temporal queries,
making it easy to work with multiple streams and perform temporally
correct joins. We’re looking at taking our native, columnar execution
engine and making it available as a PTransform and FnHarness for use
with Apache Beam.

We’ve drafted a [short document][proposal] outlining our planned
approach and the potential benefits to Kaskada and Beam users. It
would be super helpful to get some feedback on this approach and any
ways that it could be improved / better integrated with Beam to
provide more value!

Could you see yourself using (or contributing) to this work? Let us know!

Thanks!

Ben

[proposal]: 
https://docs.google.com/document/d/1w6DYpYCi1c521AOh83JN3CB3C9pZwBPruUbawqH-NsA/edit


Re: Ensuring a task does not get executed concurrently

2023-06-12 Thread Bruno Volpato via dev
Hi Stephan,

I am not sure if this is the best way to achieve this, but I've seen
parallelism being limited by using state / KV and limiting the number of
keys.
In your case, you could have the same key for both non concurrency-safe
operations and when using state, the Beam model will guarantee that they
aren't concurrently executed.

This blog post may be helpful:
https://beam.apache.org/blog/stateful-processing/




On Mon, Jun 12, 2023 at 2:21 PM Stephan Hoyer via dev 
wrote:

> Can the Beam data model (specifically the Python SDK) support executing
> functions that are idempotent but not concurrency-safe?
>
> I am thinking of a task like setting up a database (or in my case, a Zarr
>  store in Xarray-Beam
> ) where it is not safe to run
> setup concurrently, but if the whole operation fails it is safe to retry.
>
> I recognize that a better model would be to use entirely atomic
> operations, but sometimes this can be challenging to guarantee for tools
> that were not designed with parallel computing in mind.
>
> Cheers,
> Stephan
>


Ensuring a task does not get executed concurrently

2023-06-12 Thread Stephan Hoyer via dev
Can the Beam data model (specifically the Python SDK) support executing
functions that are idempotent but not concurrency-safe?

I am thinking of a task like setting up a database (or in my case, a Zarr
 store in Xarray-Beam
) where it is not safe to run setup
concurrently, but if the whole operation fails it is safe to retry.

I recognize that a better model would be to use entirely atomic operations,
but sometimes this can be challenging to guarantee for tools that were not
designed with parallel computing in mind.

Cheers,
Stephan


Tour of Beam - an interactive Apache Beam learning guide

2023-06-12 Thread Alex Panin
Hi Beam community!

We invite you to try the Tour Of Beam  [1] - an 
interactive Apache Beam learning guide.

Please share your 
feedback
 [2] !

Key features available for preview:

* Java, Python, and Go SDK learning journeys covering such topics as
* Introduction to Beam
* Common and Core Transforms
* Windowing
* Triggers
* IO Connectors
* Cross-Language Transforms
* Trying built-in examples and solving challenges
* Tracking learning progress (for authenticated users)

Planned/in-progress items:
* Support for cross-language examples execution
* Learning content contribution guide
* PR 26964 [3] - UI refinement
* PR 26861 [4] - Final challenge 
content


Please submit feedback 
form
 [2] or built-in “Enjoying Tour Of Beam?” form to provide your feedback and 
suggestions!

Thank you for trying out the Tour Of Beam!

Thanks,
Alex, on behalf of Tour Of Beam team

[1] - https://tour.beam.apache.org
[2] - 
https://docs.google.com/forms/d/e/1FAIpQLSdI3yTmsCtk_neVPt0zQOPSmxDBlz3uX2AcmUpoNT6iGEwkUQ/viewform?usp=pp_url
[3] https://github.com/apache/beam/pull/26964
[4] https://github.com/apache/beam/pull/26861



Beam High Priority Issue Report (37)

2023-06-12 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/27019 [Failing Test]: Azure Integration 
test is failing in python 3.7 PostCommit
https://github.com/apache/beam/issues/27012 [Bug]: Beam Website cannot run 
locally on Mac
https://github.com/apache/beam/issues/26981 [Bug]: Getting an error related to 
SchemaCoder after upgrading to 2.48
https://github.com/apache/beam/issues/26969 [Failing Test]: Python PostCommit 
is failing due to exceeded rate limits
https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested 
ROW (described below)
https://github.com/apache/beam/issues/26547 [Failing Test]: 
beam_PostCommit_Java_DataflowV2
https://github.com/apache/beam/issues/26354 [Bug]: BigQueryIO direct read not 
reading all rows when set --setEnableBundling=true
https://github.com/apache/beam/issues/26343 [Bug]: 
apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is 
flaky
https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not 
propagate a Coder to AvroSource
https://github.com/apache/beam/issues/26272 [Failing Test]: Python 3.7 
postcommit is red
https://github.com/apache/beam/issues/26041 [Bug]: Unable to create 
exactly-once Flink pipeline with stream source and file sink
https://github.com/apache/beam/issues/25975 [Bug]: Reducing parallelism in 
FlinkRunner leads to a data loss
https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK 
Harness ProcessBundleProgress
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21708 beam_PostCommit_Java_DataflowV2, 
testBigQueryStorageWrite30MProto failing consistently
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns wrong tableId
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) 
failing: ParDoTest$TimestampTests/OnWindowExpirationTests
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey
https://github.com/apache/beam/issues/21104 Flaky: 
apache_beam.runners.portability.fn_api_runner.fn_runner_test.FnApiRunnerTestWithGrpcAndMultiWorkers
https://github.com/apache/beam/issues/20976 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky
https://github.com/apache/beam/issues/20108 Python direct runner doesn't emit 
empty pane when it should
https://github.com/apache/beam/issues/19814 Flink streaming flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
https://github.com/apache/beam/issues/19465 Explore possibilities to lower 
in-use IP address quota footprint.


P1 Issues with no update in the last week:

https://github.com/apache/beam/issues/26723 [Failing Test]: Tour of Beam 
Frontend Test suite is perma-red on master
https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder 
will drop message id and orderingKey