Re: [PROPOSAL] Preparing for 2.50.0 Release

2023-08-14 Thread Robert Burke
+1 to what XQ says.

There will be a voting email thread once I've done the appropriate due
diligence to the branch, and finish with the Dataflow artifacts.

Generally speaking, the best validation is something you're using already,
to make sure that the new version of Beam works for your usage.


On Mon, Aug 14, 2023, 2:41 PM XQ Hu via dev  wrote:

> Welcome to the Beam community! Our release managers usually follow this
> https://beam.apache.org/contribute/release-guide/#10-vote-and-validate-release-candidate
> to send the votes out and ask for any feedback regarding the release
> candidate. If you could help run any validation on your side and cast your
> vote, it would be greatly appreciated and helpful for the community.
>
> On Mon, Aug 14, 2023 at 12:23 PM Hong  wrote:
>
>> I see, thanks for clarifying, Robert!
>>
>> Is there anything I can help with validation? Is there a wiki page with
>> the expected validations I can help with?
>>
>> Best
>> Hong
>>
>> On 14 Aug 2023, at 14:34, Robert Burke  wrote:
>>
>> 
>> The release branch was cut. Before yhe weekend, I was working on getting
>> the non-portable Dataflow Java worker built and available before producing
>> the RC1. The actual building bit doesn't take that long, but there's a
>> bunch of additional validation that goes along with it.
>>
>> The current target date for 2.50.0 is September 13th, but ultimately it's
>> as soon as we have a validated and voted on RC.
>>
>> On Mon, Aug 14, 2023, 3:43 AM Hong Liang  wrote:
>>
>>> Thanks for driving this Robert!
>>>
>>> It seems the two PRs specified have been merged. A little new to Beam,
>>> do we have an expected release date for the 2.50 release?
>>>
>>> Best,
>>> Hong
>>>
>>> On Thu, Aug 10, 2023 at 3:08 AM Robert Burke 
>>> wrote:
>>>
 I'm in the process of producing the Cut branch, but due to various
 delays on my part, it will not be cut today.

 There are two outstanding PRs blocking the cut,
 https://github.com/apache/beam/pull/27947 and
 https://github.com/apache/beam/pull/27939, but once those are in, I'll
 proceed. Remaining new issues will be cherry picked as required.

 Thanks
 Robert Burke
 Beam 2.50.0 Release Manager

 On 2023/07/26 15:49:37 Robert Burke wrote:
 > Hey Beam community,
 >
 > The next release (2.50.0) branch cut is scheduled on August 9th, 2023,
 > according to
 > the release calendar [1].
 >
 > I volunteer to perform this release. My plan is to cut the branch on
 that
 > date, and cherrypick release-blocking fixes afterwards, if any.
 >
 > Please help me make sure the release goes smoothly by:
 > - Making sure that any unresolved release blocking issues for 2.50.0
 should
 > have their "Milestone" marked as "2.50.0 Release" as soon as possible.
 > - Reviewing the current release blockers [2] and remove the Milestone
 if
 > they don't meet the criteria at [3].
 >
 > Let me know if you have any comments/objections/questions.
 >
 > Thanks,
 >
 > Robert Burke (he/him)
 > Beam Go Busybody
 >
 > [1]
 >
 https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
 > [2] https://github.com/apache/beam/milestone/14
 > [3] https://beam.apache.org/contribute/release-blocking/
 >

>>>


Re: [PROPOSAL] Preparing for 2.50.0 Release

2023-08-14 Thread XQ Hu via dev
Welcome to the Beam community! Our release managers usually follow this
https://beam.apache.org/contribute/release-guide/#10-vote-and-validate-release-candidate
to send the votes out and ask for any feedback regarding the release
candidate. If you could help run any validation on your side and cast your
vote, it would be greatly appreciated and helpful for the community.

On Mon, Aug 14, 2023 at 12:23 PM Hong  wrote:

> I see, thanks for clarifying, Robert!
>
> Is there anything I can help with validation? Is there a wiki page with
> the expected validations I can help with?
>
> Best
> Hong
>
> On 14 Aug 2023, at 14:34, Robert Burke  wrote:
>
> 
> The release branch was cut. Before yhe weekend, I was working on getting
> the non-portable Dataflow Java worker built and available before producing
> the RC1. The actual building bit doesn't take that long, but there's a
> bunch of additional validation that goes along with it.
>
> The current target date for 2.50.0 is September 13th, but ultimately it's
> as soon as we have a validated and voted on RC.
>
> On Mon, Aug 14, 2023, 3:43 AM Hong Liang  wrote:
>
>> Thanks for driving this Robert!
>>
>> It seems the two PRs specified have been merged. A little new to Beam, do
>> we have an expected release date for the 2.50 release?
>>
>> Best,
>> Hong
>>
>> On Thu, Aug 10, 2023 at 3:08 AM Robert Burke  wrote:
>>
>>> I'm in the process of producing the Cut branch, but due to various
>>> delays on my part, it will not be cut today.
>>>
>>> There are two outstanding PRs blocking the cut,
>>> https://github.com/apache/beam/pull/27947 and
>>> https://github.com/apache/beam/pull/27939, but once those are in, I'll
>>> proceed. Remaining new issues will be cherry picked as required.
>>>
>>> Thanks
>>> Robert Burke
>>> Beam 2.50.0 Release Manager
>>>
>>> On 2023/07/26 15:49:37 Robert Burke wrote:
>>> > Hey Beam community,
>>> >
>>> > The next release (2.50.0) branch cut is scheduled on August 9th, 2023,
>>> > according to
>>> > the release calendar [1].
>>> >
>>> > I volunteer to perform this release. My plan is to cut the branch on
>>> that
>>> > date, and cherrypick release-blocking fixes afterwards, if any.
>>> >
>>> > Please help me make sure the release goes smoothly by:
>>> > - Making sure that any unresolved release blocking issues for 2.50.0
>>> should
>>> > have their "Milestone" marked as "2.50.0 Release" as soon as possible.
>>> > - Reviewing the current release blockers [2] and remove the Milestone
>>> if
>>> > they don't meet the criteria at [3].
>>> >
>>> > Let me know if you have any comments/objections/questions.
>>> >
>>> > Thanks,
>>> >
>>> > Robert Burke (he/him)
>>> > Beam Go Busybody
>>> >
>>> > [1]
>>> >
>>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>>> > [2] https://github.com/apache/beam/milestone/14
>>> > [3] https://beam.apache.org/contribute/release-blocking/
>>> >
>>>
>>


Re: Mechanism for "Beam Website Feedback"

2023-08-14 Thread Ahmet Altay via dev
On Mon, Aug 14, 2023 at 1:48 PM Svetak Sundhar 
wrote:

> Hi Ahmet,
>
> I'm +1 on the idea-- one clarification question:
>
> Do you propose that when feedback is sent, it gets forwarded to the dev
> list? If not, we will need to ensure that the backend (eg a Google sheet)
> is monitored.
>

I imagine we can convert it to an email. IIRC this was possible as a native
feature of Google sheets, but if not we can use some other form -> email
product.


>
>
>
> Svetak Sundhar
>
>   Data Engineer
> s vetaksund...@google.com
>
>
>
> On Mon, Aug 14, 2023 at 4:29 PM Ahmet Altay via dev 
> wrote:
>
>> Hi all,
>>
>> We regularly get emails with "Subject: Beam Website Feedback", they are
>> filtered out before they reach this mailing list. I believe the reason for
>> that is people have some feedback to share, but clicking the feedback will
>> open the default email application and they will be surprised and close it.
>> Sometimes they close it by sending us an empty email.
>>
>> My proposal -  We can change the feedback button with a simple embedded
>> form (a text box + submit button). We can use something like google forms
>> to implement this without making a more complex backend change.
>>
>> For reference, this is a partial and it is implemented here [1].
>>
>> What do you think?
>>
>> Ahmet
>>
>> [1]
>> https://github.com/apache/beam/blob/bbaa7ebd3eec614832d76cfc577858638a96a11d/website/www/site/layouts/partials/feedback.html#L21
>>
>


Re: Mechanism for "Beam Website Feedback"

2023-08-14 Thread Svetak Sundhar via dev
Hi Ahmet,

I'm +1 on the idea-- one clarification question:

Do you propose that when feedback is sent, it gets forwarded to the dev
list? If not, we will need to ensure that the backend (eg a Google sheet)
is monitored.



Svetak Sundhar

  Data Engineer
s vetaksund...@google.com



On Mon, Aug 14, 2023 at 4:29 PM Ahmet Altay via dev 
wrote:

> Hi all,
>
> We regularly get emails with "Subject: Beam Website Feedback", they are
> filtered out before they reach this mailing list. I believe the reason for
> that is people have some feedback to share, but clicking the feedback will
> open the default email application and they will be surprised and close it.
> Sometimes they close it by sending us an empty email.
>
> My proposal -  We can change the feedback button with a simple embedded
> form (a text box + submit button). We can use something like google forms
> to implement this without making a more complex backend change.
>
> For reference, this is a partial and it is implemented here [1].
>
> What do you think?
>
> Ahmet
>
> [1]
> https://github.com/apache/beam/blob/bbaa7ebd3eec614832d76cfc577858638a96a11d/website/www/site/layouts/partials/feedback.html#L21
>


Re: Beam IO Connector

2023-08-14 Thread Devon Peticolas
Sure thing Jeremy,

Generally, the workflow we do is that every one of our jobs takes a
--consumerMode and/or --producerMode option and we pass those options to a
Read and Write PTransform which wraps standard IO PTransforms and calls the
correct one in expand based on the option's value.

I have a simplified example in slide 26 of my deck from my talk at the most
recent Beam Summit , here:
https://github.com/x/slides/blob/master/beam-summit-2023/Apache%20Beam%20Summit%20-%20Avro%20and%20Beam%20Schemas.pdf

And I have an example of the current version of the code we use at my
company, Oden, in a gist here:
https://gist.github.com/x/948ac95b768671d342cc3856a3d7c681

The main use-case for us is that all of our dataflow jobs run in both a
Streaming (normal) and Batch (recovery) mode.


On Aug 14, 2023 at 3:09:51 PM, Jeremy Bloom  wrote:

> Thanks. Is there a github link to Devon's code?
>
> On Mon, Aug 14, 2023 at 8:49 AM John Casey 
> wrote:
>
>> I believe Devon Peticolas wrote a similar tool to create an IO that wrote
>> to configurable sinks that might fit your use case
>>
>> On Sat, Aug 12, 2023 at 12:18 PM Bruno Volpato via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hi Jeremy,
>>>
>>> Apparently you are trying to use Beam's DirectRunner
>>> , which is
>>> mostly focused on small pipelines / testing purposes.
>>> Even if it runs in the JVM, there are protections in place to make sure
>>> your pipeline will be able to be distributed correctly when choosing a
>>> production-ready runner (e.g., Dataflow, Spark, Flink), from the link above:
>>>
>>> - enforcing immutability of elements
>>> - enforcing encodability of elements
>>>
>>> There are ways to disable those checks (--enforceEncodability=false,
>>> --enforceImmutability=false), but to make sure you take the best out of
>>> Beam and can run the pipeline in one of the runners in the future, I
>>> believe the best way would be to write to a file, and read it back in the
>>> GUI application (for the sink part).
>>>
>>> For the source part, you may want to use Create
>>> 
>>> to create a PCollection with specific elements for the in-memory scenario.
>>>
>>> If you are getting exceptions for supported scenarios that you've
>>> mentioned, there are a few things -- for example, if you are using lambda,
>>> sometimes Java will try to Serialize the entire instance that holds members
>>> being used. Creating your own DoFn classes and passing the Serializables
>>> that what you need to use may resolve.
>>>
>>>
>>> Best,
>>> Bruno
>>>
>>>
>>>
>>>
>>> On Sat, Aug 12, 2023 at 11:34 AM Jeremy Bloom 
>>> wrote:
>>>
 Hello-
 I am fairly new to Beam but have been working with Apache Spark for a
 number of years. The application I am developing uses a data pipeline to
 ingest JSON with a particular schema, uses it to prepare data for a service
 that I do not control (a mathematical optimization solver), runs the
 application and recovers its results, and then publishes the results in
 JSON (same schema).  Although I work in Java, colleagues of mine are
 implementing in Python. This is an open-source, non-commercial project.

 The application has three kinds of IO sources/sinks: file system files
 (using Windows now, but Unix in the future), URL, and in-memory (string,
 byte buffer, etc). The last is primarily used for debugging, displayed in a
 JTextArea.

 I have not found a Beam IO connector that handles all three data
 sources/sinks, particularly the in-memory sink. I have tried adapting
 FileIO and TextIO, however, I continually run up against objects that are
 not serializable, particularly Java OutputStream and its subclasses. I have
 looked at the code for FileIO and TextIO as well as several other custom IO
 implementations, but none of them addresses this particular bug.

 The CSVSink example in the FileIO Javadoc uses a PrintWriter, which is
 not serializable; when I tried the same thing, I got a not-serializable
 exception. How does this example actually avoid this error? In the code for
 TextIO.Sink, the PrintWriter field is marked transient, meaning that it is
 not serialized, but again, when I tried the same thing, I got an exception.

 Please explain, in particular, how to write a Sink that avoids the not
 serializable exception. In general, please explain how I can use a Beam IO
 connector for the three kinds of data sources/sinks I want to use (file
 system, url, and in-memory).

 After the frustrations I had with Spark, I have high hopes for Beam.
 This issue is a blocker for me.

 Thank you.
 Jeremy Bloom

>>>


Mechanism for "Beam Website Feedback"

2023-08-14 Thread Ahmet Altay via dev
Hi all,

We regularly get emails with "Subject: Beam Website Feedback", they are
filtered out before they reach this mailing list. I believe the reason for
that is people have some feedback to share, but clicking the feedback will
open the default email application and they will be surprised and close it.
Sometimes they close it by sending us an empty email.

My proposal -  We can change the feedback button with a simple embedded
form (a text box + submit button). We can use something like google forms
to implement this without making a more complex backend change.

For reference, this is a partial and it is implemented here [1].

What do you think?

Ahmet

[1]
https://github.com/apache/beam/blob/bbaa7ebd3eec614832d76cfc577858638a96a11d/website/www/site/layouts/partials/feedback.html#L21


Re: Beam IO Connector

2023-08-14 Thread Jeremy Bloom
Thanks. Is there a github link to Devon's code?

On Mon, Aug 14, 2023 at 8:49 AM John Casey  wrote:

> I believe Devon Peticolas wrote a similar tool to create an IO that wrote
> to configurable sinks that might fit your use case
>
> On Sat, Aug 12, 2023 at 12:18 PM Bruno Volpato via dev <
> dev@beam.apache.org> wrote:
>
>> Hi Jeremy,
>>
>> Apparently you are trying to use Beam's DirectRunner
>> , which is mostly
>> focused on small pipelines / testing purposes.
>> Even if it runs in the JVM, there are protections in place to make sure
>> your pipeline will be able to be distributed correctly when choosing a
>> production-ready runner (e.g., Dataflow, Spark, Flink), from the link above:
>>
>> - enforcing immutability of elements
>> - enforcing encodability of elements
>>
>> There are ways to disable those checks (--enforceEncodability=false,
>> --enforceImmutability=false), but to make sure you take the best out of
>> Beam and can run the pipeline in one of the runners in the future, I
>> believe the best way would be to write to a file, and read it back in the
>> GUI application (for the sink part).
>>
>> For the source part, you may want to use Create
>>  to
>> create a PCollection with specific elements for the in-memory scenario.
>>
>> If you are getting exceptions for supported scenarios that you've
>> mentioned, there are a few things -- for example, if you are using lambda,
>> sometimes Java will try to Serialize the entire instance that holds members
>> being used. Creating your own DoFn classes and passing the Serializables
>> that what you need to use may resolve.
>>
>>
>> Best,
>> Bruno
>>
>>
>>
>>
>> On Sat, Aug 12, 2023 at 11:34 AM Jeremy Bloom 
>> wrote:
>>
>>> Hello-
>>> I am fairly new to Beam but have been working with Apache Spark for a
>>> number of years. The application I am developing uses a data pipeline to
>>> ingest JSON with a particular schema, uses it to prepare data for a service
>>> that I do not control (a mathematical optimization solver), runs the
>>> application and recovers its results, and then publishes the results in
>>> JSON (same schema).  Although I work in Java, colleagues of mine are
>>> implementing in Python. This is an open-source, non-commercial project.
>>>
>>> The application has three kinds of IO sources/sinks: file system files
>>> (using Windows now, but Unix in the future), URL, and in-memory (string,
>>> byte buffer, etc). The last is primarily used for debugging, displayed in a
>>> JTextArea.
>>>
>>> I have not found a Beam IO connector that handles all three data
>>> sources/sinks, particularly the in-memory sink. I have tried adapting
>>> FileIO and TextIO, however, I continually run up against objects that are
>>> not serializable, particularly Java OutputStream and its subclasses. I have
>>> looked at the code for FileIO and TextIO as well as several other custom IO
>>> implementations, but none of them addresses this particular bug.
>>>
>>> The CSVSink example in the FileIO Javadoc uses a PrintWriter, which is
>>> not serializable; when I tried the same thing, I got a not-serializable
>>> exception. How does this example actually avoid this error? In the code for
>>> TextIO.Sink, the PrintWriter field is marked transient, meaning that it is
>>> not serialized, but again, when I tried the same thing, I got an exception.
>>>
>>> Please explain, in particular, how to write a Sink that avoids the not
>>> serializable exception. In general, please explain how I can use a Beam IO
>>> connector for the three kinds of data sources/sinks I want to use (file
>>> system, url, and in-memory).
>>>
>>> After the frustrations I had with Spark, I have high hopes for Beam.
>>> This issue is a blocker for me.
>>>
>>> Thank you.
>>> Jeremy Bloom
>>>
>>


Re: Seeking Assistance to Resolve Issues/bug with Flink Runner on Kubernetes

2023-08-14 Thread Kenneth Knowles
There is a slack channel linked from
https://beam.apache.org/community/contact-us/ it is #beam on
the-asf.slack.com

(you find this via beam.apache.org -> Community -> Contact Us)

It sounds like an issue with running a multi-language pipeline on the
portable flink runner. (something which I am not equipped to help with in
detail)

Kenn

On Wed, Aug 9, 2023 at 2:51 PM kapil singh  wrote:

> Hey,
>
> I've been grappling with this issue for the past five days and, despite my
> continuous efforts, I haven't found a resolution. Additionally, I've been
> unable to locate a Slack channel for Beam where I might seek assistance.
>
> issue
>
> *RuntimeError: Pipeline construction environment and pipeline runtime
> environment are not compatible. If you use a custom container image, check
> that the Python interpreter minor version and the Apache Beam version in
> your image match the versions used at pipeline construction time.
> Submission environment: beam:version:sdk_base:apache/beam_java8_sdk:2.48.0.
> Runtime environment:
> beam:version:sdk_base:apache/beam_python3.8_sdk:2.48.0.*
>
>
> Here what i am trying to do
>
>  i am running job from kubernetes container  that hits on job server and
> then job manager and task manager
> task manager and job manager is one Container
>
> Here is  My custom Dockerfile. name:custom-flink
>
> # Starting with the base Flink image
> FROM apache/flink:1.16-java11
> ARG FLINK_VERSION=1.16
> ARG KAFKA_VERSION=2.8.0
>
> # Install python3.8 and its associated dependencies, followed by pyflink
> RUN set -ex; \
> apt-get update && \
> apt-get install -y build-essential libssl-dev zlib1g-dev libbz2-dev
> libffi-dev lzma liblzma-dev && \
> wget https://www.python.org/ftp/python/3.8.0/Python-3.8.0.tgz && \
> tar -xvf Python-3.8.0.tgz && \
> cd Python-3.8.0 && \
> ./configure --without-tests --enable-shared && \
> make -j4 && \
> make install && \
> ldconfig /usr/local/lib && \
> cd .. && rm -f Python-3.8.0.tgz && rm -rf Python-3.8.0 && \
> ln -s /usr/local/bin/python3.8 /usr/local/bin/python && \
> ln -s /usr/local/bin/pip3.8 /usr/local/bin/pip && \
> apt-get clean && \
> rm -rf /var/lib/apt/lists/* && \
> python -m pip install --upgrade pip; \
> pip install apache-flink==${FLINK_VERSION}; \
> pip install kafka-python
>
> RUN pip install --no-cache-dir apache-beam[gcp]==2.48.0
>
> # Copy files from official SDK image, including script/dependencies.
> COPY --from=apache/beam_python3.8_sdk:2.48.0 /opt/apache/beam/
> /opt/apache/beam/
>
> # java SDK
> COPY --from=apache/beam_java11_sdk:2.48.0 /opt/apache/beam/
> /opt/apache/beam_java/
>
> RUN apt-get update && apt-get install -y python3-venv && rm -rf
> /var/lib/apt/lists/*
>
> # Give permissions to the /opt/apache/beam-venv directory
> RUN mkdir -p /opt/apache/beam-venv && chown -R :
> /opt/apache/beam-venv
>
> Here is my Deployment file for Job manager,Task manager plus worker-pool
> and job server
>
>
> apiVersion: v1
> kind: Service
> metadata:
> name: flink-jobmanager
> namespace: flink
> spec:
> type: ClusterIP
> ports:
> - name: rpc
> port: 6123
> - name: blob-server
> port: 6124
> - name: webui
> port: 8081
> selector:
> app: flink
> component: jobmanager
> ---
> apiVersion: v1
> kind: Service
> metadata:
> name: beam-worker-pool
> namespace: flink
> spec:
> selector:
> app: flink
> component: taskmanager
> ports:
> - protocol: TCP
> port: 5
> targetPort: 5
> name: pool
> ---
> apiVersion: apps/v1
> kind: Deployment
> metadata:
> name: flink-jobmanager
> namespace: flink
> spec:
> replicas: 1
> selector:
> matchLabels:
> app: flink
> component: jobmanager
> template:
> metadata:
> labels:
> app: flink
> component: jobmanager
> spec:
> containers:
> - name: jobmanager
> image: custom-flink:latest
> imagePullPolicy: IfNotPresent
> args: ["jobmanager"]
> ports:
> - containerPort: 6123
> name: rpc
> - containerPort: 6124
> name: blob-server
> - containerPort: 8081
> name: webui
> livenessProbe:
> tcpSocket:
> port: 6123
> initialDelaySeconds: 30
> periodSeconds: 60
> volumeMounts:
> - name: flink-config-volume
> mountPath: /opt/flink/conf
> - name: flink-staging
> mountPath: /tmp/beam-artifact-staging
> securityContext:
> runAsUser: 
> resources:
> requests:
> memory: "1Gi"
> cpu: "1"
> limits:
> memory: "1Gi"
> cpu: "1"
> volumes:
> - name: flink-config-volume
> configMap:
> name: flink-config
> items:
> - key: flink-conf.yaml
> path: flink-conf.yaml
> - key: log4j-console.properties
> path: log4j-console.properties
> - name: flink-staging
> persistentVolumeClaim:
> claimName: staging-artifacts-claim
> ---
> apiVersion: apps/v1
> kind: Deployment
> metadata:
> name: flink-taskmanager
> namespace: flink
> spec:
> replicas: 1
> selector:
> matchLabels:
> app: flink
> component: taskmanager
> template:
> metadata:
> labels:
> app: flink
> component: taskmanager
> spec:
> containers:
> - name: taskmanager-beam-worker
> image: custom-flink:latest
> imagePullPolicy: IfNotPresent
> args:
> - /bin/bash
> - -c
> - 

Re: Propose to add the new security section to the Beam releases

2023-08-14 Thread Kenneth Knowles
Great idea.

On Fri, Aug 11, 2023 at 2:18 PM XQ Hu via dev  wrote:

> Hi All,
>
> We are proposing to explicitly add the security fixes to the Beam release
> notes. https://github.com/apache/beam/pull/27976 modified the template in 
> CHANGES.md
> by adding this new section.
>
> Please let us know if you have any questions or feel free to comment PR:
> https://github.com/apache/beam/pull/27976.
>
> Thanks.
>
> Best,
> XQ
>


Re: [PROPOSAL] Preparing for 2.50.0 Release

2023-08-14 Thread Hong
I see, thanks for clarifying, Robert!Is there anything I can help with validation? Is there a wiki page with the expected validations I can help with?BestHongOn 14 Aug 2023, at 14:34, Robert Burke  wrote:The release branch was cut. Before yhe weekend, I was working on getting the non-portable Dataflow Java worker built and available before producing the RC1. The actual building bit doesn't take that long, but there's a bunch of additional validation that goes along with it.The current target date for 2.50.0 is September 13th, but ultimately it's as soon as we have a validated and voted on RC.On Mon, Aug 14, 2023, 3:43 AM Hong Liang  wrote:Thanks for driving this Robert!It seems the two PRs specified have been merged. A little new to Beam, do we have an expected release date for the 2.50 release?Best,HongOn Thu, Aug 10, 2023 at 3:08 AM Robert Burke  wrote:I'm in the process of producing the Cut branch, but due to various delays on my part, it will not be cut today.

There are two outstanding PRs blocking the cut, https://github.com/apache/beam/pull/27947 and https://github.com/apache/beam/pull/27939, but once those are in, I'll proceed. Remaining new issues will be cherry picked as required.

Thanks
Robert Burke
Beam 2.50.0 Release Manager

On 2023/07/26 15:49:37 Robert Burke wrote:
> Hey Beam community,
> 
> The next release (2.50.0) branch cut is scheduled on August 9th, 2023,
> according to
> the release calendar [1].
> 
> I volunteer to perform this release. My plan is to cut the branch on that
> date, and cherrypick release-blocking fixes afterwards, if any.
> 
> Please help me make sure the release goes smoothly by:
> - Making sure that any unresolved release blocking issues for 2.50.0 should
> have their "Milestone" marked as "2.50.0 Release" as soon as possible.
> - Reviewing the current release blockers [2] and remove the Milestone if
> they don't meet the criteria at [3].
> 
> Let me know if you have any comments/objections/questions.
> 
> Thanks,
> 
> Robert Burke (he/him)
> Beam Go Busybody
> 
> [1]
> https://calendar.google.com/calendar/embed?src="">
> [2] https://github.com/apache/beam/milestone/14
> [3] https://beam.apache.org/contribute/release-blocking/
> 




Re: Beam IO Connector

2023-08-14 Thread John Casey via dev
I believe Devon Peticolas wrote a similar tool to create an IO that wrote
to configurable sinks that might fit your use case

On Sat, Aug 12, 2023 at 12:18 PM Bruno Volpato via dev 
wrote:

> Hi Jeremy,
>
> Apparently you are trying to use Beam's DirectRunner
> , which is mostly
> focused on small pipelines / testing purposes.
> Even if it runs in the JVM, there are protections in place to make sure
> your pipeline will be able to be distributed correctly when choosing a
> production-ready runner (e.g., Dataflow, Spark, Flink), from the link above:
>
> - enforcing immutability of elements
> - enforcing encodability of elements
>
> There are ways to disable those checks (--enforceEncodability=false,
> --enforceImmutability=false), but to make sure you take the best out of
> Beam and can run the pipeline in one of the runners in the future, I
> believe the best way would be to write to a file, and read it back in the
> GUI application (for the sink part).
>
> For the source part, you may want to use Create
>  to
> create a PCollection with specific elements for the in-memory scenario.
>
> If you are getting exceptions for supported scenarios that you've
> mentioned, there are a few things -- for example, if you are using lambda,
> sometimes Java will try to Serialize the entire instance that holds members
> being used. Creating your own DoFn classes and passing the Serializables
> that what you need to use may resolve.
>
>
> Best,
> Bruno
>
>
>
>
> On Sat, Aug 12, 2023 at 11:34 AM Jeremy Bloom 
> wrote:
>
>> Hello-
>> I am fairly new to Beam but have been working with Apache Spark for a
>> number of years. The application I am developing uses a data pipeline to
>> ingest JSON with a particular schema, uses it to prepare data for a service
>> that I do not control (a mathematical optimization solver), runs the
>> application and recovers its results, and then publishes the results in
>> JSON (same schema).  Although I work in Java, colleagues of mine are
>> implementing in Python. This is an open-source, non-commercial project.
>>
>> The application has three kinds of IO sources/sinks: file system files
>> (using Windows now, but Unix in the future), URL, and in-memory (string,
>> byte buffer, etc). The last is primarily used for debugging, displayed in a
>> JTextArea.
>>
>> I have not found a Beam IO connector that handles all three data
>> sources/sinks, particularly the in-memory sink. I have tried adapting
>> FileIO and TextIO, however, I continually run up against objects that are
>> not serializable, particularly Java OutputStream and its subclasses. I have
>> looked at the code for FileIO and TextIO as well as several other custom IO
>> implementations, but none of them addresses this particular bug.
>>
>> The CSVSink example in the FileIO Javadoc uses a PrintWriter, which is
>> not serializable; when I tried the same thing, I got a not-serializable
>> exception. How does this example actually avoid this error? In the code for
>> TextIO.Sink, the PrintWriter field is marked transient, meaning that it is
>> not serialized, but again, when I tried the same thing, I got an exception.
>>
>> Please explain, in particular, how to write a Sink that avoids the not
>> serializable exception. In general, please explain how I can use a Beam IO
>> connector for the three kinds of data sources/sinks I want to use (file
>> system, url, and in-memory).
>>
>> After the frustrations I had with Spark, I have high hopes for Beam. This
>> issue is a blocker for me.
>>
>> Thank you.
>> Jeremy Bloom
>>
>


Re: [PROPOSAL] Preparing for 2.50.0 Release

2023-08-14 Thread Robert Burke
The release branch was cut. Before yhe weekend, I was working on getting
the non-portable Dataflow Java worker built and available before producing
the RC1. The actual building bit doesn't take that long, but there's a
bunch of additional validation that goes along with it.

The current target date for 2.50.0 is September 13th, but ultimately it's
as soon as we have a validated and voted on RC.

On Mon, Aug 14, 2023, 3:43 AM Hong Liang  wrote:

> Thanks for driving this Robert!
>
> It seems the two PRs specified have been merged. A little new to Beam, do
> we have an expected release date for the 2.50 release?
>
> Best,
> Hong
>
> On Thu, Aug 10, 2023 at 3:08 AM Robert Burke  wrote:
>
>> I'm in the process of producing the Cut branch, but due to various delays
>> on my part, it will not be cut today.
>>
>> There are two outstanding PRs blocking the cut,
>> https://github.com/apache/beam/pull/27947 and
>> https://github.com/apache/beam/pull/27939, but once those are in, I'll
>> proceed. Remaining new issues will be cherry picked as required.
>>
>> Thanks
>> Robert Burke
>> Beam 2.50.0 Release Manager
>>
>> On 2023/07/26 15:49:37 Robert Burke wrote:
>> > Hey Beam community,
>> >
>> > The next release (2.50.0) branch cut is scheduled on August 9th, 2023,
>> > according to
>> > the release calendar [1].
>> >
>> > I volunteer to perform this release. My plan is to cut the branch on
>> that
>> > date, and cherrypick release-blocking fixes afterwards, if any.
>> >
>> > Please help me make sure the release goes smoothly by:
>> > - Making sure that any unresolved release blocking issues for 2.50.0
>> should
>> > have their "Milestone" marked as "2.50.0 Release" as soon as possible.
>> > - Reviewing the current release blockers [2] and remove the Milestone if
>> > they don't meet the criteria at [3].
>> >
>> > Let me know if you have any comments/objections/questions.
>> >
>> > Thanks,
>> >
>> > Robert Burke (he/him)
>> > Beam Go Busybody
>> >
>> > [1]
>> >
>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>> > [2] https://github.com/apache/beam/milestone/14
>> > [3] https://beam.apache.org/contribute/release-blocking/
>> >
>>
>


Re: [PROPOSAL] Preparing for 2.50.0 Release

2023-08-14 Thread Hong Liang
Thanks for driving this Robert!

It seems the two PRs specified have been merged. A little new to Beam, do
we have an expected release date for the 2.50 release?

Best,
Hong

On Thu, Aug 10, 2023 at 3:08 AM Robert Burke  wrote:

> I'm in the process of producing the Cut branch, but due to various delays
> on my part, it will not be cut today.
>
> There are two outstanding PRs blocking the cut,
> https://github.com/apache/beam/pull/27947 and
> https://github.com/apache/beam/pull/27939, but once those are in, I'll
> proceed. Remaining new issues will be cherry picked as required.
>
> Thanks
> Robert Burke
> Beam 2.50.0 Release Manager
>
> On 2023/07/26 15:49:37 Robert Burke wrote:
> > Hey Beam community,
> >
> > The next release (2.50.0) branch cut is scheduled on August 9th, 2023,
> > according to
> > the release calendar [1].
> >
> > I volunteer to perform this release. My plan is to cut the branch on that
> > date, and cherrypick release-blocking fixes afterwards, if any.
> >
> > Please help me make sure the release goes smoothly by:
> > - Making sure that any unresolved release blocking issues for 2.50.0
> should
> > have their "Milestone" marked as "2.50.0 Release" as soon as possible.
> > - Reviewing the current release blockers [2] and remove the Milestone if
> > they don't meet the criteria at [3].
> >
> > Let me know if you have any comments/objections/questions.
> >
> > Thanks,
> >
> > Robert Burke (he/him)
> > Beam Go Busybody
> >
> > [1]
> >
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
> > [2] https://github.com/apache/beam/milestone/14
> > [3] https://beam.apache.org/contribute/release-blocking/
> >
>


Beam High Priority Issue Report (39)

2023-08-14 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/27892 [Bug]: ignoreUnknownValues not 
working when using CreateDisposition.CREATE_IF_NEEDED 
https://github.com/apache/beam/issues/27648 [Bug]: Python SDFs (e.g. 
PeriodicImpulse) running in Flink and polling using tracker.defer_remainder 
have checkpoint size growing indefinitely 
https://github.com/apache/beam/issues/27616 [Bug]: Unable to use 
applyRowMutations() in bigquery IO apache beam java
https://github.com/apache/beam/issues/27486 [Bug]: Read from datastore with 
inequality filters
https://github.com/apache/beam/issues/27314 [Failing Test]: 
bigquery.StorageApiSinkCreateIfNeededIT.testCreateManyTables[1]
https://github.com/apache/beam/issues/27238 [Bug]: Window trigger has lag when 
using Kafka and GroupByKey on Dataflow Runner
https://github.com/apache/beam/issues/26981 [Bug]: Getting an error related to 
SchemaCoder after upgrading to 2.48
https://github.com/apache/beam/issues/26969 [Failing Test]: Python PostCommit 
is failing due to exceeded rate limits
https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested 
ROW (described below)
https://github.com/apache/beam/issues/26354 [Bug]: BigQueryIO direct read not 
reading all rows when set --setEnableBundling=true
https://github.com/apache/beam/issues/26343 [Bug]: 
apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is 
flaky
https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not 
propagate a Coder to AvroSource
https://github.com/apache/beam/issues/26041 [Bug]: Unable to create 
exactly-once Flink pipeline with stream source and file sink
https://github.com/apache/beam/issues/25975 [Bug]: Reducing parallelism in 
FlinkRunner leads to a data loss
https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK 
Harness ProcessBundleProgress
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder 
will drop message id and orderingKey
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21708 beam_PostCommit_Java_DataflowV2, 
testBigQueryStorageWrite30MProto failing consistently
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns wrong tableId
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) 
failing: ParDoTest$TimestampTests/OnWindowExpirationTests
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey
https://github.com/apache/beam/issues/21104 Flaky: 
apache_beam.runners.portability.fn_api_runner.fn_runner_test.FnApiRunnerTestWithGrpcAndMultiWorkers
https://github.com/apache/beam/issues/20976 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky
https://github.com/apache/beam/issues/20108 Python direct runner doesn't emit 
empty pane when it should
https://github.com/apache/beam/issues/19814 Flink streaming flakes in