Cross Language

2022-10-11 Thread phani geeth
Hi,

Does the present cross Language functionality support creating custom Java
transforms and calling from python in Dataflow runner.

Use case: use existing Java IO as cross Language transform and call in
python pipeline in Dataflow runner.


Regards
Phani Geeth


Re: Cross Language

2022-10-11 Thread Alexey Romanenko
Yes, it’s possible though Java IO connector should support being used via 
X-language.

For more details regarding which connector supports this, you may want to take 
a look on IO connectors table on this page [1] and check if the required 
connector is supported "via X-language" for Python SDK. 

[1] https://beam.apache.org/documentation/io/connectors/ 

—
Alexey

> On 11 Oct 2022, at 11:06, phani geeth  wrote:
> 
> Hi,
> 
> Does the present cross Language functionality support creating custom Java 
> transforms and calling from python in Dataflow runner.
> 
> Use case: use existing Java IO as cross Language transform and call in python 
> pipeline in Dataflow runner.
> 
> 
> Regards
> Phani Geeth



Re: Cross Language

2022-10-11 Thread Chamikara Jayalath via user
Is there a specific I/O connector you are hoping to use ?

Thanks,
Cham

On Tue, Oct 11, 2022 at 4:31 AM Alexey Romanenko 
wrote:

> Yes, it’s possible though Java IO connector should support being used via
> X-language.
>
> For more details regarding which connector supports this, you may want to
> take a look on IO connectors table on this page [1] and check if the
> required connector is supported "via X-language" for Python SDK.
>
> [1] https://beam.apache.org/documentation/io/connectors/
>
> —
> Alexey
>
> > On 11 Oct 2022, at 11:06, phani geeth  wrote:
> >
> > Hi,
> >
> > Does the present cross Language functionality support creating custom
> Java transforms and calling from python in Dataflow runner.
> >
> > Use case: use existing Java IO as cross Language transform and call in
> python pipeline in Dataflow runner.
> >
> >
> > Regards
> > Phani Geeth
>
>


Re: Java + Python Xlang pipeline

2022-10-11 Thread Alexey Romanenko
I’m not sure that I get it correctly. What do you mean by “worker pool” in your 
case?

—
Alexey

> On 8 Oct 2022, at 03:24, Xiao Ma  wrote:
> 
> Hello, 
> 
> I would like to run a pipeline with Java as the main language and python 
> transformation embedded. The beam pipeline is running on the flink cluster. 
> Currently, I can run it with a taskmanager + java worker pool and a python 
> worker pool. Could I ask if there is a way to run the java code on the task 
> manager directly and keep the python worker pool?
> 
> Current: taskmanager + java worker pool + python worker pool
> Desired: taskmanager + python worker pool
> 
> Thank you very much.
> 
> Mark Ma
> 



[CFP] In-person Beam meetups

2022-10-11 Thread Aizhamal Nurmamat kyzy
Hey Beam community,

We are organizing a few in-person community meetups for Apache Beam and
opening up calls for proposals. Please send me directly the topics / ideas
you'd like to present at the following dates and locations:

November 2, Santa Clara, CA
November 10, NYC
November 12, Bangalore, India

The typical agenda will look as follows:
5:30pm~6pm: check in and food
6:00pm~8pm: 2~3 tech talks
8pm~8:30pm: mingle and close

Thanks!


Re: Java + Python Xlang pipeline

2022-10-11 Thread Xiao Ma
The worker pool means `starting a java or python sdk`, to accept the java
or python pipeline running. For example, to execute python pipeline, we
have to start a python worker pool with `--worker_pool` arguments. For the
Java code, besides the docker mode (default one), do we have other better
ways to start a java worker pool?

For now, our flink cluster is running on the k8s. If we choose the default
sdk harness mode (docker), we will have the docker (java sdk harness) in
docker (flink-taskmanager). So, what we are doing is to call
org.apache.beam.fn.harness.ExternalWorkerService class with pipeline
options as environment variables and fixed two small issues in the
FnHarness class to make sure the java sdk harness can run smoothly.

Thank you.
*Mark Ma*


On Tue, Oct 11, 2022 at 12:46 PM Alexey Romanenko 
wrote:

> I’m not sure that I get it correctly. What do you mean by “worker pool” in
> your case?
>
> —
> Alexey
>
> On 8 Oct 2022, at 03:24, Xiao Ma  wrote:
>
> Hello,
>
> I would like to run a pipeline with Java as the main language and python
> transformation embedded. The beam pipeline is running on the flink cluster.
> Currently, I can run it with a taskmanager + java worker pool and a python
> worker pool. Could I ask if there is a way to run the java code on the task
> manager directly and keep the python worker pool?
>
> Current: taskmanager + java worker pool + python worker pool
> Desired: taskmanager + python worker pool
>
> Thank you very much.
>
> *Mark Ma*
>
>
>


Re: Help on Apache Beam Pipeline Optimization

2022-10-11 Thread Evan Galpin
If I’m not mistaken you could create a PCollection from the pubsub read
operation, and then apply 3 different windowing strategies in different
“chains” of the graph. Ex

PCollection msgs = PubsubIO.read(…);

msgs.apply(Window.into(FixedWindows.of(1 min)).apply(allMyTransforms)

msgs.apply(Window.into(FixedWindows.of(5 min)).apply(allMyTransforms)

msgs.apply(Window.into(FixedWindows.of(60 min)).apply(allMyTransforms)


Similarly this could be done with a loop if preferred.

On Tue, Oct 4, 2022 at 14:15 Yi En Ong  wrote:

> Hi,
>
>
> I am trying to optimize my Apache Beam pipeline on Google Cloud Platform
> Dataflow, and I would really appreciate your help and advice.
>
>
> Background information: I am trying to read data from PubSub Messages, and
> aggregate them based on 3 time windows: 1 min, 5 min and 60 min. Such
> aggregations consists of summing, averaging, finding the maximum or
> minimum, etc. For example, for all data collected from 1200 to 1201, I want
> to aggregate them and write the output into BigTable's 1-min column family.
> And for all data collected from 1200 to 1205, I want to similarly aggregate
> them and write the output into BigTable's 5-min column. Same goes for 60min.
>
>
> The current approach I took is to have 3 separate dataflow jobs (i.e. 3
> separate Beam Pipelines), each one having a different window duration
> (1min, 5min and 60min). See
> https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/windowing/Window.html.
> And the outputs of all 3 dataflow jobs are written to the same BigTable,
> but on different column families. Other than that, the function and
> aggregations of the data are the same for the 3 jobs.
>
>
> However, this seems to be very computationally inefficient, and cost
> inefficient, as the 3 jobs are essentially doing the same function, with
> the only exception being the window time duration and output column family.
>
>
>
> Some challenges and limitations we faced was that from the documentation,
> it seems like we are unable to create multiple windows of different periods
> in a singular dataflow job. Also, when we write the final data into big
> table, we would have to define the table, column family, column, and
> rowkey. And unfortunately, the column family is a fixed property (i.e. it
> cannot be redefined or changed given the window period).
>
>
> Hence, I am writing to ask if there is a way to only use 1 dataflow job
> that fulfils the objective of this project? Which is to aggregate data on
> different window periods, and write them to different column families of
> the same BigTable.
>
>
> Thank you
>


Re: Java + Python Xlang pipeline

2022-10-11 Thread Lydian
You should be able to provide extra_args to the expansion service:
```

--defaultEnvironmentType=PROCESS
--defaultEnvironmentConfig={"command":"/opt/apache/beam_java/boot"}

```

I am also running the Xlang pipeline in flink k8s cluster. After setting
the defaultEnvironmentType to PROCESS, I don't need to use DinD or DooD at
all.
You can also find my full settings in:
https://stackoverflow.com/a/74035035/19259600



Sincerely,
Lydian Lee



On Tue, Oct 11, 2022 at 11:34 AM Xiao Ma  wrote:

> The worker pool means `starting a java or python sdk`, to accept the java
> or python pipeline running. For example, to execute python pipeline, we
> have to start a python worker pool with `--worker_pool` arguments. For the
> Java code, besides the docker mode (default one), do we have other better
> ways to start a java worker pool?
>
> For now, our flink cluster is running on the k8s. If we choose the default
> sdk harness mode (docker), we will have the docker (java sdk harness) in
> docker (flink-taskmanager). So, what we are doing is to call
> org.apache.beam.fn.harness.ExternalWorkerService class with pipeline
> options as environment variables and fixed two small issues in the
> FnHarness class to make sure the java sdk harness can run smoothly.
>
> Thank you.
> *Mark Ma*
>
>
> On Tue, Oct 11, 2022 at 12:46 PM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
>
>> I’m not sure that I get it correctly. What do you mean by “worker pool”
>> in your case?
>>
>> —
>> Alexey
>>
>> On 8 Oct 2022, at 03:24, Xiao Ma  wrote:
>>
>> Hello,
>>
>> I would like to run a pipeline with Java as the main language and python
>> transformation embedded. The beam pipeline is running on the flink cluster.
>> Currently, I can run it with a taskmanager + java worker pool and a python
>> worker pool. Could I ask if there is a way to run the java code on the task
>> manager directly and keep the python worker pool?
>>
>> Current: taskmanager + java worker pool + python worker pool
>> Desired: taskmanager + python worker pool
>>
>> Thank you very much.
>>
>> *Mark Ma*
>>
>>
>>