Re: [Java] [ParquetIO] How to determine required dependencies

2021-06-11 Thread Kyle Weaver
As far as I know, the only dependency you need to manage directly is beam-sdks-java-io-parquet [1]. Can you make sure the version of that dependency is correct (i.e. matches the version of your other Beam dependencies)? [1] https://beam.apache.org/documentation/io/built-in/parquet/ On Fri, Jun 11

Re: No data sinks have been created yet.

2021-06-10 Thread Kyle Weaver
#x27;t use Flink's sink API. I recall from a very long time ago that > we attached a noop sink to each PCollection to avoid this error. +Kyle > Weaver might know something about how this applies > to Python on Flink. > > Kenn > > On Wed, Jun 9, 2021 at 4:41 PM Trevor Kramer

Re: Error with Beam/Flink Python pipeline with Kafka

2021-05-24 Thread Kyle Weaver
The artifact staging directory (configurable via the "--artifacts_dir" pipeline option) needs to be accessible to workers, which in this case are containerized. There are a couple workarounds: 1. Don't stage files through the file system at all, i.e. use --runner=FlinkRunner in Python instead of -

Re: Does SnowflakeIO support spark runner

2021-05-06 Thread Kyle Weaver
Yeah, I'm pretty sure that documentation is just misleading. All of the options from --runner onward are runner-specific and don't have anything to do with Snowflake, so they should probably be removed from the doc. On Thu, May 6, 2021 at 12:06 PM Tao Li wrote: > Hi

Re: Does SnowflakeIO support spark runner

2021-05-06 Thread Kyle Weaver
As far as I know, it should be supported (Beam's abstract model means IOs usually "just work" on all runners). What makes you think it isn't supported? On Thu, May 6, 2021 at 11:52 AM Tao Li wrote: > Hi Beam community, > > > > Does SnowflakeIO support spark runner? Seems like only direct runner

Re: [Question] Docker and File Errors with Spark PortableRunner

2021-04-02 Thread Kyle Weaver
Hi Michael, Your problems in 1. and 2. are related to the artifact staging workflow, where Beam tries to copy your pipeline’s dependencies to the workers. When artifacts cannot be fetched because of file system or other issues, the workers cannot be started successfully. In this case, your pipelin

Re: using python sdk+kafka under k8s

2021-03-04 Thread Kyle Weaver
:8081", >> environment_type="EXTERNAL", >> environment_config="localhost:5")) as p: >> (p >> | 'Read from Kafka' >> >> ReadFromKafka(consumer_config={'bootstrap.servers': >> &#

Re: using python sdk+kafka under k8s

2021-02-26 Thread Kyle Weaver
In Beam, the Kafka connector does not know anything about the underlying execution engine (here Flink). It is instead translated by the runner into a user defined function in Flink. So it is expected that the resulting DAG does not look the same as it would with a native Flink source. On Fri, Feb

Re: Using the Beam Python SDK and PortableRunner with Flink to connect to Kafka with SSL

2021-02-02 Thread Kyle Weaver
on sdk > harness, but I was wondering if there is something similar for overriding > the Java sdk harness? > > Thanks again. > > On Tue, Feb 2, 2021 at 9:08 PM Kyle Weaver wrote: > >> AFAIK sdk_harness_container_image_overrides only works for the Dataflow >> runner. Fo

Re: Using the Beam Python SDK and PortableRunner with Flink to connect to Kafka with SSL

2021-02-02 Thread Kyle Weaver
AFAIK sdk_harness_container_image_overrides only works for the Dataflow runner. For other runners I think you will have to change the default environment config: https://github.com/apache/beam/blob/0078bb35ba4aef9410d681d8b4e2c16d9f56433d/sdks/java/core/src/main/java/org/apache/beam/sdk/options/Por

Re: Difference between Flink runner and portable runner on Python

2021-01-14 Thread Kyle Weaver
It's mostly just a different entry point. With (Python / Go / Java), the user has to start a job server themselves before submitting a job. With (Python), the job server is managed automatically in the background. Aside from that, though, there's very little difference. We recommend (Python) to new

Re: Is there an array explode function/transform?

2021-01-12 Thread Kyle Weaver
> > @Reuven Lax yes I am aware of that transform, but > that’s different from the explode operation I was referring to: > https://spark.apache.org/docs/latest/api/sql/index.html#explode > How is it different? It'd help if you could provide the signature (input and output PCollection types) of the

Re: Beam on Flink version compatibility

2021-01-11 Thread Kyle Weaver
The Java portion of the Flink runner was upgraded to 1.12, but Python and job server containers were not [1]. This will be fixed in the next Beam release (2.28.0). If you want to use Flink 1.12 with Python now, you can follow the instructions in the "Portable (Java/Python/Go)" tab [2]. But since o

Re: [VOTE] Release 2.27.0, release candidate #1

2020-12-28 Thread Kyle Weaver
Estrada wrote: > >> Good catch Ismael. Thanks! >> >> I've created https://issues.apache.org/jira/browse/INFRA-21238 to >> request the repositories to be created. >> >> I am not sure what's the status of this work - should this block the >> releas

Re: using beam with flink runner

2020-12-28 Thread Kyle Weaver
Using Docker workers along with the local filesystem I/O is not recommended because the Docker workers will use their own filesystems instead of the host filesystem. See https://issues.apache.org/jira/browse/BEAM-5440 On Sun, Dec 27, 2020 at 5:01 AM Günter Hipler wrote: > Hi, > > I just tried to

Re: beam and compatible Flink Runner versions

2020-12-28 Thread Kyle Weaver
The versions Ismaël linked are accurate for Java, but Python has lagged behind. I filed an issue for this recently, but haven't had time to fix it [1]. As a workaround, you can follow the instructions on the Flink runner page, selecting the tab "Adapt for: Portable (Java/Python/Go)" [2]. Since off

Re: [VOTE] Release 2.27.0, release candidate #1

2020-12-23 Thread Kyle Weaver
+1 (non-binding) Validated wordcount with Python source + Flink and Spark job server jars. Also checked that the ...:sql:udf jar was added and includes our cherry-picks. Thanks Pablo :) On Wed, Dec 23, 2020 at 12:02 PM Ahmet Altay wrote: > +1 (binding). > > I validated python quickstarts. Thank

Re: Potential issue with the flink runner in streaming mode

2020-11-24 Thread Kyle Weaver
Yeah, it looks like a regression. I filed a JIRA issue to track this issue. https://issues.apache.org/jira/browse/BEAM-11341 On Tue, Nov 24, 2020 at 2:07 PM Tao Li wrote: > Yep it works with “--experiments=use_deprecated_read”. Is this a > regression? > > > > *From: *Kyle W

Re: Potential issue with the flink runner in streaming mode

2020-11-24 Thread Kyle Weaver
I wonder if this issue is related to the migration to Splittable DoFn [1]. Can you try running your pipeline again with the option --experiments=use_deprecated_read? [1] https://beam.apache.org/blog/beam-2.25.0/ On Tue, Nov 24, 2020 at 10:19 AM Tao Li wrote: > Hi Beam community, > > > > I am ru

Re: Colab vs Local IDE

2020-10-28 Thread Kyle Weaver
> Is there any difference in running the spark or Flink runners from Colab vs Local. Google Colab is hosted in a Linux virtual machine. Docker for Windows is missing some features, including host networking. > 4. python "filename.py" should run but getting raise grpc.FutureTimeoutError() Can you

Re: Issues with python's external ReadFromPubSub

2020-10-28 Thread Kyle Weaver
Are you able to run streaming word count on the same setup? On Tue, Oct 27, 2020 at 5:39 PM Sam Bourne wrote: > We updated from beam 2.18.0 to 2.24.0 and have been having issues using > the python ReadFromPubSub external transform in flink 1.10. It seems like > it starts up just fine, but it doe

Re: Beam + Spark Portable Runner

2020-10-27 Thread Kyle Weaver
On the Spark runner webpage [1], we recommend running the job server container with --net=host so all the job server's ports are exposed to the host. However, host networking is not available if you're using Windows, so you would have to expose individual ports instead. The job server uses ports 80

Re: Beam Python SDK java.io.IOException: Cannot run program "docker"

2020-10-21 Thread Kyle Weaver
Are you able to SSH into your TaskManager nodes and run `docker run hello-world` successfully? On Wed, Oct 21, 2020 at 11:04 AM Mike Lo wrote: > Hi all, > > I'm following these instructions > to try to run a > Python SDK Beam job on a distri

Re: Querying Dataflow job status via Java SDK

2020-10-12 Thread Kyle Weaver
> I think the answer is to use a DataflowClient in the second service, but creating one requires DataflowPipelineOptions. Are these options supposed to be exactly the same as those used by the first service? Or do only some of the fields have to be the same? Most options are not necessary for retr

Re: Beam Flink Kafka example: issues with docker

2020-10-05 Thread Kyle Weaver
tationsplein 45 | 3013 AK | Rotterdam > > T. +31 (0)6 2445 1380 > > www.almende.com > > LinkedIn > > > > On Tue, Sep 29, 2020 at 7:39 PM Kyle Weaver wrote: > > > > > @Kyle, BEAM-5440 mentions a workaround ("For local testing, users may > want to mount

Re: Java/Flink - Flink's shaded Netty and Beam's Netty clash

2020-10-01 Thread Kyle Weaver
Can you provide your beam and flink versions as well? On Thu, Oct 1, 2020 at 5:59 AM Tomo Suzuki wrote: > To fix the problem we need to identify which JAR file contains > io.grpc.netty.shaded.io.netty.util.collection.IntObjectHashMap. Can you > check which version of which artifact (I suspect i

Re: Beam Flink Kafka example: issues with docker

2020-09-29 Thread Kyle Weaver
worked with Dataflow yet). Will any of the issues with > xlang kafka also be an issue when writing an MQTT transform? > > > On Mon, Sep 28, 2020 at 8:34 PM Kyle Weaver wrote: > >> Sorry, didn't read closely.. LOOPBACK won't work if you're doing >> c

Re: Beam Flink Kafka example: issues with docker

2020-09-28 Thread Kyle Weaver
gt; is resolved). On Mon, Sep 28, 2020 at 11:32 AM Kyle Weaver wrote: > > This looks to me like an issue with artifact staging. It looks like the > worker is trying to start the apache/beam_java_sdk:2.24.0 environment, but > can't find the jar that we staged that contains the code

Re: Beam Flink Kafka example: issues with docker

2020-09-28 Thread Kyle Weaver
> This looks to me like an issue with artifact staging. It looks like the worker is trying to start the apache/beam_java_sdk:2.24.0 environment, but can't find the jar that we staged that contains the code for the Java KafkaIO. Yeah. This kind of error most often happens when the job server and Be

Re: Flink JobService on k8s

2020-09-22 Thread Kyle Weaver
-6mhtq > :/tmp/beam-artifact-staging/3024e5d862fef831e830945b2d3e4e9511e0423bfb9c48de75aa2b3b67decce4 > > Do the jobserver and the taskmanager need to share the artifact staging > volume? > > On Tue, Sep 22, 2020 at 4:04 PM Kyle Weaver wrote: > >> > rpc error: c

Re: Flink JobService on k8s

2020-09-22 Thread Kyle Weaver
> The issue is that the jobserver does not provide the proper endpoints to the SDK harness when it submits the job to flink. I would not be surprised if there was something weird going on with Docker in Docker. The defaults mostly work fine when an external SDK harness is used [1]. Can you provid

Re: OOM issue on Dataflow Worker by doing string manipulation

2020-09-02 Thread Kyle Weaver
> It looks like `writer.setLength(0)` may actually allocate a new buffer, and then the buffer may also need to be resized as the String grows, so you could be creating a lot of orphaned buffers very quickly. I'm not that familiar with StringBuilder, is there a way to reset it and re-use the existin

Re: OOM issue on Dataflow Worker by doing string manipulation

2020-09-02 Thread Kyle Weaver
You can try scoping the string builder instance to processElement, instead of making it a member of your DoFn. The same DoFn instance can be used for a bundle of many elements, or possibly even across multiple bundles. https://beam.apache.org/releases/javadoc/2.23.0/org/apache/beam/sdk/transforms/

Re: Getting Beam(Python)-on-Flink-on-k8s to work

2020-08-28 Thread Kyle Weaver
>>>> - name: taskmanger >>>> image: myregistry:5000/docker-flink:1.10 >>>> env: >>>> - name: DOCKER_HOST >>>> value: tcp://localhost:2375 >>>> ... >>>> >>>> I quickly threw all these pieces

Re: Getting Beam(Python)-on-Flink-on-k8s to work

2020-08-26 Thread Kyle Weaver
> - With the Flink operator, I was able to submit a Beam job, but hit the issue that I need Docker installed on my Flink nodes. I haven't yet tried changing the operator's yaml files to add Docker inside them. Running Beam workers via Docker on the Flink nodes is not recommended (and probably not

Re: Need Support for ElasticSearch 7.x for beam

2020-08-24 Thread Kyle Weaver
This ticket indicates Elasticsearch 7.x has been supported since Beam 2.19: https://issues.apache.org/jira/browse/BEAM-5192 Are there any specific features you need that aren't supported? On Mon, Aug 24, 2020 at 11:33 AM Mohil Khare wrote: > Hello, > > Firstly I am on java sdk 2.23.0 and we hea

Re: [ANNOUNCE] Beam 2.23.0 Released

2020-07-30 Thread Kyle Weaver
Hi Eleanore, there have been no changes to Beam's supported Flink versions since Beam 2.21.0. Beam supports Flink 1.8, 1.9, and 1.10. If you are looking for Flink 1.11 support, I didn't find an existing issue, so I filed https://issues.apache.org/jira/browse/BEAM-10612. On Thu, Jul 30, 2020 at 2:

Re: Testing Apache Beam pipelines / python SDK

2020-07-17 Thread Kyle Weaver
7;Map to String' >> beam.Map(add_year) assert_that(lines, equal_to(["expected_value1", "expected_value2", ...])) On Fri, Jul 17, 2020 at 3:02 PM Kyle Weaver wrote: > > I had a look at the util_test.py, and i see that in those tests > pipelines are being cr

Re: Testing Apache Beam pipelines / python SDK

2020-07-17 Thread Kyle Weaver
> I had a look at the util_test.py, and i see that in those tests pipelines are being created as part of tests., and in these tests what are being tested are beam functions - eg beam.Map etc. assert_that checks the results of an entire pipeline, not individual transforms. You should be able to a

Re: Not able to create a checkpoint path

2020-07-10 Thread Kyle Weaver
For properties not exposed through Beam's FlinkPipelineOptions, Flink's usual configuration management (usually conf/flink-conf.yaml) applies. You should be able to set the checkpoint directory via state.checkpoints.dir per https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoint

Re: Building Dataflow Worker

2020-06-15 Thread Kyle Weaver
> Looks like there is an issue with Gradle Errorprone's plugin. I saw the same error earlier today: https://jira.apache.org/jira/browse/BEAM-10263 I haven't done a full investigation yet, but it seems this broke the build for several (?) past Beam source releases. I'm not sure what Beam's policy

Re: Streaming Beam jobs keep restarting on Spark/Kubernetes?

2020-06-08 Thread Kyle Weaver
> There is no error Are you sure? That sounds like a crash loop to me. It might take some digging through various Kubernetes logs to find the cause. Can you provide more information about how you're running the job? On Mon, Jun 8, 2020 at 1:50 PM Joseph Zack wrote: > Anybody out there running

Re: Issue while submitting python beam pipeline on flink - local

2020-06-05 Thread Kyle Weaver
don't think this is access > > issue as the instance on which the flink cluster is running has full > > access to gcs. > > > > > > > > I tried following this > > https://stackoverflow.com/questions/59429897/beam-running-on-flink-wit > > h-python-sdk-and-us

Re: Cross-Language pipeline fails with PROCESS SDK Harness

2020-06-01 Thread Kyle Weaver
etting *virtualenv* for a worker console > where it should be running. > > It would be useful to print out such errors with Error level log, I think. > > On 29 May 2020, at 18:55, Kyle Weaver wrote: > > That's probably a problem with your worker. You'll need to get

Re: Worker pool question

2020-05-29 Thread Kyle Weaver
> Does this mean, as an end user of Beam, I can start a worker pool and have my Pipeline executed by this pool of workers (or, are these options strictly internal)? What should be the runner value in PiupelineOptions in that case? Not exactly. The runner and Beam workers work together to execute y

Re: Cross-Language pipeline fails with PROCESS SDK Harness

2020-05-29 Thread Kyle Weaver
il.concurrent.UncheckedExecutionException: > java.lang.IllegalStateException: Process died with exit code 1 > > If it’s unknown issue, I’ll create a Jira for that. > > On 29 May 2020, at 16:46, Kyle Weaver wrote: > > Alexey, can you try adding --experiments=beam_fn_api to you

Re: Cross-Language pipeline fails with PROCESS SDK Harness

2020-05-29 Thread Kyle Weaver
beam:transform:read:v1 primitives, but > transform Create.Values/Read(CreateSource) executes in environment > Optional[urn: "beam:env:docker:v1" > payload: "\n\033apache/beam_java_sdk:2.20.0" > > Do you think it’s a bug or I miss something in configuration? > >

Re: HDFS I/O with Beam on Spark and Flink runners - consistent Error messages

2020-05-28 Thread Kyle Weaver
Hi Buvana, I suspect this is a bug. If you can try running your pipeline again with these changes: 1. Remove `--spark-master-url spark://:7077` from your Docker run command. 2. Add `--environment_type=LOOPBACK` to your pipeline options. It will help us confirm the cause of the issue. On

Re: Cross-Language pipeline fails with PROCESS SDK Harness

2020-05-28 Thread Kyle Weaver
Can you try removing the cross-language component(s) from the pipeline and see if it still has the same error? On Thu, May 28, 2020 at 4:15 PM Alexey Romanenko wrote: > For testing purposes, it’s just “Create.of(“Name1”, “Name2”, ...)" > > On 28 May 2020, at 19:29, Kyle Weaver w

Re: Pipeline Processing Time

2020-05-28 Thread Kyle Weaver
Which runner are you using? On Thu, May 28, 2020 at 1:43 PM Talat Uyarer wrote: > Hi, > > I have a pipeline which has 5 steps. What is the best way to measure > processing time for my pipeline? > > Thnaks >

Re: Cross-Language pipeline fails with PROCESS SDK Harness

2020-05-28 Thread Kyle Weaver
What source are you using? On Thu, May 28, 2020 at 1:24 PM Alexey Romanenko wrote: > Hello, > > I’m trying to run a Cross-Language pipeline (Beam 2.21, Java pipeline with > an external Python transform) with a PROCESS SDK Harness and Spark Portable > Runner but it fails. > To do that I have a ru

Re: Issue while submitting python beam pipeline on flink - local

2020-05-28 Thread Kyle Weaver
> You are using the LOOPBACK environment which requires that the Flink > cluster can connect back to your local machine. Since the loopback > environment by defaults binds to localhost that should not be possible. On the Flink runner page, we recommend using --net=host to avoid the kinds of networ

Re: Issue while submitting python beam pipeline on flink - local

2020-05-28 Thread Kyle Weaver
. On Thu, May 28, 2020 at 7:31 AM Kyle Weaver wrote: > 2.21.0 should be available now: > https://hub.docker.com/layers/apache/beam_flink1.9_job_server/2.21.0/images/sha256-eeac6dd4571794a8f985e9967fa0c1522aa56a28b5b0a0a34490a600065f096d?context=explore > > On Thu, May 28, 2020 at 7:

Re: Issue while submitting python beam pipeline on flink - local

2020-05-28 Thread Kyle Weaver
Latest Version available on docker hub is 2.20.0 > > > > > > *From:* Kyle Weaver > *Sent:* 28 May 2020 16:51 > *To:* user@beam.apache.org > *Subject:* Re: Issue while submitting python beam pipeline on flink - > local > > > > *EXTERNAL EMAIL* > > Do not clic

Re: Issue while submitting python beam pipeline on flink - local

2020-05-28 Thread Kyle Weaver
Hi Ashish, can you check to make sure apachebeam/flink1.9_job_server is also on version 2.21.0? On Thu, May 28, 2020 at 7:13 AM Ashish Raghav wrote: > Hello Guys , > > > > I am trying to run a python beam pipeline on flink. I am trying to run > apache_beam.examples.wordcount_minimal but with Pip

[ANNOUNCE] Beam 2.21.0 Released

2020-05-28 Thread Kyle Weaver
here: https://beam.apache.org/get-started/downloads/ This release includes bug fixes, features, and improvements detailed on the Beam blog: https://beam.apache.org/blog/beam-2.21.0/ Thanks to everyone who contributed to this release, and we hope you enjoy using Beam 2.21.0. -- Kyle Weaver, on

Re: Portable Runner performance optimisation

2020-05-15 Thread Kyle Weaver
Note also that a worker pool should only retrieve artifacts once: https://github.com/apache/beam/pull/9398 On Fri, May 15, 2020 at 12:15 PM Luke Cwik wrote: > > > On Fri, May 15, 2020 at 9:01 AM Kyle Weaver wrote: > >> > Yes, you can start docker containers before hand

Re: Portable Runner performance optimisation

2020-05-15 Thread Kyle Weaver
> Yes, you can start docker containers before hand using the worker_pool option: However, it only works for Python. Java doesn't have it yet: https://issues.apache.org/jira/browse/BEAM-8137 On Fri, May 15, 2020 at 12:00 PM Kyle Weaver wrote: > > 2. Is it possible to pre-r

Re: Portable Runner performance optimisation

2020-05-15 Thread Kyle Weaver
> 2. Is it possible to pre-run SDK Harness containers and reuse them for every Portable Runner pipeline? I could win quite a lot of time on this for more complicated pipelines. Yes, you can start docker containers before hand using the worker_pool option: docker run -p=5:5 apachebeam/pyth

Re: beam python on spark-runner

2020-05-14 Thread Kyle Weaver
Keep in mind that those instructions about spark-submit are meant only to apply to the Java-only runner. For Python, running spark-submit in this manner is not going to work. See https://issues.apache.org/jira/browse/BEAM-8970 On Thu, May 14, 2020 at 2:55 PM Heejong Lee wrote: > How did you sta

Re: Upgrading from 2.15 to 2.19 makes compilation fail on trigger

2020-05-05 Thread Kyle Weaver
> Maybe we should add a statement like "did you mean to wrap it in Repeatedly.forever?" to the error message +1. IMO the less indirection between the user and the fix, the better. On Tue, May 5, 2020 at 12:08 PM Luke Cwik wrote: > Pointing users to the website with additional details in the err

Re: Warsaw Beam Meetup - May 14th!

2020-05-04 Thread Kyle Weaver
Looks like some cool talks! Thanks for sharing Brittany. On Mon, May 4, 2020 at 7:52 PM Brittany Hermann wrote: > Happy Monday, folks! > > I just wanted to share that I am partnering with Polidea to host the > second Digital Warsaw Apache Beam Meetup on May 14th. Save the date and > check out th

Re: Beam + Flink + Docker - Write to host system

2020-05-01 Thread Kyle Weaver
- 10 > > [image: https://ml6.eu] <https://ml6.eu> > > Robbe Sneyders > > ML6 Gent > <https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl> > > M: +32 474 71

Re: Beam + Flink + Docker - Write to host system

2020-04-30 Thread Kyle Weaver
gt; > ML6 Gent > <https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl> > > M: +32 474 71 31 08 <+32%20474%2071%2031%2008> > > > On Wed, 29 Apr 2020 at 19:28, Kyle

Re: Set parallelism for each operator

2020-04-29 Thread Kyle Weaver
Which runner are you using? On Wed, Apr 29, 2020 at 1:32 PM Eleanore Jin wrote: > Hi all, > > I just wonder can Beam allow to set parallelism for each operator > (PTransform) separately? Flink provides such feature. > > The usecase I have is the source is kafka topics, which has less > partition

Re: Beam + Flink + Docker - Write to host system

2020-04-29 Thread Kyle Weaver
> This seems to have worked, as the output file is created on the host system. However the pipeline silently fails, and the output file remains empty. Have you checked the SDK container logs? They are most likely to contain relevant failure information. > I don't know if this is a result of me re

Re: Kafka IO: value of expansion_service

2020-04-27 Thread Kyle Weaver
>> at >> org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContext.outputWithTimestamp(FnApiDoFnRunner.java:1411) >> at >> org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContext.output(FnApiDoFnRunner.java:1406) >> at >> org.apache.beam.sdk.transforms.D

Re: SparkRunner on k8s

2020-04-24 Thread Kyle Weaver
ns to the job runner that would eventually > translate to ' --volume /storage1:/storage1 ' while the docker container is > being run by Flink? Even if it means code changes and building from source, > its fine. Please point me in the right direction. > > Thanks, > Buvana > --

Re: Kafka IO: value of expansion_service

2020-04-22 Thread Kyle Weaver
Apr 22, 2020 at 1:47 PM Kyle Weaver wrote: > >> You can build the Java SDK image from source by running the following >> command: ./gradlew :sdks:java:container:docker >> >> On Wed, Apr 22, 2020 at 4:43 PM Piotr Filipiuk >> wrote: >> >>> Thanks for q

Re: Kafka IO: value of expansion_service

2020-04-22 Thread Kyle Weaver
You can build the Java SDK image from source by running the following command: ./gradlew :sdks:java:container:docker On Wed, Apr 22, 2020 at 4:43 PM Piotr Filipiuk wrote: > Thanks for quick response. > > Since Beam 2.21.0 is not yet available via pip >

Re: SparkRunner on k8s

2020-04-16 Thread Kyle Weaver
id a git checkout of release-2.19.0 branch and > executed the portable runner and I still encounter this error. ☹ > > > > -Buvana > > > > *From: *Kyle Weaver > *Reply-To: *"user@beam.apache.org" > *Date: *Wednesday, April 15, 2020 at 2:48 PM > *To: *&qu

Re: Sending email from Apache Beam

2020-04-15 Thread Kyle Weaver
The easiest way is to do something like this (assuming you are using Python): CombineGlobally(lambda elements: ','.join(elements)) For more info on Combines, check out the programming guide: https://beam.apache.org/documentation/programming-guide/#core-beam-transforms On Wed, Apr 15, 2020 at 4:4

Re: SparkRunner on k8s

2020-04-15 Thread Kyle Weaver
e > "/usr/local/lib/python3.6/site-packages/apache_beam/io/filebasedsink.py", > line 134, in open > > return FileSystems.create(temp_path, self.mime_type, > self.compression_type) > > File > "/usr/local/lib/python3.6/site-packages/apache_beam/io/files

Re: SparkRunner on k8s

2020-04-13 Thread Kyle Weaver
type) > > File > "/usr/local/lib/python3.6/site-packages/apache_beam/io/localfilesystem.py", > line 137, in _path_open > > raw_file = open(path, mode) > > FileNotFoundError: [Errno 2] No such file or directory: > '/tmp/beam-temp-result.txt-43eab494

Re: SparkRunner on k8s

2020-04-13 Thread Kyle Weaver
Hi Buvana, Running Beam Python on Spark on Kubernetes is more complicated, because Beam has its own solution for running Python code [1]. Unfortunately there's no guide that I know of for Spark yet, however we do have instructions for Flink [2]. Beam's Flink and Spark runners, and I assume GCP's (

Re: GCS numShards doubt

2020-03-02 Thread Kyle Weaver
As Luke and Robert indicated, unsetting num shards _may_ cause the runner to optimize it automatically. For example, the Flink [1] and Dataflow [2] runners override num shards. However, in the Spark runner, I don't see any such override. So I have two questions: 1. Does the Spark runner override

Re: RDD Caching in SparkRunner

2020-02-26 Thread Kyle Weaver
> Persisting is usually the right thing to do. +1, cacheDisabled should only be used if you're certain that *in aggregate* recomputation is faster than writing to and reading from the cache. Keep in mind that cacheDisabled applies to the whole pipeline, meaning you're out of luck if you want to re

Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-02-20 Thread Kyle Weaver
Hi Tobi, This seems like a bug with Beam 2.19. I filed https://issues.apache.org/jira/browse/BEAM-9345 to track the issue. > What puzzles me is that the session cluster should be allowed to have multiple environments in detached mode - or am I wrong? It looks like that check is removed in Flink

Re: Unable to reliably have multiple cores working on a dataset with DirectRunner

2020-01-29 Thread Kyle Weaver
> I also tried briefly SparkRunner with version 2.16 but was no able to achieve any throughput. What do you mean by this? On Wed, Jan 29, 2020 at 1:20 PM Julien Lafaye wrote: > I confirm the situation gets better after the commit: 4 cores used for 18 > seconds rather than one core used for 50 s

Re: runShadow: prebuild and build in read-only directory

2020-01-09 Thread Kyle Weaver
You can build the job server jar using: ./gradlew runners:flink:1.8:job-server:shadowJar The output jar will be located in: runners/flink/1.8/job-server/build/libs/ You can run the jar using `java -jar`. Hope that helps. On Thu, Jan 9, 2020 at 10:47 AM Robert Lugg wrote: > I am able to run

Re: Worker pool dies with error: context deadline exceeded

2020-01-02 Thread Kyle Weaver
This is the root cause: > python-sdk_1 | 2019/12/31 02:59:45 Failed to obtain provisioning > information: failed to dial server at localhost:45759 The Flink task manager and Beam SDK harness use connections over `localhost` to communicate. You will have to put `taskmanager` and `python-sdk` on

Re: Protocol message had invalid UTF-8

2019-12-30 Thread Kyle Weaver
This error can happen when the job server and sdk versions are mismatched (due to protobuf incompatibilities). The sdk and job server containers should use the same beam version. On Mon, Dec 30, 2019 at 11:47 AM Yu Watanabe wrote: > Hello. > > I would like to get help with issue having in job-se

Re: Beam on Flink with Python SDK and using GCS as artifacts directory

2019-12-23 Thread Kyle Weaver
> It will be great if you can get the error from the failing process. Note that you will have to set the log level to DEBUG to get output from the process. On Fri, Dec 20, 2019 at 6:23 PM Ankur Goenka wrote: > Hi Matthew, > > It will be great if you can get the error from the failing process. >

Re: Please assist; how do i use a Sample transform ?

2019-12-17 Thread Kyle Weaver
eFn(...)), , > etc. > > On Tue, Dec 17, 2019 at 2:10 PM Kyle Weaver wrote: > > > > Looks like you need to choose a subclass of sample. Probably > FixedSizeGlobally in your case. For example, > > > > beam.transforms.combiners.Sample.FixedSizeGlobally(5)

Re: Please assist; how do i use a Sample transform ?

2019-12-17 Thread Kyle Weaver
Looks like you need to choose a subclass of sample. Probably FixedSizeGlobally in your case. For example, beam.transforms.combiners.*Sample.FixedSizeGlobally(5)* Source: https://github.com/apache/beam/blob/df376164fee1a8f54f3ad00c45190b813ffbdd34/sdks/python/apache_beam/transforms/combiners.py#L6

Re: What's every part's responsiblity for python sdk with flink?

2019-12-12 Thread Kyle Weaver
tor" > ? > It sounds like we need to depoloy it on every flink cluster node? > > > -- 原始邮件 -- > *发件人:* "Kyle Weaver"; > *发送时间:* 2019年12月13日(星期五) 凌晨0:26 > *收件人:* "user"; > *抄送:* "Maximilian Michels"; >

Re: What's every part's responsiblity for python sdk with flink?

2019-12-12 Thread Kyle Weaver
The order is: user python code -> job server -> *flink cluster -> SDK harness* 1. User python code defines the Beam pipeline. 2. The job server executes the Beam pipeline on the Flink cluster. To do so, it must translate Beam operations into Flink native operations. 3. The Flink cluster executes

Re: Beam portable runner

2019-12-02 Thread Kyle Weaver
Hi David, I recall gradlew isn't included in some of our source archives, but it should be included if you download the source from github: https://github.com/Apache/beam On Mon, Dec 2, 2019 at 1:40 PM David Edwards wrote: > Hi all, > > Looking for help trying to get the beam portable runner for

Re: Installing system dependencies in a DataFlow worker - how

2019-12-02 Thread Kyle Weaver
age": "(24f8c9b6e647d55d): The workflow could not be created. >> Causes: (24f8c9b6e647de48): Invalid worker harness container image: >> my_image. Custom images are not yet supported.", >> "status": "INVALID_ARGUMENT" >> } >> > >

Re: Installing system dependencies in a DataFlow worker - how

2019-11-27 Thread Kyle Weaver
You can also configure your own Docker images if you like, instructions here: https://beam.apache.org/documentation/runtime/environments/ On Wed, Nov 27, 2019 at 12:38 AM Carl Thomé wrote: > Hi, > > I have a Beam pipeline written in the Python SDK that decodes audio files > into TFRecord:s. I'd

Re: Beam with Flink without containers

2019-11-07 Thread Kyle Weaver
Hi Robert, I replied to your main question on SO. (This is becoming a frequently asked question, so we're going to look for ways to improve or at least document this process. Stay tuned.) > Generally, I’d like to have some sort of state diagram to describe who calls what when, if anything like

Re: Running Python Beam Functions on Spark Kubernetes Cluster

2019-10-31 Thread Kyle Weaver
> Looks like for Java, a standalone Job Service is not required to run beam functions on Spark, and spark-submit handles everything in cluster mode. But this is not the case for Python runner. That's correct. > Are you aware of any example in Python that runs in a (i.e. Kubernetes) cluster? Not

Re: How to import external module inside ParDo using Apache Flink ?

2019-09-26 Thread Kyle Weaver
Did you try moving the imports from the process function to the top of main.py? Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Wed, Sep 25, 2019 at 11:27 PM Yu Watanabe wrote: > Hello. > > I would like to ask for help with resolving dependency issue for

Re: Word-count example

2019-09-25 Thread Kyle Weaver
rtify Flink taskmanger with docker? Docker in docker (dind)? Just install > in image? Docker-out-of-docker? Any working example folks could point me to > would be great. > > > > Thanks All, > > Matt > > > > [dind?]( > https://medium.com/hootsuite-engineering/b

Re: How do you call user defined ParDo class from the pipeline in Portable Runner (Apache Flink) ?

2019-09-25 Thread Kyle Weaver
You will need to set the save_main_session pipeline option to True. Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Wed, Sep 25, 2019 at 3:44 PM Yu Watanabe wrote: > Hello. > > I would like to ask question for ParDo . > > I am getting below error ins

Re: How to reference manifest from apache flink worker node ?

2019-09-23 Thread Kyle Weaver
The relevant configuration flag for the job server is `--artifacts-dir`. @Robert Bradshaw I added this info to the log message: https://github.com/apache/beam/pull/9646 Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Mon, Sep 23, 2019 at 11:36 AM Robert Bradshaw

Re: Word-count example

2019-09-23 Thread Kyle Weaver
would start by making sure there are no unneeded Flink clusters, job servers, etc. left running on your machine, as port conflicts can cause silent failures. Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Mon, Sep 23, 2019 at 12:58 PM Matthew Patterson wrote: > K

Re: Word-count example

2019-09-23 Thread Kyle Weaver
"WARNING:root:No unique name set for transform ..." should not affect the pipeline's ability to complete successfully. Is the pipeline failing? If so, could you share more logs? Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Mon, Sep 23, 2019 at 1

Re: Word-count example

2019-09-19 Thread Kyle Weaver
I'm guessing you need to install virtualenv: `pip install virtualenv` Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Thu, Sep 19, 2019 at 11:27 AM Matthew Patterson wrote: > Kyle, > > > > Excellent, will do: unfortunately switch to 2.16 was o

Re: Word-count example

2019-09-19 Thread Kyle Weaver
You should probably use 2.15, since 2.16 release artifacts have not been published yet. Just follow the instructions that say --runner=PortableRunner, not --runner=FlinkRunner, otherwise you'll hit that other deserialization bug that was mentioned.. Kyle Weaver | Software Engineer | githu

  1   2   >