Re: ElasticsearchIO Write Batching Problems

2018-12-07 Thread Evan Galpin
Thanks for confirming that this is unexpected behaviour Tim; certainly the EsIO code looks to handle bundling. For the record, I've also confirmed via debugger that `flushBatch()` is not being triggered by large document size. I'm sourcing records from Google's BigQuery. I have 2 queries which

Re: ElasticsearchIO Write Batching Problems

2018-12-07 Thread Evan Galpin
believe it is correct to have a PCollection of String On 2018/12/07 14:33:17, Evan Galpin wrote: > Thanks for confirming that this is unexpected behaviour Tim; certainly the > EsIO code looks to handle bundling. For the record, I've also confirmed via > debugger that `flushBatch()

Re: ElasticsearchIO Write Batching Problems

2018-12-07 Thread Evan Galpin
hanks! On 2018/12/07 14:36:44, Evan Galpin wrote: > I forgot to reiterate that the PCollection on which EsIO operates is of type > String, where each element is a valid JSON document serialized _without_ > pretty printing (i.e. without line breaks). If the PCollection should be of a &g

Re: Does Beam DynamoDBIO support DynamoDB Streams?

2021-04-03 Thread Evan Galpin
Maybe others can answer more directly from personal experience about reading streams using the existing DynamoDBIO connector, but it doesn’t seem to support streams looking at the source and doc. I haven’t done this myself, but it looks as though Kinesis is “the recommended was to consume streams

Re: Rate Limiting in Beam

2021-04-15 Thread Evan Galpin
Could you possibly use a side input with fixed interval triggering[1] to query the Dataflow API to get the most recent log statement of scaling as suggested here[2]? [1] https://beam.apache.org/documentation/patterns/side-inputs/ [2] https://stackoverflow.com/a/54406878/6432284 On Thu, Apr 15, 20

Re: How avoid blocking when decompressing large GZIP files.

2021-04-23 Thread Evan Galpin
I could be wrong but I believe that if your large file is being read by a DoFn, it’s likely that the file is being processed atomically inside that DoFn, which cannot be parallelized further by the runner. One purpose-built way around that constraint is by using Splittable DoFn[1][2] which could b

Re: How avoid blocking when decompressing large GZIP files.

2021-04-23 Thread Evan Galpin
soon as the sequence is applied to it. The issue is that the > entire step of applying the sequence number appear to be blocking. Also of > note, I am using a @DoFn.StateId. > > I'll look at SplittableDoFns, thanks. > > > On Fri, Apr 23, 2021 at 12:50 PM Evan Galpin > wr

Re: Batch load with BigQueryIO fails because of a few bad records.

2021-05-06 Thread Evan Galpin
Hey Matthew, I believe you might also need to use the “ignoreUnknownValues”[1] or skipInvalidRows[2] options depending on your use case if your goal is to allow valid entities to succeed even if invalid entities exist and separately process failed entities via “getFailedResults”. You could also co

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E >> >> We should rollback using the SDF wrapper by default because of the >> usability and performance issues reported. >> >> >> On Sat, May 8, 2021 at 12:57 AM Evan Galpin >>

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
son *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4 processElement INFO: Got file contents for document_id my-file2.json Any thoughts on what could be causing this? Thanks, Evan On Wed, May 12, 2021 at 9:53 AM Evan Galpin wrote: > > > On Mon, May 10, 2021 at 2:09 PM Boyu

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
I forgot to also mention that in all tests I was setting --experiments=use_deprecated_read Thanks, Evan On Wed, May 12, 2021 at 1:39 PM Evan Galpin wrote: > Hmm, I think I spoke too soon. I'm still seeing an issue of overall > DirectRunner slowness, not just pubsub. I have a pipel

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
Steve Niemitz wrote: > use_deprecated_read was broken in 2.19 on the direct runner and didn't do > anything. [1] I don't think the fix is in 2.20 either, but will be in 2.21. > > [1] https://github.com/apache/beam/pull/14469 > > On Wed, May 12, 2021 at 1:41 PM Evan Galpin wrote: >

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
.19. > > On Wed, May 12, 2021 at 2:55 PM Evan Galpin wrote: > >> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster" >> behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for >> my use case at least) regardless of

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
the slow step is not the read that use_deprecated_read > targets for. Would you like to share your pipeline code if possible? > > On Wed, May 12, 2021 at 1:35 PM Evan Galpin wrote: > >> I just tried with v2.29.0 and use_deprecated_read but unfortunately I >> observed slow behavio

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
; which will fix your performance issue. > > On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang wrote: > >> Hi Evan, >> >> It seems like the slow step is not the read that use_deprecated_read >> targets for. Would you like to share your pipeline code if possible? >> >&g

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
d, May 12, 2021 at 5:53 PM Evan Galpin wrote: > >> Ah ok thanks for that. Do you mean use_deprecated_reads is broken >> specifically in 2.29.0 (regression) or broken in all versions up to and >> including 2.29.0 (ie never worked)? >> >> Thanks, >> Evan >>

Re: Extremely Slow DirectRunner

2021-05-14 Thread Evan Galpin
Any further thoughts here? Or tips on profiling Beam DirectRunner? Thanks, Evan On Wed, May 12, 2021 at 6:22 PM Evan Galpin wrote: > Ok gotcha. In my tests, all sdk versions 2.25.0 and higher exhibit slow > behaviour regardless of use_deprecated_reads. Not sure if that points to >

Re: GroupIntoShards not sending bytes further when dealing with huge amount of data

2021-06-14 Thread Evan Galpin
I believe that by default windows will only trigger one time [1]. This has definitely caught me by surprise before. I think that default strategy might fine for a batch pipeline, but typically does not for streaming (which I assume you’re using because you mentioned Flink). I believe you’ll want

Re: GroupIntoShards not sending bytes further when dealing with huge amount of data

2021-06-14 Thread Evan Galpin
There have been varied reports of slowness loosely attributed to SDF default wrapper change from 2.25.0. Ex https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-

Re: GroupIntoShards not sending bytes further when dealing with huge amount of data

2021-06-14 Thread Evan Galpin
@robert I have a pipeline which consistently shows a major slowdown (10 seconds Vs 10 minutes) between version <=2.23.0 and >=2.25.0 that can be boiled down to: - Read GCS file patterns from PubSub - Window into Fixed windows (repeating every 15 seconds) - Deduplicate/distinct (have tried both) -

Re: GroupIntoShards not sending bytes further when dealing with huge amount of data

2021-06-14 Thread Evan Galpin
ts the same issues?) > > On Mon, Jun 14, 2021 at 2:15 PM Evan Galpin wrote: > > > > @robert I have a pipeline which consistently shows a major slowdown (10 > seconds Vs 10 minutes) between version <=2.23.0 and >=2.25.0 that can be > boiled down to: > > > > -

Re: GroupIntoShards not sending bytes further when dealing with huge amount of data

2021-07-09 Thread Evan Galpin
gging efforts. Thanks, Evan On Mon, Jun 14, 2021 at 8:39 PM Robert Bradshaw wrote: > Awesome, thanks! > > On Mon, Jun 14, 2021 at 5:36 PM Evan Galpin wrote: > > > > I’ll try to create something as small as possible from the pipeline I > mentioned 👍 I should

Can I write to disk (using Dataflow)? Handling large files?

2021-07-16 Thread Evan Galpin
Hi there, I imagine the answer to this question might depend on the underlying runner, but simply put: can I write files, temporarily, to disk? I'm currently using the DataflowRunner if that's a helpful detail. Relatedly, how does Beam handle large files? Say that my pipeline reads files from a d

Re: Can I write to disk (using Dataflow)? Handling large files?

2021-07-16 Thread Evan Galpin
Thanks Luke for the info regarding GCS compose. Do you know if the GCS FileIO implementation supports the "compose" api that you mentioned? Thanks, Evan On Fri, Jul 16, 2021 at 12:17 PM Luke Cwik wrote: > > > On Fri, Jul 16, 2021 at 6:53 AM Evan Galpin wrote: > >&g

Re: How to test ReadFromPubSub

2021-07-27 Thread Evan Galpin
This SO question is a bit dated, but there's also a pubsub emulator[1] that you might find useful. [1] https://stackoverflow.com/questions/43354599/dataflow-pipeline-and-pubsub-emulator On Tue, Jul 27, 2021 at 1:00 PM Ning Kang wrote: > In the example test, the `mock_pubsub` is from `patch`: >

[Dataflow][Java][2.30.0] Best practice for clearing stuck data in streaming pipeline

2021-08-10 Thread Evan Galpin
Hi all, I recently had an experience where a streaming pipeline became "clogged" due to invalid data reaching the final step in my pipeline such that the data was causing a non-transient error when writing to my Sink. Since the job is a streaming job, the element (bundle) was continuously retryin

Re: [Dataflow][Java][2.30.0] Best practice for clearing stuck data in streaming pipeline

2021-08-10 Thread Evan Galpin
-in IO module of Beam, modification of the Sink may not be immediately feasible. Is the only recourse in that case to drain a job an start a new one? On Tue, Aug 10, 2021 at 12:54 PM Luke Cwik wrote: > > > On Tue, Aug 10, 2021 at 8:54 AM Evan Galpin wrote: > >> Hi all, >> &

Re: [Dataflow][Java][2.30.0] Best practice for clearing stuck data in streaming pipeline

2021-08-10 Thread Evan Galpin
Good to know this works! Thanks Luke 🙏 On Tue, Aug 10, 2021 at 1:19 PM Luke Cwik wrote: > > > On Tue, Aug 10, 2021 at 10:11 AM Evan Galpin > wrote: > >> Thanks for your responses Luke. One point I have confusion over: >> >> * Modify the sink implementation

Re: Processing historical data

2021-08-24 Thread Evan Galpin
Hi Kishore, You may be able to introduce a new pipeline which reads from BQ and publishes to PubSub like you mentioned. By default, elements read via PubSub will have a timestamp which is equal to the publish time of the message (internally established by PubSub service). Are you using a custom w

[Python] Heterogeneous TaggedOutput Type Hints

2021-09-07 Thread Evan Galpin
Hi all, What is the recommended way to write type hints for a tagged output DoFn where the outputs to different tags have different types? I tried using a Union to describe each of the possible output types, but that resulted in mismatched coder errors where only the last entry in the Union was u

Re: [Python] Heterogeneous TaggedOutput Type Hints

2021-09-08 Thread Evan Galpin
e a bug. Do > you have a minimal repro?) > > On Tue, Sep 7, 2021 at 1:23 PM Evan Galpin wrote: > > > > Hi all, > > > > What is the recommended way to write type hints for a tagged output DoFn > where the outputs to different tags have different types? > &

Re: [Python] Heterogeneous TaggedOutput Type Hints

2021-09-17 Thread Evan Galpin
range. Yes, the exact error on the service would be helpful. > > On Wed, Sep 8, 2021 at 10:12 AM Evan Galpin wrote: > > > > Thanks for the response. I've created a gist here to demonstrate a > minimal repro: > https://gist.github.com/egalpin/2d6ad2210cf9f66108ff48a9c7

Re: [Python] Heterogeneous TaggedOutput Type Hints

2021-09-21 Thread Evan Galpin
ot; hidden in the harness logs. It seems to be consistently repeatable with any TaggedOutput + GBK afterwards. Any advice on how to proceed? Thanks, Evan On Fri, Sep 17, 2021 at 11:20 AM Evan Galpin wrote: > The Dataflow error logs only showed 1 error which was: "The job failed > be

Re: [Python] Heterogeneous TaggedOutput Type Hints

2021-09-22 Thread Evan Galpin
.com/apache/beam/blob/ebf2aacf37b97fc85b167271f184f61f5b06ddc3/sdks/python/apache_beam/pvalue.py#L99 > 2: > https://github.com/apache/beam/blob/ebf2aacf37b97fc85b167271f184f61f5b06ddc3/sdks/python/apache_beam/pvalue.py#L234 > > On Tue, Sep 21, 2021 at 10:29 AM Evan Galpin > wrote:

Re: Limit the concurrency of a Beam Step (or all the steps)

2021-09-24 Thread Evan Galpin
This has been mentioned a few times and seems to me to be a fairly common requirement. I think that a rate limit could be accomplished through stateful processing, using a combination of bagState and Timers. GroupIntoBatches.java would be a good example. I wonder if this would be a good built-in

Re: coder in ReadFromBigQuery doesn't do "anything"

2021-10-17 Thread Evan Galpin
Is “the result” being printed or viewed via debugger? Is there a chance that the __repr__ or similar method for proto produces a dict strictly for printing/serialization? Thanks, Evan On Sun, Oct 17, 2021 at 01:50 Mark Striebeck wrote: > Hi, > > I have the following BigQuery table: > > Name: Da

[Dataflow][Java] Guidance on Transform Mapping Streaming Update

2021-10-21 Thread Evan Galpin
Hi all, I'm looking for any help regarding updating streaming jobs which are already running on Dataflow. Specifically I'm seeking guidance for situations where Fusion is involved, and trying to decipher which old steps should be mapped to which new steps. I have a case where I updated the steps

Re: [Dataflow][Java] Guidance on Transform Mapping Streaming Update

2021-10-22 Thread Evan Galpin
ind steps that have been added/removed/renamed. > > The PipelineDotRenderer[1] might be of use as well. > > 1: > https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/renderer/PipelineDotRenderer.java > >

Re: Running Query Containing Aggregation Using ElasticsearchIO Read

2022-03-09 Thread Evan Galpin
You're correct in your assessment that ElasticsearchIO does not currently support queries with aggregations. There's a large difference between scrolling over large sets of documents (which has a common interface provided by ES) Vs aggregations where user-code in the query will impact the output f

Re: Running Query Containing Aggregation Using ElasticsearchIO Read

2022-03-09 Thread Evan Galpin
kick off a single batch > transform to group/aggregate/filter the data in Elasticsearch, so the > workers will only need to search the result without needing to run > aggregation. But this can get ugly too because the workers need to wait > for the Elasticsearch transform to finish. >

Re: [Python] Heterogeneous TaggedOutput Type Hints

2022-03-23 Thread Evan Galpin
pdate the internal >> bug with details and/or you could reach out to GCP support with job ids >> and/or minimal repros to get support as well. >> >> >> >> On Wed, Sep 22, 2021 at 6:57 AM Evan Galpin >> wrote: >> >>> >> >>> Thanks for the

Re: Trino Runner

2022-03-30 Thread Evan Galpin
Though this might not be what's being asked, might it be possible to integrate Presto using the JDBC driver[1] along with Beam's JdbcIO[2] ? [1] https://prestodb.io/docs/current/installation/jdbc.html [2] https://beam.apache.org/releases/javadoc/2.37.0/org/apache/beam/sdk/io/jdbc/JdbcIO.html - Ev

Re: [Question] Is message order preserved in windows?

2022-05-17 Thread Evan Galpin
Hi Gyorgy, To my knowledge, ordering of elements is not guaranteed. To order your elements in each window by time, you would need to sort the iterables produced by the groupByKey. You might use SortValues[1] to do that. [1] https://beam.apache.org/documentation/sdks/java-extensions/#example-usage

Re: Chained Job Graph Apache Beam | Dataflow

2022-06-15 Thread Evan Galpin
It may also be helpful to explore CoGroupByKey as a way of joining data, though depending on the shape of the data doing so may not fit in mem: https://beam.apache.org/documentation/transforms/java/aggregation/cogroupbykey/ - Evan On Wed, Jun 15, 2022 at 3:45 PM Bruno Volpato wrote: > Hello, >

Re: [Dataflow][Java] Guidance on Transform Mapping Streaming Update

2022-07-05 Thread Evan Galpin
ace" is not present in the job file, maybe an internal Dataflow thing?), so I'm confident that the coder hasn't actually changed. I'm not sure how to proceed in updating the running pipeline, and I'd really prefer not to drain. Any ideas? Thanks, Evan On Fri, Oct 2

Re: [Dataflow][Java] Guidance on Transform Mapping Streaming Update

2022-07-08 Thread Evan Galpin
ou want to do an update > to get the latest version? > > Feel free to share the job files with GCP support. It could be something > internal but the coders for ephemeral steps that Dataflow adds are based > upon existing coders within the graph. > > On Tue, Jul 5, 2022 at 8:03 AM E

Re: [Dataflow][Java] Guidance on Transform Mapping Streaming Update

2022-07-08 Thread Evan Galpin
he job files with GCP support. It could be something >>> internal but the coders for ephemeral steps that Dataflow adds are based >>> upon existing coders within the graph. >>> >>> On Tue, Jul 5, 2022 at 8:03 AM Evan Galpin wrote: >>> >>>> +

[Dataflow][Java][stateful] Workflow Failed when trying to introduce stateful RateLimit

2022-08-05 Thread Evan Galpin
Hi all, I'm trying to create a RateLimit[1] transform that's based fairly heavily on GroupIntoBatches[2]. I've been able to run unit tests using TestPipeline to verify desired behaviour and have also run successfully using DirectRunner. However, when I submit the same job to Dataflow it completel

Re: [Dataflow][Java][stateful] Workflow Failed when trying to introduce stateful RateLimit

2022-08-05 Thread Evan Galpin
logs should have some details as to why the work items > are failing. > > > On Fri, Aug 5, 2022 at 7:36 AM Evan Galpin wrote: > >> Hi all, >> >> I'm trying to create a RateLimit[1] transform that's based fairly heavily >> on GroupIntoBatches[2]. I&#x

Re: [Dataflow][Java][stateful] Workflow Failed when trying to introduce stateful RateLimit

2022-08-05 Thread Evan Galpin
erence in the implementation may > allow you to work around what is impacting the pipelines. > > On Fri, Aug 5, 2022 at 9:40 AM Evan Galpin wrote: > >> Thanks Luke, I've opened a support case as well but thought it would be >> prudent to ask here in case there was somethi

Re: [JAVA] Handling repeated elements when merging two pcollections

2022-08-10 Thread Evan Galpin
Hi Shivam, When you say "merge the PCollections" do you mean Flatten, or somehow join? CoGroupByKey[1] would be a good choice if you need to join based on key. You would then be able to implement application logic to keep 1 of the 2 records if there is a way to decipher an element from CollectionA

Re: [question] Good Course to learn beam

2022-08-30 Thread Evan Galpin
+dev for additional visibility/input On Mon, Aug 29, 2022 at 11:10 AM Leandro Nahabedian via user < user@beam.apache.org> wrote: > Hi community! > > I'm looking for a good course to learn apache beam and I saw this one >

[troubleshooting] KafkaIO#write gets stuck "since the associated topicId changed from null to "

2022-09-13 Thread Evan Galpin
Hi all, I've recently started using the KafkaIO connector as a sink, and am new to Kafka in general. My kafka clusters are hosted by Confluent Cloud. I'm using Beam SDK 2.41.0. At least daily, the producers in my Beam pipeline are getting stuck in a loop frantically logging this message: Node

Re: [troubleshooting] KafkaIO#write gets stuck "since the associated topicId changed from null to "

2022-09-13 Thread Evan Galpin
t;))) Thanks, Evan On Tue, Sep 13, 2022 at 10:30 AM John Casey wrote: > Hi Evan, > > I haven't seen this before. Can you share your Kafka write configuration, > and any other stack traces that could be relevant? > > John > > On Tue, Sep 13, 2022 at 10:23 AM Evan G

Re: [troubleshooting] KafkaIO#write gets stuck "since the associated topicId changed from null to "

2022-09-13 Thread Evan Galpin
gs when this disconnect starts, on the Beam or Kafka side > of things? > > On Tue, Sep 13, 2022 at 10:38 AM Evan Galpin wrote: > >> Thanks for the quick reply John! I should also add that the root issue >> is not so much the logging, rather that these log messages seem to b

Re: [troubleshooting] KafkaIO#write gets stuck "since the associated topicId changed from null to "

2022-09-13 Thread Evan Galpin
esting findings or insights. Thanks, Evan On Tue, Sep 13, 2022 at 11:38 AM John Casey via user wrote: > In principle yes, but I don't see any Beam level code to handle that. I'm > a bit surprised it isn't handled in the Kafka producer layer itself. > > On Tue, S

Re: [troubleshooting] KafkaIO#write gets stuck "since the associated topicId changed from null to "

2022-09-13 Thread Evan Galpin
ache/kafka/clients/NetworkClient.java#L937 On Tue, Sep 13, 2022 at 12:17 PM Alexey Romanenko wrote: > Do you have by any chance the full stacktrace of this error? > > — > Alexey > > On 13 Sep 2022, at 18:05, Evan Galpin wrote: > > Ya likewise, I'd expect this to be handl

Re: [troubleshooting] KafkaIO#write gets stuck "since the associated topicId changed from null to "

2022-09-16 Thread Evan Galpin
size to where there are no longer OOMKills has removed any kafka "issue". Thanks all for your time and willingness to help. Evan On Tue, Sep 13, 2022 at 12:33 PM Evan Galpin wrote: > There's a few related log lines, but there isn't a full stacktrace as the > info originat

Re: [Question] Beam 2.42.0 Release Date Confirmation

2022-09-28 Thread Evan Galpin
Hi there :-) You can follow the vote on the first release candidate for 2.42.0 here: https://lists.apache.org/thread/ho2mvvgc253my1ovqmrxjpql8gvz0285 Thanks, Evan On Tue, Sep 27, 2022 at 7:16 AM Varun Chopra via user wrote: > Classification: Public > > > > Hi Team, > > > > We at Deutsche Bank

Re: Help on Apache Beam Pipeline Optimization

2022-10-11 Thread Evan Galpin
If I’m not mistaken you could create a PCollection from the pubsub read operation, and then apply 3 different windowing strategies in different “chains” of the graph. Ex PCollection msgs = PubsubIO.read(…); msgs.apply(Window.into(FixedWindows.of(1 min)).apply(allMyTransforms) msgs.apply(Window.i

[Question][Dataflow][Java][pubsub] Streaming Pipeline Stall Scenarios

2022-11-03 Thread Evan Galpin
Hi folks, Hoping to get some definitive answers with respect to streaming pipeline bundle retry semantics on Dataflow. I understand that a bundle containing a "poison pill" (bad data, let's say it causes a null pointer exception when processing in DoFn) will be retried indefinitely. What I'm not

[KafkaIO] Use of sinkGroupId with Exactly Once Semantics

2022-11-10 Thread Evan Galpin
Hey folks, I can see in the docs for "withEOS"[1] in the KafkaIO#Write section that the sinkGroupId is recommended to be unique per job. I'm wondering about a case where a single job outputs to multiple topics. Would it be advisable to have a unique sinkGroupId per instance of KafkaIO#Write tran

Re: [KafkaIO] Use of sinkGroupId with Exactly Once Semantics

2022-11-14 Thread Evan Galpin
> >> Kafka Group Ids are essentially used to track where a logical (aka >> application level, not thread/machine level) producer / consumer is at. As >> such, I think it would be fine to use just one group id, even when writing >> to multiple topics >> >> On Th

Re: [Question][Dataflow][Java][pubsub] Streaming Pipeline Stall Scenarios

2022-11-17 Thread Evan Galpin
friendly bump 🙃 Anyone have thoughts or answers? Thanks! On Thu, Nov 3, 2022 at 3:07 PM Evan Galpin wrote: > Hi folks, > > Hoping to get some definitive answers with respect to streaming pipeline > bundle retry semantics on Dataflow. I understand that a bundle containing >

Re: [EXTERNAL] Re: ElasticsearchIO write to trigger a percolator

2023-02-02 Thread Evan Galpin
I believe that 2.41.0 is the "oldest" safe version[1] to use as there were initially some bugs introduced when migrating from PDone to outputting the write results. [1] https://github.com/apache/beam/commit/2cb2ee2ba3b5efb0f08880a9325f092485b3ccf2 On Thu, Feb 2, 2023 at 3:16 PM Kaggal, Vinod C. (

Re: Q: Apache Beam IOElasticsearchIO.read() method (Java), which expects a PBegin input and a means to handle a collection of queries

2023-04-19 Thread Evan Galpin
Yes unfortunately the ES IO connector is not built in a way that can work by taking inputs from a PCollection to issue Reads. The most scalable way to support this is to revisit the implementation of Elasticsearch Read transform and instead implement it as a SplittableDoFn. On Wed, Apr 19, 2023 at

Re: Q: Apache Beam IOElasticsearchIO.read() method (Java), which expects a PBegin input and a means to handle a collection of queries

2023-04-20 Thread Evan Galpin
For more info on splitable DoFn, there is a good resource on the beam blog[1]. Alexey has also shown a great alternative! [1] https://beam.apache.org/blog/splittable-do-fn/ On Thu, Apr 20, 2023 at 9:08 AM Alexey Romanenko wrote: > Some Java IO-connectors implement a class something like "class

[java] Trouble with gradle and using ParquetIO

2023-04-20 Thread Evan Galpin
Hi all, I'm trying to make use of ParquetIO. Based on what's documented in maven central, I'm including the artifact in "compileOnly" mode (or in maven parlance, 'provided' scope). I can successfully compile my pipeline, but when I run it I (intuitively?) am met with a ClassNotFound exception fo

Re: [java] Trouble with gradle and using ParquetIO

2023-04-21 Thread Evan Galpin
21, 2023 at 2:34 AM Alexey Romanenko wrote: > Just curious. where it was documented like this? > > I briefly checked it on Maven Central [1] and the provided code snippet > for Gradle uses “implementation” scope. > > — > Alexey > > [1] > https://search.maven.org/artifact

Re: [EXTERNAL] Re: Q: Apache Beam IOElasticsearchIO.read() method (Java), which expects a PBegin input and a means to handle a collection of queries

2023-04-24 Thread Evan Galpin
AM Murphy, Sean P. wrote: > Any other thoughts? I’ve run out of ideas. Thanks, ~Sean > > > > *From: *Murphy, Sean P. > *Date: *Friday, April 21, 2023 at 11:00 AM > *To: *Alexey Romanenko , Evan Galpin < > egal...@apache.org> > *Cc: *Anthony Samy, Charl

Re: [EXTERNAL] Re: Re: Q: Apache Beam IOElasticsearchIO.read() method (Java), which expects a PBegin input and a means to handle a collection of queries

2023-04-24 Thread Evan Galpin
approach you provided below; couldn’t the esQueryResults still be > determined at runtime? > > > > ~Sean > > > > *From: *Evan Galpin > > > *Date: *Monday, April 24, 2023 at 2:18 PM > *To: *user > *Cc: *Anthony Samy, Charles , Murphy, Sean >

Re: [java] Trouble with gradle and using ParquetIO

2023-04-25 Thread Evan Galpin
ementation "org.apache.hadoop:hadoop-common:3.2.4" On Fri, Apr 21, 2023 at 10:38 AM Evan Galpin wrote: > Oops, I was looking at the "bootleg" mvnrepository search engine, which > shows `compileOnly` in the copy-pastable dependency installation > prompts[1]. When I received the "ClassNo

[Dataflow][Stateful] Bypass Dataflow Overrides?

2023-05-25 Thread Evan Galpin
Hi all, I'm running into a scenario where I feel that Dataflow Overrides (specifically BatchStatefulParDoOverrides.GbkBeforeStatefulParDo ) are unnecessarily causing a batch pipeline to "pause" throughput since a GBK needs to have processed all the data in a window before it can output. Is it str

Re: [Dataflow][Stateful] Bypass Dataflow Overrides?

2023-05-25 Thread Evan Galpin
il used to send all > elements with the same key to the same worker (so that they can share > state, which is itself partitioned by worker). This does cause a global > barrier in batch pipelines. > > On Thu, May 25, 2023 at 2:15 PM Evan Galpin wrote: > >> Hi all, >> >>

[Dataflow][Java][2.52.0] Upgrading to 2.52.0 Surfaces Pubsub Coder Error

2023-12-12 Thread Evan Galpin
Hi all, When attempting to upgrade a running Dataflow pipeline from SDK 2.51.0 to 2.52.0, an incompatibility warning is surfaced that prevents pipeline upgrade: > The Coder or type for step .../PubsubUnboundedSource has changed Was there an intentional coder change introduced for PubsubMessage

Re: [Dataflow][Java][2.52.0] Upgrading to 2.52.0 Surfaces Pubsub Coder Error

2023-12-12 Thread Evan Galpin
r wrote: > Are you setting the enable_custom_pubsub_source experiment by any chance? > > On Tue, Dec 12, 2023 at 3:24 PM Evan Galpin wrote: > >> Hi all, >> >> When attempting to upgrade a running Dataflow pipeline from SDK 2.51.0 to >> 2.52.0, an incompatibili

Re: [Dataflow][Java][2.52.0] Upgrading to 2.52.0 Surfaces Pubsub Coder Error

2023-12-14 Thread Evan Galpin
The pipeline in question is using Dataflow v1 Runner (Runner v2: Disabled) in case that's an important detail. On Tue, Dec 12, 2023 at 4:22 PM Evan Galpin wrote: > I've checked the source code and deployment command for cases of setting > experiments. I don't see "e

Re: Environmental variables not accessible in Dataflow pipeline

2023-12-22 Thread Evan Galpin
I assume from the previous messages that GCP Dataflow is being used as the pipeline runner. Even without Flex Templates, the v2 runner can use docker containers to install all dependencies from various sources[1]. I have used docker containers to solve the same problem you mention: installing a p