Re: [PROPOSAL] Preparing for Beam release 2.40.0

2022-06-15 Thread Vincent Marquez
Hello.  I have a fix for https://github.com/apache/beam/issues/21715 here:

https://github.com/apache/beam/pull/21786

Kenn mentioned that someone might want to get a fix in for it for the next
release.  It has not been reviewed or given feedback yet, but I thought I
would let everyone know.

*~Vincent*


On Wed, Jun 15, 2022 at 5:03 PM Pablo Estrada  wrote:

> Hello all!
> I've cut the release branch:
> https://github.com/apache/beam/tree/release-2.40.0
>
> If you want your changes in the branch please let me know, as we'll have
> to cherry pick.
>
> Changes I'm aware of for cherry-picks:
> - https://github.com/apache/beam/pull/21898
> - https://github.com/apache/beam/pull/21902
> - Resolution for https://github.com/apache/beam/issues/21690
>
> On Wed, Jun 15, 2022 at 12:05 PM Yiru Tang  wrote:
>
>> Yes, I think so.
>>
>> On Wed, Jun 15, 2022 at 11:12 AM Reuven Lax  wrote:
>>
>>> +Yiru Tang  You've already fixed 21625, correct?
>>>
>>> On Tue, Jun 14, 2022 at 5:19 PM Ahmet Altay  wrote:
>>>
 Thank you Kenn.

 To bring visibility, there are 2 blockers:
 https://github.com/apache/beam/issues/21625 - assigned to Reuven
 https://github.com/apache/beam/issues/21690 - assigned to Evan

 Reuven, Evan, could you please comment on the open issues? Should they
 be blocking the 2.40.0 release?

 Thank you!
 Ahmet

 On Tue, Jun 14, 2022 at 7:35 AM Kenneth Knowles 
 wrote:

> I did a pass this morning. I believe there is only one release blocker
> that doesn't already have a fix. If I closed your issue or moved it off 
> the
> milestone, feel free to have a different opinion and revert my action.
>
> Kenn
>
> On Mon, Jun 13, 2022 at 5:04 PM Ahmet Altay  wrote:
>
>>
>>
>> On Tue, Jun 7, 2022 at 7:17 PM Ahmet Altay  wrote:
>>
>>> I apologize for digressing the release thread.  To bring it back,
>>> please help Pablo with the release blockers (
>>> https://github.com/apache/beam/milestone/2) :)
>>>
>>
>> @Pablo Estrada  - there are still 10 blockers on
>> that list. Could we move the non-critical items to the next release? Do 
>> you
>> need any help?
>>
>>
>>>
>>> On Tue, Jun 7, 2022 at 6:47 PM Danny McCormick <
>>> dannymccorm...@google.com> wrote:
>>>
 > A question related to priorities (P0, P1, etc.) as labels. Does
 it mean that not all issues will have priority unless we explicitly set
 these labels?

 Technically yes, but this is a required field on all issue
 templates (for example, our bug template
 ).
 Our automation will automatically apply a label based on the response 
 to
 that field.

>>>
>>> Nice. Thank you.
>>>
>>>

 On Tue, Jun 7, 2022 at 7:46 PM Ahmet Altay 
 wrote:

>
>
> On Tue, Jun 7, 2022 at 10:19 AM Danny McCormick <
> dannymccorm...@google.com> wrote:
>
>> I'm good with that, that's consistent with the previous doc
>> behavior which pointed to the fix version page (e.g.
>> https://issues.apache.org/jira/projects/BEAM/versions/12351171).
>> I closed my pr since that approach is consistent with the current 
>> state of
>> the docs, if we decide we just want the release manager to look at 
>> P1s we
>> can reopen.
>>
>
> That makes sense. Release managers usually move most P0 and P1
> issues out of the blockers lists for good reasons and the two lists 
> tend to
> look very similar closer to the release.
>
> A question related to priorities (P0, P1, etc.) as labels. Does it
> mean that not all issues will have priority unless we explicitly set 
> these
> labels?
>
>
>> Thanks,
>> Danny
>>
>> On Tue, Jun 7, 2022 at 12:30 PM Chamikara Jayalath <
>> chamik...@google.com> wrote:
>>
>>>
>>>
>>> On Tue, Jun 7, 2022 at 7:33 AM Danny McCormick <
>>> dannymccorm...@google.com> wrote:
>>>
 If we want to filter to P0/P1 issues, we can do that with this
 query - https://github.com/apache/beam/issues?q=milestone:"2.40.0
 Release" label:P1,P0 is:open is:issue - I'll update the
 release guide to point to that url instead of just the milestones 
 page. PR
 to do this here - https://github.com/apache/beam/pull/21732

>>>
>>> Nit: I think we'd want all open issues tagged for that release.
>>> Release manager can decide to bump P2s to the next release or raise 
>>> the
>>> priority.
>>>

Beam Summit is around the corner!

2022-06-15 Thread Mara Ruvalcaba

*
*

**

*

The 2022 edition of Beam Summit will take place on July 18th to 20th, 
2022. It will be a hybrid event with an onsite component in Austin, TX, 
and all sessions will also be streamed (except workshops, which will be 
onsite only). If you feel like taking some time to get to know the city, 
there’s a lot of things to do!


if you are hungry no more

You will:

 *

   Hear from leading organizations on how they use Apache Beam for
   solving complex data processing scenarios.

 *

   Get a clear understanding of Apache Beam´s role in the data
   processing lifecycle and if it is a good fit for your organization.

 *

   Understand deep technical details of how Apache Beam works and how
   to apply advanced techniques.

 *

   Improve your Beam skills through in-depth workshops, which are only
   available to on-site attendees!

 *

   Interact with key developers of Apache Beam as well as with other
   users and contributors from all around the world.

 *

   Be part of a fun & inclusive experience!



Pipeline Doctor 



Do you have questions about your Beam pipeline? Ask the experts any 
concerns and they will try to find the cure. You are welcome to share 
any improvement ideas you might have!


*
**

If you would like to attend in person but cannot afford a ticket, please 
apply for a scholarship. 


Register now *
*

*
*

*Here is a sample of some sessions that we will have. If you are hungry 
for more, check out the entire program here 
.*


*


**




 How the sausage gets made: Dataflow under the covers
 


 The team that builds the Dataflow runner likes to say it “just
 works”. But how does it really work? How does it decide to
 autoscale? What happens to failures? Where is state data stored?
 Can I have infinite state data? What goes into Dataflow’s latency?
 What’s the checkpoint logic?

**




 Powering Real-time Data at Intuit: A Look at Golden Signals powered by
 Beam
 

We will discuss how we’ve built an internal, self-serve stream 
processing platform with Apache Beam SDK, running on Kubernetes. 
Specifically, we will highlight our stream processing platform’s 
benefits and tech stack where Apache Beam serves as a critical 
technology in how we do real time data processing at Intuit.



*


Re: Fun with WebAssembly transforms

2022-06-15 Thread Robert Burke
Obligatory mention that WASM is basically an architecture that any well
meaning compiler can target, eg the Go compiler

https://www.bradcypert.com/an-introduction-to-targeting-web-assembly-with-golang/

(Among many articles for the last few years)

Robert Burke
Beam Go Busybody

On Wed, Jun 15, 2022, 2:04 PM Sean Jensen-Grey 
wrote:

> Heh, my stage fright was so strong, I didn't realize that the talk was
> recorded. :)
>
> Steven, I'd love to chat about Wasm in Beam. This email is a bit rough.
>
> I haven't explored Wasm in Beam much since that talk. I think the most
> compelling use is in the portability of logic between data processing
> systems. Esp in the use of probabilistic data structures like Bloom
> Filters, Count-Min-Sketch, HyperLogLog, where it is nice to persist the
> data structure and use it on a different system. Like generating a bloom
> filter in Beam and using it inside of a BQ query w/o having to reimplement
> and test across many platforms.
>
> I have used Wasm in BQ, as BQ UDFs are driven by V8. Anywhere V8 exists,
> Wasm support exists for free unless the embedder goes out of their way to
> disable it. So it is supported in Deno/Node as well. In Python, Wasm
> support via Wasmtime  is
> really good.  There are *many* options for execution environments, one of
> the downsides of passing through JS one is in string and number
> support(float/int64) issues, afaik. I could be wrong, maybe JS has fixed
> all this by now.
>
> The qualities in order of importance (for me) are
>
>1. Portability, run the same code everywhere
>2. Security, memory safety for the caller. Running Wasm inside of
>Python should never crash your Python interpreter. The capability model
>ensures that the Wasm module can only do what you allow it to
>3. Performance (portable), compile once and run everywhere within some
>margin of native.  Python makes this look good :)
>
> I think something worth exploring is moving opaque-ish Arrow objects
> around via Beam, so that Beam is now mostly in the control plane and
> computation happens in Wasm, this should reduce the serialization overhead
> and also get Python out of the datapath.
>
> I see someone exploring Wasm+Arrow here,
> https://github.com/domoritz/arrow-wasm
>
> Another possibly interesting avenue to explore is compiling command line
> programs to Wasi (WebAssembly System Interface), the POSIX like shim, so
> that they can be run inprocess without the fork/exec/pipe overhead of
> running a subprocess. A neat demo might be running something like Jq
>  inside of a Beam job.
>
> Not to make Wasm sound like a Python only technology, it can be used via
> Java/JVM via
>
>- https://www.graalvm.org/22.1/reference-manual/wasm/
>- https://github.com/kawamuray/wasmtime-java
>
> Sean
>
>
>
> On Wed, Jun 15, 2022 at 9:35 AM Pablo Estrada  wrote:
>
>> adding Steven in case he didn't get the replies : )
>>
>> On Wed, Jun 15, 2022 at 9:29 AM Daniel Collins 
>> wrote:
>>
>>> If we ever do anything with the JS runtime, this would seem to be the
>>> best place to run WASM.
>>>
>>> On Tue, Jun 14, 2022 at 8:13 PM Brian Hulette 
>>> wrote:
>>>
 FYI: @Sean Jensen-Grey  gave a talk back in
 2020 where he had integrated Rust with the Python SDK. I thought he used
 WebAssembly for that, but it looks like he used some other approaches, and
 his talk mentioned WebAssembly as future work. Not sure if that was ever
 explored.

 https://www.youtube.com/watch?v=fZK_Tiu7q1o
 https://github.com/seanjensengrey/beam-rust-python-java

 Brian


 On Tue, Jun 14, 2022 at 5:05 PM Ahmet Altay  wrote:

> Adding @Lukasz Cwik  - he was interested in the
> WebAssembly topic.
>
> On Tue, Jun 14, 2022 at 3:09 PM Pablo Estrada 
> wrote:
>
>> Would you open a pull request for it? Or at least share a branch? : )
>> Even if we don't want to merge it, it would be great to have a PR as
>> a way to showcase the work, its usefulness, and receive comments on this
>> thread once we can see something more specific.
>>
>> On Tue, Jun 14, 2022 at 3:05 PM Steven van Rossum <
>> sjvanros...@google.com> wrote:
>>
>>> Hi folks,
>>>
>>> I had some spare time yesterday and thought it'd be fun to implement
>>> a transform which runs WebAssembly modules as a lightweight way to
>>> implement cross language transforms for languages which don't (yet) 
>>> have a
>>> SDK implementation.
>>>
>>> I've got a small proof of concept running in the Python SDK as a
>>> DoFn with Wasmer as the WebAssembly runtime and simple support for
>>> marshalling between the host and guest environment with the RowCoder. 
>>> The
>>> module I've constructed is mostly useless, but demonstrates the host
>>> copying the encoded element into the guest's memory, the guest copying

Re: Questions regarding contribution: Support for reading Kafka topics from any startReadTime in Java

2022-06-15 Thread Balázs Németh
Well, a sanity check to only allow startReadTime in the past seemed obvious
to me. I can't really think of any common use-case when you want to start
processing after a specific time in the future and only idle until that,
but the other direction has multiple.

The question regarding this IMO is how much of a "safety delay" from "now"
- if any - we have to keep if we were unable to find any newer offset. I
mean a message that is older than the startReadTime might be still in
processing and might not be available at the kafka cluster yet before
getting the offset. It has to be either without any check like that and let
the developers care about this, or it needs a filter that drops messages
prior to the timestamp.

Daniel Collins  ezt írta (időpont: 2022. jún. 16.,
Cs, 0:19):

> This may or may not be reasonable. Lets assume I pick startReadTime =
> now + 1 minute. Your logic will start returning a lot of records with
> values == now + 1ms. Is that reasonable for users of this API? Maybe, maybe
> not.
>
> -Daniel
>
> On Wed, Jun 15, 2022 at 6:16 PM Balázs Németh 
> wrote:
>
>> Not a single objection means my idea seems okay then? :)
>>
>> Balázs Németh  ezt írta (időpont: 2022. máj. 26.,
>> Cs, 5:59):
>>
>>> https://issues.apache.org/jira/browse/BEAM-14518
>>>
>>>
>>> https://github.com/apache/beam/blob/fd8546355523f67eaddc22249606fdb982fe4938/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ConsumerSpEL.java#L180-L198
>>>
>>> Right now the 'startReadTime' config for KafkaIO.Read looks up an offset
>>> in every topic partition that is newer or equal to that timestamp. The
>>> problem is that if we use a timestamp that is so new, that we don't have
>>> any newer/equal message in the partition. In that case the code fails with
>>> an exception. Meanwhile in certain cases it makes no sense as we could
>>> actually make it work.
>>>
>>> If we don't get an offset from calling `consumer.offsetsForTimes`, we
>>> should call `endOffsets`, and use the returned offset + 1. That is actually
>>> the offset we will have to read next time.
>>>
>>> Even if `endOffsets` can't return an offset we could use 0 as the offset
>>> to read from.
>>>
>>>
>>>
>>> Am I missing something here? Is it okay to contribute this?
>>>
>>


Re: Questions regarding contribution: Support for reading Kafka topics from any startReadTime in Java

2022-06-15 Thread Balázs Németh
Not a single objection means my idea seems okay then? :)

Balázs Németh  ezt írta (időpont: 2022. máj. 26.,
Cs, 5:59):

> https://issues.apache.org/jira/browse/BEAM-14518
>
>
> https://github.com/apache/beam/blob/fd8546355523f67eaddc22249606fdb982fe4938/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ConsumerSpEL.java#L180-L198
>
> Right now the 'startReadTime' config for KafkaIO.Read looks up an offset
> in every topic partition that is newer or equal to that timestamp. The
> problem is that if we use a timestamp that is so new, that we don't have
> any newer/equal message in the partition. In that case the code fails with
> an exception. Meanwhile in certain cases it makes no sense as we could
> actually make it work.
>
> If we don't get an offset from calling `consumer.offsetsForTimes`, we
> should call `endOffsets`, and use the returned offset + 1. That is actually
> the offset we will have to read next time.
>
> Even if `endOffsets` can't return an offset we could use 0 as the offset
> to read from.
>
>
>
> Am I missing something here? Is it okay to contribute this?
>


Re: Chained Job Graph Apache Beam | Dataflow

2022-06-15 Thread Evan Galpin
It may also be helpful to explore CoGroupByKey as a way of joining data,
though depending on the shape of the data doing so may not fit in mem:
https://beam.apache.org/documentation/transforms/java/aggregation/cogroupbykey/

- Evan

On Wed, Jun 15, 2022 at 3:45 PM Bruno Volpato 
wrote:

> Hello,
>
> I am not sure what is the context behind your join, but I just wanted to
> point out that Beam SQL [1] or the Join-library extension [2] may be
> helpful in your scenario to avoid changing semantics or the need to
> orchestrate your jobs outside Beam.
>
> [1] https://beam.apache.org/documentation/dsls/sql/extensions/joins/
> [2] https://beam.apache.org/documentation/sdks/java-extensions/
>
>
> Best,
> Bruno
>
>
>
> On Wed, Jun 15, 2022 at 3:35 PM Jack McCluskey 
> wrote:
>
>> Hey Ravi,
>>
>> The problem you're running into is that the act of writing data to a
>> table and reading from it are not joined actions in the Beam model. There's
>> no connecting PCollection tying those together, so they are split and run
>> in parallel. If you want to do this and need the data written to C, you
>> should re-use the PCollection written to C in your filtering step instead
>> of reading the data from C again. That should produce the graph you're
>> looking for in a batch context.
>>
>> Thanks,
>>
>> Jack McCluskey
>>
>> On Wed, Jun 15, 2022 at 3:30 PM Ravi Kapoor 
>> wrote:
>>
>>> FYI
>>>
>>> On Thu, Jun 16, 2022 at 12:56 AM Ravi Kapoor 
>>> wrote:
>>>
 Hi Daniel,

 I have a use case where I join two tables say A and B and write the
 joined Collection to C.
 Then I would like to filter some records on C and put it to another
 table say D.
 So, the pipeline on Dataflow UI should look like this right?

 A
\
 C -> D
/
 B

 However, the pipeline is writing C -> D in parallel.
 How can this pipeline run in parallel as data has not been pushed yet
 to C by the previous pipeline?

 Even when I ran this pipeline, Table D did not get any records inserted
 as well which is apparent.
 Can you help me with this use case?

 Thanks,
 Ravi



 On Tue, Jun 14, 2022 at 9:01 PM Daniel Collins 
 wrote:

> Can you speak to what specifically you want to be different? The job
> graph you see, with the A -> B and B-> C being separate is an accurate
> reflection of your pipeline. table_B is outside of the beam model, by
> pushing your data there, Dataflow has no ability to identify that no
> manipulation of data is happening at table_B.
>
> If you want to just process data from A to destinations D and E, while
> writing an intermediate output to table_B, you should just remove the read
> from table B and modify table_A_records again directly. If that is not 
> what
> you want, you would need to explain more specifically what you want that 
> is
> different. Is it a pure UI change? Is it a functional change?
>
> -Daniel
>
> On Tue, Jun 14, 2022 at 11:12 AM Ravi Kapoor 
> wrote:
>
>> Team,
>> Any update on this?
>>
>> On Mon, Jun 13, 2022 at 8:39 PM Ravi Kapoor 
>> wrote:
>>
>>> Hi Team,
>>>
>>> I am currently using Beam in my project with Dataflow Runner.
>>> I am trying to create a pipeline where the data flows from the
>>> source to staging then to target such as:
>>>
>>> A (Source) -> B(Staging) -> C (Target)
>>>
>>> When I create a pipeline as below:
>>>
>>> PCollection table_A_records = 
>>> p.apply(BigQueryIO.readTableRows()
>>> .from("project:dataset.table_A"));
>>>
>>> table_A_records.apply(BigQueryIO.writeTableRows().
>>> to("project:dataset.table_B")
>>> 
>>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>>> 
>>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
>>>
>>> PCollection table_B_records = 
>>> p.apply(BigQueryIO.readTableRows()
>>> .from("project:dataset.table_B"));
>>> table_B_records.apply(BigQueryIO.writeTableRows().
>>> to("project:dataset.table_C")
>>> 
>>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>>> 
>>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
>>> p.run().waitUntilFinish();
>>>
>>>
>>> It basically creates two parallel job graphs in dataflow instead
>>> creating a transformation as expected:
>>> A -> B
>>> B -> C
>>> I needed to create data pipeline which flows the data in chain like:
>>>  D
>>>/
>>> A -> B -> C
>>>   \
>>> E
>>> Is there a way to achieve this transformation in between source and
>>> target tables?
>>>
>>> Thanks,
>>> Ravi
>>>

Re: Clean Up GitHub Labels

2022-06-15 Thread Aizhamal Nurmamat kyzy
Thank you, Danny!

On Wed, Jun 15, 2022 at 8:31 AM Danny McCormick 
wrote:

> Given the general consensus here, I put up a PR to enforce this for new
> issues here - https://github.com/apache/beam/pull/21888.
>
> Once that's in, I'll run a script to update all the existing labels to the
> new preferred scheme and we can delete the old labels that we don't need
> anymore.
>
> Thanks,
> Danny
>
> On Tue, Jun 14, 2022 at 4:53 PM Kenneth Knowles  wrote:
>
>> +1 sounds good to me
>>
>> One thing I did a lot of when triaging Jiras was moving them from one
>> component to another, after which people who cared about those components
>> would go through them. Making the labels more straightforward for users
>> would streamline that.
>>
>> Kenn
>>
>> On Sun, Jun 12, 2022 at 9:04 PM Chamikara Jayalath 
>> wrote:
>>
>>> +1 for this in general. Also, as noted in the proposal, decomposing
>>> labels should be done on a case by case basis since in some cases that
>>> might result in creating labels that do not have proper context.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Fri, Jun 10, 2022 at 8:35 AM Robert Burke  wrote:
>>>
 +1. I like this consolidation proposal, but i also like thinking
 through conjunctions. :)

 On Fri, Jun 10, 2022, 6:42 AM Danny McCormick <
 dannymccorm...@google.com> wrote:

> Hey everyone,
>
> After migrating over from Jira, our labels are somewhat messy and not
> as helpful as they could be. Specifically, there are 2 sets of problems:
>
>
> 1. There is significant overlap between the labels imported from Jira
> and the labels we already had in GitHub for our PRs. For example, there 
> was
> already a “Go” GitHub label, and as part of the migration we imported a
> “sdk-go” label.
>
>
> 2. Because GitHub doesn’t provide an OR syntax in its searching, it is
> much harder to search for things like “all io labels” because the io 
> issues
> are sharded across a number of io tags (e.g. io-java-aws,
> io-java-amqp, io-py-avro, etc…). This applies to other areas like runner
> issues, portability issues, and issues by language as well.
>
> I put together a quick 1 page proposal on how we can remove the label
> duplication and make searching easier by decomposing our labels into their
> smallest components. Please let me know if you have any thoughts or
> suggestions!
> https://docs.google.com/document/d/14S5coM_vfRrwygoQ9_NClJWmY5s30_L_J5yCurLW-XU/edit?usp=sharing
>
> Thanks,
> Danny
>



Re: Chained Job Graph Apache Beam | Dataflow

2022-06-15 Thread Bruno Volpato
Hello,

I am not sure what is the context behind your join, but I just wanted to
point out that Beam SQL [1] or the Join-library extension [2] may be
helpful in your scenario to avoid changing semantics or the need to
orchestrate your jobs outside Beam.

[1] https://beam.apache.org/documentation/dsls/sql/extensions/joins/
[2] https://beam.apache.org/documentation/sdks/java-extensions/


Best,
Bruno



On Wed, Jun 15, 2022 at 3:35 PM Jack McCluskey 
wrote:

> Hey Ravi,
>
> The problem you're running into is that the act of writing data to a table
> and reading from it are not joined actions in the Beam model. There's no
> connecting PCollection tying those together, so they are split and run in
> parallel. If you want to do this and need the data written to C, you should
> re-use the PCollection written to C in your filtering step instead of
> reading the data from C again. That should produce the graph you're looking
> for in a batch context.
>
> Thanks,
>
> Jack McCluskey
>
> On Wed, Jun 15, 2022 at 3:30 PM Ravi Kapoor 
> wrote:
>
>> FYI
>>
>> On Thu, Jun 16, 2022 at 12:56 AM Ravi Kapoor 
>> wrote:
>>
>>> Hi Daniel,
>>>
>>> I have a use case where I join two tables say A and B and write the
>>> joined Collection to C.
>>> Then I would like to filter some records on C and put it to another
>>> table say D.
>>> So, the pipeline on Dataflow UI should look like this right?
>>>
>>> A
>>>\
>>> C -> D
>>>/
>>> B
>>>
>>> However, the pipeline is writing C -> D in parallel.
>>> How can this pipeline run in parallel as data has not been pushed yet to
>>> C by the previous pipeline?
>>>
>>> Even when I ran this pipeline, Table D did not get any records inserted
>>> as well which is apparent.
>>> Can you help me with this use case?
>>>
>>> Thanks,
>>> Ravi
>>>
>>>
>>>
>>> On Tue, Jun 14, 2022 at 9:01 PM Daniel Collins 
>>> wrote:
>>>
 Can you speak to what specifically you want to be different? The job
 graph you see, with the A -> B and B-> C being separate is an accurate
 reflection of your pipeline. table_B is outside of the beam model, by
 pushing your data there, Dataflow has no ability to identify that no
 manipulation of data is happening at table_B.

 If you want to just process data from A to destinations D and E, while
 writing an intermediate output to table_B, you should just remove the read
 from table B and modify table_A_records again directly. If that is not what
 you want, you would need to explain more specifically what you want that is
 different. Is it a pure UI change? Is it a functional change?

 -Daniel

 On Tue, Jun 14, 2022 at 11:12 AM Ravi Kapoor 
 wrote:

> Team,
> Any update on this?
>
> On Mon, Jun 13, 2022 at 8:39 PM Ravi Kapoor 
> wrote:
>
>> Hi Team,
>>
>> I am currently using Beam in my project with Dataflow Runner.
>> I am trying to create a pipeline where the data flows from the
>> source to staging then to target such as:
>>
>> A (Source) -> B(Staging) -> C (Target)
>>
>> When I create a pipeline as below:
>>
>> PCollection table_A_records = 
>> p.apply(BigQueryIO.readTableRows()
>> .from("project:dataset.table_A"));
>>
>> table_A_records.apply(BigQueryIO.writeTableRows().
>> to("project:dataset.table_B")
>> 
>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>> 
>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
>>
>> PCollection table_B_records = 
>> p.apply(BigQueryIO.readTableRows()
>> .from("project:dataset.table_B"));
>> table_B_records.apply(BigQueryIO.writeTableRows().
>> to("project:dataset.table_C")
>> 
>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>> 
>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
>> p.run().waitUntilFinish();
>>
>>
>> It basically creates two parallel job graphs in dataflow instead
>> creating a transformation as expected:
>> A -> B
>> B -> C
>> I needed to create data pipeline which flows the data in chain like:
>>  D
>>/
>> A -> B -> C
>>   \
>> E
>> Is there a way to achieve this transformation in between source and
>> target tables?
>>
>> Thanks,
>> Ravi
>>
>
>
> --
> Thanks,
> Ravi Kapoor
> +91-9818764564 <+91%2098187%2064564>
> kapoorrav...@gmail.com
>

>>>
>>> --
>>> Thanks,
>>> Ravi Kapoor
>>> +91-9818764564 <+91%2098187%2064564>
>>> kapoorrav...@gmail.com
>>>
>>
>>
>> --
>> Thanks,
>> Ravi Kapoor
>> +91-9818764564 <+91%2098187%2064564>
>> kapoorrav...@gmail.com
>>
>


Re: Chained Job Graph Apache Beam | Dataflow

2022-06-15 Thread Ravi Kapoor
FYI

On Thu, Jun 16, 2022 at 12:56 AM Ravi Kapoor  wrote:

> Hi Daniel,
>
> I have a use case where I join two tables say A and B and write the joined
> Collection to C.
> Then I would like to filter some records on C and put it to another table
> say D.
> So, the pipeline on Dataflow UI should look like this right?
>
> A
>\
> C -> D
>/
> B
>
> However, the pipeline is writing C -> D in parallel.
> How can this pipeline run in parallel as data has not been pushed yet to C
> by the previous pipeline?
>
> Even when I ran this pipeline, Table D did not get any records inserted as
> well which is apparent.
> Can you help me with this use case?
>
> Thanks,
> Ravi
>
>
>
> On Tue, Jun 14, 2022 at 9:01 PM Daniel Collins 
> wrote:
>
>> Can you speak to what specifically you want to be different? The job
>> graph you see, with the A -> B and B-> C being separate is an accurate
>> reflection of your pipeline. table_B is outside of the beam model, by
>> pushing your data there, Dataflow has no ability to identify that no
>> manipulation of data is happening at table_B.
>>
>> If you want to just process data from A to destinations D and E, while
>> writing an intermediate output to table_B, you should just remove the read
>> from table B and modify table_A_records again directly. If that is not what
>> you want, you would need to explain more specifically what you want that is
>> different. Is it a pure UI change? Is it a functional change?
>>
>> -Daniel
>>
>> On Tue, Jun 14, 2022 at 11:12 AM Ravi Kapoor 
>> wrote:
>>
>>> Team,
>>> Any update on this?
>>>
>>> On Mon, Jun 13, 2022 at 8:39 PM Ravi Kapoor 
>>> wrote:
>>>
 Hi Team,

 I am currently using Beam in my project with Dataflow Runner.
 I am trying to create a pipeline where the data flows from the
 source to staging then to target such as:

 A (Source) -> B(Staging) -> C (Target)

 When I create a pipeline as below:

 PCollection table_A_records = p.apply(BigQueryIO.readTableRows()
 .from("project:dataset.table_A"));

 table_A_records.apply(BigQueryIO.writeTableRows().
 to("project:dataset.table_B")
 
 .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
 
 .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));

 PCollection table_B_records = p.apply(BigQueryIO.readTableRows()
 .from("project:dataset.table_B"));
 table_B_records.apply(BigQueryIO.writeTableRows().
 to("project:dataset.table_C")
 
 .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
 
 .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
 p.run().waitUntilFinish();


 It basically creates two parallel job graphs in dataflow instead
 creating a transformation as expected:
 A -> B
 B -> C
 I needed to create data pipeline which flows the data in chain like:
  D
/
 A -> B -> C
   \
 E
 Is there a way to achieve this transformation in between source and
 target tables?

 Thanks,
 Ravi

>>>
>>>
>>> --
>>> Thanks,
>>> Ravi Kapoor
>>> +91-9818764564 <+91%2098187%2064564>
>>> kapoorrav...@gmail.com
>>>
>>
>
> --
> Thanks,
> Ravi Kapoor
> +91-9818764564
> kapoorrav...@gmail.com
>


-- 
Thanks,
Ravi Kapoor
+91-9818764564
kapoorrav...@gmail.com


Re: Chained Job Graph Apache Beam | Dataflow

2022-06-15 Thread Ravi Kapoor
Hi Daniel,

I have a use case where I join two tables say A and B and write the joined
Collection to C.
Then I would like to filter some records on C and put it to another table
say D.
So, the pipeline on Dataflow UI should look like this right?

A
   \
C -> D
   /
B

However, the pipeline is writing C -> D in parallel.
How can this pipeline run in parallel as data has not been pushed yet to C
by the previous pipeline?

Even when I ran this pipeline, Table D did not get any records inserted as
well which is apparent.
Can you help me with this use case?

Thanks,
Ravi



On Tue, Jun 14, 2022 at 9:01 PM Daniel Collins  wrote:

> Can you speak to what specifically you want to be different? The job graph
> you see, with the A -> B and B-> C being separate is an accurate reflection
> of your pipeline. table_B is outside of the beam model, by pushing your
> data there, Dataflow has no ability to identify that no manipulation of
> data is happening at table_B.
>
> If you want to just process data from A to destinations D and E, while
> writing an intermediate output to table_B, you should just remove the read
> from table B and modify table_A_records again directly. If that is not what
> you want, you would need to explain more specifically what you want that is
> different. Is it a pure UI change? Is it a functional change?
>
> -Daniel
>
> On Tue, Jun 14, 2022 at 11:12 AM Ravi Kapoor 
> wrote:
>
>> Team,
>> Any update on this?
>>
>> On Mon, Jun 13, 2022 at 8:39 PM Ravi Kapoor 
>> wrote:
>>
>>> Hi Team,
>>>
>>> I am currently using Beam in my project with Dataflow Runner.
>>> I am trying to create a pipeline where the data flows from the source to
>>> staging then to target such as:
>>>
>>> A (Source) -> B(Staging) -> C (Target)
>>>
>>> When I create a pipeline as below:
>>>
>>> PCollection table_A_records = p.apply(BigQueryIO.readTableRows()
>>> .from("project:dataset.table_A"));
>>>
>>> table_A_records.apply(BigQueryIO.writeTableRows().
>>> to("project:dataset.table_B")
>>> 
>>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>>> 
>>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
>>>
>>> PCollection table_B_records = p.apply(BigQueryIO.readTableRows()
>>> .from("project:dataset.table_B"));
>>> table_B_records.apply(BigQueryIO.writeTableRows().
>>> to("project:dataset.table_C")
>>> 
>>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>>> 
>>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
>>> p.run().waitUntilFinish();
>>>
>>>
>>> It basically creates two parallel job graphs in dataflow instead
>>> creating a transformation as expected:
>>> A -> B
>>> B -> C
>>> I needed to create data pipeline which flows the data in chain like:
>>>  D
>>>/
>>> A -> B -> C
>>>   \
>>> E
>>> Is there a way to achieve this transformation in between source and
>>> target tables?
>>>
>>> Thanks,
>>> Ravi
>>>
>>
>>
>> --
>> Thanks,
>> Ravi Kapoor
>> +91-9818764564 <+91%2098187%2064564>
>> kapoorrav...@gmail.com
>>
>

-- 
Thanks,
Ravi Kapoor
+91-9818764564
kapoorrav...@gmail.com


Re: Dataflow java job with java transforms in expansion service

2022-06-15 Thread Sahith Nallapareddy
Hello,

I ran a job today and it seems like on the latest version of beam this is
working as jobs are actually running now! If I run into any issues i'll
update this thread, but thank you seems like on 2.39 the earlier issues
have been resolved.

Thanks,

Sahith

On Tue, Jun 14, 2022 at 4:14 PM Sahith Nallapareddy 
wrote:

> Hello,
>
> I will run another one on the latest beam today and let you know what
> happens. The last version I tried this on was I think 2.35. I believe there
> were no errors on the dataflow page, but some issues with getting the
> workers started. I will try on the latest beam and update to see what
> happens!
>
> Thanks,
>
> Sahith
>
> On Tue, Jun 14, 2022 at 12:26 PM Chamikara Jayalath 
> wrote:
>
>>
>>
>> On Tue, Jun 14, 2022 at 8:47 AM Sahith Nallapareddy 
>> wrote:
>>
>>> Hello,
>>>
>>> I was wondering if anyone has run a java job with java external
>>> transforms in dataflow? We have had python beam jobs run great with java
>>> external transforms. However, we tried to run a java job with java external
>>> transforms but this seemed to stall on dataflow (this was done a while ago,
>>> have to verify this is still happening). We are working on a system and
>>> having transforms in an expansion service is a component of it. Is this
>>> something that would be supported or is it only going to work cross
>>> language?
>>>
>>
>> Do you have any specific errors that your job ran into ? In theory this
>> should work (with the latest Beam version) but I don't believe we
>> currently have tests for this.
>> One thing I can think of that could result in jobs getting stuck is
>> forgetting to include "--experiments=use_runner_v2".
>>
>> +Heejong Lee 
>>
>> Thanks,
>> Cham
>>
>>
>>>
>>> Thanks,
>>>
>>> Sahith
>>>
>>


Re: [PROPOSAL] Preparing for Beam release 2.40.0

2022-06-15 Thread Evan Galpin
>
> Thank you Evan! How confident are you about the fix?
>

I validated the fix contained in #21895
 against a job running in
Dataflow that previously failed immediately (with the exact issue described
in #21690 )  when upgrading
to  2.39.0 so that gives me good confidence that the issue is patched.  I
will say though that I don't have exceptional confidence in my own
understanding of all Windowing implications (ex.
`setWindowingStrategyInternal` Vs. `Window.into`), FWIW.

- Evan

On Wed, Jun 15, 2022 at 1:46 PM Ahmet Altay  wrote:

>
>
> On Wed, Jun 15, 2022 at 10:39 AM Evan Galpin  wrote:
>
>> RE https://github.com/apache/beam/issues/21690, I've just opened a PR
>> with a fix:  https://github.com/apache/beam/pull/21895.  If that feels
>> too rushed, I've also opened a PR to revert the change that introduced the
>> regression in the first place:  https://github.com/apache/beam/pull/21884
>> .
>>
>> For ElasticsearchIO to work at all in 2.40.0, one of
>> https://github.com/apache/beam/pull/21895 or
>> https://github.com/apache/beam/pull/21884 will need to be merged first.
>>
>
> Thank you Evan! How confident are you about the fix?
>
> I will defer the decision to release manager ( @Pablo Estrada
> ).
>
>
>>
>> Thanks,
>> Evan
>>
>>
>> On Tue, Jun 14, 2022 at 8:19 PM Ahmet Altay  wrote:
>>
>>> Thank you Kenn.
>>>
>>> To bring visibility, there are 2 blockers:
>>> https://github.com/apache/beam/issues/21625 - assigned to Reuven
>>>
>>
> @Reuven Lax  - ping on this one - Is this a release
> blocker?
>
>
>> https://github.com/apache/beam/issues/21690 - assigned to Evan
>>>
>>> Reuven, Evan, could you please comment on the open issues? Should they
>>> be blocking the 2.40.0 release?
>>>
>>> Thank you!
>>> Ahmet
>>>
>>> On Tue, Jun 14, 2022 at 7:35 AM Kenneth Knowles  wrote:
>>>
 I did a pass this morning. I believe there is only one release blocker
 that doesn't already have a fix. If I closed your issue or moved it off the
 milestone, feel free to have a different opinion and revert my action.

 Kenn

 On Mon, Jun 13, 2022 at 5:04 PM Ahmet Altay  wrote:

>
>
> On Tue, Jun 7, 2022 at 7:17 PM Ahmet Altay  wrote:
>
>> I apologize for digressing the release thread.  To bring it back,
>> please help Pablo with the release blockers (
>> https://github.com/apache/beam/milestone/2) :)
>>
>
> @Pablo Estrada  - there are still 10 blockers on
> that list. Could we move the non-critical items to the next release? Do 
> you
> need any help?
>
>
>>
>> On Tue, Jun 7, 2022 at 6:47 PM Danny McCormick <
>> dannymccorm...@google.com> wrote:
>>
>>> > A question related to priorities (P0, P1, etc.) as labels. Does it
>>> mean that not all issues will have priority unless we explicitly set 
>>> these
>>> labels?
>>>
>>> Technically yes, but this is a required field on all issue templates
>>> (for example, our bug template
>>> ).
>>> Our automation will automatically apply a label based on the response to
>>> that field.
>>>
>>
>> Nice. Thank you.
>>
>>
>>>
>>> On Tue, Jun 7, 2022 at 7:46 PM Ahmet Altay  wrote:
>>>


 On Tue, Jun 7, 2022 at 10:19 AM Danny McCormick <
 dannymccorm...@google.com> wrote:

> I'm good with that, that's consistent with the previous doc
> behavior which pointed to the fix version page (e.g.
> https://issues.apache.org/jira/projects/BEAM/versions/12351171).
> I closed my pr since that approach is consistent with the current 
> state of
> the docs, if we decide we just want the release manager to look at 
> P1s we
> can reopen.
>

 That makes sense. Release managers usually move most P0 and P1
 issues out of the blockers lists for good reasons and the two lists 
 tend to
 look very similar closer to the release.

 A question related to priorities (P0, P1, etc.) as labels. Does it
 mean that not all issues will have priority unless we explicitly set 
 these
 labels?


> Thanks,
> Danny
>
> On Tue, Jun 7, 2022 at 12:30 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
>
>>
>>
>> On Tue, Jun 7, 2022 at 7:33 AM Danny McCormick <
>> dannymccorm...@google.com> wrote:
>>
>>> If we want to filter to P0/P1 issues, we can do that with this
>>> query - https://github.com/apache/beam/issues?q=milestone:"2.40.0
>>> Release" label:P1,P0 is:open is:issue - I'll update the release
>>> guide to point to that url instead of just the 

Re: [PROPOSAL] Preparing for Beam release 2.40.0

2022-06-15 Thread Evan Galpin
RE https://github.com/apache/beam/issues/21690, I've just opened a PR with
a fix:  https://github.com/apache/beam/pull/21895.  If that feels too
rushed, I've also opened a PR to revert the change that introduced the
regression in the first place:  https://github.com/apache/beam/pull/21884.

For ElasticsearchIO to work at all in 2.40.0, one of
https://github.com/apache/beam/pull/21895 or
https://github.com/apache/beam/pull/21884 will need to be merged first.

Thanks,
Evan


On Tue, Jun 14, 2022 at 8:19 PM Ahmet Altay  wrote:

> Thank you Kenn.
>
> To bring visibility, there are 2 blockers:
> https://github.com/apache/beam/issues/21625 - assigned to Reuven
> https://github.com/apache/beam/issues/21690 - assigned to Evan
>
> Reuven, Evan, could you please comment on the open issues? Should they be
> blocking the 2.40.0 release?
>
> Thank you!
> Ahmet
>
> On Tue, Jun 14, 2022 at 7:35 AM Kenneth Knowles  wrote:
>
>> I did a pass this morning. I believe there is only one release blocker
>> that doesn't already have a fix. If I closed your issue or moved it off the
>> milestone, feel free to have a different opinion and revert my action.
>>
>> Kenn
>>
>> On Mon, Jun 13, 2022 at 5:04 PM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Tue, Jun 7, 2022 at 7:17 PM Ahmet Altay  wrote:
>>>
 I apologize for digressing the release thread.  To bring it back,
 please help Pablo with the release blockers (
 https://github.com/apache/beam/milestone/2) :)

>>>
>>> @Pablo Estrada  - there are still 10 blockers on
>>> that list. Could we move the non-critical items to the next release? Do you
>>> need any help?
>>>
>>>

 On Tue, Jun 7, 2022 at 6:47 PM Danny McCormick <
 dannymccorm...@google.com> wrote:

> > A question related to priorities (P0, P1, etc.) as labels. Does it
> mean that not all issues will have priority unless we explicitly set these
> labels?
>
> Technically yes, but this is a required field on all issue templates
> (for example, our bug template
> ).
> Our automation will automatically apply a label based on the response to
> that field.
>

 Nice. Thank you.


>
> On Tue, Jun 7, 2022 at 7:46 PM Ahmet Altay  wrote:
>
>>
>>
>> On Tue, Jun 7, 2022 at 10:19 AM Danny McCormick <
>> dannymccorm...@google.com> wrote:
>>
>>> I'm good with that, that's consistent with the previous doc behavior
>>> which pointed to the fix version page (e.g.
>>> https://issues.apache.org/jira/projects/BEAM/versions/12351171). I
>>> closed my pr since that approach is consistent with the current state of
>>> the docs, if we decide we just want the release manager to look at P1s 
>>> we
>>> can reopen.
>>>
>>
>> That makes sense. Release managers usually move most P0 and P1 issues
>> out of the blockers lists for good reasons and the two lists tend to look
>> very similar closer to the release.
>>
>> A question related to priorities (P0, P1, etc.) as labels. Does it
>> mean that not all issues will have priority unless we explicitly set 
>> these
>> labels?
>>
>>
>>> Thanks,
>>> Danny
>>>
>>> On Tue, Jun 7, 2022 at 12:30 PM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>>


 On Tue, Jun 7, 2022 at 7:33 AM Danny McCormick <
 dannymccorm...@google.com> wrote:

> If we want to filter to P0/P1 issues, we can do that with this
> query - https://github.com/apache/beam/issues?q=milestone:"2.40.0
> Release" label:P1,P0 is:open is:issue - I'll update the release
> guide to point to that url instead of just the milestones page. PR to 
> do
> this here - https://github.com/apache/beam/pull/21732
>

 Nit: I think we'd want all open issues tagged for that release.
 Release manager can decide to bump P2s to the next release or raise the
 priority.

 Thanks,
 Cham


>
> Thanks,
> Danny
>
> On Mon, Jun 6, 2022 at 7:56 PM Ahmet Altay 
> wrote:
>
>> Is this (https://github.com/apache/beam/milestone/2) the link
>> for 2.40.0 release blockers? If yes, are those 9 issues hard release
>> blockers? (We used to limit release blockers to P1s and above, not 
>> sure
>> what would be the right URL with the filter.)
>>
>> On Fri, Jun 3, 2022 at 10:41 AM Danny McCormick <
>> dannymccorm...@google.com> wrote:
>>
>>> The existing release blockers won't have their fix version
>>> automatically migrated, I'll make sure to go through and make sure 
>>> those
>>> get updated once the migration is done though. There are only 10 of 

Re: Null PCollection errors in v2.40 unit tests

2022-06-15 Thread Kenneth Knowles
This looks to be clearly a bug in autovalue. The PCollection field is
clearly labeled as @Nullable. Autovalue generates the
`checkArgument(pcollection != null)` for things that are not
marked @Nullable. It should not do so here.

The bugs sounds like it is stateful, where some global state in the
autovalue annotation processor changes depending on what tasks were run
prior to generating the code.

You could dig further by reproducing the bad build and then opening up the
autovalue generated code (`find
runners/direct-java/src
AutoValue_ImmutableListBundleFactory_CommittedImmutableListBundle.java`
).

Kenn

On Wed, Jun 15, 2022 at 12:50 AM Niel Markwick  wrote:

> Thank you!  `--rerun-tasks` fixed it for me as well!
>
> Bizarre issue, but then to misquote Clarke, a sufficiently advanced build
> system is indistinguishable from magic!
> I had tried ./gradlew clean and rm ~/.gradle, but that obviously didn't do
> enough cleaning!
>
> Now to get back to what I was doing before I disappeared down this
> rabbit hole - diagnosing/fixing flakey tests!
>
> --
> 
> * •  **Niel Markwick*
> * •  *Cloud Solutions Architect 
> * •  *Google Belgium
> * •  *ni...@google.com
> * •  *+32 2 894 6771
>
>
> Google Belgium NV/SA, Steenweg op Etterbeek 180, 1040 Brussel, Belgie. RPR: 
> 0878.065.378
>
> If you have received this communication by mistake, please don't forward
> it to anyone else (it may contain confidential or privileged information),
> please erase all copies of it, including all attachments, and please let
> the sender know it went to the wrong person. Thanks
>
>
> On Tue, 14 Jun 2022 at 23:27, Steve Niemitz  wrote:
>
>> I had brought up a weird issues I was having with AutoValue awhile ago
>> that looks actually very similar to this:
>> https://lists.apache.org/thread/0sbkykop2gsw71jpf3ln6forbnwq3j4o
>>
>> I never got to the bottom of it, but `--rerun-tasks` always fixes it for
>> me.
>>
>>
>> On Tue, Jun 14, 2022 at 5:11 PM Danny McCormick <
>> dannymccorm...@google.com> wrote:
>>
>>> It seems like this may be specifically caused by jumping around to
>>> different commits, and Evan's solution seems like the right one. I got a
>>> clean vm and did:
>>>
>>> sudo apt install git openjdk-11-jdk
>>> git clone https://github.com/apache/beam.git
>>> cd beam
>>> ./gradlew :sdks:java:io:google-cloud-platform:test
>>>
>>> tests pass
>>>
>>>
>>> git checkout b0d964c430
>>> ./gradlew :sdks:java:io:google-cloud-platform:test
>>>
>>> tests fail (this is the one we would expect to pass)
>>>
>>>
>>> git checkout 4ffeae4d
>>> ./gradlew :sdks:java:io:google-cloud-platform:test
>>>
>>> tests fail
>>>
>>> ./gradlew :sdks:java:io:google-cloud-platform:test --rerun-tasks
>>>
>>> tests passed (this is still on the "bad commit")
>>>
>>> Thanks,
>>> Danny
>>>
>>>
>>>
>>> On Tue, Jun 14, 2022 at 3:56 PM Evan Galpin  wrote:
>>>
 I had this happen to me recently as well.  After `git bisecting` led to
 confusing results, I ran my tests again via gradlew adding `--rerun-tasks`
 to the command.  This is an expensive operation, but after I ran that I was
 able to test again with expected results. YMMV

 Thanks,
 Evan


 On Tue, Jun 14, 2022 at 2:12 PM Niel Markwick  wrote:

> I agree that it is very strange!
>
> I have also just repro'd it on the cleanest possible environment: a
> brand new GCE debian 11 VM...
>
> sudo apt install git openjdk-11-jdk
> git clone https://github.com/apache/beam.git
> cd beam
> git checkout b0d964c430
> ./gradlew :sdks:java:io:google-cloud-platform:test
>
> tests pass
>
> git checkout 4ffeae4d
> ./gradlew :sdks:java:io:google-cloud-platform:test
>
> tests fail.
>
>
> The test failure stack traces are pretty much identical - the only
> difference being the test being run.
>
> They all complain about a Null PCollection from the directRunner (a
> couple complain due to incorrect expected exceptions, or asserts in a
> finally block, but they are failing because of the Null PCollection)
>
> I am not sure but I think the common ground _could_ be that a side
> input is used in the failing tests.
>
>
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
> java.lang.NullPointerException: Null PCollection
>   at app//org.apache.beam.sdk.Pipeline.run(Pipeline.java:329)
>   at 
> app//org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:399)
>   at 
> app//org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:335)
>   at 
> app//org.apache.beam.sdk.io.gcp.spanner.SpannerIOWriteTest.deadlineExceededFailsAfterRetries(SpannerIOWriteTest.java:734)
>   at 
> java.base@11.0.13/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>  Method)
>   at 
> 

Re: Changing the interface in CassandraIO Mapper

2022-06-15 Thread Kenneth Knowles
Minor but important correction: Beam does *not* "shade" Guava. That tends
to refer to build-time re-namespacing and/or bundling. Beam does neither of
those things. What Beam has done is to create the equivalent of a totally
independent fork. It has no impact on whether various libraries or IOs use
Guava directly.

Regarding CassandraIO: I believe if the user commits to using CassandraIO
or a Cassandra client library then they already commit to having a
compatible version of Guava in their dependency set. So I think it is fine
to expose it on the API surface. It is not part of the core SDK, so it only
impacts CassandraIO users, who already don't have a choice.

Kenn

On Tue, Jun 14, 2022 at 5:48 PM Chamikara Jayalath 
wrote:

>
>
> On Tue, Jun 14, 2022 at 5:11 PM Vincent Marquez 
> wrote:
>
>>
>>
>>
>> On Mon, May 16, 2022 at 11:29 PM Chamikara Jayalath 
>> wrote:
>>
>>>
>>>
>>> On Mon, May 16, 2022 at 12:35 PM Ahmet Altay  wrote:
>>>
 Adding folks who might have an opinion : @Alexey Romanenko
  @Chamikara Jayalath 

 On Wed, May 11, 2022 at 5:47 PM Vincent Marquez <
 vincent.marq...@gmail.com> wrote:

>
>
>
> On Wed, May 11, 2022 at 3:12 PM Daniel Collins 
> wrote:
>
>> ListenableFuture has the additional problem that beam shades guava,
>> so its very unlikely you would be able to put it into the public 
>> interface.
>>
>>
> I'm not sure why this would be the case, there are other places that
> make use of ListenableFuture such as the BigQuery IO, I would just need to
> use the vendored guava, no?
>

>>> I don't think this is exposed through the public API of BigQueryIO
>>> though.
>>>
>>>

>
>
>
>> Can you describe in more detail the changes you want to make and why
>> they require ListenableFuture for this interface?
>>
>
> Happy to go into detail:
>
> Currently writes to Cassandra are executed asynchronous up to 100 per
> instance of the DoFn (which I believe on most/all runners would be 1 per
> core).
>
> 1. That number should be configurable, this would entirely depend on
> the size of the Cassandra/Scylla cluster to determine if 100 async queries
> per core/node of a beam job is sufficient.
>
> 2. Once 100 async queries are queued up, the processElement *blocks*
> until all 100 queries finish.  This isn't efficient and will prevent more
> queries from being queued up until the slowest one finishes.  We've found
> it much better to have a steady rate of async queries in flight (to better
> saturate the cores on the database).   However, to do so would require 
> some
> sort of semaphore type system in that we need to know when one query
> finishes that means we can add another.  Hence the need for a
> ListenableFuture, some mechanism that can signal an onComplete to release 
> a
> semaphore (or latch or whatever).
>
> Does that make sense?  Thoughts/comments/criticism welcome.  Happy to
> put this up in a design doc if it seems like something worth doing.
>

>>> Does this have to be more complicated than maintaining threadpool to
>>> manage async requests and adding incoming requests to the pool (which will
>>> be processed when the threads become available) ? I don't understand why
>>> you need to block accepting incoming requests till all 100 queries are
>>> finished.
>>>
>>>
>>
>> Apologies that I missed your reply!  The issue isn't that the threads
>> can't process the requests fast enough, the issue is we don't want to send
>> off the requests to the server until the server has finished processing.
>> We're trying to throttle sending too many queries to that particular
>> partition.
>>
>> Make sense?
>>
>
> Yeah, sounds like a reasonable performance improvement to me (minus the
> vendored guave issue Daniel pointed out).
> For completeness I believe this is the location where you are requesting
> to change the interface:
> https://github.com/apache/beam/blob/ac20321008e51c401731895ea934642b4648efd3/sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/Mapper.java#L65
>
> It might be possible to make this available as an option (a new
> CassandraIO.withMapperFactoryFn with a new Mapper) to preserve backwards
> compatibility.
>
> Thanks,
> Cham
>
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> Thanks,
>>> Cham
>>>
>>>
>
>
>>
>> On Wed, May 11, 2022 at 5:11 PM Vincent Marquez <
>> vincent.marq...@gmail.com> wrote:
>>
>>> I would like to do some additional performance related changes to
>>> the CassandraIO module, but it would necessitate changing the Mapper
>>> interface to return ListenableFuture opposed to
>>> java.util.concurrent.Future.  I'm not sure why the Mapper interface
>>> specifies the former, as the datastax driver itself returns a
>>> ListenableFuture for any async queries.
>>>
>>> How are