Minor but important correction: Beam does *not* "shade" Guava. That tends to refer to build-time re-namespacing and/or bundling. Beam does neither of those things. What Beam has done is to create the equivalent of a totally independent fork. It has no impact on whether various libraries or IOs use Guava directly.
Regarding CassandraIO: I believe if the user commits to using CassandraIO or a Cassandra client library then they already commit to having a compatible version of Guava in their dependency set. So I think it is fine to expose it on the API surface. It is not part of the core SDK, so it only impacts CassandraIO users, who already don't have a choice. Kenn On Tue, Jun 14, 2022 at 5:48 PM Chamikara Jayalath <[email protected]> wrote: > > > On Tue, Jun 14, 2022 at 5:11 PM Vincent Marquez <[email protected]> > wrote: > >> >> >> >> On Mon, May 16, 2022 at 11:29 PM Chamikara Jayalath <[email protected]> >> wrote: >> >>> >>> >>> On Mon, May 16, 2022 at 12:35 PM Ahmet Altay <[email protected]> wrote: >>> >>>> Adding folks who might have an opinion : @Alexey Romanenko >>>> <[email protected]> @Chamikara Jayalath <[email protected]> >>>> >>>> On Wed, May 11, 2022 at 5:47 PM Vincent Marquez < >>>> [email protected]> wrote: >>>> >>>>> >>>>> >>>>> >>>>> On Wed, May 11, 2022 at 3:12 PM Daniel Collins <[email protected]> >>>>> wrote: >>>>> >>>>>> ListenableFuture has the additional problem that beam shades guava, >>>>>> so its very unlikely you would be able to put it into the public >>>>>> interface. >>>>>> >>>>>> >>>>> I'm not sure why this would be the case, there are other places that >>>>> make use of ListenableFuture such as the BigQuery IO, I would just need to >>>>> use the vendored guava, no? >>>>> >>>> >>> I don't think this is exposed through the public API of BigQueryIO >>> though. >>> >>> >>>> >>>>> >>>>> >>>>> >>>>>> Can you describe in more detail the changes you want to make and why >>>>>> they require ListenableFuture for this interface? >>>>>> >>>>> >>>>> Happy to go into detail: >>>>> >>>>> Currently writes to Cassandra are executed asynchronous up to 100 per >>>>> instance of the DoFn (which I believe on most/all runners would be 1 per >>>>> core). >>>>> >>>>> 1. That number should be configurable, this would entirely depend on >>>>> the size of the Cassandra/Scylla cluster to determine if 100 async queries >>>>> per core/node of a beam job is sufficient. >>>>> >>>>> 2. Once 100 async queries are queued up, the processElement *blocks* >>>>> until all 100 queries finish. This isn't efficient and will prevent more >>>>> queries from being queued up until the slowest one finishes. We've found >>>>> it much better to have a steady rate of async queries in flight (to better >>>>> saturate the cores on the database). However, to do so would require >>>>> some >>>>> sort of semaphore type system in that we need to know when one query >>>>> finishes that means we can add another. Hence the need for a >>>>> ListenableFuture, some mechanism that can signal an onComplete to release >>>>> a >>>>> semaphore (or latch or whatever). >>>>> >>>>> Does that make sense? Thoughts/comments/criticism welcome. Happy to >>>>> put this up in a design doc if it seems like something worth doing. >>>>> >>>> >>> Does this have to be more complicated than maintaining threadpool to >>> manage async requests and adding incoming requests to the pool (which will >>> be processed when the threads become available) ? I don't understand why >>> you need to block accepting incoming requests till all 100 queries are >>> finished. >>> >>> >> >> Apologies that I missed your reply! The issue isn't that the threads >> can't process the requests fast enough, the issue is we don't want to send >> off the requests to the server until the server has finished processing. >> We're trying to throttle sending too many queries to that particular >> partition. >> >> Make sense? >> > > Yeah, sounds like a reasonable performance improvement to me (minus the > vendored guave issue Daniel pointed out). > For completeness I believe this is the location where you are requesting > to change the interface: > https://github.com/apache/beam/blob/ac20321008e51c401731895ea934642b4648efd3/sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/Mapper.java#L65 > > It might be possible to make this available as an option (a new > CassandraIO.withMapperFactoryFn with a new Mapper) to preserve backwards > compatibility. > > Thanks, > Cham > > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>> Thanks, >>> Cham >>> >>> >>>>> >>>>> >>>>>> >>>>>> On Wed, May 11, 2022 at 5:11 PM Vincent Marquez < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> I would like to do some additional performance related changes to >>>>>>> the CassandraIO module, but it would necessitate changing the Mapper >>>>>>> interface to return ListenableFuture opposed to >>>>>>> java.util.concurrent.Future. I'm not sure why the Mapper interface >>>>>>> specifies the former, as the datastax driver itself returns a >>>>>>> ListenableFuture for any async queries. >>>>>>> >>>>>>> How are changes to user facing interfaces handled (however minor >>>>>>> they would be) for Beam? If this is something that can be done, I'm >>>>>>> happy >>>>>>> to create a ticket, but supporting full backwards compatibility might be >>>>>>> too much work. >>>>>>> >>>>>>> >>>>>>> *~Vincent* >>>>>>> >>>>>>
