Minor but important correction: Beam does *not* "shade" Guava. That tends
to refer to build-time re-namespacing and/or bundling. Beam does neither of
those things. What Beam has done is to create the equivalent of a totally
independent fork. It has no impact on whether various libraries or IOs use
Guava directly.

Regarding CassandraIO: I believe if the user commits to using CassandraIO
or a Cassandra client library then they already commit to having a
compatible version of Guava in their dependency set. So I think it is fine
to expose it on the API surface. It is not part of the core SDK, so it only
impacts CassandraIO users, who already don't have a choice.

Kenn

On Tue, Jun 14, 2022 at 5:48 PM Chamikara Jayalath <[email protected]>
wrote:

>
>
> On Tue, Jun 14, 2022 at 5:11 PM Vincent Marquez <[email protected]>
> wrote:
>
>>
>>
>>
>> On Mon, May 16, 2022 at 11:29 PM Chamikara Jayalath <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Mon, May 16, 2022 at 12:35 PM Ahmet Altay <[email protected]> wrote:
>>>
>>>> Adding folks who might have an opinion : @Alexey Romanenko
>>>> <[email protected]> @Chamikara Jayalath <[email protected]>
>>>>
>>>> On Wed, May 11, 2022 at 5:47 PM Vincent Marquez <
>>>> [email protected]> wrote:
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 11, 2022 at 3:12 PM Daniel Collins <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> ListenableFuture has the additional problem that beam shades guava,
>>>>>> so its very unlikely you would be able to put it into the public 
>>>>>> interface.
>>>>>>
>>>>>>
>>>>> I'm not sure why this would be the case, there are other places that
>>>>> make use of ListenableFuture such as the BigQuery IO, I would just need to
>>>>> use the vendored guava, no?
>>>>>
>>>>
>>> I don't think this is exposed through the public API of BigQueryIO
>>> though.
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Can you describe in more detail the changes you want to make and why
>>>>>> they require ListenableFuture for this interface?
>>>>>>
>>>>>
>>>>> Happy to go into detail:
>>>>>
>>>>> Currently writes to Cassandra are executed asynchronous up to 100 per
>>>>> instance of the DoFn (which I believe on most/all runners would be 1 per
>>>>> core).
>>>>>
>>>>> 1. That number should be configurable, this would entirely depend on
>>>>> the size of the Cassandra/Scylla cluster to determine if 100 async queries
>>>>> per core/node of a beam job is sufficient.
>>>>>
>>>>> 2. Once 100 async queries are queued up, the processElement *blocks*
>>>>> until all 100 queries finish.  This isn't efficient and will prevent more
>>>>> queries from being queued up until the slowest one finishes.  We've found
>>>>> it much better to have a steady rate of async queries in flight (to better
>>>>> saturate the cores on the database).   However, to do so would require 
>>>>> some
>>>>> sort of semaphore type system in that we need to know when one query
>>>>> finishes that means we can add another.  Hence the need for a
>>>>> ListenableFuture, some mechanism that can signal an onComplete to release 
>>>>> a
>>>>> semaphore (or latch or whatever).
>>>>>
>>>>> Does that make sense?  Thoughts/comments/criticism welcome.  Happy to
>>>>> put this up in a design doc if it seems like something worth doing.
>>>>>
>>>>
>>> Does this have to be more complicated than maintaining threadpool to
>>> manage async requests and adding incoming requests to the pool (which will
>>> be processed when the threads become available) ? I don't understand why
>>> you need to block accepting incoming requests till all 100 queries are
>>> finished.
>>>
>>>
>>
>> Apologies that I missed your reply!  The issue isn't that the threads
>> can't process the requests fast enough, the issue is we don't want to send
>> off the requests to the server until the server has finished processing.
>> We're trying to throttle sending too many queries to that particular
>> partition.
>>
>> Make sense?
>>
>
> Yeah, sounds like a reasonable performance improvement to me (minus the
> vendored guave issue Daniel pointed out).
> For completeness I believe this is the location where you are requesting
> to change the interface:
> https://github.com/apache/beam/blob/ac20321008e51c401731895ea934642b4648efd3/sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/Mapper.java#L65
>
> It might be possible to make this available as an option (a new
> CassandraIO.withMapperFactoryFn with a new Mapper) to preserve backwards
> compatibility.
>
> Thanks,
> Cham
>
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> Thanks,
>>> Cham
>>>
>>>
>>>>>
>>>>>
>>>>>>
>>>>>> On Wed, May 11, 2022 at 5:11 PM Vincent Marquez <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I would like to do some additional performance related changes to
>>>>>>> the CassandraIO module, but it would necessitate changing the Mapper
>>>>>>> interface to return ListenableFuture opposed to
>>>>>>> java.util.concurrent.Future.  I'm not sure why the Mapper interface
>>>>>>> specifies the former, as the datastax driver itself returns a
>>>>>>> ListenableFuture for any async queries.
>>>>>>>
>>>>>>> How are changes to user facing interfaces handled (however minor
>>>>>>> they would be) for Beam?  If this is something that can be done, I'm 
>>>>>>> happy
>>>>>>> to create a ticket, but supporting full backwards compatibility might be
>>>>>>> too much work.
>>>>>>>
>>>>>>>
>>>>>>> *~Vincent*
>>>>>>>
>>>>>>

Reply via email to