Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Doug Rohrer Mon, 10 Apr 2023 15:37:50 -0700

I’ve updated the CEP with two overview diagrams of the interactions between 
Sidecar, Cassandra, and the Bulk Analytics library.  Hope this helps folks 
better understand how things work, and thanks for the patience as it took a bit 
longer than expected for me to find the time for this.


Doug

> On Apr 5, 2023, at 11:18 AM, Doug Rohrer <[email protected]> wrote:
> 
> Sorry for the delay in responding here - yes, we can add some diagrams to the 
> CEP - I’ll try to get that done by end-of-week.
> 
> Thanks,
> 
> Doug
> 
>> On Mar 28, 2023, at 1:14 PM, J. D. Jordan <[email protected]> wrote:
>> 
>> Maybe some data flow diagrams could be added to the cep showing some example 
>> operations for read/write?
>> 
>>> On Mar 28, 2023, at 11:35 AM, Yifan Cai <[email protected]> wrote:
>>> 
>>> 
>>> A lot of great discussions! 
>>> 
>>> On the sidecar front, especially what the role sidecar plays in terms of 
>>> this CEP, I feel there might be some confusion. Once the code is published, 
>>> we should have clarity.
>>> Sidecar does not read sstables nor do any coordination for analytics 
>>> queries. It is local to the companion Cassandra instance. For bulk read, it 
>>> takes snapshots and streams sstables to spark workers to read. For bulk 
>>> write, it imports the sstables uploaded from spark workers. All commands 
>>> are existing jmx/nodetool functionalities from Cassandra. Sidecar adds the 
>>> http interface to them. It might be an over simplified description. The 
>>> complex computation is performed in spark clusters only.
>>> 
>>> In the long run, Cassandra might evolve into a database that does both OLTP 
>>> and OLAP. (Not what this thread aims for) 
>>> At the current stage, Spark is very suited for analytic purposes. 
>>> 
>>> On Tue, Mar 28, 2023 at 9:06 AM Benedict <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>> I disagree with the first claim, as the process has all the information it 
>>>> chooses to utilise about which resources it’s using and what it’s using 
>>>> those resources for.
>>>> 
>>>> The inability to isolate GC domains is something we cannot address, but 
>>>> also probably not a problem if we were doing everything with memory 
>>>> management as well as we could be.
>>>> 
>>>> But, not worth detailing this thread for. Today we do very little well on 
>>>> this front within the process, and a separate process is well justified 
>>>> given the state of play.
>>>> 
>>>>> On 28 Mar 2023, at 16:38, Derek Chen-Becker <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> ...
>>>>> 
>>>>>> I think we might be underselling how valuable JVM isolation is,
>>>>>> especially for analytics queries that are going to pass the entire
>>>>>> dataset through heap somewhat constantly. 
>>>>> 
>>>>> Big +1 here. The JVM simply does not have significant granularity of 
>>>>> control for resource utilization, but this is explicitly a feature of 
>>>>> separate processes. Add in being able to separate GC domains and you can 
>>>>> avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Derek
>>>>> 
>>>>> 
>>>>> -- 
>>>>> +---------------------------------------------------------------+
>>>>> | Derek Chen-Becker                                             |
>>>>> | GPG Key available at https://keybase.io/dchenbecker and       |
>>>>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>>>>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>>>>> +---------------------------------------------------------------+
>>>>> 
>

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Reply via email to