Doug, thanks for the diagrams, really helpful.
Do you think there might be some extension to this CEP (does not need to be necessarily included from the very beginning, just food for though at this point) which would read data from the commit log / CDC? The main motivation behind this is that when one looks around in terms of what is currently possible with Spark, Cassandra often exists as a sink only when comes to streaming. For example, take Spark. We can use Kafka connector (1) so data would come to Kafka, it would be streamed to Spark as RDDs and Spark would save it to Cassandra via Spark Cassandra Connector. Such transformation / pipeline is indeed possible. We have also Cassandra + Ignite integration (2, 3) so Ignite can act as in-memory caching layer on top of Cassandra which enables users to do transformations over IgniteRDD and queries which are not possible normally. (e.g. joins in SQL in Ignite over these caches etc). Very handy. But there is no Ignite streamer which would consider Cassandra to be a realtime / near realtime source. So, there is currently no integration done (correct me if I am wrong) which would have Cassandra as _real time_ source. Looking into these diagrams, when you are able to load data from Cassandra from SSTables, would it be possible to continually fetch offset in CDC index file (these changes were done in 4.0 for the first time I think, ask Josh McKenzie about the details), read these mutations and send it via Sidecar to Spark? Currently, the only solution I know of which is doing realtime-ish streaming of mutations from CDC is Debezium Cassandra connector but it is pushing these mutations straight to Kafka only. I would love to have it in Spark first and then I can do whatever I want with that. (1) https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html (2) https://ignite.apache.org/docs/latest/extensions-and-integrations/cassandra/overview (3) https://ignite.apache.org/docs/latest/extensions-and-integrations/ignite-for-spark/ignitecontext-and-rdd (4) https://github.com/debezium/debezium-connector-cassandra ________________________________________ From: Doug Rohrer <droh...@apple.com> Sent: Tuesday, April 11, 2023 0:37 To: dev@cassandra.apache.org Subject: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe. I’ve updated the CEP with two overview diagrams of the interactions between Sidecar, Cassandra, and the Bulk Analytics library. Hope this helps folks better understand how things work, and thanks for the patience as it took a bit longer than expected for me to find the time for this. Doug On Apr 5, 2023, at 11:18 AM, Doug Rohrer <droh...@apple.com> wrote: Sorry for the delay in responding here - yes, we can add some diagrams to the CEP - I’ll try to get that done by end-of-week. Thanks, Doug On Mar 28, 2023, at 1:14 PM, J. D. Jordan <jeremiah.jor...@gmail.com> wrote: Maybe some data flow diagrams could be added to the cep showing some example operations for read/write? On Mar 28, 2023, at 11:35 AM, Yifan Cai <yc25c...@gmail.com> wrote: A lot of great discussions! On the sidecar front, especially what the role sidecar plays in terms of this CEP, I feel there might be some confusion. Once the code is published, we should have clarity. Sidecar does not read sstables nor do any coordination for analytics queries. It is local to the companion Cassandra instance. For bulk read, it takes snapshots and streams sstables to spark workers to read. For bulk write, it imports the sstables uploaded from spark workers. All commands are existing jmx/nodetool functionalities from Cassandra. Sidecar adds the http interface to them. It might be an over simplified description. The complex computation is performed in spark clusters only. In the long run, Cassandra might evolve into a database that does both OLTP and OLAP. (Not what this thread aims for) At the current stage, Spark is very suited for analytic purposes. On Tue, Mar 28, 2023 at 9:06 AM Benedict <bened...@apache.org<mailto:bened...@apache.org>> wrote: I disagree with the first claim, as the process has all the information it chooses to utilise about which resources it’s using and what it’s using those resources for. The inability to isolate GC domains is something we cannot address, but also probably not a problem if we were doing everything with memory management as well as we could be. But, not worth detailing this thread for. Today we do very little well on this front within the process, and a separate process is well justified given the state of play. On 28 Mar 2023, at 16:38, Derek Chen-Becker <de...@chen-becker.org<mailto:de...@chen-becker.org>> wrote: On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch <joe.e.ly...@gmail.com<mailto:joe.e.ly...@gmail.com>> wrote: ... I think we might be underselling how valuable JVM isolation is, especially for analytics queries that are going to pass the entire dataset through heap somewhat constantly. Big +1 here. The JVM simply does not have significant granularity of control for resource utilization, but this is explicitly a feature of separate processes. Add in being able to separate GC domains and you can avoid a lot of noisy neighbor in-VM behavior for the disparate workloads. Cheers, Derek -- +---------------------------------------------------------------+ | Derek Chen-Becker | | GPG Key available at https://keybase.io/dchenbecker and | | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | +---------------------------------------------------------------+