Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Sebastian Estevez
Hey Dinesh,

Yeah it makes sense that the sstable streaming is network bound since it's
mostly just moving files.

Do you have any performance stats on the sstable parsing side inside spark?

--Seb

On Tue, May 2, 2023 at 3:31 PM Dinesh Joshi  wrote:

> It is line rate / network bound. We have a patch out in vert.x that should
> use the zero copy path for it. But it's not a strict prereq for it.
>
> On 2023/05/02 15:39:02 Sebastian Estevez wrote:
> > Hi folks,
> >
> > Great stuff thanks for sharing.
> >
> > The performance numbers I've seen so far are for the sidecar streaming
> > sstables (seems like this is just network bound?). What kind of perf are
> > you seeing at the Spark executors (at the per task level)?
> >
> > --Seb
> >
> > On Mon, May 1, 2023 at 3:50 PM Dinesh Joshi  wrote:
> >
> > > Does anybody have any questions that we could answer about this
> proposal?
> > >
> > > On Apr 27, 2023, at 1:24 PM, Francisco Guerrero <
> frank.guerr...@gmail.com>
> > > wrote:
> > >
> > > Hi folks,
> > >
> > > We have updated the confluence page with the source code for CEP-28.
> > > There are two repositories with contributions. One is the patch [1]
> > > for Cassandra Sidecar with the bulk APIs that enable the Cassandra
> > > Spark Analytics library. The second is a new repository [2] with
> > > contributions to the Cassandra Spark Analytics code
> > >
> > > We also have a README markdown file that you can follow to give the
> > > code a try:
> > >
> > >
> > >
> https://github.com/frankgh/cassandra-analytics/blob/trunk/cassandra-analytics-core-example/README.md
> > >
> > > Best,
> > > - Francisco
> > >
> > > [1] Apache Cassandra Sidecar bulk APIs source code:
> > > https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
> > > [2] Apache Cassandra Spark Analytics source code:
> > > https://github.com/frankgh/cassandra-analytics
> > >
> > >
> > > On 2023/04/05 15:18:07 Doug Rohrer wrote: > Sorry for the delay in
> > > responding here - yes, we can add some diagrams to the CEP - I’ll try
> to
> > > get that done by end-of-week. > > Thanks, > > Doug > > > On Mar 28,
> 2023,
> > > at 1:14 PM, J. D. Jordan  wrote: > > > >
> Maybe
> > > some data flow diagrams could be added to the cep showing some example
> > > operations for read/write? > > > >> On Mar 28, 2023, at 11:35 AM,
> Yifan Cai
> > >  wrote: > >> > >>  > >> A lot of great
> discussions!
> > > > >> > >> On the sidecar front, especially what the role sidecar plays
> in
> > > terms of this CEP, I feel there might be some confusion. Once the code
> is
> > > published, we should have clarity. > >> Sidecar does not read sstables
> nor
> > > do any coordination for analytics queries. It is local to the companion
> > > Cassandra instance. For bulk read, it takes snapshots and streams
> sstables
> > > to spark workers to read. For bulk write, it imports the sstables
> uploaded
> > > from spark workers. All commands are existing jmx/nodetool
> functionalities
> > > from Cassandra. Sidecar adds the http interface to them. It might be an
> > > over simplified description. The complex computation is performed in
> spark
> > > clusters only. > >> > >> In the long run, Cassandra might evolve into a
> > > database that does both OLTP and OLAP. (Not what this thread aims for)
> > >>
> > > At the current stage, Spark is very suited for analytic purposes. > >>
> > >>
> > > On Tue, Mar 28, 2023 at 9:06 AM Benedict  > > bened...@apache.org>> wrote: > >>> I disagree with the first claim, as
> > > the process has all the information it chooses to utilise about which
> > > resources it’s using and what it’s using those resources for. > >>> >
> >>>
> > > The inability to isolate GC domains is something we cannot address, but
> > > also probably not a problem if we were doing everything with memory
> > > management as well as we could be. > >>> > >>> But, not worth detailing
> > > this thread for. Today we do very little well on this front within the
> > > process, and a separate process is well justified given the state of
> play.
> > > >

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Sebastian Estevez
Hi folks,

Great stuff thanks for sharing.

The performance numbers I've seen so far are for the sidecar streaming
sstables (seems like this is just network bound?). What kind of perf are
you seeing at the Spark executors (at the per task level)?

--Seb

On Mon, May 1, 2023 at 3:50 PM Dinesh Joshi  wrote:

> Does anybody have any questions that we could answer about this proposal?
>
> On Apr 27, 2023, at 1:24 PM, Francisco Guerrero 
> wrote:
>
> Hi folks,
>
> We have updated the confluence page with the source code for CEP-28.
> There are two repositories with contributions. One is the patch [1]
> for Cassandra Sidecar with the bulk APIs that enable the Cassandra
> Spark Analytics library. The second is a new repository [2] with
> contributions to the Cassandra Spark Analytics code
>
> We also have a README markdown file that you can follow to give the
> code a try:
>
>
> https://github.com/frankgh/cassandra-analytics/blob/trunk/cassandra-analytics-core-example/README.md
>
> Best,
> - Francisco
>
> [1] Apache Cassandra Sidecar bulk APIs source code:
> https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
> [2] Apache Cassandra Spark Analytics source code:
> https://github.com/frankgh/cassandra-analytics
>
>
> On 2023/04/05 15:18:07 Doug Rohrer wrote: > Sorry for the delay in
> responding here - yes, we can add some diagrams to the CEP - I’ll try to
> get that done by end-of-week. > > Thanks, > > Doug > > > On Mar 28, 2023,
> at 1:14 PM, J. D. Jordan  wrote: > > > > Maybe
> some data flow diagrams could be added to the cep showing some example
> operations for read/write? > > > >> On Mar 28, 2023, at 11:35 AM, Yifan Cai
>  wrote: > >> > >>  > >> A lot of great discussions!
> > >> > >> On the sidecar front, especially what the role sidecar plays in
> terms of this CEP, I feel there might be some confusion. Once the code is
> published, we should have clarity. > >> Sidecar does not read sstables nor
> do any coordination for analytics queries. It is local to the companion
> Cassandra instance. For bulk read, it takes snapshots and streams sstables
> to spark workers to read. For bulk write, it imports the sstables uploaded
> from spark workers. All commands are existing jmx/nodetool functionalities
> from Cassandra. Sidecar adds the http interface to them. It might be an
> over simplified description. The complex computation is performed in spark
> clusters only. > >> > >> In the long run, Cassandra might evolve into a
> database that does both OLTP and OLAP. (Not what this thread aims for) > >>
> At the current stage, Spark is very suited for analytic purposes. > >> > >>
> On Tue, Mar 28, 2023 at 9:06 AM Benedict  bened...@apache.org>> wrote: > >>> I disagree with the first claim, as
> the process has all the information it chooses to utilise about which
> resources it’s using and what it’s using those resources for. > >>> > >>>
> The inability to isolate GC domains is something we cannot address, but
> also probably not a problem if we were doing everything with memory
> management as well as we could be. > >>> > >>> But, not worth detailing
> this thread for. Today we do very little well on this front within the
> process, and a separate process is well justified given the state of play.
> > >>> >  On 28 Mar 2023, at 16:38, Derek Chen-Becker <
> de...@chen-becker.org > wrote: >  >
>   >  >  On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch <
> joe.e.ly...@gmail.com > wrote: >  ... >
>  > > I think we might be underselling how valuable JVM isolation
> is, > > especially for analytics queries that are going to pass the
> entire > > dataset through heap somewhat constantly. >  >  Big
> +1 here. The JVM simply does not have significant granularity of control
> for resource utilization, but this is explicitly a feature of separate
> processes. Add in being able to separate GC domains and you can avoid a lot
> of noisy neighbor in-VM behavior for the disparate workloads. >  > 
> Cheers, >  >  Derek >  >  >  -- > 
> +---+ >  |
> Derek Chen-Becker | >  | GPG Key available at
> https://keybase.io/dchenbecker and | >  |
> https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | >  |
> Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | > 
> +---+ >  >
> >
> --
> Francisco Guerrero
>
>
>

-- 
All the best,

Sebastián


Re: Info for storage management / merge algorithms

2016-06-28 Thread Sebastian Estevez
Check out https://issues.apache.org/jira/browse/CASSANDRA-8099 and
https://github.com/pcmanus/cassandra/blob/8099_engine_refactor/guide_8099.md
for info on the latest storage layer changes in c*.

All the best,


[image: datastax_logo.png] 

Sebastián Estévez

Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com

[image: linkedin.png]  [image:
facebook.png]  [image: twitter.png]
 [image: g+.png]







DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

On Tue, Jun 28, 2016 at 3:17 AM, Andrew Springer <
matt.andrew.sprin...@gmail.com> wrote:

> Hi all,
>
> I'm doing my master thesis on the storage layer of *SQL databases and more
> specifically on the algorithms used to combine the disk files.
> I would kindly like to ask for a few pointers describing the storage layer
> / merge algorithms of Cassandra.
> I found this research article, but is a bit old (Nov '14) and don't know if
> the information presented is still up-to-date for the current version of
> Cassandra:
>
> https://www.researchgate.net/publication/275657574_Efficient_Range-Based_Storage_Management_for_Scalable_Datastores
>
> Thanks a lot (and sorry if this is not the correct list for such a
> request).
>


Re: Unable to remove dead node from cluster.

2015-09-21 Thread Sebastian Estevez
Order is decommission, remove, assassinate.

Which have you tried?
On Sep 21, 2015 10:47 AM, "Dikang Gu"  wrote:

> Hi there,
>
> I have a dead node in our cluster, which is a wired state right now, and
> can not be removed from cluster.
>
> The nodestatus shows:
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address  Load   Tokens  OwnsHost ID
> Rack
> DN  10.210.165.55?  256 ?   null
>r1
>
> I tried the unsafeAssassinateEndpoint, but got exception like:
> 2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55 is
> now DOWN
> 2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread
> Thread[GossipStage:1,5,main]
> 2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
> 2015-09-18_23:21:40.80669   at
> org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80669   at
> org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80670   at
> org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1822)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80671   at
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1495)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80671   at
> org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2121)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80672   at
> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1009)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80673   at
> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1113)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80673   at
> org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80673   at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80674   at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> ~[na:1.7.0_45]
> 2015-09-18_23:21:40.80674   at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> ~[na:1.7.0_45]
> 2015-09-18_23:21:40.80674   at java.lang.Thread.run(Thread.java:744)
> ~[na:1.7.0_45]
> 2015-09-18_23:21:40.85812 WARN  23:21:40 Not marking nodes down due to
> local pause of 10852378435 > 50
>
> Any suggestions about how to remove it?
> Thanks.
>
> --
> Dikang
>
>


Re: Nodes failed to bootstrap, no nodetool info but system.peer populated.

2015-05-11 Thread Sebastian Estevez
I hit this issue today with the c# driver. I still think the drivers should
handle peers inconsistencies better and maybe even output warnings about
them.

I opened CSHARP-296, @rolo, it's probably a good idea to open a similar one
for java.
On May 11, 2015 11:24 AM, "Carlos Rolo"  wrote:

> Thanks!
>
> Regards,
>
> Carlos Juzarte Rolo
> Cassandra Consultant
>
> Pythian - Love your data
>
> rolo@pythian | Twitter: cjrolo | Linkedin: *
> linkedin.com/in/carlosjuzarterolo
> *
> Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649
> www.pythian.com
>
> On Mon, May 11, 2015 at 4:29 PM, Brandon Williams 
> wrote:
>
> > https://issues.apache.org/jira/browse/CASSANDRA-9180
> >
> > On Mon, May 11, 2015 at 4:17 AM, Carlos Rolo  wrote:
> >
> > > Hi all,
> > >
> > > I just wanted to know if this should be worth filling a bug or not
> > > (Couldn't find any similar).
> > >
> > > I have a 3 node cluster (2.0.14). Decided to add 3 new ones. 2 failed
> > > because of hardware failure (virtualized environment).
> > >
> > > The process was automated, so what was supposed to happen was:
> > >
> > > - Node 4 joins
> > > - wait until status is UN and then 2min more
> > > - Node 5 joins
> > > - wait until status is UN and then 2min more
> > > - Node 6 joins
> > > - wait until status is UN and then 2min more
> > >
> > > What happened:
> > > - Node 4 joins
> > > - Wait...
> > > - Node 5 joins
> > > - VM fails while node is starting.
> > > - VM 6 starts, no node with UN, waits 2min
> > > - Node 6 joins
> > > - VM fails while node is starting.
> > >
> > > After this, nodetool reports 4 nodes all UN
> > > While trying an application (Datastax Java Driver 2.1) the debug log
> > > reports that it tries to connect to Node 5 and 6 and fails.
> > >
> > > Checking system.peers table, I see both nodes there. So I tried
> "nodetool
> > > removenode " with the IDs in the table.
> > >
> > > It blows up with the following exception:
> > > Exception in thread "main" java.lang.UnsupportedOperationException:
> Host
> > ID
> > > not found.
> > >
> > > Then I decided to do the following:
> > > DELETE from peers where ID in (ID1, ID2);
> > >
> > > All good, cluster still happy and driver not complaining anymore.
> > > Is this expected behavior?
> > >
> > >
> > >
> > > Regards,
> > >
> > > Carlos Juzarte Rolo
> > > Cassandra Consultant
> > >
> > > Pythian - Love your data
> > >
> > > rolo@pythian | Twitter: cjrolo | Linkedin: *
> > > linkedin.com/in/carlosjuzarterolo
> > > *
> > > Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649
> > > www.pythian.com
> > >
> > > --
> > >
> > >
> > > --
> > >
> > >
> > >
> > >
> >
>
> --
>
>
> --
>
>
>
>


Re: How to integrate Cassandra C driver in another open source server

2015-04-17 Thread Sebastian Estevez
You may want to try the c++ driver mailing list:

https://groups.google.com/a/lists.datastax.com/forum/#!forum/cpp-driver-user

All the best,


[image: datastax_logo.png] 

Sebastián Estévez

Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com

[image: linkedin.png]  [image:
facebook.png]  [image: twitter.png]
 [image: g+.png]





DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

On Fri, Apr 17, 2015 at 10:32 AM, TOURON, BENOIT 
wrote:

> Hi,
>
> We are working on a project which consists in a DB migration (move from a
> proprietary solution to Cassandra).
> Among the modules communicating with the DB, there is a Radius Server,
> based on the open source FreeRadius 2.2. So we have modified the source
> code of this server and added a specific module to call Cassandra, but we
> have an issue.
> We are using the DataStax C driver. The FreeRadius is in multithread mode,
> with pthreads.
>
> Our modifications are the following :
>
> *   In the instantiate part of the module (called once by the main
> process at the beginning), we do the following code
>
> ape_main_context = (APE_MAIN_CONTEXT *) malloc(sizeof(APE_MAIN_CONTEXT));
> ape_main_context->cluster = cass_cluster_new();
> cass_cluster_set_contact_points(ape_main_context->cluster, "dse5");
>
> The APE_MAIN_CONTEXT is a structure used to pass info between the threads,
> which contains among other information, the Cassandra cluster.
> Once the cluster is created (with only 1 contact point, called "dse5"), we
> continue with the following code :
>
> ape_main_context->session = cass_session_new();
> CassFuture* connect_future =
> cass_session_connect_keyspace(ape_main_context->session,
> ape_main_context->cluster, "ape01");
> CassError rc = cass_future_error_code(connect_future);
> cass_future_free(connect_future);
>
> This will create the session in the main process.
>
> *   In the authorize part of the module (which is called in a worker
> thread, spawn by the main process), we actually call the Cassandra DB, with
> the following code :
>
> CassString insert_query = cass_string_init("select * from ape01.session
> where ip_user= ? ;");
> CassStatement* statement = cass_statement_new(insert_query, 1);
> cass_statement_bind_string(statement, 0,
> cass_string_init("1FF1FF11"));
> CassFuture* query_future =
> cass_session_execute(request->thread_context->main_context->session,
> statement);
> CassResult* result = cass_future_get_result(query_future);
> if (result == NULL) {
> printf("rlm_cassandra : Query result KO");
> return RLM_MODULE_HANDLED;
> }
> cass_future_free(query_future);
> CassRow* row = cass_result_first_row(result);
> CassString key;
> cass_value_get_string(cass_row_get_column_by_name(row, "ip_user"), &key);
> printf("rlm_cassandra : resultat ip_user : <%s>\n",key.data);
> cass_result_free(result);
> cass_statement_free(statement);
>
>
> When we start the server, we notice that the instantiate part of the
> modules is executed without error (we have added some logs).
> However, when we use a Radius client to send an authorize request, the
> module function is launched in a worker thread, but remains blocked
> somewhere.
> We suspect a signal issue. The call flow would be the following :
>
> *   Call the instantiate function of the module (create the Cassandra
> cluster and session)
> *   Initialize other things (free radius code), erasing a signal
> handler which is useful for the Cassandra driver
> *   Spawn the worker threads and enter an event loop in each thread
> *   On reception of a request, a worker thread handles it and gets
> blocked
>
> As a workaround, we have moved the creation  of the Cassandra session in
> the beginning of the working thread (before entering the event loop). We
> create only 1 session (we use a mutex and a static Boolean first=true at
> the beginning and set to false in the protected code).
> So the call flow becomes :
>
> *   Call the instantiate function of the module (create the Cassandra
> cluster only)
> *   Initialize other things (free radius code)
> *   Spawn the worker threads, create 1 Cassandra session for all
> threads (static Boolean + mutex), and enter an event loop in each thread
> *   On reception of a request, a worker thread handles it and it works
>
> So, we manage to make a Cassandra