Hi guys,

We're testing out a spark/cassandra cluster, & we're very impressed with
what we've seen so far. However, I'd very much like some advice from the
shiny brains on the mailing list.

We have a large collection of python code that we're in the process of
adapting to move into spark/cassandra, & I have some misgivings on using
python for any further development.

As a concrete example, we have a python class (part of a fairly large class
library) which, as part of its constructor, also creates a record of itself
in the cassandra key space. So we get an initialised class & a row in a
table on the cluster. My problem is this: should we even be doing this?

By this I mean, we could be facing an increasing number of transactions,
which we (naturally) would like to process as quickly as possible. The
input transactions themselves may well be routed to a number of processes,
e.g. starting an agent, written to a log file, etc. So it seems wrong to be
putting the 'INSERT ... INTO ...' code into the class instantiation: it
would seem more sensible to split this into a bunch of different spark
processes, with an input handler, database insertion, create new python
object, update log file, all happening on the spark cluster, & all written
as atomically as possible.

But I think my reservations here are more fundamental. Is python the wrong
choice for this sort of thing? Would it not be better to use scala?
Shouldn't we be dividing these tasks into atomic processes which execute as
rapidly as possible? What about streaming events to the cluster, wouldn't
python be a bottleneck here rather than scala with its more robust support
for multithreading?  Is streaming even supported in python?

What do people think?

Best regards,

Johnny

-- 
Johnny Kelsey
Chief Technology Officer
*Semblent*
*jkkel...@semblent.com <jkkel...@semblent.com>*

Reply via email to