Hi guys, We're testing out a spark/cassandra cluster, & we're very impressed with what we've seen so far. However, I'd very much like some advice from the shiny brains on the mailing list.
We have a large collection of python code that we're in the process of adapting to move into spark/cassandra, & I have some misgivings on using python for any further development. As a concrete example, we have a python class (part of a fairly large class library) which, as part of its constructor, also creates a record of itself in the cassandra key space. So we get an initialised class & a row in a table on the cluster. My problem is this: should we even be doing this? By this I mean, we could be facing an increasing number of transactions, which we (naturally) would like to process as quickly as possible. The input transactions themselves may well be routed to a number of processes, e.g. starting an agent, written to a log file, etc. So it seems wrong to be putting the 'INSERT ... INTO ...' code into the class instantiation: it would seem more sensible to split this into a bunch of different spark processes, with an input handler, database insertion, create new python object, update log file, all happening on the spark cluster, & all written as atomically as possible. But I think my reservations here are more fundamental. Is python the wrong choice for this sort of thing? Would it not be better to use scala? Shouldn't we be dividing these tasks into atomic processes which execute as rapidly as possible? What about streaming events to the cluster, wouldn't python be a bottleneck here rather than scala with its more robust support for multithreading? Is streaming even supported in python? What do people think? Best regards, Johnny -- Johnny Kelsey Chief Technology Officer *Semblent* *jkkel...@semblent.com <jkkel...@semblent.com>*