Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-06-02 Thread Timur Shenkao
Did you use RDDs or DataFrames? What is the Spark version? On Mon, May 28, 2018 at 10:32 PM, Saulo Sobreiro wrote: > Hi, > I run a few more tests and found that even with a lot more operations on > the scala side, python is outperformed... > > Dataset Stream duration: ~3 minutes (csv formatted

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-05-28 Thread Saulo Sobreiro
Hi, I run a few more tests and found that even with a lot more operations on the scala side, python is outperformed... Dataset Stream duration: ~3 minutes (csv formatted data messages read from Kafka) Scala process/store time: ~3 minutes (map with split + metrics calculations + store raw +

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-05-21 Thread Russell Spitzer
The answer is most likely that when you use Cross Java - Python code you incur a penalty for every objects that you transform from a Java object into a Python object (and then back again to a Python object) when data is being passed in and out of your functions. A way around this would probably be

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-05-21 Thread Alonso Isidoro Roman
The main language they developed spark with is scala, so all the new features go first to scala, java and finally python. I'm not surprised by the results, we've seen it on Stratio since the first versions of spark. At the beginning of development, some of our engineers make the prototype with

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-05-20 Thread Saulo Sobreiro
Hi Javier, Thank you a lot for the feedback. Indeed the CPU is a huge limitation. I got a lot of trouble trying to run this use case in yarn-client mode. I managed to run this in standalone (local master) mode only. I do not have the hardware available to run this setup in a cluster yet, so I

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-04-30 Thread Javier Pareja
Hi Saulo, If the CPU is close to 100% then you are hitting the limit. I don't think that moving to Scala will make a difference. Both Spark and Cassandra are CPU hungry, your setup is small in terms of CPUs. Try running Spark on another (physical) machine so that the 2 cores are dedicated to

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-04-30 Thread Saulo Sobreiro
Hi Javier, I will try to implement this in scala then. As far as I can see in the documentation there is no SaveToCassandra in the python interface unless you are working with dataframes and the kafkaStream instance does not provide methods to convert an RDD into DF. Regarding my table, it is

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-04-29 Thread Javier Pareja
Hi Saulo, I meant using this to save: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md#writing-to-cassandra-from-a-stream But it might be slow on a different area. Another point is that Cassandra and spark running on the same machine might compete for

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-04-29 Thread Saulo Sobreiro
Hi Javier, I removed the map and used "map" directly instead of using transform, but the kafkaStream is created with KafkaUtils which does not have a method to save to cassandra directly. Do you know any workarround for this? Thank you for the suggestion. Best Regards, On 29/04/2018

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-04-29 Thread Javier Pareja
Hi Saulo, I'm no expert but I will give it a try. I would remove the rdd2.count(), I can't see the point and you will gain performance right away. Because of this, I would not use a transform, just directly the map. I have not used python but in Scala the cassandra-spark connector can save

[Spark2.1] SparkStreaming to Cassandra performance problem

2018-04-28 Thread Saulo Sobreiro
Hi all, I am implementing a use case where I read some sensor data from Kafka with SparkStreaming interface (KafkaUtils.createDirectStream) and, after some transformations, write the output (RDD) to Cassandra. Everything is working properly but I am having some trouble with the performance.