Did you use RDDs or DataFrames?
What is the Spark version?
On Mon, May 28, 2018 at 10:32 PM, Saulo Sobreiro
wrote:
> Hi,
> I run a few more tests and found that even with a lot more operations on
> the scala side, python is outperformed...
>
> Dataset Stream duration: ~3 minutes (csv formatted
Hi,
I run a few more tests and found that even with a lot more operations on the
scala side, python is outperformed...
Dataset Stream duration: ~3 minutes (csv formatted data messages read from
Kafka)
Scala process/store time: ~3 minutes (map with split + metrics calculations +
store raw +
The answer is most likely that when you use Cross Java - Python code you
incur a penalty for every objects that you transform from a Java object
into a Python object (and then back again to a Python object) when data is
being passed in and out of your functions. A way around this would probably
be
The main language they developed spark with is scala, so all the new
features go first to scala, java and finally python. I'm not surprised by
the results, we've seen it on Stratio since the first versions of spark. At
the beginning of development, some of our engineers make the prototype with
Hi Javier,
Thank you a lot for the feedback.
Indeed the CPU is a huge limitation. I got a lot of trouble trying to run this
use case in yarn-client mode. I managed to run this in standalone (local
master) mode only.
I do not have the hardware available to run this setup in a cluster yet, so I
Hi Saulo,
If the CPU is close to 100% then you are hitting the limit. I don't think
that moving to Scala will make a difference. Both Spark and Cassandra are
CPU hungry, your setup is small in terms of CPUs. Try running Spark on
another (physical) machine so that the 2 cores are dedicated to
Hi Javier,
I will try to implement this in scala then. As far as I can see in the
documentation there is no SaveToCassandra in the python interface unless you
are working with dataframes and the kafkaStream instance does not provide
methods to convert an RDD into DF.
Regarding my table, it is
Hi Saulo,
I meant using this to save:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md#writing-to-cassandra-from-a-stream
But it might be slow on a different area.
Another point is that Cassandra and spark running on the same machine might
compete for
Hi Javier,
I removed the map and used "map" directly instead of using transform, but the
kafkaStream is created with KafkaUtils which does not have a method to save to
cassandra directly.
Do you know any workarround for this?
Thank you for the suggestion.
Best Regards,
On 29/04/2018
Hi Saulo,
I'm no expert but I will give it a try.
I would remove the rdd2.count(), I can't see the point and you will gain
performance right away. Because of this, I would not use a transform, just
directly the map.
I have not used python but in Scala the cassandra-spark connector can save
Hi all,
I am implementing a use case where I read some sensor data from Kafka with
SparkStreaming interface (KafkaUtils.createDirectStream) and, after some
transformations, write the output (RDD) to Cassandra.
Everything is working properly but I am having some trouble with the
performance.
11 matches
Mail list logo