Hello, 

I have questions using Spark streaming to consume data from Kafka and insert
to Cassandra database.

5 AWS instances (each one does have 8 cores, 30GB memory) for Spark, Hadoop,
Cassandra
Scala: 2.10.5
Spark: 1.2.2
Hadoop: 1.2.1
Cassandra 2.0.18

3 AWS instances for Kafka cluster (each one does have 8 cores, 30GB memory)
Kafka: 0.8.2.1
Zookeeper: 3.4.6

Other configurations:
batchInterval = 6 Seconds
blockInterval = 1500 millis
spark.locality.wait = 500 millis
#Consumers = 10

There are two columns in the cassandra table keySpaceOfTopicA.tableOfTopicA,
"createdtime" and "log".

Here is a piece of codes,

@transient val kstreams = (1 to numConsumers.toInt).map { _ =>
KafkaUtils.createStream(ssc, zkeeper, groupId, Map("topicA"->1),   
StorageLevel.MEMORY_AND_DISK_SER)
        .map(_._2.toString).map(Tuple1(_))
        .map{case(log) => (System.currentTimeMillis(), log)}
}
@transient val unifiedMessage = ssc.union(kstreams)

unifiedMessage.saveToCassandra("keySpaceOfTopicA", "tableOfTopicA",
SomeColumns("createdtime", "log"))

I created a producer and send messages to Brokers (1000 messages/per time)

But the Cassandra can only be inserted about 100 messages in each round of
test.
Can anybody give me advices why the other messages (about 900 message) can't
be consumed? 
How do I configure and tune the parameters in order to improve the
throughput of consumers?  

Thank you very much for your reading and suggestions in advances.

Jerry Wong



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Optimize-the-performance-of-inserting-data-to-Cassandra-with-Kafka-and-Spark-Streaming-tp26244.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to