Hi, Thanks guys. I was using spark-cassandra-connector 1.4.0-M1. There is a issue in this version of spark-cassandra-connector. The parameter spark.cassandra.input.split.size_in_mb is supposed to have a default value of 64 MB , is being interpreted as 64 bytes. This causes too many partitions to be created.
Following is Jira link: https://datastax-oss.atlassian.net/browse/SPARKC-208?jql=project%20%3D%20SPARKC%20AND%20fixVersion%20%3D%201.4.0-M2 Thanks , Prateek From: Matthias Niehoff [mailto:matthias.nieh...@codecentric.de] Sent: Thursday, March 10, 2016 9:28 PM To: Bryan Jeffrey <bryan.jeff...@gmail.com> Cc: Prateek . <prat...@aricent.com>; user@spark.apache.org Subject: Re: Spark job for Reading time series data from Cassandra Hi, the spark connector docs say: (https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md) "The number of Spark partitions(tasks) created is directly controlled by the setting spark.cassandra.input.split.size_in_mb. This number reflects the approximate amount of Cassandra Data in any given Spark partition. To increase the number of Spark Partitions decrease this number from the default (64mb) to one that will sufficiently break up your C* token range. „ So, maybe your partitions are quite big? 2016-03-10 16:46 GMT+01:00 Bryan Jeffrey <bryan.jeff...@gmail.com<mailto:bryan.jeff...@gmail.com>>: Prateek, I believe that one task is created per Cassandra partition. How is your data partitioned? Regards, Bryan Jeffrey On Thu, Mar 10, 2016 at 10:36 AM, Prateek . <prat...@aricent.com<mailto:prat...@aricent.com>> wrote: Hi, I have a Spark Batch job for reading timeseries data from Cassandra which has 50,000 rows. JavaRDD<String> cassandraRowsRDD = javaFunctions.cassandraTable("iotdata", "coordinate") .map(new Function<CassandraRow, String>() { @Override public String call(CassandraRow cassandraRow) throws Exception { return cassandraRow.toString(); } }); List<String> lm = cassandraRowsRDD.collect(); I am testing in local mode where I am observing Spark is creating 770870 tasks (one job, one stage) which is taking many hours to complete. Can any please suggest, what could be possible issues. Stage Id Description Submitted Duration Tasks: Succeeded/Total Input Output Shuffle Read Shuffle Write 0 collect at CassandraSpark.java:94<http://localhost:4040/stages/stage?id=0&attempt=0>+details 2016/03/10 21:01:15 9 s 137/770870 Thank You Prateek "DISCLAIMER: This message is proprietary to Aricent and is intended solely for the use of the individual to whom it is addressed. It may contain privileged or confidential information and should not be circulated or used for any purpose other than for what it is intended. If you have received this message in error, please notify the originator immediately. If you are not the intended recipient, you are notified that you are strictly prohibited from using, copying, altering, or disclosing the contents of this message. Aricent accepts no responsibility for loss or damage arising from the use of the information transmitted by this email including damage from virus." -- Matthias Niehoff | IT-Consultant | Agile Software Factory | Consulting codecentric AG | Zeppelinstr 2 | 76185 Karlsruhe | Deutschland tel: +49 (0) 721.9595-681<tel:%2B49%20%280%29%20721.9595-681> | fax: +49 (0) 721.9595-666<tel:%2B49%20%280%29%20721.9595-666> | mobil: +49 (0) 172.1702676<tel:%2B49%20%280%29%20172.1702676> www.codecentric.de<http://www.codecentric.de/> | blog.codecentric.de<http://blog.codecentric.de/> | www.meettheexperts.de<http://www.meettheexperts.de/> | www.more4fi.de<http://www.more4fi.de/> Sitz der Gesellschaft: Solingen | HRB 25917| Amtsgericht Wuppertal Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und löschen Sie diese E-Mail und evtl. beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist nicht gestattet "DISCLAIMER: This message is proprietary to Aricent and is intended solely for the use of the individual to whom it is addressed. It may contain privileged or confidential information and should not be circulated or used for any purpose other than for what it is intended. If you have received this message in error, please notify the originator immediately. If you are not the intended recipient, you are notified that you are strictly prohibited from using, copying, altering, or disclosing the contents of this message. Aricent accepts no responsibility for loss or damage arising from the use of the information transmitted by this email including damage from virus."