Hi, the spark connector docs say: ( https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md )
"The number of Spark partitions(tasks) created is directly controlled by the setting spark.cassandra.input.split.size_in_mb. This number reflects the approximate amount of Cassandra Data in any given Spark partition. To increase the number of Spark Partitions decrease this number from the default (64mb) to one that will sufficiently break up your C* token range. „ So, maybe your partitions are quite big? 2016-03-10 16:46 GMT+01:00 Bryan Jeffrey <bryan.jeff...@gmail.com>: > Prateek, > > I believe that one task is created per Cassandra partition. How is your > data partitioned? > > Regards, > > Bryan Jeffrey > > On Thu, Mar 10, 2016 at 10:36 AM, Prateek . <prat...@aricent.com> wrote: > >> Hi, >> >> >> >> I have a Spark Batch job for reading timeseries data from Cassandra which >> has 50,000 rows. >> >> >> >> >> >> JavaRDD<String> cassandraRowsRDD = javaFunctions.cassandraTable("iotdata", >> "coordinate") >> >> .map(*new* Function<CassandraRow, String>() { >> >> @Override >> >> *public* String call(CassandraRow cassandraRow) >> *throws* Exception { >> >> *return* cassandraRow.toString(); >> >> } >> >> }); >> >> >> >> List<String> lm = cassandraRowsRDD.collect(); >> >> >> >> >> >> I am testing in local mode where I am observing Spark is creating 770870 >> tasks (one job, one stage) which is taking many hours to complete. Can any >> please suggest, what could be possible issues. >> >> >> >> >> >> *Stage Id* >> >> *Description* >> >> *Submitted* >> >> *Duration* >> >> *Tasks: Succeeded/Total* >> >> *Input* >> >> *Output* >> >> *Shuffle Read* >> >> *Shuffle Write* >> >> 0 >> >> collect at CassandraSpark.java:94 >> <http://localhost:4040/stages/stage?id=0&attempt=0>+details >> >> 2016/03/10 21:01:15 >> >> 9 s >> >> 137/*770870* >> >> >> >> >> >> Thank You >> >> >> >> Prateek >> "DISCLAIMER: This message is proprietary to Aricent and is intended >> solely for the use of the individual to whom it is addressed. It may >> contain privileged or confidential information and should not be circulated >> or used for any purpose other than for what it is intended. If you have >> received this message in error, please notify the originator immediately. >> If you are not the intended recipient, you are notified that you are >> strictly prohibited from using, copying, altering, or disclosing the >> contents of this message. Aricent accepts no responsibility for loss or >> damage arising from the use of the information transmitted by this email >> including damage from virus." >> > > -- Matthias Niehoff | IT-Consultant | Agile Software Factory | Consulting codecentric AG | Zeppelinstr 2 | 76185 Karlsruhe | Deutschland tel: +49 (0) 721.9595-681 | fax: +49 (0) 721.9595-666 | mobil: +49 (0) 172.1702676 www.codecentric.de | blog.codecentric.de | www.meettheexperts.de | www.more4fi.de Sitz der Gesellschaft: Solingen | HRB 25917| Amtsgericht Wuppertal Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und löschen Sie diese E-Mail und evtl. beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist nicht gestattet