Hi,

the spark connector docs say: (
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md
)

"The number of Spark partitions(tasks) created is directly controlled by
the setting spark.cassandra.input.split.size_in_mb. This number reflects
the approximate amount of Cassandra Data in any given Spark partition. To
increase the number of Spark Partitions decrease this number from the
default (64mb) to one that will sufficiently break up your C* token range. „

So, maybe your partitions are quite big?

2016-03-10 16:46 GMT+01:00 Bryan Jeffrey <bryan.jeff...@gmail.com>:

> Prateek,
>
> I believe that one task is created per Cassandra partition.  How is your
> data partitioned?
>
> Regards,
>
> Bryan Jeffrey
>
> On Thu, Mar 10, 2016 at 10:36 AM, Prateek . <prat...@aricent.com> wrote:
>
>> Hi,
>>
>>
>>
>> I have a Spark Batch job for reading timeseries data from Cassandra which
>> has 50,000 rows.
>>
>>
>>
>>
>>
>> JavaRDD<String> cassandraRowsRDD = javaFunctions.cassandraTable("iotdata",
>> "coordinate")
>>
>>                 .map(*new* Function<CassandraRow, String>() {
>>
>>                     @Override
>>
>>                     *public* String call(CassandraRow cassandraRow)
>> *throws* Exception {
>>
>>                         *return* cassandraRow.toString();
>>
>>                     }
>>
>>                 });
>>
>>
>>
>> List<String> lm = cassandraRowsRDD.collect();
>>
>>
>>
>>
>>
>> I am testing in local mode where I am observing Spark is creating 770870
>> tasks (one job, one stage) which is taking many hours to complete. Can any
>> please suggest, what could be possible issues.
>>
>>
>>
>>
>>
>> *Stage Id*
>>
>> *Description*
>>
>> *Submitted*
>>
>> *Duration*
>>
>> *Tasks: Succeeded/Total*
>>
>> *Input*
>>
>> *Output*
>>
>> *Shuffle Read*
>>
>> *Shuffle Write*
>>
>> 0
>>
>> collect at CassandraSpark.java:94
>> <http://localhost:4040/stages/stage?id=0&attempt=0>+details
>>
>> 2016/03/10 21:01:15
>>
>> 9 s
>>
>> 137/*770870*
>>
>>
>>
>>
>>
>> Thank You
>>
>>
>>
>> Prateek
>> "DISCLAIMER: This message is proprietary to Aricent and is intended
>> solely for the use of the individual to whom it is addressed. It may
>> contain privileged or confidential information and should not be circulated
>> or used for any purpose other than for what it is intended. If you have
>> received this message in error, please notify the originator immediately.
>> If you are not the intended recipient, you are notified that you are
>> strictly prohibited from using, copying, altering, or disclosing the
>> contents of this message. Aricent accepts no responsibility for loss or
>> damage arising from the use of the information transmitted by this email
>> including damage from virus."
>>
>
>


-- 
Matthias Niehoff | IT-Consultant | Agile Software Factory  | Consulting
codecentric AG | Zeppelinstr 2 | 76185 Karlsruhe | Deutschland
tel: +49 (0) 721.9595-681 | fax: +49 (0) 721.9595-666 | mobil: +49 (0)
172.1702676
www.codecentric.de | blog.codecentric.de | www.meettheexperts.de |
www.more4fi.de

Sitz der Gesellschaft: Solingen | HRB 25917| Amtsgericht Wuppertal
Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz

Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche
und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige
Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie
bitte sofort den Absender und löschen Sie diese E-Mail und evtl.
beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen
evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist
nicht gestattet

Reply via email to