Hi,

Thanks guys. I was using spark-cassandra-connector 1.4.0-M1. There is a issue 
in this version of  spark-cassandra-connector.
The parameter spark.cassandra.input.split.size_in_mb is supposed to have a 
default value of 64 MB , is being interpreted as 64  bytes. This causes too 
many partitions to be created.

Following is Jira link:

https://datastax-oss.atlassian.net/browse/SPARKC-208?jql=project%20%3D%20SPARKC%20AND%20fixVersion%20%3D%201.4.0-M2

Thanks ,
Prateek



From: Matthias Niehoff [mailto:matthias.nieh...@codecentric.de]
Sent: Thursday, March 10, 2016 9:28 PM
To: Bryan Jeffrey <bryan.jeff...@gmail.com>
Cc: Prateek . <prat...@aricent.com>; user@spark.apache.org
Subject: Re: Spark job for Reading time series data from Cassandra

Hi,

the spark connector docs say: 
(https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md)

"The number of Spark partitions(tasks) created is directly controlled by the 
setting spark.cassandra.input.split.size_in_mb. This number reflects the 
approximate amount of Cassandra Data in any given Spark partition. To increase 
the number of Spark Partitions decrease this number from the default (64mb) to 
one that will sufficiently break up your C* token range. „

So, maybe your partitions are quite big?

2016-03-10 16:46 GMT+01:00 Bryan Jeffrey 
<bryan.jeff...@gmail.com<mailto:bryan.jeff...@gmail.com>>:
Prateek,

I believe that one task is created per Cassandra partition.  How is your data 
partitioned?

Regards,

Bryan Jeffrey

On Thu, Mar 10, 2016 at 10:36 AM, Prateek . 
<prat...@aricent.com<mailto:prat...@aricent.com>> wrote:
Hi,

I have a Spark Batch job for reading timeseries data from Cassandra which has 
50,000 rows.


JavaRDD<String> cassandraRowsRDD = javaFunctions.cassandraTable("iotdata", 
"coordinate")
                .map(new Function<CassandraRow, String>() {
                    @Override
                    public String call(CassandraRow cassandraRow) throws 
Exception {
                        return cassandraRow.toString();
                    }
                });

List<String> lm = cassandraRowsRDD.collect();


I am testing in local mode where I am observing Spark is creating 770870 tasks 
(one job, one stage) which is taking many hours to complete. Can any please 
suggest, what could be possible issues.


Stage Id

Description

Submitted

Duration

Tasks: Succeeded/Total

Input

Output

Shuffle Read

Shuffle Write

0

collect at 
CassandraSpark.java:94<http://localhost:4040/stages/stage?id=0&attempt=0>+details

2016/03/10 21:01:15

9 s

137/770870



Thank You

Prateek
"DISCLAIMER: This message is proprietary to Aricent and is intended solely for 
the use of the individual to whom it is addressed. It may contain privileged or 
confidential information and should not be circulated or used for any purpose 
other than for what it is intended. If you have received this message in error, 
please notify the originator immediately. If you are not the intended 
recipient, you are notified that you are strictly prohibited from using, 
copying, altering, or disclosing the contents of this message. Aricent accepts 
no responsibility for loss or damage arising from the use of the information 
transmitted by this email including damage from virus."




--
Matthias Niehoff | IT-Consultant | Agile Software Factory  | Consulting
codecentric AG | Zeppelinstr 2 | 76185 Karlsruhe | Deutschland
tel: +49 (0) 721.9595-681<tel:%2B49%20%280%29%20721.9595-681> | fax: +49 (0) 
721.9595-666<tel:%2B49%20%280%29%20721.9595-666> | mobil: +49 (0) 
172.1702676<tel:%2B49%20%280%29%20172.1702676>
www.codecentric.de<http://www.codecentric.de/> | 
blog.codecentric.de<http://blog.codecentric.de/> | 
www.meettheexperts.de<http://www.meettheexperts.de/> | 
www.more4fi.de<http://www.more4fi.de/>

Sitz der Gesellschaft: Solingen | HRB 25917| Amtsgericht Wuppertal
Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz

Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche 
und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige 
Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie 
bitte sofort den Absender und löschen Sie diese E-Mail und evtl. beigefügter 
Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen evtl. beigefügter 
Dateien sowie die unbefugte Weitergabe dieser E-Mail ist nicht gestattet
"DISCLAIMER: This message is proprietary to Aricent and is intended solely for 
the use of the individual to whom it is addressed. It may contain privileged or 
confidential information and should not be circulated or used for any purpose 
other than for what it is intended. If you have received this message in error, 
please notify the originator immediately. If you are not the intended 
recipient, you are notified that you are strictly prohibited from using, 
copying, altering, or disclosing the contents of this message. Aricent accepts 
no responsibility for loss or damage arising from the use of the information 
transmitted by this email including damage from virus."

Reply via email to