RDD saveAsTextFile() to local disk

2015-07-08 Thread Vijay Pawnarkar
Getting exception when wrting RDD to local disk using following function

 saveAsTextFile(file:home/someuser/dir2/testupload/20150708/)

The dir (/home/someuser/dir2/testupload/) was created before running the
job. The error message is misleading.


org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
0.0 (TID 6, xxx.yyy.com): org.apache.hadoop.fs.ParentNotDirectoryException:
Parent path is not a directory: file:/home/someuser/dir2
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:418)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
at
org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:588)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:439)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:426)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:799)
at
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at
org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1060)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1051)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-- 
-Vijay


Re: Cassandra number of Tasks

2015-05-12 Thread Vijay Pawnarkar
Thanks!. We can somewhat approximate number of rows returned by where(), as
a result we can approximate number of partitions, so repartition approach
will work.
Lets say if the .where() had resulted in widel varying number of rows, we
would not have been to approximate # of partition, that would caused
inefficiencies.

On Mon, May 11, 2015 at 4:50 AM, ayan guha guha.a...@gmail.com wrote:

 Hi

 I think pushing filter up would be best. Essentially, I would suggest
 having smallish partitions and filter the data. Then repartition 10k
 records using numPartition=10 and then write to cassandra.

 Best
 Ayan

 On Mon, May 11, 2015 at 5:03 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Did you try repartitioning? You might end up with a lot of time spending
 on GC though.

 Thanks
 Best Regards

 On Fri, May 8, 2015 at 11:59 PM, Vijay Pawnarkar 
 vijaypawnar...@gmail.com wrote:

 I am using the Spark Cassandra connector to work with a table with 3
 million records. Using .where() API to work with only a certain rows in
 this table. Where clause filters the data to 1 rows.

 CassandraJavaUtil.javaFunctions(sparkContext) .cassandraTable(KEY_SPACE,
 MY_TABLE, CassandraJavaUtil.mapRowTo(MyClass.class)).where(cqlDataFilter,
 cqlFilterParams)


 Also using parameter spark.cassandra.input.split.size=1000

 As this job is processed by Spark cluster, it created 3000 partitions
 instead of 10. On spark cluster 3000 tasks are being executed. As the data
 in our table grows to 30 million rows, this will create 30,000 tasks
 instead of 10.

 Is there a better way to approach process these 10,000 records with 10
 tasks.

 Thanks!





 --
 Best Regards,
 Ayan Guha




-- 
-Vijay


Cassandra number of Tasks

2015-05-08 Thread Vijay Pawnarkar
I am using the Spark Cassandra connector to work with a table with 3
million records. Using .where() API to work with only a certain rows in
this table. Where clause filters the data to 1 rows.

CassandraJavaUtil.javaFunctions(sparkContext) .cassandraTable(KEY_SPACE,
MY_TABLE, CassandraJavaUtil.mapRowTo(MyClass.class)).where(cqlDataFilter,
cqlFilterParams)


Also using parameter spark.cassandra.input.split.size=1000

As this job is processed by Spark cluster, it created 3000 partitions
instead of 10. On spark cluster 3000 tasks are being executed. As the data
in our table grows to 30 million rows, this will create 30,000 tasks
instead of 10.

Is there a better way to approach process these 10,000 records with 10
tasks.

Thanks!