Re: converting DStream[String] into RDD[String] in spark streaming

2015-03-22 Thread deenar.toraskar
Sean Dstream.saveAsTextFiles internally calls foreachRDD and saveAsTextFile for each interval def saveAsTextFiles(prefix: String, suffix: String = ) { val saveFunc = (rdd: RDD[T], time: Time) = { val file = rddToFileName(prefix, suffix, time) rdd.saveAsTextFile(file) }

DataFrame saveAsTable - partitioned tables

2015-03-22 Thread deenar.toraskar
Hi I wanted to store DataFrames as partitioned Hive tables. Is there a way to do this via the saveAsTable call. The set of options does not seem to be documented. def saveAsTable(tableName: String, source: String, mode: SaveMode, options: Map[String, String]): Unit (Scala-specific) Creates a

HiveServer1 and SparkSQL

2014-10-07 Thread deenar.toraskar
Hi Shark supported both the HiveServer1 and HiveServer2 thrift interfaces (using $ bin/shark -service sharkserver[1 or 2]). SparkSQL seems to support only HiveServer2. I was wondering what is involved to add support for HiveServer1. Is this something straightforward to do that I can embark on

Re: Strategies for reading large numbers of files

2014-10-07 Thread deenar.toraskar
Hi Landon I had a problem very similar to your, where we have to process around 5 million relatively small files on NFS. After trying various options, we did something similar to what Matei suggested. 1) take the original path and find the subdirectories under that path and then parallelize the

Re: Equally weighted partitions in Spark

2014-05-15 Thread deenar.toraskar
This is my first implementation. There are a few rough edges, but when I run this I get the following exception. The class extends Partitioner which in turn extends Serializable. Any idea what I am doing wrong? scala res156.partitionBy(new EqualWeightPartitioner(1000, res156, weightFunction))

Equally weighted partitions in Spark

2014-05-01 Thread deenar.toraskar
Hi I am using Spark to distribute computationally intensive tasks across the cluster. Currently I partition my RDD of tasks randomly. There is a large variation in how long each of the jobs take to complete, leading to most partitions being processed quickly and a couple of partitions take

Spark behaviour when executor JVM crashes

2014-04-11 Thread deenar.toraskar
Hi I am running calling a C++ library on Spark using JNI. Occasionally the C++ library causes the JVM to crash. The task terminates on the MASTER, but the driver does not return. I am not sure why the driver does not terminate. I also notice that after such an occurrence, I lose some workers

Re: Running a task once on each executor

2014-03-27 Thread deenar.toraskar
Christopher Sorry I might be missing the obvious, but how do i get my function called on all Executors used by the app? I dont want to use RDDs unless necessary. once I start my shell or app, how do I get TaskNonce.getSingleton().doThisOnce() executed on each executor? @dmpour

Running a task once on each executor

2014-03-25 Thread deenar.toraskar
Hi Is there a way in Spark to run a function on each executor just once. I have a couple of use cases. a) I use an external library that is a singleton. It keeps some global state and provides some functions to manipulate it (e.g. reclaim memory. etc.) . I want to check the global state of this

Re: Reload RDD saved with saveAsObjectFile

2014-03-21 Thread deenar.toraskar
Jaonary val loadedData: RDD[(String,(String,Array[Byte]))] = sc.objectFile(yourObjectFileName) Deenar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reload-RDD-saved-with-saveAsObjectFile-tp2943p3009.html Sent from the Apache Spark User List mailing

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-21 Thread deenar.toraskar
Matei It turns out that saveAsObjectFile(), saveAsSequenceFile() and saveAsHadoopFile() currently do not pickup the hadoop settings as Aureliano found out in this post http://apache-spark-user-list.1001560.n3.nabble.com/Turning-kryo-on-does-not-decrease-binary-output-tp212p249.html Deenar --

Re: How to save as a single file efficiently?

2014-03-21 Thread deenar.toraskar
Aureliano Apologies for hijacking this thread. Matei On the subject of processing lots (millions) of small input files on HDFS, what are the best practices to follow on spark. Currently my code looks something like this. Without coalesce there is one task and one output file per input file.