Split RDD along columns

2015-01-29 Thread Schein, Sagi
Hi, I have the following usecase, assuming that I have my data in e.g. hdfs, a single file sequence file containing rows of CSV entries that I can split and build an RDD of arrays of (smaller) strings. What I want to do is to build two RDDs where the first RDD contains a subset of columns and

python worker crash in spark 1.0

2014-06-19 Thread Schein, Sagi
Hi, I am trying to upgrade from spark v0.91 to v1.0.0 and getting into some wierd behavior. When, in pyspark, I invoke sc.textFile(hdfs://hadoop-ha01:/user/x/events_2.1).take(1) the call crashes with the below stack trace. The file resides in hadoop 2.2, it is a large event data,

moving SparkContext around

2014-04-13 Thread Schein, Sagi
A few questions about the resilience of the client side of spark. what would happen if the client process crashes, can it reconstruct its state ? Suppose I just want to serialize it and reload it back is this possible ? More advanced use case, is there a way to move SparkContext between