Hi,
I have the following usecase, assuming that I have my data in e.g. hdfs, a
single file sequence file containing rows of CSV entries that I can split and
build an RDD of arrays of (smaller) strings.
What I want to do is to build two RDDs where the first RDD contains a subset of
columns and
Hi,
I am trying to upgrade from spark v0.91 to v1.0.0 and getting into some wierd
behavior.
When, in pyspark, I invoke
sc.textFile(hdfs://hadoop-ha01:/user/x/events_2.1).take(1) the
call crashes with the below stack trace.
The file resides in hadoop 2.2, it is a large event data,
A few questions about the resilience of the client side of spark.
what would happen if the client process crashes, can it reconstruct its state ?
Suppose I just want to serialize it and reload it back is this possible ?
More advanced use case, is there a way to move SparkContext between