Hello,
Similar to the thread below [1], when I tried to create an RDD from a 4GB
pandas dataframe I encountered the error
TypeError: cannot create an RDD from type: <type 'list'>
However looking into the code shows this is raised from a generic "except
Exception:" predicate (pyspark/sql/context.py:238 in spark-1.4.1). A
debugging session reveals the true error is SPARK_LOCAL_DIRS ran out of
space:
-> rdd = self._sc.parallelize(data)
(Pdb)
*IOError: (28, 'No space left on device')*
In this case, creating an RDD from a large matrix (~50mill rows) is required
for us. I'm a bit concerned about spark's process here:
a. turning the dataframe into records (data.to_records)
b. writing it to tmp
c. reading it back again in scala.
Is there a better way? The intention would be to operate on slices of this
large dataframe using numpy operations via spark's transformations and
actions.
Thanks,
FDS
1. https://www.mail-archive.com/[email protected]/msg35139.html
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]