Why not attach a bigger hard disk to the machines and point your SPARK_LOCAL_DIRS to it?
Thanks Best Regards On Sat, Aug 29, 2015 at 1:13 AM, fsacerdoti <fsacerd...@jumptrading.com> wrote: > Hello, > > Similar to the thread below [1], when I tried to create an RDD from a 4GB > pandas dataframe I encountered the error > > TypeError: cannot create an RDD from type: <type 'list'> > > However looking into the code shows this is raised from a generic "except > Exception:" predicate (pyspark/sql/context.py:238 in spark-1.4.1). A > debugging session reveals the true error is SPARK_LOCAL_DIRS ran out of > space: > > -> rdd = self._sc.parallelize(data) > (Pdb) > *IOError: (28, 'No space left on device')* > > In this case, creating an RDD from a large matrix (~50mill rows) is > required > for us. I'm a bit concerned about spark's process here: > > a. turning the dataframe into records (data.to_records) > b. writing it to tmp > c. reading it back again in scala. > > Is there a better way? The intention would be to operate on slices of this > large dataframe using numpy operations via spark's transformations and > actions. > > Thanks, > FDS > > 1. https://www.mail-archive.com/user@spark.apache.org/msg35139.html > > > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >