Re: IOError on createDataFrame

2015-08-31 Thread fsacerdoti
There are two issues here:

1. Suppression of the true reason for failure. The spark runtime reports
"TypeError" but that is not why the operation failed.

2. The low performance of loading a pandas dataframe.


DISCUSSION

Number (1) is easily fixed, and the primary purpose for my post.
Number (2) is harder, and may lead us to abandon Spark. To answer Akhil, the
process is too slow. Yes it will work, but with large dense datasets, the
line

data = [r.tolist() for r in data.to_records(index=False)]

is basically a brick wall. It will take longer to load the RDD than to do
all operations on it, by a large margin.

Any help or guidance (should we write some custom loader?) would be
appreciated.

FDS



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888p13912.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



IOError on createDataFrame

2015-08-28 Thread fsacerdoti
Hello,

Similar to the thread below [1], when I tried to create an RDD from a 4GB
pandas dataframe I encountered the error

TypeError: cannot create an RDD from type: type 'list'

However looking into the code shows this is raised from a generic except
Exception: predicate (pyspark/sql/context.py:238 in spark-1.4.1). A
debugging session reveals the true error is SPARK_LOCAL_DIRS ran out of
space:

- rdd = self._sc.parallelize(data)
(Pdb) 
*IOError: (28, 'No space left on device')*

In this case, creating an RDD from a large matrix (~50mill rows) is required
for us. I'm a bit concerned about spark's process here:

   a. turning the dataframe into records (data.to_records)
   b. writing it to tmp
   c. reading it back again in scala.

Is there a better way? The intention would be to operate on slices of this
large dataframe using numpy operations via spark's transformations and
actions.

Thanks,
FDS
 
1. https://www.mail-archive.com/user@spark.apache.org/msg35139.html





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org