Re: rdd.saveAsTextFile blows up

2014-07-25 Thread Akhil Das
Most likely you are closing the connection with HDFS. Can you paste the
piece of code that you are executing?

We were having similar problem when we closed the FileSystem object in our
code.

Thanks
Best Regards


On Thu, Jul 24, 2014 at 11:00 PM, Eric Friedman eric.d.fried...@gmail.com
wrote:

 I'm trying to run a simple pipeline using PySpark, version 1.0.1

 I've created an RDD over a parquetFile and am mapping the contents with a
 transformer function and now wish to write the data out to HDFS.

 All of the executors fail with the same stack trace (below)

 I do get a directory on HDFS, but it's empty except for a file named
 _temporary.

 Any ideas?

 java.io.IOException (java.io.IOException: Filesystem closed}
 org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629)
 org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:735)
 org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:793)
 java.io.DataInputStream.readFully(DataInputStream.java:195)
 java.io.DataInputStream.readFully(DataInputStream.java:169)
 parquet.hadoop.ParquetFileReader$Chunk.init(ParquetFileReader.java:369)
 parquet.hadoop.ParquetFileReader$Chunk.init(ParquetFileReader.java:362)
 parquet.hadoop.ParquetFileReader.readColumnChunkPages(ParquetFileReader.java:411)
 parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:349)
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:100)
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122)
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:966)
 scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:293)
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)




Re: rdd.saveAsTextFile blows up

2014-07-25 Thread Eric Friedman
I ported the same code to scala. No problems. But in pyspark, this fails 
consistently:

ctx = SQLContext(sc)
pf = ctx.parquetFile(...)
rdd = pf.map(lambda x: x)
crdd = ctx.inferSchema(rdd)
crdd.saveAsParquetFile(...)

If I do
rdd = sc.parallelize([hello, world])
rdd.saveAsTextFile(...)

It works. 

Ideas?



 On Jul 24, 2014, at 11:05 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 
 Most likely you are closing the connection with HDFS. Can you paste the piece 
 of code that you are executing?
 
 We were having similar problem when we closed the FileSystem object in our 
 code.
 
 Thanks
 Best Regards
 
 
 On Thu, Jul 24, 2014 at 11:00 PM, Eric Friedman eric.d.fried...@gmail.com 
 wrote:
 I'm trying to run a simple pipeline using PySpark, version 1.0.1
 
 I've created an RDD over a parquetFile and am mapping the contents with a 
 transformer function and now wish to write the data out to HDFS.
 
 All of the executors fail with the same stack trace (below)
 
 I do get a directory on HDFS, but it's empty except for a file named 
 _temporary.
 
 Any ideas?
 
 java.io.IOException (java.io.IOException: Filesystem closed}
 org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629)
 org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:735)
 org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:793)
 java.io.DataInputStream.readFully(DataInputStream.java:195)
 java.io.DataInputStream.readFully(DataInputStream.java:169)
 parquet.hadoop.ParquetFileReader$Chunk.init(ParquetFileReader.java:369)
 parquet.hadoop.ParquetFileReader$Chunk.init(ParquetFileReader.java:362)
 parquet.hadoop.ParquetFileReader.readColumnChunkPages(ParquetFileReader.java:411)
 parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:349)
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:100)
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122)
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:966)
 scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:293)
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)