I ported the same code to scala. No problems. But in pyspark, this fails 
consistently:

ctx = SQLContext(sc)
pf = ctx.parquetFile("...")
rdd = pf.map(lambda x: x)
crdd = ctx.inferSchema(rdd)
crdd.saveAsParquetFile("...")

If I do
rdd = sc.parallelize(["hello", "world"])
rdd.saveAsTextFile(...)

It works. 

Ideas?



> On Jul 24, 2014, at 11:05 PM, Akhil Das <ak...@sigmoidanalytics.com> wrote:
> 
> Most likely you are closing the connection with HDFS. Can you paste the piece 
> of code that you are executing?
> 
> We were having similar problem when we closed the FileSystem object in our 
> code.
> 
> Thanks
> Best Regards
> 
> 
>> On Thu, Jul 24, 2014 at 11:00 PM, Eric Friedman <eric.d.fried...@gmail.com> 
>> wrote:
>> I'm trying to run a simple pipeline using PySpark, version 1.0.1
>> 
>> I've created an RDD over a parquetFile and am mapping the contents with a 
>> transformer function and now wish to write the data out to HDFS.
>> 
>> All of the executors fail with the same stack trace (below)
>> 
>> I do get a directory on HDFS, but it's empty except for a file named 
>> _temporary.
>> 
>> Any ideas?
>> 
>> java.io.IOException (java.io.IOException: Filesystem closed}
>> org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629)
>> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:735)
>> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:793)
>> java.io.DataInputStream.readFully(DataInputStream.java:195)
>> java.io.DataInputStream.readFully(DataInputStream.java:169)
>> parquet.hadoop.ParquetFileReader$Chunk.<init>(ParquetFileReader.java:369)
>> parquet.hadoop.ParquetFileReader$Chunk.<init>(ParquetFileReader.java:362)
>> parquet.hadoop.ParquetFileReader.readColumnChunkPages(ParquetFileReader.java:411)
>> parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:349)
>> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:100)
>> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
>> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122)
>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:966)
>> scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:293)
>> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
>> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
>> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
>> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
>> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)
> 

Reply via email to