[ https://issues.apache.org/jira/browse/SPARK-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Patrick Wendell updated SPARK-1323: ----------------------------------- Fix Version/s: 0.9.1 1.0.0 > Job hangs with java.io.UTFDataFormatException when reading strings > 65536 > bytes. > ---------------------------------------------------------------------------------- > > Key: SPARK-1323 > URL: https://issues.apache.org/jira/browse/SPARK-1323 > Project: Apache Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 0.9.0 > Reporter: Karthik > Labels: pyspark > Fix For: 1.0.0, 0.9.1 > > > Steps to reproduce in Python > {code:borderStyle=solid} > st = ''.join(['1' for i in range(65537)]) > sc.parallelize([st]).saveAsTextFile("testfile") > sc.textFile('testfile').count() > {code} > The last line never completes.. Looking at the logs (with DEBUG enabled) > reveals the exception, here is the stack trace > {code:borderStyle=solid} > 14/03/25 15:03:34 INFO PythonRDD: stdin writer to Python finished early > 14/03/25 15:03:34 DEBUG PythonRDD: stdin writer to Python finished early > java.io.UTFDataFormatException: encoded string too long: 65537 bytes > at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364) > at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:222) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:221) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:221) > at > org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:81) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)