[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077452#comment-14077452 ]
Josh Rosen commented on SPARK-1630: ----------------------------------- We aren't passing completely arbitrary iterators of Java objects to {{writeIteratorToStream}}; instead, we only handle iterators of strings and byte arrays. Nulls in data read from Hadoop input formats should already be converted to None by the Java pickling code. Do you have an example where PythonRDD receives a null element and it's not due to a bug? I'm worried that this patch will mask the presence of other errors. > PythonRDDs don't handle nulls gracefully > ---------------------------------------- > > Key: SPARK-1630 > URL: https://issues.apache.org/jira/browse/SPARK-1630 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 0.9.0, 0.9.1 > Reporter: Kalpit Shah > Assignee: Davies Liu > Original Estimate: 2h > Remaining Estimate: 2h > > If PythonRDDs receive a null element in iterators, they currently NPE. It > would be better do log a DEBUG message and skip the write of NULL elements. > Here are the 2 stack traces : > 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread > Thread[stdin writer for python,5,main] > java.lang.NullPointerException > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) > ------------------------------------------------------------------------------------- > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.writeToFile. > : java.lang.NullPointerException > at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) > at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)