[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077492#comment-14077492 ]
Josh Rosen commented on SPARK-1630: ----------------------------------- In the current Spark codebase, the PythonRDD constructor is only called from Python code. In order for a user-created Scala/Java RDD<String> to be passed to PySpark, the user would need to have dug deep into PySpark's private APIs to call out to Java/Scala code to create a transformed RDD and wrap it into a PythonRDD. Given this, is it fair to say that any in-the-wild NPEs encountered here by using Spark's public APIs are due to bugs in Spark/PySpark, or is there a case that I'm overlooking (e.g. is TextInputFormat allowed to return nulls?)? > PythonRDDs don't handle nulls gracefully > ---------------------------------------- > > Key: SPARK-1630 > URL: https://issues.apache.org/jira/browse/SPARK-1630 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 0.9.0, 0.9.1 > Reporter: Kalpit Shah > Assignee: Davies Liu > Original Estimate: 2h > Remaining Estimate: 2h > > If PythonRDDs receive a null element in iterators, they currently NPE. It > would be better do log a DEBUG message and skip the write of NULL elements. > Here are the 2 stack traces : > 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread > Thread[stdin writer for python,5,main] > java.lang.NullPointerException > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) > ------------------------------------------------------------------------------------- > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.writeToFile. > : java.lang.NullPointerException > at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) > at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)