[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077899#comment-14077899 ]
Kalpit Shah commented on SPARK-1630: ------------------------------------ Here's my case that led me to filing this bug and a patch : I have a custom RDD which is implemented in Java. It implements compute() and partitions() API with semantics specific to our application. For some of our cases, CustomRDD<String> could have NULL values. In those cases, we didn't have a way to access the same in Python. IMO, this patch helps serve two purposes : 1. If a CustomRDD<String> is implemented using Java or Scala and a user wishes to access this RDD in Python,R or some other language, they will be able to do so without loss of information (NULLs preserved). 2. It facilitates preservation of cardinality and order of elements within a partition. > PythonRDDs don't handle nulls gracefully > ---------------------------------------- > > Key: SPARK-1630 > URL: https://issues.apache.org/jira/browse/SPARK-1630 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 0.9.0, 0.9.1 > Reporter: Kalpit Shah > Assignee: Davies Liu > Original Estimate: 2h > Remaining Estimate: 2h > > If PythonRDDs receive a null element in iterators, they currently NPE. It > would be better do log a DEBUG message and skip the write of NULL elements. > Here are the 2 stack traces : > 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread > Thread[stdin writer for python,5,main] > java.lang.NullPointerException > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) > ------------------------------------------------------------------------------------- > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.writeToFile. > : java.lang.NullPointerException > at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) > at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)