[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully

Josh Rosen (JIRA) Tue, 29 Jul 2014 00:03:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077452#comment-14077452
 ]


Josh Rosen commented on SPARK-1630:
-----------------------------------

We aren't passing completely arbitrary iterators of Java objects to 
{{writeIteratorToStream}}; instead, we only handle iterators of strings and 
byte arrays.  Nulls in data read from Hadoop input formats should already be 
converted to None by the Java pickling code.  Do you have an example where 
PythonRDD receives a null element and it's not due to a bug?  I'm worried that 
this patch will mask the presence of other errors.

> PythonRDDs don't handle nulls gracefully
> ----------------------------------------
>
>                 Key: SPARK-1630
>                 URL: https://issues.apache.org/jira/browse/SPARK-1630
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 0.9.0, 0.9.1
>            Reporter: Kalpit Shah
>            Assignee: Davies Liu
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> If PythonRDDs receive a null element in iterators, they currently NPE. It 
> would be better do log a DEBUG message and skip the write of NULL elements.
> Here are the 2 stack traces :
> 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread 
> Thread[stdin writer for python,5,main]
> java.lang.NullPointerException
>   at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267)
>   at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88)
> -------------------------------------------------------------------------------------
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.writeToFile.
> : java.lang.NullPointerException
>   at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246)
>   at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285)
>   at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280)
>   at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:744)  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully

Reply via email to