[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully
[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270097#comment-14270097 ] Davies Liu commented on SPARK-1630: --- We hit this issue with Kafka Python API, it will be fixed in https://github.com/apache/spark/pull/3715 > PythonRDDs don't handle nulls gracefully > > > Key: SPARK-1630 > URL: https://issues.apache.org/jira/browse/SPARK-1630 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.0, 0.9.1 >Reporter: Kalpit Shah >Assignee: Davies Liu > Original Estimate: 2h > Remaining Estimate: 2h > > If PythonRDDs receive a null element in iterators, they currently NPE. It > would be better do log a DEBUG message and skip the write of NULL elements. > Here are the 2 stack traces : > 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread > Thread[stdin writer for python,5,main] > java.lang.NullPointerException > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) > - > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.writeToFile. > : java.lang.NullPointerException > at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) > at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully
[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077945#comment-14077945 ] Josh Rosen commented on SPARK-1630: --- Hi Kalpit, Thanks for sharing your use-case; it seems like a reasonable thing that we should support. As part of [~mlnick]'s patch for [SPARK-1416], we now have {{SerDeUtil.rddToPython}} for converting pair RDDs of arbitrary Java objects into RDDs that can be read by PySpark. One alternative to this fix proposed here would be to add a similar converter from non-pair-RDDs to PythonRDDs that used the Java-side pickle library to pickle the strings as nulls. However, this could have a negative performance impact since we'd be passing pickled objects instead of UTF-8 strings. Given that the current proposed fix only affects RDD and seems unlikely to mask serious bugs, I'm inclined to merge it. > PythonRDDs don't handle nulls gracefully > > > Key: SPARK-1630 > URL: https://issues.apache.org/jira/browse/SPARK-1630 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 0.9.0, 0.9.1 >Reporter: Kalpit Shah >Assignee: Davies Liu > Original Estimate: 2h > Remaining Estimate: 2h > > If PythonRDDs receive a null element in iterators, they currently NPE. It > would be better do log a DEBUG message and skip the write of NULL elements. > Here are the 2 stack traces : > 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread > Thread[stdin writer for python,5,main] > java.lang.NullPointerException > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) > - > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.writeToFile. > : java.lang.NullPointerException > at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) > at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully
[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077899#comment-14077899 ] Kalpit Shah commented on SPARK-1630: Here's my case that led me to filing this bug and a patch : I have a custom RDD which is implemented in Java. It implements compute() and partitions() API with semantics specific to our application. For some of our cases, CustomRDD could have NULL values. In those cases, we didn't have a way to access the same in Python. IMO, this patch helps serve two purposes : 1. If a CustomRDD is implemented using Java or Scala and a user wishes to access this RDD in Python,R or some other language, they will be able to do so without loss of information (NULLs preserved). 2. It facilitates preservation of cardinality and order of elements within a partition. > PythonRDDs don't handle nulls gracefully > > > Key: SPARK-1630 > URL: https://issues.apache.org/jira/browse/SPARK-1630 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 0.9.0, 0.9.1 >Reporter: Kalpit Shah >Assignee: Davies Liu > Original Estimate: 2h > Remaining Estimate: 2h > > If PythonRDDs receive a null element in iterators, they currently NPE. It > would be better do log a DEBUG message and skip the write of NULL elements. > Here are the 2 stack traces : > 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread > Thread[stdin writer for python,5,main] > java.lang.NullPointerException > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) > - > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.writeToFile. > : java.lang.NullPointerException > at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) > at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully
[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077492#comment-14077492 ] Josh Rosen commented on SPARK-1630: --- In the current Spark codebase, the PythonRDD constructor is only called from Python code. In order for a user-created Scala/Java RDD to be passed to PySpark, the user would need to have dug deep into PySpark's private APIs to call out to Java/Scala code to create a transformed RDD and wrap it into a PythonRDD. Given this, is it fair to say that any in-the-wild NPEs encountered here by using Spark's public APIs are due to bugs in Spark/PySpark, or is there a case that I'm overlooking (e.g. is TextInputFormat allowed to return nulls?)? > PythonRDDs don't handle nulls gracefully > > > Key: SPARK-1630 > URL: https://issues.apache.org/jira/browse/SPARK-1630 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 0.9.0, 0.9.1 >Reporter: Kalpit Shah >Assignee: Davies Liu > Original Estimate: 2h > Remaining Estimate: 2h > > If PythonRDDs receive a null element in iterators, they currently NPE. It > would be better do log a DEBUG message and skip the write of NULL elements. > Here are the 2 stack traces : > 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread > Thread[stdin writer for python,5,main] > java.lang.NullPointerException > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) > - > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.writeToFile. > : java.lang.NullPointerException > at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) > at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully
[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077465#comment-14077465 ] Davies Liu commented on SPARK-1630: --- If a RDD is generated in Scala/Java by user code, such as rdd.map(user_func), it's possible to generate an Null in it (depend on some corner cases), then it will cause NPE. Given RDD[String], it's correct that some row will be null, so it's better handle it gracefully. This issue can not be reproduced in pure Python code. > PythonRDDs don't handle nulls gracefully > > > Key: SPARK-1630 > URL: https://issues.apache.org/jira/browse/SPARK-1630 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 0.9.0, 0.9.1 >Reporter: Kalpit Shah >Assignee: Davies Liu > Original Estimate: 2h > Remaining Estimate: 2h > > If PythonRDDs receive a null element in iterators, they currently NPE. It > would be better do log a DEBUG message and skip the write of NULL elements. > Here are the 2 stack traces : > 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread > Thread[stdin writer for python,5,main] > java.lang.NullPointerException > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) > - > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.writeToFile. > : java.lang.NullPointerException > at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) > at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully
[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077452#comment-14077452 ] Josh Rosen commented on SPARK-1630: --- We aren't passing completely arbitrary iterators of Java objects to {{writeIteratorToStream}}; instead, we only handle iterators of strings and byte arrays. Nulls in data read from Hadoop input formats should already be converted to None by the Java pickling code. Do you have an example where PythonRDD receives a null element and it's not due to a bug? I'm worried that this patch will mask the presence of other errors. > PythonRDDs don't handle nulls gracefully > > > Key: SPARK-1630 > URL: https://issues.apache.org/jira/browse/SPARK-1630 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 0.9.0, 0.9.1 >Reporter: Kalpit Shah >Assignee: Davies Liu > Original Estimate: 2h > Remaining Estimate: 2h > > If PythonRDDs receive a null element in iterators, they currently NPE. It > would be better do log a DEBUG message and skip the write of NULL elements. > Here are the 2 stack traces : > 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread > Thread[stdin writer for python,5,main] > java.lang.NullPointerException > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) > - > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.writeToFile. > : java.lang.NullPointerException > at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) > at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully
[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072235#comment-14072235 ] Apache Spark commented on SPARK-1630: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/1551 > PythonRDDs don't handle nulls gracefully > > > Key: SPARK-1630 > URL: https://issues.apache.org/jira/browse/SPARK-1630 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 0.9.0, 0.9.1 >Reporter: Kalpit Shah >Assignee: Davies Liu > Original Estimate: 2h > Remaining Estimate: 2h > > If PythonRDDs receive a null element in iterators, they currently NPE. It > would be better do log a DEBUG message and skip the write of NULL elements. > Here are the 2 stack traces : > 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread > Thread[stdin writer for python,5,main] > java.lang.NullPointerException > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) > - > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.writeToFile. > : java.lang.NullPointerException > at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) > at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully
[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067729#comment-14067729 ] Apache Spark commented on SPARK-1630: - User 'kalpit' has created a pull request for this issue: [https://github.com/apache/spark/pull/554|https://github.com/apache/spark/pull/554] > PythonRDDs don't handle nulls gracefully > > > Key: SPARK-1630 > URL: https://issues.apache.org/jira/browse/SPARK-1630 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.0, 0.9.1 >Reporter: Kalpit Shah > Fix For: 1.1.0 > > Original Estimate: 2h > Remaining Estimate: 2h > > If PythonRDDs receive a null element in iterators, they currently NPE. It > would be better do log a DEBUG message and skip the write of NULL elements. > Here are the 2 stack traces : > 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread > Thread[stdin writer for python,5,main] > java.lang.NullPointerException > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) > - > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.writeToFile. > : java.lang.NullPointerException > at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) > at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully
[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13981458#comment-13981458 ] Kalpit Shah commented on SPARK-1630: https://github.com/apache/spark/pull/554 > PythonRDDs don't handle nulls gracefully > > > Key: SPARK-1630 > URL: https://issues.apache.org/jira/browse/SPARK-1630 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.0, 0.9.1 >Reporter: Kalpit Shah > Fix For: 1.0.0 > > Original Estimate: 2h > Remaining Estimate: 2h > > If PythonRDDs receive a null element in iterators, they currently NPE. It > would be better do log a DEBUG message and skip the write of NULL elements. > Here are the 2 stack traces : > 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread > Thread[stdin writer for python,5,main] > java.lang.NullPointerException > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) > - > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.writeToFile. > : java.lang.NullPointerException > at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) > at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)