[jira] [Updated] (SPARK-5063) Display more helpful error messages for several invalid operations

Josh Rosen (JIRA) Wed, 21 Jan 2015 17:07:20 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Josh Rosen updated SPARK-5063:
------------------------------
    Target Version/s: 1.2.1

> Display more helpful error messages for several invalid operations
> ------------------------------------------------------------------
>
>                 Key: SPARK-5063
>                 URL: https://issues.apache.org/jira/browse/SPARK-5063
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>
> Spark does not support nested RDDs or performing Spark actions inside of 
> transformations; this usually leads to NullPointerExceptions (see SPARK-718 
> as one example).  The confusing NPE is one of the most common sources of 
> Spark questions on StackOverflow:
> - 
> https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534
> - 
> https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399
> - 
> https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674
> (those are just a sample of the ones that I've answered personally; there are 
> many others).
> I think we can detect these errors by adding logic to {{RDD}} to check 
> whether {{sc}} is null (e.g. turn {{sc}} into a getter function); we can use 
> this to add a better error message.
> In PySpark, these errors manifest themselves slightly differently.  
> Attempting to nest RDDs or perform actions inside of transformations results 
> in pickle-time errors:
> {code}
> rdd1 = sc.parallelize(range(100))
> rdd2 = sc.parallelize(range(100))
> rdd1.mapPartitions(lambda x: [rdd2.map(lambda x: x)])
> {code}
> produces
> {code}
> [...]
>   File "/Users/joshrosen/anaconda/lib/python2.7/pickle.py", line 306, in save
>     rv = reduce(self.proto)
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 304, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling o21.__getnewargs__. 
> Trace:
> py4j.Py4JException: Method __getnewargs__([]) does not exist
>       at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>       at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
>       at py4j.Gateway.invoke(Gateway.java:252)
>       at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>       at py4j.commands.CallCommand.execute(CallCommand.java:79)
>       at py4j.GatewayConnection.run(GatewayConnection.java:207)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> We get the same error when attempting to broadcast an RDD in PySpark.  For 
> Python, improved error reporting could be as simple as overriding the 
> {{getnewargs}} method to throw a more useful UnsupportedOperation exception 
> with a more helpful error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5063) Display more helpful error messages for several invalid operations

Reply via email to