[ https://issues.apache.org/jira/browse/SPARK-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen updated SPARK-5063: ------------------------------ Target Version/s: 1.2.1 > Display more helpful error messages for several invalid operations > ------------------------------------------------------------------ > > Key: SPARK-5063 > URL: https://issues.apache.org/jira/browse/SPARK-5063 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Reporter: Josh Rosen > Assignee: Josh Rosen > > Spark does not support nested RDDs or performing Spark actions inside of > transformations; this usually leads to NullPointerExceptions (see SPARK-718 > as one example). The confusing NPE is one of the most common sources of > Spark questions on StackOverflow: > - > https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534 > - > https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399 > - > https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674 > (those are just a sample of the ones that I've answered personally; there are > many others). > I think we can detect these errors by adding logic to {{RDD}} to check > whether {{sc}} is null (e.g. turn {{sc}} into a getter function); we can use > this to add a better error message. > In PySpark, these errors manifest themselves slightly differently. > Attempting to nest RDDs or perform actions inside of transformations results > in pickle-time errors: > {code} > rdd1 = sc.parallelize(range(100)) > rdd2 = sc.parallelize(range(100)) > rdd1.mapPartitions(lambda x: [rdd2.map(lambda x: x)]) > {code} > produces > {code} > [...] > File "/Users/joshrosen/anaconda/lib/python2.7/pickle.py", line 306, in save > rv = reduce(self.proto) > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 304, in get_return_value > py4j.protocol.Py4JError: An error occurred while calling o21.__getnewargs__. > Trace: > py4j.Py4JException: Method __getnewargs__([]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) > at py4j.Gateway.invoke(Gateway.java:252) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > We get the same error when attempting to broadcast an RDD in PySpark. For > Python, improved error reporting could be as simple as overriding the > {{getnewargs}} method to throw a more useful UnsupportedOperation exception > with a more helpful error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org