[ https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297913#comment-14297913 ]
Josh Rosen commented on SPARK-5462: ----------------------------------- I'm working on a patch for this now. It looks like the problem crops up when trying to select columns from DataFrames that are returned by SQL queries, as opposed to ones created by applying or inferring a schema. Here's a regression test demonstrating this: {code} def test_column_selection_on_dataframes_created_by_queries(self): # Regression test for SPARK-5462 df = self.df df.registerTempTable("test") df_from_query = self.sqlCtx.sql("select key, values from test") df_from_query.key # Throws exception df_from_query.value {code} > Catalyst UnresolvedException "Invalid call to qualifiers on unresolved > object" error when accessing fields in Python DataFrame > ------------------------------------------------------------------------------------------------------------------------------ > > Key: SPARK-5462 > URL: https://issues.apache.org/jira/browse/SPARK-5462 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 1.3.0 > Reporter: Josh Rosen > Assignee: Josh Rosen > Priority: Blocker > > When trying to access fields on a Python DataFrame created via inferSchema, I > ran into a confusing Catalyst Py4J error. Here's a reproduction: > {code} > from pyspark import SparkContext > from pyspark.sql import SQLContext, Row > sc = SparkContext("local", "test") > sqlContext = SQLContext(sc) > # Load a text file and convert each line to a Row. > lines = sc.textFile("examples/src/main/resources/people.txt") > parts = lines.map(lambda l: l.split(",")) > people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) > # Infer the schema, and register the SchemaRDD as a table. > schemaPeople = sqlContext.inferSchema(people) > schemaPeople.registerTempTable("people") > # SQL can be run over SchemaRDDs that have been registered as a table. > teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age > <= 19") > print teenagers.name > {code} > This fails with the following error: > {code} > Traceback (most recent call last): > File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in <module> > print teenagers.name > File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, > in __getattr__ > return Column(self._jdf.apply(name)) > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply. > : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > qualifiers on unresolved object, tree: 'name > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50) > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > This is distinct from the helpful error message that I get when trying to > access a non-existent column. This error didn't occur when I tried the same > thing with a DataFrame created via jsonRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org