[jira] [Commented] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

Michael Armbrust (JIRA) Thu, 13 Aug 2015 20:07:22 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696378#comment-14696378
 ]


Michael Armbrust commented on SPARK-8670:
-----------------------------------------

I think that all dataframe methods that take identifiers as strings should 
behave consistently.  We changed scala so that they are "mostly quoted" in that 
spaces and other characters behave as those it was in backticks in sql, but 
dots are an exception and need to be double escaped.  (i.e. 
{{df["structColumn.`field.with.dots`"]}}.  The rational is that using dots to 
go into a struct or to qualify an attribute is more common than column names 
with dots in them.

> Nested columns can't be referenced (but they can be selected)
> -------------------------------------------------------------
>
>                 Key: SPARK-8670
>                 URL: https://issues.apache.org/jira/browse/SPARK-8670
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark, SQL
>    Affects Versions: 1.4.0, 1.4.1, 1.5.0
>            Reporter: Nicholas Chammas
>            Assignee: Wenchen Fan
>            Priority: Blocker
>
> This is strange and looks like a regression from 1.3.
> {code}
> import json
> daterz = [
>   {
>     'name': 'Nick',
>     'stats': {
>       'age': 28
>     }
>   },
>   {
>     'name': 'George',
>     'stats': {
>       'age': 31
>     }
>   }
> ]
> df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))
> df.select('stats.age').show()
> df['stats.age']  # 1.4 fails on this line
> {code}
> On 1.3 this works and yields:
> {code}
> age
> 28 
> 31 
> Out[1]: Column<stats.age AS age#2958L>
> {code}
> On 1.4, however, this gives an error on the last line:
> {code}
> +---+
> |age|
> +---+
> | 28|
> | 31|
> +---+
> ---------------------------------------------------------------------------
> IndexError                                Traceback (most recent call last)
> <ipython-input-1-04bd990e94c6> in <module>()
>      19 
>      20 df.select('stats.age').show()
> ---> 21 df['stats.age']
> /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
>     678         if isinstance(item, basestring):
>     679             if item not in self.columns:
> --> 680                 raise IndexError("no such column: %s" % item)
>     681             jc = self._jdf.apply(item)
>     682             return Column(jc)
> IndexError: no such column: stats.age
> {code}
> This means, among other things, that you can't join DataFrames on nested 
> columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

Reply via email to