[jira] [Updated] (SPARK-8670) Nested columns can't be referenced (but they can be selected)
[ https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8670: Assignee: Wenchen Fan Nested columns can't be referenced (but they can be selected) - Key: SPARK-8670 URL: https://issues.apache.org/jira/browse/SPARK-8670 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 1.4.0, 1.4.1, 1.5.0 Reporter: Nicholas Chammas Assignee: Wenchen Fan Priority: Blocker This is strange and looks like a regression from 1.3. {code} import json daterz = [ { 'name': 'Nick', 'stats': { 'age': 28 } }, { 'name': 'George', 'stats': { 'age': 31 } } ] df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) df.select('stats.age').show() df['stats.age'] # 1.4 fails on this line {code} On 1.3 this works and yields: {code} age 28 31 Out[1]: Columnstats.age AS age#2958L {code} On 1.4, however, this gives an error on the last line: {code} +---+ |age| +---+ | 28| | 31| +---+ --- IndexErrorTraceback (most recent call last) ipython-input-1-04bd990e94c6 in module() 19 20 df.select('stats.age').show() --- 21 df['stats.age'] /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) 678 if isinstance(item, basestring): 679 if item not in self.columns: -- 680 raise IndexError(no such column: %s % item) 681 jc = self._jdf.apply(item) 682 return Column(jc) IndexError: no such column: stats.age {code} This means, among other things, that you can't join DataFrames on nested columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8670) Nested columns can't be referenced (but they can be selected)
[ https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-8670: -- Priority: Critical (was: Major) Nested columns can't be referenced (but they can be selected) - Key: SPARK-8670 URL: https://issues.apache.org/jira/browse/SPARK-8670 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 1.4.0, 1.4.1, 1.5.0 Reporter: Nicholas Chammas Priority: Critical This is strange and looks like a regression from 1.3. {code} import json daterz = [ { 'name': 'Nick', 'stats': { 'age': 28 } }, { 'name': 'George', 'stats': { 'age': 31 } } ] df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) df.select('stats.age').show() df['stats.age'] # 1.4 fails on this line {code} On 1.3 this works and yields: {code} age 28 31 Out[1]: Columnstats.age AS age#2958L {code} On 1.4, however, this gives an error on the last line: {code} +---+ |age| +---+ | 28| | 31| +---+ --- IndexErrorTraceback (most recent call last) ipython-input-1-04bd990e94c6 in module() 19 20 df.select('stats.age').show() --- 21 df['stats.age'] /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) 678 if isinstance(item, basestring): 679 if item not in self.columns: -- 680 raise IndexError(no such column: %s % item) 681 jc = self._jdf.apply(item) 682 return Column(jc) IndexError: no such column: stats.age {code} This means, among other things, that you can't join DataFrames on nested columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8670) Nested columns can't be referenced (but they can be selected)
[ https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-8670: -- Issue Type: Sub-task (was: Bug) Parent: SPARK-9564 Nested columns can't be referenced (but they can be selected) - Key: SPARK-8670 URL: https://issues.apache.org/jira/browse/SPARK-8670 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 1.4.0, 1.4.1, 1.5.0 Reporter: Nicholas Chammas This is strange and looks like a regression from 1.3. {code} import json daterz = [ { 'name': 'Nick', 'stats': { 'age': 28 } }, { 'name': 'George', 'stats': { 'age': 31 } } ] df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) df.select('stats.age').show() df['stats.age'] # 1.4 fails on this line {code} On 1.3 this works and yields: {code} age 28 31 Out[1]: Columnstats.age AS age#2958L {code} On 1.4, however, this gives an error on the last line: {code} +---+ |age| +---+ | 28| | 31| +---+ --- IndexErrorTraceback (most recent call last) ipython-input-1-04bd990e94c6 in module() 19 20 df.select('stats.age').show() --- 21 df['stats.age'] /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) 678 if isinstance(item, basestring): 679 if item not in self.columns: -- 680 raise IndexError(no such column: %s % item) 681 jc = self._jdf.apply(item) 682 return Column(jc) IndexError: no such column: stats.age {code} This means, among other things, that you can't join DataFrames on nested columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8670) Nested columns can't be referenced (but they can be selected)
[ https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8670: Target Version/s: 1.5.0 Priority: Blocker (was: Critical) Nested columns can't be referenced (but they can be selected) - Key: SPARK-8670 URL: https://issues.apache.org/jira/browse/SPARK-8670 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 1.4.0, 1.4.1, 1.5.0 Reporter: Nicholas Chammas Priority: Blocker This is strange and looks like a regression from 1.3. {code} import json daterz = [ { 'name': 'Nick', 'stats': { 'age': 28 } }, { 'name': 'George', 'stats': { 'age': 31 } } ] df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) df.select('stats.age').show() df['stats.age'] # 1.4 fails on this line {code} On 1.3 this works and yields: {code} age 28 31 Out[1]: Columnstats.age AS age#2958L {code} On 1.4, however, this gives an error on the last line: {code} +---+ |age| +---+ | 28| | 31| +---+ --- IndexErrorTraceback (most recent call last) ipython-input-1-04bd990e94c6 in module() 19 20 df.select('stats.age').show() --- 21 df['stats.age'] /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) 678 if isinstance(item, basestring): 679 if item not in self.columns: -- 680 raise IndexError(no such column: %s % item) 681 jc = self._jdf.apply(item) 682 return Column(jc) IndexError: no such column: stats.age {code} This means, among other things, that you can't join DataFrames on nested columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8670) Nested columns can't be referenced (but they can be selected)
[ https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-8670: -- Affects Version/s: 1.5.0 1.4.1 Nested columns can't be referenced (but they can be selected) - Key: SPARK-8670 URL: https://issues.apache.org/jira/browse/SPARK-8670 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 1.4.0, 1.4.1, 1.5.0 Reporter: Nicholas Chammas This is strange and looks like a regression from 1.3. {code} import json daterz = [ { 'name': 'Nick', 'stats': { 'age': 28 } }, { 'name': 'George', 'stats': { 'age': 31 } } ] df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) df.select('stats.age').show() df['stats.age'] # 1.4 fails on this line {code} On 1.3 this works and yields: {code} age 28 31 Out[1]: Columnstats.age AS age#2958L {code} On 1.4, however, this gives an error on the last line: {code} +---+ |age| +---+ | 28| | 31| +---+ --- IndexErrorTraceback (most recent call last) ipython-input-1-04bd990e94c6 in module() 19 20 df.select('stats.age').show() --- 21 df['stats.age'] /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) 678 if isinstance(item, basestring): 679 if item not in self.columns: -- 680 raise IndexError(no such column: %s % item) 681 jc = self._jdf.apply(item) 682 return Column(jc) IndexError: no such column: stats.age {code} This means, among other things, that you can't join DataFrames on nested columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8670) Nested columns can't be referenced (but they can be selected)
[ https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-8670: Description: This is strange and looks like a regression from 1.3. {code} import json daterz = [ { 'name': 'Nick', 'stats': { 'age': 28 } }, { 'name': 'George', 'stats': { 'age': 31 } } ] df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) df.select('stats.age').show() df['stats.age'] # 1.4 fails on this line {code} On 1.3 this works and yields: {code} age 28 31 Out[1]: Columnstats.age AS age#2958L {code} On 1.4, however, this gives an error on the last line: {code} +---+ |age| +---+ | 28| | 31| +---+ --- IndexErrorTraceback (most recent call last) ipython-input-1-04bd990e94c6 in module() 19 20 df.select('stats.age').show() --- 21 df['stats.age'] /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) 678 if isinstance(item, basestring): 679 if item not in self.columns: -- 680 raise IndexError(no such column: %s % item) 681 jc = self._jdf.apply(item) 682 return Column(jc) IndexError: no such column: stats.age {code} This means, among other things, that you can't join DataFrames on nested columns. was: This is strange and looks like a regression from 1.3. {code} import json daterz = [ { 'name': 'Nick', 'stats': { 'age': 28 } }, { 'name': 'George', 'stats': { 'age': 31 } } ] df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) df.select('stats.age').show() df['stats.age'] # 1.4 fails on this line {code} On 1.3 this works and yields: {code} age 28 31 Out[1]: Columnstats.age AS age#2958L {code} On 1.4, however, this gives an error on the last line: {code} +---+ |age| +---+ | 28| | 31| +---+ --- IndexErrorTraceback (most recent call last) ipython-input-1-04bd990e94c6 in module() 19 20 df.select('stats.age').show() --- 21 df['stats.age'] /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) 678 if isinstance(item, basestring): 679 if item not in self.columns: -- 680 raise IndexError(no such column: %s % item) 681 jc = self._jdf.apply(item) 682 return Column(jc) IndexError: no such column: stats.age {code} Nested columns can't be referenced (but they can be selected) - Key: SPARK-8670 URL: https://issues.apache.org/jira/browse/SPARK-8670 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0 Reporter: Nicholas Chammas This is strange and looks like a regression from 1.3. {code} import json daterz = [ { 'name': 'Nick', 'stats': { 'age': 28 } }, { 'name': 'George', 'stats': { 'age': 31 } } ] df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) df.select('stats.age').show() df['stats.age'] # 1.4 fails on this line {code} On 1.3 this works and yields: {code} age 28 31 Out[1]: Columnstats.age AS age#2958L {code} On 1.4, however, this gives an error on the last line: {code} +---+ |age| +---+ | 28| | 31| +---+ --- IndexErrorTraceback (most recent call last) ipython-input-1-04bd990e94c6 in module() 19 20 df.select('stats.age').show() --- 21 df['stats.age'] /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) 678 if isinstance(item, basestring): 679 if item not in self.columns: -- 680 raise IndexError(no such column: %s % item) 681 jc = self._jdf.apply(item) 682 return Column(jc) IndexError: no such column: stats.age {code} This means, among other things, that you can't join DataFrames on nested columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org