[jira] [Updated] (SPARK-33184) spark doesn't read data source column if it is used as an index to an array under a struct

2020-10-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33184:
-
Component/s: (was: Spark Core)
 SQL

> spark doesn't read data source column if it is used as an index to an array 
> under a struct
> --
>
> Key: SPARK-33184
> URL: https://issues.apache.org/jira/browse/SPARK-33184
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: colin fang
>Priority: Minor
>
> {code:python}
> df = spark.createDataFrame([[1, [[1, 2, 
> schema='x:int,y:struct>')
> df.write.mode('overwrite').parquet('test')
> {code}
> {code:python}
> # This causes an error "Caused by: java.lang.RuntimeException: Couldn't find 
> x#720 in [y#721]"
> spark.read.parquet('test').select(F.expr('y.a[x]')).show()
> # Explain works fine, note it doesn't read x in ReadSchema
> spark.read.parquet('test').select(F.expr('y.a[x]')).explain()
> == Physical Plan ==
> *(1) !Project [y#713.a[x#712] AS y.a AS `a`[x]#717]
> +- FileScan parquet [y#713] Batched: false, DataFilters: [], Format: Parquet, 
> Location: InMemoryFileIndex, PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct>>
> {code}
> The code works well if I 
> {code:python}
> # manually select the column it misses
> spark.read.parquet('test').select(F.expr('y.a[x]'), F.col('x')).show()
> # use element_at function
> spark.read.parquet('test').select(F.element_at('y.a', F.col('x') + 1)).show()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33184) spark doesn't read data source column if it is used as an index to an array under a struct

2020-10-19 Thread colin fang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

colin fang updated SPARK-33184:
---
Issue Type: Bug  (was: Improvement)

> spark doesn't read data source column if it is used as an index to an array 
> under a struct
> --
>
> Key: SPARK-33184
> URL: https://issues.apache.org/jira/browse/SPARK-33184
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: colin fang
>Priority: Minor
>
> {code:python}
> df = spark.createDataFrame([[1, [[1, 2, 
> schema='x:int,y:struct>')
> df.write.mode('overwrite').parquet('test')
> {code}
> {code:python}
> # This causes an error "Caused by: java.lang.RuntimeException: Couldn't find 
> x#720 in [y#721]"
> spark.read.parquet('test').select(F.expr('y.a[x]')).show()
> # Explain works fine, note it doesn't read x in ReadSchema
> spark.read.parquet('test').select(F.expr('y.a[x]')).explain()
> == Physical Plan ==
> *(1) !Project [y#713.a[x#712] AS y.a AS `a`[x]#717]
> +- FileScan parquet [y#713] Batched: false, DataFilters: [], Format: Parquet, 
> Location: InMemoryFileIndex, PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct>>
> {code}
> The code works well if I 
> {code:python}
> # manually select the column it misses
> spark.read.parquet('test').select(F.expr('y.a[x]'), F.col('x')).show()
> # use element_at function
> spark.read.parquet('test').select(F.element_at('y.a', F.col('x') + 1)).show()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33184) spark doesn't read data source column if it is used as an index to an array under a struct

2020-10-19 Thread colin fang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

colin fang updated SPARK-33184:
---
Summary: spark doesn't read data source column if it is used as an index to 
an array under a struct  (was: spark doesn't read data source column if it is 
needed as an index to an array in a nested struct)

> spark doesn't read data source column if it is used as an index to an array 
> under a struct
> --
>
> Key: SPARK-33184
> URL: https://issues.apache.org/jira/browse/SPARK-33184
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: colin fang
>Priority: Minor
>
> {code:python}
> df = spark.createDataFrame([[1, [[1, 2, 
> schema='x:int,y:struct>')
> df.write.mode('overwrite').parquet('test')
> {code}
> {code:python}
> # This causes an error "Caused by: java.lang.RuntimeException: Couldn't find 
> x#720 in [y#721]"
> spark.read.parquet('test').select(F.expr('y.a[x]')).show()
> # Explain works fine, note it doesn't read x in ReadSchema
> spark.read.parquet('test').select(F.expr('y.a[x]')).explain()
> == Physical Plan ==
> *(1) !Project [y#713.a[x#712] AS y.a AS `a`[x]#717]
> +- FileScan parquet [y#713] Batched: false, DataFilters: [], Format: Parquet, 
> Location: InMemoryFileIndex, PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct>>
> {code}
> The code works well if I 
> {code:python}
> # manually select the column it misses
> spark.read.parquet('test').select(F.expr('y.a[x]'), F.col('x')).show()
> # use element_at function
> spark.read.parquet('test').select(F.element_at('y.a', F.col('x') + 1)).show()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org