[jira] [Updated] (SPARK-33184) spark doesn't read data source column if it is used as an index to an array under a struct
[ https://issues.apache.org/jira/browse/SPARK-33184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33184: - Component/s: (was: Spark Core) SQL > spark doesn't read data source column if it is used as an index to an array > under a struct > -- > > Key: SPARK-33184 > URL: https://issues.apache.org/jira/browse/SPARK-33184 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: colin fang >Priority: Minor > > {code:python} > df = spark.createDataFrame([[1, [[1, 2, > schema='x:int,y:struct>') > df.write.mode('overwrite').parquet('test') > {code} > {code:python} > # This causes an error "Caused by: java.lang.RuntimeException: Couldn't find > x#720 in [y#721]" > spark.read.parquet('test').select(F.expr('y.a[x]')).show() > # Explain works fine, note it doesn't read x in ReadSchema > spark.read.parquet('test').select(F.expr('y.a[x]')).explain() > == Physical Plan == > *(1) !Project [y#713.a[x#712] AS y.a AS `a`[x]#717] > +- FileScan parquet [y#713] Batched: false, DataFilters: [], Format: Parquet, > Location: InMemoryFileIndex, PartitionFilters: [], PushedFilters: [], > ReadSchema: struct>> > {code} > The code works well if I > {code:python} > # manually select the column it misses > spark.read.parquet('test').select(F.expr('y.a[x]'), F.col('x')).show() > # use element_at function > spark.read.parquet('test').select(F.element_at('y.a', F.col('x') + 1)).show() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33184) spark doesn't read data source column if it is used as an index to an array under a struct
[ https://issues.apache.org/jira/browse/SPARK-33184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] colin fang updated SPARK-33184: --- Issue Type: Bug (was: Improvement) > spark doesn't read data source column if it is used as an index to an array > under a struct > -- > > Key: SPARK-33184 > URL: https://issues.apache.org/jira/browse/SPARK-33184 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: colin fang >Priority: Minor > > {code:python} > df = spark.createDataFrame([[1, [[1, 2, > schema='x:int,y:struct>') > df.write.mode('overwrite').parquet('test') > {code} > {code:python} > # This causes an error "Caused by: java.lang.RuntimeException: Couldn't find > x#720 in [y#721]" > spark.read.parquet('test').select(F.expr('y.a[x]')).show() > # Explain works fine, note it doesn't read x in ReadSchema > spark.read.parquet('test').select(F.expr('y.a[x]')).explain() > == Physical Plan == > *(1) !Project [y#713.a[x#712] AS y.a AS `a`[x]#717] > +- FileScan parquet [y#713] Batched: false, DataFilters: [], Format: Parquet, > Location: InMemoryFileIndex, PartitionFilters: [], PushedFilters: [], > ReadSchema: struct>> > {code} > The code works well if I > {code:python} > # manually select the column it misses > spark.read.parquet('test').select(F.expr('y.a[x]'), F.col('x')).show() > # use element_at function > spark.read.parquet('test').select(F.element_at('y.a', F.col('x') + 1)).show() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33184) spark doesn't read data source column if it is used as an index to an array under a struct
[ https://issues.apache.org/jira/browse/SPARK-33184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] colin fang updated SPARK-33184: --- Summary: spark doesn't read data source column if it is used as an index to an array under a struct (was: spark doesn't read data source column if it is needed as an index to an array in a nested struct) > spark doesn't read data source column if it is used as an index to an array > under a struct > -- > > Key: SPARK-33184 > URL: https://issues.apache.org/jira/browse/SPARK-33184 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: colin fang >Priority: Minor > > {code:python} > df = spark.createDataFrame([[1, [[1, 2, > schema='x:int,y:struct>') > df.write.mode('overwrite').parquet('test') > {code} > {code:python} > # This causes an error "Caused by: java.lang.RuntimeException: Couldn't find > x#720 in [y#721]" > spark.read.parquet('test').select(F.expr('y.a[x]')).show() > # Explain works fine, note it doesn't read x in ReadSchema > spark.read.parquet('test').select(F.expr('y.a[x]')).explain() > == Physical Plan == > *(1) !Project [y#713.a[x#712] AS y.a AS `a`[x]#717] > +- FileScan parquet [y#713] Batched: false, DataFilters: [], Format: Parquet, > Location: InMemoryFileIndex, PartitionFilters: [], PushedFilters: [], > ReadSchema: struct>> > {code} > The code works well if I > {code:python} > # manually select the column it misses > spark.read.parquet('test').select(F.expr('y.a[x]'), F.col('x')).show() > # use element_at function > spark.read.parquet('test').select(F.element_at('y.a', F.col('x') + 1)).show() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org