colin fang created SPARK-33184:
----------------------------------

             Summary: spark doesn't read data source column if it is needed as 
an index to an array in a nested struct
                 Key: SPARK-33184
                 URL: https://issues.apache.org/jira/browse/SPARK-33184
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.0.0
            Reporter: colin fang


```
df = spark.createDataFrame([[1, [[1, 2]]]], 
schema='x:int,y:struct<a:array<int>>')
df.write.mode('overwrite').parquet('test')
```

```
# This causes an error "Caused by: java.lang.RuntimeException: Couldn't find 
x#720 in [y#721]"
spark.read.parquet('test').select(F.expr('y.a[x]')).show()

# Explain works fine, note it doesn't read x in ReadSchema
spark.read.parquet('test').select(F.expr('y.a[x]')).explain()

== Physical Plan ==
*(1) !Project [y#713.a[x#712] AS y.a AS `a`[x]#717]
+- FileScan parquet [y#713] Batched: false, DataFilters: [], Format: Parquet, 
Location: InMemoryFileIndex, PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<y:struct<a:array<int>>>

```


The code works well if I 

- manually select the column it misses 
`spark.read.parquet('test').select(F.expr('y.a[x]'), F.col('x')).show()` 
- or use `F.element_at` function 
`spark.read.parquet('test').select(F.element_at('y.a', F.col('x') + 1)).show()`



```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to