Super! That’d be great. Lemme know if I can help in any way. Sent from my iPhone
> On Aug 30, 2019, at 6:30 PM, Anton Okolnychyi <aokolnyc...@apple.com.invalid> > wrote: > > Hi Gautam, > > Iceberg does support nested schema pruning but Spark doesn’t request this for > DS V2 in 2.4. Internally, we had to modify Spark 2.4 to make this work > end-to-end. > One of the options is to extend DataSourceV2Strategy with logic similar to > what we have in ParquetSchemaPruning in 2.4.0. I think we can share that part > if needed. > > I am planning to check whether Spark master already has this functionality. > If that’s not implemented and nobody is working on it yet, I can fix it. > > - Anton > > >> On 30 Aug 2019, at 15:42, Gautam <gautamkows...@gmail.com> wrote: >> >> Hello Devs, >> I was measuring perf on structs between V1 and V2 >> datasources. Found that although Iceberg Reader supports >> `SupportsPushDownRequiredColumns` it doesn't seem to prune nested column >> projections. I want to be able to prune on nested fields. How does V2 >> datasource have provision to be able to let Iceberg decide this? The >> `SupportsPushDownRequiredColumns` mix-in gives the entire struct field even >> if a sub-field is requested. >> >> Here's an illustration .. >> >> scala> spark.sql("select location.lat from iceberg_people_struct").show() >> +-------+ >> | lat| >> +-------+ >> | null| >> |101.123| >> |175.926| >> +-------+ >> >> >> The pruning gets the entire struct instead of just `location.lat` .. >> >> public void pruneColumns(StructType newRequestedSchema) >> >> 19/08/30 16:25:38 WARN Reader: => Prune columns : { >> "type" : "struct", >> "fields" : [ { >> "name" : "location", >> "type" : { >> "type" : "struct", >> "fields" : [ { >> "name" : "lat", >> "type" : "double", >> "nullable" : true, >> "metadata" : { } >> }, { >> "name" : "lon", >> "type" : "double", >> "nullable" : true, >> "metadata" : { } >> } ] >> }, >> "nullable" : true, >> "metadata" : { } >> } ] >> } >> >> Is there information I can use in the IcebergSource (or add some) that can >> be used to prune the exact sub-field here? What's a good way to approach >> this? For dense/wide struct fields this affects performance significantly. >> >> >> Sample gist: >> https://gist.github.com/prodeezy/001cf155ff0675be7d307e9f842e1dac >> >> >> thanks and regards, >> -Gautam. >