Re: Nested Column Pruning in Iceberg (DSV2) ..

Gautam Kowshik Fri, 30 Aug 2019 21:58:58 -0700

Super! That’d be great. Lemme know if I can help in any way. 

Sent from my iPhone


> On Aug 30, 2019, at 6:30 PM, Anton Okolnychyi <aokolnyc...@apple.com.invalid> 
> wrote:
> 
> Hi Gautam,
> 
> Iceberg does support nested schema pruning but Spark doesn’t request this for 
> DS V2 in 2.4. Internally, we had to modify Spark 2.4 to make this work 
> end-to-end.
> One of the options is to extend DataSourceV2Strategy with logic similar to 
> what we have in ParquetSchemaPruning in 2.4.0. I think we can share that part 
> if needed.
> 
> I am planning to check whether Spark master already has this functionality.
> If that’s not implemented and nobody is working on it yet, I can fix it.
> 
> - Anton
> 
> 
>> On 30 Aug 2019, at 15:42, Gautam <gautamkows...@gmail.com> wrote:
>> 
>> Hello Devs,
>>                    I was measuring perf on structs between V1 and V2 
>> datasources. Found that although Iceberg Reader supports 
>> `SupportsPushDownRequiredColumns` it doesn't seem to prune nested column 
>> projections. I want to be able to prune on nested fields. How does V2 
>> datasource have provision to be able to let Iceberg decide this? The 
>> `SupportsPushDownRequiredColumns` mix-in gives the entire struct field even 
>> if a sub-field is requested.
>> 
>> Here's an illustration .. 
>> 
>> scala> spark.sql("select location.lat from iceberg_people_struct").show()
>> +-------+
>> |    lat|
>> +-------+
>> |   null|
>> |101.123|
>> |175.926|
>> +-------+
>> 
>> 
>> The pruning gets the entire struct instead of just `location.lat`  ..
>> 
>> public void pruneColumns(StructType newRequestedSchema) 
>> 
>> 19/08/30 16:25:38 WARN Reader: => Prune columns : {
>>  "type" : "struct",
>>  "fields" : [ {
>>    "name" : "location",
>>    "type" : {
>>      "type" : "struct",
>>      "fields" : [ {
>>        "name" : "lat",
>>        "type" : "double",
>>        "nullable" : true,
>>        "metadata" : { }
>>      }, {
>>        "name" : "lon",
>>        "type" : "double",
>>        "nullable" : true,
>>        "metadata" : { }
>>      } ]
>>    },
>>    "nullable" : true,
>>    "metadata" : { }
>>  } ]
>> }
>> 
>> Is there information I can use in the IcebergSource (or add some) that can 
>> be used to prune the exact sub-field here?  What's a good way to approach 
>> this? For dense/wide struct fields this affects performance significantly.
>> 
>> 
>> Sample gist: 
>> https://gist.github.com/prodeezy/001cf155ff0675be7d307e9f842e1dac
>> 
>> 
>> thanks and regards,
>> -Gautam.
>

Re: Nested Column Pruning in Iceberg (DSV2) ..

Reply via email to