aokolnychyi commented on issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - foundation URL: https://github.com/apache/spark/pull/21320#issuecomment-446020655 @mallman @dbtsai @gatorsmile One question on non-deterministic expressions. For example, let's consider a non-deterministic UDF. ``` val nonDeterministicUdf = udf((first: String) => first + " " + Math.random()).asNondeterministic() val query = data.select(col("id"), nonDeterministicUdf(col("name.first"))) ``` As it is today, there will be no schema pruning due to the way how `collectProjectsAndFilters` is defined in `PhysicalOperation`. ``` == Analyzed Logical Plan == id: int, UDF(name.first): string Project [id#222, UDF(name#223.first) AS UDF(name.first)#246] +- Project [id#222, name#223, address#224, pets#225, friends#226, relatives#227, employer#228, p#229] +- SubqueryAlias `contacts` +- Relation[id#222,name#223,address#224,pets#225,friends#226,relatives#227,employer#228,p#229] parquet == Optimized Logical Plan == Project [id#222, UDF(name#223.first) AS UDF(name.first)#246] +- Relation[id#222,name#223,address#224,pets#225,friends#226,relatives#227,employer#228,p#229] parquet == Physical Plan == *(1) Project [id#222, UDF(name#223.first) AS UDF(name.first)#246] +- *(1) FileScan parquet [id#222,name#223,address#224,pets#225,friends#226,relatives#227,employer#228,p#229] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/f3/6jyczfzd15ndvh49zq0d_sg80000gn/T/spark-6b69e4e9-c6..., PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:struct<first:string,middle:string,last:string>,address:string,pets:int,friends... ``` To me, it seems valid to apply schema prunining in this case. What do you think?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org