[ https://issues.apache.org/jira/browse/SPARK-34204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270326#comment-17270326 ]
Nick Hryhoriev commented on SPARK-34204: ---------------------------------------- Look like `spark_partition_id` have same issue. > When use input_file_name() func all column from file appeared in physical > plan of query, not only projection. > ------------------------------------------------------------------------------------------------------------- > > Key: SPARK-34204 > URL: https://issues.apache.org/jira/browse/SPARK-34204 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.7 > Reporter: Nick Hryhoriev > Priority: Major > > input_file_name() function damage applying projection to the physical plan of > the query. > if use this function and a new column, column-oriented formats like parquet > and orc put all columns to Physical plan. > While without it, only selected columns uploaded. > In my case, performance influence is x30. > {code:java} > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.functions._ > object TestSize { > def main(args: Array[String]): Unit = { > implicit val spark: SparkSession = SparkSession.builder() > .master("local") > .config("spark.sql.shuffle.partitions", "5") > .getOrCreate() > import spark.implicits._ > val quey1 = spark.read.parquet( > "s3a:/part-00040-a19f0d20-eab3-48ef-be5a-602c7f9a8e58.c000.gz.parquet" > ) > .select($"app_id", $"idfa", input_file_name().as("fileName")) > .distinct() > .count() > val quey2 = spark.read.parquet( "s3a:/part-00040-a19f0d20-eab3-48ef-be5a- > 602c7f9a8e58.c000.gz.parquet" ) > .select($"app_id", $"idfa", input_file_name().as("fileName")) > .distinct() > .count() > Thread.sleep(10000000000L) > } > } > {code} > Query 1 has all columns in the physical plan, while query 2 only two. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org