Nick Hryhoriev created SPARK-34204: -------------------------------------- Summary: input_file_name() func damage applying projection to physical plan of query. Key: SPARK-34204 URL: https://issues.apache.org/jira/browse/SPARK-34204 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.7 Reporter: Nick Hryhoriev
input_file_name() function damage applying projection to the physical plan of the query. if use this function and a new column, column-oriented formats like parquet and orc put all columns to Physical plan. While without it, only selected columns uploaded. In my case, performance influence is x30. ``` package com.appsflyer.datalake.s3.index import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ object TestSize { def main(args: Array[String]): Unit = { implicit val spark: SparkSession = SparkSession.builder() .master("local") .config("spark.sql.shuffle.partitions", "5") .getOrCreate() import spark.implicits._ val query1 = spark.read.parquet( "s3a:/part-00040-a19f0d20-eab3-48ef-be5a-602c7f9a8e58.c000.gz.parquet" ) .select($"app_id", $"idfa", input_file_name().as("fileName")) .distinct() .count() val query2 = val query1 = spark.read.parquet( "s3a:/part-00040-a19f0d20-eab3-48ef-be5a-602c7f9a8e58.c000.gz.parquet" ) .select($"app_id", $"idfa") .distinct() .count() Thread.sleep(10000000000L) } } ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org