Liangcai li created SPARK-45879: ----------------------------------- Summary: Number check for InputFileBlockSources is missing for V2 source (BatchScan) ? Key: SPARK-45879 URL: https://issues.apache.org/jira/browse/SPARK-45879 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Environment: I tried on Spark 323 and Spark 341, both reproduced this issue. Reporter: Liangcai li
When doing a join with the "input_file_name()" function, it will blow up with a AnalysisException if using the v1 data source (FileSourceScan). That's ok. But if we change to use the v2 data source (BatchScan), the expected exception is gone, the join passes. Is this number check for InputFileDataSources mssing for V2 data source? Repro steps: ``` scala> spark.range(100).withColumn("const1", lit("from_t1")).write.parquet("/data/tmp/t1") scala> spark.range(100).withColumn("const2", lit("from_t2")).write.parquet("/data/tmp/t2") scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet") scala> spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), "id", "inner").selectExpr("*", "input_file_name()").show(5, false) org.apache.spark.sql.AnalysisException: 'input_file_name' does not support more than one sources.; line 1 pos 0; Project [id#376L, const1#377, const2#381, input_file_name() AS input_file_name()#389] +- Project [id#376L, const1#377, const2#381] +- Join Inner, (id#376L = id#380L) :- Relation [id#376L,const1#377] parquet +- Relation [id#380L,const2#381] parquet at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52) at org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476) at org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2(rules.scala:472) at org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2$adapted(rules.scala:472) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) ``` ``` scala> spark.conf.set("spark.sql.sources.useV1SourceList", "") scala> spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), "id", "inner").selectExpr("*", "input_file_name()").show(5, false) +---+-------+-------+---------------------------------------------------------------------------------------+ |id |const1 |const2 |input_file_name() | +---+-------+-------+---------------------------------------------------------------------------------------+ |91 |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet| |92 |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet| |93 |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet| |94 |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet| |95 |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet| +---+-------+-------+---------------------------------------------------------------------------------------+ only showing top 5 rows ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org