Vishal Donderia created SPARK-28563: ---------------------------------------
Summary: Spark 2.4 | Reading all the data inside partition like directory. Key: SPARK-28563 URL: https://issues.apache.org/jira/browse/SPARK-28563 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.4.1 Reporter: Vishal Donderia We have upgraded your cluster from Spark 2.3 to 2.4 and currently, we are observing different behavior while reading data. In Spark 2.3 spark.read.('basePath','output/model').orc('output/model/abc=4') Expected: We will get "abc" column in schema Similarly: spark.read.('basePath','output/model/abc=4').orc('output/model/abc=4') Expected : It will only read data inside parition abc=4 and abc will not be part of schema even "output/model" has different schema of files inside In Spark2.4 spark.read.('basePath','output/model/abc=4').orc('output/model/abc=4') It is trying to get the schema from "output/model/" instead of output/model/abc=4 and job is getting failed because of different schema {code} For partitioned table directories, data files should only live in leaf directories. And directories at the same level should have the same partition column name. Please check the following directories for unexpected files or inconsistent partition column names: at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:364) at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:165) at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:100) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:131) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:71) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:144) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.orc(DataFrameReader.scala:662) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org