Vishal Donderia created SPARK-28563:
---------------------------------------

             Summary: Spark 2.4 |  Reading all the data inside partition like 
directory. 
                 Key: SPARK-28563
                 URL: https://issues.apache.org/jira/browse/SPARK-28563
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 2.4.1
            Reporter: Vishal Donderia


We have upgraded your cluster from Spark 2.3 to 2.4 and currently, we are 
observing different behavior while reading data. 

 

In Spark 2.3 
      spark.read.('basePath','output/model').orc('output/model/abc=4')

Expected: We will get "abc" column  in schema


Similarly:

 spark.read.('basePath','output/model/abc=4').orc('output/model/abc=4')

Expected : It will only read data inside parition abc=4 and abc will not be 
part of schema even "output/model" has different schema of files inside 


In Spark2.4

spark.read.('basePath','output/model/abc=4').orc('output/model/abc=4')

It is trying to get the schema from "output/model/" instead of  
output/model/abc=4  and job is getting failed because of different schema

{code}
For partitioned table directories, data files should only live in leaf 
directories.
And directories at the same level should have the same partition column name.
Please check the following directories for unexpected files or inconsistent 
partition column names:

at scala.Predef$.assert(Predef.scala:170)
 at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:364)
 at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:165)
 at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:100)
 at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:131)
 at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:71)
 at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:144)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
 at org.apache.spark.sql.DataFrameReader.orc(DataFrameReader.scala:662)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
 at py4j.Gateway.invoke(Gateway.java:282)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:238)
 at java.lang.Thread.run(Thread.java:745)

{code} 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to