[ https://issues.apache.org/jira/browse/HUDI-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
sivabalan narayanan updated HUDI-3041: -------------------------------------- Summary: spark.read fails when drop partitions is used (was: drop partitions fails with spark.read) > spark.read fails when drop partitions is used > --------------------------------------------- > > Key: HUDI-3041 > URL: https://issues.apache.org/jira/browse/HUDI-3041 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration > Affects Versions: 0.9.0 > Reporter: sivabalan narayanan > Priority: Major > Labels: sev:critical, user-support-issues > > With 0.9.0, we added support for dropping partition columns after generating > the partition path for hoodie records. but looks like we have some gaps in > end to end flow. > > hudi serializes partition path fields to table properties and re-uses when in > need. So, while querying the field may not be part of table schema since hudi > would have removed the field (along with trimming the schema) completely. > Stacktrace for Issue (1) > {code:java} > scala> val tripsSnapshotDF = spark.read.format("hudi").load(basePath) > java.lang.IllegalArgumentException: Cannot find column: 'partitionId' in the > schema[StructField(_hoodie_commit_time,StringType,true),StructField(_hoodie_commit_seqno,StringType,true),StructField(_hoodie_record_key,StringType,true),StructField(_hoodie_partition_path,StringType,true),StructField(_hoodie_file_name,StringType,true),StructField(rowId,StringType,true),StructField(preComb,LongType,true),StructField(name,StringType,true),StructField(versionId,StringType,true),StructField(toBeDeletedStr,StringType,true),StructField(intToLong,IntegerType,true),StructField(longToInt,LongType,true)] > at > org.apache.hudi.HoodieFileIndex$$anonfun$5$$anonfun$apply$1.apply(HoodieFileIndex.scala:106) > at > org.apache.hudi.HoodieFileIndex$$anonfun$5$$anonfun$apply$1.apply(HoodieFileIndex.scala:106) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:59) > at > org.apache.hudi.HoodieFileIndex$$anonfun$5.apply(HoodieFileIndex.scala:106) > at > org.apache.hudi.HoodieFileIndex$$anonfun$5.apply(HoodieFileIndex.scala:105) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.hudi.HoodieFileIndex._partitionSchemaFromProperties$lzycompute(HoodieFileIndex.scala:105) > at > org.apache.hudi.HoodieFileIndex._partitionSchemaFromProperties(HoodieFileIndex.scala:99) > at > org.apache.hudi.HoodieFileIndex.getAllQueryPartitionPaths(HoodieFileIndex.scala:348) > at > org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:420) > at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:214) > at org.apache.hudi.HoodieFileIndex.<init>(HoodieFileIndex.scala:149) > at > org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:116) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:67) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > ... 63 elided {code} > Steps to reproduce: > [https://gist.github.com/nsivabalan/570a96004e41f84565c99d8994b12d57] > -- This message was sent by Atlassian Jira (v8.20.1#820001)