Hello Spark Devs!
We are from Uber's Spark team. Our ETL jobs use Spark to read and write from Hive datasets stored in HDFS. The freshness of the partition written to depends on the freshness of the data in the input partition(s). We monitor this freshness score, so that partitions in our critical tables always have fresh data. We are looking for some code/helper function/utility etc built into the Spark engine, through which we can programmatically get the list of partitions read and written by an execution. We looked for this in the plan, and our initial code study did not pinpoint us to any such method. We have been dependent on indirect ways like audit logs of storage, HMS, etc. We find them difficult to use and scale. However, the spark code does contain the list of partitions read and written. The below files have the partition data for the given file format: 1. Input partitions from HiveTableScanExec.scala(Text format) 2. Input partitions from DataSourceScanExec.scala(Parquet/Hudi/Orc). 3. Output partitions from InsertIntoHiveTable.scala(Text format) 4. Output partitions from InsertIntoHadoopFsRelationCommand.scala(Parquet/Hudi/Orc). We did come up with some code that can help gather this info in a programmatically friendly way. We maintained this information in the plan. We wrapped the plan with some convenience classes and methods to extract the partition details. We felt that such a programmatic interface could be used for more purposes as well, like showing in SHS a new set of statistics that can aid in troubleshooting. I wanted to know from the Dev Community, is there already something that is/was implemented in Spark that can solve our requirement? If not, we would love to share how we have implemented this and contribute to the community. Regards, Aditya Sohoni