Hi devs,
question: how to convert hive output format to spark sql datasource format?
spark version: spark 2.3.0
scene: there are many small files on hdfs(hive) generated by spark sql
applications when dynamic partition is enabled or setting
spark.sql.shuffle.partitions >200. so i am trying to develop a new feature:
after temporary files have been written on hdfs but haven’t been moved to final
path, calculate ideal file number by dfs.blocksize and temporary files’ total
length, then merge(coalesce/repartition) to ideal file number. but i meet with
a difficulty: temporary files are written in the output format(e.g.
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat) defined at hive TableDesc, i
can’t load temporary files by
```
sparkSession
.read.format(TableDesc.getInputFormatClassName)
.load(tempDataPath)
.repartition(ideal file number)
.write.format(TableDesc.getOutputFormatClassName)
```
Throw exception: xxx is not a valid Spark SQL Data Source at
DataSource#resolveRelation.
i also tried to use
```
sparkSession.read
.option("inputFormat",TableDesc.getInputFormatClassName)
.option("outputFormat", TableDesc.getOutputFormatClassName)
.load(tempDataPath)
….
```
it not works and spark sql DataSource defaults to parquet.
So how to convert hive output format to spark sql datasource format? is there
any better way than building an map<hive output format, spark sql datasource>?
Thanks in advance