vtygoss Thu, 02 Jul 2020 03:55:17 -0700

Hi devs,


question: how to convert hive output format to spark sql datasource format?   


spark version: spark 2.3.0  
scene:  there are many small files on hdfs(hive) generated  by spark sql 
applications when dynamic partition is enabled or setting 
spark.sql.shuffle.partitions >200.  so i am trying to develop a new feature: 
after temporary files have been written on hdfs but haven’t been moved to final 
path, calculate ideal file number by dfs.blocksize and temporary files’ total 
length, then merge(coalesce/repartition) to ideal file number.  but i meet with 
a difficulty:  temporary files are written in the output format(e.g. 
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat) defined at hive TableDesc, i 
can’t load temporary files by 


```
sparkSession
.read.format(TableDesc.getInputFormatClassName)
.load(tempDataPath)
.repartition(ideal file number)
.write.format(TableDesc.getOutputFormatClassName)
```
Throw exception: xxx is not a valid Spark SQL Data Source at 
DataSource#resolveRelation. 
i also tried to use 


```
sparkSession.read
.option("inputFormat",TableDesc.getInputFormatClassName)

.option("outputFormat", TableDesc.getOutputFormatClassName)
.load(tempDataPath)
….
```


it not works and spark sql DataSource defaults to parquet.


So how to convert hive output format to spark sql datasource format?  is there 
any better way than building an map<hive output format, spark sql datasource>?




Thanks in advance

Reply via email to