Hi, I am converting hive job with spark job. I have tested on small set and logic is correct in hive and spark.
when i started testing on large data, spark is very slow when compared to hive. shuffle write is taking long time. any suggestions? I am creating temporary table in spark and overwriting hive table with partitions from that temporary table created on spark. dataframe_transposed.registerTempTable(srcTable) import sqlContext._ import sqlContext.implicits._ val query=s"INSERT OVERWRITE TABLE ${destTable} SELECT * from ${srcTable}" println(s"INSERT OVERWRITE TABLE ${destTable} SELECT * from ${srcTable}") logger.info(s"Executing Query ${query}") sqlContext.sql(query) total size of dataframe is around 190 GB and it is running for ever in this case while hive job can be completed in 4 hours. Thanks, Asmath.