Hi All, I have a cluster of 3 nodes [each 8 core/32 GB memory].
My program uses Spark Streaming with Spark SQL[Spark 1.1] and writes incoming JSON to elasticsearch, Hbase. Below is my code and i receive json files [input data varies from 30MB to 300 MB] every 10 seconds. Irrespective of 3 nodes or 1 node, processing time is pretty close, say 15 files takes 5 mins to do end -end process. I have set spark.streaming.unpersist to true , stream repartition to 3 , Kryo serialization and using UseCompressedOops and some more GC mechanism and perf tuning mentioned in spark docs. Still there is no much improvement . Is there any other way that i can tune my code to improve performance. Appreciate your help.thanks in advance. inputStream.foreachRDD(rdd => { // Get SchemaRdd val inputJson = sqlContext.jsonRDD(rdd) //Store it in sparksql table inputJson.registerTempTable("tableA") // Parse JSON val outputJson = sqlContext.sql("select colA, colB etcc from tableA") //Write to ES, Hbase [uses foreachpartition] ESUtil.writeToES(outputJson, coreName) HBaseUtils.saveAsHBaseTable(outputJson, "hbasetable") } })