Hi All,

I have a cluster of 3 nodes [each 8 core/32 GB memory].

My program uses Spark Streaming with Spark SQL[Spark 1.1]  and writes
incoming JSON  to elasticsearch, Hbase. Below is my code and i receive
json files [input data varies from 30MB to 300 MB] every 10 seconds.
Irrespective of 3 nodes or 1 node, processing time is pretty close, say 15
files takes 5 mins to do end -end process.

I have set spark.streaming.unpersist to true  , stream repartition to 3 ,
Kryo serialization  and using UseCompressedOops and some more GC mechanism
and perf tuning mentioned in spark docs. Still there is no much improvement
. Is there any other way that i can tune my code to improve performance.
Appreciate your help.thanks in advance.

 inputStream.foreachRDD(rdd => {

            // Get SchemaRdd
           val inputJson = sqlContext.jsonRDD(rdd)

            //Store it in sparksql table
           inputJson.registerTempTable("tableA")

            // Parse JSON
            val outputJson = sqlContext.sql("select colA, colB etcc  from
tableA")


            //Write to ES, Hbase [uses  foreachpartition]
            ESUtil.writeToES(outputJson, coreName)
             HBaseUtils.saveAsHBaseTable(outputJson, "hbasetable")

      }
    })

Reply via email to