I have 3 text files in hdfs which I am reading using spark sql and registering them as table. After that I am doing almost 5-6 operations - including joins , group by etc.. And this whole process is taking hardly 6-7 secs. ( Source File size - 3 GB with almost 20 million rows ). As a final step of my computation, I am expecting only 1 record in my final rdd - named as acctNPIScr in below code snippet.
My question here is that when I am trying to print this rdd either by registering as table and printing records from table or by this method - acctNPIScr.map(t => "Score: " + t(1)).collect().foreach(println). It is taking very long time - almost 1.5 minute to print 1 record. Can someone pls help me if I am doing something wrong in printing. What is the best way to print final result from schemardd. ..... val acctNPIScr = sqlContext.sql(""SELECT party_id, sum(npi_int)/sum(device_priority_new) as npi_score FROM AcctNPIScoreTemp group by party_id ") acctNPIScr.registerTempTable("AcctNPIScore") val endtime = System.currentTimeMillis() logger.info("Total sql Time :" + (endtime - st)) // this time is hardly 5 secs println("start printing") val result = sqlContext.sql("SELECT * FROM AcctNPIScore").collect().foreach(println) //acctNPIScr.map(t => "Score: " + t(1)).collect().foreach(println) logger.info("Total printing Time :" + (System.currentTimeMillis() - endtime)) // print one record is taking almost 1.5 minute -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-taking-long-time-to-print-records-from-a-table-tp21493.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org