I have 3 text files in hdfs which I am reading using spark sql and
registering them as table. After that I am doing almost 5-6 operations -
including joins , group by etc.. And this whole process is taking hardly 6-7
secs. ( Source File size - 3 GB with almost 20 million rows ).
As a final step of my computation, I am expecting only 1 record in my final
rdd - named as acctNPIScr in below code snippet.
My question here is that when I am trying to print this rdd either by
registering as table and printing records from table or by this method -
acctNPIScr.map(t => "Score: " + t(1)).collect().foreach(println). It is
taking very long time - almost 1.5 minute to print 1 record.
Can someone pls help me if I am doing something wrong in printing. What is
the best way to print final result from schemardd.
.....
val acctNPIScr = sqlContext.sql(""SELECT party_id,
sum(npi_int)/sum(device_priority_new) as npi_score FROM AcctNPIScoreTemp
group by party_id ")
acctNPIScr.registerTempTable("AcctNPIScore")
val endtime = System.currentTimeMillis()
logger.info("Total sql Time :" + (endtime - st)) // this time is
hardly 5 secs
println("start printing")
val result = sqlContext.sql("SELECT * FROM
AcctNPIScore").collect().foreach(println)
//acctNPIScr.map(t => "Score: " + t(1)).collect().foreach(println)
logger.info("Total printing Time :" + (System.currentTimeMillis() -
endtime)) // print one record is taking almost 1.5 minute
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-taking-long-time-to-print-records-from-a-table-tp21493.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]