I have 3 text files in hdfs which I am reading using spark sql and
registering them as table. After that I am doing almost 5-6 operations -
including joins , group by etc.. And this whole process is taking hardly 6-7
secs. ( Source File size - 3 GB with almost 20 million rows ).
As a final step of my computation, I am expecting only 1 record in my final
rdd - named as acctNPIScr in below code snippet.

My question here is that when I am trying to print this rdd either by
registering as table and printing records from table or by this method -
acctNPIScr.map(t => "Score: " + t(1)).collect().foreach(println). It is
taking very long time - almost 1.5 minute to print 1 record. 

Can someone pls help me if I am doing something wrong in printing. What is
the best way to print final result from schemardd. 

    .....
    val acctNPIScr = sqlContext.sql(""SELECT party_id,
sum(npi_int)/sum(device_priority_new) as  npi_score FROM AcctNPIScoreTemp
group by party_id ") 
    acctNPIScr.registerTempTable("AcctNPIScore")    
    
    val endtime = System.currentTimeMillis()
    logger.info("Total sql Time :" + (endtime - st))   // this time is
hardly 5 secs
    
    println("start printing")
    
    val result = sqlContext.sql("SELECT * FROM
AcctNPIScore").collect().foreach(println)
    
    //acctNPIScr.map(t => "Score: " + t(1)).collect().foreach(println)
    
    logger.info("Total printing Time :" + (System.currentTimeMillis() -
endtime)) // print one record is taking almost 1.5 minute




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-taking-long-time-to-print-records-from-a-table-tp21493.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to