Many operations in spark are lazy -- most likely your collect() statement is actually forcing evaluation of severals steps earlier in the pipeline. The logs & the UI might give you some info about all the stages that are being run when you get to collect().
I think collect() is just fine if you are trying to pull just one record to the driver, that shouldn't be a bottleneck. On Wed, Feb 4, 2015 at 1:32 AM, jguliani <jasminkguli...@gmail.com> wrote: > I have 3 text files in hdfs which I am reading using spark sql and > registering them as table. After that I am doing almost 5-6 operations - > including joins , group by etc.. And this whole process is taking hardly > 6-7 > secs. ( Source File size - 3 GB with almost 20 million rows ). > As a final step of my computation, I am expecting only 1 record in my final > rdd - named as acctNPIScr in below code snippet. > > My question here is that when I am trying to print this rdd either by > registering as table and printing records from table or by this method - > acctNPIScr.map(t => "Score: " + t(1)).collect().foreach(println). It is > taking very long time - almost 1.5 minute to print 1 record. > > Can someone pls help me if I am doing something wrong in printing. What is > the best way to print final result from schemardd. > > ..... > val acctNPIScr = sqlContext.sql(""SELECT party_id, > sum(npi_int)/sum(device_priority_new) as npi_score FROM AcctNPIScoreTemp > group by party_id ") > acctNPIScr.registerTempTable("AcctNPIScore") > > val endtime = System.currentTimeMillis() > logger.info("Total sql Time :" + (endtime - st)) // this time is > hardly 5 secs > > println("start printing") > > val result = sqlContext.sql("SELECT * FROM > AcctNPIScore").collect().foreach(println) > > //acctNPIScr.map(t => "Score: " + t(1)).collect().foreach(println) > > logger.info("Total printing Time :" + (System.currentTimeMillis() - > endtime)) // print one record is taking almost 1.5 minute > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-taking-long-time-to-print-records-from-a-table-tp21493.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >