Many operations in spark are lazy -- most likely your collect() statement
is actually forcing evaluation of severals steps earlier in the pipeline.
The logs & the UI might give you some info about all the stages that are
being run when you get to collect().

I think collect() is just fine if you are trying to pull just one record to
the driver, that shouldn't be a bottleneck.

On Wed, Feb 4, 2015 at 1:32 AM, jguliani <jasminkguli...@gmail.com> wrote:

> I have 3 text files in hdfs which I am reading using spark sql and
> registering them as table. After that I am doing almost 5-6 operations -
> including joins , group by etc.. And this whole process is taking hardly
> 6-7
> secs. ( Source File size - 3 GB with almost 20 million rows ).
> As a final step of my computation, I am expecting only 1 record in my final
> rdd - named as acctNPIScr in below code snippet.
>
> My question here is that when I am trying to print this rdd either by
> registering as table and printing records from table or by this method -
> acctNPIScr.map(t => "Score: " + t(1)).collect().foreach(println). It is
> taking very long time - almost 1.5 minute to print 1 record.
>
> Can someone pls help me if I am doing something wrong in printing. What is
> the best way to print final result from schemardd.
>
>     .....
>     val acctNPIScr = sqlContext.sql(""SELECT party_id,
> sum(npi_int)/sum(device_priority_new) as  npi_score FROM AcctNPIScoreTemp
> group by party_id ")
>     acctNPIScr.registerTempTable("AcctNPIScore")
>
>     val endtime = System.currentTimeMillis()
>     logger.info("Total sql Time :" + (endtime - st))   // this time is
> hardly 5 secs
>
>     println("start printing")
>
>     val result = sqlContext.sql("SELECT * FROM
> AcctNPIScore").collect().foreach(println)
>
>     //acctNPIScr.map(t => "Score: " + t(1)).collect().foreach(println)
>
>     logger.info("Total printing Time :" + (System.currentTimeMillis() -
> endtime)) // print one record is taking almost 1.5 minute
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-taking-long-time-to-print-records-from-a-table-tp21493.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to