Can you try running it directly on hive to see the timing or through
spark-sql may be.

 

Spark does what Hive does that is processing large sets of data, but it
attempts to do the intermediate iterations in memory if it can (i.e. if
there is enough memory available to keep the data set in memory), otherwise
it will have to use disk space. So it boils down to how much memory you
have. 

 

HTH

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: hxw黄祥为 [mailto:huang...@ctrip.com] 
Sent: 03 December 2015 10:29
To: user@spark.apache.org
Subject: spark1.4.1 extremely slow for take(1) or head() or first() or show

 

Dear All,

 

I have a hive table with 100 million data and I just ran some very simple
operations on this dataset like:

 

  val df = sqlContext.sql("select * from user ").toDF

  df.cache

  df.registerTempTable("tb")

  val b=sqlContext.sql("select  'uid',max(length(uid)),count(distinct(uid)),
count(uid),sum(case when uid is null then 0 else 1 end),sum(case when uid is
null then 1 else 0 end),sum(case when uid is null then 1 else 0
end)/count(uid) from tb")

  b.show  //the result just one line but this step is extremely slow

 

Is this expected? Why show is so slow for dataframe? Is it a bug in the
optimizer? or I did something wrong? 

 

 

Best Regards,

tylor

Reply via email to