RE: RDD.join vs spark SQL join

2015-08-15 Thread Xiao JIANG
Thank you Akhil!

Date: Fri, 14 Aug 2015 14:51:56 +0530
Subject: Re: RDD.join vs spark SQL join
From: ak...@sigmoidanalytics.com
To: jiangxia...@outlook.com
CC: user@spark.apache.org

Both works the same way, but with SparkSQL you will get the optimization etc 
done by the catalyst. One important thing to consider is the # partitions and 
the key distribution (when you are doing RDD.join), If the keys are not evenly 
distributed across machines then you can see the process chocking on a single 
task (more like it takes hell lot of time for one task to execute compared to 
others in that stage).ThanksBest Regards

On Fri, Aug 14, 2015 at 1:25 AM, Xiao JIANG jiangxia...@outlook.com wrote:



Hi,May I know the performance difference the rdd.join function and spark SQL 
join operation. If I want to join several big Rdds, how should I decide which 
one I should use? What are the factors to consider here?  Thanks!   
   

  

Re: Can't understand the size of raw RDD and its DataFrame

2015-08-15 Thread Rishi Yadav
why are you expecting footprint of dataframe to be lower when it contains
more information ( RDD + Schema)

On Sat, Aug 15, 2015 at 6:35 PM, Todd bit1...@163.com wrote:

 Hi,
 With following code snippet, I cached the raw RDD(which is already in
 memory, but just for illustration) and its DataFrame.
 I thought that the df cache would take less space than the rdd cache,which
 is wrong because from the UI that I see the rdd cache takes 168B,while the
 df cache takes 272B.
 What data is cached when df.cache is called and actually cache the data?
 It looks that the df only cached the avg(age) which should be much smaller
 in size,

 val conf = new SparkConf().setMaster(local).setAppName(SparkSQL_Cache)
 val sc = new SparkContext(conf)
 val sqlContext = new SQLContext(sc)
 import sqlContext.implicits._
 val rdd=sc.parallelize(Array(Student(Jack,21), Student(Mary, 22)))
 rdd.cache
 rdd.toDF().registerTempTable(TBL_STUDENT)
 val df = sqlContext.sql(select avg(age) from TBL_STUDENT)
 df.cache()
 df.show




Can't understand the size of raw RDD and its DataFrame

2015-08-15 Thread Todd
Hi,
With following code snippet, I cached the raw RDD(which is already in memory, 
but just for illustration) and its DataFrame.
I thought that the df cache would take less space than the rdd cache,which is 
wrong because from the UI that I see the rdd cache takes 168B,while the df 
cache takes 272B.
What data is cached when df.cache is called and actually cache the data?  It 
looks that the df only cached the avg(age) which should be much smaller in size,

val conf = new SparkConf().setMaster(local).setAppName(SparkSQL_Cache)
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val rdd=sc.parallelize(Array(Student(Jack,21), Student(Mary, 22)))
rdd.cache
rdd.toDF().registerTempTable(TBL_STUDENT)
val df = sqlContext.sql(select avg(age) from TBL_STUDENT)
df.cache()
df.show



Re: Difference between Sort based and Hash based shuffle

2015-08-15 Thread Ravi Kiran
Have a look at this presentation.
http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be of
help to you.

On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed 
11besemja...@seecs.edu.pk wrote:

 What are the major differences between how Sort based and Hash based
 shuffle operate and what is it that cause Sort Shuffle to perform better
 than Hash?
 Any talks that discuss both shuffles in detail, how they are implemented
 and the performance gains ?



TestSQLContext compilation error when run SparkPi in Intellij ?

2015-08-15 Thread canan chen
I import the spark source code to intellij, and want to run SparkPi in
intellij, but meet the folliwing weird compilation error? I googled it and
sbt clean doesn't work for me. I am not sure whether anyone else has meet
this issue also, any help is appreciated

Error:scalac:
 while compiling:
/Users/root/github/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala
during phase: jvm
 library version: version 2.10.4
compiler version: version 2.10.4
  reconstructed args: -nobootcp -javabootclasspath : -deprecation -feature
-classpath


spark on yarn is slower than spark-ec2 standalone, how to tune?

2015-08-15 Thread AlexG
I'm using a manually installation of Spark under Yarn to run a 30 node
r3.8xlarge EC2 cluster (each node has 244Gb RAM, 600Gb SDD). All my code
runs much faster on a cluster launched w/ the spark-ec2 script, but there's
a mysterious problem with  nodes becoming inaccessible, so I switched to
using Spark under Yarn because I figure Yarn wouldn't let Spark eat up all
the resources and render a machine inaccessible. So far, this seems to be
the case. Now my code runs to completion, but much slower, so I'm wondering
how I can tune my Spark under Yarn installation to make it as fast as the
standalone spark install.

The current code I'm interested in speeding up just loads a dense 1Tb matrix
from Parquet format and then computes a low rank approximation by
essentially doing a bunch of distributed matrix multiplies. Before my code
completed in half an hour from loading to writing the output, now I expect
it to take 4 or so hours to complete.

My spark-submit options are 
  --master yarn \
  --num-executors 29 \
  --driver-memory 180G \
  --executor-memory 180G \
  --conf spark.eventLog.enabled=true \
  --conf spark.eventLog.dir=$LOGDIR \
  --conf spark.driver.maxResultSize=50G \
  --conf spark.task.maxFailures=4 \
  --conf spark.worker.timeout=120 \
  --conf spark.network.timeout=120 \
the huge timeouts were necessary on EC2 to avoid losing executors. Not sure
that they've remained necessary when switching to Yarn. 

My yarn-site.xml has the following settings:
property
nameyarn.nodemanager.resource.memory-mb/name
value236000/value
  /property
  property
nameyarn.scheduler.minimum-allocation-mb/name
value59000/value
  /property
  property
nameyarn.scheduler.maximum-allocation-mb/name
value22/value
  /property

Any suggestions?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-on-yarn-is-slower-than-spark-ec2-standalone-how-to-tune-tp24282.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re:Re: Can't understand the size of raw RDD and its DataFrame

2015-08-15 Thread Todd
I thought that the df only contains one column, and actually contains only one 
resulting row(select avg(age) from theTable).
So,I would think that it would take less space,looks my understanding is run??







At 2015-08-16 12:34:31, Rishi Yadav ri...@infoobjects.com wrote:

why are you expecting footprint of dataframe to be lower when it contains more 
information ( RDD + Schema)


On Sat, Aug 15, 2015 at 6:35 PM, Todd bit1...@163.com wrote:

Hi,
With following code snippet, I cached the raw RDD(which is already in memory, 
but just for illustration) and its DataFrame.
I thought that the df cache would take less space than the rdd cache,which is 
wrong because from the UI that I see the rdd cache takes 168B,while the df 
cache takes 272B.
What data is cached when df.cache is called and actually cache the data?  It 
looks that the df only cached the avg(age) which should be much smaller in size,

val conf = new SparkConf().setMaster(local).setAppName(SparkSQL_Cache)
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val rdd=sc.parallelize(Array(Student(Jack,21), Student(Mary, 22)))
rdd.cache
rdd.toDF().registerTempTable(TBL_STUDENT)
val df = sqlContext.sql(select avg(age) from TBL_STUDENT)
df.cache()
df.show





Re: TestSQLContext compilation error when run SparkPi in Intellij ?

2015-08-15 Thread Andrew Or
Hi Canan, TestSQLContext is no longer a singleton but now a class. It is
never meant to be a fully public API, but if you wish to use it you can
just instantiate a new one:

val sqlContext = new TestSQLContext

or just create a new SQLContext from a SparkContext.

-Andrew

2015-08-15 20:33 GMT-07:00 canan chen ccn...@gmail.com:

 I am not sure other people's spark debugging environment ( I mean for the
 master branch) , Anyone can share his experience ?


 On Sun, Aug 16, 2015 at 10:40 AM, canan chen ccn...@gmail.com wrote:

 I import the spark source code to intellij, and want to run SparkPi in
 intellij, but meet the folliwing weird compilation error? I googled it and
 sbt clean doesn't work for me. I am not sure whether anyone else has meet
 this issue also, any help is appreciated

 Error:scalac:
  while compiling:
 /Users/root/github/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala
 during phase: jvm
  library version: version 2.10.4
 compiler version: version 2.10.4
   reconstructed args: -nobootcp -javabootclasspath : -deprecation
 -feature -classpath





Re: TestSQLContext compilation error when run SparkPi in Intellij ?

2015-08-15 Thread canan chen
I am not sure other people's spark debugging environment ( I mean for the
master branch) , Anyone can share his experience ?


On Sun, Aug 16, 2015 at 10:40 AM, canan chen ccn...@gmail.com wrote:

 I import the spark source code to intellij, and want to run SparkPi in
 intellij, but meet the folliwing weird compilation error? I googled it and
 sbt clean doesn't work for me. I am not sure whether anyone else has meet
 this issue also, any help is appreciated

 Error:scalac:
  while compiling:
 /Users/root/github/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala
 during phase: jvm
  library version: version 2.10.4
 compiler version: version 2.10.4
   reconstructed args: -nobootcp -javabootclasspath : -deprecation -feature
 -classpath



Re: Can't find directory after resetting REPL state

2015-08-15 Thread Ted Yu
I tried with master branch and got the following:

http://pastebin.com/2nhtMFjQ

FYI

On Sat, Aug 15, 2015 at 1:03 AM, Kevin Jung itsjb.j...@samsung.com wrote:

 Spark shell can't find base directory of class server after running
 :reset command.
 scala :reset
 scala 1
 uncaught exception during compilation: java.lang.AssertiON-ERROR
 java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in
 '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory
 ~~~impossible to command anymore~~~
 I figure out reset() method in SparkIMain try to delete virtualDirectory
 and then create again. But virtualDirectory.create() makes a file, not a
 directory.
 Does anyone face a same problem under spark 1.4.0?

 Kevin
 상기 메일은 지정된 수신인만을 위한 것이며, 부정경쟁방지 및 영업비밀보호에 관한 법률,개인정보 보호법을 포함하여
  관련 법령에 따라 보호의 대상이 되는 영업비밀, 산업기술,기밀정보, 개인정보 등을 포함하고 있을 수 있습니다.
 본 문서에 포함된 정보의 전부 또는 일부를 무단으로 복사 또는 사용하거나 제3자에게 공개, 배포, 제공하는 것은 엄격히
  금지됩니다. 본 메일이 잘못 전송된 경우 발신인 또는 당사에게 알려주시고 본 메일을 즉시 삭제하여 주시기 바랍니다.
 The contents of this e-mail message and any attachments are confidential
 and are intended solely for addressee.
  The information may also be legally privileged. This transmission is sent
 in trust, for the sole purpose of delivery
  to the intended recipient. If you have received this transmission in
 error, any use, reproduction or dissemination of
  this transmission is strictly prohibited. If you are not the intended
 recipient, please immediately notify the sender
  by reply e-mail or phone and delete this message and its attachments, if
 any.


How to run spark in standalone mode on cassandra with high availability?

2015-08-15 Thread Vikram Kone
Hi,
We are planning to install Spark in stand alone mode on cassandra cluster.
The problem, is since Cassandra has a no-SPOF architecture ie any node can
become the master for the cluster, it creates the problem for Spark master
since it's not a peer-peer architecture where any node can become the
master.

What are our options here? Are there any framworks or tools out there that
would allow any application to run on a cluster of machines with high
availablity?


Can't find directory after resetting REPL state

2015-08-15 Thread Kevin Jung
Spark shell can't find base directory of class server after running :reset 
command. 
scala :reset 
scala 1 
uncaught exception during compilation: java.lang.AssertiON-ERROR 
java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in 
'/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory 
~~~impossible to command anymore~~~ 
I figure out reset() method in SparkIMain try to delete virtualDirectory and 
then create again. But virtualDirectory.create() makes a file, not a directory. 
Does anyone face a same problem under spark 1.4.0? 

Kevin
상기 메일은 지정된 수신인만을 위한 것이며, 부정경쟁방지 및 영업비밀보호에 관한 법률,개인정보 보호법을 포함하여
 관련 법령에 따라 보호의 대상이 되는 영업비밀, 산업기술,기밀정보, 개인정보 등을 포함하고 있을 수 있습니다.
본 문서에 포함된 정보의 전부 또는 일부를 무단으로 복사 또는 사용하거나 제3자에게 공개, 배포, 제공하는 것은 엄격히
 금지됩니다. 본 메일이 잘못 전송된 경우 발신인 또는 당사에게 알려주시고 본 메일을 즉시 삭제하여 주시기 바랍니다. 
The contents of this e-mail message and any attachments are confidential and 
are intended solely for addressee.
 The information may also be legally privileged. This transmission is sent in 
trust, for the sole purpose of delivery
 to the intended recipient. If you have received this transmission in error, 
any use, reproduction or dissemination of
 this transmission is strictly prohibited. If you are not the intended 
recipient, please immediately notify the sender
 by reply e-mail or phone and delete this message and its attachments, if any.

Difference between Sort based and Hash based shuffle

2015-08-15 Thread Muhammad Haseeb Javed
What are the major differences between how Sort based and Hash based
shuffle operate and what is it that cause Sort Shuffle to perform better
than Hash?
Any talks that discuss both shuffles in detail, how they are implemented
and the performance gains ?