RE: RDD.join vs spark SQL join

2015-08-15 Thread Xiao JIANG
Thank you Akhil! Date: Fri, 14 Aug 2015 14:51:56 +0530 Subject: Re: RDD.join vs spark SQL join From: ak...@sigmoidanalytics.com To: jiangxia...@outlook.com CC: user@spark.apache.org Both works the same way, but with SparkSQL you will get the optimization etc done by the catalyst. One important

Re: Can't understand the size of raw RDD and its DataFrame

2015-08-15 Thread Rishi Yadav
why are you expecting footprint of dataframe to be lower when it contains more information ( RDD + Schema) On Sat, Aug 15, 2015 at 6:35 PM, Todd bit1...@163.com wrote: Hi, With following code snippet, I cached the raw RDD(which is already in memory, but just for illustration) and its

Can't understand the size of raw RDD and its DataFrame

2015-08-15 Thread Todd
Hi, With following code snippet, I cached the raw RDD(which is already in memory, but just for illustration) and its DataFrame. I thought that the df cache would take less space than the rdd cache,which is wrong because from the UI that I see the rdd cache takes 168B,while the df cache takes

Re: Difference between Sort based and Hash based shuffle

2015-08-15 Thread Ravi Kiran
Have a look at this presentation. http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be of help to you. On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed 11besemja...@seecs.edu.pk wrote: What are the major differences between how Sort based and Hash based shuffle operate

TestSQLContext compilation error when run SparkPi in Intellij ?

2015-08-15 Thread canan chen
I import the spark source code to intellij, and want to run SparkPi in intellij, but meet the folliwing weird compilation error? I googled it and sbt clean doesn't work for me. I am not sure whether anyone else has meet this issue also, any help is appreciated Error:scalac: while compiling:

spark on yarn is slower than spark-ec2 standalone, how to tune?

2015-08-15 Thread AlexG
I'm using a manually installation of Spark under Yarn to run a 30 node r3.8xlarge EC2 cluster (each node has 244Gb RAM, 600Gb SDD). All my code runs much faster on a cluster launched w/ the spark-ec2 script, but there's a mysterious problem with nodes becoming inaccessible, so I switched to using

Re:Re: Can't understand the size of raw RDD and its DataFrame

2015-08-15 Thread Todd
I thought that the df only contains one column, and actually contains only one resulting row(select avg(age) from theTable). So,I would think that it would take less space,looks my understanding is run?? At 2015-08-16 12:34:31, Rishi Yadav ri...@infoobjects.com wrote: why are you

Re: TestSQLContext compilation error when run SparkPi in Intellij ?

2015-08-15 Thread Andrew Or
Hi Canan, TestSQLContext is no longer a singleton but now a class. It is never meant to be a fully public API, but if you wish to use it you can just instantiate a new one: val sqlContext = new TestSQLContext or just create a new SQLContext from a SparkContext. -Andrew 2015-08-15 20:33

Re: TestSQLContext compilation error when run SparkPi in Intellij ?

2015-08-15 Thread canan chen
I am not sure other people's spark debugging environment ( I mean for the master branch) , Anyone can share his experience ? On Sun, Aug 16, 2015 at 10:40 AM, canan chen ccn...@gmail.com wrote: I import the spark source code to intellij, and want to run SparkPi in intellij, but meet the

Re: Can't find directory after resetting REPL state

2015-08-15 Thread Ted Yu
I tried with master branch and got the following: http://pastebin.com/2nhtMFjQ FYI On Sat, Aug 15, 2015 at 1:03 AM, Kevin Jung itsjb.j...@samsung.com wrote: Spark shell can't find base directory of class server after running :reset command. scala :reset scala 1 uncaught exception during

How to run spark in standalone mode on cassandra with high availability?

2015-08-15 Thread Vikram Kone
Hi, We are planning to install Spark in stand alone mode on cassandra cluster. The problem, is since Cassandra has a no-SPOF architecture ie any node can become the master for the cluster, it creates the problem for Spark master since it's not a peer-peer architecture where any node can become the

Can't find directory after resetting REPL state

2015-08-15 Thread Kevin Jung
Spark shell can't find base directory of class server after running :reset command. scala :reset scala 1 uncaught exception during compilation: java.lang.AssertiON-ERROR java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3'

Difference between Sort based and Hash based shuffle

2015-08-15 Thread Muhammad Haseeb Javed
What are the major differences between how Sort based and Hash based shuffle operate and what is it that cause Sort Shuffle to perform better than Hash? Any talks that discuss both shuffles in detail, how they are implemented and the performance gains ?