RE: RDD.join vs spark SQL join
Thank you Akhil! Date: Fri, 14 Aug 2015 14:51:56 +0530 Subject: Re: RDD.join vs spark SQL join From: ak...@sigmoidanalytics.com To: jiangxia...@outlook.com CC: user@spark.apache.org Both works the same way, but with SparkSQL you will get the optimization etc done by the catalyst. One important thing to consider is the # partitions and the key distribution (when you are doing RDD.join), If the keys are not evenly distributed across machines then you can see the process chocking on a single task (more like it takes hell lot of time for one task to execute compared to others in that stage).ThanksBest Regards On Fri, Aug 14, 2015 at 1:25 AM, Xiao JIANG jiangxia...@outlook.com wrote: Hi,May I know the performance difference the rdd.join function and spark SQL join operation. If I want to join several big Rdds, how should I decide which one I should use? What are the factors to consider here? Thanks!
Re: Can't understand the size of raw RDD and its DataFrame
why are you expecting footprint of dataframe to be lower when it contains more information ( RDD + Schema) On Sat, Aug 15, 2015 at 6:35 PM, Todd bit1...@163.com wrote: Hi, With following code snippet, I cached the raw RDD(which is already in memory, but just for illustration) and its DataFrame. I thought that the df cache would take less space than the rdd cache,which is wrong because from the UI that I see the rdd cache takes 168B,while the df cache takes 272B. What data is cached when df.cache is called and actually cache the data? It looks that the df only cached the avg(age) which should be much smaller in size, val conf = new SparkConf().setMaster(local).setAppName(SparkSQL_Cache) val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val rdd=sc.parallelize(Array(Student(Jack,21), Student(Mary, 22))) rdd.cache rdd.toDF().registerTempTable(TBL_STUDENT) val df = sqlContext.sql(select avg(age) from TBL_STUDENT) df.cache() df.show
Can't understand the size of raw RDD and its DataFrame
Hi, With following code snippet, I cached the raw RDD(which is already in memory, but just for illustration) and its DataFrame. I thought that the df cache would take less space than the rdd cache,which is wrong because from the UI that I see the rdd cache takes 168B,while the df cache takes 272B. What data is cached when df.cache is called and actually cache the data? It looks that the df only cached the avg(age) which should be much smaller in size, val conf = new SparkConf().setMaster(local).setAppName(SparkSQL_Cache) val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val rdd=sc.parallelize(Array(Student(Jack,21), Student(Mary, 22))) rdd.cache rdd.toDF().registerTempTable(TBL_STUDENT) val df = sqlContext.sql(select avg(age) from TBL_STUDENT) df.cache() df.show
Re: Difference between Sort based and Hash based shuffle
Have a look at this presentation. http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be of help to you. On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed 11besemja...@seecs.edu.pk wrote: What are the major differences between how Sort based and Hash based shuffle operate and what is it that cause Sort Shuffle to perform better than Hash? Any talks that discuss both shuffles in detail, how they are implemented and the performance gains ?
TestSQLContext compilation error when run SparkPi in Intellij ?
I import the spark source code to intellij, and want to run SparkPi in intellij, but meet the folliwing weird compilation error? I googled it and sbt clean doesn't work for me. I am not sure whether anyone else has meet this issue also, any help is appreciated Error:scalac: while compiling: /Users/root/github/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala during phase: jvm library version: version 2.10.4 compiler version: version 2.10.4 reconstructed args: -nobootcp -javabootclasspath : -deprecation -feature -classpath
spark on yarn is slower than spark-ec2 standalone, how to tune?
I'm using a manually installation of Spark under Yarn to run a 30 node r3.8xlarge EC2 cluster (each node has 244Gb RAM, 600Gb SDD). All my code runs much faster on a cluster launched w/ the spark-ec2 script, but there's a mysterious problem with nodes becoming inaccessible, so I switched to using Spark under Yarn because I figure Yarn wouldn't let Spark eat up all the resources and render a machine inaccessible. So far, this seems to be the case. Now my code runs to completion, but much slower, so I'm wondering how I can tune my Spark under Yarn installation to make it as fast as the standalone spark install. The current code I'm interested in speeding up just loads a dense 1Tb matrix from Parquet format and then computes a low rank approximation by essentially doing a bunch of distributed matrix multiplies. Before my code completed in half an hour from loading to writing the output, now I expect it to take 4 or so hours to complete. My spark-submit options are --master yarn \ --num-executors 29 \ --driver-memory 180G \ --executor-memory 180G \ --conf spark.eventLog.enabled=true \ --conf spark.eventLog.dir=$LOGDIR \ --conf spark.driver.maxResultSize=50G \ --conf spark.task.maxFailures=4 \ --conf spark.worker.timeout=120 \ --conf spark.network.timeout=120 \ the huge timeouts were necessary on EC2 to avoid losing executors. Not sure that they've remained necessary when switching to Yarn. My yarn-site.xml has the following settings: property nameyarn.nodemanager.resource.memory-mb/name value236000/value /property property nameyarn.scheduler.minimum-allocation-mb/name value59000/value /property property nameyarn.scheduler.maximum-allocation-mb/name value22/value /property Any suggestions? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-on-yarn-is-slower-than-spark-ec2-standalone-how-to-tune-tp24282.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re:Re: Can't understand the size of raw RDD and its DataFrame
I thought that the df only contains one column, and actually contains only one resulting row(select avg(age) from theTable). So,I would think that it would take less space,looks my understanding is run?? At 2015-08-16 12:34:31, Rishi Yadav ri...@infoobjects.com wrote: why are you expecting footprint of dataframe to be lower when it contains more information ( RDD + Schema) On Sat, Aug 15, 2015 at 6:35 PM, Todd bit1...@163.com wrote: Hi, With following code snippet, I cached the raw RDD(which is already in memory, but just for illustration) and its DataFrame. I thought that the df cache would take less space than the rdd cache,which is wrong because from the UI that I see the rdd cache takes 168B,while the df cache takes 272B. What data is cached when df.cache is called and actually cache the data? It looks that the df only cached the avg(age) which should be much smaller in size, val conf = new SparkConf().setMaster(local).setAppName(SparkSQL_Cache) val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val rdd=sc.parallelize(Array(Student(Jack,21), Student(Mary, 22))) rdd.cache rdd.toDF().registerTempTable(TBL_STUDENT) val df = sqlContext.sql(select avg(age) from TBL_STUDENT) df.cache() df.show
Re: TestSQLContext compilation error when run SparkPi in Intellij ?
Hi Canan, TestSQLContext is no longer a singleton but now a class. It is never meant to be a fully public API, but if you wish to use it you can just instantiate a new one: val sqlContext = new TestSQLContext or just create a new SQLContext from a SparkContext. -Andrew 2015-08-15 20:33 GMT-07:00 canan chen ccn...@gmail.com: I am not sure other people's spark debugging environment ( I mean for the master branch) , Anyone can share his experience ? On Sun, Aug 16, 2015 at 10:40 AM, canan chen ccn...@gmail.com wrote: I import the spark source code to intellij, and want to run SparkPi in intellij, but meet the folliwing weird compilation error? I googled it and sbt clean doesn't work for me. I am not sure whether anyone else has meet this issue also, any help is appreciated Error:scalac: while compiling: /Users/root/github/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala during phase: jvm library version: version 2.10.4 compiler version: version 2.10.4 reconstructed args: -nobootcp -javabootclasspath : -deprecation -feature -classpath
Re: TestSQLContext compilation error when run SparkPi in Intellij ?
I am not sure other people's spark debugging environment ( I mean for the master branch) , Anyone can share his experience ? On Sun, Aug 16, 2015 at 10:40 AM, canan chen ccn...@gmail.com wrote: I import the spark source code to intellij, and want to run SparkPi in intellij, but meet the folliwing weird compilation error? I googled it and sbt clean doesn't work for me. I am not sure whether anyone else has meet this issue also, any help is appreciated Error:scalac: while compiling: /Users/root/github/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala during phase: jvm library version: version 2.10.4 compiler version: version 2.10.4 reconstructed args: -nobootcp -javabootclasspath : -deprecation -feature -classpath
Re: Can't find directory after resetting REPL state
I tried with master branch and got the following: http://pastebin.com/2nhtMFjQ FYI On Sat, Aug 15, 2015 at 1:03 AM, Kevin Jung itsjb.j...@samsung.com wrote: Spark shell can't find base directory of class server after running :reset command. scala :reset scala 1 uncaught exception during compilation: java.lang.AssertiON-ERROR java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory ~~~impossible to command anymore~~~ I figure out reset() method in SparkIMain try to delete virtualDirectory and then create again. But virtualDirectory.create() makes a file, not a directory. Does anyone face a same problem under spark 1.4.0? Kevin 상기 메일은 지정된 수신인만을 위한 것이며, 부정경쟁방지 및 영업비밀보호에 관한 법률,개인정보 보호법을 포함하여 관련 법령에 따라 보호의 대상이 되는 영업비밀, 산업기술,기밀정보, 개인정보 등을 포함하고 있을 수 있습니다. 본 문서에 포함된 정보의 전부 또는 일부를 무단으로 복사 또는 사용하거나 제3자에게 공개, 배포, 제공하는 것은 엄격히 금지됩니다. 본 메일이 잘못 전송된 경우 발신인 또는 당사에게 알려주시고 본 메일을 즉시 삭제하여 주시기 바랍니다. The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.
How to run spark in standalone mode on cassandra with high availability?
Hi, We are planning to install Spark in stand alone mode on cassandra cluster. The problem, is since Cassandra has a no-SPOF architecture ie any node can become the master for the cluster, it creates the problem for Spark master since it's not a peer-peer architecture where any node can become the master. What are our options here? Are there any framworks or tools out there that would allow any application to run on a cluster of machines with high availablity?
Can't find directory after resetting REPL state
Spark shell can't find base directory of class server after running :reset command. scala :reset scala 1 uncaught exception during compilation: java.lang.AssertiON-ERROR java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory ~~~impossible to command anymore~~~ I figure out reset() method in SparkIMain try to delete virtualDirectory and then create again. But virtualDirectory.create() makes a file, not a directory. Does anyone face a same problem under spark 1.4.0? Kevin 상기 메일은 지정된 수신인만을 위한 것이며, 부정경쟁방지 및 영업비밀보호에 관한 법률,개인정보 보호법을 포함하여 관련 법령에 따라 보호의 대상이 되는 영업비밀, 산업기술,기밀정보, 개인정보 등을 포함하고 있을 수 있습니다. 본 문서에 포함된 정보의 전부 또는 일부를 무단으로 복사 또는 사용하거나 제3자에게 공개, 배포, 제공하는 것은 엄격히 금지됩니다. 본 메일이 잘못 전송된 경우 발신인 또는 당사에게 알려주시고 본 메일을 즉시 삭제하여 주시기 바랍니다. The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.
Difference between Sort based and Hash based shuffle
What are the major differences between how Sort based and Hash based shuffle operate and what is it that cause Sort Shuffle to perform better than Hash? Any talks that discuss both shuffles in detail, how they are implemented and the performance gains ?