Re: Submitting spark jobs through yarn-client
Took me just about all night (it's 3am here in EST) but I finally figured out how to get this working. I pushed up my example code for others who may be struggling with this same problem. It really took an understanding of how the classpath needs to be configured both in YARN and in the client driver application. Here's the example code on github: https://github.com/cjnolet/spark-jetty-server On Fri, Jan 2, 2015 at 11:35 PM, Corey Nolet cjno...@gmail.com wrote: So looking @ the actual code- I see where it looks like --class 'notused' --jar null is set on the ClientBase.scala when yarn is being run in client mode. One thing I noticed is that the jar is being set by trying to grab the jar's uri from the classpath resources- in this case I think it's finding the spark-yarn jar instead of spark-assembly so when it tries to runt the ExecutorLauncher.scala, none of the core classes (like org.apache.spark.Logging) are going to be available on the classpath. I hope this is the root of the issue. I'll keep this thread updated with my findings. On Fri, Jan 2, 2015 at 5:46 PM, Corey Nolet cjno...@gmail.com wrote: .. and looking even further, it looks like the actual command tha'ts executed starting up the JVM to run the org.apache.spark.deploy.yarn.ExecutorLauncher is passing in --class 'notused' --jar null. I would assume this isn't expected but I don't see where to set these properties or why they aren't making it through. On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet cjno...@gmail.com wrote: Looking a little closer @ the launch_container.sh file, it appears to be adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in the directory pointed to by PWD. Any ideas? On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to get a SparkContext going in a web container which is being submitted through yarn-client. I'm trying two different approaches and both seem to be resulting in the same error from the yarn nodemanagers: 1) I'm newing up a spark context direct, manually adding all the lib jars from Spark and Hadoop to the setJars() method on the SparkConf. 2) I'm using SparkSubmit,main() to pass the classname and jar containing my code. When yarn tries to create the container, I get an exception in the driver Yarn application already ended, might be killed or not able to launch application master. When I look into the logs for the nodemanager, I see NoClassDefFoundError: org/apache/spark/Logging. Looking closer @ the contents of the nodemanagers, I see that the spark yarn jar was renamed to __spark__.jar and placed in the app cache while the rest of the libraries I specified via setJars() were all placed in the file cache. Any ideas as to what may be happening? I even tried adding the spark-core dependency and uber-jarring my own classes so that the dependencies would be there when Yarn tries to create the container.
Re: Is it possible to do incremental training using ALSModel (MLlib)?
Yes, it is easy to simply start a new factorization from the current model solution. It works well. That's more like incremental *batch* rebuilding of the model. That is not in MLlib but fairly trivial to add. You can certainly 'fold in' new data to approximately update with one new datum too, which you can find online. This is not quite the same idea as streaming SGD. I'm not sure this fits the RDD model well since it entails updating one element at a time but mini batch could be reasonable. On Jan 3, 2015 5:29 AM, Peng Cheng rhw...@gmail.com wrote: I was under the impression that ALS wasn't designed for it :- The famous ebay online recommender uses SGD However, you can try using the previous model as starting point, and gradually reduce the number of iteration after the model stablize. I never verify this idea, so you need to at least cross-validate it before putting into productio On 2 January 2015 at 04:40, Wouter Samaey wouter.sam...@storefront.be wrote: Hi all, I'm curious about MLlib and if it is possible to do incremental training on the ALSModel. Usually training is run first, and then you can query. But in my case, data is collected in real-time and I want the predictions of my ALSModel to consider the latest data without complete re-training phase. I've checked out these resources, but could not find any info on how to solve this: https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html My question fits in a larger picture where I'm using Prediction IO, and this in turn is based on Spark. Thanks in advance for any advice! Wouter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-do-incremental-training-using-ALSModel-MLlib-tp20942.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: DAG info
Hi, You can turn off these messages using log4j.properties. On Fri, Jan 2, 2015 at 1:51 PM, Robineast robin.e...@xense.co.uk wrote: Do you have some example code of what you are trying to do? Robin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DAG-info-tp20940p20941.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Regards, Madhukara Phatak http://www.madhukaraphatak.com
getting number of partition per topic in kafka
Hi experts! I am currently working on spark streaming with kafka. I have couple of questions related to this task. 1) Is there a way to find number of partitions given a topic name? 2)Is there a way to detect whether kafka server is running or not ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/getting-number-of-partition-per-topic-in-kafka-tp20952.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is it possible to do incremental training using ALSModel (MLlib)?
Do you know a place where I could find a sample or tutorial for this? I'm still very new at this. And struggling a bit... Thanks in advance Wouter Sent from my iPhone. On 03 Jan 2015, at 10:36, Sean Owen so...@cloudera.com wrote: Yes, it is easy to simply start a new factorization from the current model solution. It works well. That's more like incremental *batch* rebuilding of the model. That is not in MLlib but fairly trivial to add. You can certainly 'fold in' new data to approximately update with one new datum too, which you can find online. This is not quite the same idea as streaming SGD. I'm not sure this fits the RDD model well since it entails updating one element at a time but mini batch could be reasonable. On Jan 3, 2015 5:29 AM, Peng Cheng rhw...@gmail.com wrote: I was under the impression that ALS wasn't designed for it :- The famous ebay online recommender uses SGD However, you can try using the previous model as starting point, and gradually reduce the number of iteration after the model stablize. I never verify this idea, so you need to at least cross-validate it before putting into productio On 2 January 2015 at 04:40, Wouter Samaey wouter.sam...@storefront.be wrote: Hi all, I'm curious about MLlib and if it is possible to do incremental training on the ALSModel. Usually training is run first, and then you can query. But in my case, data is collected in real-time and I want the predictions of my ALSModel to consider the latest data without complete re-training phase. I've checked out these resources, but could not find any info on how to solve this: https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html My question fits in a larger picture where I'm using Prediction IO, and this in turn is based on Spark. Thanks in advance for any advice! Wouter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-do-incremental-training-using-ALSModel-MLlib-tp20942.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Unable to build spark from source
Hi Sean, Initially I thought on the lines of bit1...@163.com but I just changed how I connect to the internet. I ran sc.parallelize(1 to 1000).count() and it seemed to work. Another quick question on the development workflow. What is the best way to rebuild once I make a few modifications to the source code? On Sat, Jan 3, 2015 at 4:09 PM, bit1...@163.com bit1...@163.com wrote: The error hints that the maven module scala-compiler can't be fetched from repo1.maven.org. Should some repositoy urls be added to the Maven's settings file? -- bit1...@163.com *From:* Manoj Kumar manojkumarsivaraj...@gmail.com *Date:* 2015-01-03 18:46 *To:* user user@spark.apache.org *Subject:* Unable to build spark from source Hello, I tried to build Spark from source using this command (all dependencies installed) but it fails this error. Any help would be appreciated. mvn -DskipTests clean package [INFO] Spark Project Parent POM .. FAILURE [28:14.408s] [INFO] Spark Project Networking .. SKIPPED [INFO] Spark Project Shuffle Streaming Service ... SKIPPED [INFO] Spark Project Core SKIPP INFO] BUILD FAILURE [INFO] [INFO] Total time: 28:15.136s [INFO] Finished at: Sat Jan 03 15:26:31 IST 2015 [INFO] Final Memory: 20M/143M [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-parent: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed: Plugin net.alchim31.maven:scala-maven-plugin:3.2.0 or one of its dependencies could not be resolved: The following artifacts could not be resolved: org.scala-lang:scala-compiler:jar:2.10.3, org.scala-lang:scala-reflect:jar:2.10.3: Could not transfer artifact org.scala-lang:scala-compiler:jar:2.10.3 from/to central ( https://repo1.maven.org/maven2): GET request of: org/scala-lang/scala-compiler/2.10.3/scala-compiler-2.10.3.jar from central failed: Connection reset - [Help 1] -- Godspeed, Manoj Kumar, Intern, Telecom ParisTech Mech Undergrad http://manojbits.wordpress.com -- Godspeed, Manoj Kumar, Intern, Telecom ParisTech Mech Undergrad http://manojbits.wordpress.com
Re: Unable to build spark from source
This indicates a network problem in getting third party artifacts. Is there a proxy you need to go through? On Jan 3, 2015 11:17 AM, Manoj Kumar manojkumarsivaraj...@gmail.com wrote: Hello, I tried to build Spark from source using this command (all dependencies installed) but it fails this error. Any help would be appreciated. mvn -DskipTests clean package [INFO] Spark Project Parent POM .. FAILURE [28:14.408s] [INFO] Spark Project Networking .. SKIPPED [INFO] Spark Project Shuffle Streaming Service ... SKIPPED [INFO] Spark Project Core SKIPP INFO] BUILD FAILURE [INFO] [INFO] Total time: 28:15.136s [INFO] Finished at: Sat Jan 03 15:26:31 IST 2015 [INFO] Final Memory: 20M/143M [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-parent: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed: Plugin net.alchim31.maven:scala-maven-plugin:3.2.0 or one of its dependencies could not be resolved: The following artifacts could not be resolved: org.scala-lang:scala-compiler:jar:2.10.3, org.scala-lang:scala-reflect:jar:2.10.3: Could not transfer artifact org.scala-lang:scala-compiler:jar:2.10.3 from/to central ( https://repo1.maven.org/maven2): GET request of: org/scala-lang/scala-compiler/2.10.3/scala-compiler-2.10.3.jar from central failed: Connection reset - [Help 1] -- Godspeed, Manoj Kumar, Intern, Telecom ParisTech Mech Undergrad http://manojbits.wordpress.com
Re: Unable to build spark from source
The error hints that the maven module scala-compiler can't be fetched from repo1.maven.org. Should some repositoy urls be added to the Maven's settings file? bit1...@163.com From: Manoj Kumar Date: 2015-01-03 18:46 To: user Subject: Unable to build spark from source Hello, I tried to build Spark from source using this command (all dependencies installed) but it fails this error. Any help would be appreciated. mvn -DskipTests clean package [INFO] Spark Project Parent POM .. FAILURE [28:14.408s] [INFO] Spark Project Networking .. SKIPPED [INFO] Spark Project Shuffle Streaming Service ... SKIPPED [INFO] Spark Project Core SKIPP INFO] BUILD FAILURE [INFO] [INFO] Total time: 28:15.136s [INFO] Finished at: Sat Jan 03 15:26:31 IST 2015 [INFO] Final Memory: 20M/143M [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-parent: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed: Plugin net.alchim31.maven:scala-maven-plugin:3.2.0 or one of its dependencies could not be resolved: The following artifacts could not be resolved: org.scala-lang:scala-compiler:jar:2.10.3, org.scala-lang:scala-reflect:jar:2.10.3: Could not transfer artifact org.scala-lang:scala-compiler:jar:2.10.3 from/to central (https://repo1.maven.org/maven2): GET request of: org/scala-lang/scala-compiler/2.10.3/scala-compiler-2.10.3.jar from central failed: Connection reset - [Help 1] -- Godspeed, Manoj Kumar, Intern, Telecom ParisTech Mech Undergrad http://manojbits.wordpress.com
RE: Publishing streaming results to web interface
Is this through a streaming app? I've done this before by publishing results out to a queue our message bus, with a web app listening on the other end. If it's just batch or infrequent you could save the results out to a file. From: tfriskmailto:tfris...@gmail.com Sent: 1/2/2015 5:47 PM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Publishing streaming results to web interface Hi, New to spark so just feeling my way in using it on a standalone server under linux. I'm using scala to store running count totals of certain tokens in my streaming data and publishing a top 10 list. eg (TokenX,count) (TokenY,count) .. At the moment this is just being printed to std out using print() but I want to be able to view these running counts from a web page - but not sure where to start with this. Can anyone advise or point me to examples of how this might be achieved ? Many Thanks, Thomas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Publishing-streaming-results-to-web-interface-tp20948.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Unable to build spark from source
Hello, I tried to build Spark from source using this command (all dependencies installed) but it fails this error. Any help would be appreciated. mvn -DskipTests clean package [INFO] Spark Project Parent POM .. FAILURE [28:14.408s] [INFO] Spark Project Networking .. SKIPPED [INFO] Spark Project Shuffle Streaming Service ... SKIPPED [INFO] Spark Project Core SKIPP INFO] BUILD FAILURE [INFO] [INFO] Total time: 28:15.136s [INFO] Finished at: Sat Jan 03 15:26:31 IST 2015 [INFO] Final Memory: 20M/143M [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-parent: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed: Plugin net.alchim31.maven:scala-maven-plugin:3.2.0 or one of its dependencies could not be resolved: The following artifacts could not be resolved: org.scala-lang:scala-compiler:jar:2.10.3, org.scala-lang:scala-reflect:jar:2.10.3: Could not transfer artifact org.scala-lang:scala-compiler:jar:2.10.3 from/to central ( https://repo1.maven.org/maven2): GET request of: org/scala-lang/scala-compiler/2.10.3/scala-compiler-2.10.3.jar from central failed: Connection reset - [Help 1] -- Godspeed, Manoj Kumar, Intern, Telecom ParisTech Mech Undergrad http://manojbits.wordpress.com
Re: sqlContext is undefined in the Spark Shell
This is a noise,please ignore I figured out what happens... bit1...@163.com From: bit1...@163.com Date: 2015-01-03 19:03 To: user Subject: sqlContext is undefined in the Spark Shell Hi, In the spark shell, I do the following two things: 1. scala val cxt = new org.apache.spark.sql.SQLContext(sc); 2. scala import sqlContext._ The 1st one succeeds while the 2nd one fails with the following error, console:10: error: not found: value sqlContext import sqlContext._ Is there something missing? I am using Spark 1.2.0. Thanks. bit1...@163.com
Re: saveAsTextFile
If you can paste the code here I can certainly help. Also confirm the version of spark you are using Regards Pankaj Infoshore Software India -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
sqlContext is undefined in the Spark Shell
Hi, In the spark shell, I do the following two things: 1. scala val cxt = new org.apache.spark.sql.SQLContext(sc); 2. scala import sqlContext._ The 1st one succeeds while the 2nd one fails with the following error, console:10: error: not found: value sqlContext import sqlContext._ Is there something missing? I am using Spark 1.2.0. Thanks. bit1...@163.com
Re: Submitting spark jobs through yarn-client
thats great. i tried this once and gave up after a few hours. On Sat, Jan 3, 2015 at 2:59 AM, Corey Nolet cjno...@gmail.com wrote: Took me just about all night (it's 3am here in EST) but I finally figured out how to get this working. I pushed up my example code for others who may be struggling with this same problem. It really took an understanding of how the classpath needs to be configured both in YARN and in the client driver application. Here's the example code on github: https://github.com/cjnolet/spark-jetty-server On Fri, Jan 2, 2015 at 11:35 PM, Corey Nolet cjno...@gmail.com wrote: So looking @ the actual code- I see where it looks like --class 'notused' --jar null is set on the ClientBase.scala when yarn is being run in client mode. One thing I noticed is that the jar is being set by trying to grab the jar's uri from the classpath resources- in this case I think it's finding the spark-yarn jar instead of spark-assembly so when it tries to runt the ExecutorLauncher.scala, none of the core classes (like org.apache.spark.Logging) are going to be available on the classpath. I hope this is the root of the issue. I'll keep this thread updated with my findings. On Fri, Jan 2, 2015 at 5:46 PM, Corey Nolet cjno...@gmail.com wrote: .. and looking even further, it looks like the actual command tha'ts executed starting up the JVM to run the org.apache.spark.deploy.yarn.ExecutorLauncher is passing in --class 'notused' --jar null. I would assume this isn't expected but I don't see where to set these properties or why they aren't making it through. On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet cjno...@gmail.com wrote: Looking a little closer @ the launch_container.sh file, it appears to be adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in the directory pointed to by PWD. Any ideas? On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to get a SparkContext going in a web container which is being submitted through yarn-client. I'm trying two different approaches and both seem to be resulting in the same error from the yarn nodemanagers: 1) I'm newing up a spark context direct, manually adding all the lib jars from Spark and Hadoop to the setJars() method on the SparkConf. 2) I'm using SparkSubmit,main() to pass the classname and jar containing my code. When yarn tries to create the container, I get an exception in the driver Yarn application already ended, might be killed or not able to launch application master. When I look into the logs for the nodemanager, I see NoClassDefFoundError: org/apache/spark/Logging. Looking closer @ the contents of the nodemanagers, I see that the spark yarn jar was renamed to __spark__.jar and placed in the app cache while the rest of the libraries I specified via setJars() were all placed in the file cache. Any ideas as to what may be happening? I even tried adding the spark-core dependency and uber-jarring my own classes so that the dependencies would be there when Yarn tries to create the container.
Re: Unable to build spark from source
No, that is part of every Maven build by default. The repo is fine and I (and I assume everyone else) can reach it. How can you run Spark if you can't build it? Are you running something else or did it succeed? The error hints that the maven module scala-compiler can't be fetched from repo1.maven.org. Should some repositoy urls be added to the Maven's settings file? -- bit1...@163.com *From:* Manoj Kumar manojkumarsivaraj...@gmail.com *Date:* 2015-01-03 18:46 *To:* user user@spark.apache.org *Subject:* Unable to build spark from source Hello, I tried to build Spark from source using this command (all dependencies installed) but it fails this error. Any help would be appreciated. mvn -DskipTests clean package [INFO] Spark Project Parent POM .. FAILURE [28:14.408s] [INFO] Spark Project Networking .. SKIPPED [INFO] Spark Project Shuffle Streaming Service ... SKIPPED [INFO] Spark Project Core SKIPP INFO] BUILD FAILURE [INFO] [INFO] Total time: 28:15.136s [INFO] Finished at: Sat Jan 03 15:26:31 IST 2015 [INFO] Final Memory: 20M/143M [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-parent: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed: Plugin net.alchim31.maven:scala-maven-plugin:3.2.0 or one of its dependencies could not be resolved: The following artifacts could not be resolved: org.scala-lang:scala-compiler:jar:2.10.3, org.scala-lang:scala-reflect:jar:2.10.3: Could not transfer artifact org.scala-lang:scala-compiler:jar:2.10.3 from/to central ( https://repo1.maven.org/maven2): GET request of: org/scala-lang/scala-compiler/2.10.3/scala-compiler-2.10.3.jar from central failed: Connection reset - [Help 1] -- Godspeed, Manoj Kumar, Intern, Telecom ParisTech Mech Undergrad http://manojbits.wordpress.com
Re: different akka versions and spark
hey Ted, i am aware of the upgrade efforts for akka. however if spark 1.2 forces me to upgrade all our usage of akka to 2.3.x while spark 1.0 and 1.1 force me to use akka 2.2.x then we cannot build one application that runs on all spark 1.x versions, which i would consider a major incompatibility. best, koert On Sat, Jan 3, 2015 at 12:11 AM, Ted Yu yuzhih...@gmail.com wrote: Please see http://akka.io/news/2014/05/22/akka-2.3.3-released.html which points to http://doc.akka.io/docs/akka/2.3.3/project/migration-guide-2.2.x-2.3.x.html?_ga=1.35212129.1385865413.1420220234 Cheers On Fri, Jan 2, 2015 at 9:11 AM, Koert Kuipers ko...@tresata.com wrote: i noticed spark 1.2.0 bumps the akka version. since spark uses it's own akka version, does this mean it can co-exist with another akka version in the same JVM? has anyone tried this? we have some spark apps that also use akka (2.2.3) and spray. if different akka versions causes conflicts then spark 1.2.0 would not be backwards compatible for us... thanks. koert
Re: getting number of partition per topic in kafka
You can use the lowlevel consumer http://github.com/dibbhatt/kafka-spark-consumer for this, it has an API call https://github.com/dibbhatt/kafka-spark-consumer/blob/master/src/main/java/consumer/kafka/DynamicBrokersReader.java#L81 to retrieve the number of partitions from a topic. Easiest way would be to count the depth in: yourZkPath + /topics/ + topicName + /partitions Thanks Best Regards On Sat, Jan 3, 2015 at 1:54 PM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi experts! I am currently working on spark streaming with kafka. I have couple of questions related to this task. 1) Is there a way to find number of partitions given a topic name? 2)Is there a way to detect whether kafka server is running or not ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/getting-number-of-partition-per-topic-in-kafka-tp20952.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: saveAsTextFile
@lailaBased on the error u mentioned in the nabble link below, it seems like there are no permissions to write to HDFS. So this is possibly why saveAsTextFile is failing. From: Pankaj Narang pankajnaran...@gmail.com To: user@spark.apache.org Sent: Saturday, January 3, 2015 4:07 AM Subject: Re: saveAsTextFile If you can paste the code here I can certainly help. Also confirm the version of spark you are using Regards Pankaj Infoshore Software India -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark MLIB for Kaggle's CTR challenge
Hi , I entered this Kaggle's CTR challenge using scikit python framework. Although , it gave me a reasonable score , I am just wondering to explore Spark Mlib which I haven't used it before. Tried with Vowpal Wobbit also . Can someone who has already worked with MLIB ,help me if Spark Mlib supports online learning or batch SGD, if so how it performs. I don't have a cluster of spark , just the laptop. Any suggestions? The training data has close to 45 million rows in csv format and test data close to 4.2 million rows in same format. Thanks, Niranjan
Re: Unable to build spark from source
My question was that if once I make changes in the source code to a file, do I rebuild it using any other command, such that it takes in only the changes (because it takes a lot of time)? On Sat, Jan 3, 2015 at 10:40 PM, Manoj Kumar manojkumarsivaraj...@gmail.com wrote: Yes, I've built spark successfully, using the same command mvn -DskipTests clean package but it built because now I do not work behind a proxy. Thanks. -- Godspeed, Manoj Kumar, Intern, Telecom ParisTech Mech Undergrad http://manojbits.wordpress.com
Re: Unable to build spark from source
Yes, I've built spark successfully, using the same command mvn -DskipTests clean package but it built because now I do not work behind a proxy. Thanks.
[no subject]
Best Regards, Sujeevan. N
Re: Unable to build spark from source
You can use the same build commands, but it's well worth setting up a zinc server if you're doing a lot of builds. That will allow incremental scala builds, which speeds up the process significantly. SPARK-4501 might be of interest too. Simon On 3 Jan 2015, at 17:27, Manoj Kumar manojkumarsivaraj...@gmail.com wrote: My question was that if once I make changes in the source code to a file, do I rebuild it using any other command, such that it takes in only the changes (because it takes a lot of time)? On Sat, Jan 3, 2015 at 10:40 PM, Manoj Kumar manojkumarsivaraj...@gmail.com wrote: Yes, I've built spark successfully, using the same command mvn -DskipTests clean package but it built because now I do not work behind a proxy. Thanks. -- Godspeed, Manoj Kumar, Intern, Telecom ParisTech Mech Undergrad http://manojbits.wordpress.com
Joining by values
I have a two pair RDDs in spark like this rdd1 = (1 - [4,5,6,7]) (2 - [4,5]) (3 - [6,7]) rdd2 = (4 - [1001,1000,1002,1003]) (5 - [1004,1001,1006,1007]) (6 - [1007,1009,1005,1008]) (7 - [1011,1012,1013,1010]) I would like to combine them to look like this. joinedRdd = (1 - [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013]) (2 - [1000,1001,1002,1003,1004,1006,1007]) (3 - [1005,1007,1008,1009,1010,1011,1012,1013]) Can someone suggest me how to do this. Thanks Dilip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
save rdd to ORC file
Hi Experts, Like saveAsParquetFile on schemaRDD, there is a equivalent to store in ORC file. I am using spark 1.2.0. As per the link below, looks like its not part of 1.2.0, so any latest update would be great. https://issues.apache.org/jira/browse/SPARK-2883 Till the next release, is there a workaround to read/write ORC file. Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save-rdd-to-ORC-file-tp20956.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Joining by values
This is my design. Now let me try and code it in Spark. rdd1.txt =1~4,5,6,72~4,53~6,7 rdd2.txt 4~1001,1000,1002,10035~1004,1001,1006,10076~1007,1009,1005,10087~1011,1012,1013,1010 TRANSFORM 1===map each value to key (like an inverted index)4~15~16~17~15~24~26~37~3 TRANSFORM 2===Join keys in transform 1 and rdd24~1,1001,1000,1002,10034~2,1001,1000,1002,10035~1,1004,1001,1006,10075~2,1004,1001,1006,10076~1,1007,1009,1005,10086~3,1007,1009,1005,10087~1,1011,1012,1013,10107~3,1011,1012,1013,1010 TRANSFORM 3===Split key in transform 2 with ~ and keep key(1) i.e. 1,2,31~1001,1000,1002,10032~1001,1000,1002,10031~1004,1001,1006,10072~1004,1001,1006,10071~1007,1009,1005,10083~1007,1009,1005,10081~1011,1012,1013,10103~1011,1012,1013,1010 TRANSFORM 4===join by key 1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,10102~1001,1000,1002,1003,1004,1001,1006,10073~1007,1009,1005,1008,1011,1012,1013,1010 From: dcmovva dilip.mo...@gmail.com To: user@spark.apache.org Sent: Saturday, January 3, 2015 10:10 AM Subject: Joining by values I have a two pair RDDs in spark like this rdd1 = (1 - [4,5,6,7]) (2 - [4,5]) (3 - [6,7]) rdd2 = (4 - [1001,1000,1002,1003]) (5 - [1004,1001,1006,1007]) (6 - [1007,1009,1005,1008]) (7 - [1011,1012,1013,1010]) I would like to combine them to look like this. joinedRdd = (1 - [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013]) (2 - [1000,1001,1002,1003,1004,1006,1007]) (3 - [1005,1007,1008,1009,1010,1011,1012,1013]) Can someone suggest me how to do this. Thanks Dilip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark.sql.shuffle.partitions parameter
Hello everyone, I'm using SparkSQL and would like to understand how can I determine right value for spark.sql.shuffle.partitions parameter? For example if I'm joining two RDDs where first has 10 partitions and second - 60, how big this parameter should be? Thank you, Yuri
Re: spark.akka.frameSize limit error
Which version of Spark are you using? It seems like the issue here is that the map output statuses are too large to fit in the Akka frame size. This issue has been fixed in Spark 1.2 by using a different encoding for map outputs for jobs with many reducers ( https://issues.apache.org/jira/browse/SPARK-3613). On earlier Spark versions, your options are either reducing the number of reducers (e.g. by explicitly specifying the number of reducers in the reduceByKey() call) or increasing the Akka frame size (via the spark.akka.frameSize configuration option). On Sat, Jan 3, 2015 at 10:40 AM, Saeed Shahrivari saeed.shahriv...@gmail.com wrote: Hi, I am trying to get the frequency of each Unicode char in a document collection using Spark. Here is the code snippet that does the job: JavaPairRDDLongWritable, Text rows = sc.sequenceFile(args[0], LongWritable.class, Text.class); rows = rows.coalesce(1); JavaPairRDDCharacter,Long pairs = rows.flatMapToPair(t - { String content=t._2.toString(); MultisetCharacter chars= HashMultiset.create(); for(int i=0;icontent.length();i++) chars.add(content.charAt(i)); Listlt;Tuple2lt;Character,Long list=new ArrayListTuple2lt;Character, Long(); for(Character ch:chars.elementSet()){ list.add(new Tuple2Character,Long(ch,(long)chars.count(ch))); } return list; }); JavaPairRDDCharacter, Long counts = pairs.reduceByKey((a, b) - a + b); System.out.printf(MapCount %,d\n,counts.count()); But, I get the following exception: 15/01/03 21:51:34 ERROR MapOutputTrackerMasterActor: Map output statuses were 11141547 bytes which exceeds spark.akka.frameSize (10485760 bytes). org.apache.spark.SparkException: Map output statuses were 11141547 bytes which exceeds spark.akka.frameSize (10485760 bytes). at org.apache.spark.MapOutputTrackerMasterActor$$anonfun$receiveWithLogging$1.applyOrElse(MapOutputTracker.scala:59) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53) at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.MapOutputTrackerMasterActor.aroundReceive(MapOutputTracker.scala:42) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Would you please tell me where is the fault? If I process fewer rows, there is no problem. However, when the number of rows is large I always get this exception. Thanks beforehand. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-akka-frameSize-limit-error-tp20955.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Better way of measuring custom application metrics
I have a hack to gather custom application metrics in a Streaming job, but I wanted to know if there is any better way of doing this. My hack consists of this singleton: object Metriker extends Serializable { @transient lazy val mr: MetricRegistry = { val metricRegistry = new MetricRegistry() val graphiteEndpoint = new InetSocketAddress( ec2-54-220-56-229.eu-west-1.compute.amazonaws.com, 2003) GraphiteReporter .forRegistry(metricRegistry) .build(new Graphite(graphiteEndpoint)) .start(5, TimeUnit.SECONDS) metricRegistry } @transient lazy val processId = ManagementFactory.getRuntimeMXBean.getName @transient lazy val hostId = { try { InetAddress.getLocalHost.getHostName } catch { case e: UnknownHostException = localhost } } def metricName(name: String): String = { %s.%s.%s.format(name, hostId, processId) } } which I then use in my jobs like so: stream .map { i = Metriker.mr.meter(Metriker.metricName(testmetric123)).mark(i) i * 2 } Then I aggregate the metrics on Graphite. This works, but I was curious to know if anyone has a less hacky way. ᐧ
Spark for core business-logic? - Replacing: MongoDB?
In the middle of doing the architecture for a new project, which has various machine learning and related components, including: recommender systems, search engines and sequence [common intersection] matching. Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue, backed by Redis). Though I don't have experience with Hadoop, I was thinking of using Hadoop for the machine-learning (as this will become a Big Data problem quite quickly). To push the data into Hadoop, I would use a connector of some description, or push the MongoDB backups into HDFS at set intervals. However I was thinking that it might be better to put the whole thing in Hadoop, store all persistent data in Hadoop, and maybe do all the layers in Apache Spark (with caching remaining in Redis). Is that a viable option? - Most of what I see discusses Spark (and Hadoop in general) for analytics only. Apache Phoenix exposes a nice interface for read/write over HBase, so I might use that if Spark ends up being the wrong solution. Thanks for all suggestions, Alec Taylor PS: I need this for both Big and Small data. Note that I am using the Cloudera definition of Big Data referring to processing/storage across more than 1 machine. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: save rdd to ORC file
One way is to use sparkSQL. scala sqlContext.sql(create table orc_table(key INT, value STRING) stored as orc)scala sqlContext.sql(insert into table orc_table select * from schema_rdd_temp_table)scala sqlContext.sql(FROM orc_table select *) On 4 January 2015 at 00:57, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi Experts, Like saveAsParquetFile on schemaRDD, there is a equivalent to store in ORC file. I am using spark 1.2.0. As per the link below, looks like its not part of 1.2.0, so any latest update would be great. https://issues.apache.org/jira/browse/SPARK-2883 Till the next release, is there a workaround to read/write ORC file. Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save-rdd-to-ORC-file-tp20956.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Better way of measuring custom application metrics
Hi, I think there’s a StreamingSource in Spark Streaming which exposes the Spark Streaming running status to the metrics sink, you can connect it with Graphite sink to expose metrics to Graphite. I’m not sure is this what you want. Besides you can customize the Source and Sink of the MetricsSystem to build your own and configure it in metrics.properties with class name to let it loaded by metrics system, for the details you can refer to http://spark.apache.org/docs/latest/monitoring.html or source code. Thanks Jerry From: Enno Shioji [mailto:eshi...@gmail.com] Sent: Sunday, January 4, 2015 7:47 AM To: user@spark.apache.org Subject: Better way of measuring custom application metrics I have a hack to gather custom application metrics in a Streaming job, but I wanted to know if there is any better way of doing this. My hack consists of this singleton: object Metriker extends Serializable { @transient lazy val mr: MetricRegistry = { val metricRegistry = new MetricRegistry() val graphiteEndpoint = new InetSocketAddress(ec2-54-220-56-229.eu-west-1.compute.amazonaws.comhttp://ec2-54-220-56-229.eu-west-1.compute.amazonaws.com, 2003) GraphiteReporter .forRegistry(metricRegistry) .build(new Graphite(graphiteEndpoint)) .start(5, TimeUnit.SECONDS) metricRegistry } @transient lazy val processId = ManagementFactory.getRuntimeMXBean.getName @transient lazy val hostId = { try { InetAddress.getLocalHost.getHostName } catch { case e: UnknownHostException = localhost } } def metricName(name: String): String = { %s.%s.%s.format(name, hostId, processId) } } which I then use in my jobs like so: stream .map { i = Metriker.mr.meter(Metriker.metricName(testmetric123)).mark(i) i * 2 } Then I aggregate the metrics on Graphite. This works, but I was curious to know if anyone has a less hacky way. [https://mailfoogae.appspot.com/t?sender=aZXNoaW9qaUBnbWFpbC5jb20%3Dtype=zerocontentguid=29916861-9b4d-423b-8e45-c731deddd43b]ᐧ
Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.
I am running into similar problem. Have you found any resolution to this issue ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Elastic-allocation-spark-dynamicAllocation-enabled-results-in-task-never-being-executed-tp18969p20957.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: building spark1.2 meet error
- thanks, it is success builded - .but where is builded zip file? I not find finished .zip or .tar.gz package 2014-12-31 19:22 GMT+08:00 xhudik [via Apache Spark User List] ml-node+s1001560n20927...@n3.nabble.com: Hi J_soft, for me it is working, I didn't put -Dscala-2.10 -X parameters. I got only one warning, since I don't have hadoop 2.5 it didn't activate this profile: *larix@kovral:~/sources/spark-1.2.0 mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=2.5.0 -DskipTests clean package Found 0 infos Finished in 3 ms [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 14.177 s] [INFO] Spark Project Networking ... SUCCESS [ 14.670 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 9.030 s] [INFO] Spark Project Core . SUCCESS [04:42 min] [INFO] Spark Project Bagel SUCCESS [ 26.184 s] [INFO] Spark Project GraphX ... SUCCESS [01:07 min] [INFO] Spark Project Streaming SUCCESS [01:35 min] [INFO] Spark Project Catalyst . SUCCESS [01:48 min] [INFO] Spark Project SQL .. SUCCESS [01:55 min] [INFO] Spark Project ML Library ... SUCCESS [02:17 min] [INFO] Spark Project Tools SUCCESS [ 15.527 s] [INFO] Spark Project Hive . SUCCESS [01:43 min] [INFO] Spark Project REPL . SUCCESS [ 45.154 s] [INFO] Spark Project YARN Parent POM .. SUCCESS [ 3.885 s] [INFO] Spark Project YARN Stable API .. SUCCESS [01:00 min] [INFO] Spark Project Assembly . SUCCESS [ 50.812 s] [INFO] Spark Project External Twitter . SUCCESS [ 21.401 s] [INFO] Spark Project External Flume Sink .. SUCCESS [ 25.207 s] [INFO] Spark Project External Flume ... SUCCESS [ 34.734 s] [INFO] Spark Project External MQTT SUCCESS [ 22.617 s] [INFO] Spark Project External ZeroMQ .. SUCCESS [ 22.444 s] [INFO] Spark Project External Kafka ... SUCCESS [ 33.566 s] [INFO] Spark Project Examples . SUCCESS [01:23 min] [INFO] Spark Project YARN Shuffle Service . SUCCESS [ 4.873 s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 23:20 min [INFO] Finished at: 2014-12-31T12:02:32+01:00 [INFO] Final Memory: 76M/855M [INFO] [WARNING] The requested profile hadoop-2.5 could not be activated because it does not exist.* If it won't work for you. I'd try to delete all sources, download source code once more and try again ... good luck, Tomas -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/building-spark1-2-meet-error-tp20853p20927.html To unsubscribe from building spark1.2 meet error, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=20853code=enNvZnRlckBnbWFpbC5jb218MjA4NTN8LTI5MTk2NDAzOA== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/building-spark1-2-meet-error-tp20853p20958.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Joining by values
hi Take a look at the code here I wrotehttps://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala /*rdd1.txt 1~4,5,6,7 2~4,5 3~6,7 rdd2.txt 4~1001,1000,1002,1003 5~1004,1001,1006,1007 6~1007,1009,1005,1008 7~1011,1012,1013,1010 */ val sconf = new SparkConf().setMaster(local).setAppName(MedicalSideFx-PairRddJoin) val sc = new SparkContext(sconf) val rdd1 = /path/to/rdd1.txt val rdd2 = /path/to/rdd2.txt val rdd1InvIndex = sc.textFile(rdd1).map(x = (x.split('~')(0), x.split('~')(1))).flatMapValues(str = str.split(',')).map(str = (str._2, str._1)) val rdd2Pair = sc.textFile(rdd2).map(str = (str.split('~')(0), str.split('~')(1))) rdd1InvIndex.join(rdd2Pair).map(str = str._2).groupByKey().collect().foreach(println) This outputs the following . I think this may be essentially what u r looking for(I have to understand how to NOT print as CompactBuffer)(2,CompactBuffer(1001,1000,1002,1003, 1004,1001,1006,1007)) (3,CompactBuffer(1011,1012,1013,1010, 1007,1009,1005,1008)) (1,CompactBuffer(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 1007,1009,1005,1008)) From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID To: dcmovva dilip.mo...@gmail.com; user@spark.apache.org user@spark.apache.org Sent: Saturday, January 3, 2015 12:19 PM Subject: Re: Joining by values This is my design. Now let me try and code it in Spark. rdd1.txt =1~4,5,6,72~4,53~6,7 rdd2.txt 4~1001,1000,1002,10035~1004,1001,1006,10076~1007,1009,1005,10087~1011,1012,1013,1010 TRANSFORM 1===map each value to key (like an inverted index)4~15~16~17~15~24~26~37~3 TRANSFORM 2===Join keys in transform 1 and rdd24~1,1001,1000,1002,10034~2,1001,1000,1002,10035~1,1004,1001,1006,10075~2,1004,1001,1006,10076~1,1007,1009,1005,10086~3,1007,1009,1005,10087~1,1011,1012,1013,10107~3,1011,1012,1013,1010 TRANSFORM 3===Split key in transform 2 with ~ and keep key(1) i.e. 1,2,31~1001,1000,1002,10032~1001,1000,1002,10031~1004,1001,1006,10072~1004,1001,1006,10071~1007,1009,1005,10083~1007,1009,1005,10081~1011,1012,1013,10103~1011,1012,1013,1010 TRANSFORM 4===join by key 1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,10102~1001,1000,1002,1003,1004,1001,1006,10073~1007,1009,1005,1008,1011,1012,1013,1010 From: dcmovva dilip.mo...@gmail.com To: user@spark.apache.org Sent: Saturday, January 3, 2015 10:10 AM Subject: Joining by values I have a two pair RDDs in spark like this rdd1 = (1 - [4,5,6,7]) (2 - [4,5]) (3 - [6,7]) rdd2 = (4 - [1001,1000,1002,1003]) (5 - [1004,1001,1006,1007]) (6 - [1007,1009,1005,1008]) (7 - [1011,1012,1013,1010]) I would like to combine them to look like this. joinedRdd = (1 - [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013]) (2 - [1000,1001,1002,1003,1004,1006,1007]) (3 - [1005,1007,1008,1009,1010,1011,1012,1013]) Can someone suggest me how to do this. Thanks Dilip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Joining by values
call `map(_.toList)` to convert `CompactBuffer` to `List` Best Regards, Shixiong Zhu 2015-01-04 12:08 GMT+08:00 Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid: hi Take a look at the code here I wrote https://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala /*rdd1.txt 1~4,5,6,7 2~4,5 3~6,7 rdd2.txt 4~1001,1000,1002,1003 5~1004,1001,1006,1007 6~1007,1009,1005,1008 7~1011,1012,1013,1010 */ val sconf = new SparkConf().setMaster(local).setAppName(MedicalSideFx-PairRddJoin) val sc = new SparkContext(sconf) val rdd1 = /path/to/rdd1.txt val rdd2 = /path/to/rdd2.txt val rdd1InvIndex = sc.textFile(rdd1).map(x = (x.split('~')(0), x.split('~')(1))).flatMapValues(str = str.split(',')).map(str = (str._2, str._1)) val rdd2Pair = sc.textFile(rdd2).map(str = (str.split('~')(0), str.split('~')(1))) rdd1InvIndex.join(rdd2Pair).map(str = str._2).groupByKey().collect().foreach(println) This outputs the following . I think this may be essentially what u r looking for (I have to understand how to NOT print as CompactBuffer) (2,CompactBuffer(1001,1000,1002,1003, 1004,1001,1006,1007)) (3,CompactBuffer(1011,1012,1013,1010, 1007,1009,1005,1008)) (1,CompactBuffer(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 1007,1009,1005,1008)) -- *From:* Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID *To:* dcmovva dilip.mo...@gmail.com; user@spark.apache.org user@spark.apache.org *Sent:* Saturday, January 3, 2015 12:19 PM *Subject:* Re: Joining by values This is my design. Now let me try and code it in Spark. rdd1.txt = 1~4,5,6,7 2~4,5 3~6,7 rdd2.txt 4~1001,1000,1002,1003 5~1004,1001,1006,1007 6~1007,1009,1005,1008 7~1011,1012,1013,1010 TRANSFORM 1 === map each value to key (like an inverted index) 4~1 5~1 6~1 7~1 5~2 4~2 6~3 7~3 TRANSFORM 2 === Join keys in transform 1 and rdd2 4~1,1001,1000,1002,1003 4~2,1001,1000,1002,1003 5~1,1004,1001,1006,1007 5~2,1004,1001,1006,1007 6~1,1007,1009,1005,1008 6~3,1007,1009,1005,1008 7~1,1011,1012,1013,1010 7~3,1011,1012,1013,1010 TRANSFORM 3 === Split key in transform 2 with ~ and keep key(1) i.e. 1,2,3 1~1001,1000,1002,1003 2~1001,1000,1002,1003 1~1004,1001,1006,1007 2~1004,1001,1006,1007 1~1007,1009,1005,1008 3~1007,1009,1005,1008 1~1011,1012,1013,1010 3~1011,1012,1013,1010 TRANSFORM 4 === join by key 1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,1010 2~1001,1000,1002,1003,1004,1001,1006,1007 3~1007,1009,1005,1008,1011,1012,1013,1010 -- *From:* dcmovva dilip.mo...@gmail.com *To:* user@spark.apache.org *Sent:* Saturday, January 3, 2015 10:10 AM *Subject:* Joining by values I have a two pair RDDs in spark like this rdd1 = (1 - [4,5,6,7]) (2 - [4,5]) (3 - [6,7]) rdd2 = (4 - [1001,1000,1002,1003]) (5 - [1004,1001,1006,1007]) (6 - [1007,1009,1005,1008]) (7 - [1011,1012,1013,1010]) I would like to combine them to look like this. joinedRdd = (1 - [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013]) (2 - [1000,1001,1002,1003,1004,1006,1007]) (3 - [1005,1007,1008,1009,1010,1011,1012,1013]) Can someone suggest me how to do this. Thanks Dilip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: building spark1.2 meet error
it should be under ls assembly/target/scala-2.10/* On Sat, Jan 3, 2015 at 10:11 PM, j_soft zsof...@gmail.com wrote: - thanks, it is success builded - .but where is builded zip file? I not find finished .zip or .tar.gz package 2014-12-31 19:22 GMT+08:00 xhudik [via Apache Spark User List] [hidden email] http:///user/SendEmail.jtp?type=nodenode=20958i=0: Hi J_soft, for me it is working, I didn't put -Dscala-2.10 -X parameters. I got only one warning, since I don't have hadoop 2.5 it didn't activate this profile: *larix@kovral:~/sources/spark-1.2.0 mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=2.5.0 -DskipTests clean package Found 0 infos Finished in 3 ms [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 14.177 s] [INFO] Spark Project Networking ... SUCCESS [ 14.670 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 9.030 s] [INFO] Spark Project Core . SUCCESS [04:42 min] [INFO] Spark Project Bagel SUCCESS [ 26.184 s] [INFO] Spark Project GraphX ... SUCCESS [01:07 min] [INFO] Spark Project Streaming SUCCESS [01:35 min] [INFO] Spark Project Catalyst . SUCCESS [01:48 min] [INFO] Spark Project SQL .. SUCCESS [01:55 min] [INFO] Spark Project ML Library ... SUCCESS [02:17 min] [INFO] Spark Project Tools SUCCESS [ 15.527 s] [INFO] Spark Project Hive . SUCCESS [01:43 min] [INFO] Spark Project REPL . SUCCESS [ 45.154 s] [INFO] Spark Project YARN Parent POM .. SUCCESS [ 3.885 s] [INFO] Spark Project YARN Stable API .. SUCCESS [01:00 min] [INFO] Spark Project Assembly . SUCCESS [ 50.812 s] [INFO] Spark Project External Twitter . SUCCESS [ 21.401 s] [INFO] Spark Project External Flume Sink .. SUCCESS [ 25.207 s] [INFO] Spark Project External Flume ... SUCCESS [ 34.734 s] [INFO] Spark Project External MQTT SUCCESS [ 22.617 s] [INFO] Spark Project External ZeroMQ .. SUCCESS [ 22.444 s] [INFO] Spark Project External Kafka ... SUCCESS [ 33.566 s] [INFO] Spark Project Examples . SUCCESS [01:23 min] [INFO] Spark Project YARN Shuffle Service . SUCCESS [ 4.873 s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 23:20 min [INFO] Finished at: 2014-12-31T12:02:32+01:00 [INFO] Final Memory: 76M/855M [INFO] [WARNING] The requested profile hadoop-2.5 could not be activated because it does not exist.* If it won't work for you. I'd try to delete all sources, download source code once more and try again ... good luck, Tomas -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/building-spark1-2-meet-error-tp20853p20927.html To unsubscribe from building spark1.2 meet error, click here. NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: Re: building spark1.2 meet error http://apache-spark-user-list.1001560.n3.nabble.com/building-spark1-2-meet-error-tp20853p20958.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Re: Joining by values
so I changed the code tordd1InvIndex.join(rdd2Pair).map(str = str._2).groupByKey().map(str = (str._1,str._2.toList)).collect().foreach(println) Now it prints. Don't worry I will work on this to not output as List(...) But I am hoping that the JOIN question that @Dilip asked is hopefully answered :-) (2,List(1001,1000,1002,1003, 1004,1001,1006,1007))(3,List(1011,1012,1013,1010, 1007,1009,1005,1008))(1,List(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 1007,1009,1005,1008)) From: Shixiong Zhu zsxw...@gmail.com To: Sanjay Subramanian sanjaysubraman...@yahoo.com Cc: dcmovva dilip.mo...@gmail.com; user@spark.apache.org user@spark.apache.org Sent: Saturday, January 3, 2015 8:15 PM Subject: Re: Joining by values call `map(_.toList)` to convert `CompactBuffer` to `List` Best Regards,Shixiong Zhu 2015-01-04 12:08 GMT+08:00 Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid: hi Take a look at the code here I wrotehttps://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala /*rdd1.txt 1~4,5,6,7 2~4,5 3~6,7 rdd2.txt 4~1001,1000,1002,1003 5~1004,1001,1006,1007 6~1007,1009,1005,1008 7~1011,1012,1013,1010 */ val sconf = new SparkConf().setMaster(local).setAppName(MedicalSideFx-PairRddJoin) val sc = new SparkContext(sconf) val rdd1 = /path/to/rdd1.txt val rdd2 = /path/to/rdd2.txt val rdd1InvIndex = sc.textFile(rdd1).map(x = (x.split('~')(0), x.split('~')(1))).flatMapValues(str = str.split(',')).map(str = (str._2, str._1)) val rdd2Pair = sc.textFile(rdd2).map(str = (str.split('~')(0), str.split('~')(1))) rdd1InvIndex.join(rdd2Pair).map(str = str._2).groupByKey().collect().foreach(println) This outputs the following . I think this may be essentially what u r looking for(I have to understand how to NOT print as CompactBuffer)(2,CompactBuffer(1001,1000,1002,1003, 1004,1001,1006,1007)) (3,CompactBuffer(1011,1012,1013,1010, 1007,1009,1005,1008)) (1,CompactBuffer(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 1007,1009,1005,1008)) From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID To: dcmovva dilip.mo...@gmail.com; user@spark.apache.org user@spark.apache.org Sent: Saturday, January 3, 2015 12:19 PM Subject: Re: Joining by values This is my design. Now let me try and code it in Spark. rdd1.txt =1~4,5,6,72~4,53~6,7 rdd2.txt 4~1001,1000,1002,10035~1004,1001,1006,10076~1007,1009,1005,10087~1011,1012,1013,1010 TRANSFORM 1===map each value to key (like an inverted index)4~15~16~17~15~24~26~37~3 TRANSFORM 2===Join keys in transform 1 and rdd24~1,1001,1000,1002,10034~2,1001,1000,1002,10035~1,1004,1001,1006,10075~2,1004,1001,1006,10076~1,1007,1009,1005,10086~3,1007,1009,1005,10087~1,1011,1012,1013,10107~3,1011,1012,1013,1010 TRANSFORM 3===Split key in transform 2 with ~ and keep key(1) i.e. 1,2,31~1001,1000,1002,10032~1001,1000,1002,10031~1004,1001,1006,10072~1004,1001,1006,10071~1007,1009,1005,10083~1007,1009,1005,10081~1011,1012,1013,10103~1011,1012,1013,1010 TRANSFORM 4===join by key 1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,10102~1001,1000,1002,1003,1004,1001,1006,10073~1007,1009,1005,1008,1011,1012,1013,1010 From: dcmovva dilip.mo...@gmail.com To: user@spark.apache.org Sent: Saturday, January 3, 2015 10:10 AM Subject: Joining by values I have a two pair RDDs in spark like this rdd1 = (1 - [4,5,6,7]) (2 - [4,5]) (3 - [6,7]) rdd2 = (4 - [1001,1000,1002,1003]) (5 - [1004,1001,1006,1007]) (6 - [1007,1009,1005,1008]) (7 - [1011,1012,1013,1010]) I would like to combine them to look like this. joinedRdd = (1 - [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013]) (2 - [1000,1001,1002,1003,1004,1006,1007]) (3 - [1005,1007,1008,1009,1010,1011,1012,1013]) Can someone suggest me how to do this. Thanks Dilip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark for core business-logic? - Replacing: MongoDB?
Alec, Good questions. Suggestions: 1. Refactor the problem into layers viz. DFS, Data Store, DB, SQL Layer, Cache, Queue, App Server, App (Interface), App (backend ML) et al. 2. Then slot-in the appropriate technologies - may be even multiple technologies for the same layer and then work thru the pros and cons. 3. Looking at the layers (moving from the easy to difficult, the mundane to the esoteric ;o)): - Cache Queue - stick with what you are comfortable with ie Redis et al. Also take a look at Kafka - App Server - Tomcat et al - App (Interface) - JavaScript et al - DB, SQL Layer - Better off with with MongoDB. You can explore HBase, but it is not the same. - The same way as MongoDB != mySQL, HBase != MongoDB - Machine Learning Server/Layer - Spark would fit very well here. - Machine Learning DFS, Data Store - HDFS - The idea of pushing the data to Hadoop for ML is good - But you need to think thru things like incremental data load, semantics like at least once, at most once et al. 4. You could architect all with the Hadoop eco system. It might work, depending on the system. - But I would use caution. Most probably many of the elements would rather be implemented in appropriate technologies. 5. Doubleclick couple more times on the design, think thru the functionality, scaling requirements et al - Draw 3 or 4 alternatives and jot down the top 5 requirements, pros and cons, the knowns and the unknowns - The optimum design will fall thru Cheers k/ On Sat, Jan 3, 2015 at 4:43 PM, Alec Taylor alec.tayl...@gmail.com wrote: In the middle of doing the architecture for a new project, which has various machine learning and related components, including: recommender systems, search engines and sequence [common intersection] matching. Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue, backed by Redis). Though I don't have experience with Hadoop, I was thinking of using Hadoop for the machine-learning (as this will become a Big Data problem quite quickly). To push the data into Hadoop, I would use a connector of some description, or push the MongoDB backups into HDFS at set intervals. However I was thinking that it might be better to put the whole thing in Hadoop, store all persistent data in Hadoop, and maybe do all the layers in Apache Spark (with caching remaining in Redis). Is that a viable option? - Most of what I see discusses Spark (and Hadoop in general) for analytics only. Apache Phoenix exposes a nice interface for read/write over HBase, so I might use that if Spark ends up being the wrong solution. Thanks for all suggestions, Alec Taylor PS: I need this for both Big and Small data. Note that I am using the Cloudera definition of Big Data referring to processing/storage across more than 1 machine. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark.akka.frameSize limit error
I use the 1.2 version. On Sun, Jan 4, 2015 at 3:01 AM, Josh Rosen rosenvi...@gmail.com wrote: Which version of Spark are you using? It seems like the issue here is that the map output statuses are too large to fit in the Akka frame size. This issue has been fixed in Spark 1.2 by using a different encoding for map outputs for jobs with many reducers ( https://issues.apache.org/jira/browse/SPARK-3613). On earlier Spark versions, your options are either reducing the number of reducers (e.g. by explicitly specifying the number of reducers in the reduceByKey() call) or increasing the Akka frame size (via the spark.akka.frameSize configuration option). On Sat, Jan 3, 2015 at 10:40 AM, Saeed Shahrivari saeed.shahriv...@gmail.com wrote: Hi, I am trying to get the frequency of each Unicode char in a document collection using Spark. Here is the code snippet that does the job: JavaPairRDDLongWritable, Text rows = sc.sequenceFile(args[0], LongWritable.class, Text.class); rows = rows.coalesce(1); JavaPairRDDCharacter,Long pairs = rows.flatMapToPair(t - { String content=t._2.toString(); MultisetCharacter chars= HashMultiset.create(); for(int i=0;icontent.length();i++) chars.add(content.charAt(i)); Listlt;Tuple2lt;Character,Long list=new ArrayListTuple2lt;Character, Long(); for(Character ch:chars.elementSet()){ list.add(new Tuple2Character,Long(ch,(long)chars.count(ch))); } return list; }); JavaPairRDDCharacter, Long counts = pairs.reduceByKey((a, b) - a + b); System.out.printf(MapCount %,d\n,counts.count()); But, I get the following exception: 15/01/03 21:51:34 ERROR MapOutputTrackerMasterActor: Map output statuses were 11141547 bytes which exceeds spark.akka.frameSize (10485760 bytes). org.apache.spark.SparkException: Map output statuses were 11141547 bytes which exceeds spark.akka.frameSize (10485760 bytes). at org.apache.spark.MapOutputTrackerMasterActor$$anonfun$receiveWithLogging$1.applyOrElse(MapOutputTracker.scala:59) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53) at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.MapOutputTrackerMasterActor.aroundReceive(MapOutputTracker.scala:42) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Would you please tell me where is the fault? If I process fewer rows, there is no problem. However, when the number of rows is large I always get this exception. Thanks beforehand. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-akka-frameSize-limit-error-tp20955.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Joining by values
Thanks Sanjay. I will give it a try. Thanks Dilip On Sat, Jan 3, 2015 at 11:25 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com wrote: so I changed the code to rdd1InvIndex.join(rdd2Pair).map(str = str._2).groupByKey().map(str = (str._1,str._2.toList)).collect().foreach(println) Now it prints. Don't worry I will work on this to not output as List(...) But I am hoping that the JOIN question that @Dilip asked is hopefully answered :-) (2,List(1001,1000,1002,1003, 1004,1001,1006,1007)) (3,List(1011,1012,1013,1010, 1007,1009,1005,1008)) (1,List(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 1007,1009,1005,1008)) -- *From:* Shixiong Zhu zsxw...@gmail.com *To:* Sanjay Subramanian sanjaysubraman...@yahoo.com *Cc:* dcmovva dilip.mo...@gmail.com; user@spark.apache.org user@spark.apache.org *Sent:* Saturday, January 3, 2015 8:15 PM *Subject:* Re: Joining by values call `map(_.toList)` to convert `CompactBuffer` to `List` Best Regards, Shixiong Zhu 2015-01-04 12:08 GMT+08:00 Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid: hi Take a look at the code here I wrote https://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala /*rdd1.txt 1~4,5,6,7 2~4,5 3~6,7 rdd2.txt 4~1001,1000,1002,1003 5~1004,1001,1006,1007 6~1007,1009,1005,1008 7~1011,1012,1013,1010 */ val sconf = new SparkConf().setMaster(local).setAppName(MedicalSideFx-PairRddJoin) val sc = new SparkContext(sconf) val rdd1 = /path/to/rdd1.txt val rdd2 = /path/to/rdd2.txt val rdd1InvIndex = sc.textFile(rdd1).map(x = (x.split('~')(0), x.split('~')(1))).flatMapValues(str = str.split(',')).map(str = (str._2, str._1)) val rdd2Pair = sc.textFile(rdd2).map(str = (str.split('~')(0), str.split('~')(1))) rdd1InvIndex.join(rdd2Pair).map(str = str._2).groupByKey().collect().foreach(println) This outputs the following . I think this may be essentially what u r looking for (I have to understand how to NOT print as CompactBuffer) (2,CompactBuffer(1001,1000,1002,1003, 1004,1001,1006,1007)) (3,CompactBuffer(1011,1012,1013,1010, 1007,1009,1005,1008)) (1,CompactBuffer(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 1007,1009,1005,1008)) -- *From:* Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID *To:* dcmovva dilip.mo...@gmail.com; user@spark.apache.org user@spark.apache.org *Sent:* Saturday, January 3, 2015 12:19 PM *Subject:* Re: Joining by values This is my design. Now let me try and code it in Spark. rdd1.txt = 1~4,5,6,7 2~4,5 3~6,7 rdd2.txt 4~1001,1000,1002,1003 5~1004,1001,1006,1007 6~1007,1009,1005,1008 7~1011,1012,1013,1010 TRANSFORM 1 === map each value to key (like an inverted index) 4~1 5~1 6~1 7~1 5~2 4~2 6~3 7~3 TRANSFORM 2 === Join keys in transform 1 and rdd2 4~1,1001,1000,1002,1003 4~2,1001,1000,1002,1003 5~1,1004,1001,1006,1007 5~2,1004,1001,1006,1007 6~1,1007,1009,1005,1008 6~3,1007,1009,1005,1008 7~1,1011,1012,1013,1010 7~3,1011,1012,1013,1010 TRANSFORM 3 === Split key in transform 2 with ~ and keep key(1) i.e. 1,2,3 1~1001,1000,1002,1003 2~1001,1000,1002,1003 1~1004,1001,1006,1007 2~1004,1001,1006,1007 1~1007,1009,1005,1008 3~1007,1009,1005,1008 1~1011,1012,1013,1010 3~1011,1012,1013,1010 TRANSFORM 4 === join by key 1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,1010 2~1001,1000,1002,1003,1004,1001,1006,1007 3~1007,1009,1005,1008,1011,1012,1013,1010 -- *From:* dcmovva dilip.mo...@gmail.com *To:* user@spark.apache.org *Sent:* Saturday, January 3, 2015 10:10 AM *Subject:* Joining by values I have a two pair RDDs in spark like this rdd1 = (1 - [4,5,6,7]) (2 - [4,5]) (3 - [6,7]) rdd2 = (4 - [1001,1000,1002,1003]) (5 - [1004,1001,1006,1007]) (6 - [1007,1009,1005,1008]) (7 - [1011,1012,1013,1010]) I would like to combine them to look like this. joinedRdd = (1 - [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013]) (2 - [1000,1001,1002,1003,1004,1006,1007]) (3 - [1005,1007,1008,1009,1010,1011,1012,1013]) Can someone suggest me how to do this. Thanks Dilip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: simple job + futures timeout
so it appears that i need to be on the same network which is fine. Now I would like some advice on the best way to use the shell, is running the shell from the master or a working fine or should I create a new ec2 instance? Bobby -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/simple-job-futures-timeout-tp20946p20959.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Unable to build spark from source
Hi, I compiled using sbt and it takes lesser time. Thanks for the tip. I'm able to run these examples ( https://spark.apache.org/docs/latest/mllib-linear-methods.html related to MLib in the pyspark shell. However I got some errors related to Spark SQL while compiling. Is that a reason to worry? I have posted the errors as a gist over here, https://gist.github.com/MechCoder/7a9c89ee38e194513080 Thanks. On Sat, Jan 3, 2015 at 11:54 PM, Michael Armbrust mich...@databricks.com wrote: I'll add that most of the spark developers I know use sbt for day to day development as it can be much faster for incremental compilation and it has several nice features. On Sat, Jan 3, 2015 at 9:59 AM, Simon Elliston Ball si...@simonellistonball.com wrote: You can use the same build commands, but it's well worth setting up a zinc server if you're doing a lot of builds. That will allow incremental scala builds, which speeds up the process significantly. SPARK-4501 might be of interest too. Simon On 3 Jan 2015, at 17:27, Manoj Kumar manojkumarsivaraj...@gmail.com wrote: My question was that if once I make changes in the source code to a file, do I rebuild it using any other command, such that it takes in only the changes (because it takes a lot of time)? On Sat, Jan 3, 2015 at 10:40 PM, Manoj Kumar manojkumarsivaraj...@gmail.com wrote: Yes, I've built spark successfully, using the same command mvn -DskipTests clean package but it built because now I do not work behind a proxy. Thanks. -- Godspeed, Manoj Kumar, Intern, Telecom ParisTech Mech Undergrad http://manojbits.wordpress.com -- Godspeed, Manoj Kumar, Intern, Telecom ParisTech Mech Undergrad http://manojbits.wordpress.com
does calling cache()/persist() on a RDD trigger its immediate evaluation?
Hi Pro, I have a question regarding calling cache()/persist() on an RDD. All RDDs in Spark are lazily evaluated, but does calling cache()/persist() on a RDD trigger its immediate evaluation? My spark app is something like this: val rdd = sc.textFile().map() rdd.persist() while(true){ val count = rdd.filter().count if(count == 0) break newRdd = /* some codes that use `rdd` several times, and produce an new RDD */ rdd.unpersist() rdd = newRdd.persist() } In each iteration, I persist `rdd`, and unpersist it at the end of the iteration, replace `rdd` with persisted `newRdd`. My concern is that, if RDD is not evaluated and persisted when persist() is called, I need to change the position of persist()/unpersist() called to make it more efficient. Thanks, Pengcheng - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org