date:20150103

Re: Submitting spark jobs through yarn-client

2015-01-03 Thread Corey Nolet

Took me just about all night (it's 3am here in EST) but I finally figured
out how to get this working. I pushed up my example code for others who may
be struggling with this same problem. It really took an understanding of
how the classpath needs to be configured both in YARN and in the client
driver application.

Here's the example code on github:
https://github.com/cjnolet/spark-jetty-server

On Fri, Jan 2, 2015 at 11:35 PM, Corey Nolet cjno...@gmail.com wrote:

 So looking @ the actual code- I see where it looks like --class 'notused'
 --jar null is set on the ClientBase.scala when yarn is being run in client
 mode. One thing I noticed is that the jar is being set by trying to grab
 the jar's uri from the classpath resources- in this case I think it's
 finding the spark-yarn jar instead of spark-assembly so when it tries to
 runt the ExecutorLauncher.scala, none of the core classes (like
 org.apache.spark.Logging) are going to be available on the classpath.

 I hope this is the root of the issue. I'll keep this thread updated with
 my findings.

 On Fri, Jan 2, 2015 at 5:46 PM, Corey Nolet cjno...@gmail.com wrote:

 .. and looking even further, it looks like the actual command tha'ts
 executed starting up the JVM to run the
 org.apache.spark.deploy.yarn.ExecutorLauncher is passing in --class
 'notused' --jar null.

 I would assume this isn't expected but I don't see where to set these
 properties or why they aren't making it through.

 On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet cjno...@gmail.com wrote:

 Looking a little closer @ the launch_container.sh file, it appears to be
 adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in
 the directory pointed to by PWD. Any ideas?

 On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote:

 I'm trying to get a SparkContext going in a web container which is
 being submitted through yarn-client. I'm trying two different approaches
 and both seem to be resulting in the same error from the yarn nodemanagers:

 1) I'm newing up a spark context direct, manually adding all the lib
 jars from Spark and Hadoop to the setJars() method on the SparkConf.

 2) I'm using SparkSubmit,main() to pass the classname and jar
 containing my code.


 When yarn tries to create the container, I get an exception in the
 driver Yarn application already ended, might be killed or not able to
 launch application master. When I look into the logs for the nodemanager,
 I see NoClassDefFoundError: org/apache/spark/Logging.

 Looking closer @ the contents of the nodemanagers, I see that the spark
 yarn jar was renamed to __spark__.jar and placed in the app cache while the
 rest of the libraries I specified via setJars() were all placed in the file
 cache. Any ideas as to what may be happening? I even tried adding the
 spark-core dependency and uber-jarring my own classes so that the
 dependencies would be there when Yarn tries to create the container.

Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-01-03 Thread Sean Owen

Yes, it is easy to simply start a new factorization from the current model
solution. It works well. That's more like incremental *batch* rebuilding of
the model. That is not in MLlib but fairly trivial to add.

You can certainly 'fold in' new data to approximately update with one new
datum too, which you can find online. This is not quite the same idea as
streaming SGD. I'm not sure this fits the RDD model well since it entails
updating one element at a time but mini batch could be reasonable.
On Jan 3, 2015 5:29 AM, Peng Cheng rhw...@gmail.com wrote:

I was under the impression that ALS wasn't designed for it :- The famous
ebay online recommender uses SGD
However, you can try using the previous model as starting point, and
gradually reduce the number of iteration after the model stablize. I never
verify this idea, so you need to at least cross-validate it before putting
into productio

On 2 January 2015 at 04:40, Wouter Samaey wouter.sam...@storefront.be
wrote:

Hi all,

I'm curious about MLlib and if it is possible to do incremental training
on
the ALSModel.

Usually training is run first, and then you can query. But in my case,
data
is collected in real-time and I want the predictions of my ALSModel to
consider the latest data without complete re-training phase.

I've checked out these resources, but could not find any info on how to
solve this:
https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html

My question fits in a larger picture where I'm using Prediction IO, and
this
in turn is based on Spark.

Thanks in advance for any advice!

Wouter

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-do-incremental-training-using-ALSModel-MLlib-tp20942.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: DAG info

2015-01-03 Thread madhu phatak

Hi,
You can turn off these messages using log4j.properties.

On Fri, Jan 2, 2015 at 1:51 PM, Robineast robin.e...@xense.co.uk wrote:

 Do you have some example code of what you are trying to do?

 Robin



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/DAG-info-tp20940p20941.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Regards,
Madhukara Phatak
http://www.madhukaraphatak.com

getting number of partition per topic in kafka

2015-01-03 Thread Hafiz Mujadid

Hi experts!

I am currently working on spark streaming with kafka. I have couple of
questions related to this task.

1) Is there a way to find number of partitions given a topic name?
2)Is there a way to detect whether kafka server is running or not ?


Thanks 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/getting-number-of-partition-per-topic-in-kafka-tp20952.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-01-03 Thread Wouter Samaey

Do you know a place where I could find a sample or tutorial for this?

I'm still very new at this. And struggling a bit...

Thanks in advance

Wouter

Sent from my iPhone.

On 03 Jan 2015, at 10:36, Sean Owen so...@cloudera.com wrote:

On Jan 3, 2015 5:29 AM, Peng Cheng rhw...@gmail.com wrote:
I was under the impression that ALS wasn't designed for it :- The famous
ebay online recommender uses SGD
However, you can try using the previous model as starting point, and
gradually reduce the number of iteration after the model stablize. I never
verify this idea, so you need to at least cross-validate it before putting
into productio

On 2 January 2015 at 04:40, Wouter Samaey wouter.sam...@storefront.be
wrote:
Hi all,

I'm curious about MLlib and if it is possible to do incremental training on
the ALSModel.

Usually training is run first, and then you can query. But in my case, data
is collected in real-time and I want the predictions of my ALSModel to
consider the latest data without complete re-training phase.

I've checked out these resources, but could not find any info on how to
solve this:
https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html

My question fits in a larger picture where I'm using Prediction IO, and this
in turn is based on Spark.

Thanks in advance for any advice!

Wouter

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Unable to build spark from source

2015-01-03 Thread Manoj Kumar

Hi Sean,

Initially I thought on the lines of bit1...@163.com but I just changed how
I connect to the internet. I ran sc.parallelize(1 to 1000).count() and it
seemed to work.

Another quick question on the development workflow. What is the best way to
rebuild once I make a few modifications to the source code?


On Sat, Jan 3, 2015 at 4:09 PM, bit1...@163.com bit1...@163.com wrote:


 The error hints that the maven module scala-compiler can't be fetched from
 repo1.maven.org. Should some repositoy urls be added to the Maven's
 settings file?

 --
 bit1...@163.com


 *From:* Manoj Kumar manojkumarsivaraj...@gmail.com
 *Date:* 2015-01-03 18:46
 *To:* user user@spark.apache.org
 *Subject:* Unable to build spark from source
 Hello,

 I tried to build Spark from source using this command (all dependencies
 installed)
 but it fails this error. Any help would be appreciated.

 mvn -DskipTests clean package


 [INFO] Spark Project Parent POM .. FAILURE
 [28:14.408s]
 [INFO] Spark Project Networking .. SKIPPED
 [INFO] Spark Project Shuffle Streaming Service ... SKIPPED
 [INFO] Spark Project Core  SKIPP


 INFO] BUILD FAILURE
 [INFO]
 
 [INFO] Total time: 28:15.136s
 [INFO] Finished at: Sat Jan 03 15:26:31 IST 2015
 [INFO] Final Memory: 20M/143M
 [INFO]
 
 [ERROR] Failed to execute goal
 net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first)
 on project spark-parent: Execution scala-compile-first of goal
 net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed: Plugin
 net.alchim31.maven:scala-maven-plugin:3.2.0 or one of its dependencies
 could not be resolved: The following artifacts could not be resolved:
 org.scala-lang:scala-compiler:jar:2.10.3,
 org.scala-lang:scala-reflect:jar:2.10.3: Could not transfer artifact
 org.scala-lang:scala-compiler:jar:2.10.3 from/to central (
 https://repo1.maven.org/maven2): GET request of:
 org/scala-lang/scala-compiler/2.10.3/scala-compiler-2.10.3.jar from central
 failed: Connection reset - [Help 1]




 --
 Godspeed,
 Manoj Kumar,
 Intern, Telecom ParisTech
 Mech Undergrad
 http://manojbits.wordpress.com




-- 
Godspeed,
Manoj Kumar,
Intern, Telecom ParisTech
Mech Undergrad
http://manojbits.wordpress.com

Re: Unable to build spark from source

2015-01-03 Thread Sean Owen

This indicates a network problem in getting third party artifacts. Is there
a proxy you need to go through?
On Jan 3, 2015 11:17 AM, Manoj Kumar manojkumarsivaraj...@gmail.com
wrote:

 Hello,

 I tried to build Spark from source using this command (all dependencies
 installed)
 but it fails this error. Any help would be appreciated.

 mvn -DskipTests clean package


 [INFO] Spark Project Parent POM .. FAILURE
 [28:14.408s]
 [INFO] Spark Project Networking .. SKIPPED
 [INFO] Spark Project Shuffle Streaming Service ... SKIPPED
 [INFO] Spark Project Core  SKIPP


 INFO] BUILD FAILURE
 [INFO]
 
 [INFO] Total time: 28:15.136s
 [INFO] Finished at: Sat Jan 03 15:26:31 IST 2015
 [INFO] Final Memory: 20M/143M
 [INFO]
 
 [ERROR] Failed to execute goal
 net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first)
 on project spark-parent: Execution scala-compile-first of goal
 net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed: Plugin
 net.alchim31.maven:scala-maven-plugin:3.2.0 or one of its dependencies
 could not be resolved: The following artifacts could not be resolved:
 org.scala-lang:scala-compiler:jar:2.10.3,
 org.scala-lang:scala-reflect:jar:2.10.3: Could not transfer artifact
 org.scala-lang:scala-compiler:jar:2.10.3 from/to central (
 https://repo1.maven.org/maven2): GET request of:
 org/scala-lang/scala-compiler/2.10.3/scala-compiler-2.10.3.jar from central
 failed: Connection reset - [Help 1]




 --
 Godspeed,
 Manoj Kumar,
 Intern, Telecom ParisTech
 Mech Undergrad
 http://manojbits.wordpress.com

Re: Unable to build spark from source

2015-01-03 Thread bit1...@163.com


The error hints that the maven module scala-compiler can't be fetched from 
repo1.maven.org. Should some repositoy urls be added to the Maven's settings 
file?



bit1...@163.com
 
From: Manoj Kumar
Date: 2015-01-03 18:46
To: user
Subject: Unable to build spark from source
Hello,

I tried to build Spark from source using this command (all dependencies 
installed)
but it fails this error. Any help would be appreciated.

mvn -DskipTests clean package


[INFO] Spark Project Parent POM .. FAILURE [28:14.408s]
[INFO] Spark Project Networking .. SKIPPED
[INFO] Spark Project Shuffle Streaming Service ... SKIPPED
[INFO] Spark Project Core  SKIPP


INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 28:15.136s
[INFO] Finished at: Sat Jan 03 15:26:31 IST 2015
[INFO] Final Memory: 20M/143M
[INFO] 
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on 
project spark-parent: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed: Plugin 
net.alchim31.maven:scala-maven-plugin:3.2.0 or one of its dependencies could 
not be resolved: The following artifacts could not be resolved: 
org.scala-lang:scala-compiler:jar:2.10.3, 
org.scala-lang:scala-reflect:jar:2.10.3: Could not transfer artifact 
org.scala-lang:scala-compiler:jar:2.10.3 from/to central 
(https://repo1.maven.org/maven2): GET request of: 
org/scala-lang/scala-compiler/2.10.3/scala-compiler-2.10.3.jar from central 
failed: Connection reset - [Help 1]




-- 
Godspeed,
Manoj Kumar,
Intern, Telecom ParisTech
Mech Undergrad
http://manojbits.wordpress.com

RE: Publishing streaming results to web interface

2015-01-03 Thread Silvio Fiorito

Is this through a streaming app? I've done this before by publishing results 
out to a queue our message bus, with a web app listening on the other end. If 
it's just batch or infrequent you could save the results out to a file.

From: tfriskmailto:tfris...@gmail.com
Sent: ‎1/‎2/‎2015 5:47 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Publishing streaming results to web interface

Hi,

New to spark so just feeling my way in using it on a standalone server under
linux.

I'm using scala to store running count totals of certain tokens in my
streaming data and publishing a top 10 list.
eg
(TokenX,count)
(TokenY,count)
..

At the moment this is just being printed to std out using print() but I want
to be able to view these running counts from a web page - but not sure where
to start with this.

Can anyone advise or point me to examples of how this might be achieved ?

Many Thanks,

Thomas

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Publishing-streaming-results-to-web-interface-tp20948.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Unable to build spark from source

2015-01-03 Thread Manoj Kumar

Hello,

I tried to build Spark from source using this command (all dependencies
installed)
but it fails this error. Any help would be appreciated.

mvn -DskipTests clean package


[INFO] Spark Project Parent POM .. FAILURE
[28:14.408s]
[INFO] Spark Project Networking .. SKIPPED
[INFO] Spark Project Shuffle Streaming Service ... SKIPPED
[INFO] Spark Project Core  SKIPP


INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 28:15.136s
[INFO] Finished at: Sat Jan 03 15:26:31 IST 2015
[INFO] Final Memory: 20M/143M
[INFO]

[ERROR] Failed to execute goal
net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first)
on project spark-parent: Execution scala-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed: Plugin
net.alchim31.maven:scala-maven-plugin:3.2.0 or one of its dependencies
could not be resolved: The following artifacts could not be resolved:
org.scala-lang:scala-compiler:jar:2.10.3,
org.scala-lang:scala-reflect:jar:2.10.3: Could not transfer artifact
org.scala-lang:scala-compiler:jar:2.10.3 from/to central (
https://repo1.maven.org/maven2): GET request of:
org/scala-lang/scala-compiler/2.10.3/scala-compiler-2.10.3.jar from central
failed: Connection reset - [Help 1]




-- 
Godspeed,
Manoj Kumar,
Intern, Telecom ParisTech
Mech Undergrad
http://manojbits.wordpress.com

Re: sqlContext is undefined in the Spark Shell

2015-01-03 Thread bit1...@163.com

This is a noise,please ignore

I figured out what happens...

bit1...@163.com

From: bit1...@163.com
Date: 2015-01-03 19:03
To: user
Subject: sqlContext is undefined in the Spark Shell
Hi,

In the spark shell, I do the following two things:

1. scala val cxt = new org.apache.spark.sql.SQLContext(sc);
2. scala import sqlContext._

The 1st one succeeds while the 2nd one fails with the following error,

console:10: error: not found: value sqlContext 
import sqlContext._

Is there something missing? I am using Spark 1.2.0.

Thanks.

bit1...@163.com

Re: saveAsTextFile

2015-01-03 Thread Pankaj Narang

If you can paste the code here I can certainly help.

Also confirm the version of spark you are using

Regards
Pankaj 
Infoshore Software 
India



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

sqlContext is undefined in the Spark Shell

2015-01-03 Thread bit1...@163.com

Hi,

In the spark shell, I do the following two things:

1. scala val cxt = new org.apache.spark.sql.SQLContext(sc);
2. scala import sqlContext._

The 1st one succeeds while the 2nd one fails with the following error,

console:10: error: not found: value sqlContext 
import sqlContext._

Is there something missing? I am using Spark 1.2.0.

Thanks.




bit1...@163.com

Re: Submitting spark jobs through yarn-client

2015-01-03 Thread Koert Kuipers

thats great. i tried this once and gave up after a few hours.


On Sat, Jan 3, 2015 at 2:59 AM, Corey Nolet cjno...@gmail.com wrote:

 Took me just about all night (it's 3am here in EST) but I finally figured
 out how to get this working. I pushed up my example code for others who may
 be struggling with this same problem. It really took an understanding of
 how the classpath needs to be configured both in YARN and in the client
 driver application.

 Here's the example code on github:
 https://github.com/cjnolet/spark-jetty-server

 On Fri, Jan 2, 2015 at 11:35 PM, Corey Nolet cjno...@gmail.com wrote:

 So looking @ the actual code- I see where it looks like --class 'notused'
 --jar null is set on the ClientBase.scala when yarn is being run in client
 mode. One thing I noticed is that the jar is being set by trying to grab
 the jar's uri from the classpath resources- in this case I think it's
 finding the spark-yarn jar instead of spark-assembly so when it tries to
 runt the ExecutorLauncher.scala, none of the core classes (like
 org.apache.spark.Logging) are going to be available on the classpath.

 I hope this is the root of the issue. I'll keep this thread updated with
 my findings.

 On Fri, Jan 2, 2015 at 5:46 PM, Corey Nolet cjno...@gmail.com wrote:

 .. and looking even further, it looks like the actual command tha'ts
 executed starting up the JVM to run the
 org.apache.spark.deploy.yarn.ExecutorLauncher is passing in --class
 'notused' --jar null.

 I would assume this isn't expected but I don't see where to set these
 properties or why they aren't making it through.

 On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet cjno...@gmail.com wrote:

 Looking a little closer @ the launch_container.sh file, it appears to
 be adding a $PWD/__app__.jar to the classpath but there is no __app__.jar
 in the directory pointed to by PWD. Any ideas?

 On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote:

 I'm trying to get a SparkContext going in a web container which is
 being submitted through yarn-client. I'm trying two different approaches
 and both seem to be resulting in the same error from the yarn 
 nodemanagers:

 1) I'm newing up a spark context direct, manually adding all the lib
 jars from Spark and Hadoop to the setJars() method on the SparkConf.

 2) I'm using SparkSubmit,main() to pass the classname and jar
 containing my code.


 When yarn tries to create the container, I get an exception in the
 driver Yarn application already ended, might be killed or not able to
 launch application master. When I look into the logs for the nodemanager,
 I see NoClassDefFoundError: org/apache/spark/Logging.

 Looking closer @ the contents of the nodemanagers, I see that the
 spark yarn jar was renamed to __spark__.jar and placed in the app cache
 while the rest of the libraries I specified via setJars() were all placed
 in the file cache. Any ideas as to what may be happening? I even tried
 adding the spark-core dependency and uber-jarring my own classes so that
 the dependencies would be there when Yarn tries to create the container.

Re: Unable to build spark from source

2015-01-03 Thread Sean Owen

No, that is part of every Maven build by default. The repo is fine and I
(and I assume everyone else) can reach it.

How can you run Spark if you can't build it? Are you running something else
or did it succeed?

The error hints that the maven module scala-compiler can't be fetched from
repo1.maven.org. Should some repositoy urls be added to the Maven's
settings file?

--
bit1...@163.com


*From:* Manoj Kumar manojkumarsivaraj...@gmail.com
*Date:* 2015-01-03 18:46
*To:* user user@spark.apache.org
*Subject:* Unable to build spark from source
Hello,

I tried to build Spark from source using this command (all dependencies
installed)
but it fails this error. Any help would be appreciated.

mvn -DskipTests clean package


[INFO] Spark Project Parent POM .. FAILURE
[28:14.408s]
[INFO] Spark Project Networking .. SKIPPED
[INFO] Spark Project Shuffle Streaming Service ... SKIPPED
[INFO] Spark Project Core  SKIPP


INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 28:15.136s
[INFO] Finished at: Sat Jan 03 15:26:31 IST 2015
[INFO] Final Memory: 20M/143M
[INFO]

[ERROR] Failed to execute goal
net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first)
on project spark-parent: Execution scala-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed: Plugin
net.alchim31.maven:scala-maven-plugin:3.2.0 or one of its dependencies
could not be resolved: The following artifacts could not be resolved:
org.scala-lang:scala-compiler:jar:2.10.3,
org.scala-lang:scala-reflect:jar:2.10.3: Could not transfer artifact
org.scala-lang:scala-compiler:jar:2.10.3 from/to central (
https://repo1.maven.org/maven2): GET request of:
org/scala-lang/scala-compiler/2.10.3/scala-compiler-2.10.3.jar from central
failed: Connection reset - [Help 1]




-- 
Godspeed,
Manoj Kumar,
Intern, Telecom ParisTech
Mech Undergrad
http://manojbits.wordpress.com

Re: different akka versions and spark

2015-01-03 Thread Koert Kuipers

hey Ted,
i am aware of the upgrade efforts for akka. however if spark 1.2 forces me
to upgrade all our usage of akka to 2.3.x while spark 1.0 and 1.1 force me
to use akka 2.2.x then we cannot build one application that runs on all
spark 1.x versions, which i would consider a major incompatibility.
best, koert


On Sat, Jan 3, 2015 at 12:11 AM, Ted Yu yuzhih...@gmail.com wrote:

 Please see http://akka.io/news/2014/05/22/akka-2.3.3-released.html which
 points to
 http://doc.akka.io/docs/akka/2.3.3/project/migration-guide-2.2.x-2.3.x.html?_ga=1.35212129.1385865413.1420220234

 Cheers

 On Fri, Jan 2, 2015 at 9:11 AM, Koert Kuipers ko...@tresata.com wrote:

 i noticed spark 1.2.0 bumps the akka version. since spark uses it's own
 akka version, does this mean it can co-exist with another akka version in
 the same JVM? has anyone tried this?

 we have some spark apps that also use akka (2.2.3) and spray. if
 different akka versions causes conflicts then spark 1.2.0 would not be
 backwards compatible for us...

 thanks. koert

Re: getting number of partition per topic in kafka

2015-01-03 Thread Akhil Das

You can use the lowlevel consumer
http://github.com/dibbhatt/kafka-spark-consumer for this, it has an API
call
https://github.com/dibbhatt/kafka-spark-consumer/blob/master/src/main/java/consumer/kafka/DynamicBrokersReader.java#L81
to retrieve the number of partitions from a topic.

Easiest way would be to count the depth in:

yourZkPath + /topics/ + topicName + /partitions


Thanks
Best Regards

On Sat, Jan 3, 2015 at 1:54 PM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:

 Hi experts!

 I am currently working on spark streaming with kafka. I have couple of
 questions related to this task.

 1) Is there a way to find number of partitions given a topic name?
 2)Is there a way to detect whether kafka server is running or not ?


 Thanks




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/getting-number-of-partition-per-topic-in-kafka-tp20952.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: saveAsTextFile

2015-01-03 Thread Sanjay Subramanian


@lailaBased on the error u mentioned in the nabble link below, it seems like 
there are no permissions to write to HDFS. So this is possibly why 
saveAsTextFile is failing.

  From: Pankaj Narang pankajnaran...@gmail.com
 To: user@spark.apache.org 
 Sent: Saturday, January 3, 2015 4:07 AM
 Subject: Re: saveAsTextFile
   
If you can paste the code here I can certainly help.

Also confirm the version of spark you are using

Regards
Pankaj 
Infoshore Software 
India



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark MLIB for Kaggle's CTR challenge

2015-01-03 Thread Maisnam Ns

Hi ,

I entered this Kaggle's CTR challenge using scikit python framework.
Although , it gave me a reasonable score , I am just wondering to explore
Spark Mlib which I haven't used it before. Tried with Vowpal Wobbit also .

Can someone who has already worked with MLIB ,help me if Spark Mlib
supports online learning or batch SGD, if so how it performs. I don't have
a cluster of spark , just the laptop.

Any suggestions?

The training data has close to 45 million rows in csv format and test data
close to 4.2 million rows in same format.

Thanks,
Niranjan

Re: Unable to build spark from source

2015-01-03 Thread Manoj Kumar

My question was that if once I make changes in the source code to a file,

do I rebuild it using any other command, such that it takes in only the
changes (because it takes a lot of time)?

On Sat, Jan 3, 2015 at 10:40 PM, Manoj Kumar manojkumarsivaraj...@gmail.com
 wrote:

 Yes, I've built spark successfully, using the same command

 mvn -DskipTests clean package

 but it built because now I do not work behind a proxy.

 Thanks.






-- 
Godspeed,
Manoj Kumar,
Intern, Telecom ParisTech
Mech Undergrad
http://manojbits.wordpress.com

Re: Unable to build spark from source

2015-01-03 Thread Manoj Kumar

Yes, I've built spark successfully, using the same command

mvn -DskipTests clean package

but it built because now I do not work behind a proxy.

Thanks.

[no subject]

2015-01-03 Thread Sujeevan

Best Regards,

Sujeevan. N

Re: Unable to build spark from source

2015-01-03 Thread Simon Elliston Ball

You can use the same build commands, but it's well worth setting up a zinc 
server if you're doing a lot of builds. That will allow incremental scala 
builds, which speeds up the process significantly.

SPARK-4501 might be of interest too.

Simon

 On 3 Jan 2015, at 17:27, Manoj Kumar manojkumarsivaraj...@gmail.com wrote:
 
 My question was that if once I make changes in the source code to a file,
 
 do I rebuild it using any other command, such that it takes in only the 
 changes (because it takes a lot of time)? 
 
 On Sat, Jan 3, 2015 at 10:40 PM, Manoj Kumar 
 manojkumarsivaraj...@gmail.com wrote:
 Yes, I've built spark successfully, using the same command
 
 mvn -DskipTests clean package
 
 but it built because now I do not work behind a proxy.
 
 Thanks.
 
 
 
 -- 
 Godspeed,
 Manoj Kumar,
 Intern, Telecom ParisTech
 Mech Undergrad
 http://manojbits.wordpress.com

Joining by values

2015-01-03 Thread dcmovva

I have a two pair RDDs in spark like this

rdd1 = (1 - [4,5,6,7])
   (2 - [4,5])
   (3 - [6,7])


rdd2 = (4 - [1001,1000,1002,1003])
   (5 - [1004,1001,1006,1007])
   (6 - [1007,1009,1005,1008])
   (7 - [1011,1012,1013,1010])
I would like to combine them to look like this.

joinedRdd = (1 -
[1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013])
(2 - [1000,1001,1002,1003,1004,1006,1007])
(3 - [1005,1007,1008,1009,1010,1011,1012,1013])


Can someone suggest me how to do this.

Thanks Dilip



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

save rdd to ORC file

2015-01-03 Thread SamyaMaiti

Hi Experts,

Like saveAsParquetFile on schemaRDD, there is a equivalent to store in ORC
file.

I am using spark 1.2.0.

As per the link below, looks like its not part of 1.2.0, so any latest
update would be great.
https://issues.apache.org/jira/browse/SPARK-2883

Till the next release, is there a workaround to read/write ORC file.

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/save-rdd-to-ORC-file-tp20956.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Joining by values

2015-01-03 Thread Sanjay Subramanian

This is my design. Now let me try and code it in Spark.
rdd1.txt =1~4,5,6,72~4,53~6,7
rdd2.txt 
4~1001,1000,1002,10035~1004,1001,1006,10076~1007,1009,1005,10087~1011,1012,1013,1010
TRANSFORM 1===map each value to key (like an inverted 
index)4~15~16~17~15~24~26~37~3
TRANSFORM 2===Join keys in transform 1 and 
rdd24~1,1001,1000,1002,10034~2,1001,1000,1002,10035~1,1004,1001,1006,10075~2,1004,1001,1006,10076~1,1007,1009,1005,10086~3,1007,1009,1005,10087~1,1011,1012,1013,10107~3,1011,1012,1013,1010
TRANSFORM 3===Split key in transform 2 with ~ and keep key(1) i.e. 
1,2,31~1001,1000,1002,10032~1001,1000,1002,10031~1004,1001,1006,10072~1004,1001,1006,10071~1007,1009,1005,10083~1007,1009,1005,10081~1011,1012,1013,10103~1011,1012,1013,1010
TRANSFORM 4===join by key 
1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,10102~1001,1000,1002,1003,1004,1001,1006,10073~1007,1009,1005,1008,1011,1012,1013,1010

  From: dcmovva dilip.mo...@gmail.com
 To: user@spark.apache.org 
 Sent: Saturday, January 3, 2015 10:10 AM
 Subject: Joining by values
   
I have a two pair RDDs in spark like this

rdd1 = (1 - [4,5,6,7])
  (2 - [4,5])
  (3 - [6,7])


rdd2 = (4 - [1001,1000,1002,1003])
  (5 - [1004,1001,1006,1007])
  (6 - [1007,1009,1005,1008])
  (7 - [1011,1012,1013,1010])
I would like to combine them to look like this.

joinedRdd = (1 -
[1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013])
        (2 - [1000,1001,1002,1003,1004,1006,1007])
        (3 - [1005,1007,1008,1009,1010,1011,1012,1013])


Can someone suggest me how to do this.

Thanks Dilip



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark.sql.shuffle.partitions parameter

2015-01-03 Thread Yuri Makhno

Hello everyone,

I'm using SparkSQL and would like to understand how can I determine right
value for spark.sql.shuffle.partitions parameter? For example if I'm
joining two RDDs where first has 10 partitions and second - 60, how big
this parameter should be?

Thank you,
Yuri

Re: spark.akka.frameSize limit error

2015-01-03 Thread Josh Rosen

Which version of Spark are you using?  It seems like the issue here is that
the map output statuses are too large to fit in the Akka frame size.  This
issue has been fixed in Spark 1.2 by using a different encoding for map
outputs for jobs with many reducers (
https://issues.apache.org/jira/browse/SPARK-3613).  On earlier Spark
versions, your options are either reducing the number of reducers (e.g. by
explicitly specifying the number of reducers in the reduceByKey() call) or
increasing the Akka frame size (via the spark.akka.frameSize configuration
option).

On Sat, Jan 3, 2015 at 10:40 AM, Saeed Shahrivari 
saeed.shahriv...@gmail.com wrote:

 Hi,

 I am trying to get the frequency of each Unicode char in a document
 collection using Spark. Here is the code snippet that does the job:

 JavaPairRDDLongWritable, Text rows = sc.sequenceFile(args[0],
 LongWritable.class, Text.class);
 rows = rows.coalesce(1);

 JavaPairRDDCharacter,Long pairs = rows.flatMapToPair(t - {
 String content=t._2.toString();
 MultisetCharacter chars= HashMultiset.create();
 for(int i=0;icontent.length();i++)
 chars.add(content.charAt(i));
 Listlt;Tuple2lt;Character,Long list=new
 ArrayListTuple2lt;Character, Long();
 for(Character ch:chars.elementSet()){
 list.add(new
 Tuple2Character,Long(ch,(long)chars.count(ch)));
 }
 return list;
 });

 JavaPairRDDCharacter, Long counts = pairs.reduceByKey((a, b) - a
 + b);
 System.out.printf(MapCount %,d\n,counts.count());

 But, I get the following exception:

 15/01/03 21:51:34 ERROR MapOutputTrackerMasterActor: Map output statuses
 were 11141547 bytes which exceeds spark.akka.frameSize (10485760 bytes).
 org.apache.spark.SparkException: Map output statuses were 11141547 bytes
 which exceeds spark.akka.frameSize (10485760 bytes).
 at

 org.apache.spark.MapOutputTrackerMasterActor$$anonfun$receiveWithLogging$1.applyOrElse(MapOutputTracker.scala:59)
 at

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at

 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
 at

 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
 at
 scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
 at

 org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at

 org.apache.spark.MapOutputTrackerMasterActor.aroundReceive(MapOutputTracker.scala:42)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 Would you please tell me where is the fault?
 If I process fewer rows, there is no problem. However, when the number of
 rows is large I always get this exception.

 Thanks beforehand.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-akka-frameSize-limit-error-tp20955.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Better way of measuring custom application metrics

2015-01-03 Thread Enno Shioji

I have a hack to gather custom application metrics in a Streaming job, but
I wanted to know if there is any better way of doing this.

My hack consists of this singleton:

object Metriker extends Serializable {
  @transient lazy val mr: MetricRegistry = {
val metricRegistry = new MetricRegistry()
val graphiteEndpoint = new InetSocketAddress(
ec2-54-220-56-229.eu-west-1.compute.amazonaws.com, 2003)
GraphiteReporter
  .forRegistry(metricRegistry)
  .build(new Graphite(graphiteEndpoint))
  .start(5, TimeUnit.SECONDS)
metricRegistry
  }

  @transient lazy val processId = ManagementFactory.getRuntimeMXBean.getName

  @transient lazy val hostId = {
try {
  InetAddress.getLocalHost.getHostName
} catch {
  case e: UnknownHostException = localhost
}
  }

   def metricName(name: String): String = {
%s.%s.%s.format(name, hostId, processId)
  }
}


which I then use in my jobs like so:

stream
.map { i =
  Metriker.mr.meter(Metriker.metricName(testmetric123)).mark(i)
  i * 2
}

Then I aggregate the metrics on Graphite. This works, but I was curious to
know if anyone has a less hacky way.


ᐧ

Spark for core business-logic? - Replacing: MongoDB?

2015-01-03 Thread Alec Taylor

In the middle of doing the architecture for a new project, which has
various machine learning and related components, including:
recommender systems, search engines and sequence [common intersection]
matching.

Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue,
backed by Redis).

Though I don't have experience with Hadoop, I was thinking of using
Hadoop for the machine-learning (as this will become a Big Data
problem quite quickly). To push the data into Hadoop, I would use a
connector of some description, or push the MongoDB backups into HDFS
at set intervals.

However I was thinking that it might be better to put the whole thing
in Hadoop, store all persistent data in Hadoop, and maybe do all the
layers in Apache Spark (with caching remaining in Redis).

Is that a viable option? - Most of what I see discusses Spark (and
Hadoop in general) for analytics only. Apache Phoenix exposes a nice
interface for read/write over HBase, so I might use that if Spark ends
up being the wrong solution.

Thanks for all suggestions,

Alec Taylor

PS: I need this for both Big and Small data. Note that I am using
the Cloudera definition of Big Data referring to processing/storage
across more than 1 machine.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: save rdd to ORC file

2015-01-03 Thread Manku Timma

One way is to use sparkSQL.

scala sqlContext.sql(create table orc_table(key INT, value STRING)
stored as orc)scala sqlContext.sql(insert into table orc_table
select * from schema_rdd_temp_table)scala sqlContext.sql(FROM
orc_table select *)



On 4 January 2015 at 00:57, SamyaMaiti samya.maiti2...@gmail.com wrote:

 Hi Experts,

 Like saveAsParquetFile on schemaRDD, there is a equivalent to store in ORC
 file.

 I am using spark 1.2.0.

 As per the link below, looks like its not part of 1.2.0, so any latest
 update would be great.
 https://issues.apache.org/jira/browse/SPARK-2883

 Till the next release, is there a workaround to read/write ORC file.

 Regards,
 Sam



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/save-rdd-to-ORC-file-tp20956.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

RE: Better way of measuring custom application metrics

2015-01-03 Thread Shao, Saisai

Hi,

I think there’s a StreamingSource in Spark Streaming which exposes the Spark 
Streaming running status to the metrics sink, you can connect it with Graphite 
sink to expose metrics to Graphite. I’m not sure is this what you want.

Besides you can customize the Source and Sink of the MetricsSystem to build 
your own and configure it in metrics.properties with class name to let it 
loaded by metrics system, for the details you can refer to 
http://spark.apache.org/docs/latest/monitoring.html or source code.

Thanks
Jerry

From: Enno Shioji [mailto:eshi...@gmail.com]
Sent: Sunday, January 4, 2015 7:47 AM
To: user@spark.apache.org
Subject: Better way of measuring custom application metrics

I have a hack to gather custom application metrics in a Streaming job, but I 
wanted to know if there is any better way of doing this.

My hack consists of this singleton:

object Metriker extends Serializable {
  @transient lazy val mr: MetricRegistry = {
val metricRegistry = new MetricRegistry()
val graphiteEndpoint = new 
InetSocketAddress(ec2-54-220-56-229.eu-west-1.compute.amazonaws.comhttp://ec2-54-220-56-229.eu-west-1.compute.amazonaws.com,
 2003)
GraphiteReporter
  .forRegistry(metricRegistry)
  .build(new Graphite(graphiteEndpoint))
  .start(5, TimeUnit.SECONDS)
metricRegistry
  }

  @transient lazy val processId = ManagementFactory.getRuntimeMXBean.getName

  @transient lazy val hostId = {
try {
  InetAddress.getLocalHost.getHostName
} catch {
  case e: UnknownHostException = localhost
}
  }

   def metricName(name: String): String = {
%s.%s.%s.format(name, hostId, processId)
  }
}


which I then use in my jobs like so:

stream
.map { i =
  Metriker.mr.meter(Metriker.metricName(testmetric123)).mark(i)
  i * 2
}

Then I aggregate the metrics on Graphite. This works, but I was curious to know 
if anyone has a less hacky way.


[https://mailfoogae.appspot.com/t?sender=aZXNoaW9qaUBnbWFpbC5jb20%3Dtype=zerocontentguid=29916861-9b4d-423b-8e45-c731deddd43b]ᐧ

Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2015-01-03 Thread firemonk9

I am running into similar problem. Have you found any resolution to this
issue ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Elastic-allocation-spark-dynamicAllocation-enabled-results-in-task-never-being-executed-tp18969p20957.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: building spark1.2 meet error

2015-01-03 Thread j_soft

   - thanks, it is success builded
   - .but where is builded zip file? I not find finished .zip or .tar.gz
   package


2014-12-31 19:22 GMT+08:00 xhudik [via Apache Spark User List] 
ml-node+s1001560n20927...@n3.nabble.com:

 Hi J_soft,

 for me it is working, I didn't put -Dscala-2.10 -X parameters. I got only
 one warning, since I don't have hadoop 2.5 it didn't activate this profile:







































 *larix@kovral:~/sources/spark-1.2.0 mvn -Pyarn -Phadoop-2.5
 -Dhadoop.version=2.5.0 -DskipTests clean package   Found 0 infos
 Finished in 3 ms [INFO]
 
 [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM
 ... SUCCESS [ 14.177 s] [INFO] Spark Project
 Networking ... SUCCESS [ 14.670 s] [INFO] Spark
 Project Shuffle Streaming Service  SUCCESS [  9.030 s] [INFO]
 Spark Project Core . SUCCESS [04:42 min]
 [INFO] Spark Project Bagel  SUCCESS [
 26.184 s] [INFO] Spark Project GraphX ...
 SUCCESS [01:07 min] [INFO] Spark Project Streaming
  SUCCESS [01:35 min] [INFO] Spark Project
 Catalyst . SUCCESS [01:48 min] [INFO] Spark
 Project SQL .. SUCCESS [01:55 min] [INFO]
 Spark Project ML Library ... SUCCESS [02:17 min]
 [INFO] Spark Project Tools  SUCCESS [
 15.527 s] [INFO] Spark Project Hive .
 SUCCESS [01:43 min] [INFO] Spark Project REPL
 . SUCCESS [ 45.154 s] [INFO] Spark Project
 YARN Parent POM .. SUCCESS [  3.885 s] [INFO] Spark
 Project YARN Stable API .. SUCCESS [01:00 min] [INFO]
 Spark Project Assembly . SUCCESS [ 50.812 s]
 [INFO] Spark Project External Twitter . SUCCESS [
 21.401 s] [INFO] Spark Project External Flume Sink ..
 SUCCESS [ 25.207 s] [INFO] Spark Project External Flume
 ... SUCCESS [ 34.734 s] [INFO] Spark Project External
 MQTT  SUCCESS [ 22.617 s] [INFO] Spark Project
 External ZeroMQ .. SUCCESS [ 22.444 s] [INFO] Spark
 Project External Kafka ... SUCCESS [ 33.566 s] [INFO]
 Spark Project Examples . SUCCESS [01:23 min]
 [INFO] Spark Project YARN Shuffle Service . SUCCESS [
  4.873 s] [INFO]
 
 [INFO] BUILD SUCCESS [INFO]
 
 [INFO] Total time: 23:20 min [INFO] Finished at: 2014-12-31T12:02:32+01:00
 [INFO] Final Memory: 76M/855M [INFO]
 
 [WARNING] The requested profile hadoop-2.5 could not be activated because
 it does not exist.*


 If it won't work for you. I'd try to delete all sources, download source
 code once more and try again ...

 good luck, Tomas


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/building-spark1-2-meet-error-tp20853p20927.html
  To unsubscribe from building spark1.2 meet error, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=20853code=enNvZnRlckBnbWFpbC5jb218MjA4NTN8LTI5MTk2NDAzOA==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/building-spark1-2-meet-error-tp20853p20958.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Joining by values

2015-01-03 Thread Sanjay Subramanian

hi Take a look at the code here I 
wrotehttps://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala

/*rdd1.txt

1~4,5,6,7
2~4,5
3~6,7

rdd2.txt

4~1001,1000,1002,1003
5~1004,1001,1006,1007
6~1007,1009,1005,1008
7~1011,1012,1013,1010

*/
val sconf = new 
SparkConf().setMaster(local).setAppName(MedicalSideFx-PairRddJoin)
val sc = new SparkContext(sconf)


val rdd1 = /path/to/rdd1.txt
val rdd2 = /path/to/rdd2.txt

val rdd1InvIndex = sc.textFile(rdd1).map(x = (x.split('~')(0), 
x.split('~')(1))).flatMapValues(str = str.split(',')).map(str = (str._2, 
str._1))
val rdd2Pair = sc.textFile(rdd2).map(str = (str.split('~')(0), 
str.split('~')(1)))
rdd1InvIndex.join(rdd2Pair).map(str = 
str._2).groupByKey().collect().foreach(println)

This outputs the following . I think this may be essentially what u r looking 
for(I have to understand how to NOT print as 
CompactBuffer)(2,CompactBuffer(1001,1000,1002,1003, 1004,1001,1006,1007))
(3,CompactBuffer(1011,1012,1013,1010, 1007,1009,1005,1008))
(1,CompactBuffer(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 
1007,1009,1005,1008))

  From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID
 To: dcmovva dilip.mo...@gmail.com; user@spark.apache.org 
user@spark.apache.org 
 Sent: Saturday, January 3, 2015 12:19 PM
 Subject: Re: Joining by values
   
This is my design. Now let me try and code it in Spark.
rdd1.txt =1~4,5,6,72~4,53~6,7
rdd2.txt 
4~1001,1000,1002,10035~1004,1001,1006,10076~1007,1009,1005,10087~1011,1012,1013,1010
TRANSFORM 1===map each value to key (like an inverted 
index)4~15~16~17~15~24~26~37~3
TRANSFORM 2===Join keys in transform 1 and 
rdd24~1,1001,1000,1002,10034~2,1001,1000,1002,10035~1,1004,1001,1006,10075~2,1004,1001,1006,10076~1,1007,1009,1005,10086~3,1007,1009,1005,10087~1,1011,1012,1013,10107~3,1011,1012,1013,1010
TRANSFORM 3===Split key in transform 2 with ~ and keep key(1) i.e. 
1,2,31~1001,1000,1002,10032~1001,1000,1002,10031~1004,1001,1006,10072~1004,1001,1006,10071~1007,1009,1005,10083~1007,1009,1005,10081~1011,1012,1013,10103~1011,1012,1013,1010
TRANSFORM 4===join by key 
1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,10102~1001,1000,1002,1003,1004,1001,1006,10073~1007,1009,1005,1008,1011,1012,1013,1010

 

 From: dcmovva dilip.mo...@gmail.com
 To: user@spark.apache.org 
 Sent: Saturday, January 3, 2015 10:10 AM
 Subject: Joining by values
   
I have a two pair RDDs in spark like this

rdd1 = (1 - [4,5,6,7])
  (2 - [4,5])
  (3 - [6,7])


rdd2 = (4 - [1001,1000,1002,1003])
  (5 - [1004,1001,1006,1007])
  (6 - [1007,1009,1005,1008])
  (7 - [1011,1012,1013,1010])
I would like to combine them to look like this.

joinedRdd = (1 -
[1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013])
        (2 - [1000,1001,1002,1003,1004,1006,1007])
        (3 - [1005,1007,1008,1009,1010,1011,1012,1013])


Can someone suggest me how to do this.

Thanks Dilip



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Joining by values

2015-01-03 Thread Shixiong Zhu

call `map(_.toList)` to convert `CompactBuffer` to `List`

Best Regards,
Shixiong Zhu

2015-01-04 12:08 GMT+08:00 Sanjay Subramanian 
sanjaysubraman...@yahoo.com.invalid:

 hi
 Take a look at the code here I wrote

 https://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala

 /*rdd1.txt

 1~4,5,6,7
 2~4,5
 3~6,7

 rdd2.txt

 4~1001,1000,1002,1003
 5~1004,1001,1006,1007
 6~1007,1009,1005,1008
 7~1011,1012,1013,1010

 */
 val sconf = new 
 SparkConf().setMaster(local).setAppName(MedicalSideFx-PairRddJoin)
 val sc = new SparkContext(sconf)


 val rdd1 = /path/to/rdd1.txt
 val rdd2 = /path/to/rdd2.txt

 val rdd1InvIndex = sc.textFile(rdd1).map(x = (x.split('~')(0), 
 x.split('~')(1))).flatMapValues(str = str.split(',')).map(str = (str._2, 
 str._1))
 val rdd2Pair = sc.textFile(rdd2).map(str = (str.split('~')(0), 
 str.split('~')(1)))
 rdd1InvIndex.join(rdd2Pair).map(str = 
 str._2).groupByKey().collect().foreach(println)


 This outputs the following . I think this may be essentially what u r looking 
 for

 (I have to understand how to NOT print as CompactBuffer)

 (2,CompactBuffer(1001,1000,1002,1003, 1004,1001,1006,1007))
 (3,CompactBuffer(1011,1012,1013,1010, 1007,1009,1005,1008))
 (1,CompactBuffer(1001,1000,1002,1003, 1011,1012,1013,1010, 
 1004,1001,1006,1007, 1007,1009,1005,1008))


   --
  *From:* Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID
 *To:* dcmovva dilip.mo...@gmail.com; user@spark.apache.org 
 user@spark.apache.org
 *Sent:* Saturday, January 3, 2015 12:19 PM
 *Subject:* Re: Joining by values

 This is my design. Now let me try and code it in Spark.

 rdd1.txt
 =
 1~4,5,6,7
 2~4,5
 3~6,7

 rdd2.txt
 
 4~1001,1000,1002,1003
 5~1004,1001,1006,1007
 6~1007,1009,1005,1008
 7~1011,1012,1013,1010

 TRANSFORM 1
 ===
 map each value to key (like an inverted index)
 4~1
 5~1
 6~1
 7~1
 5~2
 4~2
 6~3
 7~3

 TRANSFORM 2
 ===
 Join keys in transform 1 and rdd2
 4~1,1001,1000,1002,1003
 4~2,1001,1000,1002,1003
 5~1,1004,1001,1006,1007
 5~2,1004,1001,1006,1007
 6~1,1007,1009,1005,1008
 6~3,1007,1009,1005,1008
 7~1,1011,1012,1013,1010
 7~3,1011,1012,1013,1010

 TRANSFORM 3
 ===
 Split key in transform 2 with ~ and keep key(1) i.e. 1,2,3
 1~1001,1000,1002,1003
 2~1001,1000,1002,1003
 1~1004,1001,1006,1007
 2~1004,1001,1006,1007
 1~1007,1009,1005,1008
 3~1007,1009,1005,1008
 1~1011,1012,1013,1010
 3~1011,1012,1013,1010

 TRANSFORM 4
 ===
 join by key

 1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,1010
 2~1001,1000,1002,1003,1004,1001,1006,1007
 3~1007,1009,1005,1008,1011,1012,1013,1010




  --
  *From:* dcmovva dilip.mo...@gmail.com
 *To:* user@spark.apache.org
 *Sent:* Saturday, January 3, 2015 10:10 AM
 *Subject:* Joining by values

 I have a two pair RDDs in spark like this

 rdd1 = (1 - [4,5,6,7])
   (2 - [4,5])
   (3 - [6,7])


 rdd2 = (4 - [1001,1000,1002,1003])
   (5 - [1004,1001,1006,1007])
   (6 - [1007,1009,1005,1008])
   (7 - [1011,1012,1013,1010])
 I would like to combine them to look like this.

 joinedRdd = (1 -
 [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013])
 (2 - [1000,1001,1002,1003,1004,1006,1007])
 (3 - [1005,1007,1008,1009,1010,1011,1012,1013])


 Can someone suggest me how to do this.

 Thanks Dilip



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: building spark1.2 meet error

2015-01-03 Thread Boromir Widas

it should be under
 ls assembly/target/scala-2.10/*

On Sat, Jan 3, 2015 at 10:11 PM, j_soft zsof...@gmail.com wrote:


- thanks, it is success builded
- .but where is builded zip file? I not find finished .zip or .tar.gz
package


 2014-12-31 19:22 GMT+08:00 xhudik [via Apache Spark User List] [hidden
 email] http:///user/SendEmail.jtp?type=nodenode=20958i=0:

 Hi J_soft,

 for me it is working, I didn't put -Dscala-2.10 -X parameters. I got only
 one warning, since I don't have hadoop 2.5 it didn't activate this profile:







































 *larix@kovral:~/sources/spark-1.2.0 mvn -Pyarn -Phadoop-2.5
 -Dhadoop.version=2.5.0 -DskipTests clean package   Found 0 infos
 Finished in 3 ms [INFO]
 
 [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM
 ... SUCCESS [ 14.177 s] [INFO] Spark Project
 Networking ... SUCCESS [ 14.670 s] [INFO] Spark
 Project Shuffle Streaming Service  SUCCESS [  9.030 s] [INFO]
 Spark Project Core . SUCCESS [04:42 min]
 [INFO] Spark Project Bagel  SUCCESS [
 26.184 s] [INFO] Spark Project GraphX ...
 SUCCESS [01:07 min] [INFO] Spark Project Streaming
  SUCCESS [01:35 min] [INFO] Spark Project
 Catalyst . SUCCESS [01:48 min] [INFO] Spark
 Project SQL .. SUCCESS [01:55 min] [INFO]
 Spark Project ML Library ... SUCCESS [02:17 min]
 [INFO] Spark Project Tools  SUCCESS [
 15.527 s] [INFO] Spark Project Hive .
 SUCCESS [01:43 min] [INFO] Spark Project REPL
 . SUCCESS [ 45.154 s] [INFO] Spark Project
 YARN Parent POM .. SUCCESS [  3.885 s] [INFO] Spark
 Project YARN Stable API .. SUCCESS [01:00 min] [INFO]
 Spark Project Assembly . SUCCESS [ 50.812 s]
 [INFO] Spark Project External Twitter . SUCCESS [
 21.401 s] [INFO] Spark Project External Flume Sink ..
 SUCCESS [ 25.207 s] [INFO] Spark Project External Flume
 ... SUCCESS [ 34.734 s] [INFO] Spark Project External
 MQTT  SUCCESS [ 22.617 s] [INFO] Spark Project
 External ZeroMQ .. SUCCESS [ 22.444 s] [INFO] Spark
 Project External Kafka ... SUCCESS [ 33.566 s] [INFO]
 Spark Project Examples . SUCCESS [01:23 min]
 [INFO] Spark Project YARN Shuffle Service . SUCCESS [
  4.873 s] [INFO]
 
 [INFO] BUILD SUCCESS [INFO]
 
 [INFO] Total time: 23:20 min [INFO] Finished at: 2014-12-31T12:02:32+01:00
 [INFO] Final Memory: 76M/855M [INFO]
 
 [WARNING] The requested profile hadoop-2.5 could not be activated because
 it does not exist.*


 If it won't work for you. I'd try to delete all sources, download source
 code once more and try again ...

 good luck, Tomas


 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/building-spark1-2-meet-error-tp20853p20927.html
  To unsubscribe from building spark1.2 meet error, click here.
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



 --
 View this message in context: Re: building spark1.2 meet error
 http://apache-spark-user-list.1001560.n3.nabble.com/building-spark1-2-meet-error-tp20853p20958.html

 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.

Re: Joining by values

2015-01-03 Thread Sanjay Subramanian

so I changed the code tordd1InvIndex.join(rdd2Pair).map(str = 
str._2).groupByKey().map(str = 
(str._1,str._2.toList)).collect().foreach(println)
Now it prints. Don't worry I will work on this to not output as List(...) But I 
am hoping that the JOIN question that @Dilip asked is hopefully answered :-) 
(2,List(1001,1000,1002,1003, 1004,1001,1006,1007))(3,List(1011,1012,1013,1010, 
1007,1009,1005,1008))(1,List(1001,1000,1002,1003, 1011,1012,1013,1010, 
1004,1001,1006,1007, 1007,1009,1005,1008))
  From: Shixiong Zhu zsxw...@gmail.com
 To: Sanjay Subramanian sanjaysubraman...@yahoo.com 
Cc: dcmovva dilip.mo...@gmail.com; user@spark.apache.org 
user@spark.apache.org 
 Sent: Saturday, January 3, 2015 8:15 PM
 Subject: Re: Joining by values

call `map(_.toList)` to convert `CompactBuffer` to `List`
Best Regards,Shixiong Zhu
2015-01-04 12:08 GMT+08:00 Sanjay Subramanian 
sanjaysubraman...@yahoo.com.invalid:

hi Take a look at the code here I 
wrotehttps://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala

/*rdd1.txt

1~4,5,6,7
2~4,5
3~6,7

rdd2.txt

4~1001,1000,1002,1003
5~1004,1001,1006,1007
6~1007,1009,1005,1008
7~1011,1012,1013,1010

*/
val sconf = new 
SparkConf().setMaster(local).setAppName(MedicalSideFx-PairRddJoin)
val sc = new SparkContext(sconf)

val rdd1 = /path/to/rdd1.txt
val rdd2 = /path/to/rdd2.txt

val rdd1InvIndex = sc.textFile(rdd1).map(x = (x.split('~')(0), 
x.split('~')(1))).flatMapValues(str = str.split(',')).map(str = (str._2, 
str._1))
val rdd2Pair = sc.textFile(rdd2).map(str = (str.split('~')(0), 
str.split('~')(1)))
rdd1InvIndex.join(rdd2Pair).map(str = 
str._2).groupByKey().collect().foreach(println)

This outputs the following . I think this may be essentially what u r looking 
for(I have to understand how to NOT print as 
CompactBuffer)(2,CompactBuffer(1001,1000,1002,1003, 1004,1001,1006,1007))
(3,CompactBuffer(1011,1012,1013,1010, 1007,1009,1005,1008))
(1,CompactBuffer(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 
1007,1009,1005,1008))

  From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID
 To: dcmovva dilip.mo...@gmail.com; user@spark.apache.org 
user@spark.apache.org 
 Sent: Saturday, January 3, 2015 12:19 PM
 Subject: Re: Joining by values

This is my design. Now let me try and code it in Spark.
rdd1.txt =1~4,5,6,72~4,53~6,7
rdd2.txt 
4~1001,1000,1002,10035~1004,1001,1006,10076~1007,1009,1005,10087~1011,1012,1013,1010
TRANSFORM 1===map each value to key (like an inverted 
index)4~15~16~17~15~24~26~37~3
TRANSFORM 2===Join keys in transform 1 and 
rdd24~1,1001,1000,1002,10034~2,1001,1000,1002,10035~1,1004,1001,1006,10075~2,1004,1001,1006,10076~1,1007,1009,1005,10086~3,1007,1009,1005,10087~1,1011,1012,1013,10107~3,1011,1012,1013,1010
TRANSFORM 3===Split key in transform 2 with ~ and keep key(1) i.e. 
1,2,31~1001,1000,1002,10032~1001,1000,1002,10031~1004,1001,1006,10072~1004,1001,1006,10071~1007,1009,1005,10083~1007,1009,1005,10081~1011,1012,1013,10103~1011,1012,1013,1010
TRANSFORM 4===join by key 
1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,10102~1001,1000,1002,1003,1004,1001,1006,10073~1007,1009,1005,1008,1011,1012,1013,1010

 From: dcmovva dilip.mo...@gmail.com
 To: user@spark.apache.org 
 Sent: Saturday, January 3, 2015 10:10 AM
 Subject: Joining by values

I have a two pair RDDs in spark like this

rdd1 = (1 - [4,5,6,7])
  (2 - [4,5])
  (3 - [6,7])

rdd2 = (4 - [1001,1000,1002,1003])
  (5 - [1004,1001,1006,1007])
  (6 - [1007,1009,1005,1008])
  (7 - [1011,1012,1013,1010])
I would like to combine them to look like this.

joinedRdd = (1 -
[1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013])
        (2 - [1000,1001,1002,1003,1004,1006,1007])
        (3 - [1005,1007,1008,1009,1010,1011,1012,1013])

Can someone suggest me how to do this.

Thanks Dilip

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark for core business-logic? - Replacing: MongoDB?

2015-01-03 Thread Krishna Sankar

Alec,
   Good questions. Suggestions:

   1. Refactor the problem into layers viz. DFS, Data Store, DB, SQL Layer,
   Cache, Queue, App Server, App (Interface), App (backend ML) et al.
   2. Then slot-in the appropriate technologies - may be even multiple
   technologies for the same layer and then work thru the pros and cons.
   3. Looking at the layers (moving from the easy to difficult, the mundane
   to the esoteric ;o)):
  - Cache  Queue - stick with what you are comfortable with ie Redis
  et al. Also take a look at Kafka
  - App Server - Tomcat et al
  - App (Interface) - JavaScript et al
  - DB, SQL Layer - Better off with with MongoDB. You can explore
  HBase, but it is not the same.
 - The same way as MongoDB != mySQL, HBase != MongoDB
  - Machine Learning Server/Layer - Spark would fit very well here.
  - Machine Learning DFS, Data Store - HDFS
  - The idea of pushing the data to Hadoop for ML is good
 - But you need to think thru things like incremental data load,
 semantics like at least once, at most once et al.
  4. You could architect all with the Hadoop eco system. It might work,
   depending on the system.
  - But I would use caution. Most probably many of the elements would
  rather be implemented in appropriate technologies.
  5. Doubleclick couple more times on the design, think thru the
   functionality, scaling requirements et al
  - Draw 3 or 4 alternatives and jot down the top 5 requirements, pros
  and cons, the knowns and the unknowns
  - The optimum design will fall thru

Cheers
k/

On Sat, Jan 3, 2015 at 4:43 PM, Alec Taylor alec.tayl...@gmail.com wrote:

 In the middle of doing the architecture for a new project, which has
 various machine learning and related components, including:
 recommender systems, search engines and sequence [common intersection]
 matching.

 Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue,
 backed by Redis).

 Though I don't have experience with Hadoop, I was thinking of using
 Hadoop for the machine-learning (as this will become a Big Data
 problem quite quickly). To push the data into Hadoop, I would use a
 connector of some description, or push the MongoDB backups into HDFS
 at set intervals.

 However I was thinking that it might be better to put the whole thing
 in Hadoop, store all persistent data in Hadoop, and maybe do all the
 layers in Apache Spark (with caching remaining in Redis).

 Is that a viable option? - Most of what I see discusses Spark (and
 Hadoop in general) for analytics only. Apache Phoenix exposes a nice
 interface for read/write over HBase, so I might use that if Spark ends
 up being the wrong solution.

 Thanks for all suggestions,

 Alec Taylor

 PS: I need this for both Big and Small data. Note that I am using
 the Cloudera definition of Big Data referring to processing/storage
 across more than 1 machine.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: spark.akka.frameSize limit error

2015-01-03 Thread Saeed Shahrivari

I use the 1.2 version.

On Sun, Jan 4, 2015 at 3:01 AM, Josh Rosen rosenvi...@gmail.com wrote:

 Which version of Spark are you using?  It seems like the issue here is
 that the map output statuses are too large to fit in the Akka frame size.
 This issue has been fixed in Spark 1.2 by using a different encoding for
 map outputs for jobs with many reducers (
 https://issues.apache.org/jira/browse/SPARK-3613).  On earlier Spark
 versions, your options are either reducing the number of reducers (e.g. by
 explicitly specifying the number of reducers in the reduceByKey() call)
 or increasing the Akka frame size (via the spark.akka.frameSize
 configuration option).

 On Sat, Jan 3, 2015 at 10:40 AM, Saeed Shahrivari 
 saeed.shahriv...@gmail.com wrote:

 Hi,

 I am trying to get the frequency of each Unicode char in a document
 collection using Spark. Here is the code snippet that does the job:

 JavaPairRDDLongWritable, Text rows = sc.sequenceFile(args[0],
 LongWritable.class, Text.class);
 rows = rows.coalesce(1);

 JavaPairRDDCharacter,Long pairs = rows.flatMapToPair(t - {
 String content=t._2.toString();
 MultisetCharacter chars= HashMultiset.create();
 for(int i=0;icontent.length();i++)
 chars.add(content.charAt(i));
 Listlt;Tuple2lt;Character,Long list=new
 ArrayListTuple2lt;Character, Long();
 for(Character ch:chars.elementSet()){
 list.add(new
 Tuple2Character,Long(ch,(long)chars.count(ch)));
 }
 return list;
 });

 JavaPairRDDCharacter, Long counts = pairs.reduceByKey((a, b) -
 a
 + b);
 System.out.printf(MapCount %,d\n,counts.count());

 But, I get the following exception:

 15/01/03 21:51:34 ERROR MapOutputTrackerMasterActor: Map output statuses
 were 11141547 bytes which exceeds spark.akka.frameSize (10485760 bytes).
 org.apache.spark.SparkException: Map output statuses were 11141547 bytes
 which exceeds spark.akka.frameSize (10485760 bytes).
 at

 org.apache.spark.MapOutputTrackerMasterActor$$anonfun$receiveWithLogging$1.applyOrElse(MapOutputTracker.scala:59)
 at

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at

 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
 at

 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
 at
 scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
 at

 org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at

 org.apache.spark.MapOutputTrackerMasterActor.aroundReceive(MapOutputTracker.scala:42)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 Would you please tell me where is the fault?
 If I process fewer rows, there is no problem. However, when the number of
 rows is large I always get this exception.

 Thanks beforehand.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-akka-frameSize-limit-error-tp20955.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Joining by values

2015-01-03 Thread Dilip Movva

Thanks Sanjay. I will give it a try.

Thanks
Dilip

On Sat, Jan 3, 2015 at 11:25 PM, Sanjay Subramanian 
sanjaysubraman...@yahoo.com wrote:

 so I changed the code to

 rdd1InvIndex.join(rdd2Pair).map(str = str._2).groupByKey().map(str = 
 (str._1,str._2.toList)).collect().foreach(println)

 Now it prints. Don't worry I will work on this to not output as List(...)
 But I am hoping that the JOIN question that @Dilip asked is hopefully
 answered :-)

 (2,List(1001,1000,1002,1003, 1004,1001,1006,1007))
 (3,List(1011,1012,1013,1010, 1007,1009,1005,1008))
 (1,List(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007,
 1007,1009,1005,1008))

   --
  *From:* Shixiong Zhu zsxw...@gmail.com
 *To:* Sanjay Subramanian sanjaysubraman...@yahoo.com
 *Cc:* dcmovva dilip.mo...@gmail.com; user@spark.apache.org 
 user@spark.apache.org
 *Sent:* Saturday, January 3, 2015 8:15 PM

 *Subject:* Re: Joining by values

 call `map(_.toList)` to convert `CompactBuffer` to `List`

 Best Regards,
 Shixiong Zhu

 2015-01-04 12:08 GMT+08:00 Sanjay Subramanian 
 sanjaysubraman...@yahoo.com.invalid:


 hi
 Take a look at the code here I wrote

 https://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala

 /*rdd1.txt

 1~4,5,6,7
 2~4,5
 3~6,7

 rdd2.txt

 4~1001,1000,1002,1003
 5~1004,1001,1006,1007
 6~1007,1009,1005,1008
 7~1011,1012,1013,1010

 */
 val sconf = new 
 SparkConf().setMaster(local).setAppName(MedicalSideFx-PairRddJoin)
 val sc = new SparkContext(sconf)


 val rdd1 = /path/to/rdd1.txt
 val rdd2 = /path/to/rdd2.txt

 val rdd1InvIndex = sc.textFile(rdd1).map(x = (x.split('~')(0), 
 x.split('~')(1))).flatMapValues(str = str.split(',')).map(str = (str._2, 
 str._1))
 val rdd2Pair = sc.textFile(rdd2).map(str = (str.split('~')(0), 
 str.split('~')(1)))
 rdd1InvIndex.join(rdd2Pair).map(str = 
 str._2).groupByKey().collect().foreach(println)


 This outputs the following . I think this may be essentially what u r looking 
 for

 (I have to understand how to NOT print as CompactBuffer)

 (2,CompactBuffer(1001,1000,1002,1003, 1004,1001,1006,1007))
 (3,CompactBuffer(1011,1012,1013,1010, 1007,1009,1005,1008))
 (1,CompactBuffer(1001,1000,1002,1003, 1011,1012,1013,1010, 
 1004,1001,1006,1007, 1007,1009,1005,1008))


   --
  *From:* Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID
 *To:* dcmovva dilip.mo...@gmail.com; user@spark.apache.org 
 user@spark.apache.org
 *Sent:* Saturday, January 3, 2015 12:19 PM
 *Subject:* Re: Joining by values

 This is my design. Now let me try and code it in Spark.

 rdd1.txt
 =
 1~4,5,6,7
 2~4,5
 3~6,7

 rdd2.txt
 
 4~1001,1000,1002,1003
 5~1004,1001,1006,1007
 6~1007,1009,1005,1008
 7~1011,1012,1013,1010

 TRANSFORM 1
 ===
 map each value to key (like an inverted index)
 4~1
 5~1
 6~1
 7~1
 5~2
 4~2
 6~3
 7~3

 TRANSFORM 2
 ===
 Join keys in transform 1 and rdd2
 4~1,1001,1000,1002,1003
 4~2,1001,1000,1002,1003
 5~1,1004,1001,1006,1007
 5~2,1004,1001,1006,1007
 6~1,1007,1009,1005,1008
 6~3,1007,1009,1005,1008
 7~1,1011,1012,1013,1010
 7~3,1011,1012,1013,1010

 TRANSFORM 3
 ===
 Split key in transform 2 with ~ and keep key(1) i.e. 1,2,3
 1~1001,1000,1002,1003
 2~1001,1000,1002,1003
 1~1004,1001,1006,1007
 2~1004,1001,1006,1007
 1~1007,1009,1005,1008
 3~1007,1009,1005,1008
 1~1011,1012,1013,1010
 3~1011,1012,1013,1010

 TRANSFORM 4
 ===
 join by key

 1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,1010
 2~1001,1000,1002,1003,1004,1001,1006,1007
 3~1007,1009,1005,1008,1011,1012,1013,1010




  --
  *From:* dcmovva dilip.mo...@gmail.com
 *To:* user@spark.apache.org
 *Sent:* Saturday, January 3, 2015 10:10 AM
 *Subject:* Joining by values

 I have a two pair RDDs in spark like this

 rdd1 = (1 - [4,5,6,7])
   (2 - [4,5])
   (3 - [6,7])


 rdd2 = (4 - [1001,1000,1002,1003])
   (5 - [1004,1001,1006,1007])
   (6 - [1007,1009,1005,1008])
   (7 - [1011,1012,1013,1010])
 I would like to combine them to look like this.

 joinedRdd = (1 -
 [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013])
 (2 - [1000,1001,1002,1003,1004,1006,1007])
 (3 - [1005,1007,1008,1009,1010,1011,1012,1013])


 Can someone suggest me how to do this.

 Thanks Dilip



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: simple job + futures timeout

2015-01-03 Thread brichards

so it appears that i need to be on the same network which is fine.  Now I
would like some advice on the best way to use the shell, is running the
shell from the master or a working fine or should I create a new ec2
instance?

Bobby



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/simple-job-futures-timeout-tp20946p20959.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Unable to build spark from source

2015-01-03 Thread Manoj Kumar

Hi,

I compiled using sbt and it takes lesser time. Thanks for the tip. I'm able
to run these examples (
https://spark.apache.org/docs/latest/mllib-linear-methods.html related to
MLib in the pyspark shell.

However I got some errors related to Spark SQL while compiling. Is that a
reason to worry? I have posted the errors as a gist over here,
https://gist.github.com/MechCoder/7a9c89ee38e194513080

Thanks.


On Sat, Jan 3, 2015 at 11:54 PM, Michael Armbrust mich...@databricks.com
wrote:

 I'll add that most of the spark developers I know use sbt for day to day
 development as it can be much faster for incremental compilation and it has
 several nice features.

 On Sat, Jan 3, 2015 at 9:59 AM, Simon Elliston Ball 
 si...@simonellistonball.com wrote:

 You can use the same build commands, but it's well worth setting up a
 zinc server if you're doing a lot of builds. That will allow incremental
 scala builds, which speeds up the process significantly.

 SPARK-4501 might be of interest too.

 Simon

 On 3 Jan 2015, at 17:27, Manoj Kumar manojkumarsivaraj...@gmail.com
 wrote:

 My question was that if once I make changes in the source code to a file,

 do I rebuild it using any other command, such that it takes in only the
 changes (because it takes a lot of time)?

 On Sat, Jan 3, 2015 at 10:40 PM, Manoj Kumar 
 manojkumarsivaraj...@gmail.com wrote:

 Yes, I've built spark successfully, using the same command

 mvn -DskipTests clean package

 but it built because now I do not work behind a proxy.

 Thanks.






 --
 Godspeed,
 Manoj Kumar,
 Intern, Telecom ParisTech
 Mech Undergrad
 http://manojbits.wordpress.com





-- 
Godspeed,
Manoj Kumar,
Intern, Telecom ParisTech
Mech Undergrad
http://manojbits.wordpress.com

does calling cache()/persist() on a RDD trigger its immediate evaluation?

2015-01-03 Thread Pengcheng YIN

Hi Pro,

I have a question regarding calling cache()/persist() on an RDD. All RDDs in 
Spark are lazily evaluated, but does calling cache()/persist() on a RDD trigger 
its immediate evaluation?

My spark app is something like this:

val rdd = sc.textFile().map()
rdd.persist()
while(true){
val count = rdd.filter().count
if(count == 0)
break

newRdd = /* some codes that use `rdd` several times, and produce an new RDD 
*/
rdd.unpersist()
rdd = newRdd.persist()
}

In each iteration, I persist `rdd`, and unpersist it at the end of the 
iteration, replace `rdd` with persisted `newRdd`. My concern is that, if RDD is 
not evaluated and persisted when persist() is called, I need to change the 
position of persist()/unpersist() called to make it more efficient.

Thanks,
Pengcheng




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

44 matches

Mail list logo