Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream

2015-01-02 Thread Akhil Das
Missed the $


export
SPARK_CLASSPATH=/home/christophe/Development/spark-streaming3/config-1.2.1.jar:
*$SPARK_CLASSPATH*


Thanks
Best Regards

On Fri, Jan 2, 2015 at 4:57 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Can you try:

 export
 SPARK_CLASSPATH=/home/christophe/Development/spark-streaming3/config-1.2.1.jar:SPARK_CLASSPATH

 Also put the jar overriding the spark's version at the end.​

 Thanks
 Best Regards

 On Fri, Jan 2, 2015 at 4:46 PM, Christophe Billiard 
 christophe.billi...@gmail.com wrote:

 Thank you Akhil for your idea.

 In spark-env.sh, I set
 export
 SPARK_CLASSPATH=/home/christophe/Development/spark-streaming3/config-1.2.1.jar

 When I run  bin/compute-classpath.sh
 I get Spark's classpath:

 /home/christophe/Development/spark-streaming3/config-1.2.1.jar::/home/christophe/Development/spark-streaming3/conf:/home/christophe/Development/spark-streaming3/lib/spark-assembly-1.1.1-hadoop2.4.0.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-api-jdo-3.2.1.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-rdbms-3.2.1.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-core-3.2.2.jar

 Few errors later, Spark's classpath looks like:

 /home/christophe/Development/spark-streaming3/config-1.2.1.jar:akka-stream-experimental_2.10-1.0-M2.jar:reactive-streams-1.0.0.M3.jar::/home/christophe/Development/spark-streaming3/conf:/home/christophe/Development/spark-streaming3/lib/spark-assembly-1.1.1-hadoop2.4.0.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-api-jdo-3.2.1.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-rdbms-3.2.1.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-core-3.2.2.jar

 And the error is now:
 Exception in thread main java.lang.NoSuchMethodError:
 akka.actor.ExtendedActorSystem.systemActorOf(Lakka/actor/Props;Ljava/lang/String;)Lakka/actor/ActorRef;
 at akka.stream.scaladsl.StreamTcp.init(StreamTcp.scala:147)
 at
 akka.stream.scaladsl.StreamTcp$.createExtension(StreamTcp.scala:140)
 at akka.stream.scaladsl.StreamTcp$.createExtension(StreamTcp.scala:32)
 at akka.actor.ActorSystemImpl.registerExtension(ActorSystem.scala:654)
 at akka.actor.ExtensionId$class.apply(Extension.scala:79)
 at akka.stream.scaladsl.StreamTcp$.apply(StreamTcp.scala:134)
 at akka.http.HttpExt.bind(Http.scala:33)
 at SimpleAppStreaming3$.main(SimpleAppStreaming3.scala:250)
 at SimpleAppStreaming3.main(SimpleAppStreaming3.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 I am trying to add akka-actor_2.10-2.3.7.jar to Spark's classpath:
 And the error gets worse (Spark can't even start anymore):
 [ERROR] [01/02/2015 12:08:14.807]
 [sparkDriver-akka.actor.default-dispatcher-4] [ActorSystem(sparkDriver)]
 Uncaught fatal error from thread
 [sparkDriver-akka.actor.default-dispatcher-4] shutting down ActorSystem
 [sparkDriver]
 java.lang.AbstractMethodError
 at akka.actor.ActorCell.create(ActorCell.scala:580)
 at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
 at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
 at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 [ERROR] [01/02/2015 12:08:14.811]
 [sparkDriver-akka.actor.default-dispatcher-3] [ActorSystem(sparkDriver)]
 Uncaught fatal error from thread
 [sparkDriver-akka.actor.default-dispatcher-3] shutting down ActorSystem
 [sparkDriver]
 java.lang.AbstractMethodError:
 akka.event.slf4j.Slf4jLogger.aroundReceive(Lscala/PartialFunction;Ljava/lang/Object;)V
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
 at akka.dispatch.Mailbox.run(Mailbox.scala:221)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 

Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream

2015-01-02 Thread Christophe Billiard
Thank you Akhil for your idea.

In spark-env.sh, I set
export
SPARK_CLASSPATH=/home/christophe/Development/spark-streaming3/config-1.2.1.jar

When I run  bin/compute-classpath.sh
I get Spark's classpath:
/home/christophe/Development/spark-streaming3/config-1.2.1.jar::/home/christophe/Development/spark-streaming3/conf:/home/christophe/Development/spark-streaming3/lib/spark-assembly-1.1.1-hadoop2.4.0.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-api-jdo-3.2.1.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-rdbms-3.2.1.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-core-3.2.2.jar

Few errors later, Spark's classpath looks like:
/home/christophe/Development/spark-streaming3/config-1.2.1.jar:akka-stream-experimental_2.10-1.0-M2.jar:reactive-streams-1.0.0.M3.jar::/home/christophe/Development/spark-streaming3/conf:/home/christophe/Development/spark-streaming3/lib/spark-assembly-1.1.1-hadoop2.4.0.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-api-jdo-3.2.1.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-rdbms-3.2.1.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-core-3.2.2.jar

And the error is now:
Exception in thread main java.lang.NoSuchMethodError:
akka.actor.ExtendedActorSystem.systemActorOf(Lakka/actor/Props;Ljava/lang/String;)Lakka/actor/ActorRef;
at akka.stream.scaladsl.StreamTcp.init(StreamTcp.scala:147)
at akka.stream.scaladsl.StreamTcp$.createExtension(StreamTcp.scala:140)
at akka.stream.scaladsl.StreamTcp$.createExtension(StreamTcp.scala:32)
at akka.actor.ActorSystemImpl.registerExtension(ActorSystem.scala:654)
at akka.actor.ExtensionId$class.apply(Extension.scala:79)
at akka.stream.scaladsl.StreamTcp$.apply(StreamTcp.scala:134)
at akka.http.HttpExt.bind(Http.scala:33)
at SimpleAppStreaming3$.main(SimpleAppStreaming3.scala:250)
at SimpleAppStreaming3.main(SimpleAppStreaming3.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I am trying to add akka-actor_2.10-2.3.7.jar to Spark's classpath:
And the error gets worse (Spark can't even start anymore):
[ERROR] [01/02/2015 12:08:14.807]
[sparkDriver-akka.actor.default-dispatcher-4] [ActorSystem(sparkDriver)]
Uncaught fatal error from thread
[sparkDriver-akka.actor.default-dispatcher-4] shutting down ActorSystem
[sparkDriver]
java.lang.AbstractMethodError
at akka.actor.ActorCell.create(ActorCell.scala:580)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

[ERROR] [01/02/2015 12:08:14.811]
[sparkDriver-akka.actor.default-dispatcher-3] [ActorSystem(sparkDriver)]
Uncaught fatal error from thread
[sparkDriver-akka.actor.default-dispatcher-3] shutting down ActorSystem
[sparkDriver]
java.lang.AbstractMethodError:
akka.event.slf4j.Slf4jLogger.aroundReceive(Lscala/PartialFunction;Ljava/lang/Object;)V
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

My guess is that akka-actor_2.10-2.3.7.jar is overiding the akka-actor's
version of Spark.
But if I am not overidding it, I can't use akka-http/akka-stream.
Is there a way to work around this problem?

Thanks,
Best regards


On Thu, Jan 1, 2015 at 9:28 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Its a typesafe jar conflict, you will need to put the jar with getDuration
 method in the first position of your classpath.

 Thanks
 Best Regards

 On Wed, Dec 31, 2014 at 4:38 

pyspark executor PYTHONPATH

2015-01-02 Thread Antony Mayi
Hi,
I am running spark 1.1.0 on yarn. I have custom set of modules installed under 
same location on each executor node and wondering how can I pass the executors 
the PYTHONPATH so that they can use the modules.
I've tried this:
spark-env.sh:export PYTHONPATH=/tmp/test/

spark-defaults.conf:spark.executorEnv.PYTHONPATH=/tmp/test/


/tmp/test/pkg:__init__.pymod.py:  def test(x):
      return x
from the pyspark shell I can import the module pkg.mod without any issues:
$$$ import pkg.mod$$$ print pkg.mod.test(1)1
also the path is correctly set:
$$$ print 
os.environ['PYTHONPATH']/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark/python/:/tmp/test/
$$$ print sys.path['', '/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip', 
'/usr/lib/spark/python', '/tmp/test/', ... ]

it even is seen by the executors:
$$$ sc.parallelize(range(4)).map(lambda x: 
os.environ['PYTHONPATH']).collect()['/u01/yarn/local/usercache/user/filecache/24/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar:/tmp/test/:/tmp/test/',
 
'/u01/yarn/local/usercache/user/filecache/24/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar:/tmp/test/:/tmp/test/',
 
'/u01/yarn/local/usercache/user/filecache/24/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar:/tmp/test/:/tmp/test/',
 
'/u02/yarn/local/usercache/user/filecache/24/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar:/tmp/test/:/tmp/test/']
yet it fails when actually using the module on the executor:$$$ 
sc.parallelize(range(4)).map(pkg.mod.test).collect()...ImportError: No module 
named mod...
any idea how to achieve this? don't want to use the sc.addPyFile as this is big 
packages and they are installed everywhere anyway...
thank you,Antony.




Is it possible to do incremental training using ALSModel (MLlib)?

2015-01-02 Thread Wouter Samaey
Hi all,

I'm curious about MLlib and if it is possible to do incremental training on
the ALSModel.

Usually training is run first, and then you can query. But in my case, data
is collected in real-time and I want the predictions of my ALSModel to
consider the latest data without complete re-training phase.

I've checked out these resources, but could not find any info on how to
solve this:
https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html

My question fits in a larger picture where I'm using Prediction IO, and this
in turn is based on Spark.

Thanks in advance for any advice!

Wouter



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-do-incremental-training-using-ALSModel-MLlib-tp20942.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: DAG info

2015-01-02 Thread Robineast
Do you have some example code of what you are trying to do?

Robin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DAG-info-tp20940p20941.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



JdbcRdd for Python

2015-01-02 Thread elliott cordo
Hi All -

Is JdbcRdd currently supported?  Having trouble finding any info or
examples?


Re: pyspark executor PYTHONPATH

2015-01-02 Thread Antony Mayi
ok, I see now what's happening - the pkg.mod.test is serialized by reference 
and there is nothing actually trying to import pkg.mod on the executors so the 
reference is broken.
so how can I get the pkg.mod imported on the executors?
thanks,Antony. 

 On Friday, 2 January 2015, 13:49, Antony Mayi antonym...@yahoo.com wrote:
   
 

 Hi,
I am running spark 1.1.0 on yarn. I have custom set of modules installed under 
same location on each executor node and wondering how can I pass the executors 
the PYTHONPATH so that they can use the modules.
I've tried this:
spark-env.sh:export PYTHONPATH=/tmp/test/

spark-defaults.conf:spark.executorEnv.PYTHONPATH=/tmp/test/


/tmp/test/pkg:__init__.pymod.py:  def test(x):
      return x
from the pyspark shell I can import the module pkg.mod without any issues:
$$$ import pkg.mod$$$ print pkg.mod.test(1)1
also the path is correctly set:
$$$ print 
os.environ['PYTHONPATH']/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark/python/:/tmp/test/
$$$ print sys.path['', '/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip', 
'/usr/lib/spark/python', '/tmp/test/', ... ]

it even is seen by the executors:
$$$ sc.parallelize(range(4)).map(lambda x: 
os.environ['PYTHONPATH']).collect()['/u01/yarn/local/usercache/user/filecache/24/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar:/tmp/test/:/tmp/test/',
 
'/u01/yarn/local/usercache/user/filecache/24/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar:/tmp/test/:/tmp/test/',
 
'/u01/yarn/local/usercache/user/filecache/24/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar:/tmp/test/:/tmp/test/',
 
'/u02/yarn/local/usercache/user/filecache/24/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar:/tmp/test/:/tmp/test/']
yet it fails when actually using the module on the executor:$$$ 
sc.parallelize(range(4)).map(pkg.mod.test).collect()...ImportError: No module 
named mod...
any idea how to achieve this? don't want to use the sc.addPyFile as this is big 
packages and they are installed everywhere anyway...
thank you,Antony.




 
   

KafkaReceiver executor in spark streaming job on YARN suddenly killed by ResourceManager

2015-01-02 Thread Junki Kim
Hi, guys

I tried to run job of spark streaming with kafka on YARN.
My business logic is very simple.
Just listen on kafka topic and  write dstream to hdfs on each batch
iteration.
After launching streaming job few hours, it works well. However suddenly
died by ResourceManager.
ResourceManager send SIGTERM to kafka receiver executor and other executors
also died without any log error messages.
No Outofmemory exception, no other exceptions…. just died.

Do you know about this issue??
Please help me to resolve this problem.

My environments are below described.

* Scala : 2.10
* Spark : 1.1
* YARN : 2.5.0-cdh5.2.1
* CDH : 5.2.1 version

Thanks in advance.

Have a nice day~



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KafkaReceiver-executor-in-spark-streaming-job-on-YARN-suddenly-killed-by-ResourceManager-tp20945.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MLLIB and Openblas library in non-default dir

2015-01-02 Thread xhudik
Hi
I have compiled OpenBlas library into nonstandard directory and I want to
inform Spark app about it via:
-Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so
which is a standard option in netlib-java
(https://github.com/fommil/netlib-java)

I tried 2 ways:
1. via *--conf* parameter
/bin/spark-submit -v  --class
org.apache.spark.examples.mllib.LinearRegression *--conf
-Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so*
 
examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop1.0.4.jar  
data/mllib/sample_libsvm_data.txt/

2. via *--driver-java-options* parameter
/bin/spark-submit -v *--driver-java-options
-Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so*
 
--class org.apache.spark.examples.mllib.LinearRegression 
examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop1.0.4.jar  
data/mllib/sample_libsvm_data.txt
/

How can I force spark-submit to propagate info about non-standard placement
of openblas library to netlib-java lib?

thanks, Tomas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-and-Openblas-library-in-non-default-dir-tp20943.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: sparkContext.textFile does not honour the minPartitions argument

2015-01-02 Thread Aniket Bhatnagar
Thanks everyone. I studied the source code and realized minPartitions is
passed over to Hadoop's InputFormat and its upto the InputFormat
implementation to use the parameter as a hint.

Thanks,
Aniket

On Fri, Jan 2, 2015, 7:13 AM Rishi Yadav ri...@infoobjects.com wrote:

 Hi Ankit,

 Optional number of partitions value is to increase number of partitions
 not reduce it from default value.

 On Thu, Jan 1, 2015 at 10:43 AM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 I am trying to read a file into a single partition but it seems like
 sparkContext.textFile ignores the passed minPartitions value. I know I can
 repartition the RDD but I was curious to know if this is expected or if
 this is a bug that needs to be further investigated?





KafkaReceiver executor in spark streaming job on YARN suddenly killed by ResourceManager

2015-01-02 Thread Jun Ki Kim
Hi, guys

I tried to run job of spark streaming with kafka on YARN.
My business logic is very simple.
Just listen on kafka topic and  write dstream to hdfs on each batch iteration.
After launching streaming job few hours, it works well. However suddenly died 
by ResourceManager.
ResourceManager send SIGTERM to kafka receiver executor and other executors 
also died without any log error messages.
No Outofmemory exception, no other exceptions…. just died.

Do you know about this issue??
Please help me to resolve this problem.

My environments are below described.

* Scala : 2.10
* Spark : 1.1
* YARN : 2.5.0-cdh5.2.1
* CDH : 5.2.1 version

Thanks in advance.

Have a nice day~

Re: NoClassDefFoundError when trying to run spark application

2015-01-02 Thread Pankaj Narang
do you assemble the uber jar ?

you can use sbt assembly to build the jar and then run. It should fix the
issue



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NoClassDefFoundError-when-trying-to-run-spark-application-tp20707p20944.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



different akka versions and spark

2015-01-02 Thread Koert Kuipers
i noticed spark 1.2.0 bumps the akka version. since spark uses it's own
akka version, does this mean it can co-exist with another akka version in
the same JVM? has anyone tried this?

we have some spark apps that also use akka (2.2.3) and spray. if different
akka versions causes conflicts then spark 1.2.0 would not be backwards
compatible for us...

thanks. koert


Re: (send this email to subscribe)

2015-01-02 Thread Ted Yu
There is no need to include user@spark.apache.org in subscription request.

FYI

On Fri, Jan 2, 2015 at 7:36 AM, Pankaj pankajnaran...@gmail.com wrote:





(send this email to subscribe)

2015-01-02 Thread Pankaj



Re: Spark or Tachyon: capture data lineage

2015-01-02 Thread Sven Krasser
Agreed with Jerry. Aside from Tachyon, seeing this for general debugging
would be very helpful.

Haoyuan, is that feature you are referring to related to
https://issues.apache.org/jira/browse/SPARK-975?

In the interim, I've found the toDebugString() method useful (but it
renders execution as a tree and not as a more general DAG and therefore
doesn't always capture the flow in the way I'd like to review it). Example:

 a = sc.parallelize(range(1,1000)).map(lambda x: (x, x*x)).filter(lambda
x: x[1]1000)
 b = a.join(a)
 print b.toDebugString()
(16) PythonRDD[19] at RDD at PythonRDD.scala:43
 |   MappedRDD[17] at values at NativeMethodAccessorImpl.java:-2
 |   ShuffledRDD[16] at partitionBy at NativeMethodAccessorImpl.java:-2
 +-(16) PairwiseRDD[15] at RDD at PythonRDD.scala:261
|   PythonRDD[14] at RDD at PythonRDD.scala:43
|   UnionRDD[13] at union at NativeMethodAccessorImpl.java:-2
|   PythonRDD[11] at RDD at PythonRDD.scala:43
|   ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:315
|   PythonRDD[12] at RDD at PythonRDD.scala:43
|   ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:315

Best,
-Sven

On Fri, Jan 2, 2015 at 12:32 PM, Haoyuan Li haoyuan...@gmail.com wrote:

 Jerry,

 Great question. Spark and Tachyon capture lineage information at different
 granularities. We are working on an integration between Spark/Tachyon about
 this. Hope to get it ready to be released soon.

 Best,

 Haoyuan

 On Fri, Jan 2, 2015 at 12:24 PM, Jerry Lam chiling...@gmail.com wrote:

 Hi spark developers,

 I was thinking it would be nice to extract the data lineage information
 from a data processing pipeline. I assume that spark/tachyon keeps this
 information somewhere. For instance, a data processing pipeline uses
 datasource A and B to produce C. C is then used by another process to
 produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so
 useful if there is a way to capture this information when we are using
 spark/tachyon to query this data lineage information. For example, give me
 datasets that produce E. It should give me  a graph like (A and B)-C-E.

 Is this something already possible with spark/tachyon? If not, do you
 think it is possible? Does anyone mind to share their experience in
 capturing the data lineage in a data processing pipeline?

 Best Regards,

 Jerry




 --
 Haoyuan Li
 AMPLab, EECS, UC Berkeley
 http://www.cs.berkeley.edu/~haoyuan/




-- 
http://sites.google.com/site/krasser/?utm_source=sig


Re: Submitting spark jobs through yarn-client

2015-01-02 Thread Corey Nolet
Looking a little closer @ the launch_container.sh file, it appears to be
adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in
the directory pointed to by PWD. Any ideas?

On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote:

 I'm trying to get a SparkContext going in a web container which is being
 submitted through yarn-client. I'm trying two different approaches and both
 seem to be resulting in the same error from the yarn nodemanagers:

 1) I'm newing up a spark context direct, manually adding all the lib jars
 from Spark and Hadoop to the setJars() method on the SparkConf.

 2) I'm using SparkSubmit,main() to pass the classname and jar containing
 my code.


 When yarn tries to create the container, I get an exception in the driver
 Yarn application already ended, might be killed or not able to launch
 application master. When I look into the logs for the nodemanager, I see
 NoClassDefFoundError: org/apache/spark/Logging.

 Looking closer @ the contents of the nodemanagers, I see that the spark
 yarn jar was renamed to __spark__.jar and placed in the app cache while the
 rest of the libraries I specified via setJars() were all placed in the file
 cache. Any ideas as to what may be happening? I even tried adding the
 spark-core dependency and uber-jarring my own classes so that the
 dependencies would be there when Yarn tries to create the container.



Submitting spark jobs through yarn-client

2015-01-02 Thread Corey Nolet
I'm trying to get a SparkContext going in a web container which is being
submitted through yarn-client. I'm trying two different approaches and both
seem to be resulting in the same error from the yarn nodemanagers:

1) I'm newing up a spark context direct, manually adding all the lib jars
from Spark and Hadoop to the setJars() method on the SparkConf.

2) I'm using SparkSubmit,main() to pass the classname and jar containing my
code.


When yarn tries to create the container, I get an exception in the driver
Yarn application already ended, might be killed or not able to launch
application master. When I look into the logs for the nodemanager, I see
NoClassDefFoundError: org/apache/spark/Logging.

Looking closer @ the contents of the nodemanagers, I see that the spark
yarn jar was renamed to __spark__.jar and placed in the app cache while the
rest of the libraries I specified via setJars() were all placed in the file
cache. Any ideas as to what may be happening? I even tried adding the
spark-core dependency and uber-jarring my own classes so that the
dependencies would be there when Yarn tries to create the container.


Re: Spark or Tachyon: capture data lineage

2015-01-02 Thread Haoyuan Li
Jerry,

Great question. Spark and Tachyon capture lineage information at different
granularities. We are working on an integration between Spark/Tachyon about
this. Hope to get it ready to be released soon.

Best,

Haoyuan

On Fri, Jan 2, 2015 at 12:24 PM, Jerry Lam chiling...@gmail.com wrote:

 Hi spark developers,

 I was thinking it would be nice to extract the data lineage information
 from a data processing pipeline. I assume that spark/tachyon keeps this
 information somewhere. For instance, a data processing pipeline uses
 datasource A and B to produce C. C is then used by another process to
 produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so
 useful if there is a way to capture this information when we are using
 spark/tachyon to query this data lineage information. For example, give me
 datasets that produce E. It should give me  a graph like (A and B)-C-E.

 Is this something already possible with spark/tachyon? If not, do you
 think it is possible? Does anyone mind to share their experience in
 capturing the data lineage in a data processing pipeline?

 Best Regards,

 Jerry




-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Publishing streaming results to web interface

2015-01-02 Thread tfrisk

Hi,

New to spark so just feeling my way in using it on a standalone server under
linux.

I'm using scala to store running count totals of certain tokens in my
streaming data and publishing a top 10 list.
eg
(TokenX,count)
(TokenY,count)
..

At the moment this is just being printed to std out using print() but I want
to be able to view these running counts from a web page - but not sure where
to start with this.

Can anyone advise or point me to examples of how this might be achieved ?

Many Thanks,

Thomas




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Publishing-streaming-results-to-web-interface-tp20948.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark or Tachyon: capture data lineage

2015-01-02 Thread Jerry Lam
Hi spark developers,

I was thinking it would be nice to extract the data lineage information
from a data processing pipeline. I assume that spark/tachyon keeps this
information somewhere. For instance, a data processing pipeline uses
datasource A and B to produce C. C is then used by another process to
produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so
useful if there is a way to capture this information when we are using
spark/tachyon to query this data lineage information. For example, give me
datasets that produce E. It should give me  a graph like (A and B)-C-E.

Is this something already possible with spark/tachyon? If not, do you think
it is possible? Does anyone mind to share their experience in capturing the
data lineage in a data processing pipeline?

Best Regards,

Jerry


Re: JdbcRdd for Python

2015-01-02 Thread elliott cordo
yeah.. i went through the source, and unless i'm missing something it's
not.. agreed, i'd love to see it implemented!

On Fri, Jan 2, 2015 at 3:59 PM, Tim Schweichler 
tim.schweich...@healthination.com wrote:

  Doesn't look like it is at the moment. If that's the case I'd love to
 see it implemented.

   From: elliott cordo elliottco...@gmail.com
 Date: Friday, January 2, 2015 at 8:17 AM
 To: user@spark.apache.org user@spark.apache.org
 Subject: JdbcRdd for Python

   Hi All -

  Is JdbcRdd currently supported?  Having trouble finding any info or
 examples?





Re: Apache Spark, Hadoop 2.2.0 without Yarn Integration

2015-01-02 Thread Moep
Well that's confusing. I have the same issue. So you're saying I have to
compile Spark with Yarn set to true to make it work with Hadoop 2.2.0 in
Standalone mode?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Hadoop-2-2-0-without-Yarn-Integration-tp9202p20947.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: JdbcRdd for Python

2015-01-02 Thread Tim Schweichler
Doesn't look like it is at the moment. If that's the case I'd love to see it 
implemented.

From: elliott cordo elliottco...@gmail.commailto:elliottco...@gmail.com
Date: Friday, January 2, 2015 at 8:17 AM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: JdbcRdd for Python

Hi All -

Is JdbcRdd currently supported?  Having trouble finding any info or examples?




Re: Submitting spark jobs through yarn-client

2015-01-02 Thread Corey Nolet
So looking @ the actual code- I see where it looks like --class 'notused'
--jar null is set on the ClientBase.scala when yarn is being run in client
mode. One thing I noticed is that the jar is being set by trying to grab
the jar's uri from the classpath resources- in this case I think it's
finding the spark-yarn jar instead of spark-assembly so when it tries to
runt the ExecutorLauncher.scala, none of the core classes (like
org.apache.spark.Logging) are going to be available on the classpath.

I hope this is the root of the issue. I'll keep this thread updated with my
findings.

On Fri, Jan 2, 2015 at 5:46 PM, Corey Nolet cjno...@gmail.com wrote:

 .. and looking even further, it looks like the actual command tha'ts
 executed starting up the JVM to run the
 org.apache.spark.deploy.yarn.ExecutorLauncher is passing in --class
 'notused' --jar null.

 I would assume this isn't expected but I don't see where to set these
 properties or why they aren't making it through.

 On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet cjno...@gmail.com wrote:

 Looking a little closer @ the launch_container.sh file, it appears to be
 adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in
 the directory pointed to by PWD. Any ideas?

 On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote:

 I'm trying to get a SparkContext going in a web container which is being
 submitted through yarn-client. I'm trying two different approaches and both
 seem to be resulting in the same error from the yarn nodemanagers:

 1) I'm newing up a spark context direct, manually adding all the lib
 jars from Spark and Hadoop to the setJars() method on the SparkConf.

 2) I'm using SparkSubmit,main() to pass the classname and jar containing
 my code.


 When yarn tries to create the container, I get an exception in the
 driver Yarn application already ended, might be killed or not able to
 launch application master. When I look into the logs for the nodemanager,
 I see NoClassDefFoundError: org/apache/spark/Logging.

 Looking closer @ the contents of the nodemanagers, I see that the spark
 yarn jar was renamed to __spark__.jar and placed in the app cache while the
 rest of the libraries I specified via setJars() were all placed in the file
 cache. Any ideas as to what may be happening? I even tried adding the
 spark-core dependency and uber-jarring my own classes so that the
 dependencies would be there when Yarn tries to create the container.






Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream

2015-01-02 Thread Pankaj Narang
Like before I get a java.lang.NoClassDefFoundError:
akka/stream/FlowMaterializer$

This can be solved using assembly plugin. you need to enable assembly plugin
in global plugins

C:\Users\infoshore\.sbt\0.13\plugins
 add a line in plugins.sbt  addSbtPlugin(com.eed3si9n % sbt-assembly %
0.11.0)



 and then add the following lines in build.sbt 

import AssemblyKeys._ // put this at the top of the file

seq(assemblySettings: _*)

Also in the bottom dont forget to add

assemblySettings

mergeStrategy in assembly := {
  case m if m.toLowerCase.endsWith(manifest.mf)  =
MergeStrategy.discard
  case m if m.toLowerCase.matches(meta-inf.*\\.sf$)  =
MergeStrategy.discard
  case log4j.properties  =
MergeStrategy.discard
  case m if m.toLowerCase.startsWith(meta-inf/services/) =
MergeStrategy.filterDistinctLines
  case reference.conf=
MergeStrategy.concat
  case _   =
MergeStrategy.first
}


Now in your sbt run sbt assembly that will create the jar which can be run
without --jars options
as this will be a uber jar containing all jars



Also nosuchmethod exception is thrown when there is difference in versions
of complied and runtime versions.

What is the version of spark you are using ? You need to use same version in
build.sbt


Here is your build.sbt


libraryDependencies += org.apache.spark %% spark-core % 1.1.1
//exclude(com.typesafe, config) 

libraryDependencies += org.apache.spark %% spark-sql % 1.1.1 

libraryDependencies += com.datastax.cassandra % cassandra-driver-core %
2.1.3 

libraryDependencies += com.datastax.spark %% spark-cassandra-connector %
1.1.0 withSources() withJavadoc() 

libraryDependencies += org.apache.cassandra % cassandra-thrift % 2.0.5 

libraryDependencies += joda-time % joda-time % 2.6 


and your error is Exception in thread main java.lang.NoSuchMethodError:
com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J
 
at
akka.stream.StreamSubscriptionTimeoutSettings$.apply(FlowMaterializer.scala:256)
 
 I think there is version mismatch on the jars you use at runtime


 If you need more help add me on skype pankaj.narang


---Pankaj





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration-with-akka-http-akka-stream-tp20926p20950.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Submitting spark jobs through yarn-client

2015-01-02 Thread Corey Nolet
.. and looking even further, it looks like the actual command tha'ts
executed starting up the JVM to run the
org.apache.spark.deploy.yarn.ExecutorLauncher is passing in --class
'notused' --jar null.

I would assume this isn't expected but I don't see where to set these
properties or why they aren't making it through.

On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet cjno...@gmail.com wrote:

 Looking a little closer @ the launch_container.sh file, it appears to be
 adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in
 the directory pointed to by PWD. Any ideas?

 On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote:

 I'm trying to get a SparkContext going in a web container which is being
 submitted through yarn-client. I'm trying two different approaches and both
 seem to be resulting in the same error from the yarn nodemanagers:

 1) I'm newing up a spark context direct, manually adding all the lib jars
 from Spark and Hadoop to the setJars() method on the SparkConf.

 2) I'm using SparkSubmit,main() to pass the classname and jar containing
 my code.


 When yarn tries to create the container, I get an exception in the driver
 Yarn application already ended, might be killed or not able to launch
 application master. When I look into the logs for the nodemanager, I see
 NoClassDefFoundError: org/apache/spark/Logging.

 Looking closer @ the contents of the nodemanagers, I see that the spark
 yarn jar was renamed to __spark__.jar and placed in the app cache while the
 rest of the libraries I specified via setJars() were all placed in the file
 cache. Any ideas as to what may be happening? I even tried adding the
 spark-core dependency and uber-jarring my own classes so that the
 dependencies would be there when Yarn tries to create the container.





Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-01-02 Thread Reza Zadeh
There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-4981

On Fri, Jan 2, 2015 at 8:28 PM, Peng Cheng rhw...@gmail.com wrote:

 I was under the impression that ALS wasn't designed for it :- The famous
 ebay online recommender uses SGD
 However, you can try using the previous model as starting point, and
 gradually reduce the number of iteration after the model stablize. I never
 verify this idea, so you need to at least cross-validate it before putting
 into productio

 On 2 January 2015 at 04:40, Wouter Samaey wouter.sam...@storefront.be
 wrote:

 Hi all,

 I'm curious about MLlib and if it is possible to do incremental training
 on
 the ALSModel.

 Usually training is run first, and then you can query. But in my case,
 data
 is collected in real-time and I want the predictions of my ALSModel to
 consider the latest data without complete re-training phase.

 I've checked out these resources, but could not find any info on how to
 solve this:
 https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

 http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html

 My question fits in a larger picture where I'm using Prediction IO, and
 this
 in turn is based on Spark.

 Thanks in advance for any advice!

 Wouter



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-do-incremental-training-using-ALSModel-MLlib-tp20942.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-01-02 Thread Peng Cheng
I was under the impression that ALS wasn't designed for it :- The famous
ebay online recommender uses SGD
However, you can try using the previous model as starting point, and
gradually reduce the number of iteration after the model stablize. I never
verify this idea, so you need to at least cross-validate it before putting
into productio

On 2 January 2015 at 04:40, Wouter Samaey wouter.sam...@storefront.be
wrote:

 Hi all,

 I'm curious about MLlib and if it is possible to do incremental training on
 the ALSModel.

 Usually training is run first, and then you can query. But in my case, data
 is collected in real-time and I want the predictions of my ALSModel to
 consider the latest data without complete re-training phase.

 I've checked out these resources, but could not find any info on how to
 solve this:
 https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

 http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html

 My question fits in a larger picture where I'm using Prediction IO, and
 this
 in turn is based on Spark.

 Thanks in advance for any advice!

 Wouter



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-do-incremental-training-using-ALSModel-MLlib-tp20942.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Publishing streaming results to web interface

2015-01-02 Thread Sathish Kumaran Vairavelu
Try and see if this helps. http://zeppelin-project.org/

-Sathish


On Fri Jan 02 2015 at 8:20:54 PM Pankaj Narang pankajnaran...@gmail.com
wrote:

 Thomus,

 Spark does not provide any web interface directly. There might be third
 party apps providing dashboards
 but I am not aware of any for the same purpose.

 *You can use some methods so that this data is saved on file system instead
 of being printed on screen

 Some of the methods you can use ON RDD are saveAsObjectFile, saveAsFile
 *


 Now you can read these files to show them on web interface in  any language
 of your choice

 Regards
 Pankaj






 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Publishing-streaming-results-to-web-interface-
 tp20948p20949.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Publishing streaming results to web interface

2015-01-02 Thread Pankaj Narang
Thomus,

Spark does not provide any web interface directly. There might be third
party apps providing dashboards
but I am not aware of any for the same purpose.

*You can use some methods so that this data is saved on file system instead
of being printed on screen

Some of the methods you can use ON RDD are saveAsObjectFile, saveAsFile
*


Now you can read these files to show them on web interface in  any language
of your choice

Regards
Pankaj






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Publishing-streaming-results-to-web-interface-tp20948p20949.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to convert String data to RDD.

2015-01-02 Thread Ted Yu
Please see http://search-hadoop.com/m/JW1q53L9PJ

On Fri, Jan 2, 2015 at 4:31 PM, RP hadoo...@outlook.com wrote:

  Hello Guys,
  Spark noob here. I am trying to create RDD from  JSON data  fetched from
 URL parsing.
 My URL parsing function gives me JSON in string format.
 How do I convert JSON string to JSONRDD so that I can use it in SparkSQL.



 // get json data in string fromat
 val jsonURLData = getUrlAsString(https://somehost.com/test.json;
 https://somehost.com/test.json)

 val jsonDataRDD = ?

 val json1 = sqlContext.jsonRDD(jsonDataRDD)

 Thanks,
 RP







How to convert String data to RDD.

2015-01-02 Thread RP

Hello Guys,
 Spark noob here. I am trying to create RDD from  JSON data  fetched 
from URL parsing.

My URL parsing function gives me JSON in string format.
How do I convert JSON string to JSONRDD so that I can use it in SparkSQL.



// get json data in string fromat
val jsonURLData = getUrlAsString(https://somehost.com/test.json;)

val jsonDataRDD = ?

val  json1  =  sqlContext.jsonRDD(jsonDataRDD)

Thanks,
RP






Re: FlatMapValues

2015-01-02 Thread Sanjay Subramanian
OK this is how I solved it. Not elegant at all but works and I need to move 
ahead at this time.Converting to pair RDD is now not required.
reacRdd.map(line = line.split(',')).map(fields = {
  if (fields.length = 10  !fields(0).contains(VAERS_ID)) {
((fields(0)+,+fields(1)+\t+fields(0)+,+fields(3)+\t+fields(0)+,+fields(5)+\t+fields(0)+,+fields(7)+\t+fields(0)+,+fields(9)))
  }
  else {
()
  }
}).flatMap(str = str.split('\t')).filter(line = 
line.toString.length()  0).saveAsTextFile(/data/vaers/msfx/reac/ + outFile)

  From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID
 To: Hitesh Khamesra hiteshk...@gmail.com 
Cc: Kapil Malik kma...@adobe.com; Sean Owen so...@cloudera.com; 
user@spark.apache.org user@spark.apache.org 
 Sent: Thursday, January 1, 2015 12:39 PM
 Subject: Re: FlatMapValues
   
thanks let me try that out
 

 From: Hitesh Khamesra hiteshk...@gmail.com
 To: Sanjay Subramanian sanjaysubraman...@yahoo.com 
Cc: Kapil Malik kma...@adobe.com; Sean Owen so...@cloudera.com; 
user@spark.apache.org user@spark.apache.org 
 Sent: Thursday, January 1, 2015 9:46 AM
 Subject: Re: FlatMapValues
   
How about this..apply flatmap on per line. And in that function, parse each 
line and return all the colums as per your need.


On Wed, Dec 31, 2014 at 10:16 AM, Sanjay Subramanian 
sanjaysubraman...@yahoo.com.invalid wrote:

hey guys
Some of u may care :-) but this is just give u a background with where I am 
going with this. I have an IOS medical side effects app called MedicalSideFx. I 
built the entire underlying data layer aggregation using hadoop and currently 
the search is based on lucene. I am re-architecting the data layer by replacing 
hadoop with Spark and integrating FDA data, Canadian sidefx data and vaccines 
sidefx data.     
  @Kapil , sorry but flatMapValues is being reported as undefined
To give u a complete picture of the code (its inside IntelliJ but thats only 
for testingthe real code runs on sparkshell on my cluster)
https://github.com/sanjaysubramanian/msfx_scala/blob/master/src/main/scala/org/medicalsidefx/common/utils/AersReacColumnExtractor.scala

If u were to assume dataset as 
025003,Delirium,8.10,Hypokinesia,8.10,Hypotonia,8.10
025005,Arthritis,8.10,Injection site oedema,8.10,Injection site 
reaction,8.10

This present version of the code, the flatMap works but only gives me values 
DeliriumHypokinesiaHypotonia
ArthritisInjection site oedemaInjection site reaction


What I need is
025003,Delirium
025003,Hypokinesia025003,Hypotonia025005,Arthritis
025005,Injection site oedema025005,Injection site reaction

thanks
sanjay
  From: Kapil Malik kma...@adobe.com
 To: Sean Owen so...@cloudera.com; Sanjay Subramanian 
sanjaysubraman...@yahoo.com 
Cc: user@spark.apache.org user@spark.apache.org 
 Sent: Wednesday, December 31, 2014 9:35 AM
 Subject: RE: FlatMapValues
   
Hi Sanjay,

Oh yes .. on flatMapValues, it's defined in PairRDDFunctions, and you need to 
import org.apache.spark.rdd.SparkContext._ to use them 
(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
 )

@Sean, yes indeed flatMap / flatMapValues both can be used.

Regards,

Kapil 



-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: 31 December 2014 21:16
To: Sanjay Subramanian
Cc: user@spark.apache.org
Subject: Re: FlatMapValues

From the clarification below, the problem is that you are calling 
flatMapValues, which is only available on an RDD of key-value tuples.
Your map function returns a tuple in one case but a String in the other, so 
your RDD is a bunch of Any, which is not at all what you want. You need to 
return a tuple in both cases, which is what Kapil pointed out.

However it's still not quite what you want. Your input is basically [key value1 
value2 value3] so you want to flatMap that to (key,value1)
(key,value2) (key,value3). flatMapValues does not come into play.

On Wed, Dec 31, 2014 at 3:25 PM, Sanjay Subramanian 
sanjaysubraman...@yahoo.com wrote:
 My understanding is as follows

 STEP 1 (This would create a pair RDD)
 ===

 reacRdd.map(line = line.split(',')).map(fields = {
  if (fields.length = 11  !fields(0).contains(VAERS_ID)) {

 (fields(0),(fields(1)+\t+fields(3)+\t+fields(5)+\t+fields(7)+\t+fields(9)))
  }
  else {
    
  }
  })

 STEP 2
 ===
 Since previous step created a pair RDD, I thought flatMapValues method 
 will be applicable.
 But the code does not even compile saying that flatMapValues is not 
 applicable to RDD :-(


 reacRdd.map(line = line.split(',')).map(fields = {
  if (fields.length = 11  !fields(0).contains(VAERS_ID)) {

 (fields(0),(fields(1)+\t+fields(3)+\t+fields(5)+\t+fields(7)+\t+fields(9)))
  }
  else {
    
  }
  }).flatMapValues(skus =
 skus.split('\t')).saveAsTextFile(/data/vaers/msfx/reac/ + outFile)


 SUMMARY
 ===
 when a dataset looks like the 

Re: SparkSQL 1.2.0 sources API error

2015-01-02 Thread Cheng Lian
Most of the time a NoSuchMethodError means wrong classpath settings, and 
some jar file is overriden by a wrong version. In your case it could be 
netty.


On 1/3/15 1:36 PM, Niranda Perera wrote:

Hi all,

I am evaluating the spark sources API released with Spark 1.2.0. But 
I'm getting a ava.lang.NoSuchMethodError: 
org.jboss.netty.channel.socket.nio.NioWorkerPool.init(Ljava/util/concurrent/Executor;I)V 
error running the program.


Error log:
15/01/03 10:41:30 ERROR ActorSystemImpl: Uncaught fatal error from 
thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting 
down ActorSystem [sparkDriver]
java.lang.NoSuchMethodError: 
org.jboss.netty.channel.socket.nio.NioWorkerPool.init(Ljava/util/concurrent/Executor;I)V
at 
akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:283)
at 
akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:240)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)

at scala.util.Try$.apply(Try.scala:161)
at 
akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73)
at 
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
at 
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)

at scala.util.Success.flatMap(Try.scala:200)
at 
akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84)

at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:692)
at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:684)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at 
akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:684)
at 
akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:492)

at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.remote.EndpointManager.aroundReceive(Remoting.scala:395)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at 
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




Following is my simple Java code:

public class AvroSparkTest {

public static void main(String[] args) throws Exception {

SparkConf sparkConf = new SparkConf()
.setMaster(local[2])
.setAppName(avro-spark-test)
.setSparkHome(/home/niranda/software/spark-1.2.0-bin-hadoop1);

JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);

JavaSQLContext sqlContext = new JavaSQLContext(sparkContext);
JavaSchemaRDD episodes = AvroUtils.avroFile(sqlContext,
/home/niranda/projects/avro-spark-test/src/test/resources/episodes.avro);

episodes.printSchema();
}

}

Dependencies:
dependencies
dependency
groupIdcom.databricks/groupId
artifactIdspark-avro_2.10/artifactId
version0.1/version
/dependency

dependency
groupIdorg.apache.spark/groupId
artifactIdspark-sql_2.10/artifactId
version1.2.0/version
/dependency
/dependencies

I'm using Java 1.7, IntelliJ IDEA and Maven as the build tool.

What might cause this error and what may be the remedy?

Cheers

--
Niranda




SparkSQL 1.2.0 sources API error

2015-01-02 Thread Niranda Perera
Hi all,

I am evaluating the spark sources API released with Spark 1.2.0. But I'm
getting a ava.lang.NoSuchMethodError:
org.jboss.netty.channel.socket.nio.NioWorkerPool.init(Ljava/util/concurrent/Executor;I)V
error running the program.

Error log:
15/01/03 10:41:30 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down
ActorSystem [sparkDriver]
java.lang.NoSuchMethodError:
org.jboss.netty.channel.socket.nio.NioWorkerPool.init(Ljava/util/concurrent/Executor;I)V
at
akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:283)
at
akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:240)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)
at scala.util.Try$.apply(Try.scala:161)
at
akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73)
at
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
at
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
at scala.util.Success.flatMap(Try.scala:200)
at
akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84)
at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:692)
at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:684)
at
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at
akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:684)
at
akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:492)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.remote.EndpointManager.aroundReceive(Remoting.scala:395)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



Following is my simple Java code:

public class AvroSparkTest {

public static void main(String[] args) throws Exception {

SparkConf sparkConf = new SparkConf()
.setMaster(local[2])
.setAppName(avro-spark-test)

.setSparkHome(/home/niranda/software/spark-1.2.0-bin-hadoop1);

JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);

JavaSQLContext sqlContext = new JavaSQLContext(sparkContext);
JavaSchemaRDD episodes = AvroUtils.avroFile(sqlContext,

/home/niranda/projects/avro-spark-test/src/test/resources/episodes.avro);

episodes.printSchema();
}

}

Dependencies:
dependencies
dependency
groupIdcom.databricks/groupId
artifactIdspark-avro_2.10/artifactId
version0.1/version
/dependency

dependency
groupIdorg.apache.spark/groupId
artifactIdspark-sql_2.10/artifactId
version1.2.0/version
/dependency
/dependencies

I'm using Java 1.7, IntelliJ IDEA and Maven as the build tool.

What might cause this error and what may be the remedy?

Cheers

-- 
Niranda


Re: different akka versions and spark

2015-01-02 Thread Ted Yu
Please see http://akka.io/news/2014/05/22/akka-2.3.3-released.html which
points to
http://doc.akka.io/docs/akka/2.3.3/project/migration-guide-2.2.x-2.3.x.html?_ga=1.35212129.1385865413.1420220234

Cheers

On Fri, Jan 2, 2015 at 9:11 AM, Koert Kuipers ko...@tresata.com wrote:

 i noticed spark 1.2.0 bumps the akka version. since spark uses it's own
 akka version, does this mean it can co-exist with another akka version in
 the same JVM? has anyone tried this?

 we have some spark apps that also use akka (2.2.3) and spray. if different
 akka versions causes conflicts then spark 1.2.0 would not be backwards
 compatible for us...

 thanks. koert



Spark-1.2.0 build error

2015-01-02 Thread rapelly kartheek
Hi,

I get the following error when I build spark using sbt:

[error] Nonzero exit code (128): git clone
https://github.com/ScrapCodes/sbt-pom-reader.git
/home/karthik/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader
[error] Use 'last' for the full log.


Any help please?