1.0.0 Release Date?
Can anyone comment on the anticipated date or worse case timeframe for when Spark 1.0.0 will be released? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-0-0-Release-Date-tp5664.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected
thanks for reply~~ I had solved the problem and found the reason, because I used the Master node to upload files to hdfs, this action may take up a lot of Master's network resources. When I changed to use another computer none of the cluster to upload these files, it got the correct result. QingFeng -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streaming-on-hdfs-can-detected-all-new-file-but-the-sum-of-all-the-rdd-count-not-equals-which-had-ded-tp5572p5635.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark on Yarn - A small issue !
You need to look at the logs files for yarn. Generally this can be done with yarn logs -applicationId your_app_id. That only works if you have log aggregation enabled though. You should be able to see atleast the application master logs through the yarn resourcemanager web ui. I would try that first. If that doesn't work you can turn on debug in the nodemanager: To review per-container launch environment, increase yarn.nodemanager.delete.debug-delay-sec to a large value (e.g. 36000), and then access the application cache through yarn.nodemanager.local-dirs on the nodes on which containers are launched. This directory contains the launch script, jars, and all environment variables used for launching each container. This process is useful for debugging classpath problems in particular. (Note that enabling this requires admin privileges on cluster settings and a restart of all node managers. Thus, this is not applicable to hosted clusters). Tom On Monday, May 12, 2014 9:38 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Hi All, I wanted to launch Spark on Yarn, interactive - yarn client mode. With default settings of yarn-site.xml and spark-env.sh, i followed the given link http://spark.apache.org/docs/0.8.1/running-on-yarn.html I get the pi value correct when i run without launching the shell. When i launch the shell, with following command, SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar \ SPARK_YARN_APP_JAR=examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar \ MASTER=yarn-client ./spark-shell And try to create RDDs and do some action on it, i get nothing. After sometime tasks fails. LogFile of spark: 519095 14/05/12 13:30:40 INFO YarnClientClusterScheduler: YarnClientClusterScheduler.postStartHook done 519096 14/05/12 13:30:40 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager s1:38355 with 324.4 MB RAM 519097 14/05/12 13:31:38 INFO MemoryStore: ensureFreeSpace(202584) called with curMem=0, maxMem=340147568 519098 14/05/12 13:31:38 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 197.8 KB, free 324.2 MB) 519099 14/05/12 13:31:49 INFO FileInputFormat: Total input paths to process : 1 519100 14/05/12 13:31:49 INFO NetworkTopology: Adding a new node: /default-rack/192.168.1.100:50010 519101 14/05/12 13:31:49 INFO SparkContext: Starting job: top at console:15 519102 14/05/12 13:31:49 INFO DAGScheduler: Got job 0 (top at console:15) with 4 output partitions (allowLocal=false) 519103 14/05/12 13:31:49 INFO DAGScheduler: Final stage: Stage 0 (top at console:15) 519104 14/05/12 13:31:49 INFO DAGScheduler: Parents of final stage: List() 519105 14/05/12 13:31:49 INFO DAGScheduler: Missing parents: List() 519106 14/05/12 13:31:49 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[2] at top at console:15), which has no missing par ents 519107 14/05/12 13:31:49 INFO DAGScheduler: Submitting 4 missing tasks from Stage 0 (MapPartitionsRDD[2] at top at console:15) 519108 14/05/12 13:31:49 INFO YarnClientClusterScheduler: Adding task set 0.0 with 4 tasks 519109 14/05/12 13:31:49 INFO RackResolver: Resolved s1 to /default-rack 519110 14/05/12 13:31:49 INFO ClusterTaskSetManager: Starting task 0.0:3 as TID 0 on executor 1: s1 (PROCESS_LOCAL) 519111 14/05/12 13:31:49 INFO ClusterTaskSetManager: Serialized task 0.0:3 as 1811 bytes in 4 ms 519112 14/05/12 13:31:49 INFO ClusterTaskSetManager: Starting task 0.0:0 as TID 1 on executor 1: s1 (NODE_LOCAL) 519113 14/05/12 13:31:49 INFO ClusterTaskSetManager: Serialized task 0.0:0 as 1811 bytes in 1 ms 519114 14/05/12 13:32:18INFO YarnClientSchedulerBackend: Executor 1 disconnected, so removing it 519115 14/05/12 13:32:18 ERROR YarnClientClusterScheduler: Lost executor 1 on s1: remote Akka client shutdown 519116 14/05/12 13:32:18 INFO ClusterTaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 519117 14/05/12 13:32:18 WARN ClusterTaskSetManager: Lost TID 1 (task 0.0:0) 519118 14/05/12 13:32:18 WARN ClusterTaskSetManager: Lost TID 0 (task 0.0:3) 519119 14/05/12 13:32:18 INFO DAGScheduler: Executor lost: 1 (epoch 0) 519120 14/05/12 13:32:18 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 519121 14/05/12 13:32:18 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor Do i need to set any other env-variable specifically for SPARK on YARN. What could be the isuue ?? Can anyone please help me in this regard. Thanks in Advance !!
Re: How to run shark?
My configuration is just like this,the slave's node has been configuate,but I donnot know what's happened to the shark?Can you help me Sir? shark-env.sh export SPARK_USER_HOME=/root export SPARK_MEM=2g export SCALA_HOME=/root/scala-2.11.0-RC4 export SHARK_MASTER_MEM=1g export HIVE_CONF_DIR=/usr/lib/hive/conf export HIVE_HOME=/usr/lib/hive export HADOOP_HOME=/usr/lib/hadoop export SPARK_HOME=/root/spark-0.9.1 export MASTER=spark://192.168.10.220:7077 export SHARK_EXEC_MODE=yarn SPARK_JAVA_OPTS= -Dspark.local.dir=/tmp SPARK_JAVA_OPTS+=-Dspark.kryoserializer.buffer.mb=10 SPARK_JAVA_OPTS+=-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps export SPARK_JAVA_OPTS export SPARK_ASSEMBLY_JAR=/root/spark-0.9.1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar export SHARK_ASSEMBLY_JAR=/root/shark-0.9.1-bin-hadoop2/target/scala-2.10/shark_2.10-0.9.1.jar Best regards, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-shark-tp5581p5688.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
EndpointWriter: AssociationError
Hi, I've been trying to run my newly created spark job on my local master instead of just runing it using maven and i haven't been able to make it work. My main issue seems to be related to that error: 14/05/14 09:34:26 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster@devsrv:7077] - [akka.tcp://driverClient@ devsrv .mydomain.priv:50237]: Error [Association failed with [akka.tcp://driverClient@ devsrv . mydomain.priv :50237]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://driverClient@ devsrv . mydomain.priv :50237] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: devsrv . mydomain.priv /172.16.202.246:50237 ] FYI, the port 50237 is always changing so i'm not sure what it's supposed to be. I get this kind of error from many commands including ./bin/spark-class org.apache.spark.deploy.Client kill spark:// devsrv :7077 driver-20140513165819-0001 and ./bin/spark-class org.apache.spark.deploy.Client launch spark:// devsrv :7077 file:///path/to/my/rspark-jobs-1.0.0.0-jar-with-dependencies.jar my.jobs.spark.Indexer I get this error from the kill even if the driver has already finished or does not exist. The launch actually works (or seems to) as i can see my driver appearing on the web UI as SUBMITTED When i then deploy a worker my job starts running and i have no error in it's log. The job, though, never ends and holds after starting to spill on disk. 2014-05-14 09:46:31 INFO BlockFetcherIterator$BasicBlockFetcherIterator:50 - Started 0 remote gets in 45 ms 2014-05-14 09:46:33 WARN ExternalAppendOnlyMap:62 - Spilling in-memory map of 147 MB to disk (1 time so far) 2014-05-14 09:46:33 WARN ExternalAppendOnlyMap:62 - Spilling in-memory map of 130 MB to disk (1 time so far) 2014-05-14 09:46:33 WARN ExternalAppendOnlyMap:62 - Spilling in-memory map of 118 MB to disk (1 time so far) and the worker ends up crashing with those errors: 14/05/14 09:51:23 ERROR OneForOneStrategy: FAILED (of class scala.Enumeration$Val) scala.MatchError: FAILED (of class scala.Enumeration$Val) at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/14 09:51:24 ERROR EndpointWriter: AssociationError [akka.tcp://sparkwor...@devsrv.mydomain.priv:35607] - [akka.tcp://dri...@devsrv.mydomain.priv:45792]: Error [Association failed with [akka.tcp://dri...@devsrv.mydomain.priv:45792]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://dri...@devsrv.mydomain.priv:45792] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: devsrv.mydomain.priv/172.XXX.XXX.XXX:45792 ] 14/05/14 09:51:24 ERROR EndpointWriter: AssociationError [akka.tcp://sparkwor...@devsrv.mydomain.priv:35607] - [akka.tcp://dri...@devsrv.mydomain.priv:45792]: Error [Association failed with [akka.tcp://dri...@devsrv.mydomain.priv:45792]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://dri...@devsrv.mydomain.priv:45792] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: devsrv.mydomain.priv/172.XXX.XXX.XXX:45792 ] 14/05/14 09:51:24 ERROR EndpointWriter: AssociationError [akka.tcp://sparkwor...@devsrv.mydomain.priv:35607] - [akka.tcp://dri...@devsrv.mydomain.priv:45792]: Error [Association failed with [akka.tcp://dri...@devsrv.mydomain.priv:45792]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://dri...@devsrv.mydomain.priv:45792] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: devsrv.mydomain.priv/172.XXX.XXX.XXX:45792 I'm almost sure those errors (or at least part of them) have nothing to do with our jar (as we get them even when killing an inexistant driver). We're using spark 0.9.1 for hadoop 1. Any suggestions ? Thanks Regards Laurent
Re: 1.0.0 Release Date?
Hey Brian, We've had a fairly stable 1.0 branch for a while now. I've started voting on the dev list last night... voting can take some time but it usually wraps up anywhere from a few days to weeks. However, you can get started right now with the release candidates. These are likely to be almost identical to the final release. - Patrick On Tue, May 13, 2014 at 9:40 AM, bhusted brian.hus...@gmail.com wrote: Can anyone comment on the anticipated date or worse case timeframe for when Spark 1.0.0 will be released? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-0-0-Release-Date-tp5664.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Distribute jar dependencies via sc.AddJar(fileName)
I don't know whether this would fix the problem. In v0.9, you need `yarn-standalone` instead of `yarn-cluster`. See https://github.com/apache/spark/commit/328c73d037c17440c2a91a6c88b4258fbefa0c08 On Tue, May 13, 2014 at 11:36 PM, Xiangrui Meng men...@gmail.com wrote: Does v0.9 support yarn-cluster mode? I checked SparkContext.scala in v0.9.1 and didn't see special handling of `yarn-cluster`. -Xiangrui On Mon, May 12, 2014 at 11:14 AM, DB Tsai dbt...@stanford.edu wrote: We're deploying Spark in yarn-cluster mode (Spark 0.9), and we add jar dependencies in command line with --addJars option. However, those external jars are only available in the driver (application running in hadoop), and not available in the executors (workers). After doing some research, we realize that we've to push those jars to executors in driver via sc.AddJar(fileName). Although in the driver's log (see the following), the jar is successfully added in the http server in the driver, and I confirm that it's downloadable from any machine in the network, I still get `java.lang.NoClassDefFoundError` in the executors. 14/05/09 14:51:41 INFO spark.SparkContext: Added JAR analyticshadoop-eba5cdce1.jar at http://10.0.0.56:42522/jars/analyticshadoop-eba5cdce1.jar with timestamp 1399672301568 Then I check the log in the executors, and I don't find anything `Fetching file with timestamp timestamp`, which implies something is wrong; the executors are not downloading the external jars. Any suggestion what we can look at? After digging into how spark distributes external jars, I wonder the scalability of this approach. What if there are thousands of nodes downloading the jar from single http server in the driver? Why don't we push the jars into HDFS distributed cache by default instead of distributing them via http server? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai
accessing partition i+1 from mapper of partition i
Hi, I am trying to find a way to fill in missing values in an RDD. The RDD is a sorted sequence. For example, (1, 2, 3, 5, 8, 11, ...) I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11) One way to do this is to slide and zip rdd1 = sc.parallelize(List(1, 2, 3, 5, 8, 11, ...)) x = rdd1.first rdd2 = rdd1 filter (_ != x) rdd3 = rdd2 zip rdd1 rdd4 = rdd3 flatmap { (x, y) = generate missing elements between x and y } Another method which I think is more efficient is to use mapParititions() on rdd1 to be able to iterate on elements of rdd1 in each partition. However, that leaves the boundaries of the partitions to be unfilled. *Is there a way within the function passed to mapPartitions, to read the first element in the next partition?* The latter approach also appears to work for a general sliding window calculation on the RDD. The former technique requires a lot of sliding and zipping and I believe it is not efficient. If only I could read the next partition...I have tried passing a pointer to rdd1 to the function passed to mapPartitions but the rdd1 pointer turns out to be NULL, I guess because Spark cannot deal with a mapper calling another mapper (since it happens on a worker not the driver) Mohit.
RE: How to use Mahout VectorWritable in Spark.
The issue of console:12: error: not found: type Text is resolved by import statement.. But still facing issue with imports of VectorWritable. Mahout math jar is added to classpath as I can check on WebUI as well on shell scala System.getenv res1: java.util.Map[String,String] = {TERM=xterm, JAVA_HOME=/usr/lib/jvm/java-6-openjdk, SHLVL=2, SHELL_JARS=/home/hduser/installations/work-space/mahout-math-0.7.jar, SPARK_MASTER_WEBUI_PORT=5050, LESSCLOSE=/usr/bin/lesspipe %s %s, SSH_CLIENT=10.112.67.149 55123 22, SPARK_HOME=/home/hduser/installations/spark-0.9.0, MAIL=/var/mail/hduser, SPARK_WORKER_DIR=/tmp/spark-hduser-worklogs/work, XDG_SESSION_COOKIE=fbd2e4304c8c75dd606c36100186-1400039480.256868-916349946, https_proxy=https://DS-1078D2486320:3128/, NICKNAME=vm01, JAVA_OPTS= -Djava.library.path= -Xms512m -Xmx512m, PWD=/home/hduser/installations/work-space/KMeansClustering_1, SSH_TTY=/dev/pts/0, SPARK_MASTER_PORT=7077, LOGNAME=hduser, MASTER=spark://VM-52540048731A:7077, SPARK_WORKER_MEMORY=2g, HADOOP_HOME=/usr/lib/hadoop, SS... Still not able to import Mahout Classes.. Any ideas ?? Thanks Stuti Awasthi -Original Message- From: Stuti Awasthi Sent: Wednesday, May 14, 2014 1:13 PM To: user@spark.apache.org Subject: RE: How to use Mahout VectorWritable in Spark. Hi Xiangrui, Thanks for the response .. I tried few ways to include mahout-math jar while launching Spark shell.. but no success.. Can you please point what I am doing wrong 1. mahout-math.jar exported in CLASSPATH, and PATH 2. Tried Launching Spark Shell by : MASTER=spark://HOSTNAME:PORT ADD_JARS=~/installations/work-space/mahout-math-0.7.jar park-0.9.0/bin/spark-shell After launching, I checked the environment details on WebUi: It looks like mahout-math jar is included. spark.jars /home/hduser/installations/work-space/mahout-math-0.7.jar Then I try : scala import org.apache.mahout.math.VectorWritable console:10: error: object mahout is not a member of package org.apache import org.apache.mahout.math.VectorWritable scala val raw = sc.sequenceFile(path, classOf[Text], classOf[VectorWritable]) console:12: error: not found: type Text val data = sc.sequenceFile(/stuti/ML/Clustering/KMeans/HAR/KMeans_dataset_seq/part-r-0, classOf[Text], classOf[VectorWritable]) ^ Im using Spark 0.9 and Hadoop 1.0.4 and Mahout 0.7 Thanks Stuti -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Wednesday, May 14, 2014 11:56 AM To: user@spark.apache.org Subject: Re: How to use Mahout VectorWritable in Spark. You need val raw = sc.sequenceFile(path, classOf[Text], classOf[VectorWriteable]) to load the data. After that, you can do val data = raw.values.map(_.get) To get an RDD of mahout's Vector. You can use `--jar mahout-math.jar` when you launch spark-shell to include mahout-math. Best, Xiangrui On Tue, May 13, 2014 at 10:37 PM, Stuti Awasthi stutiawas...@hcl.com wrote: Hi All, I am very new to Spark and trying to play around with Mllib hence apologies for the basic question. I am trying to run KMeans algorithm using Mahout and Spark MLlib to see the performance. Now initial datasize was 10 GB. Mahout converts the data in Sequence File Text,VectorWritable which is used for KMeans Clustering. The Sequence File crated was ~ 6GB in size. Now I wanted if I can use the Mahout Sequence file to be executed in Spark MLlib for KMeans . I have read that SparkContext.sequenceFile may be used here. Hence I tried to read my sequencefile as below but getting the error : Command on Spark Shell : scala val data = sc.sequenceFile[String,VectorWritable](/ KMeans_dataset_seq/part-r-0,String,VectorWritable) console:12: error: not found: type VectorWritable val data = sc.sequenceFile[String,VectorWritable]( /KMeans_dataset_seq/part-r-0,String,VectorWritable) Here I have 2 ques: 1. Mahout has “Text” as Key but Spark is printing “not found: type:Text” hence I changed it to String.. Is this correct ??? 2. How will VectorWritable be found in Spark. Do I need to include Mahout jar in Classpath or any other option ?? Please Suggest Regards Stuti Awasthi ::DISCLAIMER:: -- -- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any,
Proper way to create standalone app with custom Spark version
We can create standalone Spark application by simply adding spark-core_2.x to build.sbt/pom.xml and connecting it to Spark master. We can also compile custom version of Spark (e.g. compiled against Hadoop 2.x) from source and deploy it to cluster manually. But what is a proper way to use _custom version_ of Spark in _standalone application_? We can't simply include custom version into build.sbt/pom.xml, since it's not in central repository. I'm currently trying to deploy custom version to local Maven repository and add it to SBT project. Another option is to add Spark as local jar to every project. But both of these ways look overcomplicated and in general wrong. What is an implied way to solve this issue? Thanks, Andrei
Re: logging in pyspark
foreach vs. map isn't the issue. Both require serializing the called function, so the pickle error would still apply, yes? And at the moment, I'm just testing. Definitely wouldn't want to log something for each element, but may want to detect something and log for SOME elements. So my question is: how are other people doing logging from distributed tasks, given the serialization issues? The same issue actually exists in Scala, too. I could work around it by creating a small serializable object that provides a logger, but it seems kind of kludgy to me, so I'm wondering if other people are logging from tasks, and if so, how? Diana On Tue, May 6, 2014 at 6:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I think you're looking for RDD.foreach()http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#foreach . According to the programming guidehttp://spark.apache.org/docs/latest/scala-programming-guide.html : Run a function func on each element of the dataset. This is usually done for side effects such as updating an accumulator variable (see below) or interacting with external storage systems. Do you really want to log something for each element of your RDD? Nick On Tue, May 6, 2014 at 3:31 PM, Diana Carroll dcarr...@cloudera.comwrote: What should I do if I want to log something as part of a task? This is what I tried. To set up a logger, I followed the advice here: http://py4j.sourceforge.net/faq.html#how-to-turn-logging-on-off logger = logging.getLogger(py4j) logger.setLevel(logging.INFO) logger.addHandler(logging.StreamHandler()) This works fine when I call it from my driver (ie pyspark): logger.info(this works fine) But I want to try logging within a distributed task so I did this: def logTestMap(a): logger.info(test) return a myrdd.map(logTestMap).count() and got: PicklingError: Can't pickle 'lock' object So it's trying to serialize my function and can't because of a lock object used in logger, presumably for thread-safeness. But then...how would I do it? Or is this just a really bad idea? Thanks Diana
Re: Distribute jar dependencies via sc.AddJar(fileName)
Hi Xiangrui, I actually used `yarn-standalone`, sorry for misleading. I did debugging in the last couple days, and everything up to updateDependency in executor.scala works. I also checked the file size and md5sum in the executors, and they are the same as the one in driver. Gonna do more testing tomorrow. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, May 13, 2014 at 11:41 PM, Xiangrui Meng men...@gmail.com wrote: I don't know whether this would fix the problem. In v0.9, you need `yarn-standalone` instead of `yarn-cluster`. See https://github.com/apache/spark/commit/328c73d037c17440c2a91a6c88b4258fbefa0c08 On Tue, May 13, 2014 at 11:36 PM, Xiangrui Meng men...@gmail.com wrote: Does v0.9 support yarn-cluster mode? I checked SparkContext.scala in v0.9.1 and didn't see special handling of `yarn-cluster`. -Xiangrui On Mon, May 12, 2014 at 11:14 AM, DB Tsai dbt...@stanford.edu wrote: We're deploying Spark in yarn-cluster mode (Spark 0.9), and we add jar dependencies in command line with --addJars option. However, those external jars are only available in the driver (application running in hadoop), and not available in the executors (workers). After doing some research, we realize that we've to push those jars to executors in driver via sc.AddJar(fileName). Although in the driver's log (see the following), the jar is successfully added in the http server in the driver, and I confirm that it's downloadable from any machine in the network, I still get `java.lang.NoClassDefFoundError` in the executors. 14/05/09 14:51:41 INFO spark.SparkContext: Added JAR analyticshadoop-eba5cdce1.jar at http://10.0.0.56:42522/jars/analyticshadoop-eba5cdce1.jar with timestamp 1399672301568 Then I check the log in the executors, and I don't find anything `Fetching file with timestamp timestamp`, which implies something is wrong; the executors are not downloading the external jars. Any suggestion what we can look at? After digging into how spark distributes external jars, I wonder the scalability of this approach. What if there are thousands of nodes downloading the jar from single http server in the driver? Why don't we push the jars into HDFS distributed cache by default instead of distributing them via http server? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai
Re: java.lang.StackOverflowError when calling count()
Would cache() + count() every N iterations work just as well as checkPoint() + count() to get around this issue? We're basically trying to get Spark to avoid working on too lengthy a lineage at once, right? Nick On Tue, May 13, 2014 at 12:04 PM, Xiangrui Meng men...@gmail.com wrote: After checkPoint, call count directly to materialize it. -Xiangrui On Tue, May 13, 2014 at 4:20 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: We are running into same issue. After 700 or so files the stack overflows, cache, persist checkpointing dont help. Basically checkpointing only saves the RDD when it is materialized it only materializes in the end, then it runs out of stack. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Tue, May 13, 2014 at 11:40 AM, Xiangrui Meng men...@gmail.com wrote: You have a long lineage that causes the StackOverflow error. Try rdd.checkPoint() and rdd.count() for every 20~30 iterations. checkPoint can cut the lineage. -Xiangrui On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan gh...@lanl.gov wrote: Dear Sparkers: I am using Python spark of version 0.9.0 to implement some iterative algorithm. I got some errors shown at the end of this email. It seems that it's due to the Java Stack Overflow error. The same error has been duplicated on a mac desktop and a linux workstation, both running the same version of Spark. The same line of code works correctly after quite some iterations. At the line of error, rdd__new.count() could be 0. (In some previous rounds, this was also 0 without any problem). Any thoughts on this? Thank you very much, - Guanhua CODE:print round, round, rdd__new.count() File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py, line 542, in count 14/05/12 16:20:28 INFO TaskSetManager: Loss was due to java.lang.StackOverflowError [duplicate 1] return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 14/05/12 16:20:28 ERROR TaskSetManager: Task 8419.0:0 failed 1 times; aborting job File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py, line 533, in sum 14/05/12 16:20:28 INFO TaskSchedulerImpl: Ignoring update with state FAILED from TID 1774 because its task set is gone return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py, line 499, in reduce vals = self.mapPartitions(func).collect() File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py, line 463, in collect bytesInJava = self._jrdd.collect().iterator() File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o4317.collect. : org.apache.spark.SparkException: Job aborted: Task 8419.0:1 failed 1 times (most recent failure: Exception failure: java.lang.StackOverflowError) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at
Re: java.lang.StackOverflowError when calling count()
If we do cache() + count() after say every 50 iterations. The whole process becomes very slow. I have tried checkpoint() , cache() + count(), saveAsObjectFiles(). Nothing works. Materializing RDD's lead to drastic decrease in performance if we don't materialize, we face stackoverflowerror. On Wed, May 14, 2014 at 10:25 AM, Nick Chammas [via Apache Spark User List] ml-node+s1001560n5683...@n3.nabble.com wrote: Would cache() + count() every N iterations work just as well as checkPoint() + count() to get around this issue? We're basically trying to get Spark to avoid working on too lengthy a lineage at once, right? Nick On Tue, May 13, 2014 at 12:04 PM, Xiangrui Meng [hidden email]http://user/SendEmail.jtp?type=nodenode=5683i=0 wrote: After checkPoint, call count directly to materialize it. -Xiangrui On Tue, May 13, 2014 at 4:20 AM, Mayur Rustagi [hidden email]http://user/SendEmail.jtp?type=nodenode=5683i=1 wrote: We are running into same issue. After 700 or so files the stack overflows, cache, persist checkpointing dont help. Basically checkpointing only saves the RDD when it is materialized it only materializes in the end, then it runs out of stack. Regards Mayur Mayur Rustagi Ph: a href=tel:%2B1%20%28760%29%20203%203257 value=+17602033257+1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Tue, May 13, 2014 at 11:40 AM, Xiangrui Meng [hidden email]http://user/SendEmail.jtp?type=nodenode=5683i=2 wrote: You have a long lineage that causes the StackOverflow error. Try rdd.checkPoint() and rdd.count() for every 20~30 iterations. checkPoint can cut the lineage. -Xiangrui On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan [hidden email]http://user/SendEmail.jtp?type=nodenode=5683i=3 wrote: Dear Sparkers: I am using Python spark of version 0.9.0 to implement some iterative algorithm. I got some errors shown at the end of this email. It seems that it's due to the Java Stack Overflow error. The same error has been duplicated on a mac desktop and a linux workstation, both running the same version of Spark. The same line of code works correctly after quite some iterations. At the line of error, rdd__new.count() could be 0. (In some previous rounds, this was also 0 without any problem). Any thoughts on this? Thank you very much, - Guanhua CODE:print round, round, rdd__new.count() File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py, line 542, in count 14/05/12 16:20:28 INFO TaskSetManager: Loss was due to java.lang.StackOverflowError [duplicate 1] return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 14/05/12 16:20:28 ERROR TaskSetManager: Task 8419.0:0 failed 1 times; aborting job File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py, line 533, in sum 14/05/12 16:20:28 INFO TaskSchedulerImpl: Ignoring update with state FAILED from TID 1774 because its task set is gone return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py, line 499, in reduce vals = self.mapPartitions(func).collect() File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py, line 463, in collect bytesInJava = self._jrdd.collect().iterator() File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o4317.collect. : org.apache.spark.SparkException: Job aborted: Task 8419.0:1 failed 1 times (most recent failure: Exception failure: java.lang.StackOverflowError) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at
Re: How to run shark?
Is your Spark working .. can you try running spark shell? http://spark.apache.org/docs/0.9.1/quick-start.html If spark is working we can move this to shark user list(copied here) Also I am anything but a sir :) Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, May 14, 2014 at 12:49 PM, Sophia sln-1...@163.com wrote: My configuration is just like this,the slave's node has been configuate,but I donnot know what's happened to the shark?Can you help me Sir? shark-env.sh export SPARK_USER_HOME=/root export SPARK_MEM=2g export SCALA_HOME=/root/scala-2.11.0-RC4 export SHARK_MASTER_MEM=1g export HIVE_CONF_DIR=/usr/lib/hive/conf export HIVE_HOME=/usr/lib/hive export HADOOP_HOME=/usr/lib/hadoop export SPARK_HOME=/root/spark-0.9.1 export MASTER=spark://192.168.10.220:7077 export SHARK_EXEC_MODE=yarn SPARK_JAVA_OPTS= -Dspark.local.dir=/tmp SPARK_JAVA_OPTS+=-Dspark.kryoserializer.buffer.mb=10 SPARK_JAVA_OPTS+=-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps export SPARK_JAVA_OPTS export SPARK_ASSEMBLY_JAR=/root/spark-0.9.1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar export SHARK_ASSEMBLY_JAR=/root/shark-0.9.1-bin-hadoop2/target/scala-2.10/shark_2.10-0.9.1.jar Best regards, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-shark-tp5581p5688.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: NoSuchMethodError: breeze.linalg.DenseMatrix
Hi, DB i've add breeze jars to workers using sc.addJar() breeze jars include : breeze-natives_2.10-0.7.jar breeze-macros_2.10-0.3.jar breeze-macros_2.10-0.3.1.jar breeze_2.10-0.8-SNAPSHOT.jar breeze_2.10-0.7.jar almost all the jars about breeze i can find, but still NoSuchMethodError: breeze.linalg.DenseMatrix from the executor stderr, you can see the executor successsully fetches these jars, what's wrong about my method? thank you! 14/05/14 20:36:02 INFO Executor: Fetching http://192.168.0.106:42883/jars/breeze-natives_2.10-0.7.jar with timestamp 1400070957376 14/05/14 20:36:02 INFO Utils: Fetching http://192.168.0.106:42883/jars/breeze-natives_2.10-0.7.jar to /tmp/fetchFileTemp7468892065227766972.tmp 14/05/14 20:36:02 INFO Executor: Adding file:/home/wxhsdp/spark/spark/tags/v1.0.0-rc3/work/app-20140514203557-/0/./breeze-natives_2.10-0.7.jar to class loader 14/05/14 20:36:02 INFO Executor: Fetching http://192.168.0.106:42883/jars/breeze-macros_2.10-0.3.jar with timestamp 1400070957441 14/05/14 20:36:02 INFO Utils: Fetching http://192.168.0.106:42883/jars/breeze-macros_2.10-0.3.jar to /tmp/fetchFileTemp2324565598765584917.tmp 14/05/14 20:36:02 INFO Executor: Adding file:/home/wxhsdp/spark/spark/tags/v1.0.0-rc3/work/app-20140514203557-/0/./breeze-macros_2.10-0.3.jar to class loader 14/05/14 20:36:02 INFO Executor: Fetching http://192.168.0.106:42883/jars/breeze_2.10-0.8-SNAPSHOT.jar with timestamp 1400070957358 14/05/14 20:36:02 INFO Utils: Fetching http://192.168.0.106:42883/jars/breeze_2.10-0.8-SNAPSHOT.jar to /tmp/fetchFileTemp8730123100104850193.tmp 14/05/14 20:36:02 INFO Executor: Adding file:/home/wxhsdp/spark/spark/tags/v1.0.0-rc3/work/app-20140514203557-/0/./breeze_2.10-0.8-SNAPSHOT.jar to class loader 14/05/14 20:36:02 INFO Executor: Fetching http://192.168.0.106:42883/jars/breeze-macros_2.10-0.3.1.jar with timestamp 1400070957414 14/05/14 20:36:02 INFO Utils: Fetching http://192.168.0.106:42883/jars/breeze-macros_2.10-0.3.1.jar to /tmp/fetchFileTemp3473404556989515218.tmp 14/05/14 20:36:02 INFO Executor: Adding file:/home/wxhsdp/spark/spark/tags/v1.0.0-rc3/work/app-20140514203557-/0/./breeze-macros_2.10-0.3.1.jar to class loader 14/05/14 20:36:02 INFO Executor: Fetching http://192.168.0.106:42883/jars/build-project_2.10-1.0.jar with timestamp 1400070956753 14/05/14 20:36:02 INFO Utils: Fetching http://192.168.0.106:42883/jars/build-project_2.10-1.0.jar to /tmp/fetchFileTemp1289055585501269156.tmp 14/05/14 20:36:02 INFO Executor: Adding file:/home/wxhsdp/spark/spark/tags/v1.0.0-rc3/work/app-20140514203557-/0/./build-project_2.10-1.0.jar to class loader 14/05/14 20:36:02 INFO Executor: Fetching http://192.168.0.106:42883/jars/breeze_2.10-0.7.jar with timestamp 1400070957228 14/05/14 20:36:02 INFO Utils: Fetching http://192.168.0.106:42883/jars/breeze_2.10-0.7.jar to /tmp/fetchFileTemp1287317286108432726.tmp 14/05/14 20:36:02 INFO Executor: Adding file:/home/wxhsdp/spark/spark/tags/v1.0.0-rc3/work/app-20140514203557-/0/./breeze_2.10-0.7.jar to class loader DB Tsai-2 wrote Since the breeze jar is brought into spark by mllib package, you may want to add mllib as your dependency in spark 1.0. For bring it from your application yourself, you can either use sbt assembly in ur build project to generate a flat myApp-assembly.jar which contains breeze jar, or use spark add jar api like Yadid said. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, May 4, 2014 at 10:24 PM, wxhsdp lt; wxhsdp@ gt; wrote: Hi, DB, i think it's something related to sbt publishLocal if i remove the breeze dependency in my sbt file, breeze can not be found [error] /home/wxhsdp/spark/example/test/src/main/scala/test.scala:5: not found: object breeze [error] import breeze.linalg._ [error]^ here's my sbt file: name := Build Project version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.0.0-SNAPSHOT resolvers += Akka Repository at http://repo.akka.io/releases/; i run sbt publishLocal on the Spark tree. but if i manully put spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar in /lib directory, sbt package is ok, i can run my app in workers without addJar what's the difference between add dependency in sbt after sbt publishLocal and manully put spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar in /lib directory? why can i run my app in worker without addJar this time? DB Tsai-2 wrote If you add the breeze dependency in your build.sbt project, it will not be available to all the workers. There are couple options, 1) use sbt assembly to package breeze into your application jar. 2) manually copy breeze jar into all the nodes, and have them in the classpath. 3) spark 1.0 has breeze jar in the spark flat assembly jar, so you don't need to add breeze dependency
Re: Packaging a spark job using maven
Hi, Thanks François but this didn't change much. I'm not even sure what this reference.conf is. It isn't mentioned in any of spark documentation. Should i have one in my resources ? Thanks Laurent -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a-spark-job-using-maven-tp5615p5707.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark LIBLINEAR
Hi Professor Lin, On our internal datasets, I am getting accuracy at par with glmnet-R for sparse feature selection from liblinear. The default mllib based gradient descent was way off. I did not tune learning rate but I run with varying lambda. Ths feature selection was weak. I used liblinear code. Next I will explore the distributed liblinear. Adding the code on github will definitely help for collaboration. I am experimenting if a bfgs / owlqn based sparse logistic in spark mllib give us accuracy at par with liblinear. If liblinear solver outperforms them (either accuracy/performance) we have to bring tron to mllib and let other algorithms benefit from it as well. We are using Bfgs and Owlqn solvers from breeze opt. Thanks. Deb On May 12, 2014 9:07 PM, DB Tsai dbt...@stanford.edu wrote: It seems that the code isn't managed in github. Can be downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/spark/spark-liblinear-1.94.zip It will be easier to track the changes in github. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, May 12, 2014 at 7:53 AM, Xiangrui Meng men...@gmail.com wrote: Hi Chieh-Yen, Great to see the Spark implementation of LIBLINEAR! We will definitely consider adding a wrapper in MLlib to support it. Is the source code on github? Deb, Spark LIBLINEAR uses BSD license, which is compatible with Apache. Best, Xiangrui On Sun, May 11, 2014 at 10:29 AM, Debasish Das debasish.da...@gmail.com wrote: Hello Prof. Lin, Awesome news ! I am curious if you have any benchmarks comparing C++ MPI with Scala Spark liblinear implementations... Is Spark Liblinear apache licensed or there are any specific restrictions on using it ? Except using native blas libraries (which each user has to manage by pulling in their best proprietary BLAS package), all Spark code is Apache licensed. Thanks. Deb On Sun, May 11, 2014 at 3:01 AM, DB Tsai dbt...@stanford.edu wrote: Dear Prof. Lin, Interesting! We had an implementation of L-BFGS in Spark and already merged in the upstream now. We read your paper comparing TRON and OWL-QN for logistic regression with L1 (http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf), but it seems that it's not in the distributed setup. Will be very interesting to know the L2 logistic regression benchmark result in Spark with your TRON optimizer and the L-BFGS optimizer against different datasets (sparse, dense, and wide, etc). I'll try your TRON out soon. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, May 11, 2014 at 1:49 AM, Chieh-Yen r01944...@csie.ntu.edu.tw wrote: Dear all, Recently we released a distributed extension of LIBLINEAR at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/ Currently, TRON for logistic regression and L2-loss SVM is supported. We provided both MPI and Spark implementations. This is very preliminary so your comments are very welcome. Thanks, Chieh-Yen
RE: How to use Mahout VectorWritable in Spark.
Hi Xiangrui, Thanks for the response .. I tried few ways to include mahout-math jar while launching Spark shell.. but no success.. Can you please point what I am doing wrong 1. mahout-math.jar exported in CLASSPATH, and PATH 2. Tried Launching Spark Shell by : MASTER=spark://HOSTNAME:PORT ADD_JARS=~/installations/work-space/mahout-math-0.7.jar park-0.9.0/bin/spark-shell After launching, I checked the environment details on WebUi: It looks like mahout-math jar is included. spark.jars /home/hduser/installations/work-space/mahout-math-0.7.jar Then I try : scala import org.apache.mahout.math.VectorWritable console:10: error: object mahout is not a member of package org.apache import org.apache.mahout.math.VectorWritable scala val raw = sc.sequenceFile(path, classOf[Text], classOf[VectorWritable]) console:12: error: not found: type Text val data = sc.sequenceFile(/stuti/ML/Clustering/KMeans/HAR/KMeans_dataset_seq/part-r-0, classOf[Text], classOf[VectorWritable]) ^ Im using Spark 0.9 and Hadoop 1.0.4 and Mahout 0.7 Thanks Stuti -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Wednesday, May 14, 2014 11:56 AM To: user@spark.apache.org Subject: Re: How to use Mahout VectorWritable in Spark. You need val raw = sc.sequenceFile(path, classOf[Text], classOf[VectorWriteable]) to load the data. After that, you can do val data = raw.values.map(_.get) To get an RDD of mahout's Vector. You can use `--jar mahout-math.jar` when you launch spark-shell to include mahout-math. Best, Xiangrui On Tue, May 13, 2014 at 10:37 PM, Stuti Awasthi stutiawas...@hcl.com wrote: Hi All, I am very new to Spark and trying to play around with Mllib hence apologies for the basic question. I am trying to run KMeans algorithm using Mahout and Spark MLlib to see the performance. Now initial datasize was 10 GB. Mahout converts the data in Sequence File Text,VectorWritable which is used for KMeans Clustering. The Sequence File crated was ~ 6GB in size. Now I wanted if I can use the Mahout Sequence file to be executed in Spark MLlib for KMeans . I have read that SparkContext.sequenceFile may be used here. Hence I tried to read my sequencefile as below but getting the error : Command on Spark Shell : scala val data = sc.sequenceFile[String,VectorWritable](/ KMeans_dataset_seq/part-r-0,String,VectorWritable) console:12: error: not found: type VectorWritable val data = sc.sequenceFile[String,VectorWritable]( /KMeans_dataset_seq/part-r-0,String,VectorWritable) Here I have 2 ques: 1. Mahout has “Text” as Key but Spark is printing “not found: type:Text” hence I changed it to String.. Is this correct ??? 2. How will VectorWritable be found in Spark. Do I need to include Mahout jar in Classpath or any other option ?? Please Suggest Regards Stuti Awasthi ::DISCLAIMER:: -- -- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects. -- --
saveAsTextFile with replication factor in HDFS
Hi, Can we override the default file-replication factor while using saveAsTextFile() to HDFS. My default repl.factor is 1. But intermediate files that i want to put in HDFS while running a SPARK query need not be replicated, so is there a way ? Thanks !
Worker re-spawn and dynamic node joining
Hi all, Just 2 questions: 1. Is there a way to automatically re-spawn spark workers? We've situations where executor OOM causes worker process to be DEAD and it does not came back automatically. 2. How to dynamically add (or remove) some worker machines to (from) the cluster? We'd like to leverage the auto-scaling group in EC2 for example. We're using spark-standalone. Thanks a lot. -- *JU Han* Data Engineer @ Botify.com +33 061960
Re: Spark unit testing best practices
There's an undocumented mode that looks like it simulates a cluster: SparkContext.scala: // Regular expression for simulating a Spark cluster of [N, cores, memory] locally val LOCAL_CLUSTER_REGEX = local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*].r can you running your tests with a master URL of local-cluster[2,2,512] to see if that does serialization? On Wed, May 14, 2014 at 3:34 AM, Andras Nemeth andras.nem...@lynxanalytics.com wrote: Hi, Spark's local mode is great to create simple unit tests for our spark logic. The disadvantage however is that certain types of problems are never exposed in local mode because things never need to be put on the wire. E.g. if I accidentally use a closure which has something non-serializable in it, then my test will happily succeed in local mode but go down in flames on a real cluster. Other example is kryo: I'd like to use setRegistrationRequired(true) to avoid any hidden performance problems due to forgotten registration. And of course I'd like things to fail in tests. But it won't happen because we never actually need to serialize the RDDs in local mode. So, is there some good solution to the above problems? Is there some local-like mode which simulates serializations as well? Or is there an easy way to start up *from code* a standalone spark cluster on the machine running the unit test? Thanks, Andras
Re: Spark unit testing best practices
Have you actually found this to be true? I have found Spark local mode to be quite good about blowing up if there is something non-serializable and so my unit tests have been great for detecting this. I have never seen something that worked in local mode that didn't work on the cluster because of different serialization requirements between the two. Perhaps it is different when using Kryo On 05/14/2014 04:34 AM, Andras Nemeth wrote: E.g. if I accidentally use a closure which has something non-serializable in it, then my test will happily succeed in local mode but go down in flames on a real cluster.
little confused about SPARK_JAVA_OPTS alternatives
i have some settings that i think are relevant for my application. they are spark.akka settings so i assume they are relevant for both executors and my driver program. i used to do: SPARK_JAVA_OPTS=-Dspark.akka.frameSize=1 now this is deprecated. the alternatives mentioned are: * some spark-submit settings which are not relevant to me since i do not use spark-submit (i launch spark jobs from an existing application) * spark.executor.extraJavaOptions to set -X options. i am not sure what -X options are, but it doesnt sound like what i need, since its only for executors * SPARK_DAEMON_OPTS to set java options for standalone daemons (i.e. master, worker), that sounds like i should not use it since i am trying to change settings for an app, not a daemon. am i missing the correct setting to use? should i do -Dspark.akka.frameSize=1 on my application launch directly, and then also set spark.executor.extraJavaOptions? so basically repeat it?
Re: Packaging a spark job using maven
I have a similar objective to use maven as our build tool and ran into the same issue. The idea is that your config file is actually not found, your fat jar assembly does not contain the reference.conf resource. I added the following the resources section of my pom to make it work : resource directorysrc/main/resources/directory includes include*.conf/include /includes targetPath${project.build.directory}/classes/targetPath /resource I think Paul's gist also achieves a similar effect by specifying a proper appender in the shading conf. cheers François On Tue, May 13, 2014 at 4:09 AM, Laurent Thoulon laurent.thou...@ldmobile.net wrote: (I've never actually received my previous mail so i'm resending it. Sorry if it creates a duplicate.) Hi, I'm quite new to spark (and scala) but has anyone ever successfully compiled and run a spark job using java and maven ? Packaging seems to go fine but when i try to execute the job using mvn package java -Xmx4g -cp target/jobs-1.4.0.0-jar-with-dependencies.jar my.jobs.spark.TestJob I get the following error Exception in thread main com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version' at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:115) at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:136) at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:142) at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:150) at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:155) at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:197) at akka.actor.ActorSystem$Settings.init(ActorSystem.scala:136) at akka.actor.ActorSystemImpl.init(ActorSystem.scala:470) at akka.actor.ActorSystem$.apply(ActorSystem.scala:111) at akka.actor.ActorSystem$.apply(ActorSystem.scala:104) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126) at org.apache.spark.SparkContext.init(SparkContext.scala:139) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:47) at my.jobs.spark.TestJob.run(TestJob.java:56) Here's the code right until line 56 SparkConf conf = new SparkConf() .setMaster(local[ + cpus + ]) .setAppName(this.getClass().getSimpleName()) .setSparkHome(/data/spark) .setJars(JavaSparkContext.jarOfClass(this.getClass())) .set(spark.default.parallelism, String.valueOf(cpus * 2)) .set(spark.executor.memory, 4g) .set(spark.storage.memoryFraction, 0.6) .set(spark.shuffle.memoryFraction, 0.3); JavaSparkContext sc = new JavaSparkContext(conf); Thanks Regards, Laurent -- François /fly Le Lay Data Infra Chapter Lead NYC +1 (646)-656-0075
NotSerializableException in Spark Streaming
Hey all, trying to set up a pretty simple streaming app and getting some weird behavior. First, a non-streaming job that works fine: I'm trying to pull out lines of a log file that match a regex, for which I've set up a function: def getRequestDoc(s: String): String = { KBDOC-[0-9]*.r.findFirstIn(s).orNull } logs=sc.textFile(logfiles) logs.map(getRequestDoc).take(10) That works, but I want to run that on the same data, but streaming, so I tried this: val logs = ssc.socketTextStream(localhost,) logs.map(getRequestDoc).print() ssc.start() From this code, I get: 14/05/08 03:32:08 ERROR JobScheduler: Error running job streaming job 1399545128000 ms.0 org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext But if I do the map function inline instead of calling a separate function, it works: logs.map(KBDOC-[0-9]*.r.findFirstIn(_).orNull).print() So why is it able to serialize my little function in regular spark, but not in streaming? Thanks, Diana
No configuration setting found for key 'akka.zeromq'
hi,all When i run ZeroMQWordCount example on cluster, the worker log says: Caused by: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.zeromq' Actually, i can see that the reference.conf in spark-examples-assembly-0.9.1.jar contains below configurations: Anyone know what happen ? # # Akka ZeroMQ Reference Config File # # # This is the reference config file that contains all the default settings. # Make your edits/overrides in your application.conf. akka { zeromq { # The default timeout for a poll on the actual zeromq socket. poll-timeout = 100ms # Timeout for creating a new socket new-socket-timeout = 5s socket-dispatcher { # A zeromq socket needs to be pinned to the thread that created it. # Changing this value results in weird errors and race conditions within # zeromq executor = thread-pool-executor type = PinnedDispatcher thread-pool-executor.allow-core-timeout = off } } } Exception in worker akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:218) Caused by: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.zeromq' 14/05/06 21:26:19 ERROR actor.ActorCell: changing Recreate into Create after akka.actor.ActorInitializationException: exception during creation Thanks, Francis.Hu
Re: How to use spark-submit
I used spark-submit to run the MovieLensALS example from the examples module. here is the command: $spark-submit --master local /home/phoenix/spark/spark-dev/examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop1.0.4.jar --class org.apache.spark.examples.mllib.MovieLensALS u.data also, you could check the parameters of spark-submit by $spark-submit --h hope this helps! On Wed, May 7, 2014 at 9:27 AM, Tathagata Das tathagata.das1...@gmail.comwrote: Doesnt the run-example script work for you? Also, are you on the latest commit of branch-1.0 ? TD On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta soumya.sima...@gmail.comwrote: Yes, I'm struggling with a similar problem where my class are not found on the worker nodes. I'm using 1.0.0_SNAPSHOT. I would really appreciate if someone can provide some documentation on the usage of spark-submit. Thanks On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com wrote: I have a spark streaming application that uses the external streaming modules (e.g. kafka, mqtt, ..) as well. It is not clear how to properly invoke the spark-submit script: what are the ---driver-class-path and/or -Dspark.executor.extraClassPath parameters required? For reference, the following error is proving difficult to resolve: java.lang.ClassNotFoundException: org.apache.spark.streaming.examples.StreamingExamples
spark on yarn-standalone, throws StackOverflowError and fails somtimes and succeed for the rest
Hi all, My spark code is running on yarn-standalone. the last three lines of the code as below, val result = model.predict(prdctpairs) result.map(x = x.user+,+x.product+,+x.rating).saveAsTextFile(output) sc.stop() the same code, sometimes be able to run successfully and could give out the right result, while from time to time, it throws StackOverflowError and fail. and I don`t have a clue how I should debug. below is the error, (the start and end portion to be exact): 14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-17] MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 44 to sp...@rxx43.mc10.site.net:43885 14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-17] MapOutputTrackerMaster: Size of output statuses for shuffle 44 is 148 bytes 14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-35] MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 45 to sp...@rxx43.mc10.site.net:43885 14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-35] MapOutputTrackerMaster: Size of output statuses for shuffle 45 is 453 bytes 14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-20] MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 44 to sp...@rxx43.mc10.site.net:56767 14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-29] MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 45 to sp...@rxx43.mc10.site.net:56767 14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-29] MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 44 to sp...@rxx43.mc10.site.net:49879 14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-29] MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 45 to sp...@rxx43.mc10.site.net:49879 14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-17] TaskSetManager: Starting task 946.0:17 as TID 146 on executor 6: rx15.mc10.site.net (PROCESS_LOCAL) 14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-17] TaskSetManager: Serialized task 946.0:17 as 6414 bytes in 0 ms 14-05-09 17:55:51 WARN [Result resolver thread-0] TaskSetManager: Lost TID 133 (task 946.0:4) 14-05-09 17:55:51 WARN [Result resolver thread-0] TaskSetManager: Loss was due to java.lang.StackOverflowError java.lang.StackOverflowError at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631) at java.lang.ClassLoader.defineClass(ClassLoader.java:615) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) at java.net.URLClassLoader.access$000(URLClassLoader.java:58) at java.net.URLClassLoader$1.run(URLClassLoader.java:197) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631) at java.lang.ClassLoader.defineClass(ClassLoader.java:615) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870) 14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-5] TaskSetManager: Starting task 946.0:4 as TID 147 on executor 6: r15.mc10.site.net (PROCESS_LOCAL) 14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-5] TaskSetManager: Serialized task 946.0:4 as 6414 bytes in 0 ms 14-05-09 17:55:51 WARN [Result resolver thread-1] TaskSetManager: Lost TID 139 (task 946.0:10) 14-05-09 17:55:51 INFO [Result resolver thread-1] TaskSetManager: Loss was due to java.lang.StackOverflowError [duplicate 1] 14-05-09 17:55:51 INFO [spark-akka.actor.default-dispatcher-5] CoarseGrainedSchedulerBackend: Executor 4 disconnected, so removing it 14-05-09 17:55:51 ERROR
Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.
Hi Jacob, Thanks for the help answer on the docker question. Have you already experimented with the new link feature in Docker? That does not help the HDFS issue as the DataNode needs the namenode and vice-versa but it does facilitate simpler client-server interactions. My issue described at the beginning is related to networking between the host and the docker images, but I was loosing too much time tracking down the exact problem, so I moved my Spark job driver into the mesos node and it started working. Sadly, my Mesos UI is partially crippled as workers are not addressable (therefore spark job logs are hard to gather) Your discussion about dynamic port allocation is very relevant to understand why some components cannot talk with each other. I'll need to have a more in-depth read of that discussion to find a better solution for my local development environment. regards, Gerard. On Tue, May 6, 2014 at 3:30 PM, Jacob Eisinger jeis...@us.ibm.com wrote: Howdy, You might find the discussion Andrew and I have been having about Docker and network security [1] applicable. Also, I posted an answer [2] to your stackoverflow question. [1] http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-driver-interacting-with-Workers-in-YARN-mode-firewall-blocking-communication-tp5237p5441.html [2] http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns/23495100#23495100 Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 [image: Inactive hide details for Gerard Maas ---05/05/2014 04:18:08 PM---Hi Benjamin, Yes, we initially used a modified version of the]Gerard Maas ---05/05/2014 04:18:08 PM---Hi Benjamin, Yes, we initially used a modified version of the AmpLabs docker scripts From: Gerard Maas gerard.m...@gmail.com To: user@spark.apache.org Date: 05/05/2014 04:18 PM Subject: Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs. -- Hi Benjamin, Yes, we initially used a modified version of the AmpLabs docker scripts [1]. The amplab docker images are a good starting point. One of the biggest hurdles has been HDFS, which requires reverse-DNS and I didn't want to go the dnsmasq route to keep the containers relatively simple to use without the need of external scripts. Ended up running a 1-node setup nnode+dnode. I'm still looking for a better solution for HDFS [2] Our usecase using docker is to easily create local dev environments both for development and for automated functional testing (using cucumber). My aim is to strongly reduce the time of the develop-deploy-test cycle. That also means that we run the minimum number of instances required to have a functionally working setup. E.g. 1 Zookeeper, 1 Kafka broker, ... For the actual cluster deployment we have Chef-based devops toolchain that put things in place on public cloud providers. Personally, I think Docker rocks and would like to replace those complex cookbooks with Dockerfiles once the technology is mature enough. -greetz, Gerard. [1] *https://github.com/amplab/docker-scripts*https://github.com/amplab/docker-scripts [2] *http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns*http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns On Mon, May 5, 2014 at 11:00 PM, Benjamin *bboui...@gmail.com*bboui...@gmail.com wrote: Hi, Before considering running on Mesos, did you try to submit the application on Spark deployed without Mesos on Docker containers ? Currently investigating this idea to deploy quickly a complete set of clusters with Docker, I'm interested by your findings on sharing the settings of Kafka and Zookeeper across nodes. How many broker and zookeeper do you use ? Regards, On Mon, May 5, 2014 at 10:11 PM, Gerard Maas *gerard.m...@gmail.com*gerard.m...@gmail.com wrote: Hi all, I'm currently working on creating a set of docker images to facilitate local development with Spark/streaming on Mesos (+zk, hdfs, kafka) After solving the initial hurdles to get things working together in docker containers, now everything seems to start-up correctly and the mesos UI shows slaves as they are started. I'm trying to submit a job from IntelliJ and the jobs submissions seem to get lost in Mesos translation. The logs are not helping me to figure out what's wrong, so I'm posting them here in the hope that they can ring a bell and somebdoy could provide me a hint on what's wrong/missing with my setup. DRIVER (IntelliJ running a Job.scala main) 14/05/05 21:52:31 INFO MetadataCleaner: Ran metadata cleaner for SHUFFLE_BLOCK_MANAGER 14/05/05 21:52:31 INFO BlockManager: Dropping broadcast blocks older than 1399319251962 14/05/05 21:52:31 INFO
Re: master attempted to re-register the worker and then took all workers as unregistered
Hi Cheney Which mode you are running? YARN or standalone? I got the same exception when I ran spark on YARN. On Tue, May 6, 2014 at 10:06 PM, Cheney Sun sun.che...@gmail.com wrote: Hi Nan, In worker's log, I see the following exception thrown when try to launch on executor. (The SPARK_HOME is wrongly specified on purpose, so there is no such file /usr/local/spark1/bin/compute-classpath.sh). After the exception was thrown several times, the worker was requested to kill the executor. Following the killing, the worker try to register again with master, but master reject the registration with WARN message Got heartbeat from unregistered worker worker-20140504140005-host-spark-online001 Looks like the issue wasn't fixed in 0.9.1. Do you know any pull request addressing this issue? Thanks. java.io.IOException: Cannot run program /usr/local/spark1/bin/compute-classpath.sh (in directory .): error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:600) at org.apache.spark.deploy.worker.CommandUtils$.buildJavaOpts(CommandUtils.scala:58) at org.apache.spark.deploy.worker.CommandUtils$.buildCommandSeq(CommandUtils.scala:37) at org.apache.spark.deploy.worker.ExecutorRunner.getCommandSeq(ExecutorRunner.scala:104) at org.apache.spark.deploy.worker.ExecutorRunner.fetchAndRunExecutor(ExecutorRunner.scala:119) at org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:59) Caused by: java.io.IOException: error=2, No such file or directory at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.init(UNIXProcess.java:135) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1021) ... 6 more .. 14/05/04 21:35:45 INFO Worker: Asked to kill executor app-20140504213545-0034/18 14/05/04 21:35:45 INFO Worker: Executor app-20140504213545-0034/18 finished with state FAILED message class java.io.IOException: Cannot run program /usr/local/spark1/bin/compute-classpath.sh (in directory .): error=2, No such file or directory 14/05/04 21:35:45 ERROR OneForOneStrategy: key not found: app-20140504213545-0034/18 java.util.NoSuchElementException: key not found: app-20140504213545-0034/18 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:232) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/04 21:35:45 INFO Worker: Starting Spark worker host-spark-online001:7078 with 10 cores, 28.0 GB RAM 14/05/04 21:35:45 INFO Worker: Spark home: /usr/local/spark-0.9.1-cdh4.2.0 14/05/04 21:35:45 INFO WorkerWebUI: Started Worker web UI at http://host-spark-online001:8081 14/05/04 21:35:45 INFO Worker: Connecting to master spark://host-spark-online001:7077... 14/05/04 21:35:45 INFO Worker: Successfully registered with master spark://host-spark-online001:7077
Re: Unable to load native-hadoop library problem
Hello Sophia You are only providing the Spark jar here (nevertheless, a spark jar that contains hadoop libraries in it, but that is not sufficient). Where is your hadoop installed? (Most probably: /usr/lib/hadoop/*) So you need to add that to your class path (by using -cp) I guess. Let me know if that works shivani On Tue, May 6, 2014 at 6:25 PM, Sophia sln-1...@163.com wrote: Hi,everyone, [root@CHBM220 spark-0.9.1]# SPARK_JAR=.assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar ./bin/spark-class org.apache.spark.deploy.yarn.Client --jar examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class org.apache.spark.examples.SparkPi --args yarn-standalone --num-workers 3 --master-memory 2g --worker-memory 2g --worker-cores 1 14/05/07 09:05:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/05/07 09:05:14 INFO RMProxy: Connecting to ResourceManager at CHBM220/192.168.10.220:8032 Then it stopped,my hadoop_conf_dir has been configued well,what should I do to? Wish you happy everyday. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-load-native-hadoop-library-problem-tp5469.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Software Engineer Analytics Engineering Team@ Box Mountain View, CA
Express VMs - good idea?
Hi, I've wanted to play with Spark. I wanted to fast track things and just use one of the vendor's express VMs. I've tried Cloudera CDH 5.0 and Hortonworks HDP 2.1. I've not written down all of my issues, but for certain, when I try to run spark-shell it doesn't work. Cloudera seems to crash, and both complain when I try to use SparkContext in a simple Scala command. So, just a basic question on whether anyone has had success getting these express VMs to work properly with Spark *out of the box* (HDP does required you install Spark manually). I know Cloudera recommends 8GB of RAM, but I've been running it with 4GB. Could it be that 4GB is just not enough, and causing issues or have others had success using these Hadoop 2.x pre-built VMs with Spark 0.9.x? Marco
Re: instalation de spark
J'ai oublié la plupart de mes français. You can download a Spark binary or build from source. This is how I build from source: Download and install sbt: http://www.scala-sbt.org/ I installed in C:\sbt Check C:\sbt\conf\sbtconfig.txt, use these options: -Xmx512M -XX:MaxPermSize=256m -XX:ReservedCodeCacheSize=128m Download and install git: http://git-scm.com/downloads Make sure git is in your PATH Download and untar Spark into some directory, e.g. C:\spark-0.91 cd C:\spark-0.9.1 C:\bin\sbt assembly do something else, this take some time Enjoy! bin\spark-shell.cmd This assumes you have Java installed, I have Java 7 You can install Scala 2.10.x for Scala development. I have Python 2.7.6? For pySpark I use ScalaIDE Eclipse plugin. Let me know how it works out. - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/instalation-de-spark-tp5689p5724.html Sent from the Apache Spark User List mailing list archive at Nabble.com.