Status stays at ACCEPTED
Hi, I’m new to Spark and trying to test first Spark prog. I’m running SparkPi successfully in yarn-client -mode but when running the same in yarn-mode, app gets stuck to ACCEPTED phase. I’ve tried hours to hunt down the reason but the outcome is always the same. Any hints what to look for next? cheers, -jan vagrant@vm-cluster-node1:~$ ./run_pi.sh 14/05/20 06:24:04 INFO RMProxy: Connecting to ResourceManager at vm-cluster-node2/10.211.55.101:8032 14/05/20 06:24:05 INFO Client: Got Cluster metric info from ApplicationsManager (ASM), number of NodeManagers: 2 14/05/20 06:24:05 INFO Client: Queue info ... queueName: root.default, queueCurrentCapacity: 0.0, queueMaxCapacity: -1.0, queueApplicationCount = 0, queueChildQueueCount = 0 14/05/20 06:24:05 INFO Client: Max mem capabililty of a single resource in this cluster 2048 14/05/20 06:24:05 INFO Client: Preparing Local resources 14/05/20 06:24:05 INFO Client: Uploading file:/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar to hdfs://vm-cluster-node2:8020/user/vagrant/.sparkStaging/application_1400563733088_0012/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar 14/05/20 06:24:07 INFO Client: Setting up the launch environment 14/05/20 06:24:07 INFO Client: Setting up container launch context 14/05/20 06:24:07 INFO Client: Command for starting the Spark ApplicationMaster: java -server -Xmx1024m -Djava.io.tmpdir=$PWD/tmp org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.examples.SparkPi --jar /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar --args 'yarn-standalone' --args '10' --worker-memory 500 --worker-cores 1 --num-workers 1 1 LOG_DIR/stdout 2 LOG_DIR/stderr 14/05/20 06:24:07 INFO Client: Submitting application to ASM 14/05/20 06:24:07 INFO YarnClientImpl: Submitted application application_1400563733088_0012 14/05/20 06:24:08 INFO Client: Application report from ASM: THIS PART GET REPEATING FOREVER application identifier: application_1400563733088_0012 appId: 12 clientToAMToken: null appDiagnostics: appMasterHost: N/A appQueue: root.vagrant appMasterRpcPort: -1 appStartTime: 1400567047343 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: http://vm-cluster-node2:8088/proxy/application_1400563733088_0012/ appUser: vagrant Log files give me no additional help. Latest log entry just acknowledges the status change: hadoop-yarn/hadoop-cmf-yarn-RESOURCEMANAGER-vm-cluster-node2.log.out:2014-05-20 06:24:07,347 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1400563733088_0012 State change from SUBMITTED to ACCEPTED I’m running the example in local test environment with three virtual nodes in Cloudera (CDH5). Below is the run_pi.sh : #!/bin/bash export SPARK_HOME=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark export STANDALONE_SPARK_MASTER_HOST=vm-cluster-node2 export SPARK_MASTER_PORT=7077 export DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop export SPARK_JAR_HDFS_PATH=/user/spark/share/lib/spark-assembly.jar export SPARK_LAUNCH_WITH_SCALA=0 export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME} if [ -n $HADOOP_HOME ]; then export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:${HADOOP_HOME}/lib/native fi export SPARK_JAR=hdfs://vm-cluster-node2:8020/user/spark/share/lib/spark-assembly.jar APP_JAR=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar $SPARK_HOME/bin/spark-class org.apache.spark.deploy.yarn.Client \ --jar $APP_JAR \ --class org.apache.spark.examples.SparkPi \ --args yarn-standalone \ --args 10 \ --num-workers 1 \ --master-memory 1g \ --worker-memory 500m \ --worker-cores 1
unsubscribe
CLASSIFICATION : Public This message has been marked by Jayaraman Babu on Tuesday, May 20, 2014, 9:55:10 AM. The above classification labels were added to the message by. AL ELM Message Classification This e-mail message and all attachments transmitted with it are intended solely for the use of the addressee and may contain legally privileged and confidential information. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution, copying, or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately by replying to this message and please delete it from your computer.
Re: combinebykey throw classcastexception
You asked off-list, and provided a more detailed example there: val random = new Random() val testdata = (1 to 1).map(_=(random.nextInt(),random.nextInt())) sc.parallelize(testdata).combineByKey[ArrayBuffer[Int]]( (instant:Int)={new ArrayBuffer[Int]()}, (bucket:ArrayBuffer[Int],instant:Int)={bucket+=instant}, (bucket1:ArrayBuffer[Int],bucket2:ArrayBuffer[Int])={bucket1++=bucket2} ).collect() https://www.quora.com/Why-is-my-combinebykey-throw-classcastexception I can't reproduce this with Spark 0.9.0 / CDH5 or Spark 1.0.0 RC9. Your definition looks fine too. (Except that you are dropping the first value, but that's a different problem.) On Tue, May 20, 2014 at 2:05 AM, xiemeilong xiemeilong...@gmail.com wrote: I am using CDH5 on a three machines cluster. map data from hbase as (string, V) pair , then call combineByKey like this: .combineByKey[C]( (v:V)=new C(v), //this line throw java.lang.ClassCastException: C cannot be cast to V (v:C,v:V)=C, (c1:C,c2:C)=C) I am very confused of this, there isn't C to V casting at all. What's wrong? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/combinebykey-throw-classcastexception-tp6059.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Status stays at ACCEPTED
Hi Jan, How much memory capacity is configured for each node? If you go to the ResourceManager web UI, does it indicate any containers are running? -Sandy On May 19, 2014, at 11:43 PM, Jan Holmberg jan.holmb...@perigeum.fi wrote: Hi, I’m new to Spark and trying to test first Spark prog. I’m running SparkPi successfully in yarn-client -mode but when running the same in yarn-mode, app gets stuck to ACCEPTED phase. I’ve tried hours to hunt down the reason but the outcome is always the same. Any hints what to look for next? cheers, -jan vagrant@vm-cluster-node1:~$ ./run_pi.sh 14/05/20 06:24:04 INFO RMProxy: Connecting to ResourceManager at vm-cluster-node2/10.211.55.101:8032 14/05/20 06:24:05 INFO Client: Got Cluster metric info from ApplicationsManager (ASM), number of NodeManagers: 2 14/05/20 06:24:05 INFO Client: Queue info ... queueName: root.default, queueCurrentCapacity: 0.0, queueMaxCapacity: -1.0, queueApplicationCount = 0, queueChildQueueCount = 0 14/05/20 06:24:05 INFO Client: Max mem capabililty of a single resource in this cluster 2048 14/05/20 06:24:05 INFO Client: Preparing Local resources 14/05/20 06:24:05 INFO Client: Uploading file:/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar to hdfs://vm-cluster-node2:8020/user/vagrant/.sparkStaging/application_1400563733088_0012/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar 14/05/20 06:24:07 INFO Client: Setting up the launch environment 14/05/20 06:24:07 INFO Client: Setting up container launch context 14/05/20 06:24:07 INFO Client: Command for starting the Spark ApplicationMaster: java -server -Xmx1024m -Djava.io.tmpdir=$PWD/tmp org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.examples.SparkPi --jar /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar --args 'yarn-standalone' --args '10' --worker-memory 500 --worker-cores 1 --num-workers 1 1 LOG_DIR/stdout 2 LOG_DIR/stderr 14/05/20 06:24:07 INFO Client: Submitting application to ASM 14/05/20 06:24:07 INFO YarnClientImpl: Submitted application application_1400563733088_0012 14/05/20 06:24:08 INFO Client: Application report from ASM: THIS PART GET REPEATING FOREVER application identifier: application_1400563733088_0012 appId: 12 clientToAMToken: null appDiagnostics: appMasterHost: N/A appQueue: root.vagrant appMasterRpcPort: -1 appStartTime: 1400567047343 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: http://vm-cluster-node2:8088/proxy/application_1400563733088_0012/ appUser: vagrant Log files give me no additional help. Latest log entry just acknowledges the status change: hadoop-yarn/hadoop-cmf-yarn-RESOURCEMANAGER-vm-cluster-node2.log.out:2014-05-20 06:24:07,347 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1400563733088_0012 State change from SUBMITTED to ACCEPTED I’m running the example in local test environment with three virtual nodes in Cloudera (CDH5). Below is the run_pi.sh : #!/bin/bash export SPARK_HOME=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark export STANDALONE_SPARK_MASTER_HOST=vm-cluster-node2 export SPARK_MASTER_PORT=7077 export DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop export SPARK_JAR_HDFS_PATH=/user/spark/share/lib/spark-assembly.jar export SPARK_LAUNCH_WITH_SCALA=0 export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME} if [ -n $HADOOP_HOME ]; then export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:${HADOOP_HOME}/lib/native fi export SPARK_JAR=hdfs://vm-cluster-node2:8020/user/spark/share/lib/spark-assembly.jar APP_JAR=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar $SPARK_HOME/bin/spark-class org.apache.spark.deploy.yarn.Client \ --jar $APP_JAR \ --class org.apache.spark.examples.SparkPi \ --args yarn-standalone \ --args 10 \ --num-workers 1 \ --master-memory 1g \ --worker-memory 500m \ --worker-cores 1
Re: combinebykey throw classcastexception
This issue is turned out cased by version mismatch between driver(0.9.1) and server(0.9.0-cdh5.0.1) just now. Other function works fine but combinebykey before. Thank you very much for your reply. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/combinebykey-throw-classcastexception-tp6060p6087.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Status stays at ACCEPTED
Hi, each node has 4Gig of memory. After total reboot and re-run of SparkPi resource manager shows no running containers and 1 pending container. -jan On 20 May 2014, at 10:24, sandy.r...@cloudera.com sandy.r...@cloudera.com wrote: Hi Jan, How much memory capacity is configured for each node? If you go to the ResourceManager web UI, does it indicate any containers are running? -Sandy On May 19, 2014, at 11:43 PM, Jan Holmberg jan.holmb...@perigeum.fi wrote: Hi, I’m new to Spark and trying to test first Spark prog. I’m running SparkPi successfully in yarn-client -mode but when running the same in yarn-mode, app gets stuck to ACCEPTED phase. I’ve tried hours to hunt down the reason but the outcome is always the same. Any hints what to look for next? cheers, -jan vagrant@vm-cluster-node1:~$ ./run_pi.sh 14/05/20 06:24:04 INFO RMProxy: Connecting to ResourceManager at vm-cluster-node2/10.211.55.101:8032 14/05/20 06:24:05 INFO Client: Got Cluster metric info from ApplicationsManager (ASM), number of NodeManagers: 2 14/05/20 06:24:05 INFO Client: Queue info ... queueName: root.default, queueCurrentCapacity: 0.0, queueMaxCapacity: -1.0, queueApplicationCount = 0, queueChildQueueCount = 0 14/05/20 06:24:05 INFO Client: Max mem capabililty of a single resource in this cluster 2048 14/05/20 06:24:05 INFO Client: Preparing Local resources 14/05/20 06:24:05 INFO Client: Uploading file:/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar to hdfs://vm-cluster-node2:8020/user/vagrant/.sparkStaging/application_1400563733088_0012/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar 14/05/20 06:24:07 INFO Client: Setting up the launch environment 14/05/20 06:24:07 INFO Client: Setting up container launch context 14/05/20 06:24:07 INFO Client: Command for starting the Spark ApplicationMaster: java -server -Xmx1024m -Djava.io.tmpdir=$PWD/tmp org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.examples.SparkPi --jar /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar --args 'yarn-standalone' --args '10' --worker-memory 500 --worker-cores 1 --num-workers 1 1 LOG_DIR/stdout 2 LOG_DIR/stderr 14/05/20 06:24:07 INFO Client: Submitting application to ASM 14/05/20 06:24:07 INFO YarnClientImpl: Submitted application application_1400563733088_0012 14/05/20 06:24:08 INFO Client: Application report from ASM: THIS PART GET REPEATING FOREVER application identifier: application_1400563733088_0012 appId: 12 clientToAMToken: null appDiagnostics: appMasterHost: N/A appQueue: root.vagrant appMasterRpcPort: -1 appStartTime: 1400567047343 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: http://vm-cluster-node2:8088/proxy/application_1400563733088_0012/ appUser: vagrant Log files give me no additional help. Latest log entry just acknowledges the status change: hadoop-yarn/hadoop-cmf-yarn-RESOURCEMANAGER-vm-cluster-node2.log.out:2014-05-20 06:24:07,347 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1400563733088_0012 State change from SUBMITTED to ACCEPTED I’m running the example in local test environment with three virtual nodes in Cloudera (CDH5). Below is the run_pi.sh : #!/bin/bash export SPARK_HOME=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark export STANDALONE_SPARK_MASTER_HOST=vm-cluster-node2 export SPARK_MASTER_PORT=7077 export DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop export SPARK_JAR_HDFS_PATH=/user/spark/share/lib/spark-assembly.jar export SPARK_LAUNCH_WITH_SCALA=0 export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME} if [ -n $HADOOP_HOME ]; then export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:${HADOOP_HOME}/lib/native fi export SPARK_JAR=hdfs://vm-cluster-node2:8020/user/spark/share/lib/spark-assembly.jar APP_JAR=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar $SPARK_HOME/bin/spark-class org.apache.spark.deploy.yarn.Client \ --jar $APP_JAR \ --class org.apache.spark.examples.SparkPi \ --args yarn-standalone \ --args 10 \ --num-workers 1 \ --master-memory 1g \ --worker-memory 500m \ --worker-cores 1
Re: Worker re-spawn and dynamic node joining
Thank you guys for the detailed answer. Akhil, yes I would like to have a try of your tool. Is it open-sourced? 2014-05-17 17:55 GMT+02:00 Mayur Rustagi mayur.rust...@gmail.com: A better way would be use Mesos (and quite possibly Yarn in 1.0.0). That will allow you to add nodes on the fly leverage it for Spark. Frankly Standalone mode is not meant to handle those issues. That said we use our deployment tool as stopping the cluster for adding nodes is not really an issue at the moment. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, May 17, 2014 at 9:05 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for the info about adding/removing nodes dynamically. That's valuable. 2014년 5월 16일 금요일, Akhil Dasak...@sigmoidanalytics.com님이 작성한 메시지: Hi Han :) 1. Is there a way to automatically re-spawn spark workers? We've situations where executor OOM causes worker process to be DEAD and it does not came back automatically. = Yes. You can either add OOM killer exceptionhttp://backdrift.org/how-to-create-oom-killer-exceptions on all of your Spark processes. Or you can have a cronjob which will keep monitoring your worker processes and if they goes down the cronjob will bring it back. 2. How to dynamically add (or remove) some worker machines to (from) the cluster? We'd like to leverage the auto-scaling group in EC2 for example. = You can add/remove worker nodes on the fly by spawning a new machine and then adding that machine's ip address in the master node then rsyncing the spark directory with all worker machines including the one you added. Then simply you can use the *start-all.sh* script inside the master node to bring up the new worker in action. For removing a worker machine from master can be done in the same way, you have to remove the workers IP address from the masters *slaves *file and then you can restart your slaves and that will get your worker removed. FYI, we have a deployment tool (a web-based UI) that we use for internal purposes, it is build on top of the spark-ec2 script (with some changes) and it has a module for adding/removing worker nodes on the fly. It looks like the attached screenshot. If you want i can give you some access. Thanks Best Regards On Wed, May 14, 2014 at 9:52 PM, Han JU ju.han.fe...@gmail.com wrote: Hi all, Just 2 questions: 1. Is there a way to automatically re-spawn spark workers? We've situations where executor OOM causes worker process to be DEAD and it does not came back automatically. 2. How to dynamically add (or remove) some worker machines to (from) the cluster? We'd like to leverage the auto-scaling group in EC2 for example. We're using spark-standalone. Thanks a lot. -- *JU Han* Data Engineer @ Botify.com +33 061960 -- *JU Han* Data Engineer @ Botify.com +33 061960
question about the license of akka and Spark
Hi Just know akka is under a commercial license,however Spark is under the apache license. Is there any problem? Regards
Re: question about the license of akka and Spark
Akka is under Apache 2 license too. http://doc.akka.io/docs/akka/snapshot/project/licenses.html On Tue, May 20, 2014 at 2:16 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote: Hi Just know akka is under a commercial license,however Spark is under the apache license. Is there any problem? Regards
Re: question about the license of akka and Spark
The page says Akka is Open Source and available under the Apache 2 License. It may also be available under another license, but that does not change the fact that it may be used by adhering to the terms of the AL2. The section is referring to commercial support that Typesafe sells. I am not even sure there is a second license; they may merely mean commercial support contract since the target page doesn't refer to any other license. That is, it's not a clear dual-license model or anything. On Tue, May 20, 2014 at 11:15 AM, YouPeng Yang yypvsxf19870...@gmail.com wrote: Hi Well,Maybe I get the wrong reference: http://doc.akka.io/docs/akka/2.3.2/intro/what-is-akka.html On the page ,the last bold tag Commercial Support indicate that the akka is under commercial license ,by the way,the version is 2.3.2 2014-05-20 17:30 GMT+08:00 Tathagata Das tathagata.das1...@gmail.com: Akka is under Apache 2 license too. http://doc.akka.io/docs/akka/snapshot/project/licenses.html On Tue, May 20, 2014 at 2:16 AM, YouPeng Yang yypvsxf19870...@gmail.com wrote: Hi Just know akka is under a commercial license,however Spark is under the apache license. Is there any problem? Regards
Re: Yarn configuration file doesn't work when run with yarn-client mode
Few more details I would like to provide (Sorry as I should have provided with the previous post): *- Spark Version = 0.9.1 (using pre-built spark-0.9.1-bin-hadoop2) - Hadoop Version = 2.4.0 (Hortonworks) - I am trying to execute a Spark Streaming program* Because I am using Hortornworks Hadoop (HDP), YARN is configured with different port numbers than the default Apache's default configurations. For example, *resourcemanager.address* is IP:8050 in HDP whereas it defaults to IP:8032. When I run the Spark examples using bin/run-example, I can see in the console logs, that it is connecting to the right port configured by HDP, i.e., 8050. Please refer the below console log: */[root@host spark-0.9.1-bin-hadoop2]# SPARK_YARN_MODE=true SPARK_JAR=assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar SPARK_YARN_APP_JAR=examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar bin/run-example org.apache.spark.examples.HdfsTest yarn-client /user/root/test SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/spark-0.9.1-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark-0.9.1-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 14/05/20 06:55:29 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/05/20 06:55:29 INFO Remoting: Starting remoting 14/05/20 06:55:29 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@IP:60988] 14/05/20 06:55:29 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@lt;IP:60988] 14/05/20 06:55:29 INFO spark.SparkEnv: Registering BlockManagerMaster 14/05/20 06:55:29 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140520065529-924f 14/05/20 06:55:29 INFO storage.MemoryStore: MemoryStore started with capacity 4.2 GB. 14/05/20 06:55:29 INFO network.ConnectionManager: Bound socket to port 35359 with id = ConnectionManagerId(IP,35359) 14/05/20 06:55:29 INFO storage.BlockManagerMaster: Trying to register BlockManager 14/05/20 06:55:29 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager IP:35359 with 4.2 GB RAM 14/05/20 06:55:29 INFO storage.BlockManagerMaster: Registered BlockManager 14/05/20 06:55:29 INFO spark.HttpServer: Starting HTTP Server 14/05/20 06:55:29 INFO server.Server: jetty-7.x.y-SNAPSHOT 14/05/20 06:55:29 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:59418 14/05/20 06:55:29 INFO broadcast.HttpBroadcast: Broadcast server started at http://IP:59418 14/05/20 06:55:29 INFO spark.SparkEnv: Registering MapOutputTracker 14/05/20 06:55:29 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-fc34fdc8-d940-420b-b184-fc7a8a65501a 14/05/20 06:55:29 INFO spark.HttpServer: Starting HTTP Server 14/05/20 06:55:29 INFO server.Server: jetty-7.x.y-SNAPSHOT 14/05/20 06:55:29 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:53425 14/05/20 06:55:29 INFO server.Server: jetty-7.x.y-SNAPSHOT 14/05/20 06:55:29 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage/rdd,null} 14/05/20 06:55:29 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage,null} 14/05/20 06:55:29 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/stage,null} 14/05/20 06:55:29 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/pool,null} 14/05/20 06:55:29 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages,null} 14/05/20 06:55:29 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/environment,null} 14/05/20 06:55:29 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/executors,null} 14/05/20 06:55:29 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/metrics/json,null} 14/05/20 06:55:29 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/static,null} 14/05/20 06:55:29 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/,null} 14/05/20 06:55:29 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 14/05/20 06:55:29 INFO ui.SparkUI: Started Spark Web UI at http://IP:4040 14/05/20 06:55:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/05/20 06:55:29 INFO spark.SparkContext: Added JAR /usr/local/spark-0.9.1-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar at http://IP:53425/jars/spark-examples_2.10-assembly-0.9.1.jar with timestamp 1400586929921 14/05/20 06:55:30 INFO client.RMProxy: Connecting to ResourceManager at IP:8050 14/05/20 06:55:30 INFO yarn.Client: Got Cluster metric info from ApplicationsManager (ASM), number of
Re: Status stays at ACCEPTED
Still the same. I increased the memory of the node holding resource manager to 5 Gig. I also spotted an HDFS alert of replication factor 3 that I now dropped to the number of data nodes. I also shut all down all services not in use. Still the issue remains. I have noticed following two events that are fired when I start the Spark run : Zookeeper : caught end of stream exception Yarn : The specific max attempts: 0 for application: 1 is invalid, because it is out of the range [1, 2]. Use the global max attempts -jan On 20 May 2014, at 11:14, Jan Holmberg jan.holmb...@perigeum.fimailto:jan.holmb...@perigeum.fi wrote: Hi, each node has 4Gig of memory. After total reboot and re-run of SparkPi resource manager shows no running containers and 1 pending container. -jan On 20 May 2014, at 10:24, sandy.r...@cloudera.commailto:sandy.r...@cloudera.com sandy.r...@cloudera.commailto:sandy.r...@cloudera.com wrote: Hi Jan, How much memory capacity is configured for each node? If you go to the ResourceManager web UI, does it indicate any containers are running? -Sandy On May 19, 2014, at 11:43 PM, Jan Holmberg jan.holmb...@perigeum.fimailto:jan.holmb...@perigeum.fi wrote: Hi, I’m new to Spark and trying to test first Spark prog. I’m running SparkPi successfully in yarn-client -mode but when running the same in yarn-mode, app gets stuck to ACCEPTED phase. I’ve tried hours to hunt down the reason but the outcome is always the same. Any hints what to look for next? cheers, -jan vagrant@vm-cluster-node1:~$ ./run_pi.sh 14/05/20 06:24:04 INFO RMProxy: Connecting to ResourceManager at vm-cluster-node2/10.211.55.101:8032 14/05/20 06:24:05 INFO Client: Got Cluster metric info from ApplicationsManager (ASM), number of NodeManagers: 2 14/05/20 06:24:05 INFO Client: Queue info ... queueName: root.default, queueCurrentCapacity: 0.0, queueMaxCapacity: -1.0, queueApplicationCount = 0, queueChildQueueCount = 0 14/05/20 06:24:05 INFO Client: Max mem capabililty of a single resource in this cluster 2048 14/05/20 06:24:05 INFO Client: Preparing Local resources 14/05/20 06:24:05 INFO Client: Uploading file:/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar to hdfs://vm-cluster-node2:8020/user/vagrant/.sparkStaging/application_1400563733088_0012/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar 14/05/20 06:24:07 INFO Client: Setting up the launch environment 14/05/20 06:24:07 INFO Client: Setting up container launch context 14/05/20 06:24:07 INFO Client: Command for starting the Spark ApplicationMaster: java -server -Xmx1024m -Djava.io.tmpdir=$PWD/tmp org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.examples.SparkPi --jar /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar --args 'yarn-standalone' --args '10' --worker-memory 500 --worker-cores 1 --num-workers 1 1 LOG_DIR/stdout 2 LOG_DIR/stderr 14/05/20 06:24:07 INFO Client: Submitting application to ASM 14/05/20 06:24:07 INFO YarnClientImpl: Submitted application application_1400563733088_0012 14/05/20 06:24:08 INFO Client: Application report from ASM: THIS PART GET REPEATING FOREVER application identifier: application_1400563733088_0012 appId: 12 clientToAMToken: null appDiagnostics: appMasterHost: N/A appQueue: root.vagrant appMasterRpcPort: -1 appStartTime: 1400567047343 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: http://vm-cluster-node2:8088/proxy/application_1400563733088_0012/ appUser: vagrant Log files give me no additional help. Latest log entry just acknowledges the status change: hadoop-yarn/hadoop-cmf-yarn-RESOURCEMANAGER-vm-cluster-node2.log.out:2014-05-20 06:24:07,347 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1400563733088_0012 State change from SUBMITTED to ACCEPTED I’m running the example in local test environment with three virtual nodes in Cloudera (CDH5). Below is the run_pi.sh : #!/bin/bash export SPARK_HOME=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark export STANDALONE_SPARK_MASTER_HOST=vm-cluster-node2 export SPARK_MASTER_PORT=7077 export DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop export SPARK_JAR_HDFS_PATH=/user/spark/share/lib/spark-assembly.jar export SPARK_LAUNCH_WITH_SCALA=0 export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME} if [ -n $HADOOP_HOME ]; then export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:${HADOOP_HOME}/lib/native fi export SPARK_JAR=hdfs://vm-cluster-node2:8020/user/spark/share/lib/spark-assembly.jar
Re: life if an executor
if they are tied to the spark context, then why can the subprocess not be started up with the extra jars (sc.addJars) already on class path? this way a switch like user-jars-first would be a simple rearranging of the class path for the subprocess, and the messing with classloaders that is currently done in executor (which forces people to use reflection is certain situations and is broken if you want user jars first) would be history On May 20, 2014 1:07 AM, Matei Zaharia matei.zaha...@gmail.com wrote: They’re tied to the SparkContext (application) that launched them. Matei On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote: from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx
Re: life if an executor
just for my clarification: off heap cannot be java objects, correct? so we are always talking about serialized off-heap storage? On May 20, 2014 1:27 AM, Tathagata Das tathagata.das1...@gmail.com wrote: That's one the main motivation in using Tachyon ;) http://tachyon-project.org/ It gives off heap in-memory caching. And starting Spark 0.9, you can cache any RDD in Tachyon just by specifying the appropriate StorageLevel. TD On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.comwrote: I guess it needs to be this way to benefit from caching of RDDs in memory. It would be nice however if the RDD cache can be dissociated from the JVM heap so that in cases where garbage collection is difficult to tune, one could choose to discard the JVM and run the next operation in a few one. On Mon, May 19, 2014 at 10:06 PM, Matei Zaharia matei.zaha...@gmail.comwrote: They’re tied to the SparkContext (application) that launched them. Matei On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote: from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx
Ignoring S3 0 files exception
Hi, I'm trying to get data from S3 using sc.textFile(s3n://+filenamePattern) It seems that if a pattern gives out no result i get an exception like so: org.apache.hadoop.mapred.InvalidInputException: Input Pattern s3n://bucket/20140512/* matches 0 files at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) I'm actually doing a union of multiple RDD so some may have data in them but some won't. Is there anyway to say ignore empty patterns so that it can work with the RDDs that actually found files ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ignoring-S3-0-files-exception-tp6101.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Advanced log processing
Thanks for the advice. I think you're right. I'm not sure we're going to use HBase but starting by partitioning data into multiple buckets will be a first step. I'll see how it performs on large datasets. My original question though was more like: is there a spark trick i don't know about ? Currently here's what i'm doing: JavaPairRDD originalData = ...;JavaPairRDD incompleteData = originalData .filter(KeepIncompleteData).map(CleanData).cache();List pathList = incompleteData.flatMap(GetPossibleConciliationPaths).distinct() .collect()JavaPairRDD conciliationRDD = null;for (String filePath : pathList ) { JavaPairRDD fileData = sc .textFile(filePath) .flatMap(ProcessData); if (conciliationRDD == null) { conciliationRDD = fileData; } else { conciliationRDD = conciliationRDD .union(fileData); }}JavaPairRDD finalData = originalData.filter(KeepCompleteData) .union(conciliationRDD.join(incompleteData)).saveAsTextFile(dir); The collect part is what's frightening me the most as there may be alot of different paths.Does that seem fine ?Would an approach with HBase allow me to simply join the incomplete data with the stored state using a key ?Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Advanced-log-processing-tp5743p6102.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
reading large XML files
We are trying to read some large GraphML files to use in spark. Is there an easy way to read XML-based files like this that accounts for partition boundaries and the like? Thanks, Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com
Re: issue with Scala, Spark and Akka
This error message says I can't find the config for the akka subsystem. That is typically included in the Spark assembly. First, you need to compile your spark distro, by running sbt/sbt assembly on the SPARK_HOME dir. Then, use the SPARK_HOME (through env or configuration) to point to your SPARK_HOME dir. See running standalone app here: http://spark.apache.org/docs/latest/quick-start.html -kr, Gerard. On Tue, May 20, 2014 at 5:01 PM, Greg g...@zooniverse.org wrote: Hi, I have the following Scala code: ===--- import org.apache.spark.SparkContext class test { def main(){ val sc = new SparkContext(local, Scala Word Count) } } ===--- and the following build.sbt file ===--- name := test version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 0.9.1 libraryDependencies += org.apache.hadoop % hadoop-client % 1.0.4 libraryDependencies += org.mongodb % mongo-java-driver % 2.11.4 libraryDependencies += org.mongodb % mongo-hadoop-core % 1.0.0 resolvers += Akka Repository at http://repo.akka.io/releases/ ===--- I get the following error: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version' at com.typesafe.config.impl.SimpleConfig.findKey(test.sc3587202794988350330.tmp:111) at com.typesafe.config.impl.SimpleConfig.find(test.sc3587202794988350330.tmp:132) Any suggestions on how to fix this? thanks, Greg -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-with-Scala-Spark-and-Akka-tp6103.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: filling missing values in a sequence
Xiangrui, Thanks for the pointer. I think it should work...for now I did cook up my own which is similar but on top of spark core APIs. I would suggest moving the sliding window RDD to the core spark library. It seems quite general to me and a cursory look at the code indicates nothing specific to machine learning. Mohit. On Mon, May 19, 2014 at 10:13 PM, Xiangrui Meng men...@gmail.com wrote: Actually there is a sliding method implemented in mllib.rdd.RDDFunctions. Since this is not for general use cases, we didn't include it in spark-core. You can take a look at the implementation there and see whether it fits. -Xiangrui On Mon, May 19, 2014 at 10:06 PM, Mohit Jaggi mohitja...@gmail.com wrote: Thanks Sean. Yes, your solution works :-) I did oversimplify my real problem, which has other parameters that go along with the sequence. On Fri, May 16, 2014 at 3:03 AM, Sean Owen so...@cloudera.com wrote: Not sure if this is feasible, but this literally does what I think you are describing: sc.parallelize(rdd1.first to rdd1.last) On Tue, May 13, 2014 at 4:56 PM, Mohit Jaggi mohitja...@gmail.com wrote: Hi, I am trying to find a way to fill in missing values in an RDD. The RDD is a sorted sequence. For example, (1, 2, 3, 5, 8, 11, ...) I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11) One way to do this is to slide and zip rdd1 = sc.parallelize(List(1, 2, 3, 5, 8, 11, ...)) x = rdd1.first rdd2 = rdd1 filter (_ != x) rdd3 = rdd2 zip rdd1 rdd4 = rdd3 flatmap { (x, y) = generate missing elements between x and y } Another method which I think is more efficient is to use mapParititions() on rdd1 to be able to iterate on elements of rdd1 in each partition. However, that leaves the boundaries of the partitions to be unfilled. Is there a way within the function passed to mapPartitions, to read the first element in the next partition? The latter approach also appears to work for a general sliding window calculation on the RDD. The former technique requires a lot of sliding and zipping and I believe it is not efficient. If only I could read the next partition...I have tried passing a pointer to rdd1 to the function passed to mapPartitions but the rdd1 pointer turns out to be NULL, I guess because Spark cannot deal with a mapper calling another mapper (since it happens on a worker not the driver) Mohit.
Re: life if an executor
One issue is that new jars can be added during the lifetime of a SparkContext, which can mean after executors are already started. Off-heap storage is always serialized, correct. On Tue, May 20, 2014 at 6:48 AM, Koert Kuipers ko...@tresata.com wrote: just for my clarification: off heap cannot be java objects, correct? so we are always talking about serialized off-heap storage? On May 20, 2014 1:27 AM, Tathagata Das tathagata.das1...@gmail.com wrote: That's one the main motivation in using Tachyon ;) http://tachyon-project.org/ It gives off heap in-memory caching. And starting Spark 0.9, you can cache any RDD in Tachyon just by specifying the appropriate StorageLevel. TD On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.comwrote: I guess it needs to be this way to benefit from caching of RDDs in memory. It would be nice however if the RDD cache can be dissociated from the JVM heap so that in cases where garbage collection is difficult to tune, one could choose to discard the JVM and run the next operation in a few one. On Mon, May 19, 2014 at 10:06 PM, Matei Zaharia matei.zaha...@gmail.com wrote: They’re tied to the SparkContext (application) that launched them. Matei On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote: from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx
Re: reading large XML files
Try sc.wholeTextFiles(). It reads the entire file into a string record. -Xiangrui On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: We are trying to read some large GraphML files to use in spark. Is there an easy way to read XML-based files like this that accounts for partition boundaries and the like? Thanks, Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com
Evaluating Spark just for Cluster Computing
Hi - We have a use case for batch processing for which we are trying to figure out if Apache Spark would be a good fit or not. We have a universe of identifiers sitting in RDBMS for which we need to go get input data from RDBMS and then pass that input to analytical models that generate some output numbers and store it back to the database. This is one unit of work for us. So basically we are looking where we can do this processing in parallel for the universe of identifiers that we have. All the data is in RDBMS and is not sitting in file system. Can we use spark for this kind of work and would it be a good fit for that? Thanks for your help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Evaluating-Spark-just-for-Cluster-Computing-tp6110.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: reading large XML files
Unfortunately, I don't have a bunch of moderately big xml files; I have one, really big file - big enough that reading it into memory as a single string is not feasible. On Tue, May 20, 2014 at 1:24 PM, Xiangrui Meng men...@gmail.com wrote: Try sc.wholeTextFiles(). It reads the entire file into a string record. -Xiangrui On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: We are trying to read some large GraphML files to use in spark. Is there an easy way to read XML-based files like this that accounts for partition boundaries and the like? Thanks, Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com
Spark and Hadoop
I'm a first time user and need to try just the hello world kind of program in spark. Now on downloads page, I see following 3 options for Pre-built packages that I can download: - Hadoop 1 (HDP1, CDH3) - CDH4 - Hadoop 2 (HDP2, CDH5) I'm confused which one do I need to download. I need to try just the hello world kind of program which has nothing to do with Hadoop. Or is it that I can only use Spark with Hadoop? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Hadoop-tp6113.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark and Hadoop
Hi Puneet, If you're not going to read/write data in HDFS from your Spark cluster, then it doesn't matter which one you download. Just go with Hadoop 2 as that's more likely to connect to an HDFS cluster in the future if you ever do decide to use HDFS because it's the newer APIs. Cheers, Andrew On Tue, May 20, 2014 at 10:43 AM, pcutil puneet.ma...@gmail.com wrote: I'm a first time user and need to try just the hello world kind of program in spark. Now on downloads page, I see following 3 options for Pre-built packages that I can download: - Hadoop 1 (HDP1, CDH3) - CDH4 - Hadoop 2 (HDP2, CDH5) I'm confused which one do I need to download. I need to try just the hello world kind of program which has nothing to do with Hadoop. Or is it that I can only use Spark with Hadoop? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Hadoop-tp6113.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark and Hadoop
You can download any of them, I would go with the latest versions, or just download the source and build it yourself. For experimenting with basic things you can just launch the REPL and start right away in spark local mode not using any hadoop stuff. 2014-05-20 19:43 GMT+02:00 pcutil puneet.ma...@gmail.com: I'm a first time user and need to try just the hello world kind of program in spark. Now on downloads page, I see following 3 options for Pre-built packages that I can download: - Hadoop 1 (HDP1, CDH3) - CDH4 - Hadoop 2 (HDP2, CDH5) I'm confused which one do I need to download. I need to try just the hello world kind of program which has nothing to do with Hadoop. Or is it that I can only use Spark with Hadoop? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Hadoop-tp6113.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.
Howdy Gerard, Yeah, the docker link feature seems to work well for client-server interaction. But, peer-to-peer architectures need more for service discovery. As for you addressing requirements, I don't completely understand what you are asking for... you may also want to check out xip.io . Their wild card domains sometimes makes for an easy, neat hack. Finally for the ports, Docker's new host networking [1] feature helps out with making a Spark Docker container. (Security is still an issue.) Jacob [1] http://blog.docker.io/2014/05/docker-0-11-release-candidate-for-1-0/ Jacob Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 From: Gerard Maas gerard.m...@gmail.com To: user@spark.apache.org Date: 05/16/2014 10:26 AM Subject:Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs. Hi Jacob, Thanks for the help answer on the docker question. Have you already experimented with the new link feature in Docker? That does not help the HDFS issue as the DataNode needs the namenode and vice-versa but it does facilitate simpler client-server interactions. My issue described at the beginning is related to networking between the host and the docker images, but I was loosing too much time tracking down the exact problem, so I moved my Spark job driver into the mesos node and it started working. Sadly, my Mesos UI is partially crippled as workers are not addressable (therefore spark job logs are hard to gather) Your discussion about dynamic port allocation is very relevant to understand why some components cannot talk with each other. I'll need to have a more in-depth read of that discussion to find a better solution for my local development environment. regards, Gerard. On Tue, May 6, 2014 at 3:30 PM, Jacob Eisinger jeis...@us.ibm.com wrote: Howdy, You might find the discussion Andrew and I have been having about Docker and network security [1] applicable. Also, I posted an answer [2] to your stackoverflow question. [1] http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-driver-interacting-with-Workers-in-YARN-mode-firewall-blocking-communication-tp5237p5441.html [2] http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns/23495100#23495100 Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 Inactive hide details for Gerard Maas ---05/05/2014 04:18:08 PM---Hi Benjamin, Yes, we initially used a modified version of theGerard Maas ---05/05/2014 04:18:08 PM---Hi Benjamin, Yes, we initially used a modified version of the AmpLabs docker scripts From: Gerard Maas gerard.m...@gmail.com To: user@spark.apache.org Date: 05/05/2014 04:18 PM Subject: Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs. Hi Benjamin, Yes, we initially used a modified version of the AmpLabs docker scripts [1]. The amplab docker images are a good starting point. One of the biggest hurdles has been HDFS, which requires reverse-DNS and I didn't want to go the dnsmasq route to keep the containers relatively simple to use without the need of external scripts. Ended up running a 1-node setup nnode+dnode. I'm still looking for a better solution for HDFS [2] Our usecase using docker is to easily create local dev environments both for development and for automated functional testing (using cucumber). My aim is to strongly reduce the time of the develop-deploy-test cycle. That also means that we run the minimum number of instances required to have a functionally working setup. E.g. 1 Zookeeper, 1 Kafka broker, ... For the actual cluster deployment we have Chef-based devops toolchain that put things in place on public cloud providers. Personally, I think Docker rocks and would like to replace those complex cookbooks with Dockerfiles once the technology is mature enough. -greetz, Gerard. [1] https://github.com/amplab/docker-scripts [2] http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns On Mon, May 5, 2014 at 11:00 PM, Benjamin bboui...@gmail.com wrote: Hi, Before considering running on Mesos, did you try to submit the application on Spark deployed without Mesos on Docker containers ? Currently investigating this idea to deploy quickly a complete set of clusters with Docker, I'm interested by your findings on sharing the settings of Kafka and Zookeeper across nodes. How many broker and zookeeper do you use ? Regards, On Mon, May 5, 2014 at 10:11 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi all, I'm currently working on creating a set of docker images to facilitate local development with Spark/streaming on Mesos (+zk, hdfs, kafka) After solving the initial hurdles to get things
Re: Yarn configuration file doesn't work when run with yarn-client mode
I was actually able to get this to work. I was NOT setting the classpath properly originally. Simply running java -cp /etc/hadoop/conf/:yarn, hadoop jars com.domain.JobClass and setting yarn-client as the spark master worked for me. Originally I had not put the configuration on the classpath. Also, I used $SPARK_HOME/bin/compute_classpath.sh now now to get all of the relevant jars. The job properly connects to the am at the correct port. Is there any intuition on how spark executor map to yarn workers or how the different memory settings interplay, SPARK_MEM vs YARN_WORKER_MEM? Thanks, Arun On Tue, May 20, 2014 at 2:25 PM, Andrew Or and...@databricks.com wrote: Hi Gaurav and Arun, Your settings seem reasonable; as long as YARN_CONF_DIR or HADOOP_CONF_DIR is properly set, the application should be able to find the correct RM port. Have you tried running the examples in yarn-client mode, and your custom application in yarn-standalone (now yarn-cluster) mode? 2014-05-20 5:17 GMT-07:00 gaurav.dasgupta gaurav.d...@gmail.com: Few more details I would like to provide (Sorry as I should have provided with the previous post): *- Spark Version = 0.9.1 (using pre-built spark-0.9.1-bin-hadoop2) - Hadoop Version = 2.4.0 (Hortonworks) - I am trying to execute a Spark Streaming program* Because I am using Hortornworks Hadoop (HDP), YARN is configured with different port numbers than the default Apache's default configurations. For example, *resourcemanager.address* is IP:8050 in HDP whereas it defaults to IP:8032. When I run the Spark examples using bin/run-example, I can see in the console logs, that it is connecting to the right port configured by HDP, i.e., 8050. Please refer the below console log: */[root@host spark-0.9.1-bin-hadoop2]# SPARK_YARN_MODE=true SPARK_JAR=assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar SPARK_YARN_APP_JAR=examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar bin/run-example org.apache.spark.examples.HdfsTest yarn-client /user/root/test SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/spark-0.9.1-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark-0.9.1-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 14/05/20 06:55:29 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/05/20 06:55:29 INFO Remoting: Starting remoting 14/05/20 06:55:29 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@IP:60988] 14/05/20 06:55:29 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@lt;IP:60988] 14/05/20 06:55:29 INFO spark.SparkEnv: Registering BlockManagerMaster 14/05/20 06:55:29 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140520065529-924f 14/05/20 06:55:29 INFO storage.MemoryStore: MemoryStore started with capacity 4.2 GB. 14/05/20 06:55:29 INFO network.ConnectionManager: Bound socket to port 35359 with id = ConnectionManagerId(IP,35359) 14/05/20 06:55:29 INFO storage.BlockManagerMaster: Trying to register BlockManager 14/05/20 06:55:29 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager IP:35359 with 4.2 GB RAM 14/05/20 06:55:29 INFO storage.BlockManagerMaster: Registered BlockManager 14/05/20 06:55:29 INFO spark.HttpServer: Starting HTTP Server 14/05/20 06:55:29 INFO server.Server: jetty-7.x.y-SNAPSHOT 14/05/20 06:55:29 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:59418 14/05/20 06:55:29 INFO broadcast.HttpBroadcast: Broadcast server started at http://IP:59418 14/05/20 06:55:29 INFO spark.SparkEnv: Registering MapOutputTracker 14/05/20 06:55:29 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-fc34fdc8-d940-420b-b184-fc7a8a65501a 14/05/20 06:55:29 INFO spark.HttpServer: Starting HTTP Server 14/05/20 06:55:29 INFO server.Server: jetty-7.x.y-SNAPSHOT 14/05/20 06:55:29 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:53425 14/05/20 06:55:29 INFO server.Server: jetty-7.x.y-SNAPSHOT 14/05/20 06:55:29 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage/rdd,null} 14/05/20 06:55:29 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage,null} 14/05/20 06:55:29 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/stage,null} 14/05/20 06:55:29 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/pool,null} 14/05/20 06:55:29 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages,null} 14/05/20 06:55:29 INFO handler.ContextHandler: started
Re: advice on maintaining a production spark cluster?
Hi Matei, Unfortunately, I don't have more detailed information, but we have seen the loss of workers in standalone mode as well. If a job is killed through CTRL-C we will often see in the Spark Master page the number of workers and cores decrease. They are still alive and well in the Cloudera Manager page, but not visible on the Spark master, simply restarting the workers usually resolves this, but we often seen workers disappear after a failed or killed job. If we see this occur again, I'll try and provide some logs. On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Which version is this with? I haven’t seen standalone masters lose workers. Is there other stuff on the machines that’s killing them, or what errors do you see? Matei On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote: Hey folks, I'm wondering what strategies other folks are using for maintaining and monitoring the stability of stand-alone spark clusters. Our master very regularly loses workers, and they (as expected) never rejoin the cluster. This is the same behavior I've seen using akka cluster (if that's what spark is using in stand-alone mode) -- are there configuration options we could be setting to make the cluster more robust? We have a custom script which monitors the number of workers (through the web interface) and restarts the cluster when necessary, as well as resolving other issues we face (like spark shells left open permanently claiming resources), and it works, but it's no where close to a great solution. What are other folks doing? Is this something that other folks observe as well? I suspect that the loss of workers is tied to jobs that run out of memory on the client side or our use of very large broadcast variables, but I don't have an isolated test case. I'm open to general answers here: for example, perhaps we should simply be using mesos or yarn instead of stand-alone mode. --j
Re: advice on maintaining a production spark cluster?
I'd just like to point out that, along with Matei, I have not seen workers drop even under the most exotic job failures. We're running pretty close to master, though; perhaps it is related to an uncaught exception in the Worker from a prior version of Spark. On Tue, May 20, 2014 at 11:36 AM, Arun Ahuja aahuj...@gmail.com wrote: Hi Matei, Unfortunately, I don't have more detailed information, but we have seen the loss of workers in standalone mode as well. If a job is killed through CTRL-C we will often see in the Spark Master page the number of workers and cores decrease. They are still alive and well in the Cloudera Manager page, but not visible on the Spark master, simply restarting the workers usually resolves this, but we often seen workers disappear after a failed or killed job. If we see this occur again, I'll try and provide some logs. On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Which version is this with? I haven’t seen standalone masters lose workers. Is there other stuff on the machines that’s killing them, or what errors do you see? Matei On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote: Hey folks, I'm wondering what strategies other folks are using for maintaining and monitoring the stability of stand-alone spark clusters. Our master very regularly loses workers, and they (as expected) never rejoin the cluster. This is the same behavior I've seen using akka cluster (if that's what spark is using in stand-alone mode) -- are there configuration options we could be setting to make the cluster more robust? We have a custom script which monitors the number of workers (through the web interface) and restarts the cluster when necessary, as well as resolving other issues we face (like spark shells left open permanently claiming resources), and it works, but it's no where close to a great solution. What are other folks doing? Is this something that other folks observe as well? I suspect that the loss of workers is tied to jobs that run out of memory on the client side or our use of very large broadcast variables, but I don't have an isolated test case. I'm open to general answers here: for example, perhaps we should simply be using mesos or yarn instead of stand-alone mode. --j
Re: Yarn configuration file doesn't work when run with yarn-client mode
I'm assuming you're running Spark 0.9.x, because in the latest version of Spark you shouldn't have to add the HADOOP_CONF_DIR to the java class path manually. I tested this out on my own YARN cluster and was able to confirm that. In Spark 1.0, SPARK_MEM is deprecated and should not be used. Instead, you should set the per-executor memory through spark.executor.memory, which has the same effect but takes higher priority. By YARN_WORKER_MEM, do you mean SPARK_EXECUTOR_MEMORY? It also does the same thing. In Spark 1.0, the priority hierarchy is as follows: spark.executor.memory (set through spark-defaults.conf) SPARK_EXECUTOR_MEMORY SPARK_MEM (deprecated) In Spark 0.9, the hierarchy very similar: spark.executor.memory (set through SPARK_JAVA_OPTS in spark-env) SPARK_MEM For more information: http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/configuration.html http://spark.apache.org/docs/0.9.1/configuration.html 2014-05-20 11:30 GMT-07:00 Arun Ahuja aahuj...@gmail.com: I was actually able to get this to work. I was NOT setting the classpath properly originally. Simply running java -cp /etc/hadoop/conf/:yarn, hadoop jars com.domain.JobClass and setting yarn-client as the spark master worked for me. Originally I had not put the configuration on the classpath. Also, I used $SPARK_HOME/bin/compute_classpath.sh now now to get all of the relevant jars. The job properly connects to the am at the correct port. Is there any intuition on how spark executor map to yarn workers or how the different memory settings interplay, SPARK_MEM vs YARN_WORKER_MEM? Thanks, Arun On Tue, May 20, 2014 at 2:25 PM, Andrew Or and...@databricks.com wrote: Hi Gaurav and Arun, Your settings seem reasonable; as long as YARN_CONF_DIR or HADOOP_CONF_DIR is properly set, the application should be able to find the correct RM port. Have you tried running the examples in yarn-client mode, and your custom application in yarn-standalone (now yarn-cluster) mode? 2014-05-20 5:17 GMT-07:00 gaurav.dasgupta gaurav.d...@gmail.com: Few more details I would like to provide (Sorry as I should have provided with the previous post): *- Spark Version = 0.9.1 (using pre-built spark-0.9.1-bin-hadoop2) - Hadoop Version = 2.4.0 (Hortonworks) - I am trying to execute a Spark Streaming program* Because I am using Hortornworks Hadoop (HDP), YARN is configured with different port numbers than the default Apache's default configurations. For example, *resourcemanager.address* is IP:8050 in HDP whereas it defaults to IP:8032. When I run the Spark examples using bin/run-example, I can see in the console logs, that it is connecting to the right port configured by HDP, i.e., 8050. Please refer the below console log: */[root@host spark-0.9.1-bin-hadoop2]# SPARK_YARN_MODE=true SPARK_JAR=assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar SPARK_YARN_APP_JAR=examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar bin/run-example org.apache.spark.examples.HdfsTest yarn-client /user/root/test SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/spark-0.9.1-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark-0.9.1-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 14/05/20 06:55:29 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/05/20 06:55:29 INFO Remoting: Starting remoting 14/05/20 06:55:29 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@IP:60988] 14/05/20 06:55:29 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@lt;IP:60988] 14/05/20 06:55:29 INFO spark.SparkEnv: Registering BlockManagerMaster 14/05/20 06:55:29 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140520065529-924f 14/05/20 06:55:29 INFO storage.MemoryStore: MemoryStore started with capacity 4.2 GB. 14/05/20 06:55:29 INFO network.ConnectionManager: Bound socket to port 35359 with id = ConnectionManagerId(IP,35359) 14/05/20 06:55:29 INFO storage.BlockManagerMaster: Trying to register BlockManager 14/05/20 06:55:29 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager IP:35359 with 4.2 GB RAM 14/05/20 06:55:29 INFO storage.BlockManagerMaster: Registered BlockManager 14/05/20 06:55:29 INFO spark.HttpServer: Starting HTTP Server 14/05/20 06:55:29 INFO server.Server: jetty-7.x.y-SNAPSHOT 14/05/20 06:55:29 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:59418 14/05/20 06:55:29 INFO broadcast.HttpBroadcast: Broadcast server started at http://IP:59418 14/05/20 06:55:29 INFO
Re: advice on maintaining a production spark cluster?
Are you guys both using Cloudera Manager? Maybe there’s also an issue with the integration with that. Matei On May 20, 2014, at 11:44 AM, Aaron Davidson ilike...@gmail.com wrote: I'd just like to point out that, along with Matei, I have not seen workers drop even under the most exotic job failures. We're running pretty close to master, though; perhaps it is related to an uncaught exception in the Worker from a prior version of Spark. On Tue, May 20, 2014 at 11:36 AM, Arun Ahuja aahuj...@gmail.com wrote: Hi Matei, Unfortunately, I don't have more detailed information, but we have seen the loss of workers in standalone mode as well. If a job is killed through CTRL-C we will often see in the Spark Master page the number of workers and cores decrease. They are still alive and well in the Cloudera Manager page, but not visible on the Spark master, simply restarting the workers usually resolves this, but we often seen workers disappear after a failed or killed job. If we see this occur again, I'll try and provide some logs. On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Which version is this with? I haven’t seen standalone masters lose workers. Is there other stuff on the machines that’s killing them, or what errors do you see? Matei On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote: Hey folks, I'm wondering what strategies other folks are using for maintaining and monitoring the stability of stand-alone spark clusters. Our master very regularly loses workers, and they (as expected) never rejoin the cluster. This is the same behavior I've seen using akka cluster (if that's what spark is using in stand-alone mode) -- are there configuration options we could be setting to make the cluster more robust? We have a custom script which monitors the number of workers (through the web interface) and restarts the cluster when necessary, as well as resolving other issues we face (like spark shells left open permanently claiming resources), and it works, but it's no where close to a great solution. What are other folks doing? Is this something that other folks observe as well? I suspect that the loss of workers is tied to jobs that run out of memory on the client side or our use of very large broadcast variables, but I don't have an isolated test case. I'm open to general answers here: for example, perhaps we should simply be using mesos or yarn instead of stand-alone mode. --j
Re: advice on maintaining a production spark cluster?
We're using spark 0.9.0, and we're using it out of the box -- not using Cloudera Manager or anything similar. There are warnings from the master that there continue to be heartbeats from the unregistered workers. I will see if there are particular telltale errors on the worker side. We've had occasional problems with running out of memory on the driver side (esp. with large broadcast variables) so that may be related. --j On Tuesday, May 20, 2014, Matei Zaharia matei.zaha...@gmail.com wrote: Are you guys both using Cloudera Manager? Maybe there’s also an issue with the integration with that. Matei On May 20, 2014, at 11:44 AM, Aaron Davidson ilike...@gmail.comjavascript:_e(%7B%7D,'cvml','ilike...@gmail.com'); wrote: I'd just like to point out that, along with Matei, I have not seen workers drop even under the most exotic job failures. We're running pretty close to master, though; perhaps it is related to an uncaught exception in the Worker from a prior version of Spark. On Tue, May 20, 2014 at 11:36 AM, Arun Ahuja aahuj...@gmail.comjavascript:_e(%7B%7D,'cvml','aahuj...@gmail.com'); wrote: Hi Matei, Unfortunately, I don't have more detailed information, but we have seen the loss of workers in standalone mode as well. If a job is killed through CTRL-C we will often see in the Spark Master page the number of workers and cores decrease. They are still alive and well in the Cloudera Manager page, but not visible on the Spark master, simply restarting the workers usually resolves this, but we often seen workers disappear after a failed or killed job. If we see this occur again, I'll try and provide some logs. On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia matei.zaha...@gmail.comjavascript:_e(%7B%7D,'cvml','matei.zaha...@gmail.com'); wrote: Which version is this with? I haven’t seen standalone masters lose workers. Is there other stuff on the machines that’s killing them, or what errors do you see? Matei On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.comjavascript:_e(%7B%7D,'cvml','jmar...@meetup.com'); wrote: Hey folks, I'm wondering what strategies other folks are using for maintaining and monitoring the stability of stand-alone spark clusters. Our master very regularly loses workers, and they (as expected) never rejoin the cluster. This is the same behavior I've seen using akka cluster (if that's what spark is using in stand-alone mode) -- are there configuration options we could be setting to make the cluster more robust? We have a custom script which monitors the number of workers (through the web interface) and restarts the cluster when necessary, as well as resolving other issues we face (like spark shells left open permanently claiming resources), and it works, but it's no where close to a great solution. What are other folks doing? Is this something that other folks observe as well? I suspect that the loss of workers is tied to jobs that run out of memory on the client side or our use of very large broadcast variables, but I don't have an isolated test case. I'm open to general answers here: for example, perhaps we should simply be using mesos or yarn instead of stand-alone mode. --j
Re: life if an executor
interesting, so it sounds to me like spark is forced to choose between the ability to add jars during lifetime and the ability to run tasks with user classpath first (which important for the ability to run jobs on spark clusters not under your control, so for the viability of 3rd party spark apps) On Tue, May 20, 2014 at 1:06 PM, Aaron Davidson ilike...@gmail.com wrote: One issue is that new jars can be added during the lifetime of a SparkContext, which can mean after executors are already started. Off-heap storage is always serialized, correct. On Tue, May 20, 2014 at 6:48 AM, Koert Kuipers ko...@tresata.com wrote: just for my clarification: off heap cannot be java objects, correct? so we are always talking about serialized off-heap storage? On May 20, 2014 1:27 AM, Tathagata Das tathagata.das1...@gmail.com wrote: That's one the main motivation in using Tachyon ;) http://tachyon-project.org/ It gives off heap in-memory caching. And starting Spark 0.9, you can cache any RDD in Tachyon just by specifying the appropriate StorageLevel. TD On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.comwrote: I guess it needs to be this way to benefit from caching of RDDs in memory. It would be nice however if the RDD cache can be dissociated from the JVM heap so that in cases where garbage collection is difficult to tune, one could choose to discard the JVM and run the next operation in a few one. On Mon, May 19, 2014 at 10:06 PM, Matei Zaharia matei.zaha...@gmail.com wrote: They’re tied to the SparkContext (application) that launched them. Matei On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote: from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx
Re: advice on maintaining a production spark cluster?
So, for example, I have two disassociated worker machines at the moment. The last messages in the spark logs are akka association error messages, like the following: 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError [akka.tcp:// sparkwor...@hdn3.int.meetup.com:50038] - [akka.tcp:// sparkexecu...@hdn3.int.meetup.com:46288]: Error [Association failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: hdn3.int.meetup.com/10.3.6.23:46288 ] On the master side, there are lots and lots of messages of the form: 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker worker-20140520011737-hdn3.int.meetup.com-50038 --j On Tue, May 20, 2014 at 3:28 PM, Josh Marcus jmar...@meetup.com wrote: We're using spark 0.9.0, and we're using it out of the box -- not using Cloudera Manager or anything similar. There are warnings from the master that there continue to be heartbeats from the unregistered workers. I will see if there are particular telltale errors on the worker side. We've had occasional problems with running out of memory on the driver side (esp. with large broadcast variables) so that may be related. --j On Tuesday, May 20, 2014, Matei Zaharia matei.zaha...@gmail.com wrote: Are you guys both using Cloudera Manager? Maybe there’s also an issue with the integration with that. Matei On May 20, 2014, at 11:44 AM, Aaron Davidson ilike...@gmail.com wrote: I'd just like to point out that, along with Matei, I have not seen workers drop even under the most exotic job failures. We're running pretty close to master, though; perhaps it is related to an uncaught exception in the Worker from a prior version of Spark. On Tue, May 20, 2014 at 11:36 AM, Arun Ahuja aahuj...@gmail.com wrote: Hi Matei, Unfortunately, I don't have more detailed information, but we have seen the loss of workers in standalone mode as well. If a job is killed through CTRL-C we will often see in the Spark Master page the number of workers and cores decrease. They are still alive and well in the Cloudera Manager page, but not visible on the Spark master, simply restarting the workers usually resolves this, but we often seen workers disappear after a failed or killed job. If we see this occur again, I'll try and provide some logs. On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Which version is this with? I haven’t seen standalone masters lose workers. Is there other stuff on the machines that’s killing them, or what errors do you see? Matei On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote: Hey folks, I'm wondering what strategies other folks are using for maintaining and monitoring the stability of stand-alone spark clusters. Our master very regularly loses workers, and they (as expected) never rejoin the cluster. This is the same behavior I've seen using akka cluster (if that's what spark is using in stand-alone mode) -- are there configuration options we could be setting to make the cluster more robust? We have a custom script which monitors the number of workers (through the web interface) and restarts the cluster when necessary, as well as resolving other issues we face (like spark shells left open permanently claiming resources), and it works, but it's no where close to a great solution. What are other folks doing? Is this something that other folks observe as well? I suspect that the loss of workers is tied to jobs that run out of memory on the client side or our use of very large broadcast variables, but I don't have an isolated test case. I'm open to general answers here: for example, perhaps we should simply be using mesos or yarn instead of stand-alone mode. --j
Spark Streaming using Flume body size limitation
Hi, I am trying to send events to spark streaming via flume. It's working fine up to a certain point. I have problems when the size of the body is over 1020 characters. Basically, up to 1020 it works 1021 through 1024, the event will be accepted and there is no exception, but the channel seems to be corrupted. Or at least no more events can make it through. 1025 and up, I am seeing some exceptions : java.io.StreamCorruptedException: invalid stream header: I created a test using the Embedded Agent and a local spark to expose the problem. I am using the cdh5 distribution. cdh-flume.version1.4.0-cdh5.0.0/cdh-flume.version cdh-spark.version0.9.0-cdh5.0.0/cdh-spark.version cdh-avro.version1.7.5-cdh5.0.0/cdh-avro.version File is attached. Has anybody ever seen this? Any suggestions? /D *** This e-mail and attachments are confidential, legally privileged, may be subject to copyright and sent solely for the attention of the addressee(s). Any unauthorized use or disclosure is prohibited. Statements and opinions expressed in this e-mail may not represent those of Radialpoint. ~~ Le contenu du présent courriel est confidentiel, privilégié et peut être soumis à des droits d'auteur. Il est envoyé à l'intention exclusive de son ou de ses destinataires. Il est interdit de l'utiliser ou de le divulguer sans autorisation. Les opinions exprimées dans le présent courriel peuvent diverger de celles de Radialpoint. SparkFlumeStreamTest.java (11K) http://apache-spark-user-list.1001560.n3.nabble.com/attachment/6127/0/SparkFlumeStreamTest.java -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-using-Flume-body-size-limitation-tp6127.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Yarn configuration file doesn't work when run with yarn-client mode
Yes, we are on Spark 0.9.0 so that explains the first piece, thanks! Also, yes, I meant SPARK_WORKER_MEMORY. Thanks for the hierarchy. Similarly is there some best practice on setting SPARK_WORKER_INSTANCES and spark.default.parallelism? Thanks, Arun On Tue, May 20, 2014 at 3:04 PM, Andrew Or and...@databricks.com wrote: I'm assuming you're running Spark 0.9.x, because in the latest version of Spark you shouldn't have to add the HADOOP_CONF_DIR to the java class path manually. I tested this out on my own YARN cluster and was able to confirm that. In Spark 1.0, SPARK_MEM is deprecated and should not be used. Instead, you should set the per-executor memory through spark.executor.memory, which has the same effect but takes higher priority. By YARN_WORKER_MEM, do you mean SPARK_EXECUTOR_MEMORY? It also does the same thing. In Spark 1.0, the priority hierarchy is as follows: spark.executor.memory (set through spark-defaults.conf) SPARK_EXECUTOR_MEMORY SPARK_MEM (deprecated) In Spark 0.9, the hierarchy very similar: spark.executor.memory (set through SPARK_JAVA_OPTS in spark-env) SPARK_MEM For more information: http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/configuration.html http://spark.apache.org/docs/0.9.1/configuration.html 2014-05-20 11:30 GMT-07:00 Arun Ahuja aahuj...@gmail.com: I was actually able to get this to work. I was NOT setting the classpath properly originally. Simply running java -cp /etc/hadoop/conf/:yarn, hadoop jars com.domain.JobClass and setting yarn-client as the spark master worked for me. Originally I had not put the configuration on the classpath. Also, I used $SPARK_HOME/bin/compute_classpath.sh now now to get all of the relevant jars. The job properly connects to the am at the correct port. Is there any intuition on how spark executor map to yarn workers or how the different memory settings interplay, SPARK_MEM vs YARN_WORKER_MEM? Thanks, Arun On Tue, May 20, 2014 at 2:25 PM, Andrew Or and...@databricks.com wrote: Hi Gaurav and Arun, Your settings seem reasonable; as long as YARN_CONF_DIR or HADOOP_CONF_DIR is properly set, the application should be able to find the correct RM port. Have you tried running the examples in yarn-client mode, and your custom application in yarn-standalone (now yarn-cluster) mode? 2014-05-20 5:17 GMT-07:00 gaurav.dasgupta gaurav.d...@gmail.com: Few more details I would like to provide (Sorry as I should have provided with the previous post): *- Spark Version = 0.9.1 (using pre-built spark-0.9.1-bin-hadoop2) - Hadoop Version = 2.4.0 (Hortonworks) - I am trying to execute a Spark Streaming program* Because I am using Hortornworks Hadoop (HDP), YARN is configured with different port numbers than the default Apache's default configurations. For example, *resourcemanager.address* is IP:8050 in HDP whereas it defaults to IP:8032. When I run the Spark examples using bin/run-example, I can see in the console logs, that it is connecting to the right port configured by HDP, i.e., 8050. Please refer the below console log: */[root@host spark-0.9.1-bin-hadoop2]# SPARK_YARN_MODE=true SPARK_JAR=assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar SPARK_YARN_APP_JAR=examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar bin/run-example org.apache.spark.examples.HdfsTest yarn-client /user/root/test SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/spark-0.9.1-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark-0.9.1-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 14/05/20 06:55:29 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/05/20 06:55:29 INFO Remoting: Starting remoting 14/05/20 06:55:29 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@IP:60988] 14/05/20 06:55:29 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@lt;IP:60988] 14/05/20 06:55:29 INFO spark.SparkEnv: Registering BlockManagerMaster 14/05/20 06:55:29 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140520065529-924f 14/05/20 06:55:29 INFO storage.MemoryStore: MemoryStore started with capacity 4.2 GB. 14/05/20 06:55:29 INFO network.ConnectionManager: Bound socket to port 35359 with id = ConnectionManagerId(IP,35359) 14/05/20 06:55:29 INFO storage.BlockManagerMaster: Trying to register BlockManager 14/05/20 06:55:29 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager IP:35359 with 4.2 GB RAM 14/05/20 06:55:29 INFO storage.BlockManagerMaster: Registered
Re: reading large XML files
Thanks, that sounds perfect On Tue, May 20, 2014 at 1:38 PM, Xiangrui Meng men...@gmail.com wrote: You can search for XMLInputFormat on Google. There are some implementations that allow you to specify the tag to split on, e.g.: https://github.com/lintool/Cloud9/blob/master/src/dist/edu/umd/cloud9/collection/XMLInputFormat.java On Tue, May 20, 2014 at 10:31 AM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: Unfortunately, I don't have a bunch of moderately big xml files; I have one, really big file - big enough that reading it into memory as a single string is not feasible. On Tue, May 20, 2014 at 1:24 PM, Xiangrui Meng men...@gmail.com wrote: Try sc.wholeTextFiles(). It reads the entire file into a string record. -Xiangrui On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: We are trying to read some large GraphML files to use in spark. Is there an easy way to read XML-based files like this that accounts for partition boundaries and the like? Thanks, Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com
java.lang.NoClassDefFoundError: org/apache/hadoop/io/Writable
This is the first time I'm trying the Spark. I just downloaded and trying the SimpleApp Java program using the maven. I added 2 maven dependencies -- spark-core and scala-library? Even though my program is in java, I was forced to add the scala dependency. Is that really required? Now, I'm able to build but while executing below line getting the error: JavaSparkContext sc = new JavaSparkContext(local, Simple App, C:\\tmp\\spark-0.9.1-bin-cdh4, new String[]{target\\spark-hello-world-1.1.jar}); java.lang.NoClassDefFoundError: org/apache/hadoop/io/Writable at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:81) at SimpleApp.main(SimpleApp.java:8) at TestSimpleApp.testMain(TestSimpleApp.java:14) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:164) at junit.framework.TestCase.runBare(TestCase.java:130) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:120) at junit.framework.TestSuite.runTest(TestSuite.java:228) at junit.framework.TestSuite.run(TestSuite.java:223) at org.junit.internal.runners.OldTestClassRunner.run(OldTestClassRunner.java:35) at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:120) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:103) at org.apache.maven.surefire.Surefire.run(Surefire.java:169) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350) at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.io.Writable at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) ... 26 more -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-NoClassDefFoundError-org-apache-hadoop-io-Writable-tp6131.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Evaluating Spark just for Cluster Computing
My $0.02: If you are simply reading input records, running a model, and outputting the result, then it's a simple map-only problem and you're mostly looking for a process to baby-sit these operations. Lots of things work -- Spark, M/R (+ Crunch), Hadoop Streaming, etc. I'd choose whatever is simplest to integrate with the RDBMS and analytical model; all could work. Keep in mind that the failure recovery processes in these various frameworks don't necessarily interact cleanly with your external systems. For example, if a Spark worker dies while doing some work on some your IDs, it will happily be restarted, but if your job inserts results into another table, you may find it has inserted them twice of course. On Tue, May 20, 2014 at 6:26 PM, pcutil puneet.ma...@gmail.com wrote: Hi - We have a use case for batch processing for which we are trying to figure out if Apache Spark would be a good fit or not. We have a universe of identifiers sitting in RDBMS for which we need to go get input data from RDBMS and then pass that input to analytical models that generate some output numbers and store it back to the database. This is one unit of work for us. So basically we are looking where we can do this processing in parallel for the universe of identifiers that we have. All the data is in RDBMS and is not sitting in file system. Can we use spark for this kind of work and would it be a good fit for that? Thanks for your help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Evaluating-Spark-just-for-Cluster-Computing-tp6110.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark stalling during shuffle (maybe a memory issue)
If the distribution of the keys in your groupByKey is skewed (some keys appear way more often than others) you should consider modifying your job to use reduceByKey instead wherever possible. On May 20, 2014 12:53 PM, Jon Keebler jkeeble...@gmail.com wrote: So we upped the spark.akka.frameSize value to 128 MB and still observed the same behavior. It's happening not necessarily when data is being sent back to the driver, but when there is an inter-cluster shuffle, for example during a groupByKey. Is it possible we should focus on tuning these parameters: spark.storage.memoryFraction spark.shuffle.memoryFraction ?? On Tue, May 20, 2014 at 12:09 AM, Aaron Davidson ilike...@gmail.comwrote: This is very likely because the serialized map output locations buffer exceeds the akka frame size. Please try setting spark.akka.frameSize (default 10 MB) to some higher number, like 64 or 128. In the newest version of Spark, this would throw a better error, for what it's worth. On Mon, May 19, 2014 at 8:39 PM, jonathan.keebler jkeeble...@gmail.comwrote: Has anyone observed Spark worker threads stalling during a shuffle phase with the following message (one per worker host) being echoed to the terminal on the driver thread? INFO spark.MapOutputTrackerActor: Asked to send map output locations for shuffle 0 to [worker host]... At this point Spark-related activity on the hadoop cluster completely halts .. there's no network activity, disk IO or CPU activity, and individual tasks are not completing and the job just sits in this state. At this point we just kill the job a re-start of the Spark server service is required. Using identical jobs we were able to by-pass this halt point by increasing available heap memory to the workers, but it's odd we don't get an out-of-memory error or any error at all. Upping the memory available isn't a very satisfying answer to what may be going on :) We're running Spark 0.9.0 on CDH5.0 in stand-alone mode. Thanks for any help or ideas you may have! Cheers, Jonathan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Imports that need to be specified in a Spark application jar?
Hello All, I am learning that there are certain imports done by Spark REPL that is used to invoke and run code in a spark shell, that I would have to import specifically if I need the same functionality in a spark jar run by command line. I am getting into a repeated serialization error of an RDD that contains the Map. The exact RDD is org.apache.spark.rdd.RDD[(String, Seq[Map[String,MetaData]])] where MetaData is a case class. This serialization is not encountered when I try to write out the RDD to disk by using saveAsTextFile or when I try to count the number of elements in the RDD using the count() function. Sadly when I run the same commands in the spark-shell, I do not encounter any error. I appreciate your help in advance. Thanks Shivani -- Software Engineer Analytics Engineering Team@ Box Mountain View, CA
Re: Setting queue for spark job on yarn
Hi Ron, What version are you using? For 0.9, you need to set it outside your code with the SPARK_YARN_QUEUE environment variable. -Sandy On Mon, May 19, 2014 at 9:29 PM, Ron Gonzalez zlgonza...@yahoo.com wrote: Hi, How does one submit a spark job to yarn and specify a queue? The code that successfully submits to yarn is: val conf = new SparkConf() val sc = new SparkContext(yarn-client, Simple App, conf) Where do I need to specify the queue? Thanks in advance for any help on this... Thanks, Ron
Re: Spark stalling during shuffle (maybe a memory issue)
Thanks for the suggestion, Andrew. We have also implemented our solution using reduceByKey, but observe the same behavior. For example if we do the following: map1 groupByKey map2 saveAsTextFile Then the stalling will occur during the map1+groupByKey execution. If we do map1 reduceByKey map2 saveAsTextFile Then the reduceByKey finishes successfully, but the stalling will occur during the map2+saveAsTextFile execution. On Tue, May 20, 2014 at 4:22 PM, Andrew Ash [via Apache Spark User List] ml-node+s1001560n6134...@n3.nabble.com wrote: If the distribution of the keys in your groupByKey is skewed (some keys appear way more often than others) you should consider modifying your job to use reduceByKey instead wherever possible. On May 20, 2014 12:53 PM, Jon Keebler [hidden email]http://user/SendEmail.jtp?type=nodenode=6134i=0 wrote: So we upped the spark.akka.frameSize value to 128 MB and still observed the same behavior. It's happening not necessarily when data is being sent back to the driver, but when there is an inter-cluster shuffle, for example during a groupByKey. Is it possible we should focus on tuning these parameters: spark.storage.memoryFraction spark.shuffle.memoryFraction ?? On Tue, May 20, 2014 at 12:09 AM, Aaron Davidson [hidden email]http://user/SendEmail.jtp?type=nodenode=6134i=1 wrote: This is very likely because the serialized map output locations buffer exceeds the akka frame size. Please try setting spark.akka.frameSize (default 10 MB) to some higher number, like 64 or 128. In the newest version of Spark, this would throw a better error, for what it's worth. On Mon, May 19, 2014 at 8:39 PM, jonathan.keebler [hidden email]http://user/SendEmail.jtp?type=nodenode=6134i=2 wrote: Has anyone observed Spark worker threads stalling during a shuffle phase with the following message (one per worker host) being echoed to the terminal on the driver thread? INFO spark.MapOutputTrackerActor: Asked to send map output locations for shuffle 0 to [worker host]... At this point Spark-related activity on the hadoop cluster completely halts .. there's no network activity, disk IO or CPU activity, and individual tasks are not completing and the job just sits in this state. At this point we just kill the job a re-start of the Spark server service is required. Using identical jobs we were able to by-pass this halt point by increasing available heap memory to the workers, but it's odd we don't get an out-of-memory error or any error at all. Upping the memory available isn't a very satisfying answer to what may be going on :) We're running Spark 0.9.0 on CDH5.0 in stand-alone mode. Thanks for any help or ideas you may have! Cheers, Jonathan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6134.html To unsubscribe from Spark stalling during shuffle (maybe a memory issue), click herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=6067code=amtlZWJsZXI0MkBnbWFpbC5jb218NjA2N3wtMjA5NzAzMzE5NQ== . NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6137.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark Streaming and Shark | Streaming Taking All CPUs
Thanks Mayur, it is working :) -- Anish Sneh http://in.linkedin.com/in/anishsneh
Re: Spark stalling during shuffle (maybe a memory issue)
So the current stalling is simply sitting there with no log output? Have you jstack'd an Executor to see where it may be hanging? Are you observing memory or disk pressure (df and df -i)? On Tue, May 20, 2014 at 2:03 PM, jonathan.keebler jkeeble...@gmail.comwrote: Thanks for the suggestion, Andrew. We have also implemented our solution using reduceByKey, but observe the same behavior. For example if we do the following: map1 groupByKey map2 saveAsTextFile Then the stalling will occur during the map1+groupByKey execution. If we do map1 reduceByKey map2 saveAsTextFile Then the reduceByKey finishes successfully, but the stalling will occur during the map2+saveAsTextFile execution. On Tue, May 20, 2014 at 4:22 PM, Andrew Ash [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=6137i=0 wrote: If the distribution of the keys in your groupByKey is skewed (some keys appear way more often than others) you should consider modifying your job to use reduceByKey instead wherever possible. On May 20, 2014 12:53 PM, Jon Keebler [hidden email]http://user/SendEmail.jtp?type=nodenode=6134i=0 wrote: So we upped the spark.akka.frameSize value to 128 MB and still observed the same behavior. It's happening not necessarily when data is being sent back to the driver, but when there is an inter-cluster shuffle, for example during a groupByKey. Is it possible we should focus on tuning these parameters: spark.storage.memoryFraction spark.shuffle.memoryFraction ?? On Tue, May 20, 2014 at 12:09 AM, Aaron Davidson [hidden email]http://user/SendEmail.jtp?type=nodenode=6134i=1 wrote: This is very likely because the serialized map output locations buffer exceeds the akka frame size. Please try setting spark.akka.frameSize (default 10 MB) to some higher number, like 64 or 128. In the newest version of Spark, this would throw a better error, for what it's worth. On Mon, May 19, 2014 at 8:39 PM, jonathan.keebler [hidden email]http://user/SendEmail.jtp?type=nodenode=6134i=2 wrote: Has anyone observed Spark worker threads stalling during a shuffle phase with the following message (one per worker host) being echoed to the terminal on the driver thread? INFO spark.MapOutputTrackerActor: Asked to send map output locations for shuffle 0 to [worker host]... At this point Spark-related activity on the hadoop cluster completely halts .. there's no network activity, disk IO or CPU activity, and individual tasks are not completing and the job just sits in this state. At this point we just kill the job a re-start of the Spark server service is required. Using identical jobs we were able to by-pass this halt point by increasing available heap memory to the workers, but it's odd we don't get an out-of-memory error or any error at all. Upping the memory available isn't a very satisfying answer to what may be going on :) We're running Spark 0.9.0 on CDH5.0 in stand-alone mode. Thanks for any help or ideas you may have! Cheers, Jonathan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6134.html To unsubscribe from Spark stalling during shuffle (maybe a memory issue), click here. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: Re: Spark stalling during shuffle (maybe a memory issue)http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6137.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.
Spark Performace Comparison Spark on YARN vs Spark Standalone
Hi All I need to analyse performance of Spark YARN vs Spark Standalone Please suggest if we have some pre-published comparison statistics available. TIA -- Anish Sneh http://in.linkedin.com/in/anishsneh
Re: facebook data mining with Spark
Hello Joe, The first step is acquiring some data, either through the Facebook APIhttps://developers.facebook.com/or a third-party service like Datasift https://datasift.com/ (paid). Once you've acquired some data, and got it somewhere Spark can access it (like HDFS), you can then load and manipulate it just like any other data. Here is a pretty-printed example JSON message I got from a Datasifthttps://datasift.com/ stream this morning, it illustrates an anonymised someone with *clearly too much time on their hands* having reached *level 576* on Candy Crush Saga. { demographic: { gender: mostly_female }, facebook: { application: Candy Crush Saga, author: { type: user, hash_id: }, caption: I just completed level 576, scored 494020 points and got 3 stars., created_at: Tue, 20 May 2014 03:08:09 +, description: Click here to follow my progress!, id: 100_123456789012345, link: http://apps.facebook.com/candycrush/?urlMessage= , name: Yay, I completed level 576 in Candy Crush Saga!, source: Candy Crush Saga (123456789012345), type: link }, interaction: { schema: { version: 3 }, type: facebook, id: , created_at: Tue, 20 May 2014 03:08:09 +, received_at: 1400555303.6832, author: { type: user, hash_id: }, title: Yay, I completed level 576 in Candy Crush Saga!, link: http://www.facebook.com/100_123456789012345;, subtype: link, content: Click here to follow my progress!, source: Candy Crush Saga (123456789012345) }, language: { tag: en, tag_extended: en, confidence: 97 } } Much like processing Twitter streams, the data arrives as a single JSON object on each line. So you need to pass the RDD[String] you get from opening the textFile through a JSON parser. Spark has json4shttps://github.com/json4s/json4sand jackson JSON parsers embedded in the assembly so you can basically use those for 'free' without having to bundle them in your JAR. Here is an example Spark job which answers the age-old question: Who is better at Candy Crush, boys? or girls? // We want to extract the level number from Yay, I completed level 576 in Candy Crush Saga! // the actual text will change based on the users language but parsing the 'last number' works val pattern = (\d+).r // Produces a RDD[String] val lines = sc.textFile(facebook-2014-05-19.json) lines.map(line = { // Parse the JSON parse(line) }).filter(json = { // Filter out only 'Candy Crush Saga' activity json \ facebook \ application == JString(Candy Crush Saga) }).map(json = { // Extract the 'level' or default to zero var level = 0; pattern.findAllIn( compact(json \ interaction \ title) ).matchData.foreach(m = { level = m.group(1).toInt }) // Extract the gender val gender = compact(json \ demographic \ gender) // Return a Tuple of RDD[gender: String, (level: Int, count: Int)] ( gender, (level, 1) ) }).filter(a = { // Filter out entries with a level of zero a._2._1 0 }).reduceByKey( (a, b) = { // Sum the levels and counts so we can average later ( a._1 + b._1, a._2 + b._2 ) }).collect().foreach(entry = { // Print the results val gender = entry._1 val values = entry._2 val average = values._1 / values._2 println(gender + : average= + average + , count= + values._2 ) }) See more: https://gist.github.com/cotdp/fda64b4248e43a3c8f46 If you run this on a small sample of data you get results like this: - female: average=114, count=15422 - male: average=104, count=14727 Which basically says the average level achieved by women is slightly higher than guys. Best of luck fishing through Facebook data! MC *Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com mich...@tumra.comWeb: tumra.com http://tumra.com/?utm_source=signatureutm_medium=email* *Visit us at our offices in Chiswick Park http://goo.gl/maps/abBxq* *Registered in England Wales, 07916412. VAT No. 130595328* This email and any files transmitted with it are confidential and may also be privileged. It is intended only for the person to whom it is addressed. If you have received this email in error, please inform the sender immediately. If you are not the intended recipient you must not use, disclose, copy, print, distribute or rely on this email. On 20 May 2014 05:07, Joe L selme...@yahoo.com wrote: Is there any way to get facebook data into Spark and filter the content of it? -- View this message in context:
Python, Spark and HBase
Hello, This seems like a basic question but I have been unable to find an answer in the archives or other online sources. I would like to know if there is any way to load a RDD from HBase in python. In Java/Scala I can do this by initializing a NewAPIHadoopRDD with a TableInputFormat class. Is there any equivalent in python? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Using Spark to analyze complex JSON
The Apache Drill http://incubator.apache.org/drill/ home page has an interesting heading: Liberate Nested Data. Is there any current or planned functionality in Spark SQL or Shark to enable SQL-like querying of complex JSON? Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-to-analyze-complex-JSON-tp6146.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: count()-ing gz files gives java.io.IOException: incorrect header check
Any tips on how to troubleshoot this? On Thu, May 15, 2014 at 4:15 PM, Nick Chammas nicholas.cham...@gmail.comwrote: I’m trying to do a simple count() on a large number of GZipped files in S3. My job is failing with the following message: 14/05/15 19:12:37 WARN scheduler.TaskSetManager: Loss was due to java.io.IOException java.io.IOException: incorrect header check at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method) at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:82) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76) at java.io.InputStream.read(InputStream.java:101) snipped I traced this down to HADOOP-5281https://issues.apache.org/jira/browse/HADOOP-5281, but I’m not sure if 1) it’s the same issue, or 2) how to go about resolving it. I gather I need to update some Hadoop jar? Any tips on where to look/what to do? I’m running Spark on an EC2 cluster created by spark-ec2 with no special options used. Nick -- View this message in context: count()-ing gz files gives java.io.IOException: incorrect header checkhttp://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.
Unsubscribe
Re: count()-ing gz files gives java.io.IOException: incorrect header check
I have read gzip files from S3 successfully. It sounds like a file is corrupt or not a valid gzip file. Does it work with fewer gzip files? How are you reading the files? - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6149.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
How to Unsubscribe from the Spark user list
Send an email to this address to unsubscribe from the Spark user list: user-unsubscr...@spark.apache.org Sending an email to the Spark user list itself (i.e. this list) *does not do anything*, even if you put unsubscribe as the subject. We will all just see your email. Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Unsubscribe-from-the-Spark-user-list-tp6150.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: advice on maintaining a production spark cluster?
Unfortunately, those errors are actually due to an Executor that exited, such that the connection between the Worker and Executor failed. This is not a fatal issue, unless there are analogous messages from the Worker to the Master (which should be present, if they exist, at around the same point in time). Do you happen to have the logs from the Master that indicate that the Worker terminated? Is it just an Akka disassociation, or some exception? On Tue, May 20, 2014 at 12:53 PM, Sean Owen so...@cloudera.com wrote: This isn't helpful of me to say, but, I see the same sorts of problem and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight into when it happens, but usually after heavy use and after running for a long time. I had figured I'd see if the changes since 0.9.0 addressed it and revisit later. On Tue, May 20, 2014 at 8:37 PM, Josh Marcus jmar...@meetup.com wrote: So, for example, I have two disassociated worker machines at the moment. The last messages in the spark logs are akka association error messages, like the following: 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError [akka.tcp://sparkwor...@hdn3.int.meetup.com:50038] - [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]: Error [Association failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: hdn3.int.meetup.com/10.3.6.23:46288 ] On the master side, there are lots and lots of messages of the form: 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker worker-20140520011737-hdn3.int.meetup.com-50038 --j
Re: count()-ing gz files gives java.io.IOException: incorrect header check
Yes, it does work with fewer GZipped files. I am reading the files in using sc.textFile() and a pattern string. For example: a = sc.textFile('s3n://bucket/2014-??-??/*.gz') a.count() Nick On Tue, May 20, 2014 at 10:09 PM, Madhu ma...@madhu.com wrote: I have read gzip files from S3 successfully. It sounds like a file is corrupt or not a valid gzip file. Does it work with fewer gzip files? How are you reading the files? - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6149.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
IllegalStateException when creating Job from shell
Hi, I'm trying to work with Spark from the shell and create a Hadoop Job instance. I get the exception you see below because the Job.toString doesn't like to be called until it has been submitted. I tried using the :silent command but that didn't seem to have any impact. scala import org.apache.hadoop.mapreduce.Job scala val job = new Job() java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283) at org.apache.hadoop.mapreduce.Job.toString(Job.java:462) at scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324) at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329) at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337) at .init(console:10) at .clinit(console) ... Any help would be greatly appreciated! Thanks, Alex
Re: Python, Spark and HBase
Unfortunately this is not yet possible. There’s a patch in progress posted here though: https://github.com/apache/spark/pull/455 — it would be great to get your feedback on it. Matei On May 20, 2014, at 4:21 PM, twizansk twiza...@gmail.com wrote: Hello, This seems like a basic question but I have been unable to find an answer in the archives or other online sources. I would like to know if there is any way to load a RDD from HBase in python. In Java/Scala I can do this by initializing a NewAPIHadoopRDD with a TableInputFormat class. Is there any equivalent in python? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: advice on maintaining a production spark cluster?
Aaron: I see this in the Master's logs: 14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same address: akka.tcp://sparkwor...@hdn3.int.meetup.com:50038 14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker worker-20140520011737-hdn3.int.meetup.com-50038 There was an executor that launched that did fail, such as: 14/05/20 01:16:05 INFO Master: Launching executor app-20140520011605-0001/2 on worker worker-20140519155427-hdn3.int.meetup.com-50 038 14/05/20 01:17:37 INFO Master: Removing executor app-20140520011605-0001/2 because it is FAILED ... but other executors on other machines also failed without permanently disassociating. There are these messages which I don't know if they are related: 14/05/20 01:17:38 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaste r/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3. 6.19%3A47252-18#1027788678] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with confi guration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/05/20 01:17:38 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka ://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkM aster%4010.3.6.19%3A47252-18#1027788678] was not delivered. [4] dead letters encountered. This logging can be turned off or adjust ed with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. On Tue, May 20, 2014 at 10:13 PM, Aaron Davidson ilike...@gmail.com wrote: Unfortunately, those errors are actually due to an Executor that exited, such that the connection between the Worker and Executor failed. This is not a fatal issue, unless there are analogous messages from the Worker to the Master (which should be present, if they exist, at around the same point in time). Do you happen to have the logs from the Master that indicate that the Worker terminated? Is it just an Akka disassociation, or some exception? On Tue, May 20, 2014 at 12:53 PM, Sean Owen so...@cloudera.com wrote: This isn't helpful of me to say, but, I see the same sorts of problem and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight into when it happens, but usually after heavy use and after running for a long time. I had figured I'd see if the changes since 0.9.0 addressed it and revisit later. On Tue, May 20, 2014 at 8:37 PM, Josh Marcus jmar...@meetup.com wrote: So, for example, I have two disassociated worker machines at the moment. The last messages in the spark logs are akka association error messages, like the following: 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError [akka.tcp://sparkwor...@hdn3.int.meetup.com:50038] - [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]: Error [Association failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: hdn3.int.meetup.com/10.3.6.23:46288 ] On the master side, there are lots and lots of messages of the form: 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker worker-20140520011737-hdn3.int.meetup.com-50038 --j
any way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ?
sparkers, Is there a better way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ? Thanks, Francis.Hu