Spark and ActorSystem
Hi, I'd like to know where I could find more information related to the depreciation of the actor system in spark (from 1.4.x). I'm interested in the reasons for this decision, Cheers -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-ActorSystem-tp24321.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: scheduler delay time
You'd need to provide information such as executor configuration (#cores, memory size). You might have less scheduler delay with smaller, but more numerous executors, than the contrary. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/scheduler-delay-time-tp6003p24133.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to make my spark implementation parallel?
If you want to exploit properly the 8 nodes of your cluster, you should use ~ 2 times that number for partitioning. You can specify the number of partitions when calling parallelize, as following: JavaRDDPoint pnts = sc.parallelize(points, 16); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-my-spark-implementation-parallel-tp23804p23808.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
HDFS performances + unexpected death of executors.
Hi, I have several issues related to HDFS, that may have different roots. I'm posting as much information as I can, with the hope that I can get your opinion on at least some of them. Basically the cases are: - HDFS classes not found - Connections with some datanode seems to be slow/ unexpectedly close. - Executors become lost (and cannot be relaunched due to an out of memory error) * What I'm looking for: - HDFS misconfiguration/ tuning advises - Global setup flaws (impact of VMs and NUMA mismatch, for example) - For the last category of issue, I'd like to know why, when the executor dies, JVM's memory is not freed, thus not allowing a new executor to be launched.* My setup is the following: 1 hypervisor with 32 cores and 50 GB of RAM, 5 VMs running in this hv. Each vms has 5 cores and 7GB. Each node has 1 worker setup with 4 cores 6 GB available (the remaining resources are intended to be used by hdfs/os I run a Wordcount workload with a dataset of 4GB, on a spark 1.4.0 / hdfs 2.5.2 setup. I got the binaries from official websites (no local compiling). (1) 2) are logged on the worker, in the work/app-id/exec-id/stderr file) *1) Hadoop class related issues* /15:34:32: DEBUG HadoopRDD: SplitLocationInfo and other new Hadoop classes are unavailable. Using the older Hadoop location info code. java.lang.ClassNotFoundException: org.apache.hadoop.mapred.InputSplitWithLocationInfo/ / 15:40:46: DEBUG SparkHadoopUtil: Couldn't find method for retrieving thread-level FileSystem input data java.lang.NoSuchMethodException: org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()/ *2) HDFS performance related issues* The following error arise: / 15:43:16: ERROR TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=284992323013, chunkIndex=2}, buffer=FileSegmentManagedBuffer{file=/tmp/spark-b17f3299-99f3-4147-929f-1f236c812d0e/executor-d4ceae23-b9d9-4562-91c2-2855baeb8664/blockmgr-10da9c53-c20a-45f7-a430-2e36d799c7e1/2f/shuffle_0_14_0.data, offset=15464702, length=998530}} to /192.168.122.168:59299; closing connection java.io.IOException: Broken pipe/ /15:43:16 ERROR TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=284992323013, chunkIndex=0}, buffer=FileSegmentManagedBuffer{file=/tmp/spark-b17f3299-99f3-4147-929f-1f236c812d0e/executor-d4ceae23-b9d9-4562-91c2-2855baeb8664/blockmgr-10da9c53-c20a-45f7-a430-2e36d799c7e1/31/shuffle_0_12_0.data, offset=15238441, length=980944}} to /192.168.122.168:59299; closing connection java.io.IOException: Broken pipe/ /15:44:28 : WARN TransportChannelHandler: Exception in connection from /192.168.122.15:50995 java.io.IOException: Connection reset by peer/ (note that it's on another executor) Some time later: / 15:44:52 DEBUG DFSClient: DFSClient seqno: -2 status: SUCCESS status: ERROR downstreamAckTimeNanos: 0 15:44:52 WARN DFSClient: DFSOutputStream ResponseProcessor exception for block BP-845049430-155.99.144.31-1435598542277:blk_1073742427_1758 java.io.IOException: Bad response ERROR for block BP-845049430-155.99.144.31-1435598542277:blk_1073742427_1758 from datanode x.x.x.x:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819)/ The following two errors appears several times: /15:51:05 ERROR Executor: Exception in task 19.0 in stage 1.0 (TID 51) java.nio.channels.ClosedChannelException at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1528) at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:98) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) at java.io.DataOutputStream.write(DataOutputStream.java:107) at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:81) at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:102) at org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:95) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1110) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70)
Re: How to make my spark implementation parallel?
can you please share your application code? I suspect that you're not making a good use of the cluster by configuring a wrong number of partitions in your RDDs. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-my-spark-implementation-parallel-tp23804p23805.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is it possible to change the default port number 7077 for spark?
Q1: You can change the port number on the master in the file conf/spark-defaults.conf. I don't know what will be the impact on a cloudera distro thought. Q2: Yes: a Spark worker needs to be present on each node which you want to make available to the driver. Q3: You can submit an application from your laptop to the master with the spark-submit script. You don't need to contact the workers directly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-change-the-default-port-number-7077-for-spark-tp23774p23781.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Issues when combining Spark and a third party java library
I'm using hadoop 2.5.2 with spark 1.4.0 and I can also see in my logs: 15/07/09 06:39:02 DEBUG HadoopRDD: SplitLocationInfo and other new Hadoop classes are unavailable. Using the older Hadoop location info code. java.lang.ClassNotFoundException: org.apache.hadoop.mapred.InputSplitWithLocationInfo at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at org.apache.spark.rdd.HadoopRDD$SplitInfoReflections.init(HadoopRDD.scala:386) at org.apache.spark.rdd.HadoopRDD$.liftedTree1$1(HadoopRDD.scala:396) at org.apache.spark.rdd.HadoopRDD$.init(HadoopRDD.scala:395) at org.apache.spark.rdd.HadoopRDD$.clinit(HadoopRDD.scala) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:165) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:289) at WordCount$.main(WordCount.scala:13) at WordCount.main(WordCount.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) The application I launch through spark-submit can access data on hdfs tho, and I launch the script with HADOOP_HOME being set. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issues-when-combining-Spark-and-a-third-party-java-library-tp21367p23765.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Issues when combining Spark and a third party java library
Also, it's worth noting that I'm using the prebuilt version for hadoop 2.4 and higher from the official website. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issues-when-combining-Spark-and-a-third-party-java-library-tp21367p23770.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark standalone cluster - Output file stored in temporary directory in worker
I think the properties that you have in your hdfs-site.xml should go in the core-site.xml (at least for the namenode.name and datanote.data ones). I might be wrong here, but that's what I have in my setup. you should also add hadoop.tmp.dir in your core-site.xml. That might be the source of your inconsistency. as for hadoop-env.sh, I just use it to export variable such as HADOOP_PREFIX, LOG_DIR, CONF_DIR and JAVA_HOME. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-cluster-Output-file-stored-in-temporary-directory-in-worker-tp23653p23697.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Master doesn't start, no logs
Hi, I've been compiling spark 1.4.0 with SBT, from the source tarball available on the official website. I cannot run spark's master, even tho I have built and run several other instance of spark on the same machine (spark 1.3, master branch, pre built 1.4, ...) /starting org.apache.spark.deploy.master.Master, logging to /mnt/spark-1.4.0/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-xx.out failed to launch org.apache.spark.deploy.master.Master: full log in /mnt/spark-1.4.0/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-xx.out/ But the log file is empty. After digging up to ./bin/spark-class, and finally trying to start the master with: ./bin/spark-class org.apache.spark.deploy.master.Master --host 155.99.144.31 I still have the same result. Here is the strace output for this command: http://pastebin.com/bkJVncBm I'm using a 64 bit Xeon, CentOS 6.5, spark 1.4.0, compiled against hadoop 2.5.2 Any idea? :-) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Master-doesn-t-start-no-logs-tp23651.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark standalone cluster - Output file stored in temporary directory in worker
Can you share your hadoop configuration file please? - etc/hadoop/core-site.xml - etc/hadoop/hdfs-site.xml - etc/hadoop/hadoo-env.sh AFAIK, the following properties should be configured: hadoop.tmp.dir, dfs.namenode.name.dir, dfs.datanode.data.dir and dfs.namenode.checkpoint.dir Otherwise, an HDFS slave will use it's default temporary folder to save blocks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-cluster-Output-file-stored-in-temporary-directory-in-worker-tp23653p23656.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Directory creation failed leads to job fail (should it?)
Hi there, I have some traces from my master and some workers where for some reason, the ./work directory of an application can not be created on the workers. There is also an issue with the master's temp directory creation. master logs: http://pastebin.com/v3NCzm0u worker's logs: http://pastebin.com/Ninkscnx It seems that some of the executors can create the directories, but as some others are repetitively failing, the job ends up failing. Shouldn't spark manage to keep working with a smallest number of executors instead of failing? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Directory-creation-failed-leads-to-job-fail-should-it-tp23531.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to get the memory usage infomation of a spark application
You can see the amount of memory consumed by each executor in the web ui (go to the application page, and click on the executor tab). Otherwise, for a finer grained monitoring, I can only think of correlating a system monitoring tool like Ganglia, with the event timeline of your job. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-memory-usage-infomation-of-a-spark-application-tp23494p23495.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Vision old applications in webui with json logs
Is it possible to recreate the same views given in the webui for completed applications, when rebooting the master, thanks to the log files? I just tried to change the url of the form http://w.x.y.z:8080/history/app-2-0036, by giving the appID, but it redirected me on the master's homepage. As far as I know, the logs for an app can be found in the master event logs, and in each executor ./work/ and event logs directories. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Vision-old-applications-in-webui-with-json-logs-tp23498.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Exception in thread main java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
Basically, here's a dump of the SO question I opened (http://stackoverflow.com/questions/31033724/spark-1-4-0-java-lang-nosuchmethoderror-com-google-common-base-stopwatch-elapse) I'm using spark 1.4.0 and when running the Scala SparkPageRank example (*examples/src/main/scala/org/apache/spark/examples/SparkPageRank.scala*), I encounter the following error: Exception in thread main java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.RDD$$anonfun$distinct$2.apply(RDD.scala:329) at org.apache.spark.rdd.RDD$$anonfun$distinct$2.apply(RDD.scala:329) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.distinct(RDD.scala:328) at org.apache.spark.examples.SparkPageRank$.main(SparkPageRank.scala:60) at org.apache.spark.examples.SparkPageRank.main(SparkPageRank.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:621) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) I'm not extremely familiar with java, but it seems that it's a [guava][1] version issue The following information could be helpfup: $ find ./spark -name *.jars | grep guava ./lib_managed/bundles/guava-16.0.1.jar ./lib_managed/bundles/guava-14.0.1.jar part of the examples/pom.xml file: ... dependency groupIdorg.apache.cassandra/groupId artifactIdcassandra-all/artifactId version1.2.6/version exclusions exclusion groupIdcom.google.guava/groupId artifactIdguava/artifactId /exclusion ... And indeed it seems that the class does not contain the problematic method: $ javap -p /mnt/spark/examples/target/streams/\$global/assemblyOption/\$global/streams/assembly/7850cb6d36b2a6589a4d27ce027a65a2da72c9df_5fa98cd1a63c99a44dd8d3b77e4762b066a5d0c5/com/google/common/base/Stopwatch.class Compiled from Stopwatch.java public final class com.google.common.base.Stopwatch { private final com.google.common.base.Ticker ticker; private boolean isRunning; private long elapsedNanos; private long startTick; public static com.google.common.base.Stopwatch createUnstarted(); public static com.google.common.base.Stopwatch createUnstarted(com.google.common.base.Ticker); public static com.google.common.base.Stopwatch createStarted(); public static com.google.common.base.Stopwatch createStarted(com.google.common.base.Ticker); public com.google.common.base.Stopwatch(); public com.google.common.base.Stopwatch(com.google.common.base.Ticker); public boolean isRunning(); public com.google.common.base.Stopwatch start(); public com.google.common.base.Stopwatch stop(); public com.google.common.base.Stopwatch reset(); private long elapsedNanos(); public long elapsed(java.util.concurrent.TimeUnit);
Should I keep memory dedicated for HDFS and Spark on cluster nodes?
I'm wondering if there is a real benefit for splitting my memory in two for the datanode/workers. Datanodes and OS needs memory to perform their business. I suppose there could be loss of performance if they came to compete for memory with the worker(s). Any opinion? :-) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Should-I-keep-memory-dedicated-for-HDFS-and-Spark-on-cluster-nodes-tp23451.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Submitting Spark Applications using Spark Submit
You can specify the jars of your application to be included with spark-submit with the /--jars/ switch. Otherwise, are you sure that your newly compiled spark jar assembly is in assembly/target/scala-2.10/? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-Applications-using-Spark-Submit-tp23352p23400.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can we increase the space of spark standalone cluster
For 1) In standalone mode, you can increase the worker's resource allocation in their local conf/spark-env.sh with the following variables: SPARK_WORKER_CORES, SPARK_WORKER_MEMORY At application submit time, you can tune the number of resource allocated to executors with /--executor-cores/ and /--executor-memory/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-increase-the-space-of-spark-standalone-cluster-tp23368p23372.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can we increase the space of spark standalone cluster
Also, still for 1), in conf/spark-defaults.sh, you can give the following arguments to tune the Driver's resources: spark.driver.cores spark.driver.memory Not sure if you can pass them at submit time, but it should be possible. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-increase-the-space-of-spark-standalone-cluster-tp23368p23373.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Determining number of executors within RDD
Note that this property is only available for YARN -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Determining-number-of-executors-within-RDD-tp15554p23256.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Determining number of executors within RDD
Actually this is somehow confusing for two reasons: - First, the option 'spark.executor.instances', which seems to be only dealt with in the case of YARN in the source code of SparkSubmit.scala, is also present in the conf/spark-env.sh file under the standalone section, which would indicate that it is also available for this mode - Second, a post from Andrew Or states that this properties define the number of workers in the cluster, not the number of executors on a given worker. (http://apache-spark-user-list.1001560.n3.nabble.com/clarification-for-some-spark-on-yarn-configuration-options-td13692.html) Could anyone clarify this? :-) Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Determining-number-of-executors-within-RDD-tp15554p23262.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Determining number of executors within RDD
You should try, from the SparkConf object, to issue a get. I don't have the exact name for the matching key, but from reading the code in SparkSubmit.scala, it should be something like: conf.get(spark.executor.instances) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Determining-number-of-executors-within-RDD-tp15554p23234.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How does lineage get passed down in RDDs
If I read the code correctly, in RDD.scala, each rdd keeps track of it's own dependencies, (from Dependency.scala), and has methods to access to it's /ancestors/ dependencies, thus being able to recompute the lineage (see getNarrowAncestors() or getDependencies() in some rdd like UnionRDD). So it doesn't looks like an RDD knows the whole lineage graph without having to compute it, nor does that an RDD gives more than it's own identity as a parent to a child RDD. As a new user I may be mistaken so any veteran confirmation would be appreciated :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-does-lineage-get-passed-down-in-RDDs-tp23196p23212.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org