Spark and ActorSystem

2015-08-18 Thread maxdml
Hi, I'd like to know where I could find more information related to the depreciation of the actor system in spark (from 1.4.x). I'm interested in the reasons for this decision, Cheers -- View this message in context:

Re: scheduler delay time

2015-08-04 Thread maxdml
You'd need to provide information such as executor configuration (#cores, memory size). You might have less scheduler delay with smaller, but more numerous executors, than the contrary. -- View this message in context:

Re: How to make my spark implementation parallel?

2015-07-13 Thread maxdml
If you want to exploit properly the 8 nodes of your cluster, you should use ~ 2 times that number for partitioning. You can specify the number of partitions when calling parallelize, as following: JavaRDDPoint pnts = sc.parallelize(points, 16); -- View this message in context:

HDFS performances + unexpected death of executors.

2015-07-13 Thread maxdml
Hi, I have several issues related to HDFS, that may have different roots. I'm posting as much information as I can, with the hope that I can get your opinion on at least some of them. Basically the cases are: - HDFS classes not found - Connections with some datanode seems to be slow/

Re: How to make my spark implementation parallel?

2015-07-13 Thread maxdml
can you please share your application code? I suspect that you're not making a good use of the cluster by configuring a wrong number of partitions in your RDDs. -- View this message in context:

Re: Is it possible to change the default port number 7077 for spark?

2015-07-12 Thread maxdml
Q1: You can change the port number on the master in the file conf/spark-defaults.conf. I don't know what will be the impact on a cloudera distro thought. Q2: Yes: a Spark worker needs to be present on each node which you want to make available to the driver. Q3: You can submit an application

Re: Issues when combining Spark and a third party java library

2015-07-10 Thread maxdml
I'm using hadoop 2.5.2 with spark 1.4.0 and I can also see in my logs: 15/07/09 06:39:02 DEBUG HadoopRDD: SplitLocationInfo and other new Hadoop classes are unavailable. Using the older Hadoop location info code. java.lang.ClassNotFoundException:

Re: Issues when combining Spark and a third party java library

2015-07-10 Thread maxdml
Also, it's worth noting that I'm using the prebuilt version for hadoop 2.4 and higher from the official website. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issues-when-combining-Spark-and-a-third-party-java-library-tp21367p23770.html Sent from the

Re: Spark standalone cluster - Output file stored in temporary directory in worker

2015-07-07 Thread maxdml
I think the properties that you have in your hdfs-site.xml should go in the core-site.xml (at least for the namenode.name and datanote.data ones). I might be wrong here, but that's what I have in my setup. you should also add hadoop.tmp.dir in your core-site.xml. That might be the source of your

Master doesn't start, no logs

2015-07-06 Thread maxdml
Hi, I've been compiling spark 1.4.0 with SBT, from the source tarball available on the official website. I cannot run spark's master, even tho I have built and run several other instance of spark on the same machine (spark 1.3, master branch, pre built 1.4, ...) /starting

Re: Spark standalone cluster - Output file stored in temporary directory in worker

2015-07-06 Thread maxdml
Can you share your hadoop configuration file please? - etc/hadoop/core-site.xml - etc/hadoop/hdfs-site.xml - etc/hadoop/hadoo-env.sh AFAIK, the following properties should be configured: hadoop.tmp.dir, dfs.namenode.name.dir, dfs.datanode.data.dir and dfs.namenode.checkpoint.dir Otherwise, an

Directory creation failed leads to job fail (should it?)

2015-06-29 Thread maxdml
Hi there, I have some traces from my master and some workers where for some reason, the ./work directory of an application can not be created on the workers. There is also an issue with the master's temp directory creation. master logs: http://pastebin.com/v3NCzm0u worker's logs:

Re: How to get the memory usage infomation of a spark application

2015-06-25 Thread maxdml
You can see the amount of memory consumed by each executor in the web ui (go to the application page, and click on the executor tab). Otherwise, for a finer grained monitoring, I can only think of correlating a system monitoring tool like Ganglia, with the event timeline of your job. -- View

Vision old applications in webui with json logs

2015-06-25 Thread maxdml
Is it possible to recreate the same views given in the webui for completed applications, when rebooting the master, thanks to the log files? I just tried to change the url of the form http://w.x.y.z:8080/history/app-2-0036, by giving the appID, but it redirected me on the master's

Exception in thread main java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-06-24 Thread maxdml
Basically, here's a dump of the SO question I opened (http://stackoverflow.com/questions/31033724/spark-1-4-0-java-lang-nosuchmethoderror-com-google-common-base-stopwatch-elapse) I'm using spark 1.4.0 and when running the Scala SparkPageRank example

Should I keep memory dedicated for HDFS and Spark on cluster nodes?

2015-06-23 Thread maxdml
I'm wondering if there is a real benefit for splitting my memory in two for the datanode/workers. Datanodes and OS needs memory to perform their business. I suppose there could be loss of performance if they came to compete for memory with the worker(s). Any opinion? :-) -- View this message

Re: Submitting Spark Applications using Spark Submit

2015-06-18 Thread maxdml
You can specify the jars of your application to be included with spark-submit with the /--jars/ switch. Otherwise, are you sure that your newly compiled spark jar assembly is in assembly/target/scala-2.10/? -- View this message in context:

Re: Can we increase the space of spark standalone cluster

2015-06-17 Thread maxdml
For 1) In standalone mode, you can increase the worker's resource allocation in their local conf/spark-env.sh with the following variables: SPARK_WORKER_CORES, SPARK_WORKER_MEMORY At application submit time, you can tune the number of resource allocated to executors with /--executor-cores/ and

Re: Can we increase the space of spark standalone cluster

2015-06-17 Thread maxdml
Also, still for 1), in conf/spark-defaults.sh, you can give the following arguments to tune the Driver's resources: spark.driver.cores spark.driver.memory Not sure if you can pass them at submit time, but it should be possible. -- View this message in context:

Re: Determining number of executors within RDD

2015-06-10 Thread maxdml
Note that this property is only available for YARN -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Determining-number-of-executors-within-RDD-tp15554p23256.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Determining number of executors within RDD

2015-06-10 Thread maxdml
Actually this is somehow confusing for two reasons: - First, the option 'spark.executor.instances', which seems to be only dealt with in the case of YARN in the source code of SparkSubmit.scala, is also present in the conf/spark-env.sh file under the standalone section, which would indicate that

Re: Determining number of executors within RDD

2015-06-09 Thread maxdml
You should try, from the SparkConf object, to issue a get. I don't have the exact name for the matching key, but from reading the code in SparkSubmit.scala, it should be something like: conf.get(spark.executor.instances) -- View this message in context:

Re: How does lineage get passed down in RDDs

2015-06-08 Thread maxdml
If I read the code correctly, in RDD.scala, each rdd keeps track of it's own dependencies, (from Dependency.scala), and has methods to access to it's /ancestors/ dependencies, thus being able to recompute the lineage (see getNarrowAncestors() or getDependencies() in some rdd like UnionRDD). So it