Spark and ActorSystem

2015-08-18 Thread maxdml
Hi,

I'd like to know where I could find more information related to the
depreciation of the actor system in spark (from 1.4.x).

I'm interested in the reasons for this decision,

Cheers



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-ActorSystem-tp24321.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: scheduler delay time

2015-08-04 Thread maxdml
You'd need to provide information such as executor configuration (#cores,
memory size). You might have less scheduler delay with smaller, but more
numerous executors, than the contrary.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/scheduler-delay-time-tp6003p24133.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to make my spark implementation parallel?

2015-07-13 Thread maxdml
If you want to exploit properly the 8 nodes of your cluster, you should use ~
2 times that number for partitioning.

You can specify the number of partitions when calling parallelize, as
following:


JavaRDDPoint pnts = sc.parallelize(points, 16); 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-my-spark-implementation-parallel-tp23804p23808.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



HDFS performances + unexpected death of executors.

2015-07-13 Thread maxdml
Hi,

I have several issues related to HDFS, that may have different roots. I'm
posting as much information as I can, with the hope that I can get your
opinion on at least some of them. Basically the cases are:

- HDFS classes not found
- Connections with some datanode seems to be slow/ unexpectedly close.
- Executors become lost (and cannot be relaunched due to an out of memory
error)

*
What I'm looking for:
- HDFS misconfiguration/ tuning advises
- Global setup flaws (impact of VMs and NUMA mismatch, for example)
- For the last category of issue, I'd like to know why, when the executor
dies, JVM's memory is not freed, thus not allowing a new executor to be
launched.*

My setup is the following:
1 hypervisor with 32 cores and 50 GB of RAM, 5 VMs running in this hv. Each
vms has 5 cores and 7GB.
Each node has 1 worker setup with 4 cores 6 GB available (the remaining
resources are intended to be used by hdfs/os

I run a Wordcount workload with a dataset of 4GB, on a spark 1.4.0 / hdfs
2.5.2 setup. I got the binaries from official websites (no local compiling).

(1)  2) are logged on the worker, in the work/app-id/exec-id/stderr file)

*1) Hadoop class related issues*

/15:34:32: DEBUG HadoopRDD: SplitLocationInfo and other new Hadoop classes
are unavailable. Using the older Hadoop location info code.
java.lang.ClassNotFoundException:
org.apache.hadoop.mapred.InputSplitWithLocationInfo/

/
15:40:46: DEBUG SparkHadoopUtil: Couldn't find method for retrieving
thread-level FileSystem input data
java.lang.NoSuchMethodException:
org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()/


*2) HDFS performance related issues*

The following error arise: 

/ 15:43:16: ERROR TransportRequestHandler: Error sending result
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=284992323013,
chunkIndex=2},
buffer=FileSegmentManagedBuffer{file=/tmp/spark-b17f3299-99f3-4147-929f-1f236c812d0e/executor-d4ceae23-b9d9-4562-91c2-2855baeb8664/blockmgr-10da9c53-c20a-45f7-a430-2e36d799c7e1/2f/shuffle_0_14_0.data,
offset=15464702, length=998530}} to /192.168.122.168:59299; closing
connection
java.io.IOException: Broken pipe/

/15:43:16 ERROR TransportRequestHandler: Error sending result
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=284992323013,
chunkIndex=0},
buffer=FileSegmentManagedBuffer{file=/tmp/spark-b17f3299-99f3-4147-929f-1f236c812d0e/executor-d4ceae23-b9d9-4562-91c2-2855baeb8664/blockmgr-10da9c53-c20a-45f7-a430-2e36d799c7e1/31/shuffle_0_12_0.data,
offset=15238441, length=980944}} to /192.168.122.168:59299; closing
connection
java.io.IOException: Broken pipe/


/15:44:28 : WARN TransportChannelHandler: Exception in connection from
/192.168.122.15:50995
java.io.IOException: Connection reset by peer/ (note that it's on another
executor)

Some time later: 
/
15:44:52 DEBUG DFSClient: DFSClient seqno: -2 status: SUCCESS status: ERROR
downstreamAckTimeNanos: 0
15:44:52 WARN DFSClient: DFSOutputStream ResponseProcessor exception  for
block BP-845049430-155.99.144.31-1435598542277:blk_1073742427_1758
java.io.IOException: Bad response ERROR for block
BP-845049430-155.99.144.31-1435598542277:blk_1073742427_1758 from datanode
x.x.x.x:50010
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819)/

The following two errors appears several times:

/15:51:05 ERROR Executor: Exception in task 19.0 in stage 1.0 (TID 51)
java.nio.channels.ClosedChannelException
at
org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1528)
at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:98)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at
org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:81)
at
org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:102)
at
org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:95)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1110)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
  

Re: How to make my spark implementation parallel?

2015-07-13 Thread maxdml
can you please share your application code?

I suspect that you're not making a good use of the cluster by configuring a
wrong number of partitions in your RDDs.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-my-spark-implementation-parallel-tp23804p23805.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is it possible to change the default port number 7077 for spark?

2015-07-12 Thread maxdml
Q1: You can change the port number on the master in the file
conf/spark-defaults.conf. I don't know what will be the impact on a cloudera
distro thought.

Q2: Yes: a Spark worker needs to be present on each node which you want to
make available to the driver.

Q3: You can submit an application from your laptop to the master with the
spark-submit script. You don't need to contact the workers directly.






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-change-the-default-port-number-7077-for-spark-tp23774p23781.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Issues when combining Spark and a third party java library

2015-07-10 Thread maxdml
I'm using hadoop 2.5.2 with spark 1.4.0 and I can also see in my logs:

15/07/09 06:39:02 DEBUG HadoopRDD: SplitLocationInfo and other new Hadoop
classes are unavailable. Using the older Hadoop location info code.
java.lang.ClassNotFoundException:
org.apache.hadoop.mapred.InputSplitWithLocationInfo
  at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:264)
  at
org.apache.spark.rdd.HadoopRDD$SplitInfoReflections.init(HadoopRDD.scala:386)
  at org.apache.spark.rdd.HadoopRDD$.liftedTree1$1(HadoopRDD.scala:396)
  at org.apache.spark.rdd.HadoopRDD$.init(HadoopRDD.scala:395)
  at org.apache.spark.rdd.HadoopRDD$.clinit(HadoopRDD.scala)
  at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:165)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
  at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
  at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
  at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
  at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
  at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290)
  at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290)
  at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
  at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
  at
org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:289)
  at WordCount$.main(WordCount.scala:13)
  at WordCount.main(WordCount.scala)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:497)
  at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


The application I launch through spark-submit can access data on hdfs tho,
and I launch the script with HADOOP_HOME being set.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issues-when-combining-Spark-and-a-third-party-java-library-tp21367p23765.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Issues when combining Spark and a third party java library

2015-07-10 Thread maxdml
Also, it's worth noting that I'm using the prebuilt version for hadoop 2.4
and higher from the official website.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issues-when-combining-Spark-and-a-third-party-java-library-tp21367p23770.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark standalone cluster - Output file stored in temporary directory in worker

2015-07-07 Thread maxdml
I think the properties that you have in your hdfs-site.xml should go in the
core-site.xml (at least for the namenode.name and datanote.data ones). I
might be wrong here, but that's what I have in my setup.

you should also add hadoop.tmp.dir in your core-site.xml. That might be the
source of your inconsistency.

as for hadoop-env.sh, I just use it to export variable such as
HADOOP_PREFIX,  LOG_DIR, CONF_DIR and JAVA_HOME.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-cluster-Output-file-stored-in-temporary-directory-in-worker-tp23653p23697.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Master doesn't start, no logs

2015-07-06 Thread maxdml
Hi,

I've been compiling spark 1.4.0 with SBT, from the source tarball available
on the official website. I cannot run spark's master, even tho I have built
and run several other instance of spark on the same machine (spark 1.3,
master branch, pre built 1.4, ...)

/starting org.apache.spark.deploy.master.Master, logging to
/mnt/spark-1.4.0/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-xx.out
failed to launch org.apache.spark.deploy.master.Master:
full log in
/mnt/spark-1.4.0/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-xx.out/

But the log file is empty.

After digging up to ./bin/spark-class, and finally trying to start the
master with:

./bin/spark-class org.apache.spark.deploy.master.Master --host 155.99.144.31

I still have the same result. Here is the strace output for this command:

http://pastebin.com/bkJVncBm

I'm using a 64 bit Xeon, CentOS 6.5, spark 1.4.0, compiled against hadoop
2.5.2

Any idea? :-)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Master-doesn-t-start-no-logs-tp23651.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark standalone cluster - Output file stored in temporary directory in worker

2015-07-06 Thread maxdml
Can you share your hadoop configuration file please?

- etc/hadoop/core-site.xml
- etc/hadoop/hdfs-site.xml
- etc/hadoop/hadoo-env.sh

AFAIK, the following properties should be configured:

hadoop.tmp.dir, dfs.namenode.name.dir, dfs.datanode.data.dir and
dfs.namenode.checkpoint.dir

Otherwise, an HDFS slave will use it's default temporary folder to save
blocks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-cluster-Output-file-stored-in-temporary-directory-in-worker-tp23653p23656.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Directory creation failed leads to job fail (should it?)

2015-06-29 Thread maxdml
Hi there,

I have some traces from my master and some workers where for some reason,
the ./work directory of an application can not be created on the workers.
There is also an issue with the master's temp directory creation.

master logs: http://pastebin.com/v3NCzm0u
worker's logs: http://pastebin.com/Ninkscnx

It seems that some of the executors can create the directories, but as some
others are repetitively failing, the job ends up failing. Shouldn't spark
manage to keep working with a smallest number of executors instead of
failing?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Directory-creation-failed-leads-to-job-fail-should-it-tp23531.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to get the memory usage infomation of a spark application

2015-06-25 Thread maxdml
You can see the amount of memory consumed by each executor in the web ui (go
to the application page, and click on the executor tab).

Otherwise, for a finer grained monitoring, I can only think of correlating a
system monitoring tool like Ganglia, with the event timeline of your job.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-memory-usage-infomation-of-a-spark-application-tp23494p23495.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Vision old applications in webui with json logs

2015-06-25 Thread maxdml
Is it possible to recreate the same views given in the webui for completed
applications, when rebooting the master, thanks to the log files? I just
tried to change the url of the form
http://w.x.y.z:8080/history/app-2-0036, by giving the appID, but it
redirected me on the master's homepage.

As far as I know, the logs for an app can be found in the master event logs,
and in each executor ./work/ and event logs directories.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Vision-old-applications-in-webui-with-json-logs-tp23498.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Exception in thread main java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-06-24 Thread maxdml
Basically, here's a dump of the SO question I opened
(http://stackoverflow.com/questions/31033724/spark-1-4-0-java-lang-nosuchmethoderror-com-google-common-base-stopwatch-elapse)

I'm using spark 1.4.0 and when running the Scala SparkPageRank example
(*examples/src/main/scala/org/apache/spark/examples/SparkPageRank.scala*), I
encounter the following error:

Exception in thread main java.lang.NoSuchMethodError:
com.google.common.base.Stopwatch.elapsedMillis()J
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.RDD$$anonfun$distinct$2.apply(RDD.scala:329)
at org.apache.spark.rdd.RDD$$anonfun$distinct$2.apply(RDD.scala:329)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.distinct(RDD.scala:328)
at
org.apache.spark.examples.SparkPageRank$.main(SparkPageRank.scala:60)
at org.apache.spark.examples.SparkPageRank.main(SparkPageRank.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:621)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


I'm not extremely familiar with java, but it seems that it's a [guava][1]
version issue

The following information could be helpfup:

$ find ./spark -name *.jars | grep guava
./lib_managed/bundles/guava-16.0.1.jar
./lib_managed/bundles/guava-14.0.1.jar

part of the examples/pom.xml file: 

...
 dependency
  groupIdorg.apache.cassandra/groupId
  artifactIdcassandra-all/artifactId
  version1.2.6/version
  exclusions
exclusion
  groupIdcom.google.guava/groupId
  artifactIdguava/artifactId
/exclusion
...

And indeed it seems that the class does not contain the problematic method:

$ javap -p
/mnt/spark/examples/target/streams/\$global/assemblyOption/\$global/streams/assembly/7850cb6d36b2a6589a4d27ce027a65a2da72c9df_5fa98cd1a63c99a44dd8d3b77e4762b066a5d0c5/com/google/common/base/Stopwatch.class

Compiled from Stopwatch.java
public final class com.google.common.base.Stopwatch {
  private final com.google.common.base.Ticker ticker;
  private boolean isRunning;
  private long elapsedNanos;
  private long startTick;
  public static com.google.common.base.Stopwatch createUnstarted();
  public static com.google.common.base.Stopwatch
createUnstarted(com.google.common.base.Ticker);
  public static com.google.common.base.Stopwatch createStarted();
  public static com.google.common.base.Stopwatch
createStarted(com.google.common.base.Ticker);
  public com.google.common.base.Stopwatch();
  public
com.google.common.base.Stopwatch(com.google.common.base.Ticker);
  public boolean isRunning();
  public com.google.common.base.Stopwatch start();
  public com.google.common.base.Stopwatch stop();
  public com.google.common.base.Stopwatch reset();
  private long elapsedNanos();
  public long elapsed(java.util.concurrent.TimeUnit);
  

Should I keep memory dedicated for HDFS and Spark on cluster nodes?

2015-06-23 Thread maxdml
I'm wondering if there is a real benefit for splitting my memory in two for
the datanode/workers.

Datanodes and OS needs memory to perform their business. I suppose there
could be loss of performance if they came to compete for memory with the
worker(s).

Any opinion? :-)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Should-I-keep-memory-dedicated-for-HDFS-and-Spark-on-cluster-nodes-tp23451.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Submitting Spark Applications using Spark Submit

2015-06-18 Thread maxdml
You can specify the jars of your application to be included with spark-submit
with the /--jars/ switch.

Otherwise, are you sure that your newly compiled spark jar assembly is in
assembly/target/scala-2.10/?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-Applications-using-Spark-Submit-tp23352p23400.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can we increase the space of spark standalone cluster

2015-06-17 Thread maxdml
For 1)

In standalone mode, you can increase the worker's resource allocation in
their local conf/spark-env.sh with the following variables:

SPARK_WORKER_CORES,
SPARK_WORKER_MEMORY

At application submit time, you can tune the number of resource allocated to
executors with /--executor-cores/ and /--executor-memory/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-increase-the-space-of-spark-standalone-cluster-tp23368p23372.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can we increase the space of spark standalone cluster

2015-06-17 Thread maxdml
Also, still for 1), in conf/spark-defaults.sh, you can give the following
arguments to tune the Driver's resources:

spark.driver.cores
spark.driver.memory

Not sure if you can pass them at submit time, but it should be possible.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-increase-the-space-of-spark-standalone-cluster-tp23368p23373.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Determining number of executors within RDD

2015-06-10 Thread maxdml
Note that this property is only available for YARN



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Determining-number-of-executors-within-RDD-tp15554p23256.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Determining number of executors within RDD

2015-06-10 Thread maxdml
Actually this is somehow confusing for two reasons:

- First, the option 'spark.executor.instances', which seems to be only dealt
with in the case of YARN in the source code of SparkSubmit.scala, is also
present in the conf/spark-env.sh file under the standalone section, which
would indicate that it is also available for this mode

- Second, a post from Andrew Or states that this properties define the
number of workers in the cluster, not the number of executors on a given
worker.
(http://apache-spark-user-list.1001560.n3.nabble.com/clarification-for-some-spark-on-yarn-configuration-options-td13692.html)

Could anyone clarify this? :-)

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Determining-number-of-executors-within-RDD-tp15554p23262.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Determining number of executors within RDD

2015-06-09 Thread maxdml
You should try, from the SparkConf object, to issue a get.

I don't have the exact name for the matching key, but from reading the code
in SparkSubmit.scala, it should be something like:

conf.get(spark.executor.instances)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Determining-number-of-executors-within-RDD-tp15554p23234.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How does lineage get passed down in RDDs

2015-06-08 Thread maxdml
If I read the code correctly, in RDD.scala, each rdd keeps track of it's own
dependencies, (from Dependency.scala), and has methods to access to it's
/ancestors/ dependencies, thus being able to recompute the lineage (see
getNarrowAncestors() or getDependencies() in some rdd like UnionRDD).

So it doesn't looks like an RDD knows the whole lineage graph without having
to compute it, nor does that an RDD gives more than it's own identity as a
parent to a child RDD.

As a new user I may be mistaken so any veteran confirmation would be
appreciated :)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-lineage-get-passed-down-in-RDDs-tp23196p23212.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org