Re: Dead lock running multiple Spark jobs on Mesos

2014-05-12 Thread Andrew Ash
Are you setting a core limit with spark.cores.max? If you don't, in coarse mode each Spark job uses all available cores on Mesos and doesn't let them go until the job is terminated. At which point the other job can access the cores. https://spark.apache.org/docs/latest/running-on-mesos.html -- "

Re: Accuracy in mllib BinaryClassificationMetrics

2014-05-12 Thread Xiangrui Meng
Hi Deb, feel free to add accuracy along with precision and recall. -Xiangrui On Mon, May 12, 2014 at 1:26 PM, Debasish Das wrote: > Hi, > > I see precision and recall but no accuracy in mllib.evaluation.binary. > > Is it already under development or it needs to be added ? > > Thanks. > Deb >

Spark's Behavior 2

2014-05-12 Thread Eduardo Costa Alfaia
Hi TD, I have sent more informations now using 8 workers. The gap has been 27 sec now. Have you seen? Thanks BR -- Informativa sulla Privacy: http://www.unibs.it/node/8155

Re: Forcing spark to send exactly one element to each worker node

2014-05-12 Thread Tathagata Das
>From the logs, it seems that your tasks are being started in parallel. If they were being executed serially, then you would have seen the following in logs Starting task 1 Finished task 1 Starting task 2 Finished task 2 Starting task 3 Finished task 3 ... Instead you are seeing Starting task 1

Re: streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected

2014-05-12 Thread zzzzzqf12345
I had solved the problem and found the reason, because I used the Master node to upload files to hdfs, this action may take up a lot of Master's network resources. When I changed to use another computer none of the cluster to upload these files, it got the correct result. QingFeng Tathagata Das w

Re: missing method in my slf4j after excluding Spark ZK log4j

2014-05-12 Thread Paul Brown
Hi, Adrian -- If my memory serves, you need 1.7.7 of the various slf4j modules to avoid that issue. Best. -- Paul — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Mon, May 12, 2014 at 7:51 AM, Adrian Mocanu wrote: > Hey guys, > > I've asked before, in Spark 0.9 - I now

Re: pySpark memory usage

2014-05-12 Thread Matei Zaharia
Hey Jim, unfortunately external spilling is not implemented in Python right now. While it would be possible to update combineByKey to do smarter stuff here, one simple workaround you can try is to launch more map tasks (or more reduce tasks). To set the minimum number of map tasks, you can pass

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-12 Thread Tim St Clair
Jacob & Gerard - You might find the link below useful: http://rrati.github.io/blog/2014/05/07/apache-hadoop-plus-docker-plus-fedora-running-images/ For non-reverse-dns apps, NAT is your friend. Cheers, Tim - Original Message - > From: "Jacob Eisinger" > To: user@spark.apache.o

Re: Proper way to stop Spark stream processing

2014-05-12 Thread Tathagata Das
Since you are using the latest Spark code and not Spark 0.9.1 (guessed from the log messages), you can actually do graceful shutdown of a streaming context. This ensures that the receivers are properly stopped and all received data is processed and then the system terminates (stop() stays blocked u

Re: Average of each RDD in Stream

2014-05-12 Thread Tathagata Das
Use DStream.foreachRDD to do an operation on the final RDD of every batch. val sumandcount = numbers.map(n => (n.toDouble, 1)).reduce{ (a, b) => (a._1 + b._1, a._2 + b._2) } sumandcount.foreachRDD { rdd => val first: (Double, Int) = rdd.take(1) ; ... } DStream.reduce creates DStream whose RDDs h

Re: pySpark memory usage

2014-05-12 Thread Jim Blomo
Thanks, Aaron, this looks like a good solution! Will be trying it out shortly. I noticed that the S3 exception seem to occur more frequently when the box is swapping. Why is the box swapping? combineByKey seems to make the assumption that it can fit an entire partition in memory when doing the

Re: build shark(hadoop CDH5) on hadoop2.0.0 CDH4

2014-05-12 Thread Sophia
Hi Why I always confront remoting error: akka.remote.remoteTransportException and java.util.concurrent.timeoutException? Best Regards, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/build-shark-hadoop-CDH5-on-hadoop2-0-0-CDH4-tp5574p5629.html Sent from the

Re: Bug when zip with longs and too many partitions?

2014-05-12 Thread Michael Malak
I've discovered that it was noticed a year ago that RDD zip() does not work when the number of partitions does not evenly divide the total number of elements in the RDD: https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ I will enter a JIRA ticket just as soon as the A

Re: Spark LIBLINEAR

2014-05-12 Thread DB Tsai
It seems that the code isn't managed in github. Can be downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/spark/spark-liblinear-1.94.zip It will be easier to track the changes in github. Sincerely, DB Tsai ---

Re: How to read a multipart s3 file?

2014-05-12 Thread Aaron Davidson
One way to ensure Spark writes more partitions is by using RDD#repartition() to make each partition smaller. One Spark partition always corresponds to one file in the underlying store, and it's usually a good idea to have each partition size range somewhere between 64 MB to 256 MB. Too few partitio

Re: java.lang.ClassNotFoundException

2014-05-12 Thread Archit Thakur
Hi Joe, Your messages are going into spam folder for me. Thx, Archit_Thakur. On Fri, May 2, 2014 at 9:22 AM, Joe L wrote: > Hi, You should include the jar file of your project. for example: > conf.set("yourjarfilepath.jar") > > Joe > On Friday, May 2, 2014 7:39 AM, proofmoore [via Apache Sp

Re: streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected

2014-05-12 Thread Tathagata Das
A very crucial thing to remember when using file stream is that the files must be written to the monitored directory "atomically". That is when the file system show the file in its listing, the file should not be appended / updated after that. That often causes this kind of issues, as spark streami

Re: missing method in my slf4j after excluding Spark ZK log4j

2014-05-12 Thread Tathagata Das
This gives dependency tree in SBT (spark uses this). https://github.com/jrudolph/sbt-dependency-graph TD On Mon, May 12, 2014 at 4:55 PM, Sean Owen wrote: > It sounds like you are doing everything right. > > NoSuchMethodError suggests it's finding log4j, just not the right > version. That meth

Re: Spark to utilize HDFS's mmap caching

2014-05-12 Thread Marcelo Vanzin
Is that true? I believe that API Chanwit is talking about requires explicitly asking for files to be cached in HDFS. Spark automatically benefits from the kernel's page cache (i.e. if some block is in the kernel's page cache, it will be read more quickly). But the explicit HDFS cache is a differen

Re: ERROR: Unknown Spark version

2014-05-12 Thread wxhsdp
thank you Madhu, it's a great help for me! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-Unknown-Spark-version-tp5500p5519.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: No space left on device error when pulling data from s3

2014-05-12 Thread Han JU
Set `hadoop.tmp.dir` in `spark-env.sh` solved the problem. Spark job no longer writes tmp files in /tmp/hadoop-root/. SPARK_JAVA_OPTS+=" -Dspark.local.dir=/mnt/spark,/mnt2/spark -Dhadoop.tmp.dir=/mnt/ephemeral-hdfs" export SPARK_JAVA_OPTS I'm wondering if we need to permanently add this in th

Unexpected results when caching data

2014-05-12 Thread paul
I have been experimenting with a data set with and without persisting the RDD and have come across some unexpected results. The files we are reading are Avro files so we are using the following to define the RDD, what we end up with is a RDD[CleansedLogFormat]: val f = new NewHadoopRDD(sc,

Re: missing method in my slf4j after excluding Spark ZK log4j

2014-05-12 Thread Sean Owen
It sounds like you are doing everything right. NoSuchMethodError suggests it's finding log4j, just not the right version. That method is definitely in 1.2; it might have been removed in 2.x? (http://logging.apache.org/log4j/2.x/manual/migration.html) So I wonder if something is sneaking in log4j

Re: java.lang.NoSuchMethodError on Java API

2014-05-12 Thread Madhu
I was able to compile your code in Eclipse. I ran it using the data in your comments, but I also see the NoSuchMethodError you mentioned. It seems to run fine until the call to calculateZVector(...) It appears that org.apache.commons.math3.util.Pair is not Serializable, so that's one potential pro

Re: Doubts regarding Shark

2014-05-12 Thread Nicholas Chammas
To answer your first question, caching in Spark is lazy, meaning that Spark will not actually try to cache the RDD you've targeted until you take some sort of action on that RDD (like a count). That might be why you don't see any error at first. On Thu, May 8, 2014 at 2:46 AM, vinay Bajaj wrote

Re: Forcing spark to send exactly one element to each worker node

2014-05-12 Thread NevinLi158
A few more data points: my current theory is now that spark's piping mechanism is considerably slower than just running the C++ app directly on the node. I ran the C++ application directly on a node in the cluster, and timed the execution of various parts of the program, and got ~10 seconds to run

Average of each RDD in Stream

2014-05-12 Thread Laeeq Ahmed
Hi, I use the following code for calculating average. The problem is that the reduce operation return a DStream here and not a tuple as it normally does without Streaming. So how can we get the sum and the count from the DStream. Can we cast it to tuple? val numbers = ssc.textFileStream(args(

Re: Average of each RDD in Stream

2014-05-12 Thread Sean Owen
You mean you normally get an RDD, right? A DStream is a sequence of RDDs. It kind of depends on what you are trying to accomplish here? sum/count for each RDD in the stream? On Wed, May 7, 2014 at 6:43 PM, Laeeq Ahmed wrote: > Hi, > > I use the following code for calculating average. The problem

Re: Preferred RDD Size

2014-05-12 Thread Andrew Ash
At the minimum to get decent parallelization you'd want to have some data on every machine. If you're reading from HDFS, then the smallest you'd want is one HDFS block per server in your cluster. Note that Spark will work at smaller sizes, but in order to make use of all your machines when your p

java.lang.StackOverflowError when calling count()

2014-05-12 Thread Guanhua Yan
Dear Sparkers: I am using Python spark of version 0.9.0 to implement some iterative algorithm. I got some errors shown at the end of this email. It seems that it's due to the Java Stack Overflow error. The same error has been duplicated on a mac desktop and a linux workstation, both running the sa

Re: Forcing spark to send exactly one element to each worker node

2014-05-12 Thread NevinLi158
I can't seem to get Spark to run the tasks in parallel. My spark code is the following: //Create commands to be piped into a C++ program List commandList = makeCommandList(Integer.parseInt(step.first()),100); JavaRDD commandListRDD = ctx.parallelize(commandList, commandList.size()); //Run the C+

Re: cant get tests to pass anymore on master master

2014-05-12 Thread Tathagata Das
Can you also send us the error you are seeing in the streaming suites? TD On Sun, May 11, 2014 at 11:50 AM, Koert Kuipers wrote: > resending because the list didnt seem to like my email before > > > On Wed, May 7, 2014 at 5:01 PM, Koert Kuipers wrote: > >> i used to be able to get all tests t

Re: Spark to utilize HDFS's mmap caching

2014-05-12 Thread Matei Zaharia
That API is something the HDFS administrator uses outside of any application to tell HDFS to cache certain files or directories. But once you’ve done that, any existing HDFS client accesses them directly from the cache. Matei On May 12, 2014, at 11:10 AM, Marcelo Vanzin wrote: > Is that true?

Accuracy in mllib BinaryClassificationMetrics

2014-05-12 Thread Debasish Das
Hi, I see precision and recall but no accuracy in mllib.evaluation.binary. Is it already under development or it needs to be added ? Thanks. Deb

Re: spark+mesos: configure mesos 'callback' port?

2014-05-12 Thread Tim St Clair
Does - mesos-slave --master=zk://host1:port1,host2:port2 --port=54321 not work? Cheers, Tim - Original Message - > From: "Scott Clasen" > To: u...@spark.incubator.apache.org > Sent: Tuesday, May 6, 2014 11:39:34 PM > Subject: spark+mesos: configure mesos 'callback' port? > > Is any

Is their a way to Create SparkContext object?

2014-05-12 Thread yh18190
Hi, Could anyone suggest an idea how can we create sparkContext object in other classes or fucntions where we need to convert a scala collection to RDD using sc object.like sc.makeRDD(list).instead of using Main class sparkcontext object? is their a way to pass sc object as a parameter to functio

Re: Is their a way to Create SparkContext object?

2014-05-12 Thread Matei Zaharia
You can just pass it around as a parameter. On May 12, 2014, at 12:37 PM, yh18190 wrote: > Hi, > > Could anyone suggest an idea how can we create sparkContext object in other > classes or fucntions where we need to convert a scala collection to RDD > using sc object.like sc.makeRDD(list).instea

Re: why is Spark 0.9.1 (context creation?) so slow on my OSX laptop?

2014-05-12 Thread Madhu
There is an HTTP server started on port 4040, but that seems to be OK, from what I can see in your logs. Does the log tell you anything about what it's doing just before the long delay? Have you tried reducing the log level to see more detail? - Madhu https://www.linkedin.com/in/msiddalinga

Dead lock running multiple Spark jobs on Mesos

2014-05-12 Thread Martin Weindel
I'm using a current Spark 1.0.0-SNAPSHOT for Hadoop 2.2.0 on Mesos 0.17.0. If I run a single Spark Job, the job runs fine on Mesos. Running multiple Spark Jobs also works, if I'm using the coarse-grained mode ("spark.mesos.coarse" = true). But if I run two Spark Jobs in parallel using the fin

Re: Packaging a spark job using maven

2014-05-12 Thread Paul Brown
Hi, Laurent -- That's the way we package our Spark jobs (i.e., with Maven). You'll need something like this: https://gist.github.com/prb/d776a47bd164f704eecb That packages separate driver (which you can run with java -jar ...) and worker JAR files. Cheers. -- Paul — p...@mult.ifario.us | Mult

Re: Turn BLAS on MacOSX

2014-05-12 Thread Xiangrui Meng
Those are warning messages instead of errors. You need to add netlib-java:all to use native BLAS/LAPACK. But it won't work if you include netlib-java:all in an assembly jar. It has to be a separate jar when you submit your job. For SGD, we only use level-1 BLAS, so I don't think native code is call

Re: Forcing spark to send exactly one element to each worker node

2014-05-12 Thread Matei Zaharia
How many elements do you have in total? If they are fairly few (say less than a few thousand), do a collect() to bring them to the master, then do sc.parallelize(elements, numElements) to get an RDD with exactly one element per partition. Matei On May 12, 2014, at 10:29 AM, NevinLi158 wrote:

Distribute jar dependencies via sc.AddJar(fileName)

2014-05-12 Thread DB Tsai
We're deploying Spark in yarn-cluster mode (Spark 0.9), and we add jar dependencies in command line with "--addJars" option. However, those external jars are only available in the driver (application running in hadoop), and not available in the executors (workers). After doing some research, we re

Re: Forcing spark to send exactly one element to each worker node

2014-05-12 Thread NevinLi158
Fixed the problem as soon as I sent this out, sigh. Apparently you can do this by changing the number of slices to cut the dataset into: I thought that was identical to the amount of partitions, but apparently not. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble

Forcing spark to send exactly one element to each worker node

2014-05-12 Thread NevinLi158
Hi all, I'm currently trying to use pipe to run C++ code on each worker node, and I have an RDD of essentially command line arguments that i'm passing to each node. I want to send exactly one element to each node, but when I run my code, Spark ends up sending multiple elements to a node: is there

Re: logging in pyspark

2014-05-12 Thread Nicholas Chammas
Ah, yes, that is correct. You need a serializable object one way or the other. An alternate suggestion would be to use a combination of RDD.sample()and collect() to take a look at some small amount of data and just

Job failed: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-05-12 Thread yh18190
Hi, I am facing above exception when I am trying to apply a method(ComputeDwt) on RDD[(Int,ArrayBuffer[(Int,Double)])] input. I am even using extends Serialization option to serialize objects in spark.Here is the code snippet. Could anyone suggest me what could be the problem and what should be d

Packaging a spark job using maven

2014-05-12 Thread Laurent Thoulon
Hi, I'm quite new to spark (and scala) but has anyone ever successfully compiled and run a spark job using java and maven ? Packaging seems to go fine but when i try to execute the job using mvn package java -Xmx4g -cp target/jobs-1.4.0.0-jar-with-dependencies.jar my.jobs.spark.TestJob I

Re: Variables outside of mapPartitions scope

2014-05-12 Thread pedro
Right now I am not using any class variables (references to this). All my variables are created within the scope of the method I am running. I did more debugging and found this strange behavior. variables here for loop mapPartitions call use variables here end mapPartitions endfor

missing method in my slf4j after excluding Spark ZK log4j

2014-05-12 Thread Adrian Mocanu
Hey guys, I've asked before, in Spark 0.9 - I now use 0.9.1, about removing log4j dependency and was told that it was gone. However I still find it part of zookeeper imports. This is fine since I exclude it myself in the sbt file, but another issue arises. I wonder if anyone else has run into th

Re: Spark LIBLINEAR

2014-05-12 Thread Xiangrui Meng
Hi Chieh-Yen, Great to see the Spark implementation of LIBLINEAR! We will definitely consider adding a wrapper in MLlib to support it. Is the source code on github? Deb, Spark LIBLINEAR uses BSD license, which is compatible with Apache. Best, Xiangrui On Sun, May 11, 2014 at 10:29 AM, Debasish

Bug when zip with longs and too many partitions?

2014-05-12 Thread Michael Malak
Is this a bug? scala> sc.parallelize(1 to 2,4).zip(sc.parallelize(11 to 12,4)).collect res0: Array[(Int, Int)] = Array((1,11), (2,12)) scala> sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect res1: Array[(Long, Int)] = Array((2,11))

spark-env.sh do not take effect.

2014-05-12 Thread lihu
Hi, I set a small cluster with 3 machines, every machine is 64GB RAM, 11 Core. and I used the spark0.9. I have set spark-env.sh as following: *SPARK_MASTER_IP=192.168.35.2* * SPARK_MASTER_PORT=7077* * SPARK_MASTER_WEBUI_PORT=12306* * SPARK_WORKER_CORES=3* * SPARK_WORKER_MEMORY=2

Re: Is there any problem on the spark mailing list?

2014-05-12 Thread Sean Owen
Note the mails are coming out of order in some cases. I am getting current messages but a sprinkling of old replies too. On May 12, 2014 12:16 PM, "ankurdave" wrote: > I haven't been getting mail either. This was the last message I received: > > http://apache-spark-user-list.1001560.n3.nabble.com

Spark on Yarn - A small issue !

2014-05-12 Thread Sai Prasanna
Hi All, I wanted to launch Spark on Yarn, interactive - yarn client mode. With default settings of yarn-site.xml and spark-env.sh, i followed the given link http://spark.apache.org/docs/0.8.1/running-on-yarn.html I get the pi value correct when i run without launching the shell. When i launch t

Proper way to stop Spark stream processing

2014-05-12 Thread Tobias Pfeiffer
Hello, I am trying to implement something like "process a stream for N seconds, then return a result" with Spark Streaming (built from git head). My approach (which is probably not very elegant) is val ssc = new StreamingContext(...) ssc.start() future { Thread.sleep(Seconds(N))

How to run shark?

2014-05-12 Thread Sophia
When I run the shark command line,it turns out like this,and I cannot see something like "shark>".How can I do? the log: - Starting the Shark Command Line Client 14/05/12 16:32:49 WARN conf.Configuration: mapred.max.split.size is deprecated. Instead, use mapreduc

Client cannot authenticate via:[TOKEN]

2014-05-12 Thread innowireless TaeYun Kim
I'm trying to run spark-shell on Hadoop yarn. Specifically, the environment is as follows: - Client - OS: Windows 7 - Spark version: 1.0.0-SNAPSHOT (git cloned 2014.5.8) - Server - Platform: hortonworks sandbox 2.1 I modified the spark code to apply https://issues.apache.org/jira/browse/YAR

Re: Spark to utilize HDFS's mmap caching

2014-05-12 Thread Matei Zaharia
Yes, Spark goes through the standard HDFS client and will automatically benefit from this. Matei On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi wrote: > Hi all, > > Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via > sc.textFile() and other HDFS-related APIs? > > http://hadoop.apac

Re: build shark(hadoop CDH5) on hadoop2.0.0 CDH4

2014-05-12 Thread Sean Owen
There was never a Hadoop "2.0.0". There was a Hadoop "2.0.0-alpha" as far as Maven artifacts are concerned. The latest in that series is 2.0.6-alpha. On Mon, May 12, 2014 at 4:29 AM, Sophia wrote: > I have built shark in sbt way,but the sbt exception turn out: > [error] sbt.resolveException:unres

Re: How to use spark-submit

2014-05-12 Thread Sonal Goyal
I am creating a jar with only my dependencies and run spark-submit through my project mvn build. I have configured the mvn exec goal to the location of the script. Here is how I have set it up for my app. The mainClass is my driver program, and I am able to send my custom args too. Hope this helps.

Re: How to read a multipart s3 file?

2014-05-12 Thread Nicholas Chammas
On Wed, May 7, 2014 at 4:00 AM, Han JU wrote: But in my experience, when reading directly from s3n, spark create only 1 > input partition per file, regardless of the file size. This may lead to > some performance problem if you have big files. You can (and perhaps should) always repartition() t

Re: How to use spark-submit

2014-05-12 Thread Stephen Boesch
@Sonal - makes sense. Is the maven shade plugin runnable within sbt ? If so would you care to share those build.sbt (or .scala) lines? If not, are you aware of a similar plugin for sbt? 2014-05-11 23:53 GMT-07:00 Sonal Goyal : > Hi Stephen, > > I am using maven shade plugin for creating my u

Re: java.lang.NoSuchMethodError on Java API

2014-05-12 Thread Alessandro De Carli
Sure, I uploaded the code on pastebin: http://pastebin.com/90Hynrjh On Mon, May 12, 2014 at 12:27 AM, Madhu wrote: > No, you don't need to do anything special to get it to run in Eclipse. > Just add the assembly jar to the build path, create a main method, add your > code, and click the green "ru

Re: File present but file not found exception

2014-05-12 Thread Sai Prasanna
I found that if a file is present in all the nodes in the given path in localFS, then reading is possible. But is there a way to read if the file is present only in certain nodes ?? [There should be a way !!] *NEED: Wanted to do some filter ops in HDFS file, create a local file of the result, cre