Re: How to use spark-submit

2014-05-12 Thread Sonal Goyal
Hi Stephen, I am using maven shade plugin for creating my uber jar. I have marked spark dependencies as provided. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Mon, May 12, 2014 at 1:04 AM, Stephen Boesch java...@gmail.com wrote: HI

Re: File present but file not found exception

2014-05-12 Thread Sai Prasanna
I found that if a file is present in all the nodes in the given path in localFS, then reading is possible. But is there a way to read if the file is present only in certain nodes ?? [There should be a way !!] *NEED: Wanted to do some filter ops in HDFS file, create a local file of the result,

Re: java.lang.NoSuchMethodError on Java API

2014-05-12 Thread Alessandro De Carli
Sure, I uploaded the code on pastebin: http://pastebin.com/90Hynrjh On Mon, May 12, 2014 at 12:27 AM, Madhu ma...@madhu.com wrote: No, you don't need to do anything special to get it to run in Eclipse. Just add the assembly jar to the build path, create a main method, add your code, and click

Re: How to use spark-submit

2014-05-12 Thread Stephen Boesch
@Sonal - makes sense. Is the maven shade plugin runnable within sbt ? If so would you care to share those build.sbt (or .scala) lines? If not, are you aware of a similar plugin for sbt? 2014-05-11 23:53 GMT-07:00 Sonal Goyal sonalgoy...@gmail.com: Hi Stephen, I am using maven shade

Re: How to read a multipart s3 file?

2014-05-12 Thread Nicholas Chammas
On Wed, May 7, 2014 at 4:00 AM, Han JU ju.han.fe...@gmail.com wrote: But in my experience, when reading directly from s3n, spark create only 1 input partition per file, regardless of the file size. This may lead to some performance problem if you have big files. You can (and perhaps should)

Re: Spark to utilize HDFS's mmap caching

2014-05-12 Thread Matei Zaharia
Yes, Spark goes through the standard HDFS client and will automatically benefit from this. Matei On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi chan...@gmail.com wrote: Hi all, Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via sc.textFile() and other HDFS-related APIs?

Re: build shark(hadoop CDH5) on hadoop2.0.0 CDH4

2014-05-12 Thread Sean Owen
There was never a Hadoop 2.0.0. There was a Hadoop 2.0.0-alpha as far as Maven artifacts are concerned. The latest in that series is 2.0.6-alpha. On Mon, May 12, 2014 at 4:29 AM, Sophia sln-1...@163.com wrote: I have built shark in sbt way,but the sbt exception turn out: [error]

Client cannot authenticate via:[TOKEN]

2014-05-12 Thread innowireless TaeYun Kim
I'm trying to run spark-shell on Hadoop yarn. Specifically, the environment is as follows: - Client - OS: Windows 7 - Spark version: 1.0.0-SNAPSHOT (git cloned 2014.5.8) - Server - Platform: hortonworks sandbox 2.1 I modified the spark code to apply

Spark on Yarn - A small issue !

2014-05-12 Thread Sai Prasanna
Hi All, I wanted to launch Spark on Yarn, interactive - yarn client mode. With default settings of yarn-site.xml and spark-env.sh, i followed the given link http://spark.apache.org/docs/0.8.1/running-on-yarn.html I get the pi value correct when i run without launching the shell. When i launch

Re: Is there any problem on the spark mailing list?

2014-05-12 Thread Sean Owen
Note the mails are coming out of order in some cases. I am getting current messages but a sprinkling of old replies too. On May 12, 2014 12:16 PM, ankurdave ankurd...@gmail.com wrote: I haven't been getting mail either. This was the last message I received:

spark-env.sh do not take effect.

2014-05-12 Thread lihu
Hi, I set a small cluster with 3 machines, every machine is 64GB RAM, 11 Core. and I used the spark0.9. I have set spark-env.sh as following: *SPARK_MASTER_IP=192.168.35.2* * SPARK_MASTER_PORT=7077* * SPARK_MASTER_WEBUI_PORT=12306* * SPARK_WORKER_CORES=3* *

missing method in my slf4j after excluding Spark ZK log4j

2014-05-12 Thread Adrian Mocanu
Hey guys, I've asked before, in Spark 0.9 - I now use 0.9.1, about removing log4j dependency and was told that it was gone. However I still find it part of zookeeper imports. This is fine since I exclude it myself in the sbt file, but another issue arises. I wonder if anyone else has run into

Re: Variables outside of mapPartitions scope

2014-05-12 Thread pedro
Right now I am not using any class variables (references to this). All my variables are created within the scope of the method I am running. I did more debugging and found this strange behavior. variables here for loop mapPartitions call use variables here end mapPartitions endfor

Re: logging in pyspark

2014-05-12 Thread Nicholas Chammas
Ah, yes, that is correct. You need a serializable object one way or the other. An alternate suggestion would be to use a combination of RDD.sample()http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#sampleand collect() to take a look at some small amount of data and just

Forcing spark to send exactly one element to each worker node

2014-05-12 Thread NevinLi158
Hi all, I'm currently trying to use pipe to run C++ code on each worker node, and I have an RDD of essentially command line arguments that i'm passing to each node. I want to send exactly one element to each node, but when I run my code, Spark ends up sending multiple elements to a node: is there

Re: Forcing spark to send exactly one element to each worker node

2014-05-12 Thread NevinLi158
Fixed the problem as soon as I sent this out, sigh. Apparently you can do this by changing the number of slices to cut the dataset into: I thought that was identical to the amount of partitions, but apparently not. -- View this message in context:

Re: Turn BLAS on MacOSX

2014-05-12 Thread Xiangrui Meng
Those are warning messages instead of errors. You need to add netlib-java:all to use native BLAS/LAPACK. But it won't work if you include netlib-java:all in an assembly jar. It has to be a separate jar when you submit your job. For SGD, we only use level-1 BLAS, so I don't think native code is

Re: Is their a way to Create SparkContext object?

2014-05-12 Thread Matei Zaharia
You can just pass it around as a parameter. On May 12, 2014, at 12:37 PM, yh18190 yh18...@gmail.com wrote: Hi, Could anyone suggest an idea how can we create sparkContext object in other classes or fucntions where we need to convert a scala collection to RDD using sc object.like

Is their a way to Create SparkContext object?

2014-05-12 Thread yh18190
Hi, Could anyone suggest an idea how can we create sparkContext object in other classes or fucntions where we need to convert a scala collection to RDD using sc object.like sc.makeRDD(list).instead of using Main class sparkcontext object? is their a way to pass sc object as a parameter to

java.lang.StackOverflowError when calling count()

2014-05-12 Thread Guanhua Yan
Dear Sparkers: I am using Python spark of version 0.9.0 to implement some iterative algorithm. I got some errors shown at the end of this email. It seems that it's due to the Java Stack Overflow error. The same error has been duplicated on a mac desktop and a linux workstation, both running the

Re: Average of each RDD in Stream

2014-05-12 Thread Sean Owen
You mean you normally get an RDD, right? A DStream is a sequence of RDDs. It kind of depends on what you are trying to accomplish here? sum/count for each RDD in the stream? On Wed, May 7, 2014 at 6:43 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: Hi, I use the following code for calculating

Average of each RDD in Stream

2014-05-12 Thread Laeeq Ahmed
Hi, I use the following code for calculating average. The problem is that the reduce operation return a DStream here and not a tuple as it normally does without Streaming. So how can we get the sum and the count from the DStream. Can we cast it to tuple? val numbers =

Re: Forcing spark to send exactly one element to each worker node

2014-05-12 Thread NevinLi158
A few more data points: my current theory is now that spark's piping mechanism is considerably slower than just running the C++ app directly on the node. I ran the C++ application directly on a node in the cluster, and timed the execution of various parts of the program, and got ~10 seconds to

Re: java.lang.NoSuchMethodError on Java API

2014-05-12 Thread Madhu
I was able to compile your code in Eclipse. I ran it using the data in your comments, but I also see the NoSuchMethodError you mentioned. It seems to run fine until the call to calculateZVector(...) It appears that org.apache.commons.math3.util.Pair is not Serializable, so that's one potential

Re: missing method in my slf4j after excluding Spark ZK log4j

2014-05-12 Thread Sean Owen
It sounds like you are doing everything right. NoSuchMethodError suggests it's finding log4j, just not the right version. That method is definitely in 1.2; it might have been removed in 2.x? (http://logging.apache.org/log4j/2.x/manual/migration.html) So I wonder if something is sneaking in log4j

Unexpected results when caching data

2014-05-12 Thread paul
I have been experimenting with a data set with and without persisting the RDD and have come across some unexpected results. The files we are reading are Avro files so we are using the following to define the RDD, what we end up with is a RDD[CleansedLogFormat]: val f = new NewHadoopRDD(sc,

Re: Spark to utilize HDFS's mmap caching

2014-05-12 Thread Marcelo Vanzin
Is that true? I believe that API Chanwit is talking about requires explicitly asking for files to be cached in HDFS. Spark automatically benefits from the kernel's page cache (i.e. if some block is in the kernel's page cache, it will be read more quickly). But the explicit HDFS cache is a

Re: missing method in my slf4j after excluding Spark ZK log4j

2014-05-12 Thread Tathagata Das
This gives dependency tree in SBT (spark uses this). https://github.com/jrudolph/sbt-dependency-graph TD On Mon, May 12, 2014 at 4:55 PM, Sean Owen so...@cloudera.com wrote: It sounds like you are doing everything right. NoSuchMethodError suggests it's finding log4j, just not the right

Re: streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected

2014-05-12 Thread Tathagata Das
A very crucial thing to remember when using file stream is that the files must be written to the monitored directory atomically. That is when the file system show the file in its listing, the file should not be appended / updated after that. That often causes this kind of issues, as spark

Re: java.lang.ClassNotFoundException

2014-05-12 Thread Archit Thakur
Hi Joe, Your messages are going into spam folder for me. Thx, Archit_Thakur. On Fri, May 2, 2014 at 9:22 AM, Joe L selme...@yahoo.com wrote: Hi, You should include the jar file of your project. for example: conf.set(yourjarfilepath.jar) Joe On Friday, May 2, 2014 7:39 AM, proofmoore

Re: How to read a multipart s3 file?

2014-05-12 Thread Aaron Davidson
One way to ensure Spark writes more partitions is by using RDD#repartition() to make each partition smaller. One Spark partition always corresponds to one file in the underlying store, and it's usually a good idea to have each partition size range somewhere between 64 MB to 256 MB. Too few

Re: Spark LIBLINEAR

2014-05-12 Thread DB Tsai
It seems that the code isn't managed in github. Can be downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/spark/spark-liblinear-1.94.zip It will be easier to track the changes in github. Sincerely, DB Tsai

Re: Bug when zip with longs and too many partitions?

2014-05-12 Thread Michael Malak
I've discovered that it was noticed a year ago that RDD zip() does not work when the number of partitions does not evenly divide the total number of elements in the RDD: https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ I will enter a JIRA ticket just as soon as the

Re: build shark(hadoop CDH5) on hadoop2.0.0 CDH4

2014-05-12 Thread Sophia
Hi Why I always confront remoting error: akka.remote.remoteTransportException and java.util.concurrent.timeoutException? Best Regards, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/build-shark-hadoop-CDH5-on-hadoop2-0-0-CDH4-tp5574p5629.html Sent from

Re: pySpark memory usage

2014-05-12 Thread Jim Blomo
Thanks, Aaron, this looks like a good solution! Will be trying it out shortly. I noticed that the S3 exception seem to occur more frequently when the box is swapping. Why is the box swapping? combineByKey seems to make the assumption that it can fit an entire partition in memory when doing the

Re: Average of each RDD in Stream

2014-05-12 Thread Tathagata Das
Use DStream.foreachRDD to do an operation on the final RDD of every batch. val sumandcount = numbers.map(n = (n.toDouble, 1)).reduce{ (a, b) = (a._1 + b._1, a._2 + b._2) } sumandcount.foreachRDD { rdd = val first: (Double, Int) = rdd.take(1) ; ... } DStream.reduce creates DStream whose RDDs

Re: Proper way to stop Spark stream processing

2014-05-12 Thread Tathagata Das
Since you are using the latest Spark code and not Spark 0.9.1 (guessed from the log messages), you can actually do graceful shutdown of a streaming context. This ensures that the receivers are properly stopped and all received data is processed and then the system terminates (stop() stays blocked

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-12 Thread Tim St Clair
Jacob Gerard - You might find the link below useful: http://rrati.github.io/blog/2014/05/07/apache-hadoop-plus-docker-plus-fedora-running-images/ For non-reverse-dns apps, NAT is your friend. Cheers, Tim - Original Message - From: Jacob Eisinger jeis...@us.ibm.com To:

Re: pySpark memory usage

2014-05-12 Thread Matei Zaharia
Hey Jim, unfortunately external spilling is not implemented in Python right now. While it would be possible to update combineByKey to do smarter stuff here, one simple workaround you can try is to launch more map tasks (or more reduce tasks). To set the minimum number of map tasks, you can pass

Re: missing method in my slf4j after excluding Spark ZK log4j

2014-05-12 Thread Paul Brown
Hi, Adrian -- If my memory serves, you need 1.7.7 of the various slf4j modules to avoid that issue. Best. -- Paul — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Mon, May 12, 2014 at 7:51 AM, Adrian Mocanu amoc...@verticalscope.comwrote: Hey guys, I've asked