Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-15 Thread wxhsdp
Hi, DB i tried including breeze library by using spark 1.0, it works. but how can i call the native library in standalone cluster mode. in local mode 1. i include org.scalanlp % breeze-natives_2.10 % 0.7 dependency in sbt build file 2. i install openblas it works in standalone

A new resource for getting examples of Spark RDD API calls

2014-05-15 Thread zhen
Hi Everyone, I found it quite difficult to find good examples for Spark RDD API calls. So my student and I decided to go through the entire API and write examples for the vast majority of API calls (basically examples for anything that is remotely interesting). I think these examples maybe useful

same log4j slf4j error in spark 9.1

2014-05-15 Thread Adrian Mocanu
I recall someone from the Spark team (TD?) saying that Spark 9.1 will change the logger and the circular loop error between slf4j and log4j wouldn't show up. Yet on Spark 9.1 I still get SLF4J: Detected both log4j-over-slf4j.jar AND slf4j-log4j12.jar on the class path, preempting

Preferred RDD Size

2014-05-15 Thread Sai Prasanna
Hi, Is there any lower-bound on the size of RDD to optimally utilize the in-memory framework Spark. Say creating RDD for very small data set of some 64 MB is not as efficient as that of some 256 MB, then accordingly the application can be tuned. So is there a soft-lowerbound related to

Re: No space left on device error when pulling data from s3

2014-05-15 Thread darkjh
Set `hadoop.tmp.dir` in `spark-env.sh` solved the problem. Spark job no longer writes tmp files in /tmp/hadoop-root/. SPARK_JAVA_OPTS+= -Dspark.local.dir=/mnt/spark,/mnt2/spark -Dhadoop.tmp.dir=/mnt/ephemeral-hdfs export SPARK_JAVA_OPTS I'm wondering if we need to permanently add this in the

Re: How to use Mahout VectorWritable in Spark.

2014-05-15 Thread Dmitriy Lyubimov
PS spark shell with all proper imports are also supported natively in Mahout (mahout spark-shell command). See M-1489 for specifics. There's also a tutorial somewhere but i suspect it has not been yet finished/publised via public link yet. Again, you need trunk to use spark shell there. On Wed,

Re: How to use Mahout VectorWritable in Spark.

2014-05-15 Thread Dmitriy Lyubimov
PPS The shell/spark tutorial i've mentioned is actually being developed in MAHOUT-1542. As it stands, i believe it is now complete in its core. On Wed, May 14, 2014 at 5:48 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: PS spark shell with all proper imports are also supported natively in

problem about broadcast variable in iteration

2014-05-15 Thread randylu
My code just like follows: 1 var rdd1 = ... 2 var rdd2 = ... 3 var kv = ... 4 for (i - 0 until n) { 5var kvGlobal = sc.broadcast(kv) // broadcast kv 6rdd1 = rdd2.map { 7 case t = doSomething(t, kvGlobal.value) 8} 9var tmp =

Re: Equally weighted partitions in Spark

2014-05-15 Thread Syed A. Hashmi
I took a stab at it and wrote a partitionerhttps://github.com/syedhashmi/spark/commit/4ca94cc155aea4be36505d5f37d037e209078196that I intend to contribute back to main repo some time later. The partitioner takes in parameter which governs minimum number of keys / partition and once all partition

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-05-15 Thread Koert Kuipers
hey patrick, i have a SparkConf i can add them too. i was looking for a way to do this where they are not hardwired within scala, which is what SPARK_JAVA_OPTS used to do. i guess if i just set -Dspark.akka.frameSize=1 on my java app launch then it will get picked up by the SparkConf too

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-15 Thread wxhsdp
finally i fixed it. previous failure is caused by lack of some jars. i pasted the classpath in local mode to workers by using show compile:dependencyClasspath and it works! -- View this message in context:

File present but file not found exception

2014-05-15 Thread Sai Prasanna
Hi Everyone, I think all are pretty busy, the response time in this group has slightly increased. But anyways, this is a pretty silly problem, but could not get over. I have a file in my localFS, but when i try to create an RDD out of it, tasks fails with file not found exception is thrown at

Re: run spark0.9.1 on yarn with hadoop CDH4

2014-05-15 Thread Arpit Tak
Also try this out , we have already done this .. It will help you.. http://docs.sigmoidanalytics.com/index.php/Setup_hadoop_2.0.0-cdh4.2.0_and_spark_0.9.0_on_ubuntu_12.04 On Tue, May 6, 2014 at 10:17 PM, Andrew Lee alee...@hotmail.com wrote: Please check JAVA_HOME. Usually it should point to

Test

2014-05-15 Thread Matei Zaharia

Re: 0.9 wont start cluster on ec2, SSH connection refused?

2014-05-15 Thread wxhsdp
Hi, mayur i've met the same problem. the instances are on, i can see them from ec2 console, and connect to them wxhsdp@ubuntu:~/spark/spark/tags/v1.0.0-rc3/ec2$ ssh -i wxhsdp-us-east.pem root@54.86.181.108 The authenticity of host '54.86.181.108 (54.86.181.108)' can't be established. ECDSA key

pyspark python exceptions / py4j exceptions

2014-05-15 Thread Patrick Donovan
Hello, I'm trying to write a python function that does something like: def foo(line): try: return stuff(line) except Exception: raise MoreInformativeException(line) and then use it in a map like so: rdd.map(foo) and have my MoreInformativeException make it back if/when

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-15 Thread DB Tsai
Hi Wxhsdp, I also have some difficulties witth sc.addJar(). Since we include the breeze library by using Spark 1.0, we don't have the problem you ran into. However, when we add external jars via sc.addJar(), I found that the executors actually fetch the jars but the classloader still doesn't

Re: Equally weighted partitions in Spark

2014-05-15 Thread deenar.toraskar
This is my first implementation. There are a few rough edges, but when I run this I get the following exception. The class extends Partitioner which in turn extends Serializable. Any idea what I am doing wrong? scala res156.partitionBy(new EqualWeightPartitioner(1000, res156, weightFunction))

Re: Unable to load native-hadoop library problem

2014-05-15 Thread Andrew Or
This seems unrelated to not being able to load native-hadoop library. Is it failing to connect to ResourceManager? Have you verified that there is an RM process listening on port 8032 at the specified IP? On Tue, May 6, 2014 at 6:25 PM, Sophia sln-1...@163.com wrote: Hi,everyone,

filling missing values in a sequence

2014-05-15 Thread Mohit Jaggi
Hi, I am trying to find a way to fill in missing values in an RDD. The RDD is a sorted sequence. For example, (1, 2, 3, 5, 8, 11, ...) I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11) One way to do this is to slide and zip rdd1 = sc.parallelize(List(1, 2, 3, 5, 8, 11,

Re: sbt run with spark.ContextCleaner ERROR

2014-05-15 Thread Nan Zhu
same problem +1, though does not change the program result -- Nan Zhu On Tuesday, May 6, 2014 at 11:58 PM, Tathagata Das wrote: Okay, this needs to be fixed. Thanks for reporting this! On Mon, May 5, 2014 at 11:00 PM, wxhsdp wxh...@gmail.com (mailto:wxh...@gmail.com) wrote: Hi,

Re: is Mesos falling out of favor?

2014-05-15 Thread deric
I'm also using right now SPARK_EXECUTOR_URI, though I would prefer distributing Spark as a binary package. For running examples with `./bin/run-example ...` it works fine, however tasks from spark-shell are getting lost. Error: Could not find or load main class

Re: Easy one

2014-05-15 Thread Laeeq Ahmed
Hi Ian, Don't use SPARK_MEM in spark-env.sh. It will get it set for all of your jobs. The better way is to use only the second option sconf.setExecutorEnv(spark.executor.memory, 4g”) i.e. set it in the driver program. In this way every job will have memory according to requirment. For example

Re: Equivalent of collect() on DStream

2014-05-15 Thread Stephen Boesch
It seems the concept I had been missing is to invoke the DStream foreach method. This method takes a function expecting an RDD and applies the function to each RDD within the DStream. 2014-05-14 21:33 GMT-07:00 Stephen Boesch java...@gmail.com: Looking further it appears the functionality I

Re: How to use Mahout VectorWritable in Spark.

2014-05-15 Thread Dmitriy Lyubimov
Mahout now supports doing its distributed linalg natively on Spark so the problem of sequence file input load into Spark is already solved there (trunk, http://mahout.apache.org/users/sparkbindings/home.html, drmFromHDFS() call -- and then you can access to the direct rdd via rdd matrix property

Re: How to run shark?

2014-05-15 Thread Mayur Rustagi
Mostly your shark server is not started. Are you connecting to the cluster or running in local mode? What is the lowest error on the stack. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, May 12, 2014 at 2:07 PM,

Average of each RDD in Stream

2014-05-15 Thread Laeeq Ahmed
Hi, I use the following code for calculating average. The problem is that the reduce operation return a DStream here and not a tuple as it normally does without Streaming. So how can we get the sum and the count from the DStream. Can we cast it to tuple? val numbers =

spark+mesos: configure mesos 'callback' port?

2014-05-15 Thread Scott Clasen
Is anyone aware of a way to configure the mesos GroupProcess port on the mesos slave/task which the mesos master calls back on? The log line that shows this port looks like below (mesos 0.17.0) I0507 02:37:20.893334 11638 group.cpp:310] Group process ((2)@1.2.3.4:54321) connected to ZooKeeper.

Re: 1.0.0 Release Date?

2014-05-15 Thread Madhu
Spark 1.0.0 rc5 is available and open for voting Give it a try and vote on it at the dev user list. - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-0-0-Release-Date-tp5664p5716.html Sent from the

Re: pySpark memory usage

2014-05-15 Thread Jim Blomo
Should add that I had to tweak the numbers a bit to keep above swap threshold, but below the Too many open files error (`ulimit -n` is 32768). On Wed, May 14, 2014 at 10:47 AM, Jim Blomo jim.bl...@gmail.com wrote: That worked amazingly well, thank you Matei! Numbers that worked for me were 400

Re: Job failed: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-05-15 Thread Shivani Rao
This is something that I have bumped into time and again. the object that contains your main() should also be serializable then you won't have this issue. For example object Test extends serializable{ def main(){ // set up spark context // read your data // create your RDD's (grouped by key)

Re: pySpark memory usage

2014-05-15 Thread Jim Blomo
That worked amazingly well, thank you Matei! Numbers that worked for me were 400 for the textFile()s, 1500 for the join()s. On Mon, May 12, 2014 at 7:58 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Jim, unfortunately external spilling is not implemented in Python right now. While it

Re: Equivalent of collect() on DStream

2014-05-15 Thread Stephen Boesch
Looking further it appears the functionality I am seeking is in the following *private[spark] * class ForEachdStream (version 0.8.1 , yes we are presently using an older release..) private[streaming] class ForEachDStream[T: ClassManifest] ( parent: DStream[T], *foreachFunc: (RDD[T],

Re: pySpark memory usage

2014-05-15 Thread Matei Zaharia
Cool, that’s good to hear. We’d also like to add spilling in Python itself, or at least make it exit with a good message if it can’t do it. Matei On May 14, 2014, at 10:47 AM, Jim Blomo jim.bl...@gmail.com wrote: That worked amazingly well, thank you Matei! Numbers that worked for me were

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-05-15 Thread Patrick Wendell
Just wondering - how are you launching your application? If you want to set values like this the right way is to add them to the SparkConf when you create a SparkContext. val conf = new SparkConf().set(spark.akka.frameSize, 1).setAppName(...).setMaster(...) val sc = new SparkContext(conf) -

Re: problem about broadcast variable in iteration

2014-05-15 Thread Earthson
RDD is not cached? Because recomputing may be required, every broadcast object is included in the dependences of RDDs, this may also have memory issue(when n and kv is too large in your case). -- View this message in context:

Re: spark 0.9.1 textFile hdfs unknown host exception

2014-05-15 Thread Eugen Cepoi
Solved: Putting HADOOP_CONF_DIR in spark-env of the workers solved the problem. The difference between HadoopRDD and NewHadoopRDD is that the old one creates the JobConf on worker side, where the new one creates an instance of JobConf on driver side and then broadcasts it. I tried creating

Schema view of HadoopRDD

2014-05-15 Thread Debasish Das
Hi, For each line that we read as textLine from HDFS, we have a schema..if there is an API that takes the schema as List[Symbol] and maps each token to the Symbol it will be helpful... Does RDDs provide a schema view of the dataset on HDFS ? Thanks. Deb

Re: problem about broadcast variable in iteration

2014-05-15 Thread randylu
rdd1 is cached, but it has no effect: 1 var rdd1 = ... 2 var rdd2 = ... 3 var kv = ... 4 for (i - 0 until n) { 5var kvGlobal = sc.broadcast(kv) // broadcast kv 6rdd1 = rdd2.map { 7 case t = doSomething(t, kvGlobal.value) 8}.cache() 9var tmp =

os buffer cache does not cache shuffle output file

2014-05-15 Thread wxhsdp
Hi, patrick said The intermediate shuffle output gets written to disk, but it often hits the OS-buffer cache since it's not explicitly fsync'ed, so in many cases it stays entirely in memory. The behavior of the shuffle is agnostic to whether the base RDD is in cache or in disk. i

Re: Schema view of HadoopRDD

2014-05-15 Thread rxin
The new Spark SQL component is defined for this! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-Schema-view-of-HadoopRDD-tp5627p5723.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Is there any problem on the spark mailing list?

2014-05-15 Thread Cheney Sun
I can't receive any spark-user mail since yesterday. Can you guys receive any new mail? -- Cheney -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509.html Sent from the Apache Spark User List mailing list

Re: problem about broadcast variable in iteration

2014-05-15 Thread randylu
But when i put broadcast variable out of for-circle, it workes well(if not concerned about memory issue as you pointed out): 1 var rdd1 = ... 2 var rdd2 = ... 3 var kv = ... 4 var kvGlobal = sc.broadcast(kv) // broadcast kv 5 for (i - 0 until n) { 6rdd1 =

Re: Task not serializable?

2014-05-15 Thread pedro
I'me still fairly new to this, but I found problems using classes in maps if they used instance variables in part of the map function. It seems like for maps and such to work correctly, it needs to be purely functional programming. -- View this message in context:

Spark unit testing best practices

2014-05-15 Thread Andras Nemeth
Hi, Spark's local mode is great to create simple unit tests for our spark logic. The disadvantage however is that certain types of problems are never exposed in local mode because things never need to be put on the wire. E.g. if I accidentally use a closure which has something non-serializable

Re: is Mesos falling out of favor?

2014-05-15 Thread Scott Clasen
curious what the bug is and what it breaks? I have spark 0.9.0 running on mesos 0.17.0 and seems to work correctly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/is-Mesos-falling-out-of-favor-tp5444p5483.html Sent from the Apache Spark User List mailing

Not getting mails from user group

2014-05-15 Thread Laeeq Ahmed
Hi all, There seems to be a problem. I am not getting mails from spark user group from two days. Regards, Laeeq

Spark to utilize HDFS's mmap caching

2014-05-15 Thread Chanwit Kaewkasi
Hi all, Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via sc.textFile() and other HDFS-related APIs? http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit