Re: Compile SimpleApp.scala encountered error, please can any one help?

2014-04-11 Thread prabeesh k
ensure the only one SimpleApp object in your project, also check is there any copy of SimpleApp.scala. Normally the file SimpleApp.scala in src/main/scala or in the project root folder. On Sat, Apr 12, 2014 at 11:07 AM, jni2000 wrote: > Hi > > I am a new Spark user and try to test run it from

Compile SimpleApp.scala encountered error, please can any one help?

2014-04-11 Thread jni2000
Hi I am a new Spark user and try to test run it from scratch. I followed the documentation and was able the build the Spark package and run the spark shell. However when I move on to building the standalone sample "SimpleApp.scala", I see the following errors: Loading /usr/share/sbt/bin/sbt-laun

Re: Huge matrix

2014-04-11 Thread Reza Zadeh
Hi Xiaoli, There is a PR currently in progress to allow this, via the sampling scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf The PR is at https://github.com/apache/spark/pull/336 though it will need refactoring given the recent changes to matrix interface in MLlib. You may

Re: Huge matrix

2014-04-11 Thread Xiaoli Li
Hi Andrew, Thanks for your suggestion. I have tried the method. I used 8 nodes and every node has 8G memory. The program just stopped at a stage for about several hours without any further information. Maybe I need to find out a more efficient way. On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash wr

Re: SVD under spark/mllib/linalg

2014-04-11 Thread Xiangrui Meng
It was moved to mllib.linalg.distributed.RowMatrix. With RowMatrix, you can compute column summary statistics, gram matrix, covariance, SVD, and PCA. We will provide multiplication for distributed matrices, but not in v1.0. -Xiangrui On Fri, Apr 11, 2014 at 9:12 PM, wxhsdp wrote: > Hi, all > the

SVD under spark/mllib/linalg

2014-04-11 Thread wxhsdp
Hi, all the code under https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/linalg has changed. previous matrix classes are all removed, like MatrixEntry, MatrixSVD. Instead breeze matrix definition appears. Do we move to Breeze Linear Algebra when do linear algor

Re: Behaviour of caching when dataset does not fit into memory

2014-04-11 Thread Mayur Rustagi
One reason could be that spark uses scratch disk space on intermediate calculations so as you perform calculations that data need to be flushed before you can leverage memory for operations. Second issue could be large intermediate data may push more data in rdd onto disk ( something I see in wareh

Re: Spark on YARN performance

2014-04-11 Thread Mayur Rustagi
I am using Mesos right now & it works great. Mesos has fine grained as well as coarse grained allocation & really useful for prioritizing different pipelines. On Apr 11, 2014 1:19 PM, "Patrick Wendell" wrote: > To reiterate what Tom was saying - the code that runs inside of Spark on > YARN is exa

Re: Huge matrix

2014-04-11 Thread Andrew Ash
The naive way would be to put all the users and their attributes into an RDD, then cartesian product that with itself. Run the similarity score on every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) and take the .top(k) for each user. I doubt that you'll be able to take this appr

Huge matrix

2014-04-11 Thread Xiaoli Li
Hi all, I am implementing an algorithm using Spark. I have one million users. I need to compute the similarity between each pair of users using some user's attributes. For each user, I need to get top k most similar users. What is the best way to implement this? Thanks.

Re: Spark on YARN performance

2014-04-11 Thread Patrick Wendell
To reiterate what Tom was saying - the code that runs inside of Spark on YARN is exactly the same code that runs in any deployment mode. There shouldn't be any performance difference once your application starts (assuming you are comparing apples-to-apples in terms of hardware). The differences ar

Shutdown with streaming driver running in cluster broke master web UI permanently

2014-04-11 Thread Paul Mogren
I had a cluster running with a streaming driver deployed into it. I shut down the cluster using sbin/stop-all.sh. Upon restarting (and restarting, and restarting), the master web UI cannot respond to requests. The cluster seems to be otherwise functional. Below is the master's log, showing stack

Re: 0.9 wont start cluster on ec2, SSH connection refused?

2014-04-11 Thread Alton Alexander
No not anymore but it was at the time. Thanks but I also just found a thread from two days ago discussing the root and es2-user workaround. For now I'll just go back to using the AMI provided. Thanks! On Fri, Apr 11, 2014 at 1:39 PM, Mayur Rustagi wrote: > is the machine booted up & reachable? >

Re: 0.9 wont start cluster on ec2, SSH connection refused?

2014-04-11 Thread Mayur Rustagi
is the machine booted up & reachable? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Fri, Apr 11, 2014 at 12:37 PM, Alton Alexander wrote: > I run the follwoing command and it correctly starts one head and one > master

0.9 wont start cluster on ec2, SSH connection refused?

2014-04-11 Thread Alton Alexander
I run the follwoing command and it correctly starts one head and one master but then it fails because it can't log onto the head with the ssh key. The wierd thing is that I can log onto the head with that same public key. (ssh -i myamazonkey.pem r...@ec2-54-86-3-208.compute-1.amazonaws.com) Thanks

Re: Is Branch 1.0 build broken ?

2014-04-11 Thread Chester Chen
Sean     yes, you are right, I did not pay attention to the details:  [error] Server access Error: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty url=https://repository.apache.org/content/repositories/

Re: Setting properties in core-site.xml for Spark and Hadoop to access

2014-04-11 Thread Nicholas Chammas
Digging up this thread to ask a follow-up question: What is the intended use for /root/spark/conf/core-site.xml? It seems that both /root/spark/bin/pyspark and /root/ ephemeral-hdfs/bin/hadoop point to /root/ephemeral-hdfs/conf/core-site.xml. If I specify S3 access keys in spark/conf, Spark doesn

Spark behaviour when executor JVM crashes

2014-04-11 Thread deenar.toraskar
Hi I am running calling a C++ library on Spark using JNI. Occasionally the C++ library causes the JVM to crash. The task terminates on the MASTER, but the driver does not return. I am not sure why the driver does not terminate. I also notice that after such an occurrence, I lose some workers perma

GraphX

2014-04-11 Thread Ghufran Malik
Hi I was wondering if there was an implementation for Breadth First Search algorithm in graphX? Cheers, Ghufran

Re: Behaviour of caching when dataset does not fit into memory

2014-04-11 Thread Pierre Borckmans
Hi Matei, Could you enlighten us on this please? Thanks Pierre On 11 Apr 2014, at 14:49, Jérémy Subtil wrote: > Hi Xusen, > > I was convinced the cache() method would involve in-memory only operations > and has nothing to do with disks as the underlying default cache strategy is > MEMORY_O

Re: Spark - ready for prime time?

2014-04-11 Thread Surendranauth Hiraman
Excellent, thanks you. On Fri, Apr 11, 2014 at 12:09 PM, Matei Zaharia wrote: > It's not a new API, it just happens underneath the current one if you have > spark.shuffle.spill set to true (which it is by default). Take a look at > the config settings that mention "spill" in > http://spark.incu

Re: Spark - ready for prime time?

2014-04-11 Thread Matei Zaharia
It’s not a new API, it just happens underneath the current one if you have spark.shuffle.spill set to true (which it is by default). Take a look at the config settings that mention “spill” in http://spark.incubator.apache.org/docs/latest/configuration.html. Matei On Apr 11, 2014, at 7:02 AM, S

Re: Spark on YARN performance

2014-04-11 Thread Tom Graves
I haven't run on mesos before, but I do run on yarn. The performance differences are going to be in how long it takes you go get the Executors allocated.  On yarn that is going to depend on the cluster setup. If you have dedicated resources to a queue where you are running your spark job the ov

Re: shuffle memory requirements

2014-04-11 Thread Ameet Kini
A typo - I mean't section 2.1.2.5 "ulimit and nproc" of https://hbase.apache.org/book.html Ameet On Fri, Apr 11, 2014 at 10:32 AM, Ameet Kini wrote: > > Turns out that my ulimit settings were too low. I bumped up and the job > successfully completes. Here's what I have now: > > $ ulimit -u

Re: shuffle memory requirements

2014-04-11 Thread Ameet Kini
Turns out that my ulimit settings were too low. I bumped up and the job successfully completes. Here's what I have now: $ ulimit -u // for max user processes 81920 $ ulimit -n // for open files 81920 I was thrown off by the OutOfMemoryError into thinking it is Spark running out of memory in t

Re: Hybrid GPU CPU computation

2014-04-11 Thread Jaonary Rabarisoa
In fact the idea is to run some part of the code on GPU as Patrick described and extend the RDD structure so that it can also be distributed on GPU's. The following article http://www.wired.com/2013/06/andrew_ng/ describes a hybrid GPU/GPU implementation (with MPI) that outperforms a 16, 000 cores

Re: Spark - ready for prime time?

2014-04-11 Thread Surendranauth Hiraman
Matei, Where is the functionality in 0.9 to spill data within a task (separately from persist)? My apologies if this is something obvious but I don't see it in the api docs. -Suren On Thu, Apr 10, 2014 at 3:59 PM, Matei Zaharia wrote: > To add onto the discussion about memory working space, 0

Re: Hybrid GPU CPU computation

2014-04-11 Thread Patrick Grinaway
I've actually done it using PySpark and python libraries which call cuda code, though I've never done it from scala directly. The only major challenge I've hit is assigning tasks to gpus on multiple gpu machines. Sent from my iPhone > On Apr 11, 2014, at 8:38 AM, Jaonary Rabarisoa wrote: > >

Re: Hybrid GPU CPU computation

2014-04-11 Thread Pascal Voitot Dev
On Fri, Apr 11, 2014 at 3:34 PM, Dean Wampler wrote: > I've thought about this idea, although I haven't tried it, but I think the > right approach is to pick your granularity boundary and use Spark + JVM for > large-scale parts of the algorithm, then use the gpgus API for number > crunching large

Re: Hybrid GPU CPU computation

2014-04-11 Thread Dean Wampler
I've thought about this idea, although I haven't tried it, but I think the right approach is to pick your granularity boundary and use Spark + JVM for large-scale parts of the algorithm, then use the gpgus API for number crunching large chunks at a time. No need to run the JVM and Spark on the GPU,

Too many tasks in reduceByKey() when do PageRank iteration

2014-04-11 Thread 张志齐
Hi all, I am now implementing a simple PageRank. Unlike the PageRank example in spark, I divided the matrix into blocks and the rank vector into slices. Here is my code: https://github.com/gowithqi/PageRankOnSpark/blob/master/src/PageRank/PageRank.java I supposed that the complexity of each it

Re: Using Spark for Divide-and-Conquer Algorithms

2014-04-11 Thread Dean Wampler
There is a handy parallelize method for running independent computations. The examples page (http://spark.apache.org/examples.html) on the website uses it to estimate Pi. You can join the results at the end of the parallel calculations. On Fri, Apr 11, 2014 at 7:52 AM, Yanzhe Chen wrote: > Hi

Re: Spark 0.9.1 PySpark ImportError

2014-04-11 Thread aazout
Matei, thanks. So including the PYTHONPATH in spark-env.sh seemed to work. I am faced with this issue now. I am doing a large GroupBy in pyspark and the process fails (at the driver it seems). There is not much of a stack trace here to see where the issue is happening. This process works locally.

Using Spark for Divide-and-Conquer Algorithms

2014-04-11 Thread Yanzhe Chen
Hi all, Is Spark suitable for applications like Convex Hull algorithm, which has some classic divide-and-conquer approaches like QuickHull? More generally, Is there a way to express divide-and-conquer algorithms in Spark? Thanks! -- Yanzhe Chen Institute of Parallel and Distributed Systems

Re: Behaviour of caching when dataset does not fit into memory

2014-04-11 Thread Jérémy Subtil
Hi Xusen, I was convinced the cache() method would involve in-memory only operations and has nothing to do with disks as the underlying default cache strategy is MEMORY_ONLY. Am I missing something? 2014-04-11 11:44 GMT+02:00 尹绪森 : > Hi Pierre, > > 1. cache() would cost time to carry stuffs fro

Re: Hybrid GPU CPU computation

2014-04-11 Thread Saurabh Jha
There is a scala implementation for gpgus (nvidia cuda to be precise). but you also need to port mesos for gpu's. I am not sure about mesos. Also, the current scala gpu version is not stable to be used commercially. Hope this helps. Thanks saurabh. *Saurabh Jha* Intl. Exchange Student School o

Re: Hybrid GPU CPU computation

2014-04-11 Thread Pascal Voitot Dev
This is a bit crazy :) I suppose you would have to run Java code on the GPU! I heard there are some funny projects to do that... Pascal On Fri, Apr 11, 2014 at 2:38 PM, Jaonary Rabarisoa wrote: > Hi all, > > I'm just wondering if hybrid GPU/CPU computation is something that is > feasible with sp

Hybrid GPU CPU computation

2014-04-11 Thread Jaonary Rabarisoa
Hi all, I'm just wondering if hybrid GPU/CPU computation is something that is feasible with spark ? And what should be the best way to do it. Cheers, Jaonary

[GraphX] Cast error when comparing a vertex attribute after its type has changed

2014-04-11 Thread Pierre-Alexandre Fonta
Hi, Testing in mapTriplets if a vertex attribute, which is defined as Integer in first VertexRDD but has been changed after to Double by mapVertices, is greater than a number throws "java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double". If second elements of vertex

Re: Behaviour of caching when dataset does not fit into memory

2014-04-11 Thread 尹绪森
Hi Pierre, 1. cache() would cost time to carry stuffs from disk to memory, so pls do not use cache() if your job is not an iterative one. 2. If your dataset is larger than memory amount, then there will be a replacement strategy to exchange data between memory and disk. 2014-04-11 0:07 GMT+08:0

Re: Error when I use spark-streaming

2014-04-11 Thread Hahn Jiang
I found it. I should run "nc -lk " at first and then run the NetworkWordCount. Thanks On Fri, Apr 11, 2014 at 4:13 PM, Schein, Sagi wrote: > I would check the DNS setting. > > Akka seems to pick configuration from FQDN on my system > > > > Sagi > > > > *From:* Hahn Jiang [mailto:hahn.jian

RE: Error when I use spark-streaming

2014-04-11 Thread Schein, Sagi
I would check the DNS setting. Akka seems to pick configuration from FQDN on my system Sagi From: Hahn Jiang [mailto:hahn.jiang@gmail.com] Sent: Friday, April 11, 2014 10:56 AM To: user Subject: Error when I use spark-streaming hi all, When I run spark-streaming use NetworkWordCount in

Error when I use spark-streaming

2014-04-11 Thread Hahn Jiang
hi all, When I run spark-streaming use NetworkWordCount in example, it always throw this Exception. I don't understand why it can't connect and I don't restrict port. 14/04/11 15:38:56 ERROR SocketReceiver: Error receiving data in receiver 0 java.net.ConnectException: Connection refuse