Error when I use spark-streaming

2014-04-11 Thread Hahn Jiang
hi all, When I run spark-streaming use NetworkWordCount in example, it always throw this Exception. I don't understand why it can't connect and I don't restrict port. 14/04/11 15:38:56 ERROR SocketReceiver: Error receiving data in receiver 0 java.net.ConnectException: Connection

Re: Error when I use spark-streaming

2014-04-11 Thread Hahn Jiang
I found it. I should run nc -lk at first and then run the NetworkWordCount. Thanks On Fri, Apr 11, 2014 at 4:13 PM, Schein, Sagi sagi.sch...@hp.com wrote: I would check the DNS setting. Akka seems to pick configuration from FQDN on my system Sagi *From:* Hahn Jiang

Re: Behaviour of caching when dataset does not fit into memory

2014-04-11 Thread 尹绪森
Hi Pierre, 1. cache() would cost time to carry stuffs from disk to memory, so pls do not use cache() if your job is not an iterative one. 2. If your dataset is larger than memory amount, then there will be a replacement strategy to exchange data between memory and disk. 2014-04-11 0:07

[GraphX] Cast error when comparing a vertex attribute after its type has changed

2014-04-11 Thread Pierre-Alexandre Fonta
Hi, Testing in mapTriplets if a vertex attribute, which is defined as Integer in first VertexRDD but has been changed after to Double by mapVertices, is greater than a number throws java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double. If second elements of vertex

Hybrid GPU CPU computation

2014-04-11 Thread Jaonary Rabarisoa
Hi all, I'm just wondering if hybrid GPU/CPU computation is something that is feasible with spark ? And what should be the best way to do it. Cheers, Jaonary

Re: Hybrid GPU CPU computation

2014-04-11 Thread Pascal Voitot Dev
This is a bit crazy :) I suppose you would have to run Java code on the GPU! I heard there are some funny projects to do that... Pascal On Fri, Apr 11, 2014 at 2:38 PM, Jaonary Rabarisoa jaon...@gmail.comwrote: Hi all, I'm just wondering if hybrid GPU/CPU computation is something that is

Re: Hybrid GPU CPU computation

2014-04-11 Thread Saurabh Jha
There is a scala implementation for gpgus (nvidia cuda to be precise). but you also need to port mesos for gpu's. I am not sure about mesos. Also, the current scala gpu version is not stable to be used commercially. Hope this helps. Thanks saurabh. *Saurabh Jha* Intl. Exchange Student School

Re: Behaviour of caching when dataset does not fit into memory

2014-04-11 Thread Jérémy Subtil
Hi Xusen, I was convinced the cache() method would involve in-memory only operations and has nothing to do with disks as the underlying default cache strategy is MEMORY_ONLY. Am I missing something? 2014-04-11 11:44 GMT+02:00 尹绪森 yinxu...@gmail.com: Hi Pierre, 1. cache() would cost time to

Re: Spark 0.9.1 PySpark ImportError

2014-04-11 Thread aazout
Matei, thanks. So including the PYTHONPATH in spark-env.sh seemed to work. I am faced with this issue now. I am doing a large GroupBy in pyspark and the process fails (at the driver it seems). There is not much of a stack trace here to see where the issue is happening. This process works locally.

Re: Using Spark for Divide-and-Conquer Algorithms

2014-04-11 Thread Dean Wampler
There is a handy parallelize method for running independent computations. The examples page (http://spark.apache.org/examples.html) on the website uses it to estimate Pi. You can join the results at the end of the parallel calculations. On Fri, Apr 11, 2014 at 7:52 AM, Yanzhe Chen

Too many tasks in reduceByKey() when do PageRank iteration

2014-04-11 Thread 张志齐
Hi all, I am now implementing a simple PageRank. Unlike the PageRank example in spark, I divided the matrix into blocks and the rank vector into slices. Here is my code: https://github.com/gowithqi/PageRankOnSpark/blob/master/src/PageRank/PageRank.java I supposed that the complexity of each

Re: Hybrid GPU CPU computation

2014-04-11 Thread Dean Wampler
I've thought about this idea, although I haven't tried it, but I think the right approach is to pick your granularity boundary and use Spark + JVM for large-scale parts of the algorithm, then use the gpgus API for number crunching large chunks at a time. No need to run the JVM and Spark on the

Re: Hybrid GPU CPU computation

2014-04-11 Thread Pascal Voitot Dev
On Fri, Apr 11, 2014 at 3:34 PM, Dean Wampler deanwamp...@gmail.com wrote: I've thought about this idea, although I haven't tried it, but I think the right approach is to pick your granularity boundary and use Spark + JVM for large-scale parts of the algorithm, then use the gpgus API for

Re: Hybrid GPU CPU computation

2014-04-11 Thread Patrick Grinaway
I've actually done it using PySpark and python libraries which call cuda code, though I've never done it from scala directly. The only major challenge I've hit is assigning tasks to gpus on multiple gpu machines. Sent from my iPhone On Apr 11, 2014, at 8:38 AM, Jaonary Rabarisoa

Re: Hybrid GPU CPU computation

2014-04-11 Thread Jaonary Rabarisoa
In fact the idea is to run some part of the code on GPU as Patrick described and extend the RDD structure so that it can also be distributed on GPU's. The following article http://www.wired.com/2013/06/andrew_ng/ describes a hybrid GPU/GPU implementation (with MPI) that outperforms a 16, 000 cores

Re: shuffle memory requirements

2014-04-11 Thread Ameet Kini
Turns out that my ulimit settings were too low. I bumped up and the job successfully completes. Here's what I have now: $ ulimit -u // for max user processes 81920 $ ulimit -n // for open files 81920 I was thrown off by the OutOfMemoryError into thinking it is Spark running out of memory in

Re: shuffle memory requirements

2014-04-11 Thread Ameet Kini
A typo - I mean't section 2.1.2.5 ulimit and nproc of https://hbase.apache.org/book.html Ameet On Fri, Apr 11, 2014 at 10:32 AM, Ameet Kini ameetk...@gmail.com wrote: Turns out that my ulimit settings were too low. I bumped up and the job successfully completes. Here's what I have now: $

Re: Spark - ready for prime time?

2014-04-11 Thread Matei Zaharia
It’s not a new API, it just happens underneath the current one if you have spark.shuffle.spill set to true (which it is by default). Take a look at the config settings that mention “spill” in http://spark.incubator.apache.org/docs/latest/configuration.html. Matei On Apr 11, 2014, at 7:02 AM,

Re: Spark - ready for prime time?

2014-04-11 Thread Surendranauth Hiraman
Excellent, thanks you. On Fri, Apr 11, 2014 at 12:09 PM, Matei Zaharia matei.zaha...@gmail.comwrote: It's not a new API, it just happens underneath the current one if you have spark.shuffle.spill set to true (which it is by default). Take a look at the config settings that mention spill in

GraphX

2014-04-11 Thread Ghufran Malik
Hi I was wondering if there was an implementation for Breadth First Search algorithm in graphX? Cheers, Ghufran

Spark behaviour when executor JVM crashes

2014-04-11 Thread deenar.toraskar
Hi I am running calling a C++ library on Spark using JNI. Occasionally the C++ library causes the JVM to crash. The task terminates on the MASTER, but the driver does not return. I am not sure why the driver does not terminate. I also notice that after such an occurrence, I lose some workers

Re: Setting properties in core-site.xml for Spark and Hadoop to access

2014-04-11 Thread Nicholas Chammas
Digging up this thread to ask a follow-up question: What is the intended use for /root/spark/conf/core-site.xml? It seems that both /root/spark/bin/pyspark and /root/ ephemeral-hdfs/bin/hadoop point to /root/ephemeral-hdfs/conf/core-site.xml. If I specify S3 access keys in spark/conf, Spark

0.9 wont start cluster on ec2, SSH connection refused?

2014-04-11 Thread Alton Alexander
I run the follwoing command and it correctly starts one head and one master but then it fails because it can't log onto the head with the ssh key. The wierd thing is that I can log onto the head with that same public key. (ssh -i myamazonkey.pem r...@ec2-54-86-3-208.compute-1.amazonaws.com)

Re: 0.9 wont start cluster on ec2, SSH connection refused?

2014-04-11 Thread Mayur Rustagi
is the machine booted up reachable? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Apr 11, 2014 at 12:37 PM, Alton Alexander alexanderal...@gmail.comwrote: I run the follwoing command and it correctly starts one

Re: 0.9 wont start cluster on ec2, SSH connection refused?

2014-04-11 Thread Alton Alexander
No not anymore but it was at the time. Thanks but I also just found a thread from two days ago discussing the root and es2-user workaround. For now I'll just go back to using the AMI provided. Thanks! On Fri, Apr 11, 2014 at 1:39 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: is the machine

Shutdown with streaming driver running in cluster broke master web UI permanently

2014-04-11 Thread Paul Mogren
I had a cluster running with a streaming driver deployed into it. I shut down the cluster using sbin/stop-all.sh. Upon restarting (and restarting, and restarting), the master web UI cannot respond to requests. The cluster seems to be otherwise functional. Below is the master's log, showing

Re: Spark on YARN performance

2014-04-11 Thread Patrick Wendell
To reiterate what Tom was saying - the code that runs inside of Spark on YARN is exactly the same code that runs in any deployment mode. There shouldn't be any performance difference once your application starts (assuming you are comparing apples-to-apples in terms of hardware). The differences

Huge matrix

2014-04-11 Thread Xiaoli Li
Hi all, I am implementing an algorithm using Spark. I have one million users. I need to compute the similarity between each pair of users using some user's attributes. For each user, I need to get top k most similar users. What is the best way to implement this? Thanks.

Re: Huge matrix

2014-04-11 Thread Andrew Ash
The naive way would be to put all the users and their attributes into an RDD, then cartesian product that with itself. Run the similarity score on every pair (1M * 1M = 1T scores), map to (user, (score, otherUser)) and take the .top(k) for each user. I doubt that you'll be able to take this

Re: Behaviour of caching when dataset does not fit into memory

2014-04-11 Thread Mayur Rustagi
One reason could be that spark uses scratch disk space on intermediate calculations so as you perform calculations that data need to be flushed before you can leverage memory for operations. Second issue could be large intermediate data may push more data in rdd onto disk ( something I see in

SVD under spark/mllib/linalg

2014-04-11 Thread wxhsdp
Hi, all the code under https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/linalg has changed. previous matrix classes are all removed, like MatrixEntry, MatrixSVD. Instead breeze matrix definition appears. Do we move to Breeze Linear Algebra when do linear

Re: SVD under spark/mllib/linalg

2014-04-11 Thread Xiangrui Meng
It was moved to mllib.linalg.distributed.RowMatrix. With RowMatrix, you can compute column summary statistics, gram matrix, covariance, SVD, and PCA. We will provide multiplication for distributed matrices, but not in v1.0. -Xiangrui On Fri, Apr 11, 2014 at 9:12 PM, wxhsdp wxh...@gmail.com wrote:

Re: Huge matrix

2014-04-11 Thread Xiaoli Li
Hi Andrew, Thanks for your suggestion. I have tried the method. I used 8 nodes and every node has 8G memory. The program just stopped at a stage for about several hours without any further information. Maybe I need to find out a more efficient way. On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash

Re: Huge matrix

2014-04-11 Thread Reza Zadeh
Hi Xiaoli, There is a PR currently in progress to allow this, via the sampling scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf The PR is at https://github.com/apache/spark/pull/336 though it will need refactoring given the recent changes to matrix interface in MLlib. You