hi all,
When I run spark-streaming use NetworkWordCount in example, it always
throw this Exception. I don't understand why it can't connect and I don't
restrict port.
14/04/11 15:38:56 ERROR SocketReceiver: Error receiving data in receiver 0
java.net.ConnectException: Connection
I found it. I should run nc -lk at first and then run the
NetworkWordCount.
Thanks
On Fri, Apr 11, 2014 at 4:13 PM, Schein, Sagi sagi.sch...@hp.com wrote:
I would check the DNS setting.
Akka seems to pick configuration from FQDN on my system
Sagi
*From:* Hahn Jiang
Hi Pierre,
1. cache() would cost time to carry stuffs from disk to memory, so pls do
not use cache() if your job is not an iterative one.
2. If your dataset is larger than memory amount, then there will be a
replacement strategy to exchange data between memory and disk.
2014-04-11 0:07
Hi,
Testing in mapTriplets if a vertex attribute, which is defined as Integer in
first VertexRDD but has been changed after to Double by mapVertices, is
greater than a number throws java.lang.ClassCastException:
java.lang.Integer cannot be cast to java.lang.Double.
If second elements of vertex
Hi all,
I'm just wondering if hybrid GPU/CPU computation is something that is
feasible with spark ? And what should be the best way to do it.
Cheers,
Jaonary
This is a bit crazy :)
I suppose you would have to run Java code on the GPU!
I heard there are some funny projects to do that...
Pascal
On Fri, Apr 11, 2014 at 2:38 PM, Jaonary Rabarisoa jaon...@gmail.comwrote:
Hi all,
I'm just wondering if hybrid GPU/CPU computation is something that is
There is a scala implementation for gpgus (nvidia cuda to be precise). but
you also need to port mesos for gpu's. I am not sure about mesos. Also, the
current scala gpu version is not stable to be used commercially.
Hope this helps.
Thanks
saurabh.
*Saurabh Jha*
Intl. Exchange Student
School
Hi Xusen,
I was convinced the cache() method would involve in-memory only operations
and has nothing to do with disks as the underlying default cache strategy
is MEMORY_ONLY. Am I missing something?
2014-04-11 11:44 GMT+02:00 尹绪森 yinxu...@gmail.com:
Hi Pierre,
1. cache() would cost time to
Matei, thanks. So including the PYTHONPATH in spark-env.sh seemed to work. I
am faced with this issue now. I am doing a large GroupBy in pyspark and the
process fails (at the driver it seems). There is not much of a stack trace
here to see where the issue is happening. This process works locally.
There is a handy parallelize method for running independent computations.
The examples page (http://spark.apache.org/examples.html) on the website
uses it to estimate Pi. You can join the results at the end of the parallel
calculations.
On Fri, Apr 11, 2014 at 7:52 AM, Yanzhe Chen
Hi all,
I am now implementing a simple PageRank. Unlike the PageRank example in spark,
I divided the matrix into blocks and the rank vector into slices.
Here is my code:
https://github.com/gowithqi/PageRankOnSpark/blob/master/src/PageRank/PageRank.java
I supposed that the complexity of each
I've thought about this idea, although I haven't tried it, but I think the
right approach is to pick your granularity boundary and use Spark + JVM for
large-scale parts of the algorithm, then use the gpgus API for number
crunching large chunks at a time. No need to run the JVM and Spark on the
On Fri, Apr 11, 2014 at 3:34 PM, Dean Wampler deanwamp...@gmail.com wrote:
I've thought about this idea, although I haven't tried it, but I think the
right approach is to pick your granularity boundary and use Spark + JVM for
large-scale parts of the algorithm, then use the gpgus API for
I've actually done it using PySpark and python libraries which call cuda code,
though I've never done it from scala directly. The only major challenge I've
hit is assigning tasks to gpus on multiple gpu machines.
Sent from my iPhone
On Apr 11, 2014, at 8:38 AM, Jaonary Rabarisoa
In fact the idea is to run some part of the code on GPU as Patrick
described and extend the RDD structure so that it can also be distributed
on GPU's. The following article
http://www.wired.com/2013/06/andrew_ng/ describes a hybrid GPU/GPU
implementation (with MPI) that outperforms a
16, 000 cores
Turns out that my ulimit settings were too low. I bumped up and the job
successfully completes. Here's what I have now:
$ ulimit -u // for max user processes
81920
$ ulimit -n // for open files
81920
I was thrown off by the OutOfMemoryError into thinking it is Spark running
out of memory in
A typo - I mean't section 2.1.2.5 ulimit and nproc of
https://hbase.apache.org/book.html
Ameet
On Fri, Apr 11, 2014 at 10:32 AM, Ameet Kini ameetk...@gmail.com wrote:
Turns out that my ulimit settings were too low. I bumped up and the job
successfully completes. Here's what I have now:
$
It’s not a new API, it just happens underneath the current one if you have
spark.shuffle.spill set to true (which it is by default). Take a look at the
config settings that mention “spill” in
http://spark.incubator.apache.org/docs/latest/configuration.html.
Matei
On Apr 11, 2014, at 7:02 AM,
Excellent, thanks you.
On Fri, Apr 11, 2014 at 12:09 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
It's not a new API, it just happens underneath the current one if you have
spark.shuffle.spill set to true (which it is by default). Take a look at
the config settings that mention spill in
Hi
I was wondering if there was an implementation for Breadth First Search
algorithm in graphX?
Cheers,
Ghufran
Hi
I am running calling a C++ library on Spark using JNI. Occasionally the C++
library causes the JVM to crash. The task terminates on the MASTER, but the
driver does not return. I am not sure why the driver does not terminate. I
also notice that after such an occurrence, I lose some workers
Digging up this thread to ask a follow-up question:
What is the intended use for /root/spark/conf/core-site.xml?
It seems that both /root/spark/bin/pyspark and /root/
ephemeral-hdfs/bin/hadoop point to /root/ephemeral-hdfs/conf/core-site.xml.
If I specify S3 access keys in spark/conf, Spark
I run the follwoing command and it correctly starts one head and one
master but then it fails because it can't log onto the head with the
ssh key. The wierd thing is that I can log onto the head with that
same public key. (ssh -i myamazonkey.pem
r...@ec2-54-86-3-208.compute-1.amazonaws.com)
is the machine booted up reachable?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Apr 11, 2014 at 12:37 PM, Alton Alexander
alexanderal...@gmail.comwrote:
I run the follwoing command and it correctly starts one
No not anymore but it was at the time. Thanks but I also just found a
thread from two days ago discussing the root and es2-user workaround.
For now I'll just go back to using the AMI provided.
Thanks!
On Fri, Apr 11, 2014 at 1:39 PM, Mayur Rustagi mayur.rust...@gmail.com wrote:
is the machine
I had a cluster running with a streaming driver deployed into it. I shut down
the cluster using sbin/stop-all.sh. Upon restarting (and restarting, and
restarting), the master web UI cannot respond to requests. The cluster seems to
be otherwise functional. Below is the master's log, showing
To reiterate what Tom was saying - the code that runs inside of Spark on
YARN is exactly the same code that runs in any deployment mode. There
shouldn't be any performance difference once your application starts
(assuming you are comparing apples-to-apples in terms of hardware).
The differences
Hi all,
I am implementing an algorithm using Spark. I have one million users. I
need to compute the similarity between each pair of users using some user's
attributes. For each user, I need to get top k most similar users. What is
the best way to implement this?
Thanks.
The naive way would be to put all the users and their attributes into an
RDD, then cartesian product that with itself. Run the similarity score on
every pair (1M * 1M = 1T scores), map to (user, (score, otherUser)) and
take the .top(k) for each user.
I doubt that you'll be able to take this
One reason could be that spark uses scratch disk space on intermediate
calculations so as you perform calculations that data need to be flushed
before you can leverage memory for operations.
Second issue could be large intermediate data may push more data in rdd
onto disk ( something I see in
Hi, all
the code under
https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/linalg
has changed. previous matrix classes are all removed, like MatrixEntry,
MatrixSVD. Instead breeze matrix definition appears. Do we move to Breeze
Linear Algebra when do linear
It was moved to mllib.linalg.distributed.RowMatrix. With RowMatrix,
you can compute column summary statistics, gram matrix, covariance,
SVD, and PCA. We will provide multiplication for distributed matrices,
but not in v1.0. -Xiangrui
On Fri, Apr 11, 2014 at 9:12 PM, wxhsdp wxh...@gmail.com wrote:
Hi Andrew,
Thanks for your suggestion. I have tried the method. I used 8 nodes and
every node has 8G memory. The program just stopped at a stage for about
several hours without any further information. Maybe I need to find
out a more efficient way.
On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash
Hi Xiaoli,
There is a PR currently in progress to allow this, via the sampling scheme
described in this paper: stanford.edu/~rezab/papers/dimsum.pdf
The PR is at https://github.com/apache/spark/pull/336 though it will need
refactoring given the recent changes to matrix interface in MLlib. You
34 matches
Mail list logo