Dear, all
i'am testing double precision matrix multiplication in spark on ec2
m1.large machines.
i use breeze linalg library, and internally it calls native
library(openblas nehalem single threaded)
m1.large:
model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
cpu MHz :
What we are doing is:
1. Installing Spark 0.9.1 according to the documentation on the website,
along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
2. Building a fat jar with a Spark app with sbt then trying to run it on the
cluster
I've also included code snippets, and sbt
Thanks Jay. I honestly think I just had a senior moment or something. I was
getting HiBench and YCSB confused. Has anyone attempted to port HiBench to
using Spark? HiBench performs a lot of map/reduce and it would be a very
interesting comparison for us.
--
View this message in context:
Where's your driver code (the code interacting with the RDDs)? Are you
getting serialization errors?
2014년 5월 17일 토요일, Samarth Mailinglistmailinglistsama...@gmail.com님이 작성한
메시지:
Hi all,
I am trying to store the results of a reduce into mongo.
I want to share the variable collection in the
Thanks for the info about adding/removing nodes dynamically. That's
valuable.
2014년 5월 16일 금요일, Akhil Dasak...@sigmoidanalytics.com님이 작성한 메시지:
Hi Han :)
1. Is there a way to automatically re-spawn spark workers? We've
situations where executor OOM causes worker process to be DEAD and it does
A better way would be use Mesos (and quite possibly Yarn in 1.0.0).
That will allow you to add nodes on the fly leverage it for Spark.
Frankly Standalone mode is not meant to handle those issues. That said we
use our deployment tool as stopping the cluster for adding nodes is not
really an issue
You have to ideally pass the mongoclient object along with your data in the
mapper(python should be try to serialize your mongoclient, but explicit is
better)
if client is serializable then all should end well.. if not then you are
better off using map partition initilizing the driver in each
The real question is why are looking to consume file as a Stream
1. Too big to load as RDD
2. Operate in sequential manner.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Sat, May 17, 2014 at 5:12 AM, Soumya Simanta
@Soumya Simanta
Right now its just a prove of concept. Later I will have a real stream. Its EEG
files of brain. Later it can be used for real time analysis of eeg streams.
@Mayur
The size is huge yes. SO its better to do in distributed manner and as I said
above I want to read as stream
I have had a lot of success with Spark on large datasets,
both in terms of performance and flexibility.
However I hit a wall with reduceByKey when the RDD contains billions of
items.
I am reducing with simple functions like addition for building histograms,
so the reduction process should be
HI, I want to do some benchmarking tests (run-time and memory) for one of
GraphX examples, lets say PageRank on my single processor PC to start with.
a) Is there a way to get the total time taken for the execution from start
to finish?
b) log4j properties need to be modified to turn off logging,
Daniel,
How many partitions do you have?
Are they more or less uniformly distributed?
We have similar data volume currently running well on Hadoop MapReduce with
roughly 30 nodes.
I was planning to test it with Spark.
I'm very interested in your findings.
-
Madhu
@Laeeq - please see this example.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala#L47-L49
On Sat, May 17, 2014 at 2:06 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:
@Soumya Simanta
Right now its just a prove of
i think maybe it's related to m1.large, because i also tested on my laptop,
the two case cost nearly
the same amount of time.
my laptop:
model name : Intel(R) Core(TM) i5-3380M CPU @ 2.90GHz
cpu MHz : 2893.549
os:
Linux ubuntu 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC
Hi,
I'm new to spark and I wanted to understand a few things conceptually so that I
can optimize my spark job. I have a large text file (~14G, 200k lines). This
file is available on each worker node of my spark cluster. The job I run calls
sc.textFile(...).flatmap(...) . The function that I
You need to include breeze-natives or netlib:all to load the native
libraries. Check the log messages to ensure native libraries are used,
especially on the worker nodes. The easiest way to use OpenBLAS is
copying the shared library to /usr/lib/libblas.so.3 and
/usr/lib/liblapack.so.3. -Xiangrui
18 matches
Mail list logo