Hi
I am running a simple word count program on spark standalone cluster.
The cluster is made up of 6 node, each run 4 worker and each worker own 10G
memory and 16 core thus total 96 core and 240G memory. ( well, also used to
configed as 1 worker with 40G memory on each node )
is definitely a
bit strange, the data gets compressed when written to disk, but unless you have
a weird dataset (E.g. all zeros) I wouldn't expect it to compress _that_ much.
On Mon, Apr 28, 2014 at 1:18 AM, Liu, Raymond raymond@intel.com wrote:
Hi
I am running a simple word count program
per word occurrence.
On Tue, Apr 29, 2014 at 7:48 AM, Liu, Raymond raymond@intel.com wrote:
Hi Patrick
I am just doing simple word count , the data is generated by hadoop
random text writer.
This seems to me not quite related to compress , If I turn off compress
on shuffle
Hi
I am running a WordCount program which count words from HDFS, and I
noticed that the serializer part of code takes a lot of CPU time. On a
16core/32thread node, the total throughput is around 50MB/s by JavaSerializer,
and if I switching to KryoSerializer, it doubles to around
For all the tasks, say 32 task on total
Best Regards,
Raymond Liu
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Is this the serialization throughput per task or the serialization throughput
for all the tasks?
On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond
directly instead of
read from HDFS, similar throughput result)
Best Regards,
Raymond Liu
-Original Message-
From: Liu, Raymond [mailto:raymond@intel.com]
For all the tasks, say 32 task on total
Best Regards,
Raymond Liu
-Original Message-
From: Patrick Wendell
, 2014 at 10:14 PM, Liu, Raymond raymond@intel.com wrote:
For all the tasks, say 32 task on total
Best Regards,
Raymond Liu
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Is this the serialization throughput per task or the serialization throughput
. So it seems to me that when running the full path code in my
previous case, 32 core with 50MB/s total throughput are reasonable?
Best Regards,
Raymond Liu
-Original Message-
From: Liu, Raymond [mailto:raymond@intel.com]
Later case, total throughput aggregated from all cores
In the core, they are not quite different
In standalone mode, you have spark master and spark worker who allocate driver
and executors for your spark app.
While in Yarn mode, Yarn resource manager and node manager do this work.
When the driver and executors have been launched, the rest part of
Seems you are asking that does spark related jar need to be deploy to yarn
cluster manually before you launch application?
Then, no , you don't, just like other yarn application. And it doesn't matter
it is yarn-client or yarn-cluster mode..
Best Regards,
Raymond Liu
-Original
If some task have no locality preference, it will also show up as
PROCESS_LOCAL, yet, I think we probably need to name it NO_PREFER to make it
more clear. Not sure is this your case.
Best Regards,
Raymond Liu
From: coded...@gmail.com [mailto:coded...@gmail.com] On Behalf Of Sung Hwan
Chung
I think there is a shuffle stage involved. And the future count job will
depends on the first job’s shuffle stages’s output data directly as long as it
is still available. Thus it will be much faster.
Best Regards,
Raymond Liu
From: tomsheep...@gmail.com [mailto:tomsheep...@gmail.com]
Sent:
You can try to manipulate the string you want to output before saveAsTextFile,
something like
modify. flatMap(x=x).map{x=
val s=x.toString
s.subSequence(1,s.length-1)
}
Should have more optimized way.
Best Regards,
Raymond Liu
-Original Message-
From: yh18190
Basically, a Block Manager manages the storage for most of the data in spark,
name a few: block that represent a cached RDD partition, intermediate shuffle
data, broadcast data etc. it is per executor, while in standalone mode,
normally, you have one executor per worker.
You don't control how
,
Raymond Liu
From: Victor Tso-Guillen [mailto:v...@paxata.com]
Sent: Wednesday, August 27, 2014 1:40 PM
To: Liu, Raymond
Cc: user@spark.apache.org
Subject: Re: What is a Block Manager?
We're a single-app deployment so we want to launch as many executors as the
system has workers. We accomplish
You could use cogroup to combine RDDs in one RDD for cross reference processing.
e.g.
a.cogroup(b). filter{case (_, (l,r)) = l.nonEmpty r.nonEmpty }. map{case
(k,(l,r)) = (k, l)}
Best Regards,
Raymond Liu
-Original Message-
From: marylucy [mailto:qaz163wsx_...@hotmail.com]
Sent:
Not sure what did you refer to when saying replicated rdd, if you actually mean
RDD, then, yes , read the API doc and paper as Tobias mentioned.
If you actually focus on the word replicated, then that is for fault
tolerant, and probably mostly used in the streaming case for receiver created
AFAIK, No.
Best Regards,
Raymond Liu
From: 牛兆捷 [mailto:nzjem...@gmail.com]
Sent: Thursday, September 04, 2014 11:30 AM
To: user@spark.apache.org
Subject: resize memory size for caching RDD
Dear all:
Spark uses memory to cache RDD and the memory size is specified by
Actually, a replicated RDD and a parallel job on the same RDD, this two
conception is not related at all.
A replicated RDD just store data on multiple node, it helps with HA and provide
better chance for data locality. It is still one RDD, not two separate RDD.
While regarding run two jobs on
You don’t need to. It is not static allocated to RDD cache, it is just an up
limit.
If you don’t use up the memory by RDD cache, it is always available for other
usage. except those one also controlled by some memoryFraction conf. e.g.
spark.shuffle.memoryFraction which you also set the up
Regards,
Raymond Liu
From: 牛兆捷 [mailto:nzjem...@gmail.com]
Sent: Thursday, September 04, 2014 2:57 PM
To: Liu, Raymond
Cc: Patrick Wendell; user@spark.apache.org; d...@spark.apache.org
Subject: Re: memory size for caching RDD
Oh I see.
I want to implement something like this: sometimes I need
21 matches
Mail list logo