Shuffle Spill Issue

2014-04-28 Thread Liu, Raymond
Hi I am running a simple word count program on spark standalone cluster. The cluster is made up of 6 node, each run 4 worker and each worker own 10G memory and 16 core thus total 96 core and 240G memory. ( well, also used to configed as 1 worker with 40G memory on each node )

RE: Shuffle Spill Issue

2014-04-28 Thread Liu, Raymond
is definitely a bit strange, the data gets compressed when written to disk, but unless you have a weird dataset (E.g. all zeros) I wouldn't expect it to compress _that_ much. On Mon, Apr 28, 2014 at 1:18 AM, Liu, Raymond raymond@intel.com wrote: Hi         I am running a simple word count program

RE: Shuffle Spill Issue

2014-04-29 Thread Liu, Raymond
per word occurrence. On Tue, Apr 29, 2014 at 7:48 AM, Liu, Raymond raymond@intel.com wrote: Hi  Patrick         I am just doing simple word count , the data is generated by hadoop random text writer.         This seems to me not quite related to compress , If I turn off compress on shuffle

How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
Hi I am running a WordCount program which count words from HDFS, and I noticed that the serializer part of code takes a lot of CPU time. On a 16core/32thread node, the total throughput is around 50MB/s by JavaSerializer, and if I switching to KryoSerializer, it doubles to around

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
For all the tasks, say 32 task on total Best Regards, Raymond Liu -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Is this the serialization throughput per task or the serialization throughput for all the tasks? On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
directly instead of read from HDFS, similar throughput result) Best Regards, Raymond Liu -Original Message- From: Liu, Raymond [mailto:raymond@intel.com] For all the tasks, say 32 task on total Best Regards, Raymond Liu -Original Message- From: Patrick Wendell

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
, 2014 at 10:14 PM, Liu, Raymond raymond@intel.com wrote: For all the tasks, say 32 task on total Best Regards, Raymond Liu -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Is this the serialization throughput per task or the serialization throughput

RE: How fast would you expect shuffle serialize to be?

2014-04-30 Thread Liu, Raymond
. So it seems to me that when running the full path code in my previous case, 32 core with 50MB/s total throughput are reasonable? Best Regards, Raymond Liu -Original Message- From: Liu, Raymond [mailto:raymond@intel.com] Later case, total throughput aggregated from all cores

RE: different in spark on yarn mode and standalone mode

2014-05-04 Thread Liu, Raymond
In the core, they are not quite different In standalone mode, you have spark master and spark worker who allocate driver and executors for your spark app. While in Yarn mode, Yarn resource manager and node manager do this work. When the driver and executors have been launched, the rest part of

RE: yarn-client mode question

2014-05-21 Thread Liu, Raymond
Seems you are asking that does spark related jar need to be deploy to yarn cluster manually before you launch application? Then, no , you don't, just like other yarn application. And it doesn't matter it is yarn-client or yarn-cluster mode.. Best Regards, Raymond Liu -Original

RE: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Liu, Raymond
If some task have no locality preference, it will also show up as PROCESS_LOCAL, yet, I think we probably need to name it NO_PREFER to make it more clear. Not sure is this your case. Best Regards, Raymond Liu From: coded...@gmail.com [mailto:coded...@gmail.com] On Behalf Of Sung Hwan Chung

RE: About StorageLevel

2014-06-26 Thread Liu, Raymond
I think there is a shuffle stage involved. And the future count job will depends on the first job’s shuffle stages’s output data directly as long as it is still available. Thus it will be much faster. Best Regards, Raymond Liu From: tomsheep...@gmail.com [mailto:tomsheep...@gmail.com] Sent:

RE: Request for help in writing to Textfile

2014-08-25 Thread Liu, Raymond
You can try to manipulate the string you want to output before saveAsTextFile, something like modify. flatMap(x=x).map{x= val s=x.toString s.subSequence(1,s.length-1) } Should have more optimized way. Best Regards, Raymond Liu -Original Message- From: yh18190

RE: What is a Block Manager?

2014-08-26 Thread Liu, Raymond
Basically, a Block Manager manages the storage for most of the data in spark, name a few: block that represent a cached RDD partition, intermediate shuffle data, broadcast data etc. it is per executor, while in standalone mode, normally, you have one executor per worker. You don't control how

RE: What is a Block Manager?

2014-08-27 Thread Liu, Raymond
, Raymond Liu From: Victor Tso-Guillen [mailto:v...@paxata.com] Sent: Wednesday, August 27, 2014 1:40 PM To: Liu, Raymond Cc: user@spark.apache.org Subject: Re: What is a Block Manager? We're a single-app deployment so we want to launch as many executors as the system has workers. We accomplish

RE: how to filter value in spark

2014-08-31 Thread Liu, Raymond
You could use cogroup to combine RDDs in one RDD for cross reference processing. e.g. a.cogroup(b). filter{case (_, (l,r)) = l.nonEmpty r.nonEmpty }. map{case (k,(l,r)) = (k, l)} Best Regards, Raymond Liu -Original Message- From: marylucy [mailto:qaz163wsx_...@hotmail.com] Sent:

RE: RDDs

2014-09-03 Thread Liu, Raymond
Not sure what did you refer to when saying replicated rdd, if you actually mean RDD, then, yes , read the API doc and paper as Tobias mentioned. If you actually focus on the word replicated, then that is for fault tolerant, and probably mostly used in the streaming case for receiver created

RE: resize memory size for caching RDD

2014-09-03 Thread Liu, Raymond
AFAIK, No. Best Regards, Raymond Liu From: 牛兆捷 [mailto:nzjem...@gmail.com] Sent: Thursday, September 04, 2014 11:30 AM To: user@spark.apache.org Subject: resize memory size for caching RDD Dear all: Spark uses memory to cache RDD and the memory size is specified by

RE: RDDs

2014-09-04 Thread Liu, Raymond
Actually, a replicated RDD and a parallel job on the same RDD, this two conception is not related at all. A replicated RDD just store data on multiple node, it helps with HA and provide better chance for data locality. It is still one RDD, not two separate RDD. While regarding run two jobs on

RE: memory size for caching RDD

2014-09-04 Thread Liu, Raymond
You don’t need to. It is not static allocated to RDD cache, it is just an up limit. If you don’t use up the memory by RDD cache, it is always available for other usage. except those one also controlled by some memoryFraction conf. e.g. spark.shuffle.memoryFraction which you also set the up

RE: memory size for caching RDD

2014-09-04 Thread Liu, Raymond
Regards, Raymond Liu From: 牛兆捷 [mailto:nzjem...@gmail.com] Sent: Thursday, September 04, 2014 2:57 PM To: Liu, Raymond Cc: Patrick Wendell; user@spark.apache.org; d...@spark.apache.org Subject: Re: memory size for caching RDD Oh I see. I want to implement something like this: sometimes I need