Re: RE: Spark checkpoint problem

2015-11-26 Thread eric wong
I don't think it is a deliberate design.

So you may need do action on the  RDD before the action of 
RDD, if you want to explicitly checkpoint  RDD.


2015-11-26 13:23 GMT+08:00 wyphao.2007 :

> Spark 1.5.2.
>
> 在 2015-11-26 13:19:39,"张志强(旺轩)"  写道:
>
> What’s your spark version?
>
> *发件人:* wyphao.2007 [mailto:wyphao.2...@163.com]
> *发送时间:* 2015年11月26日 10:04
> *收件人:* user
> *抄送:* d...@spark.apache.org
> *主题:* Spark checkpoint problem
>
> I am test checkpoint to understand how it works, My code as following:
>
>
>
> scala> val data = sc.parallelize(List("a", "b", "c"))
>
> data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at
> parallelize at :15
>
>
>
> scala> sc.setCheckpointDir("/tmp/checkpoint")
>
> 15/11/25 18:09:07 WARN spark.SparkContext: Checkpoint directory must be
> non-local if Spark is running on a cluster: /tmp/checkpoint1
>
>
>
> scala> data.checkpoint
>
>
>
> scala> val temp = data.map(item => (item, 1))
>
> temp: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map
> at :17
>
>
>
> scala> temp.checkpoint
>
>
>
> scala> temp.count
>
>
>
> but I found that only the temp RDD is checkpont in the /tmp/checkpoint
> directory, The data RDD is not checkpointed! I found the doCheckpoint
> function  in the org.apache.spark.rdd.RDD class:
>
>
>
>   private[spark] def doCheckpoint(): Unit = {
>
> RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false,
> ignoreParent = true) {
>
>   if (!doCheckpointCalled) {
>
> doCheckpointCalled = true
>
> if (checkpointData.isDefined) {
>
>   checkpointData.get.checkpoint()
>
> } else {
>
>   dependencies.foreach(_.rdd.doCheckpoint())
>
> }
>
>   }
>
> }
>
>   }
>
>
>
> from the code above, Only the last RDD(In my case is temp) will be
> checkpointed, My question : Is deliberately designed or this is a bug?
>
>
>
> Thank you.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>



-- 
王海华


Re: when cached RDD will unpersist its data

2015-06-23 Thread eric wong
In a case that memory cannot hold all the cached RDD, then BlockManager
will evict some older block for storage of new RDD block.


Hope that will helpful.

2015-06-24 13:22 GMT+08:00 bit1...@163.com bit1...@163.com:

 I am kind of consused about when cached RDD will unpersist its data. I
 know we can explicitly unpersist it with RDD.unpersist ,but can it be
 unpersist automatically by the spark framework?
 Thanks.

 --
 bit1...@163.com




-- 
王海华


How to set DEBUG level log of spark executor on Standalone deploy mode

2015-04-29 Thread eric wong
Hi,
I want to check the DEBUG log of spark executor on Standalone deploy mode.
But,
1. Set log4j.properties in spark/conf folder on master node and restart
cluster.
no means above works.
2. usning spark-submit --properties-file log4j.
Just print debug log to screen but executor log still seems to be INFO
level

So how could i set the  log level of spark executor on Standalone to DEBUG?

Env Info---
spark 1.1.0

Standalone deploy mode.

Submit shell:
 bin/spark-submit --class org.apache.spark.examples.mllib.JavaKMeans
--master spark://master:7077 --executor-memory 600m --properties-file
log4j.properties lib/spark-examples-1.1.0-hadoop2.3.0.jar
hdfs://master:8000/kmeans/data-Kmeans-5.3g 8 1


Thanks!
Wang Haihua


WebUI shows poor locality when task schduling

2015-04-21 Thread eric wong
Hi,

When running a exprimental KMeans job for expriment, the Cached RDD is
original Points data.

I saw poor locality in Task details from WebUI. Almost one half of the
input of task is Network instead of Memory.

And Task with network input consumes almost the same time compare with the
task with  Hadoop(Disk) input, and twice with task(Memory input).
e.g
Task(Memory): 16s
Task(Network): 9s
Task(Hadoop): 9s


I see fectching RDD with 30MB form remote node consumes 5 seconds in
executor logs like below:

15/03/31 04:08:52 INFO CoarseGrainedExecutorBackend: Got assigned task 58
15/03/31 04:08:52 INFO Executor: Running task 15.0 in stage 1.0 (TID 58)
15/03/31 04:08:52 INFO HadoopRDD: Input split:
hdfs://master:8000/kmeans/data-Kmeans-5.3g:2013265920+134217728
15/03/31 04:08:52 INFO BlockManager: Found block rdd_3_15 locally
15/03/31 04:08:58 INFO Executor: Finished task 15.0 in stage 1.0 (TID 58).
1920 bytes result sent to driver
15/03/31 04:08:58 INFO CoarseGrainedExecutorBackend: Got assigned task 60
-Task60
15/03/31 04:08:58 INFO Executor: Running task 17.0 in stage 1.0 (TID 60)
15/03/31 04:08:58 INFO HadoopRDD: Input split:
hdfs://master:8000/kmeans/data-Kmeans-5.3g:2281701376+134217728
15/03/31 04:09:02 INFO BlockManager: Found block rdd_3_17 remotely
15/03/31 04:09:12 INFO Executor: Finished task 17.0 in stage 1.0 (TID 60).
1920 bytes result sent to driver


So
1)is that means i should use RDD with cache(MEMORY_AND_DISK) instead of
Memory only?

2)And should i expand Network capacity or turn Schduling locality parameter?


Any suggestion will be welcome.


--Env info---

Cluster: 4 worker, with 1 Cores and 2G executor memory

Spark version: 1.1.0

Network: 30MB/s

Submit shell:
bin/spark-submit --class org.apache.spark.examples.mllib.JavaKMeans
--master spark://master:7077 --executor-memory 1g
lib/spark-examples-1.1.0-hadoop2.3.0.jar
hdfs://master:8000/kmeans/data-Kmeans-7g 8 1


Thanks very much and forgive for my poor English.

-- 
Wang Haihua


Re: How does Spark honor data locality when allocating computing resources for an application

2015-03-14 Thread eric wong
you seem like not to note the configuration varible spreadOutApps

And it's comment:
  // As a temporary workaround before better ways of configuring memory, we
allow users to set
  // a flag that will perform round-robin scheduling across the nodes
(spreading out each app
  // among all the nodes) instead of trying to consolidate each app onto a
small # of nodes.

2015-03-14 10:41 GMT+08:00 bit1...@163.com bit1...@163.com:

 Hi, sparkers,
 When I read the code about computing resources allocation for the newly
 submitted application in the Master#schedule method,  I got a question
 about data locality:

 // Pack each app into as few nodes as possible until we've assigned all
 its cores
 for (worker - workers if worker.coresFree  0  worker.state ==
 WorkerState.ALIVE) {
for (app - waitingApps if app.coresLeft  0) {
   if (canUse(app, worker)) {
   val coresToUse = math.min(worker.coresFree, app.coresLeft)
  if (coresToUse  0) {
 val exec = app.addExecutor(worker, coresToUse)
 launchExecutor(worker, exec)
 app.state = ApplicationState.RUNNING
  }
  }
   }
 }

 Looks that the resource allocation policy here is that Master will assign
 as few workers as possible, so long as these few workers has enough
 resources for the application.
 My question is: Assume that the data the application will process is
 spread on all the worker nodes, then the data locality is lost if using
 the above policy?
 Not sure whether I have unstandood correctly or I have missed something.


 --
 bit1...@163.com




-- 
王海华


Re: Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread eric wong
A good beginning if you are chinese.

https://github.com/JerryLead/SparkInternals/tree/master/markdown

2015-01-07 10:13 GMT+08:00 bit1...@163.com bit1...@163.com:

 Thank you, Tobias. I will look into  the Spark paper. But it looks that
 the paper has been moved,
 http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.
 A web page is returned (Resource not found)when I access it.

 --
 bit1...@163.com


 *From:* Tobias Pfeiffer t...@preferred.jp
 *Date:* 2015-01-07 09:24
 *To:* Todd bit1...@163.com
 *CC:* user user@spark.apache.org
 *Subject:* Re: I think I am almost lost in the internals of Spark
 Hi,

 On Tue, Jan 6, 2015 at 11:24 PM, Todd bit1...@163.com wrote:

 I am a bit new to Spark, except that I tried simple things like word
 count, and the examples given in the spark sql programming guide.
 Now, I am investigating the internals of Spark, but I think I am almost
 lost, because I could not grasp a whole picture what spark does when it
 executes the word count.


 I recommend understanding what an RDD is and how it is processed, using

 http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds
 and probably also
   http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
   (once the server is back).
 Understanding how an RDD is processed is probably most helpful to
 understand the whole of Spark.

 Tobias




-- 
王海华


Re: how to set log level of spark executor on YARN(using yarn-cluster mode)

2014-10-20 Thread eric wong
Thanks for your reply!

Sorry for i forgot referring the spark which i'm using is *Version 1.0.2*
instead of 1.1.0.

Also the document of 1.0.2 seems not same like 1.1.0:
http://spark.apache.org/docs/1.0.2/running-on-yarn.html

And i tried your suggestion(upload ) but did not work:

*1. set my copy of log4j.properties like:*

*log4j.rootCategory=DEBUG, console*


*2. upload when using spark-submit script:*

*./bin/spark-submit --class edu.bjut.spark.SparkPageRank --master
yarn-cluster --num-executors 5 --executor-memory 2g
--executor-cores 1 /data/hadoopspark/MySparkTest.jar
hdfs://master:8000/srcdata/searchengine/* 5 5
hdfs://master:8000/resultdata/searchengine/2014102001/ *
*--files log4j.properties*

So plz point out my fault and any suggestion would be welcome

Thanks!



2014-10-16 9:45 GMT+08:00 Marcelo Vanzin van...@cloudera.com:

 Hi Eric,

 Check the Debugging Your Application section at:
 http://spark.apache.org/docs/latest/running-on-yarn.html

 Long story short: upload your log4j.properties using the --files
 argument of spark-submit.

 (Mental note: we could make the log level configurable via a system
 property...)


 On Wed, Oct 15, 2014 at 5:58 PM, eric wong win19...@gmail.com wrote:
  Hi,
 
  I want to check the DEBUG log of spark executor on YARN(using
 yarn-cluster
  mode), but
 
  1. yarn daemonlog setlevel DEBUG YarnChild.class
  2. set log4j.properties in spark/conf folder on client node.
 
  no means above works.
 
  So how could i set the  log level of spark executor on YARN container to
  DEBUG?
 
  Thanks!
 
 
 
 
  --
  Wang Haihua
 



 --
 Marcelo




-- 
王海华


how to submit multiple jar files when using spark-submit script in shell?

2014-10-17 Thread eric wong
Hi,

i using the comma separated style for submit multiple jar files in the
follow shell but it does not work:

bin/spark-submit --class org.apache.spark.examples.mllib.JavaKMeans
--master yarn-cluster --execur-memory 2g *--jars
lib/spark-examples-1.0.2-hadoop2.2.0.jar,lib/spark-mllib_2.10-1.0.0.jar
*hdfs://master:8000/srcdata/kmeans
8 4


Thanks!


-- 
WangHaihua


how to set log level of spark executor on YARN(using yarn-cluster mode)

2014-10-15 Thread eric wong
Hi,

I want to check the DEBUG log of spark executor on YARN(using yarn-cluster
mode), but

1. yarn daemonlog setlevel DEBUG YarnChild.class
2. set log4j.properties in spark/conf folder on client node.

no means above works.

So how could i set the  log level of spark executor* on YARN container to
DEBUG?*

Thanks!




-- 
Wang Haihua