[ 
https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangdenghui updated SPARK-19007:
---------------------------------
    Description: 
Test data:80G CTR training data from 
criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
 ) ,I used 1 of the 24 days' data.Some  features needed to be repalced by new 
generated continuous features,the way to generate the new features refers to 
the way mentioned in the xgboost's paper.

Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per 
executor.

Parameters: numIterations 10, maxdepth  8,   the rest parameters are default

I tested the GradientBoostedTrees algorithm in mllib  using 80G CTR data 
mentioned above.

It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT 
rounds later.Without these task failures and task retry it can be much faster 
,which can save about half the time. I think it's caused by the RDD named 
predError in the while loop of  the boost method at 
GradientBoostedTrees.scala,because the lineage of the RDD named predError is 
growing after every GBT round, and then it caused failures like this :

(ExecutorLostFailure (executor 6 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.).  

I tried to boosting spark.yarn.executor.memoryOverhead  but the meomry it 
needed is too much (even increase half the memory  can't solve the problem) so 
i think it's not a proper method. 

Although it can set the predCheckpoint  Interval  smaller  to cut the line of 
the lineage  but it increases IO cost a lot. 

I tried  another way to solve this problem.I persisted the RDD named predError 
every round  and use  pre_predError to record the previous RDD  and unpersist 
it  because it's useless anymore.

Finally it costs about 45 min after i tried my method and no task failure 
occured and no more memeory added. 

So when the data is much larger than memory, my little improvement can speedup  
the  GradientBoostedTrees  1~2 times.


  was:
Test data:80G CTR training data from 
criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
 ) ,I used 1 of the 24 days' data.Some  features needed to be repalced by new 
generated continuous features,the way to generate the new features refers to 
the way mentioned in the xgboost's paper.

Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per 
executor.

I tested the GradientBoostedTrees algorithm in mllib  using 80G CTR data 
mentioned above.

It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT 
rounds later.Without these task failures and task retry it can be much faster 
,which can save about half the time. I think it's caused by the RDD named 
predError in the while loop of  the boost method at 
GradientBoostedTrees.scala,because the lineage of the RDD named predError is 
growing after every GBT round, and then it caused failures like this :

(ExecutorLostFailure (executor 6 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.).  

I tried to boosting spark.yarn.executor.memoryOverhead  but the meomry it 
needed is too much (even increase half the memory  can't solve the problem) so 
i think it's not a proper method. 

Although it can set the predCheckpoint  Interval  smaller  to cut the line of 
the lineage  but it increases IO cost a lot. 

I tried  another way to solve this problem.I persisted the RDD named predError 
every round  and use  pre_predError to record the previous RDD  and unpersist 
it  because it's useless anymore.

Finally it costs about 45 min after i tried my method and no task failure 
occured and no more memeory added. 

So when the data is much larger than memory, my little improvement can speedup  
the  GradientBoostedTrees  1~2 times.



> Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
> ------------------------------------------------------------------------
>
>                 Key: SPARK-19007
>                 URL: https://issues.apache.org/jira/browse/SPARK-19007
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 
> 2.0.1, 2.0.2, 2.1.0
>         Environment: A CDH cluster consists of 3 redhat server ,(120G 
> memory、40 cores、43TB disk per server).
>            Reporter: zhangdenghui
>             Fix For: 2.1.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Test data:80G CTR training data from 
> criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
>  ) ,I used 1 of the 24 days' data.Some  features needed to be repalced by new 
> generated continuous features,the way to generate the new features refers to 
> the way mentioned in the xgboost's paper.
> Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per 
> executor.
> Parameters: numIterations 10, maxdepth  8,   the rest parameters are default
> I tested the GradientBoostedTrees algorithm in mllib  using 80G CTR data 
> mentioned above.
> It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT 
> rounds later.Without these task failures and task retry it can be much faster 
> ,which can save about half the time. I think it's caused by the RDD named 
> predError in the while loop of  the boost method at 
> GradientBoostedTrees.scala,because the lineage of the RDD named predError is 
> growing after every GBT round, and then it caused failures like this :
> (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) 
> Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 
> GB physical memory used. Consider boosting 
> spark.yarn.executor.memoryOverhead.).  
> I tried to boosting spark.yarn.executor.memoryOverhead  but the meomry it 
> needed is too much (even increase half the memory  can't solve the problem) 
> so i think it's not a proper method. 
> Although it can set the predCheckpoint  Interval  smaller  to cut the line of 
> the lineage  but it increases IO cost a lot. 
> I tried  another way to solve this problem.I persisted the RDD named 
> predError every round  and use  pre_predError to record the previous RDD  and 
> unpersist it  because it's useless anymore.
> Finally it costs about 45 min after i tried my method and no task failure 
> occured and no more memeory added. 
> So when the data is much larger than memory, my little improvement can 
> speedup  the  GradientBoostedTrees  1~2 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to