[jira] [Commented] (SPARK-4094) checkpoint should still be available after rdd actions

2014-12-21 Thread Zhang, Liye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14255367#comment-14255367
 ] 

Zhang, Liye commented on SPARK-4094:


HI [~sowen], I'm also wondering why the original way is restricted. But for the 
real-world case, especially for graph algorithms that has many iterations, this 
restricted too much.

Hi [~matei], the spark code originally has such restriction. Can you tell if 
there are other considerations excluding the complexity involved from 
traversing the whole lineage? And if there are problems allowing checkpoint 
after rdd actions?

> checkpoint should still be available after rdd actions
> --
>
> Key: SPARK-4094
> URL: https://issues.apache.org/jira/browse/SPARK-4094
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
>
> rdd.checkpoint() must be called before any actions on this rdd, if there is 
> any other actions before, checkpoint would never succeed. For the following 
> code as example:
> *rdd = sc.makeRDD(...)*
> *rdd.collect()*
> *rdd.checkpoint()*
> *rdd.count()*
> This rdd would never be checkpointed. For algorithms that have many 
> iterations would have some problem. Such as graph algorithm, there will have 
> many iterations which will cause the RDD lineage very long. So RDD may need 
> checkpoint after a certain iteration number. And if there is also any action 
> within the iteration loop, the checkpoint() operation will never work for the 
> later iterations after the iteration which calls the action operation.
> But this would not happen for RDD cache. RDD cache would always make 
> successfully before rdd actions no matter whether there is any actions before 
> cache().
> So rdd.checkpoint() should also be with the same behavior with rdd.cache().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4094) checkpoint should still be available after rdd actions

2014-12-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253191#comment-14253191
 ] 

Sean Owen commented on SPARK-4094:
--

[~liyezhang556520] But this is exactly what the doc says is not permitted. By 
invoking action C, you necessarily execute the job for RDD B, after which time 
you can't checkpoint it.

My question, if you're proposing to loosen the restriction, I wonder what 
problem there was originally to allowing this, and why the change resolves that?

> checkpoint should still be available after rdd actions
> --
>
> Key: SPARK-4094
> URL: https://issues.apache.org/jira/browse/SPARK-4094
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
>
> rdd.checkpoint() must be called before any actions on this rdd, if there is 
> any other actions before, checkpoint would never succeed. For the following 
> code as example:
> *rdd = sc.makeRDD(...)*
> *rdd.collect()*
> *rdd.checkpoint()*
> *rdd.count()*
> This rdd would never be checkpointed. For algorithms that have many 
> iterations would have some problem. Such as graph algorithm, there will have 
> many iterations which will cause the RDD lineage very long. So RDD may need 
> checkpoint after a certain iteration number. And if there is also any action 
> within the iteration loop, the checkpoint() operation will never work for the 
> later iterations after the iteration whichs call the action operation.
> But this would not happen for RDD cache. RDD cache would always make 
> successfully before rdd actions no matter whether there is any actions before 
> cache().
> So rdd.checkpoint() should also be with the same behavior with rdd.cache().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4094) checkpoint should still be available after rdd actions

2014-10-28 Thread Zhang, Liye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187982#comment-14187982
 ] 

Zhang, Liye commented on SPARK-4094:


[SPARK-3625|https://issues.apache.org/jira/browse/SPARK-3625] did something 
similar with this issue, but currently it does not support case like this:
*rdd0 = sc.makeRDD(...)*
*rdd1 = rdd0.flatmap(...)*
*rdd1.collect()*
*rdd0.checkpoint()*
*rdd1.count()*
In which *rdd0* would not be checkpointed.
In this JIRA, we will always traverse the whole rdd lineage for any rdd 
actions, until encounter the rdds that has already been checkpointed. Since the 
traverse only check for the status of rdds, the operations will not introduce 
much impact on the performance.

> checkpoint should still be available after rdd actions
> --
>
> Key: SPARK-4094
> URL: https://issues.apache.org/jira/browse/SPARK-4094
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Zhang, Liye
>
> rdd.checkpoint() must be called before any actions on this rdd, if there is 
> any other actions before, checkpoint would never succeed. For the following 
> code as example:
> *rdd = sc.makeRDD(...)*
> *rdd.collect()*
> *rdd.checkpoint()*
> *rdd.count()*
> This rdd would never be checkpointed. But this would not happen for RDD 
> cache. RDD cache would always make successfully before rdd actions no matter 
> whether there is any actions before cache().
> So rdd.checkpoint() should also be with the same behavior with rdd.cache().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4094) checkpoint should still be available after rdd actions

2014-10-27 Thread Jie Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186418#comment-14186418
 ] 

Jie Huang commented on SPARK-4094:
--

Yes. we found the similar issue also. According to the document, it can support 
checkpoint only before the action. But the problem here is, if you have a 
lineage like below.
{noformat}
A-- B--C(action)
|--D(action)
{noformat}
If submit C action, then checkpoint B before action D, like
*C*
*B.checkpoint*
*D*

You cannot checkpoint that RDD(B). It doesn't align with the document and its 
original design.

> checkpoint should still be available after rdd actions
> --
>
> Key: SPARK-4094
> URL: https://issues.apache.org/jira/browse/SPARK-4094
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Zhang, Liye
>
> rdd.checkpoint() must be called before any actions on this rdd, if there is 
> any other actions before, checkpoint would never succeed. For the following 
> code as example:
> *rdd = sc.makeRDD(...)*
> *rdd.collect()*
> *rdd.checkpoint()*
> *rdd.count()*
> This rdd would never be checkpointed. But this would not happen for RDD 
> cache. RDD cache would always make successfully before rdd actions no matter 
> whether there is any actions before cache().
> So rdd.checkpoint() should also be with the same behavior with rdd.cache().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4094) checkpoint should still be available after rdd actions

2014-10-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184897#comment-14184897
 ] 

Apache Spark commented on SPARK-4094:
-

User 'liyezhang556520' has created a pull request for this issue:
https://github.com/apache/spark/pull/2956

> checkpoint should still be available after rdd actions
> --
>
> Key: SPARK-4094
> URL: https://issues.apache.org/jira/browse/SPARK-4094
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Zhang, Liye
>
> rdd.checkpoint() must be called before any actions on this rdd, if there is 
> any other actions before, checkpoint would never succeed. For the following 
> code as example:
> *rdd = sc.makeRDD(...)*
> *rdd.collect()*
> *rdd.checkpoint()*
> *rdd.count()*
> This rdd would never be checkpointed. But this would not happen for RDD 
> cache. RDD cache would always make successfully before rdd actions no matter 
> whether there is any actions before cache().
> So rdd.checkpoint() should also be with the same behavior with rdd.cache().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org