[jira] [Commented] (TINKERPOP-2081) PersistedOutputRDD materialises rdd lazily with Spark 2.x

2018-10-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TINKERPOP-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667468#comment-16667468
 ] 

ASF GitHub Bot commented on TINKERPOP-2081:
---

spmallette closed pull request #973: TINKERPOP-2081: Fix PersistedOutputRDD to 
eager persist RDD
URL: https://github.com/apache/tinkerpop/pull/973
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/CHANGELOG.asciidoc b/CHANGELOG.asciidoc
index 9a47f3c9c9..c67f1bea11 100644
--- a/CHANGELOG.asciidoc
+++ b/CHANGELOG.asciidoc
@@ -23,6 +23,7 @@ 
image::https://raw.githubusercontent.com/apache/tinkerpop/master/docs/static/ima
 [[release-3-3-5]]
 === TinkerPop 3.3.5 (Release Date: NOT OFFICIALLY RELEASED YET)
 
+* Fixed `PersistedOutputRDD` to eager persist RDD by adding `count()` action 
calls.
 
 [[release-3-3-4]]
 === TinkerPop 3.3.4 (Release Date: October 15, 2018)
diff --git 
a/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/structure/io/PersistedOutputRDD.java
 
b/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/structure/io/PersistedOutputRDD.java
index c9fc684fbf..6eb6673bcc 100644
--- 
a/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/structure/io/PersistedOutputRDD.java
+++ 
b/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/structure/io/PersistedOutputRDD.java
@@ -55,9 +55,13 @@ public void writeGraphRDD(final Configuration configuration, 
final JavaPairRDD {
 vertex.get().dropEdges(Direction.BOTH);
 return vertex;
-
}).setName(Constants.getGraphLocation(configuration.getString(Constants.GREMLIN_HADOOP_OUTPUT_LOCATION))).persist(storageLevel);
+
}).setName(Constants.getGraphLocation(configuration.getString(Constants.GREMLIN_HADOOP_OUTPUT_LOCATION))).persist(storageLevel)
+// call action to eager store rdd
+.count();
 else
-
graphRDD.setName(Constants.getGraphLocation(configuration.getString(Constants.GREMLIN_HADOOP_OUTPUT_LOCATION))).persist(storageLevel);
+
graphRDD.setName(Constants.getGraphLocation(configuration.getString(Constants.GREMLIN_HADOOP_OUTPUT_LOCATION))).persist(storageLevel)
+// call action to eager store rdd
+.count();
 Spark.refresh(); // necessary to do really fast so the Spark GC 
doesn't clear out the RDD
 }
 
@@ -69,7 +73,9 @@ public void writeGraphRDD(final Configuration configuration, 
final JavaPairRDD new 
KeyValue<>(tuple._1(), tuple._2()));
 }


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PersistedOutputRDD materialises rdd lazily with Spark 2.x
> -
>
> Key: TINKERPOP-2081
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2081
> Project: TinkerPop
>  Issue Type: Bug
>  Components: hadoop
>Affects Versions: 3.3.4
>Reporter: Artem Aliev
>Assignee: stephen mallette
>Priority: Major
>
> PersistedOutputRDD is not actually persist RDD in spark memory but mark it 
> for lazy caching in the future. It looks like caching was eager in Spark 1.6, 
> but in spark 2.0 it lazy.
> The lazy caching looks wrong for this case, the source graph could be changed 
> after snapshot is created and snapshot should not be affected by that changes.
> The fix itself is simple: PersistedOutputRDD should call any spark action to 
> trigger eager caching. For example count()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TINKERPOP-2081) PersistedOutputRDD materialises rdd lazily with Spark 2.x

2018-10-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TINKERPOP-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16665290#comment-16665290
 ] 

ASF GitHub Bot commented on TINKERPOP-2081:
---

artem-aliev opened a new pull request #973: TINKERPOP-2081: Fix 
PersistedOutputRDD to eager persist RDD
URL: https://github.com/apache/tinkerpop/pull/973
 
 
   call rdd.count() action to trigger the caching


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PersistedOutputRDD materialises rdd lazily with Spark 2.x
> -
>
> Key: TINKERPOP-2081
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2081
> Project: TinkerPop
>  Issue Type: Bug
>Affects Versions: 3.3.4
>Reporter: Artem Aliev
>Priority: Major
>
> PersistedOutputRDD is not actually persist RDD in spark memory but mark it 
> for lazy caching in the future. It looks like caching was eager in Spark 1.6, 
> but in spark 2.0 it lazy.
> The lazy caching looks wrong for this case, the source graph could be changed 
> after snapshot is created and snapshot should not be affected by that changes.
> The fix itself is simple: PersistedOutputRDD should call any spark action to 
> trigger eager caching. For example count()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)