Repository: spark
Updated Branches:
  refs/heads/master 7e16c94f1 -> 4bafacaa5


[SPARK-17417][CORE] Fix # of partitions for Reliable RDD checkpointing

## What changes were proposed in this pull request?
Currently the no. of partition files are limited to 10000 files (%05d format). 
If there are more than 10000 part files, the logic goes for a toss while 
recreating the RDD as it sorts them by string. More details can be found in the 
JIRA desc [here](https://issues.apache.org/jira/browse/SPARK-17417).

## How was this patch tested?
I tested this patch by checkpointing a RDD and then manually renaming part 
files to the old format and tried to access the RDD. It was successfully 
created from the old format. Also verified loading a sample parquet file and 
saving it as multiple formats - CSV, JSON, Text, Parquet, ORC and read them 
successfully back from the saved files. I couldn't launch the unit test from my 
local box, so will wait for the Jenkins output.

Author: Dhruve Ashar <dhruveas...@gmail.com>

Closes #15370 from dhruve/bug/SPARK-17417.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4bafacaa
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4bafacaa
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4bafacaa

Branch: refs/heads/master
Commit: 4bafacaa5f50a3e986c14a38bc8df9bae303f3a0
Parents: 7e16c94
Author: Dhruve Ashar <dhruveas...@gmail.com>
Authored: Mon Oct 10 10:55:57 2016 -0500
Committer: Tom Graves <tgra...@yahoo-inc.com>
Committed: Mon Oct 10 10:55:57 2016 -0500

----------------------------------------------------------------------
 .../main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala  | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/4bafacaa/core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala
----------------------------------------------------------------------
diff --git 
a/core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala
index ab6554f..eac901d 100644
--- a/core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala
@@ -69,10 +69,10 @@ private[spark] class ReliableCheckpointRDD[T: ClassTag](
     val inputFiles = fs.listStatus(cpath)
       .map(_.getPath)
       .filter(_.getName.startsWith("part-"))
-      .sortBy(_.toString)
+      .sortBy(_.getName.stripPrefix("part-").toInt)
     // Fail fast if input files are invalid
     inputFiles.zipWithIndex.foreach { case (path, i) =>
-      if 
(!path.toString.endsWith(ReliableCheckpointRDD.checkpointFileName(i))) {
+      if (path.getName != ReliableCheckpointRDD.checkpointFileName(i)) {
         throw new SparkException(s"Invalid checkpoint file: $path")
       }
     }


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to