[GitHub] spark pull request #18962: [SPARK-21714][CORE][YARN] Avoiding re-uploading r...

jerryshao Fri, 18 Aug 2017 01:37:07 -0700

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18962#discussion_r133907328
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
    @@ -330,18 +330,45 @@ object SparkSubmit extends CommandLineUtils {
         args.archives = Option(args.archives).map(resolveGlobPaths(_, 
hadoopConf)).orNull
     
         // In client mode, download remote files.
    +    var localPrimaryResource: String = null
    +    var localJars: String = null
    +    var localPyFiles: String = null
         if (deployMode == CLIENT) {
    -      args.primaryResource = Option(args.primaryResource).map {
    +      localPrimaryResource = Option(args.primaryResource).map {
             downloadFile(_, targetDir, args.sparkProperties, hadoopConf)
           }.orNull
    -      args.jars = Option(args.jars).map {
    +      localJars = Option(args.jars).map {
             downloadFileList(_, targetDir, args.sparkProperties, hadoopConf)
           }.orNull
    -      args.pyFiles = Option(args.pyFiles).map {
    +      localPyFiles = Option(args.pyFiles).map {
             downloadFileList(_, targetDir, args.sparkProperties, hadoopConf)
           }.orNull
         }
     
    +    if (clusterManager == YARN) {
    +      def isNoneFsFileExist(paths: String): Boolean = {
    +        Option(paths).exists { p =>
    +          p.split(",").map(_.trim).filter(_.nonEmpty).exists { path =>
    +            val url = Utils.resolveURI(path)
    +            url.getScheme match {
    +              case "http" | "https" | "ftp" => true
    +              case _ => false
    +            }
    +          }
    +        }
    +      }
    +
    +      // Spark on YARN doesn't support upload remote resources from http, 
https or ftp server
    +      // directly to distributed cache, so print a warning and exit the 
process.
    +      if (isNoneFsFileExist(args.jars) ||
    --- End diff --
    
    From my understanding, to solve this issue, there's some solutions:
    
    1. Filtering resource from http(s) and add to spark.jars/spark.files, using 
Spark's internal fileServer to distribute. This should be worked in yarn client 
mode, but in yarn cluster mode it cannot be worked, since AM doesn't have such 
resources.
    2. Filtering resources from http(s) and download them to local both in yarn 
client and cluster mode. This will re-upload resources to Hadoop FS, so the 
problem here is resources are downloaded and uploaded unnecessarily. But this 
could solve the whole problem here, also if there's no internet access other 
than gateway nodes, this should be a possible solution. I assume not so many 
users will use resources on http(s), so as a compromise solution it looks 
acceptable.
    
    What do you think? @vanzin @tgravescs .




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18962: [SPARK-21714][CORE][YARN] Avoiding re-uploading r...

Reply via email to