[GitHub] spark pull request #20611: [SPARK-23425][SQL]Support wildcard in HDFS path f...

sujith71955 Fri, 13 Jul 2018 10:59:27 -0700

Github user sujith71955 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20611#discussion_r202429058
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
    @@ -303,94 +303,44 @@ case class LoadDataCommand(
               s"partitioned, but a partition spec was provided.")
           }
         }
    -
    -    val loadPath =
    +    val loadPath = {
           if (isLocal) {
    -        val uri = Utils.resolveURI(path)
    -        val file = new File(uri.getPath)
    -        val exists = if (file.getAbsolutePath.contains("*")) {
    -          val fileSystem = FileSystems.getDefault
    -          val dir = file.getParentFile.getAbsolutePath
    -          if (dir.contains("*")) {
    -            throw new AnalysisException(
    -              s"LOAD DATA input path allows only filename wildcard: $path")
    -          }
    -
    -          // Note that special characters such as "*" on Windows are not 
allowed as a path.
    -          // Calling `WindowsFileSystem.getPath` throws an exception if 
there are in the path.
    -          val dirPath = fileSystem.getPath(dir)
    -          val pathPattern = new File(dirPath.toAbsolutePath.toString, 
file.getName).toURI.getPath
    -          val safePathPattern = if (Utils.isWindows) {
    -            // On Windows, the pattern should not start with slashes for 
absolute file paths.
    -            pathPattern.stripPrefix("/")
    -          } else {
    -            pathPattern
    -          }
    -          val files = new File(dir).listFiles()
    -          if (files == null) {
    -            false
    -          } else {
    -            val matcher = fileSystem.getPathMatcher("glob:" + 
safePathPattern)
    -            files.exists(f => 
matcher.matches(fileSystem.getPath(f.getAbsolutePath)))
    -          }
    -        } else {
    -          new File(file.getAbsolutePath).exists()
    -        }
    -        if (!exists) {
    -          throw new AnalysisException(s"LOAD DATA input path does not 
exist: $path")
    -        }
    -        uri
    +        val localFS = FileContext.getLocalFSFileContext()
    +        localFS.makeQualified(new Path(path))
           } else {
    -        val uri = new URI(path)
    -        val hdfsUri = if (uri.getScheme() != null && uri.getAuthority() != 
null) {
    -          uri
    -        } else {
    -          // Follow Hive's behavior:
    -          // If no schema or authority is provided with non-local inpath,
    -          // we will use hadoop configuration "fs.defaultFS".
    -          val defaultFSConf = 
sparkSession.sessionState.newHadoopConf().get("fs.defaultFS")
    -          val defaultFS = if (defaultFSConf == null) {
    -            new URI("")
    -          } else {
    -            new URI(defaultFSConf)
    -          }
    -
    -          val scheme = if (uri.getScheme() != null) {
    --- End diff --
    
    Suppose if user provides a path like "load data inpath 
'hdfs://hacluster/user/su?ith.txt' into table t1",
    
    Meaning of this wild card usage -:  after  'su' followed by any single 
character followed by 'ith'
    
    As per old logic when we form an URI the characters after ? will be ignored 
since it interprets the string after ? as query parameter( in this scenario 
there is no significance for query param).
    makeQualified() will ignore the query parameter part while creating a Path, 
so the
    entire  path string will be considered while making a Path instance.
    I had discussed regarding the usage of this API with couple of Hadoop PMC 
members (MR. Vinay Kumar and Brahma Reddy), they said there is no harm with the 
usage. 
    
    If we revert this change then wildcard '?' usage in paths will lead to 
incorrect path validation in this context, whereas in hive they support the 
usage. so this will become a limitation .



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20611: [SPARK-23425][SQL]Support wildcard in HDFS path f...

Reply via email to