[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
teeyog commented on a change in pull request #2475: URL: https://github.com/apache/hudi/pull/2475#discussion_r584401119 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala ## @@ -84,6 +88,26 @@ class DefaultSource extends RelationProvider val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray) log.info("Obtained hudi table path: " + tablePath) +if (path.nonEmpty) { + val _path = path.get.stripSuffix("/") + val pathTmp = new Path(_path).makeQualified(fs.getUri, fs.getWorkingDirectory) + // If the user specifies the table path, the data path is automatically inferred + if (pathTmp.toString.equals(tablePath)) { +val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext) +val fsBackedTableMetadata = + new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false) +val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths Review comment: @lw309637554 Thank you for your review, the previous path to get the hudi table can also be obtained through configuration instead of inference This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
teeyog commented on a change in pull request #2475: URL: https://github.com/apache/hudi/pull/2475#discussion_r580728819 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala ## @@ -74,6 +78,19 @@ class DefaultSource extends RelationProvider val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray) log.info("Obtained hudi table path: " + tablePath) +val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext) +val fsBackedTableMetadata = + new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false) +val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths +val onePartitionPath = if(!partitionPaths.isEmpty && !StringUtils.isEmpty(partitionPaths.get(0))) { +tablePath + "/" + partitionPaths.get(0) + } else { +tablePath + } +val dataPath = DataSourceUtils.getDataPath(tablePath, onePartitionPath) +log.info("Obtained hudi data path: " + dataPath) +parameters += "path" -> dataPath Review comment: @vinothchandar Now it supports your needs. If the path specified by the user is a table path, it will be automatically inferred, otherwise it will not be inferred. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
teeyog commented on a change in pull request #2475: URL: https://github.com/apache/hudi/pull/2475#discussion_r579910252 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala ## @@ -74,6 +78,19 @@ class DefaultSource extends RelationProvider val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray) log.info("Obtained hudi table path: " + tablePath) +val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext) +val fsBackedTableMetadata = + new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false) +val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths +val onePartitionPath = if(!partitionPaths.isEmpty && !StringUtils.isEmpty(partitionPaths.get(0))) { +tablePath + "/" + partitionPaths.get(0) + } else { +tablePath + } +val dataPath = DataSourceUtils.getDataPath(tablePath, onePartitionPath) +log.info("Obtained hudi data path: " + dataPath) +parameters += "path" -> dataPath Review comment: I will try to see if I can automatically infer this but also meet your needs This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
teeyog commented on a change in pull request #2475: URL: https://github.com/apache/hudi/pull/2475#discussion_r579907953 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala ## @@ -74,6 +78,19 @@ class DefaultSource extends RelationProvider val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray) log.info("Obtained hudi table path: " + tablePath) +val sparkEngineContext = new HoodieSparkEngineContext(sqlContext.sparkContext) +val fsBackedTableMetadata = + new FileSystemBackedTableMetadata(sparkEngineContext, new SerializableConfiguration(fs.getConf), tablePath, false) +val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths +val onePartitionPath = if(!partitionPaths.isEmpty && !StringUtils.isEmpty(partitionPaths.get(0))) { +tablePath + "/" + partitionPaths.get(0) + } else { +tablePath + } +val dataPath = DataSourceUtils.getDataPath(tablePath, onePartitionPath) +log.info("Obtained hudi data path: " + dataPath) +parameters += "path" -> dataPath Review comment: The path specified by the user will be overwritten by the automatically inferred data directory, and your needs cannot be met This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory
teeyog commented on a change in pull request #2475: URL: https://github.com/apache/hudi/pull/2475#discussion_r569889546 ## File path: hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java ## @@ -84,6 +86,39 @@ public static String getTablePath(FileSystem fs, Path[] userProvidedPaths) throw throw new TableNotFoundException("Unable to find a hudi table for the user provided paths."); } + public static Option getOnePartitionPath(FileSystem fs, Path tablePath) throws IOException { +// When the table is not partitioned +if (HoodiePartitionMetadata.hasPartitionMetadata(fs, tablePath)) { + return Option.of(tablePath.toString()); +} +FileStatus[] statuses = fs.listStatus(tablePath); +for (FileStatus status : statuses) { + if (status.isDirectory()) { +if (HoodiePartitionMetadata.hasPartitionMetadata(fs, status.getPath())) { + return Option.of(status.getPath().toString()); +} else { + Option partitionPath = getOnePartitionPath(fs, status.getPath()); + if (partitionPath.isPresent()) { +return partitionPath; Review comment: Thank you for your review, this method of obtaining partitions is very fast. As long as one partition path is obtained, it will return directly. FSUtils.getAllPartitionPaths will obtain all partition paths, which is very time-consuming. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org