[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

2021-02-28 Thread GitBox


teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r584401119



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##
@@ -84,6 +88,26 @@ class DefaultSource extends RelationProvider
 val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
 log.info("Obtained hudi table path: " + tablePath)
 
+if (path.nonEmpty) {
+  val _path = path.get.stripSuffix("/")
+  val pathTmp = new Path(_path).makeQualified(fs.getUri, 
fs.getWorkingDirectory)
+  // If the user specifies the table path, the data path is automatically 
inferred
+  if (pathTmp.toString.equals(tablePath)) {
+val sparkEngineContext = new 
HoodieSparkEngineContext(sqlContext.sparkContext)
+val fsBackedTableMetadata =
+  new FileSystemBackedTableMetadata(sparkEngineContext, new 
SerializableConfiguration(fs.getConf), tablePath, false)
+val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths

Review comment:
   @lw309637554 Thank you for your review, the previous path to get the 
hudi table can also be obtained through configuration instead of inference





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

2021-02-22 Thread GitBox


teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r580728819



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##
@@ -74,6 +78,19 @@ class DefaultSource extends RelationProvider
 val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
 log.info("Obtained hudi table path: " + tablePath)
 
+val sparkEngineContext = new 
HoodieSparkEngineContext(sqlContext.sparkContext)
+val fsBackedTableMetadata =
+  new FileSystemBackedTableMetadata(sparkEngineContext, new 
SerializableConfiguration(fs.getConf), tablePath, false)
+val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths
+val onePartitionPath = if(!partitionPaths.isEmpty && 
!StringUtils.isEmpty(partitionPaths.get(0))) {
+tablePath + "/" + partitionPaths.get(0)
+  } else {
+tablePath
+  }
+val dataPath = DataSourceUtils.getDataPath(tablePath, onePartitionPath)
+log.info("Obtained hudi data path: " + dataPath)
+parameters += "path" -> dataPath

Review comment:
   @vinothchandar Now it supports your needs. If the path specified by the 
user is a table path, it will be automatically inferred, otherwise it will not 
be inferred.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

2021-02-21 Thread GitBox


teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r579910252



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##
@@ -74,6 +78,19 @@ class DefaultSource extends RelationProvider
 val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
 log.info("Obtained hudi table path: " + tablePath)
 
+val sparkEngineContext = new 
HoodieSparkEngineContext(sqlContext.sparkContext)
+val fsBackedTableMetadata =
+  new FileSystemBackedTableMetadata(sparkEngineContext, new 
SerializableConfiguration(fs.getConf), tablePath, false)
+val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths
+val onePartitionPath = if(!partitionPaths.isEmpty && 
!StringUtils.isEmpty(partitionPaths.get(0))) {
+tablePath + "/" + partitionPaths.get(0)
+  } else {
+tablePath
+  }
+val dataPath = DataSourceUtils.getDataPath(tablePath, onePartitionPath)
+log.info("Obtained hudi data path: " + dataPath)
+parameters += "path" -> dataPath

Review comment:
   I will try to see if I can automatically infer this but also meet your 
needs





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

2021-02-21 Thread GitBox


teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r579907953



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##
@@ -74,6 +78,19 @@ class DefaultSource extends RelationProvider
 val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
 log.info("Obtained hudi table path: " + tablePath)
 
+val sparkEngineContext = new 
HoodieSparkEngineContext(sqlContext.sparkContext)
+val fsBackedTableMetadata =
+  new FileSystemBackedTableMetadata(sparkEngineContext, new 
SerializableConfiguration(fs.getConf), tablePath, false)
+val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths
+val onePartitionPath = if(!partitionPaths.isEmpty && 
!StringUtils.isEmpty(partitionPaths.get(0))) {
+tablePath + "/" + partitionPaths.get(0)
+  } else {
+tablePath
+  }
+val dataPath = DataSourceUtils.getDataPath(tablePath, onePartitionPath)
+log.info("Obtained hudi data path: " + dataPath)
+parameters += "path" -> dataPath

Review comment:
   The path specified by the user will be overwritten by the automatically 
inferred data directory, and your needs cannot be met





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

2021-02-03 Thread GitBox


teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r569889546



##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##
@@ -84,6 +86,39 @@ public static String getTablePath(FileSystem fs, Path[] 
userProvidedPaths) throw
 throw new TableNotFoundException("Unable to find a hudi table for the user 
provided paths.");
   }
 
+  public static Option getOnePartitionPath(FileSystem fs, Path 
tablePath) throws IOException {
+// When the table is not partitioned
+if (HoodiePartitionMetadata.hasPartitionMetadata(fs, tablePath)) {
+  return Option.of(tablePath.toString());
+}
+FileStatus[] statuses = fs.listStatus(tablePath);
+for (FileStatus status : statuses) {
+  if (status.isDirectory()) {
+if (HoodiePartitionMetadata.hasPartitionMetadata(fs, 
status.getPath())) {
+  return Option.of(status.getPath().toString());
+} else {
+  Option partitionPath = getOnePartitionPath(fs, 
status.getPath());
+  if (partitionPath.isPresent()) {
+return partitionPath;

Review comment:
   Thank you for your review, this method of obtaining partitions is very 
fast. As long as one partition path is obtained, it will return directly. 
FSUtils.getAllPartitionPaths will obtain all partition paths, which is very 
time-consuming.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org