[jira] [Commented] (HUDI-1307) spark datasource load path format is confused for snapshot and increment read mode

Raymond Xu (Jira) Sun, 26 Sep 2021 11:35:04 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420358#comment-17420358
 ]


Raymond Xu commented on HUDI-1307:
----------------------------------

[~309637554] Any update on this improvement? definitely useful to align on the 
pattern. In fact, today to enable HoodieFileIndex, user should avoid passing in 
glob path. Is it still necessary to keep glob path pattern around?

> spark datasource load path format is confused for snapshot and increment read 
> mode
> ----------------------------------------------------------------------------------
>
>                 Key: HUDI-1307
>                 URL: https://issues.apache.org/jira/browse/HUDI-1307
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Spark Integration
>            Reporter: liwei
>            Assignee: liwei
>            Priority: Critical
>              Labels: sev:high, user-support-issues
>
> as spark datasource read hudi table
> 1、snapshot mode
> {code:java}
>  val readHudi = spark.read.format("org.apache.hudi").load(basePath + "/*");
> should add "/*" ,otherwise will fail, because in 
> org.apache.hudi.DefaultSource.
> createRelation() will use fs.globStatus(). if do not have "/*" will not get 
> .hoodie and default dir
> val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, 
> fs){code}
>  
> 2、increment mode
> both basePath and  basePath + "/*"  is ok.This is because in 
> org.apache.hudi.DefaultSource  
> DataSourceUtils.getTablePath can support both the two format.
> {code:java}
>  val incViewDF = spark.read.format("org.apache.hudi").
>  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
>  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
>  option(END_INSTANTTIME_OPT_KEY, endTime).
>  load(basePath){code}
>  
> {code:java}
>  val incViewDF = spark.read.format("org.apache.hudi").
>  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
>  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
>  option(END_INSTANTTIME_OPT_KEY, endTime).
>  load(basePath + "/*")
>  {code}
>  
> as  increment mode and snapshot mode not coincide, user will confuse .Also 
> load use basepath +"/*"  *or "/***/*"* is  confuse. I know this is to support 
> partition.
> but i think this api will more clear for user
>  
> {code:java}
>  partition = "year = '2019'"
> spark.read .format("hudi") .load(path) .where(partition) {code}
>  
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1307) spark datasource load path format is confused for snapshot and increment read mode

Reply via email to