[jira] [Commented] (HUDI-2086) redo the logical of mor_incremental_view for hive

ASF GitHub Bot (Jira) Wed, 14 Jul 2021 20:01:04 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380980#comment-17380980
 ]


ASF GitHub Bot commented on HUDI-2086:
--------------------------------------

xiarixiaoyao opened a new pull request #3203:
URL: https://github.com/apache/hudi/pull/3203


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   redo the logical of mor_incremental_view for hive to fix some bugs for 
mor_incremental_view for hive/sparksql
   
   purpose of the pull request:
   
   1) support read the lastest incremental datas which are stored by logs
   2) support read incremental datas which before replacecommit
   3) support read file groups which has only logs
   4) keep the logical of mor_incremental_view  as the same logicl as spark 
dataSource
   
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   new UT added
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> redo the logical of mor_incremental_view for hive
> -------------------------------------------------
>
>                 Key: HUDI-2086
>                 URL: https://issues.apache.org/jira/browse/HUDI-2086
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Hive Integration
>         Environment: spark3.1.1
> hive3.1.1
> hadoop3.1.1
> os： suse
>            Reporter: tao meng
>            Assignee: tao meng
>            Priority: Major
>              Labels: pull-request-available
>
> now ，There are some problems with mor_incremental_view for hive。
> For example，
> 1）：*hudi cannot read the lastest incremental datas which are stored by logs*
> think that:  create a mor table with bulk_insert, and then do upsert for this 
> table, 
> no we want to query the latest incremental data by hive/sparksql,   however 
> the lastest incremental datas are stored by logs,   when we do query nothings 
> will return
> step1: prepare data
> val df = spark.sparkContext.parallelize(0 to 20, 2).map(x => testCase(x, 
> x+"jack", Random.nextInt(2))).toDF()
>  .withColumn("col3", expr("keyid + 3000"))
>  .withColumn("p", lit(1))
> step2: do bulk_insert
> mergePartitionTable(df, 4, "default", "inc", tableType = 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
> step3: do upsert
> mergePartitionTable(df, 4, "default", "inc", tableType = 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")
> step4:  check the lastest commit time and do query
> spark.sql("set hoodie.inc.consume.mode=INCREMENTAL")
> spark.sql("set hoodie.inc.consume.max.commits=1")
> spark.sql("set hoodie.inc.consume.start.timestamp=20210628103935")
> spark.sql("select keyid, col3 from inc_rt where `_hoodie_commit_time` > 
> '20210628103935' order by keyid").show(100, false)
> +-----+----+
> |keyid|col3|
> +-----+----+
> +-----+----+
>  
> 2）：*if we do insert_over_write/insert_over_write_table for hudi mor table, 
> the incr query result is wrong when we want to query the data before 
> insert_overwrite/insert_overwrite_table*
> step1: do bulk_insert 
> mergePartitionTable(df, 4, "default", "overInc", tableType = 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
> now the commits is
> [20210628160614.deltacommit ]
> step2: do insert_overwrite_table
> mergePartitionTable(df, 4, "default", "overInc", tableType = 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert_overwrite_table")
> now the commits is
> [20210628160614.deltacommit, 20210628160923.replacecommit ]
> step3: query the data before insert_overwrite_table
> spark.sql("set hoodie.overInc.consume.mode=INCREMENTAL")
> spark.sql("set hoodie.overInc.consume.max.commits=1")
> spark.sql("set hoodie.overInc.consume.start.timestamp=0")
> spark.sql("select keyid, col3 from overInc_rt where `_hoodie_commit_time` > 
> '0' order by keyid").show(100, false)
> +-----+----+
> |keyid|col3|
> +-----+----+
> +-----+----+
>  
> 3) *hive/presto/flink  cannot read  file groups which has only logs*
> when we use hbase/inmemory as index, mor table will produce log files instead 
> of parquet file, but now hive/presto cannot read those files since those 
> files are log files.
> *HUDI-2048* mentions this problem.
>  
> however when we use spark data source to executre incremental query， there is 
> no such problem above。keep the logical of mor_incremental_view for hive as 
> the same logicl as spark dataSource is necessary。
> we redo the logical of mor_incremental_view for hive，to solve above problems 
> and keep the logical of mor_incremental_view  as the same logicl as spark 
> dataSource
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2086) redo the logical of mor_incremental_view for hive

Reply via email to