[ 
https://issues.apache.org/jira/browse/HUDI-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006991#comment-17006991
 ] 

Vinoth Chandar commented on HUDI-485:
-------------------------------------

Here is a brain dump and you can then take over :) 

 

Lets say the incremental query asks for all record with _`_hoodie_commit_time > 
t1`._ 

In a nutshell, what we actually have in the commit metadata (the .commit and 
.deltacommit files) is the file slice (a base parquet file written at an 
instant time and a set of log files generated as deltas on top of the base). 
The parquet file and the log can actually contain records that were written 
before time t1 and so incremental query filters at two levels .

- First gets all the latest file slices that were written to after time t1

- Next, within these file slices, filters out records such that their 
__hoodie_commit_time > t1`._ 

 

(P.S: This sort of record level metadata is what differentiates Hudi as a true 
streaming system from others) 

I will take Copy on write and explain this, since its easier, but it 
generalized to MOR as well.  For copy-on-write, the commit metadata points to 
all the parquet files that were written (either new or versioning of an 
existing file) at that commit. So, by reading all the .commit files after a 
given time t1, we can know all the parquet files with records written after 
time t1 (superset).. But this set of files will also have older records and 
thus we needed to push a filter (see IncrementalRelation.scala in hudi-spark to 
see logic that automatically does this in spark) at the InputFormat level, to 
only return is the rows that match the hoodie_commit_time > t1 criteria.. 
Pushing this to parquet is the most efficient way..  

 

When we tried to do this before 
[https://github.com/apache/incubator-hudi/blob/hoodie-0.3.0/hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java#L192]
 , the predicate did not actually work.  

 

 

 

 

 

 

 

 

> Check for where clause is wrong in HiveIncrementalPuller
> --------------------------------------------------------
>
>                 Key: HUDI-485
>                 URL: https://issues.apache.org/jira/browse/HUDI-485
>             Project: Apache Hudi (incubating)
>          Issue Type: Sub-task
>          Components: Incremental Pull, newbie
>            Reporter: Pratyaksh Sharma
>            Assignee: Pratyaksh Sharma
>            Priority: Major
>
> HiveIncrementalPuller checks the clause in incrementalSqlFile like this -> 
> if (!incrementalSQL.contains("`_hoodie_commit_time` > '%targetBasePath'"))
> { LOG.info("Incremental SQL : " + incrementalSQL + " does not contain 
> `_hoodie_commit_time` > %targetBasePath. Please add " + "this clause for 
> incremental to work properly."); throw new HoodieIncrementalPullSQLException( 
> "Incremental SQL does not have clause `_hoodie_commit_time` > 
> '%targetBasePath', which " + "means its not pulling incrementally"); }
> Basically we are trying to add a placeholder here which is later replaced 
> with config.fromCommitTime here - 
> incrementalPullSQLtemplate.add("incrementalSQL", 
> String.format(incrementalSQL, config.fromCommitTime));
> Hence, the above check needs to replaced with `_hoodie_commit_time` > %s



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to