[GitHub] [hudi] n3nash commented on issue #2095: Inserts in partitioned MoR RO view visible without compaction

2020-09-21 Thread GitBox


n3nash commented on issue #2095:
URL: https://github.com/apache/hudi/issues/2095#issuecomment-696212548







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on issue #2095: Inserts in partitioned MoR RO view visible without compaction

2020-09-21 Thread GitBox


n3nash commented on issue #2095:
URL: https://github.com/apache/hudi/issues/2095#issuecomment-696259058


   @vishalpathak1986 Hudi internally maintains an Index to tag the incoming 
records with the fileId that it maps to. If inserts are written to log files, 
we require a way from the index to know which log file a particular record was 
written to. We have indexes such as BloomIndex which are written to parquet 
files and hence we are able to figure out if a record is present in a parquet 
file or not, but we don't have such an index for log files. 
   There is no config to turn off writing inserts as parquet, you just have to 
use an index implementation that can index log files. Currently, only the 
HbaseIndex can index log files -> 
https://github.com/apache/hudi/blob/c8e19e2def0c33415bc3945ffb81f524c484c924/hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java#L483.
 In the future, the record level index I pointed out earlier will be able to 
index the log files which will eliminate the need of an external K-V store for 
this feature.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on issue #2095: Inserts in partitioned MoR RO view visible without compaction

2020-09-21 Thread GitBox


n3nash commented on issue #2095:
URL: https://github.com/apache/hudi/issues/2095#issuecomment-696212548


   @vishalpathak1986 Currently, Hudi supports writing inserts in columnar file 
fomat (parquet) for MOR tables. All inserts goto parquet while updates goto 
AVRO file. 
   This is done for 2 reasons a) If you only have inserts, you don't have to 
compact again and have your data written in columnar file format to start with 
b) Absence of an index that can index log file.
   This feature will soon be supported with -> 
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+08+%3A+Record+level+indexing+mechanisms+for+Hudi+datasets
 or you can try the 
[HbaseIndex](https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java)
 in the meantime which requires a Hbase cluster. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org