Thanks for the quick response Balaji! I think there is a lot here to continue with: 1. I did see that recent pull request for the delete API. I think collaborating to support another delete API with just record key would be a great next step. I'll begin looking into it. Additionally, the scenario of using Hbase as the global index is definitely something which we'd be interested in understanding further. 2. Actually I was speaking to the case of ComplexKeyGenerator. Currently if any single component of it is null, it will throw an exception. If this is not intended behavior, I'd be happy to fix this bug as it looks to solve our use case. 3. Thanks for the update on this. The spark upgrade is definitely a large undertaking that I'd be happy to help with.
Thanks again, Brandon On 11/8/19, 3:52 PM, "Balaji Varadarajan" <varadar...@gmail.com> wrote: Brandon, Great initiative and thoughts. Thanks for writing detailed description on what you are looking to achieve. Here are some of my comments/thoughts: 1. HUDI-326 : There is some work that is happening in this direction. But, we should be able to collaborate on this. Siva has opened a PR ( https://github.com/apache/incubator-hudi/pull/1004) to support delete using only HoodieKey (partitionPath, recordKey). Technically, we can support an interface for delete with only recordKeys if the index is of type global (Current implementation supports HoodieGlobalBloomIndex). Within Uber, we use Hbase as the global Hudi index to support partition agnostic record-key lookups. In other words, we can have 2 flavors of delete APIs - one with input being RDD<HoodieKeys> (works for all index types) and another with input RDD<RecordKey> that works with global index. Our vision is to support an external clustered index (global) as the de-facto index that resides in DFS along with dataset. 2. HUDI-327 : IIUC, Just like ComplexKeyGenerator, the new key generator would need composite keys (in this case primary and secondary for breaking the "null" tie ). Are you concerned about the record-key footprint for each key when using the key generated by ComplexKeyGenerator? In that case, makes sense to me. Otherwise, ComplexKeyGenerator should be able to handle cases when some component of it is null. right ? 3. As for HUDI-83, at-least on the write side, we have tied this with spark-2.4 upgrade. There is ongoing work happening in this regard. I will request folks who is working on this to provide status. Last I know, we were running into some test failures when doing this upgrade. But yes, as this is a massive upgrade, we would need your help in reviewing, debugging and testing this change :) Others, Thoughts ? Thanks, Balaji.V On Fri, Nov 8, 2019 at 2:49 PM Scheller, Brandon <bsche...@amazon.com.invalid> wrote: > Hi Hudi community, > > We at AWS EMR are interested in starting work on a few different usability > improvements for Hudi and we’re interested to hear your feedback. > > Here are some of our ideas: > https://issues.apache.org/jira/browse/HUDI-326 > https://issues.apache.org/jira/browse/HUDI-327 > > Additionally, we were hoping to help drive: > https://issues.apache.org/jira/browse/HUDI-83 and its associated Hive > Jira: https://issues.apache.org/jira/browse/HIVE-22224 > > I am looking forward to improving Hudi with you all. And feel free to let > us know if there is anything specific, you’d like us to look at. > > Thanks, > Brandon >