[I] Support dynamic schemas with hudi [hudi]

via GitHub Sat, 29 Nov 2025 19:44:40 -0800


hudi-bot opened a new issue, #14817:
URL: https://github.com/apache/hudi/issues/14817


   Sometimes, users have a requirement where they have different producers and 
each producer produces only a subset of columns. 
   
    
   
   for eg:
   
   Producer 1: rec_key, colA, colB, colC
   
   Producer 2: rec_key, colC, colD, colE, colF
   
   Producer 3: rec_key, colB, colF, colI, colK
   
    
   
   Expectation from hudi:
   
   keep merging new columns and inject defaults values for all other missing 
columns. 
   
    
   
   So, for above usecase, final hudi table's schema is expected to be 
   
   rec_key, colA, colB, colC, colD, colE, colF, colI, colK
   
    
   
    
   
    
   
    
   
    
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-2175
   - Type: Improvement
   
   
   ---
   
   
   ## Comments
   
   15/Mar/22 01:24;x1q1j1;hi [~shivnarayan] I think this implementation is 
compatible with query, but it is not good enough. This scenario is very common 
in machine learning and feature engineering. Several features (data columns) 
are calculated each time through the machine learning algorithm.
   
   I think we should avoid loading all the data when reading the required 
columns and then filtering. We should support column storage first. For 
example, we need to add column family like HBase, write separate data files 
according to the columns when writing data, and read according to the columns 
when reading.;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Support dynamic schemas with hudi [hudi]

Reply via email to