hudi-bot opened a new issue, #14817: URL: https://github.com/apache/hudi/issues/14817
Sometimes, users have a requirement where they have different producers and each producer produces only a subset of columns. for eg: Producer 1: rec_key, colA, colB, colC Producer 2: rec_key, colC, colD, colE, colF Producer 3: rec_key, colB, colF, colI, colK Expectation from hudi: keep merging new columns and inject defaults values for all other missing columns. So, for above usecase, final hudi table's schema is expected to be rec_key, colA, colB, colC, colD, colE, colF, colI, colK ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-2175 - Type: Improvement --- ## Comments 15/Mar/22 01:24;x1q1j1;hi [~shivnarayan] I think this implementation is compatible with query, but it is not good enough. This scenario is very common in machine learning and feature engineering. Several features (data columns) are calculated each time through the machine learning algorithm. I think we should avoid loading all the data when reading the required columns and then filtering. We should support column storage first. For example, we need to add column family like HBase, write separate data files according to the columns when writing data, and read according to the columns when reading.;;; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
