[GitHub] [hudi] yanghua commented on pull request #2410: [HUDI-1510] Move HoodieEngineContext and its dependencies to hudi-common
yanghua commented on pull request #2410: URL: https://github.com/apache/hudi/pull/2410#issuecomment-755878176 I agree with this operation so as not to block work progress. However, I always feel that in terms of project layout, if we treat writing or reading (or call it query) equally, it will be clearer. Currently, we clearly define the "client" as a writing client. On the query side, just use the ability of an external engine. However, if we hope to introduce a native java query client and a native Python query client in the future, how will we consider their location? Also, if you consider introducing file-level APIs. How to state more clearly and intuitively: `common`, `client`, `write`, `query`(`read`), `engine`...? I always think that introducing query-side things into common is not elegant enough. Although Hudi is just a library, we can also regard it as a "service" in the future. For any service, `Input` and `Output` are clear and equivalent. Of course, we can discuss these in-depth in the future. For now, let us merge it and complete the work. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on pull request #2410: [HUDI-1510] Move HoodieEngineContext and its dependencies to hudi-common
yanghua commented on pull request #2410: URL: https://github.com/apache/hudi/pull/2410#issuecomment-755391853 > @yanghua Did mull this a lot, along similar lines. I was thinking about Engine as a general construct that provides parallel execution, rather than being tied to the client/writing. For e.g we can use parallelized listing even on the InputFormat implementations. > > The core issue is we want to parallelize `FSUtils.getAllPartitionPaths()` (and the underlying call to the HoodieTableMetadata#getAllPartitionPaths()). if you can think of a better way, please let us know. Maybe I don't have a good way, but I personally tend to make common a little simpler, so that it does not blend with the engine. Engine can not be tied to client/writing, but can engine become a standalone module? For example, between common and client? `common <- engine <- client`. The operations to be parallelized are in the engine module, I don't know if it is feasible. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on pull request #2410: [HUDI-1510] Move HoodieEngineContext and its dependencies to hudi-common
yanghua commented on pull request #2410: URL: https://github.com/apache/hudi/pull/2410#issuecomment-755261406 @vinothchandar I have a concern about this operation. The common module seems not a good place to hold the context of the client engine. It will break the abstraction. The write client engine context should be hosted in the client common package. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org