[GitHub] [hudi] yanghua commented on pull request #2410: [HUDI-1510] Move HoodieEngineContext and its dependencies to hudi-common

2021-01-06 Thread GitBox


yanghua commented on pull request #2410:
URL: https://github.com/apache/hudi/pull/2410#issuecomment-755878176


   I agree with this operation so as not to block work progress. However, I 
always feel that in terms of project layout, if we treat writing or reading (or 
call it query) equally, it will be clearer. Currently, we clearly define the 
"client" as a writing client. On the query side, just use the ability of an 
external engine. However, if we hope to introduce a native java query client 
and a native Python query client in the future, how will we consider their 
location? Also, if you consider introducing file-level APIs. How to state more 
clearly and intuitively: `common`, `client`, `write`, `query`(`read`), 
`engine`...?
   
   I always think that introducing query-side things into common is not elegant 
enough. Although Hudi is just a library, we can also regard it as a "service" 
in the future. For any service, `Input` and `Output` are clear and equivalent.
   
   Of course, we can discuss these in-depth in the future. For now, let us 
merge it and complete the work.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on pull request #2410: [HUDI-1510] Move HoodieEngineContext and its dependencies to hudi-common

2021-01-06 Thread GitBox


yanghua commented on pull request #2410:
URL: https://github.com/apache/hudi/pull/2410#issuecomment-755391853


   > @yanghua Did mull this a lot, along similar lines. I was thinking about 
Engine as a general construct that provides parallel execution, rather than 
being tied to the client/writing. For e.g we can use parallelized listing even 
on the InputFormat implementations.
   > 
   > The core issue is we want to parallelize `FSUtils.getAllPartitionPaths()` 
(and the underlying call to the HoodieTableMetadata#getAllPartitionPaths()). if 
you can think of a better way, please let us know.
   
   Maybe I don't have a good way, but I personally tend to make common a little 
simpler, so that it does not blend with the engine. Engine can not be tied to 
client/writing, but can engine become a standalone module? For example, between 
common and client? `common <- engine <- client`. The operations to be 
parallelized are in the engine module, I don't know if it is feasible.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on pull request #2410: [HUDI-1510] Move HoodieEngineContext and its dependencies to hudi-common

2021-01-06 Thread GitBox


yanghua commented on pull request #2410:
URL: https://github.com/apache/hudi/pull/2410#issuecomment-755261406


   @vinothchandar I have a concern about this operation. The common module 
seems not a good place to hold the context of the client engine. It will break 
the abstraction. The write client engine context should be hosted in the client 
common package.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org