Dmitriy, I will not be fully confident that partition ID is the best approach in all cases. Even if we have full access to the database structure, there are another problems.
Assume we have a table PERSON (ID NUMBER, NAME VARCHAR, SURNAME VARCHAR, AGE NUMBER, EMPL_DATE DATE). And we add our column PART NUMBER. While we already have indexes IDX1(NAME), IDX2(SURNAME), IDX3(AGE), IDX4(EMPL_DATE), we have to add new 2-column index IDX5(PART, EMPL_DATE) for pre-loading at startup, for example, recently employed persons. And if we'd like to query filtered data from the database, we'd also have to create the other compound indexes IDX6(PART, NAME), IDX7(PART, SURNAME), IDX8(PART, AGE). So we doubling overhead is defined by indexes. After this modifications on the database has been done and the PART column is filled, what we should do to preload the data? We should perform so many database queries so many partitions are stored on the nodes. Number of queries would be 1024 by default settings in the affinity functions. Some calls may not return any data at all, and it will be a vain network round-trip. Also it may be a problem for some databases to effectively perform number of parallel queries without a degradation on the total throughput. DataStreamer approach may be faster, but it should be tested. 2016-11-16 16:40 GMT+03:00 Dmitriy Setrakyan <dsetrak...@apache.org>: > On Wed, Nov 16, 2016 at 1:54 PM, Yakov Zhdanov <yzhda...@apache.org> > wrote: > > > > On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov <yzhda...@apache.org> > > wrote: > > > > > > > Yakov, I agree that such scenario should be avoided. I also think > > that > > > > > > > loadCache(...) method, as it is right now, provides a way to avoid > > it. > > > > > > > > > > > > No, it does not. > > > > > > > > > Yes it does :) > > > > No it doesn't. Load cache should either send a query to DB that filters > all > > the data on server side which, in turn, may result to full-scan of 2 Tb > > data set dozens of times (equal to node count) or send a query that > brings > > the whole dataset to each node which is unacceptable as well. > > > > Why not store the partition ID in the database and query only local > partitions? Whatever approach we design with a DataStreamer will be slower > than this. > -- Thanks, Alexandr Kuramshin