Dmitriy,

I will not be fully confident that partition ID is the best approach in all
cases. Even if we have full access to the database structure, there are
another problems.

Assume we have a table PERSON (ID NUMBER, NAME VARCHAR, SURNAME VARCHAR,
AGE NUMBER, EMPL_DATE DATE). And we add our column PART NUMBER.

While we already have indexes IDX1(NAME), IDX2(SURNAME), IDX3(AGE),
IDX4(EMPL_DATE), we have to add new 2-column index IDX5(PART, EMPL_DATE)
for pre-loading at startup, for example, recently employed persons.

And if we'd like to query filtered data from the database, we'd also have
to create the other compound indexes IDX6(PART, NAME), IDX7(PART, SURNAME),
IDX8(PART, AGE). So we doubling overhead is defined by indexes.

After this modifications on the database has been done and the PART column
is filled, what we should do to preload the data?

We should perform so many database queries so many partitions are stored on
the nodes. Number of queries would be 1024 by default settings in the
affinity functions. Some calls may not return any data at all, and it will
be a vain network round-trip. Also it may be a problem for some databases
to effectively perform number of parallel queries without a degradation on
the total throughput.

DataStreamer approach may be faster, but it should be tested.

2016-11-16 16:40 GMT+03:00 Dmitriy Setrakyan <dsetrak...@apache.org>:

> On Wed, Nov 16, 2016 at 1:54 PM, Yakov Zhdanov <yzhda...@apache.org>
> wrote:
>
> > > On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov <yzhda...@apache.org>
> > wrote:
> >
> > > > > Yakov, I agree that such scenario should be avoided. I also think
> > that
> >
> > > > > loadCache(...) method, as it is right now, provides a way to avoid
> > it.
> >
> > > >
> >
> > > > No, it does not.
> >
> > > >
> > > Yes it does :)
> >
> > No it doesn't. Load cache should either send a query to DB that filters
> all
> > the data on server side which, in turn, may result to full-scan of 2 Tb
> > data set dozens of times (equal to node count) or send a query that
> brings
> > the whole dataset to each node which is unacceptable as well.
> >
>
> Why not store the partition ID in the database and query only local
> partitions? Whatever approach we design with a DataStreamer will be slower
> than this.
>



-- 
Thanks,
Alexandr Kuramshin

Reply via email to