Looks good for me.

But I will suggest to consider one more use-case:

If user knows its data he could manually split loading.
For example: table Persons contains 10M rows.
User could provide something like:
cache.loadCache(null, "Person", "select * from Person where id < 1_000_000",
"Person", "select * from Person where id >=  1_000_000 and id < 2_000_000",
....
"Person", "select * from Person where id >= 9_000_000 and id < 10_000_000",
);

or may be it could be some descriptor object like

 {
   sql: select * from Person where id >=  ? and id < ?"
   range: 0...10_000_000
}

In this case provided queries will be send to mach nodes as number of
queries.
And data will be loaded in parallel and for keys that a not local - data
streamer
should be used (as described Alexandr description).

I think it is a good issue for Ignite 2.0

Vova, Val - what do you think?


On Mon, Nov 14, 2016 at 4:01 PM, Alexandr Kuramshin <ein.nsk...@gmail.com>
wrote:

> All right,
>
> Let's assume a simple scenario. When the IgniteCache.loadCache is invoked,
> we check whether the cache is not local, and if so, then we'll initiate the
> new loading logic.
>
> First, we take a "streamer" node, it could be done by
> utilizing LoadBalancingSpi, or it may be configured statically, for the
> reason that the streamer node is running on the same host as the
> persistence storage provider.
>
> After that we start the loading task on the streamer node which
> creates IgniteDataStreamer and loads the cache with CacheStore.loadCache.
> Every call to IgniteBiInClosure.apply simply
> invokes IgniteDataStreamer.addData.
>
> This implementation will completely relieve overhead on the persistence
> storage provider. Network overhead is also decreased in the case of
> partitioned caches. For two nodes we get 1-1/2 amount of data transferred
> by the network (1 part well be transferred from the persistence storage to
> the streamer, and then 1/2 from the streamer node to the another node). For
> three nodes it will be 1-2/3 and so on, up to the two times amount of data
> on the big clusters.
>
> I'd like to propose some additional optimization at this place. If we have
> the streamer node on the same machine as the persistence storage provider,
> then we completely relieve the network overhead as well. It could be a some
> special daemon node for the cache loading assigned in the cache
> configuration, or an ordinary sever node as well.
>
> Certainly this calculations have been done in assumption that we have even
> partitioned cache with only primary nodes (without backups). In the case of
> one backup (the most frequent case I think), we get 2 amount of data
> transferred by the network on two nodes, 2-1/3 on three, 2-1/2 on four, and
> so on up to the three times amount of data on the big clusters. Hence it's
> still better than the current implementation. In the worst case with a
> fully replicated cache we take N+1 amount of data transferred by the
> network (where N is the number of nodes in the cluster). But it's not a
> problem in small clusters, and a little overhead in big clusters. And we
> still gain the persistence storage provider optimization.
>
> Now let's take more complex scenario. To achieve some level of parallelism,
> we could split our cluster on several groups. It could be a parameter of
> the IgniteCache.loadCache method or a cache configuration option. The
> number of groups could be a fixed value, or it could be calculated
> dynamically by the maximum number of nodes in the group.
>
> After splitting the whole cluster on groups we will take the streamer node
> in the each group and submit the task for loading the cache similar to the
> single streamer scenario, except as the only keys will be passed to
> the IgniteDataStreamer.addData method those correspond to the cluster group
> where is the streamer node running.
>
> In this case we get equal level of overhead as the parallelism, but not so
> surplus as how many nodes in whole the cluster.
>
> 2016-11-11 15:37 GMT+03:00 Alexey Kuznetsov <akuznet...@apache.org>:
>
> > Alexandr,
> >
> > Could you describe your proposal in more details?
> > Especially in case with several nodes.
> >
> > On Fri, Nov 11, 2016 at 6:34 PM, Alexandr Kuramshin <
> ein.nsk...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > You know CacheStore API that is commonly used for read/write-through
> > > relationship of the in-memory data with the persistence storage.
> > >
> > > There is also IgniteCache.loadCache method for hot-loading the cache on
> > > startup. Invocation of this method causes execution of
> > CacheStore.loadCache
> > > on the all nodes storing the cache partitions. Because of none keys are
> > > passed to the CacheStore.loadCache methods, the underlying
> implementation
> > > is forced to read all the data from the persistence storage, but only
> > part
> > > of the data will be stored on each node.
> > >
> > > So, the current implementation have two general drawbacks:
> > >
> > > 1. Persistence storage is forced to perform as many identical queries
> as
> > > many nodes on the cluster. Each query may involve much additional
> > > computation on the persistence storage server.
> > >
> > > 2. Network is forced to transfer much more data, so obviously the big
> > > disadvantage on large systems.
> > >
> > > The partition-aware data loading approach, described in
> > > https://apacheignite.readme.io/docs/data-loading#section-
> > > partition-aware-data-loading
> > > , is not a choice. It requires persistence of the volatile data
> depended
> > on
> > > affinity function implementation and settings.
> > >
> > > I propose using something like IgniteDataStreamer inside
> > > IgniteCache.loadCache implementation.
> > >
> > >
> > > --
> > > Thanks,
> > > Alexandr Kuramshin
> > >
> >
> >
> >
> > --
> > Alexey Kuznetsov
> >
>
>
>
> --
> Thanks,
> Alexandr Kuramshin
>



-- 
Alexey Kuznetsov
GridGain Systems
www.gridgain.com

Reply via email to