Hi, All! I think we do not need to chage API at all.
public void loadCache(@Nullable IgniteBiPredicate<K, V> p, @Nullable Object... args) throws CacheException; We could pass any args to loadCache(); So we could create class IgniteCacheLoadDescriptor { some fields that will describe how to load } and modify POJO store to detect and use such arguments. All we need is to implement this and write good documentation and examples. Thoughts? On Tue, Nov 15, 2016 at 5:22 PM, Alexandr Kuramshin <ein.nsk...@gmail.com> wrote: > Hi Vladimir, > > I don't offer any changes in API. Usage scenario is the same as it was > described in > https://apacheignite.readme.io/docs/persistent-store#section-loadcache- > > The preload cache logic invokes IgniteCache.loadCache() with some > additional arguments, depending on a CacheStore implementation, and then > the loading occurs in the way I've already described. > > > 2016-11-15 11:26 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>: > > > Hi Alex, > > > > >>> Let's give the user the reusable code which is convenient, reliable > and > > fast. > > Convenience - this is why I asked for example on how API can look like > and > > how users are going to use it. > > > > Vladimir. > > > > On Tue, Nov 15, 2016 at 11:18 AM, Alexandr Kuramshin < > ein.nsk...@gmail.com > > > > > wrote: > > > > > Hi all, > > > > > > I think the discussion goes a wrong direction. Certainly it's not a big > > > deal to implement some custom user logic to load the data into caches. > > But > > > Ignite framework gives the user some reusable code build on top of the > > > basic system. > > > > > > So the main question is: Why developers let the user to use convenient > > way > > > to load caches with totally non-optimal solution? > > > > > > We could talk too much about different persistence storage types, but > > > whenever we initiate the loading with IgniteCache.loadCache the current > > > implementation imposes much overhead on the network. > > > > > > Partition-aware data loading may be used in some scenarios to avoid > this > > > network overhead, but the users are compelled to do additional steps to > > > achieve this optimization: adding the column to tables, adding compound > > > indices including the added column, write a peace of repeatable code to > > > load the data in different caches in fault-tolerant fashion, etc. > > > > > > Let's give the user the reusable code which is convenient, reliable and > > > fast. > > > > > > 2016-11-14 20:56 GMT+03:00 Valentin Kulichenko < > > > valentin.kuliche...@gmail.com>: > > > > > > > Hi Aleksandr, > > > > > > > > Data streamer is already outlined as one of the possible approaches > for > > > > loading the data [1]. Basically, you start a designated client node > or > > > > chose a leader among server nodes [1] and then use IgniteDataStreamer > > API > > > > to load the data. With this approach there is no need to have the > > > > CacheStore implementation at all. Can you please elaborate what > > > additional > > > > value are you trying to add here? > > > > > > > > [1] https://apacheignite.readme.io/docs/data-loading# > > ignitedatastreamer > > > > [2] https://apacheignite.readme.io/docs/leader-election > > > > > > > > -Val > > > > > > > > On Mon, Nov 14, 2016 at 8:23 AM, Dmitriy Setrakyan < > > > dsetrak...@apache.org> > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > I just want to clarify a couple of API details from the original > > email > > > to > > > > > make sure that we are making the right assumptions here. > > > > > > > > > > *"Because of none keys are passed to the CacheStore.loadCache > > methods, > > > > the > > > > > > underlying implementation is forced to read all the data from the > > > > > > persistence storage"* > > > > > > > > > > > > > > > According to the javadoc, loadCache(...) method receives an > optional > > > > > argument from the user. You can pass anything you like, including a > > > list > > > > of > > > > > keys, or an SQL where clause, etc. > > > > > > > > > > *"The partition-aware data loading approach is not a choice. It > > > requires > > > > > > persistence of the volatile data depended on affinity function > > > > > > implementation and settings."* > > > > > > > > > > > > > > > This is only partially true. While Ignite allows to plugin custom > > > > affinity > > > > > functions, the affinity function is not something that changes > > > > dynamically > > > > > and should always return the same partition for the same key.So, > the > > > > > partition assignments are not volatile at all. If, in some very > rare > > > > case, > > > > > the partition assignment logic needs to change, then you could > update > > > the > > > > > partition assignments that you may have persisted elsewhere as > well, > > > e.g. > > > > > database. > > > > > > > > > > D. > > > > > > > > > > On Mon, Nov 14, 2016 at 10:23 AM, Vladimir Ozerov < > > > voze...@gridgain.com> > > > > > wrote: > > > > > > > > > > > Alexandr, Alexey, > > > > > > > > > > > > While I agree with you that current cache loading logic is far > from > > > > > ideal, > > > > > > it would be cool to see API drafts based on your suggestions to > get > > > > > better > > > > > > understanding of your ideas. How exactly users are going to use > > your > > > > > > suggestions? > > > > > > > > > > > > My main concern is that initial load is not very trivial task in > > > > general > > > > > > case. Some users have centralized RDBMS systems, some have NoSQL, > > > > others > > > > > > work with distributed persistent stores (e.g. HDFS). Sometimes we > > > have > > > > > > Ignite nodes "near" persistent data, sometimes we don't. > Sharding, > > > > > > affinity, co-location, etc.. If we try to support all (or many) > > cases > > > > out > > > > > > of the box, we may end up in very messy and difficult API. So we > > > should > > > > > > carefully balance between simplicity, usability and feature-rich > > > > > > characteristics here. > > > > > > > > > > > > Personally, I think that if user is not satisfied with > > "loadCache()" > > > > API, > > > > > > he just writes simple closure with blackjack streamer and queries > > and > > > > > send > > > > > > it to whatever node he finds convenient. Not a big deal. Only > very > > > > common > > > > > > cases should be added to Ignite API. > > > > > > > > > > > > Vladimir. > > > > > > > > > > > > > > > > > > On Mon, Nov 14, 2016 at 12:43 PM, Alexey Kuznetsov < > > > > > > akuznet...@gridgain.com> > > > > > > wrote: > > > > > > > > > > > > > Looks good for me. > > > > > > > > > > > > > > But I will suggest to consider one more use-case: > > > > > > > > > > > > > > If user knows its data he could manually split loading. > > > > > > > For example: table Persons contains 10M rows. > > > > > > > User could provide something like: > > > > > > > cache.loadCache(null, "Person", "select * from Person where id > < > > > > > > > 1_000_000", > > > > > > > "Person", "select * from Person where id >= 1_000_000 and id < > > > > > > 2_000_000", > > > > > > > .... > > > > > > > "Person", "select * from Person where id >= 9_000_000 and id < > > > > > > 10_000_000", > > > > > > > ); > > > > > > > > > > > > > > or may be it could be some descriptor object like > > > > > > > > > > > > > > { > > > > > > > sql: select * from Person where id >= ? and id < ?" > > > > > > > range: 0...10_000_000 > > > > > > > } > > > > > > > > > > > > > > In this case provided queries will be send to mach nodes as > > number > > > of > > > > > > > queries. > > > > > > > And data will be loaded in parallel and for keys that a not > > local - > > > > > data > > > > > > > streamer > > > > > > > should be used (as described Alexandr description). > > > > > > > > > > > > > > I think it is a good issue for Ignite 2.0 > > > > > > > > > > > > > > Vova, Val - what do you think? > > > > > > > > > > > > > > > > > > > > > On Mon, Nov 14, 2016 at 4:01 PM, Alexandr Kuramshin < > > > > > > ein.nsk...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > >> All right, > > > > > > >> > > > > > > >> Let's assume a simple scenario. When the IgniteCache.loadCache > > is > > > > > > invoked, > > > > > > >> we check whether the cache is not local, and if so, then we'll > > > > > initiate > > > > > > >> the > > > > > > >> new loading logic. > > > > > > >> > > > > > > >> First, we take a "streamer" node, it could be done by > > > > > > >> utilizing LoadBalancingSpi, or it may be configured > statically, > > > for > > > > > the > > > > > > >> reason that the streamer node is running on the same host as > the > > > > > > >> persistence storage provider. > > > > > > >> > > > > > > >> After that we start the loading task on the streamer node > which > > > > > > >> creates IgniteDataStreamer and loads the cache with > > > > > > CacheStore.loadCache. > > > > > > >> Every call to IgniteBiInClosure.apply simply > > > > > > >> invokes IgniteDataStreamer.addData. > > > > > > >> > > > > > > >> This implementation will completely relieve overhead on the > > > > > persistence > > > > > > >> storage provider. Network overhead is also decreased in the > case > > > of > > > > > > >> partitioned caches. For two nodes we get 1-1/2 amount of data > > > > > > transferred > > > > > > >> by the network (1 part well be transferred from the > persistence > > > > > storage > > > > > > to > > > > > > >> the streamer, and then 1/2 from the streamer node to the > another > > > > > node). > > > > > > >> For > > > > > > >> three nodes it will be 1-2/3 and so on, up to the two times > > amount > > > > of > > > > > > data > > > > > > >> on the big clusters. > > > > > > >> > > > > > > >> I'd like to propose some additional optimization at this > place. > > If > > > > we > > > > > > have > > > > > > >> the streamer node on the same machine as the persistence > storage > > > > > > provider, > > > > > > >> then we completely relieve the network overhead as well. It > > could > > > > be a > > > > > > >> some > > > > > > >> special daemon node for the cache loading assigned in the > cache > > > > > > >> configuration, or an ordinary sever node as well. > > > > > > >> > > > > > > >> Certainly this calculations have been done in assumption that > we > > > > have > > > > > > even > > > > > > >> partitioned cache with only primary nodes (without backups). > In > > > the > > > > > case > > > > > > >> of > > > > > > >> one backup (the most frequent case I think), we get 2 amount > of > > > data > > > > > > >> transferred by the network on two nodes, 2-1/3 on three, 2-1/2 > > on > > > > > four, > > > > > > >> and > > > > > > >> so on up to the three times amount of data on the big > clusters. > > > > Hence > > > > > > it's > > > > > > >> still better than the current implementation. In the worst > case > > > > with a > > > > > > >> fully replicated cache we take N+1 amount of data transferred > by > > > the > > > > > > >> network (where N is the number of nodes in the cluster). But > > it's > > > > not > > > > > a > > > > > > >> problem in small clusters, and a little overhead in big > > clusters. > > > > And > > > > > we > > > > > > >> still gain the persistence storage provider optimization. > > > > > > >> > > > > > > >> Now let's take more complex scenario. To achieve some level of > > > > > > >> parallelism, > > > > > > >> we could split our cluster on several groups. It could be a > > > > parameter > > > > > of > > > > > > >> the IgniteCache.loadCache method or a cache configuration > > option. > > > > The > > > > > > >> number of groups could be a fixed value, or it could be > > calculated > > > > > > >> dynamically by the maximum number of nodes in the group. > > > > > > >> > > > > > > >> After splitting the whole cluster on groups we will take the > > > > streamer > > > > > > node > > > > > > >> in the each group and submit the task for loading the cache > > > similar > > > > to > > > > > > the > > > > > > >> single streamer scenario, except as the only keys will be > passed > > > to > > > > > > >> the IgniteDataStreamer.addData method those correspond to the > > > > cluster > > > > > > >> group > > > > > > >> where is the streamer node running. > > > > > > >> > > > > > > >> In this case we get equal level of overhead as the > parallelism, > > > but > > > > > not > > > > > > so > > > > > > >> surplus as how many nodes in whole the cluster. > > > > > > >> > > > > > > >> 2016-11-11 15:37 GMT+03:00 Alexey Kuznetsov < > > > akuznet...@apache.org > > > > >: > > > > > > >> > > > > > > >> > Alexandr, > > > > > > >> > > > > > > > >> > Could you describe your proposal in more details? > > > > > > >> > Especially in case with several nodes. > > > > > > >> > > > > > > > >> > On Fri, Nov 11, 2016 at 6:34 PM, Alexandr Kuramshin < > > > > > > >> ein.nsk...@gmail.com> > > > > > > >> > wrote: > > > > > > >> > > > > > > > >> > > Hi, > > > > > > >> > > > > > > > > >> > > You know CacheStore API that is commonly used for > > > > > read/write-through > > > > > > >> > > relationship of the in-memory data with the persistence > > > storage. > > > > > > >> > > > > > > > > >> > > There is also IgniteCache.loadCache method for hot-loading > > the > > > > > cache > > > > > > >> on > > > > > > >> > > startup. Invocation of this method causes execution of > > > > > > >> > CacheStore.loadCache > > > > > > >> > > on the all nodes storing the cache partitions. Because of > > none > > > > > keys > > > > > > >> are > > > > > > >> > > passed to the CacheStore.loadCache methods, the underlying > > > > > > >> implementation > > > > > > >> > > is forced to read all the data from the persistence > storage, > > > but > > > > > > only > > > > > > >> > part > > > > > > >> > > of the data will be stored on each node. > > > > > > >> > > > > > > > > >> > > So, the current implementation have two general drawbacks: > > > > > > >> > > > > > > > > >> > > 1. Persistence storage is forced to perform as many > > identical > > > > > > queries > > > > > > >> as > > > > > > >> > > many nodes on the cluster. Each query may involve much > > > > additional > > > > > > >> > > computation on the persistence storage server. > > > > > > >> > > > > > > > > >> > > 2. Network is forced to transfer much more data, so > > obviously > > > > the > > > > > > big > > > > > > >> > > disadvantage on large systems. > > > > > > >> > > > > > > > > >> > > The partition-aware data loading approach, described in > > > > > > >> > > https://apacheignite.readme.io/docs/data-loading#section- > > > > > > >> > > partition-aware-data-loading > > > > > > >> > > , is not a choice. It requires persistence of the volatile > > > data > > > > > > >> depended > > > > > > >> > on > > > > > > >> > > affinity function implementation and settings. > > > > > > >> > > > > > > > > >> > > I propose using something like IgniteDataStreamer inside > > > > > > >> > > IgniteCache.loadCache implementation. > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > -- > > > > > > >> > > Thanks, > > > > > > >> > > Alexandr Kuramshin > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > -- > > > > > > >> > Alexey Kuznetsov > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> -- > > > > > > >> Thanks, > > > > > > >> Alexandr Kuramshin > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Alexey Kuznetsov > > > > > > > GridGain Systems > > > > > > > www.gridgain.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Thanks, > > > Alexandr Kuramshin > > > > > > > > > -- > Thanks, > Alexandr Kuramshin > -- Alexey Kuznetsov