Re: IgniteCache.loadCache improvement proposal

Denis Magda Tue, 15 Nov 2016 14:28:12 -0800

Well, that’s clear. However, with localLoadCache the user still has to care 
about the fault-tolerance if the node that loads the data goes down. What if we 
provide an overloaded version of loadCache that will accept a number of nodes 
where the closure has to be executed? If the number decreases then the engine 
will re-execute the closure on a node that is alive.


—
Denis 


> On Nov 15, 2016, at 2:06 PM, Valentin Kulichenko 
> <[email protected]> wrote:
> 
> You can use localLoadCache method for this (it should be overloaded as well
> of course). Basically, if you provide closure based on IgniteDataStreamer
> and call localLoadCache on one of the nodes (client or server), it's the
> same approach as described in [1], but with the possibility to reuse
> existing persistence code. Makes sense?
> 
> [1] https://apacheignite.readme.io/docs/data-loading#ignitedatastreamer
> 
> -Val
> 
> On Tue, Nov 15, 2016 at 1:15 PM, Denis Magda <[email protected]> wrote:
> 
>> How would your proposal resolve the main point Aleksandr is trying to
>> convey that is extensive network utilization?
>> 
>> As I see the loadCache method still will be triggered on every and as
>> before all the nodes will pre-load all the data set from a database. That
>> was Aleksandr’s reasonable concern.
>> 
>> If we make up a way how to call the loadCache on a specific node only and
>> implement some falt-tolerant mechanism then your suggestion should work
>> perfectly fine.
>> 
>> —
>> Denis
>> 
>>> On Nov 15, 2016, at 12:05 PM, Valentin Kulichenko <
>> [email protected]> wrote:
>>> 
>>> It sounds like Aleksandr is basically proposing to support automatic
>>> persistence [1] for loading through data streamer and we really don't
>> have
>>> this. However, I think I have more generic solution in mind.
>>> 
>>> What if we add one more IgniteCache.loadCache overload like this:
>>> 
>>> loadCache(@Nullable IgniteBiPredicate<K, V> p, IgniteBiInClosure<K, V>
>>> clo, @Nullable
>>> Object... args)
>>> 
>>> It's the same as the existing one, but with the key-value closure
>> provided
>>> as a parameter. This closure will be passed to the CacheStore.loadCache
>>> along with the arguments and will allow to override the logic that
>> actually
>>> saves the loaded entry in cache (currently this logic is always provided
>> by
>>> the cache itself and user can't control it).
>>> 
>>> We can then provide the implementation of this closure that will create a
>>> data streamer and call addData() within its apply() method.
>>> 
>>> I see the following advantages:
>>> 
>>>  - Any existing CacheStore implementation can be reused to load through
>>>  streamer (our JDBC and Cassandra stores or anything else that user
>> has).
>>>  - Loading code is always part of CacheStore implementation, so it's
>> very
>>>  easy to switch between different ways of loading.
>>>  - User is not limited by two approaches we provide out of the box, they
>>>  can always implement a new one.
>>> 
>>> Thoughts?
>>> 
>>> [1] https://apacheignite.readme.io/docs/automatic-persistence
>>> 
>>> -Val
>>> 
>>> On Tue, Nov 15, 2016 at 2:27 AM, Alexey Kuznetsov <[email protected]
>>> 
>>> wrote:
>>> 
>>>> Hi, All!
>>>> 
>>>> I think we do not need to chage API at all.
>>>> 
>>>> public void loadCache(@Nullable IgniteBiPredicate<K, V> p, @Nullable
>>>> Object... args) throws CacheException;
>>>> 
>>>> We could pass any args to loadCache();
>>>> 
>>>> So we could create class
>>>> IgniteCacheLoadDescriptor {
>>>> some fields that will describe how to load
>>>> }
>>>> 
>>>> 
>>>> and modify POJO store to detect and use such arguments.
>>>> 
>>>> 
>>>> All we need is to implement this and write good documentation and
>> examples.
>>>> 
>>>> Thoughts?
>>>> 
>>>> On Tue, Nov 15, 2016 at 5:22 PM, Alexandr Kuramshin <
>> [email protected]>
>>>> wrote:
>>>> 
>>>>> Hi Vladimir,
>>>>> 
>>>>> I don't offer any changes in API. Usage scenario is the same as it was
>>>>> described in
>>>>> https://apacheignite.readme.io/docs/persistent-store#
>> section-loadcache-
>>>>> 
>>>>> The preload cache logic invokes IgniteCache.loadCache() with some
>>>>> additional arguments, depending on a CacheStore implementation, and
>> then
>>>>> the loading occurs in the way I've already described.
>>>>> 
>>>>> 
>>>>> 2016-11-15 11:26 GMT+03:00 Vladimir Ozerov <[email protected]>:
>>>>> 
>>>>>> Hi Alex,
>>>>>> 
>>>>>>>>> Let's give the user the reusable code which is convenient, reliable
>>>>> and
>>>>>> fast.
>>>>>> Convenience - this is why I asked for example on how API can look like
>>>>> and
>>>>>> how users are going to use it.
>>>>>> 
>>>>>> Vladimir.
>>>>>> 
>>>>>> On Tue, Nov 15, 2016 at 11:18 AM, Alexandr Kuramshin <
>>>>> [email protected]
>>>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I think the discussion goes a wrong direction. Certainly it's not a
>>>> big
>>>>>>> deal to implement some custom user logic to load the data into
>>>> caches.
>>>>>> But
>>>>>>> Ignite framework gives the user some reusable code build on top of
>>>> the
>>>>>>> basic system.
>>>>>>> 
>>>>>>> So the main question is: Why developers let the user to use
>>>> convenient
>>>>>> way
>>>>>>> to load caches with totally non-optimal solution?
>>>>>>> 
>>>>>>> We could talk too much about different persistence storage types, but
>>>>>>> whenever we initiate the loading with IgniteCache.loadCache the
>>>> current
>>>>>>> implementation imposes much overhead on the network.
>>>>>>> 
>>>>>>> Partition-aware data loading may be used in some scenarios to avoid
>>>>> this
>>>>>>> network overhead, but the users are compelled to do additional steps
>>>> to
>>>>>>> achieve this optimization: adding the column to tables, adding
>>>> compound
>>>>>>> indices including the added column, write a peace of repeatable code
>>>> to
>>>>>>> load the data in different caches in fault-tolerant fashion, etc.
>>>>>>> 
>>>>>>> Let's give the user the reusable code which is convenient, reliable
>>>> and
>>>>>>> fast.
>>>>>>> 
>>>>>>> 2016-11-14 20:56 GMT+03:00 Valentin Kulichenko <
>>>>>>> [email protected]>:
>>>>>>> 
>>>>>>>> Hi Aleksandr,
>>>>>>>> 
>>>>>>>> Data streamer is already outlined as one of the possible approaches
>>>>> for
>>>>>>>> loading the data [1]. Basically, you start a designated client node
>>>>> or
>>>>>>>> chose a leader among server nodes [1] and then use
>>>> IgniteDataStreamer
>>>>>> API
>>>>>>>> to load the data. With this approach there is no need to have the
>>>>>>>> CacheStore implementation at all. Can you please elaborate what
>>>>>>> additional
>>>>>>>> value are you trying to add here?
>>>>>>>> 
>>>>>>>> [1] https://apacheignite.readme.io/docs/data-loading#
>>>>>> ignitedatastreamer
>>>>>>>> [2] https://apacheignite.readme.io/docs/leader-election
>>>>>>>> 
>>>>>>>> -Val
>>>>>>>> 
>>>>>>>> On Mon, Nov 14, 2016 at 8:23 AM, Dmitriy Setrakyan <
>>>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I just want to clarify a couple of API details from the original
>>>>>> email
>>>>>>> to
>>>>>>>>> make sure that we are making the right assumptions here.
>>>>>>>>> 
>>>>>>>>> *"Because of none keys are passed to the CacheStore.loadCache
>>>>>> methods,
>>>>>>>> the
>>>>>>>>>> underlying implementation is forced to read all the data from
>>>> the
>>>>>>>>>> persistence storage"*
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> According to the javadoc, loadCache(...) method receives an
>>>>> optional
>>>>>>>>> argument from the user. You can pass anything you like,
>>>> including a
>>>>>>> list
>>>>>>>> of
>>>>>>>>> keys, or an SQL where clause, etc.
>>>>>>>>> 
>>>>>>>>> *"The partition-aware data loading approach is not a choice. It
>>>>>>> requires
>>>>>>>>>> persistence of the volatile data depended on affinity function
>>>>>>>>>> implementation and settings."*
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> This is only partially true. While Ignite allows to plugin custom
>>>>>>>> affinity
>>>>>>>>> functions, the affinity function is not something that changes
>>>>>>>> dynamically
>>>>>>>>> and should always return the same partition for the same key.So,
>>>>> the
>>>>>>>>> partition assignments are not volatile at all. If, in some very
>>>>> rare
>>>>>>>> case,
>>>>>>>>> the partition assignment logic needs to change, then you could
>>>>> update
>>>>>>> the
>>>>>>>>> partition assignments that you may have persisted elsewhere as
>>>>> well,
>>>>>>> e.g.
>>>>>>>>> database.
>>>>>>>>> 
>>>>>>>>> D.
>>>>>>>>> 
>>>>>>>>> On Mon, Nov 14, 2016 at 10:23 AM, Vladimir Ozerov <
>>>>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Alexandr, Alexey,
>>>>>>>>>> 
>>>>>>>>>> While I agree with you that current cache loading logic is far
>>>>> from
>>>>>>>>> ideal,
>>>>>>>>>> it would be cool to see API drafts based on your suggestions to
>>>>> get
>>>>>>>>> better
>>>>>>>>>> understanding of your ideas. How exactly users are going to use
>>>>>> your
>>>>>>>>>> suggestions?
>>>>>>>>>> 
>>>>>>>>>> My main concern is that initial load is not very trivial task
>>>> in
>>>>>>>> general
>>>>>>>>>> case. Some users have centralized RDBMS systems, some have
>>>> NoSQL,
>>>>>>>> others
>>>>>>>>>> work with distributed persistent stores (e.g. HDFS). Sometimes
>>>> we
>>>>>>> have
>>>>>>>>>> Ignite nodes "near" persistent data, sometimes we don't.
>>>>> Sharding,
>>>>>>>>>> affinity, co-location, etc.. If we try to support all (or many)
>>>>>> cases
>>>>>>>> out
>>>>>>>>>> of the box, we may end up in very messy and difficult API. So
>>>> we
>>>>>>> should
>>>>>>>>>> carefully balance between simplicity, usability and
>>>> feature-rich
>>>>>>>>>> characteristics here.
>>>>>>>>>> 
>>>>>>>>>> Personally, I think that if user is not satisfied with
>>>>>> "loadCache()"
>>>>>>>> API,
>>>>>>>>>> he just writes simple closure with blackjack streamer and
>>>> queries
>>>>>> and
>>>>>>>>> send
>>>>>>>>>> it to whatever node he finds convenient. Not a big deal. Only
>>>>> very
>>>>>>>> common
>>>>>>>>>> cases should be added to Ignite API.
>>>>>>>>>> 
>>>>>>>>>> Vladimir.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Mon, Nov 14, 2016 at 12:43 PM, Alexey Kuznetsov <
>>>>>>>>>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Looks good for me.
>>>>>>>>>>> 
>>>>>>>>>>> But I will suggest to consider one more use-case:
>>>>>>>>>>> 
>>>>>>>>>>> If user knows its data he could manually split loading.
>>>>>>>>>>> For example: table Persons contains 10M rows.
>>>>>>>>>>> User could provide something like:
>>>>>>>>>>> cache.loadCache(null, "Person", "select * from Person where
>>>> id
>>>>> <
>>>>>>>>>>> 1_000_000",
>>>>>>>>>>> "Person", "select * from Person where id >=  1_000_000 and
>>>> id <
>>>>>>>>>> 2_000_000",
>>>>>>>>>>> ....
>>>>>>>>>>> "Person", "select * from Person where id >= 9_000_000 and id
>>>> <
>>>>>>>>>> 10_000_000",
>>>>>>>>>>> );
>>>>>>>>>>> 
>>>>>>>>>>> or may be it could be some descriptor object like
>>>>>>>>>>> 
>>>>>>>>>>> {
>>>>>>>>>>>  sql: select * from Person where id >=  ? and id < ?"
>>>>>>>>>>>  range: 0...10_000_000
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> In this case provided queries will be send to mach nodes as
>>>>>> number
>>>>>>> of
>>>>>>>>>>> queries.
>>>>>>>>>>> And data will be loaded in parallel and for keys that a not
>>>>>> local -
>>>>>>>>> data
>>>>>>>>>>> streamer
>>>>>>>>>>> should be used (as described Alexandr description).
>>>>>>>>>>> 
>>>>>>>>>>> I think it is a good issue for Ignite 2.0
>>>>>>>>>>> 
>>>>>>>>>>> Vova, Val - what do you think?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Nov 14, 2016 at 4:01 PM, Alexandr Kuramshin <
>>>>>>>>>> [email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> All right,
>>>>>>>>>>>> 
>>>>>>>>>>>> Let's assume a simple scenario. When the
>>>> IgniteCache.loadCache
>>>>>> is
>>>>>>>>>> invoked,
>>>>>>>>>>>> we check whether the cache is not local, and if so, then
>>>> we'll
>>>>>>>>> initiate
>>>>>>>>>>>> the
>>>>>>>>>>>> new loading logic.
>>>>>>>>>>>> 
>>>>>>>>>>>> First, we take a "streamer" node, it could be done by
>>>>>>>>>>>> utilizing LoadBalancingSpi, or it may be configured
>>>>> statically,
>>>>>>> for
>>>>>>>>> the
>>>>>>>>>>>> reason that the streamer node is running on the same host as
>>>>> the
>>>>>>>>>>>> persistence storage provider.
>>>>>>>>>>>> 
>>>>>>>>>>>> After that we start the loading task on the streamer node
>>>>> which
>>>>>>>>>>>> creates IgniteDataStreamer and loads the cache with
>>>>>>>>>> CacheStore.loadCache.
>>>>>>>>>>>> Every call to IgniteBiInClosure.apply simply
>>>>>>>>>>>> invokes IgniteDataStreamer.addData.
>>>>>>>>>>>> 
>>>>>>>>>>>> This implementation will completely relieve overhead on the
>>>>>>>>> persistence
>>>>>>>>>>>> storage provider. Network overhead is also decreased in the
>>>>> case
>>>>>>> of
>>>>>>>>>>>> partitioned caches. For two nodes we get 1-1/2 amount of
>>>> data
>>>>>>>>>> transferred
>>>>>>>>>>>> by the network (1 part well be transferred from the
>>>>> persistence
>>>>>>>>> storage
>>>>>>>>>> to
>>>>>>>>>>>> the streamer, and then 1/2 from the streamer node to the
>>>>> another
>>>>>>>>> node).
>>>>>>>>>>>> For
>>>>>>>>>>>> three nodes it will be 1-2/3 and so on, up to the two times
>>>>>> amount
>>>>>>>> of
>>>>>>>>>> data
>>>>>>>>>>>> on the big clusters.
>>>>>>>>>>>> 
>>>>>>>>>>>> I'd like to propose some additional optimization at this
>>>>> place.
>>>>>> If
>>>>>>>> we
>>>>>>>>>> have
>>>>>>>>>>>> the streamer node on the same machine as the persistence
>>>>> storage
>>>>>>>>>> provider,
>>>>>>>>>>>> then we completely relieve the network overhead as well. It
>>>>>> could
>>>>>>>> be a
>>>>>>>>>>>> some
>>>>>>>>>>>> special daemon node for the cache loading assigned in the
>>>>> cache
>>>>>>>>>>>> configuration, or an ordinary sever node as well.
>>>>>>>>>>>> 
>>>>>>>>>>>> Certainly this calculations have been done in assumption
>>>> that
>>>>> we
>>>>>>>> have
>>>>>>>>>> even
>>>>>>>>>>>> partitioned cache with only primary nodes (without backups).
>>>>> In
>>>>>>> the
>>>>>>>>> case
>>>>>>>>>>>> of
>>>>>>>>>>>> one backup (the most frequent case I think), we get 2 amount
>>>>> of
>>>>>>> data
>>>>>>>>>>>> transferred by the network on two nodes, 2-1/3 on three,
>>>> 2-1/2
>>>>>> on
>>>>>>>>> four,
>>>>>>>>>>>> and
>>>>>>>>>>>> so on up to the three times amount of data on the big
>>>>> clusters.
>>>>>>>> Hence
>>>>>>>>>> it's
>>>>>>>>>>>> still better than the current implementation. In the worst
>>>>> case
>>>>>>>> with a
>>>>>>>>>>>> fully replicated cache we take N+1 amount of data
>>>> transferred
>>>>> by
>>>>>>> the
>>>>>>>>>>>> network (where N is the number of nodes in the cluster). But
>>>>>> it's
>>>>>>>> not
>>>>>>>>> a
>>>>>>>>>>>> problem in small clusters, and a little overhead in big
>>>>>> clusters.
>>>>>>>> And
>>>>>>>>> we
>>>>>>>>>>>> still gain the persistence storage provider optimization.
>>>>>>>>>>>> 
>>>>>>>>>>>> Now let's take more complex scenario. To achieve some level
>>>> of
>>>>>>>>>>>> parallelism,
>>>>>>>>>>>> we could split our cluster on several groups. It could be a
>>>>>>>> parameter
>>>>>>>>> of
>>>>>>>>>>>> the IgniteCache.loadCache method or a cache configuration
>>>>>> option.
>>>>>>>> The
>>>>>>>>>>>> number of groups could be a fixed value, or it could be
>>>>>> calculated
>>>>>>>>>>>> dynamically by the maximum number of nodes in the group.
>>>>>>>>>>>> 
>>>>>>>>>>>> After splitting the whole cluster on groups we will take the
>>>>>>>> streamer
>>>>>>>>>> node
>>>>>>>>>>>> in the each group and submit the task for loading the cache
>>>>>>> similar
>>>>>>>> to
>>>>>>>>>> the
>>>>>>>>>>>> single streamer scenario, except as the only keys will be
>>>>> passed
>>>>>>> to
>>>>>>>>>>>> the IgniteDataStreamer.addData method those correspond to
>>>> the
>>>>>>>> cluster
>>>>>>>>>>>> group
>>>>>>>>>>>> where is the streamer node running.
>>>>>>>>>>>> 
>>>>>>>>>>>> In this case we get equal level of overhead as the
>>>>> parallelism,
>>>>>>> but
>>>>>>>>> not
>>>>>>>>>> so
>>>>>>>>>>>> surplus as how many nodes in whole the cluster.
>>>>>>>>>>>> 
>>>>>>>>>>>> 2016-11-11 15:37 GMT+03:00 Alexey Kuznetsov <
>>>>>>> [email protected]
>>>>>>>>> :
>>>>>>>>>>>> 
>>>>>>>>>>>>> Alexandr,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Could you describe your proposal in more details?
>>>>>>>>>>>>> Especially in case with several nodes.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Nov 11, 2016 at 6:34 PM, Alexandr Kuramshin <
>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> You know CacheStore API that is commonly used for
>>>>>>>>> read/write-through
>>>>>>>>>>>>>> relationship of the in-memory data with the persistence
>>>>>>> storage.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> There is also IgniteCache.loadCache method for
>>>> hot-loading
>>>>>> the
>>>>>>>>> cache
>>>>>>>>>>>> on
>>>>>>>>>>>>>> startup. Invocation of this method causes execution of
>>>>>>>>>>>>> CacheStore.loadCache
>>>>>>>>>>>>>> on the all nodes storing the cache partitions. Because
>>>> of
>>>>>> none
>>>>>>>>> keys
>>>>>>>>>>>> are
>>>>>>>>>>>>>> passed to the CacheStore.loadCache methods, the
>>>> underlying
>>>>>>>>>>>> implementation
>>>>>>>>>>>>>> is forced to read all the data from the persistence
>>>>> storage,
>>>>>>> but
>>>>>>>>>> only
>>>>>>>>>>>>> part
>>>>>>>>>>>>>> of the data will be stored on each node.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So, the current implementation have two general
>>>> drawbacks:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. Persistence storage is forced to perform as many
>>>>>> identical
>>>>>>>>>> queries
>>>>>>>>>>>> as
>>>>>>>>>>>>>> many nodes on the cluster. Each query may involve much
>>>>>>>> additional
>>>>>>>>>>>>>> computation on the persistence storage server.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2. Network is forced to transfer much more data, so
>>>>>> obviously
>>>>>>>> the
>>>>>>>>>> big
>>>>>>>>>>>>>> disadvantage on large systems.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The partition-aware data loading approach, described in
>>>>>>>>>>>>>> https://apacheignite.readme.
>>>> io/docs/data-loading#section-
>>>>>>>>>>>>>> partition-aware-data-loading
>>>>>>>>>>>>>> , is not a choice. It requires persistence of the
>>>> volatile
>>>>>>> data
>>>>>>>>>>>> depended
>>>>>>>>>>>>> on
>>>>>>>>>>>>>> affinity function implementation and settings.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I propose using something like IgniteDataStreamer inside
>>>>>>>>>>>>>> IgniteCache.loadCache implementation.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Alexandr Kuramshin
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Alexey Kuznetsov
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Alexandr Kuramshin
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Alexey Kuznetsov
>>>>>>>>>>> GridGain Systems
>>>>>>>>>>> www.gridgain.com
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> Alexandr Kuramshin
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks,
>>>>> Alexandr Kuramshin
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Alexey Kuznetsov
>>>> 
>> 
>>

Re: IgniteCache.loadCache improvement proposal

Reply via email to