Hi Randy, Currently, indexes are built only based on what is in cache - i.e. contents of the backing store not present in cache are not presented in index in any way, and hence yes, indexing blocks scanning. Moreover, even non indexed columns in Ignite tables contain only data actually loaded to cache.
Significant changes in this aspect should be expected with arrival of Ignite 2.0, but that is not yet to happen until some time in 2017. Regards, Alex 2016-12-17 2:16 GMT+03:00 Randy Harmon <rjharmon0...@gmail.com>: > Hi all, > > I'm seeking a fuller understanding of how Apache ignite manages datasets, > both for indexes and for the underlying data. > > In particular, I'm looking at what practical constraints exist for overall > data size (beyond the obvious 'how much memory do you have?'), and > functional characteristics when working near the constraint boundaries. > > My assumption (corrections welcome) include: > > The underlying objects (Value part of cache) do not need to be in-memory on > any cache nodes (performance is naturally affected if they were evicted from > the cache) to execute an indexed query. > > The indexed keys need to be in-memory for all indexed lookups. If the > referenced Value is not in-memory, it will be loaded by call to backing > store when that value is needed: load(key) > > Indexed keys do not need to be in-memory for any table-scan queries to work, > but loadCache() (?) is called to bring these data into memory. This may > result in eviction of other values. Once the queries on these data are > complete, the keys (at least) will tend to remain in-memory (how to forcibly > remove?) > > In this latter case, can large datasets be queried, with earlier records in > the dataset progressively evicted to make room for later records in the > dataset (e.g. SUM(x) GROUP BY y)? > > A sample use case might include a set of metadata objects (megabytes to > gigabytes, in various Ignite caches) and a much larger set of operational > metrics with fine-grained slicing, or even fully-granular facts (GB/TB/PB). > In this use-case, the metadata might well have "hot" subsets that (we hope) > are not evicted by an LFU cache, as well as some less-frequently-used data; > meanwhile, the operational metrics may also have tiers, even to the extent > where the least frequently-used metrics should be evicted after a rather > short idle time, recovering both Value memory as well as Key memory. > > In this case ^, can "small" data and "big" data co-exist within an Ignite > cluster, and are there any particular techniques needed to assure > operational performance, particularly for keeping hot data hot, when total > data-size exceeds total-available-memory? > > a) Can "indexed" queries be executed across datasets that need to be loaded > with loadCache() or would they execute as table-scans? > > b) Would such a query run incrementally with progressive eviction of data, > in the case of big data? > > I guess I'm unclear on the sequence of data-loading vs data-scanning - are > they parallel operations, or would we expect the data-loading phase to block > the data-scanning phase? > > Hopefully these questions and sample scenario are clear enough to get > experienced perspective & input from y'all... thanks in advance. > > R > >