Re: Indexing, operational performance and data eviction

Randy Harmon Mon, 19 Dec 2016 10:51:06 -0800

Thanks, Alex.  I understand that any SQL query relying on indexed data
would need to block until the data is loaded, else it could miss index
entries for important rows.


Do I understand that all Ignite SQL queries must use some index?  In
context of a classic example scenario (
https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/datagrid/CacheQueryExample.java):
if a SQL query for Person.salary > 1000 can use a predefined index on the
salary column, we can expect it to seek to 1000 in that index, then scan
through the remainder of the index (all this within each distributed
partition).  Without an index on salary, would that same SQL execute by
scanning through the index on Long Person.id to find candidate rows
(evaluating the salary > 1000 expression on each candidate row)?

Is that also true for ScanQuery case?  IOW: Does loadCache() from backing
store (-> distributed localLoadCache) have to be *completed* prior to
ScanQuery's predicate getting any records?  Or will IgniteBiPredicate
<https://ignite.apache.org/releases/mobile/org/apache/ignite/lang/IgniteBiPredicate.html>
for
that ScanQuery get to process some batches of records in one thread, while
a separate data-loading thread continues to load more records?

I found a description along that line in javadocs for
CacheLoadOnlyStoreAdapter, but it's not clear which typical client-facing
use-cases, if any, can take advantage of the batched/parallel-processing
behavior described there.

Thanks again,

R




On Mon, Dec 19, 2016 at 7:10 AM, Alexander Paschenko <
alexander.a.pasche...@gmail.com> wrote:

> Hi Randy,
>
> Currently, indexes are built only based on what is in cache - i.e.
> contents of the backing store not present in cache are not presented
> in index in any way, and hence yes, indexing blocks scanning.
> Moreover, even non indexed columns in Ignite tables contain only data
> actually loaded to cache.
>
> Significant changes in this aspect should be expected with arrival of
> Ignite 2.0, but that is not yet to happen until some time in 2017.
>
> Regards,
> Alex
>
> 2016-12-17 2:16 GMT+03:00 Randy Harmon <rjharmon0...@gmail.com>:
> > Hi all,
> >
> > I'm seeking a fuller understanding of how Apache ignite manages datasets,
> > both for indexes and for the underlying data.
> >
> > In particular, I'm looking at what practical constraints exist for
> overall
> > data size (beyond the obvious 'how much memory do you have?'), and
> > functional characteristics when working near the constraint boundaries.
> >
> > My assumption (corrections welcome) include:
> >
> > The underlying objects (Value part of cache) do not need to be in-memory
> on
> > any cache nodes (performance is naturally affected if they were evicted
> from
> > the cache) to execute an indexed query.
> >
> > The indexed keys need to be in-memory for all indexed lookups.  If the
> > referenced Value is not in-memory, it will be loaded by call to backing
> > store when that value is needed: load(key)
> >
> > Indexed keys do not need to be in-memory for any table-scan queries to
> work,
> > but loadCache() (?) is called to bring these data into memory.  This may
> > result in eviction of other values. Once the queries on these data are
> > complete, the keys (at least) will tend to remain in-memory (how to
> forcibly
> > remove?)
> >
> > In this latter case, can large datasets be queried, with earlier records
> in
> > the dataset progressively evicted to make room for later records in the
> > dataset (e.g. SUM(x) GROUP BY y)?
> >
> > A sample use case might include a set of metadata objects (megabytes to
> > gigabytes, in various Ignite caches) and a much larger set of operational
> > metrics with fine-grained slicing, or even fully-granular facts
> (GB/TB/PB).
> > In this use-case, the metadata might well have "hot" subsets that (we
> hope)
> > are not evicted by an LFU cache, as well as some less-frequently-used
> data;
> > meanwhile, the operational metrics may also have tiers, even to the
> extent
> > where the least frequently-used metrics should be evicted after a rather
> > short idle time, recovering both Value memory as well as Key memory.
> >
> > In this case ^, can "small" data and "big" data co-exist within an Ignite
> > cluster, and are there any particular techniques needed to assure
> > operational performance, particularly for keeping hot data hot, when
> total
> > data-size exceeds total-available-memory?
> >
> > a) Can "indexed" queries be executed across datasets that need to be
> loaded
> > with loadCache() or would they execute as table-scans?
> >
> > b) Would such a query run incrementally with progressive eviction of
> data,
> > in the case of big data?
> >
> > I guess I'm unclear on the sequence of data-loading vs data-scanning -
> are
> > they parallel operations, or would we expect the data-loading phase to
> block
> > the data-scanning phase?
> >
> > Hopefully these questions and sample scenario are clear enough to get
> > experienced perspective & input from y'all... thanks in advance.
> >
> > R
> >
> >
>

Re: Indexing, operational performance and data eviction

Reply via email to