My apologies for confusing Index storage with Geode; thought I heard this somewhere in the context of GemFire/Geode before. No doubt confused this with other data stores I work with. (So) much to learn yet.
On Fri, Aug 19, 2016 at 4:16 PM, Michael Stolz <[email protected]> wrote: > Unfortunately the indexes are not stored. They need to be rebuilt on > restart. For that reason, on start up, the whole diskstore needs to be read. > > -- > Mike Stolz > Principal Engineer, GemFire Product Manager > Mobile: 631-835-4771 > > On Fri, Aug 19, 2016 at 5:30 PM, John Blum <[email protected]> wrote: > >> *Jason, Mike*: first, thank you. >> >> > *In order to target the nodes that would supposedly hold the data of >> interest you need to know the keys you are looking for. If you know the >> keys why are you querying in the first place? Just do getAll(keys).* >> >> Two reasons... >> >> 1. I want to apply some "additional filtering" that can only be handled >> elegantly by a OQL query predicate after a subset of the data has been >> identified/targeted (using keys). I have example of this somewhere (doh) >> after working with a customer on this exact UC >> >> 2. I don't want the entire object (i.e. row); I only need a specific >> "projection" of the (object) data. This is particularly important if I >> have very large and complex object graph and I am streaming data across the >> wire (client/server). >> >> >> > *The trouble happens when you are NOT hitting your indices. * >> >> Yes, good point. >> >> > *If you do a query that requires a full table scan, then every row in >> the database table needs to be examined, and to examine it, it has to be in >> memory at least briefly.* >> >> Of course. >> >> *Denis*- >> >> > *The disk entries that are mentioned by John were located in memory >> before and were overflowed on disk at some point of time. It means that if >> you start your cluster from scratch and want to run OQL queries over the >> indexed data then you have to preload all the data from the persistence.* >> >> I don't specifically recall how much persistent data Geode reloads on >> restart (Geode is a shared-nothing architecture though so each data node >> has it's own persistence; additionally primaries must come online before >> secondaries are accessible). The question is how much data gets reloaded >> on restart. It would seem silly if the disk store contained more data then >> would fit in memory and reload everything knowing some of the data would be >> OVERFLOW on preload when it would not all fit. Geode will reload the Index >> though, which is stored as well. >> >> I let the experts answer this one. >> >> >> On Fri, Aug 19, 2016 at 2:04 PM, Denis Magda <[email protected]> wrote: >> >>> Hi John, Jason, >>> >>> If to expand more on this >>> >>> >>> *If an index can be used, the index look up is executed and entries >>> added to the result set. If any of the entries that match the predicates >>> is actually on disk, those values will need to be loaded to memory before >>> being returned as a result.* >>> >>> The disk entries that are mentioned by John were located in memory >>> before and were overflowed on disk at some point of time. It means that if >>> you start your cluster from scratch and want to run OQL queries over the >>> indexed data then you have to preload all the data from the persistence. >>> Yes, some of the data may be overflowed back to disk during the preloading >>> but you'll have your indexes in a valid state. >>> >>> Correct me if I'm still missing something. >>> >>> -- >>> Denis >>> >>> >>> On Fri, Aug 19, 2016 at 1:52 PM, Jason Huynh <[email protected]> wrote: >>> >>>> Hi John, >>>> >>>> I think you were referring to Mike's explanation of: >>>> "If, however, you ever resort to hitting the disk-based data for a >>>> query it is going to have to read every record that isn't in memory from >>>> disk which is going to be extremely slow. I personally would never use >>>> Geode that way." >>>> >>>> When stating: >>>> "Additionally, assuming the Indexes were defined properly based on the >>>> predicates in the queries (most often) used, that it would target the data >>>> on disk matching the predicate and load only the data required (no data >>>> store, RDBMS or otherwise, especially disk-bound stores, should have to >>>> load the entire table/Region/Map/whatever to access the data matching the >>>> predicate; that's absurd, OOMEs galore)." >>>> >>>> Let me try to clear things up slightly...hopefully not causing more >>>> confusion... >>>> If an index can be used, the index look up is executed and entries >>>> added to the result set. If any of the entries that match the predicates >>>> is actually on disk, those values will need to be loaded to memory before >>>> being returned as a result. >>>> I think what Mike was saying was that if an index is not used, then the >>>> query itself would execute across the entire region, which means loading >>>> every entry into memory. We would need to inspect each entry to see if >>>> fulfill the criteria. >>>> >>>> -Jason >>>> >>>> >>>> >>>> On Fri, Aug 19, 2016 at 1:39 PM John Blum <[email protected]> wrote: >>>> >>>>> Hi All- >>>>> >>>>> DISCLAIMER: I am no expert in querying and index >>>>> architecture/implementation; mostly a consumer. >>>>> >>>>> Perhaps *Anil* or *Jason* can shed more light on the subject, but for >>>>> my own understanding/sanity, it would seem we could do better than this, >>>>> meaning... >>>>> >>>>> I would think any UC partially depends on the organization of your >>>>> data in the grid as well. If you used a PARTITION data management >>>>> policy [1], for instance, then, of course, your data would be distributed >>>>> and partitioned across all the data nodes in the grid (cluster) holding >>>>> the >>>>> data (i.e. data nodes that have declared the same PARTITION Region). >>>>> It should then be possible to make this more optimal by have a redundancy >>>>> level of 1 or more (depending on the frequency of transactions and data >>>>> changes) to parallelize the data access. >>>>> >>>>> Not only does having more nodes mean better (or more optimal) >>>>> organization, but more memory. Still, given a very large data set, >>>>> clearly >>>>> some of the data will need to OVERFLOW (to disk). >>>>> >>>>> But, by combining the Function Execution service with querying (on >>>>> PARTITIONED data) [2], you could target the nodes that would >>>>> supposedly hold the data of interests, and execute the queries there. >>>>> >>>>> Additionally, assuming the Indexes were defined properly based on the >>>>> predicates in the queries (most often) used, that it would target the data >>>>> on disk matching the predicate and load only the data required (no data >>>>> store, RDBMS or otherwise, especially disk-bound stores, should have to >>>>> load the entire table/Region/Map/whatever to access the data matching the >>>>> predicate; that's absurd, OOMEs galore). >>>>> >>>>> TMK, Geode keeps Indexes in memory (even loads them on startup) and >>>>> updates them (either sync/async depending on your configuration) as the >>>>> data changes. You would assume the data would not be changing in the >>>>> OVERFLOW, disk-based data set. If the data did change, then wouldn't you >>>>> also assume that that data would then have to be in-memory (I think so). >>>>> >>>>> Please let me know if I am way of basis here, but I would think Geode >>>>> gives you enough options that particular UCs could be made, with nominal >>>>> effort, more optimal. >>>>> >>>>> Additional references... >>>>> >>>>> * Query Partitioned Regions [3] >>>>> * Working with Indexes [4], and then... >>>>> * Tips and Guidelines on Using Indexes [5], but also important... >>>>> * Using Indexes with Overflow Regions [6] >>>>> >>>>> Hope this helps. >>>>> >>>>> Cheers! >>>>> -John >>>>> >>>>> >>>>> [1] http://geode.docs.pivotal.io/docs/developing/region_opti >>>>> ons/region_types.html >>>>> [2] http://geode.docs.pivotal.io/docs/developing/querying_ba >>>>> sics/performance_considerations.html >>>>> [3] http://geode.docs.pivotal.io/docs/developing/querying_ba >>>>> sics/querying_partitioned_regions.html >>>>> [4] http://geode.docs.pivotal.io/docs/developing/query_index >>>>> /query_index.html >>>>> [5] http://geode.docs.pivotal.io/docs/developing/query_index >>>>> /indexing_guidelines.html >>>>> [6] http://geode.docs.pivotal.io/docs/developing/query_index >>>>> /indexes_with_overflow_regions.html >>>>> >>>>> >>>>> On Fri, Aug 19, 2016 at 12:55 PM, Denis Magda <[email protected]> >>>>> wrote: >>>>> >>>>>> Thanks, now I see. >>>>>> >>>>>> This works the same way as in Ignite then. If you set up an eviction >>>>>> policy in Ignite the data may be evicted to swap at some point of time >>>>>> and >>>>>> if a query is executed right after that the it may swap in the data back >>>>>> to >>>>>> memory. However the indexes must always be in memory. >>>>>> >>>>>> -- >>>>>> Denis >>>>>> >>>>>> >>>>>> On Fri, Aug 19, 2016 at 12:43 PM, Michael Stolz <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> There is a notion of data aging out in Geode. We call it overflow to >>>>>>> disk. >>>>>>> >>>>>>> The idea is that as data gets old you can have the records in memory >>>>>>> expire, and that expiry can be to disk. That's the cold data. >>>>>>> >>>>>>> You may have built an index while you were initially loading the >>>>>>> data, and if your predicates only hit the indexes you will still get >>>>>>> really >>>>>>> fast queries if the result sets aren't large. >>>>>>> >>>>>>> If, however, you ever resort to hitting the disk-based data for a >>>>>>> query it is going to have to read every record that isn't in memory from >>>>>>> disk which is going to be extremely slow. I personally would never use >>>>>>> Geode that way. >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Mike Stolz >>>>>>> Principal Engineer, GemFire Product Manager >>>>>>> Mobile: 631-835-4771 >>>>>>> >>>>>>> On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Mike, >>>>>>>> >>>>>>>> Thanks a lot for the explanation! It makes perfect sense to me. >>>>>>>> >>>>>>>> I just thought that you were able to do something with indexes in a >>>>>>>> such way that there is no need to preload everything from disk into >>>>>>>> memory >>>>>>>> when a query is executed over cold data. >>>>>>>> >>>>>>>> Then what does "execution over cold data" mean? I'm referring to >>>>>>>> the following sentence from the main page: >>>>>>>> >>>>>>>> *Object Query Language allows distributed query execution on hot >>>>>>>> and cold data, with SQL-like capabilities, including joins.* >>>>>>>> >>>>>>>> -- >>>>>>>> Denis >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Here's the thing... >>>>>>>>> >>>>>>>>> On any In-memory data grid, if you run a query before the data has >>>>>>>>> been loaded into memory, it is going to cause the exact same amount >>>>>>>>> of disk >>>>>>>>> i/o to do the query as it will take to load everything into memory. >>>>>>>>> >>>>>>>>> And the system will still have to go ahead and load everything >>>>>>>>> into memory anyway so you're going to end up doing all that disk i/o >>>>>>>>> TWICE. >>>>>>>>> >>>>>>>>> Geode DOES have a nice feature for key based access though. We >>>>>>>>> actually store the keys in a separate file from the data and we can >>>>>>>>> load >>>>>>>>> that file very quickly. Then if you go after the data for one of >>>>>>>>> those keys >>>>>>>>> we can lazily load it from disk on demand if it hasn't yet been >>>>>>>>> loaded into >>>>>>>>> memory. >>>>>>>>> >>>>>>>>> The Lucene integration work that is going on in Geode might also >>>>>>>>> make it possible to load the indexes first and lazily load the data >>>>>>>>> based >>>>>>>>> on queries against the indexes. >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Mike Stolz >>>>>>>>> Principal Engineer, GemFire Product Manager >>>>>>>>> Mobile: 631-835-4771 >>>>>>>>> >>>>>>>>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hello Geode community, >>>>>>>>>> >>>>>>>>>> I've been investigating possibilities of Geode Persistence for a >>>>>>>>>> while and still can't get it clear whether I need to have all my >>>>>>>>>> data in >>>>>>>>>> memory if I want to execute OQL queries or OQL engine works over the >>>>>>>>>> persistence as well. >>>>>>>>>> >>>>>>>>>> My use case is the following. During the cluster startup I don't >>>>>>>>>> want to wait while all the data has been pre-loaded from the >>>>>>>>>> persistence to >>>>>>>>>> RAM and want to execute OQL queries right away. Is it feasible to >>>>>>>>>> implement >>>>>>>>>> with Geode? Please provide me with the links where I can read more >>>>>>>>>> about >>>>>>>>>> this. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Denis >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Удачи, >>>>>>>> Денис Магда >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Удачи, >>>>>> Денис Магда >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> -John >>>>> 503-504-8657 >>>>> john.blum10101 (skype) >>>>> >>>> >>> >>> >>> -- >>> Удачи, >>> Денис Магда >>> >> >> >> >> -- >> -John >> 503-504-8657 >> john.blum10101 (skype) >> > > -- -John 503-504-8657 john.blum10101 (skype)
