Hi Galder, you make many interesting points but I am not interested in discussing my specific design ideas in detail, I just sketched that quickly as an example of requirements description. My intervention in this thread is all about demanding what the use case is for such a synchronous cachestore.
It is my understanding that lots of people are currently discussing how to best make a synchronous cachestore mostly efficient, but I yet have to see what the requirements are. My use case - although expecting an asynchronous one - is just an example of how I would like to see the general architecture described first before we waste time on a useless component. I am not stating that nobody wants a strictly synchronous CacheStore, but I would like to challenge the arguments who made someone (who?) ask for this, as I believe there are many other things that would need to be addressed. Therefore, my suggestion for describing the main use cases we expect to support is not off-topic at all, it's the first thing any engineer would have requested, and I would not spend a minute more of our engineers in coding without a clear description of the expected architecture, expected reliability, expected operations to be used. Sanne On 3 July 2013 14:59, Galder Zamarreño <gal...@redhat.com> wrote: > Sanne, let me comment on some of the points you raised that didn't comment on > in an earlier comment... > > On Jul 2, 2013, at 2:47 PM, Sanne Grinovero <sa...@infinispan.org> wrote: > >> It would be nice to have a deck of "cheat sheets" on the expected use >> cases and guarantees: to me it looks like everyone is trying to solve >> a different problem / having a different problem in mind. >> >> My own take on it: >> >> Scenario #1 >> I'll primarily use Infinispan with DIST, and I don't care much for >> other options. Reliability is guaranteed via numOwners>1, NOT by >> persisting to disk: if a node fails, I kill the VM (the machine, not >> the Java process) and start new ones to compensate: I'm assuming cloud >> nodes, so it's likely that when a failed node is gone, the disk is >> gone as well, with all the carefully stored data. >> I will use Infinispan primarily to absorb write spikes - so a "synch >> flush" is no good for me - and to boost read performance by as much >> memory I can throw at it. >> CacheStore is used for two reasons: >> - overflow (LIRS+passivation) for then the memory is not enough >> - clean shutdown: you can think of it as a way to be able to upgrade >> some component in the system (Infinispan or my own); I would expect >> some kind of "JMX flush" operation to do a clean shutdown without data >> loss. > > ^ Should this really be implemented at the Infinispan level? In the > AS/EAP/Wildfly case, they take care that all transactions have finished > before shutting down, and Infinispan benefits from that. > >> Given such a scenario, I am not interested at all in synchronous >> storage. Before we commit into a design which is basically assuming >> the need for synchronous storage guarantees, I'd like to understand >> what kind of use case it's aiming to solve. > > Sanne, any **strict** synchronous storage guarantees (e.g. to force or to > not) will be configurable and most likely they'll be disabled, just like > Level DB JNI, or Karsten's file cache store by default. A case where someone > might want to enable this is when it just has a local cache and wants to > persist data for recovery. Of course, the whole node and the disk could die…, > but this is not so far fetched IMO. > > The whole discussion about **strict** synchronous storage guarantees in this > thread is to make sure we're comparing apples with apples. IOW, it doesn't > make sense to compare performance when each has different **strict** > synchronous storage guarantee settings. > >> It would be great to document each such use case and put down a table >> of things which can be expected, which features should not be expected >> (be very explicit on the limitations), and how basic operations are >> expected to be performed in the scenario: like how do you do a rolling >> upgrade in Scenario 1# ? How do you do a backup? And of course some >> configurations & code examples. > > ^ Hmmm, these operations are not really specific to the file cache store > per-se. They are valid points, for sure, but out of the scope of this IMO. > >> Only then we would be able to pick a design (or multiple ones); for my >> use case the proposal from Karsten seems excellent, so I'm wondering >> why I should be looking for alternatives, and wondering why everyone >> is still wasting time on different discussions :-D >> >> I'm pretty sure there is people looking forward for a synch-CacheStore >> too: if you could nail down such a scenario however I'm pretty sure >> that some other considerations would not be taken into account (like >> consistency of data when reactivating a dormant node), so I suspect >> that just implementing such a component would actually not make any >> new architecture possible, as you would get blocked by other problems >> which need to be solved too.. better define all expectations asap! >> >> To me this thread smells of needing the off-heap Direct Memory buffers >> which I suggested [long time ago] to efficiently offload internal >> buffers, > > ^ Hmmm, if we have a file based cache store is to provide data survival > beyond shutting down a machine or it crashing it (assuming no disk failure). > So, I can't see how this off-heap memory buffers help here? Unless you've got > it mapped to a file or something else? > >> but failing to recognise this we're pushing responsibility to >> an epic level complex CacheStore.. guys let's not forget that a mayor >> bottleneck of CacheStores today is the SPI it has to implement, we >> identified several limitations in the contract in the past which >> prevent a superior efficiency: we're working towards a mayor release >> now so I'd rather focus on the API changes which will make it possible >> to get decent performance even without changing any storage engine.. > > If you haven't already done so, the place to suggest/comment on this is for > sure [1]. > > [1] https://community.jboss.org/wiki/CacheLoaderAndCacheStoreSPIRedesign > >> I'm pretty sure Cassandra (to pick one) doesn't scale too bad. > > ^ Requires a separate process and much more complex to set up. Not really > what we're looking for a simple local cache store that you can use for > example for passivation EJB3 SFSBs or HTTP sessions. > >> >> Cheers, >> Sanne >> >> >> >> On 2 July 2013 10:09, Radim Vansa <rva...@redhat.com> wrote: >>> Hi, >>> >>> I've written down this proposal for the implementation of new cache store. >>> >>> https://community.jboss.org/wiki/BrnoCacheStoreDesignProposal >>> >>> WDYT? >>> >>> Radim >>> >>> ----- Original Message ----- >>> | From: "Radim Vansa" <rva...@redhat.com> >>> | To: "infinispan -Dev List" <infinispan-dev@lists.jboss.org> >>> | Sent: Thursday, June 27, 2013 2:37:43 PM >>> | Subject: Re: [infinispan-dev] Cachestores performance >>> | >>> | >>> | >>> | ----- Original Message ----- >>> | | From: "Galder Zamarreño" <gal...@redhat.com> >>> | | To: "infinispan -Dev List" <infinispan-dev@lists.jboss.org> >>> | | Sent: Thursday, June 27, 2013 1:52:11 PM >>> | | Subject: Re: [infinispan-dev] Cachestores performance >>> | | >>> | | > As for Karsten's FCS implementation, I too have issues with the key >>> set >>> | | > and >>> | | > value offsets being solely in memory. However I think that could be >>> | | > improved by storing only a certain number of keys/offsets in memory, >>> and >>> | | > flushing the rest to disk again into an index file. >>> | | >>> | | ^ Karsten's implementation makes this relatively easy to achieve >>> because it >>> | | already keeps this mapping in a LinkedHashMap (with a given max entries >>> | | limit [1]) assuming removeEldestEntry() is overriden to flush to disk >>> older >>> | | entries. Some extra logic would be needed to bring back data from the >>> disk >>> | | too… but your suggestion below is also quite interesting... >>> | >>> | I certainly wouldn't call this easy task, because the most problematic >>> part >>> | is what we will do when the whole entry (both key and value) are gone from >>> | memory and we want to read them - that requires keeping some searchable >>> | structure on-disk. And that's the hard stuff. >>> | >>> | | >>> | | > I believe LevelDB follows a similar design, but I think Karsten's FCS >>> | | > will >>> | | > perform better than LevelDB since it doesn't attempt to maintain a >>> sorted >>> | | > structure on disk. >>> | | >>> | | ^ In-memory, the structure can optionally be ordered if it's bound [1], >>> | | otherwise it's just a normal map. How would be store it at the disk >>> level? >>> | | B+ tree with hashes of keys and then linked lists? >>> | >>> | Before choosing "I love B#@& trees, let's use B#@& trees!", I'd find out >>> what >>> | requirements do we have for the structure. I believe that the index itself >>> | should not be considered persistent, as it can be rebuilt when preloading >>> | the data (sequentially reading the data is fast, therefore we can afford >>> do >>> | this indexing preload), the reason of the index being on-disk is that we >>> | don't have enough memory to store all keys, or even key hashes. Therefore >>> it >>> | does not have to be updated synchronously with the writes. It should be >>> | mostly read-optimized then, because that's the thing where we need >>> | synchronous access to this structure. >>> | >>> | | >>> | | > One approach to maintaining keys and offsets in memory could be a >>> | | > WeakReference that points to the key stored in the in-memory >>> | | > DataContainer. Once evicted from the DC, then the CacheStore impl >>> would >>> | | > need to fetch the key again from the index file before looking up the >>> | | > value in the actual store. >>> | | >>> | | ^ Hmmm, interesting idea… has the potential to safe the memory space by >>> not >>> | | having to keep that extra data structure in the cache store. >>> | >>> | You mean to mix the DataContainer with xCacheEntry implementation and the >>> | cache store implementation? Is that possible from design perspective? >>> | Speaking about different kind of references, we may even optimize >>> | not-well-tuned eviction by SoftReferences, so that even if the entry was >>> | evicted from main DataContainer, we'd keep the value referenced from the >>> | cache-store (and this does not have to be loaded from disk if referenced >>> | before garbage collection). But such thought may be premature >>> optimization. >>> | For having eviction managed in relation with GC we should rather combine >>> | this with PhantomReferences, where entries would be written to cache upon >>> | finalization. >>> | >>> | | >>> | | > This way we have hot items always in memory, semi-hot items with >>> offsets >>> | | > in >>> | | > memory and values on disk, and cold items needing to be read off disk >>> | | > entirely (both offset and value). Also for write-through and >>> | | > write-behind, as long as the item is hot or warm (key and offset in >>> | | > memory), writing will be pretty fast. >>> | | >>> | | My worry about Karsten's impl is writing actually. If you look at the >>> last >>> | | performance numbers in [2], where we see the performance difference of >>> | | force=true and force=false in Karsten's cache store compared with >>> LevelDB >>> | | JNI, you see that force=false is fastest, then JNI LevelDB, and the >>> | | force=true. Me wonders what kind of write guarantees LevelDB JNI >>> provides >>> | | (and the JAVA version)... >>> | >>> | Just for clarification: the fast implementation is without force at all, >>> the >>> | slower is with force(false). Force(true) means updating metadata (such as >>> | access times?) which is not required for cache-store. >>> | But the numbers suggest that the random access with syncing is really not >>> a >>> | good option, and that we should rather use the temporary append-only log, >>> | which would be persisted into structured DB by different thread (as >>> LevelDB >>> | does, I suppose). >>> | >>> | Thinking about all the levels and cache structures optimizing the read >>> | access, I can see four levels of search structures: key + value (usual >>> | DataContainer), key + offset, hash + offset, all on disk. The "hash + >>> | offset" may seem superflous but for some use-cases with big keys it may be >>> | worth sparing a few disk look-ups. >>> | >>> | Radim >>> | >>> | | > >>> | | > On 27 Jun 2013, at 10:33, Radim Vansa <rva...@redhat.com> wrote: >>> | | > >>> | | >> Oops, by the cache store I mean the previously-superfast >>> | | >> KarstenFileCacheStore implementation. >>> | | >> >>> | | >> ----- Original Message ----- >>> | | >> | From: "Radim Vansa" <rva...@redhat.com> >>> | | >> | To: "infinispan -Dev List" <infinispan-dev@lists.jboss.org> >>> | | >> | Sent: Thursday, June 27, 2013 11:30:53 AM >>> | | >> | Subject: Re: [infinispan-dev] Cachestores performance >>> | | >> | >>> | | >> | I have added FileChannel.force(false) flushes after all write >>> | | >> | operations >>> | | >> | in >>> | | >> | the cache store, and now the comparison is also updated with these >>> | | >> | values. >>> | | >> | >>> | | >> | Radim >>> | | >> | >>> | | >> | ----- Original Message ----- >>> | | >> | | From: "Radim Vansa" <rva...@redhat.com> >>> | | >> | | To: "infinispan -Dev List" <infinispan-dev@lists.jboss.org> >>> | | >> | | Sent: Thursday, June 27, 2013 8:54:25 AM >>> | | >> | | Subject: Re: [infinispan-dev] Cachestores performance >>> | | >> | | >>> | | >> | | Yep, write-through. LevelDB JAVA used FileChannelTable >>> | | >> | | implementation >>> | | >> | | (-Dleveldb.mmap), because Mmaping is not implemented very well >>> and >>> | | >> | | causes >>> | | >> | | JVM crashes (I believe it's because of calling non-public API via >>> | | >> | | reflection >>> | | >> | | - I've found post from the Oracle JVM guys discouraging the >>> | | >> | | particular >>> | | >> | | trick >>> | | >> | | it uses). After writing the record to the log, it calls >>> | | >> | | FileChannel.force(true), therefore, it should be really on the >>> disc >>> | | >> | | by >>> | | >> | | that >>> | | >> | | moment. >>> | | >> | | I have not looked into the JNI implementation but I expect the >>> same. >>> | | >> | | >>> | | >> | | By the way, I have updated [1] with numbers when running on more >>> | | >> | | data >>> | | >> | | (2 GB >>> | | >> | | instead of 100 MB). I won't retype it here, so look there. The >>> | | >> | | performance >>> | | >> | | is much lower. >>> | | >> | | I may try also increase JVM heap size and try with a bit more >>> data >>> | | >> | | yet. >>> | | >> | | >>> | | >> | | Radim >>> | | >> | | >>> | | >> | | [1] https://community.jboss.org/wiki/FileCacheStoreRedesign >>> | | >> | | >>> | | >> | | ----- Original Message ----- >>> | | >> | | | From: "Erik Salter" <an1...@hotmail.com> >>> | | >> | | | To: "infinispan -Dev List" <infinispan-dev@lists.jboss.org> >>> | | >> | | | Sent: Wednesday, June 26, 2013 7:40:19 PM >>> | | >> | | | Subject: Re: [infinispan-dev] Cachestores performance >>> | | >> | | | >>> | | >> | | | These were write-through cache stores, right? And with >>> LevelDB, >>> | | >> | | | this was >>> | | >> | | | through to the database file itself? >>> | | >> | | | >>> | | >> | | | Erik >>> | | >> | | | >>> | | >> | | | -----Original Message----- >>> | | >> | | | From: infinispan-dev-boun...@lists.jboss.org >>> | | >> | | | [mailto:infinispan-dev-boun...@lists.jboss.org] On Behalf Of >>> Radim >>> | | >> | | | Vansa >>> | | >> | | | Sent: Wednesday, June 26, 2013 11:24 AM >>> | | >> | | | To: infinispan -Dev List >>> | | >> | | | Subject: [infinispan-dev] Cachestores performance >>> | | >> | | | >>> | | >> | | | Hi all, >>> | | >> | | | >>> | | >> | | | according to [1] I've created the comparison of performance in >>> | | >> | | | stress-tests. >>> | | >> | | | >>> | | >> | | | All setups used local-cache, benchmark was executed via >>> Radargun >>> | | >> | | | (actually >>> | | >> | | | version not merged into master yet [2]). I've used 4 nodes >>> just to >>> | | >> | | | get >>> | | >> | | | more >>> | | >> | | | data - each slave was absolutely independent of the others. >>> | | >> | | | >>> | | >> | | | First test was preloading performance - the cache started and >>> | | >> | | | tried >>> | | >> | | | to >>> | | >> | | | load >>> | | >> | | | 1GB of data from harddrive. Without cachestore the startup >>> takes >>> | | >> | | | about 2 >>> | | >> | | | - >>> | | >> | | | 4 >>> | | >> | | | seconds, average numbers for the cachestores are below: >>> | | >> | | | >>> | | >> | | | FileCacheStore: 9.8 s >>> | | >> | | | KarstenFileCacheStore: 14 s >>> | | >> | | | LevelDB-JAVA impl.: 12.3 s >>> | | >> | | | LevelDB-JNI impl.: 12.9 s >>> | | >> | | | >>> | | >> | | | IMO nothing special, all times seem affordable. We don't >>> benchmark >>> | | >> | | | exactly >>> | | >> | | | storing the data into the cachestore, here FileCacheStore took >>> | | >> | | | about >>> | | >> | | | 44 >>> | | >> | | | minutes, while Karsten about 38 seconds, LevelDB-JAVA 4 minutes >>> | | >> | | | and >>> | | >> | | | LevelDB-JNI 96 seconds. The units are right, it's minutes >>> compared >>> | | >> | | | to >>> | | >> | | | seconds. But we all know that FileCacheStore is bloody slow. >>> | | >> | | | >>> | | >> | | | Second test is stress test (5 minutes, preceded by 2 minute >>> | | >> | | | warmup) >>> | | >> | | | where >>> | | >> | | | each of 10 threads works on 10k entries with 1kB values (~100 >>> MB >>> | | >> | | | in >>> | | >> | | | total). >>> | | >> | | | 20 % writes, 80 % reads, as usual. No eviction is configured, >>> | | >> | | | therefore >>> | | >> | | | the >>> | | >> | | | cache-store works as a persistent storage only for case of >>> crash. >>> | | >> | | | >>> | | >> | | | FileCacheStore: 3.1M reads/s 112 writes/s // on one >>> | | >> | | | node >>> | | >> | | | the >>> | | >> | | | performance was only 2.96M reads/s 75 writes/s >>> | | >> | | | KarstenFileCacheStore: 9.2M reads/s 226k writes/s // yikes! >>> | | >> | | | LevelDB-JAVA impl.: 3.9M reads/s 5100 writes/s >>> | | >> | | | LevelDB-JNI impl.: 6.6M reads/s 14k writes/s // on one >>> | | >> | | | node >>> | | >> | | | the >>> | | >> | | | performance was 3.9M/8.3k - about half of the others >>> | | >> | | | Without cache store: 15.5M reads/s 4.4M writes/s >>> | | >> | | | >>> | | >> | | | Karsten implementation pretty rules here for two reasons. >>> First of >>> | | >> | | | all, >>> | | >> | | | it >>> | | >> | | | does not flush the data (it calls only >>> RandomAccessFile.write()). >>> | | >> | | | Other >>> | | >> | | | cheat is that it stores in-memory the keys and offsets of data >>> | | >> | | | values in >>> | | >> | | | the >>> | | >> | | | database file. Therefore, it's definitely the best choice for >>> this >>> | | >> | | | scenario, >>> | | >> | | | but it does not allow to scale the cache-store, especially in >>> | | >> | | | cases >>> | | >> | | | where >>> | | >> | | | the keys are big and values small. However, this performance >>> boost >>> | | >> | | | is >>> | | >> | | | definitely worth checking - I could think of caching the disk >>> | | >> | | | offsets in >>> | | >> | | | memory and querying persistent index only in case of missing >>> | | >> | | | record, >>> | | >> | | | with >>> | | >> | | | part of the persistent index flushed asynchronously (the index >>> can >>> | | >> | | | be >>> | | >> | | | always >>> | | >> | | | rebuilt during the preloading for case of crash). >>> | | >> | | | >>> | | >> | | | The third test should have tested the scenario with more data >>> to >>> | | >> | | | be >>> | | >> | | | stored >>> | | >> | | | than memory - therefore, the stressors operated on 100k entries >>> | | >> | | | (~100 MB >>> | | >> | | | of >>> | | >> | | | data) but eviction was set to 10k entries (9216 entries ended >>> up >>> | | >> | | | in >>> | | >> | | | memory >>> | | >> | | | after the test has ended). >>> | | >> | | | >>> | | >> | | | FileCacheStore: 750 reads/s 285 writes/s // >>> | | >> | | | one >>> | | >> | | | node >>> | | >> | | | had >>> | | >> | | | only 524 reads and 213 writes per second >>> | | >> | | | KarstenFileCacheStore: 458k reads/s 137k writes/s >>> | | >> | | | LevelDB-JAVA impl.: 21k reads/s 9k writes/s >>> // a >>> | | >> | | | bit >>> | | >> | | | varying >>> | | >> | | | performance >>> | | >> | | | LevelDB-JNI impl.: 13k-46k reads/s 6.6k-15.2k writes/s // >>> | | >> | | | the >>> | | >> | | | performance varied a lot! >>> | | >> | | | >>> | | >> | | | 100 MB of data is not much, but it takes so long to push it >>> into >>> | | >> | | | FileCacheStore that I won't use more unless we exclude this >>> loser >>> | | >> | | | from >>> | | >> | | | the >>> | | >> | | | comparison :) >>> | | >> | | | >>> | | >> | | | Radim >>> | | >> | | | >>> | | >> | | | [1] https://community.jboss.org/wiki/FileCacheStoreRedesign >>> | | >> | | | [2] https://github.com/rvansa/radargun/tree/t_keygen >>> | | >> | | | >>> | | >> | | | ----------------------------------------------------------- >>> | | >> | | | Radim Vansa >>> | | >> | | | Quality Assurance Engineer >>> | | >> | | | JBoss Datagrid >>> | | >> | | | tel. +420532294559 ext. 62559 >>> | | >> | | | >>> | | >> | | | Red Hat Czech, s.r.o. >>> | | >> | | | Brno, Purkyňova 99/71, PSČ 612 45 >>> | | >> | | | Czech Republic >>> | | >> | | | >>> | | >> | | | >>> | | >> | | | _______________________________________________ >>> | | >> | | | infinispan-dev mailing list >>> | | >> | | | infinispan-dev@lists.jboss.org >>> | | >> | | | https://lists.jboss.org/mailman/listinfo/infinispan-dev >>> | | >> | | | >>> | | >> | | | >>> | | >> | | | _______________________________________________ >>> | | >> | | | infinispan-dev mailing list >>> | | >> | | | infinispan-dev@lists.jboss.org >>> | | >> | | | https://lists.jboss.org/mailman/listinfo/infinispan-dev >>> | | >> | | >>> | | >> | | _______________________________________________ >>> | | >> | | infinispan-dev mailing list >>> | | >> | | infinispan-dev@lists.jboss.org >>> | | >> | | https://lists.jboss.org/mailman/listinfo/infinispan-dev >>> | | >> | >>> | | >> | _______________________________________________ >>> | | >> | infinispan-dev mailing list >>> | | >> | infinispan-dev@lists.jboss.org >>> | | >> | https://lists.jboss.org/mailman/listinfo/infinispan-dev >>> | | >> >>> | | >> _______________________________________________ >>> | | >> infinispan-dev mailing list >>> | | >> infinispan-dev@lists.jboss.org >>> | | >> https://lists.jboss.org/mailman/listinfo/infinispan-dev >>> | | > >>> | | > -- >>> | | > Manik Surtani >>> | | > ma...@jboss.org >>> | | > twitter.com/maniksurtani >>> | | > >>> | | > Platform Architect, JBoss Data Grid >>> | | > http://red.ht/data-grid >>> | | > >>> | | > >>> | | > _______________________________________________ >>> | | > infinispan-dev mailing list >>> | | > infinispan-dev@lists.jboss.org >>> | | > https://lists.jboss.org/mailman/listinfo/infinispan-dev >>> | | >>> | | >>> | | -- >>> | | Galder Zamarreño >>> | | gal...@redhat.com >>> | | twitter.com/galderz >>> | | >>> | | Project Lead, Escalante >>> | | http://escalante.io >>> | | >>> | | Engineer, Infinispan >>> | | http://infinispan.org >>> | | >>> | | >>> | | _______________________________________________ >>> | | infinispan-dev mailing list >>> | | infinispan-dev@lists.jboss.org >>> | | https://lists.jboss.org/mailman/listinfo/infinispan-dev >>> | >>> | _______________________________________________ >>> | infinispan-dev mailing list >>> | infinispan-dev@lists.jboss.org >>> | https://lists.jboss.org/mailman/listinfo/infinispan-dev >>> >>> _______________________________________________ >>> infinispan-dev mailing list >>> infinispan-dev@lists.jboss.org >>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >> >> _______________________________________________ >> infinispan-dev mailing list >> infinispan-dev@lists.jboss.org >> https://lists.jboss.org/mailman/listinfo/infinispan-dev > > > -- > Galder Zamarreño > gal...@redhat.com > twitter.com/galderz > > Project Lead, Escalante > http://escalante.io > > Engineer, Infinispan > http://infinispan.org > > > _______________________________________________ > infinispan-dev mailing list > infinispan-dev@lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev