On 2 Jul 2013, at 19:39, Erik Salter <an1...@hotmail.com> wrote: > I concur with part of the below, but with a few changes: > > - The cache is the primary storage, similar to Sanne's case. (DIST mode) > - My customers are not interested in extra components to the system, like > databases or Cassandra nodes. They wonder why they can't simply use the > existing file system on the nodes they have. +1 for a no-dep cache store impl :-) > - I'm only going to be using the filesystem to recover in the case of > upgrades and catastrophic failures. So during normal operation, flushes to > disk cannot impact cluster performance. > - Most importantly, there needs to be a way, scripted or otherwise, to > recover the keys from local storage in a DIST-mode cache. I cannot guarantee > anything regarding node ordering, so anything about persisting segment > info/previous CHs are out. If that means copying all LevelDB SST files to > all nodes and restarting them, that's fine. > > At the executive levels of my customer, they don't see (or really care about) > the differentiation between data grids and MySQL -- only that one has > file-based persistence and the other doesn't. > > In production, we've already taken a massive outage where a unbelievable > series of coincidences occurred to reveal a JBoss AS bug that ended up > deadlocking all threads in the cluster and we had to restart all nodes. And > I'm sure it'll happen again. > > Hope this offers some user perspective. > > Erik > > -----Original Message----- > From: infinispan-dev-boun...@lists.jboss.org > [mailto:infinispan-dev-boun...@lists.jboss.org] On Behalf Of Sanne Grinovero > Sent: Tuesday, July 02, 2013 8:47 AM > To: infinispan -Dev List > Subject: Re: [infinispan-dev] Cachestores performance > > It would be nice to have a deck of "cheat sheets" on the expected use cases > and guarantees: to me it looks like everyone is trying to solve a different > problem / having a different problem in mind. > > My own take on it: > > Scenario #1 > I'll primarily use Infinispan with DIST, and I don't care much for other > options. Reliability is guaranteed via numOwners>1, NOT by persisting to > disk: if a node fails, I kill the VM (the machine, not the Java process) and > start new ones to compensate: I'm assuming cloud nodes, so it's likely that > when a failed node is gone, the disk is gone as well, with all the carefully > stored data. > I will use Infinispan primarily to absorb write spikes - so a "synch flush" > is no good for me - and to boost read performance by as much memory I can > throw at it. > CacheStore is used for two reasons: > - overflow (LIRS+passivation) for then the memory is not enough > - clean shutdown: you can think of it as a way to be able to upgrade some > component in the system (Infinispan or my own); I would expect some kind of > "JMX flush" operation to do a clean shutdown without data loss. > > Given such a scenario, I am not interested at all in synchronous storage. > Before we commit into a design which is basically assuming the need for > synchronous storage guarantees, I'd like to understand what kind of use case > it's aiming to solve. > > It would be great to document each such use case and put down a table of > things which can be expected, which features should not be expected (be very > explicit on the limitations), and how basic operations are expected to be > performed in the scenario: like how do you do a rolling upgrade in Scenario > 1# ? How do you do a backup? And of course some configurations & code > examples. > > Only then we would be able to pick a design (or multiple ones); for my use > case the proposal from Karsten seems excellent, so I'm wondering why I should > be looking for alternatives, and wondering why everyone is still wasting time > on different discussions :-D > > I'm pretty sure there is people looking forward for a synch-CacheStore > too: if you could nail down such a scenario however I'm pretty sure that some > other considerations would not be taken into account (like consistency of > data when reactivating a dormant node), so I suspect that just implementing > such a component would actually not make any new architecture possible, as > you would get blocked by other problems which need to be solved too.. better > define all expectations asap! > > To me this thread smells of needing the off-heap Direct Memory buffers which > I suggested [long time ago] to efficiently offload internal buffers, but > failing to recognise this we're pushing responsibility to an epic level > complex CacheStore.. guys let's not forget that a mayor bottleneck of > CacheStores today is the SPI it has to implement, we identified several > limitations in the contract in the past which prevent a superior efficiency: > we're working towards a mayor release now so I'd rather focus on the API > changes which will make it possible to get decent performance even without > changing any storage engine.. > I'm pretty sure Cassandra (to pick one) doesn't scale too bad. > > Cheers, > Sanne > > > > On 2 July 2013 10:09, Radim Vansa <rva...@redhat.com> wrote: >> Hi, >> >> I've written down this proposal for the implementation of new cache store. >> >> https://community.jboss.org/wiki/BrnoCacheStoreDesignProposal >> >> WDYT? >> >> Radim >> >> ----- Original Message ----- >> | From: "Radim Vansa" <rva...@redhat.com> >> | To: "infinispan -Dev List" <infinispan-dev@lists.jboss.org> >> | Sent: Thursday, June 27, 2013 2:37:43 PM >> | Subject: Re: [infinispan-dev] Cachestores performance >> | >> | >> | >> | ----- Original Message ----- >> | | From: "Galder Zamarreño" <gal...@redhat.com> >> | | To: "infinispan -Dev List" <infinispan-dev@lists.jboss.org> >> | | Sent: Thursday, June 27, 2013 1:52:11 PM >> | | Subject: Re: [infinispan-dev] Cachestores performance >> | | >> | | > As for Karsten's FCS implementation, I too have issues with the >> | | > key set and value offsets being solely in memory. However I >> | | > think that could be improved by storing only a certain number of >> | | > keys/offsets in memory, and flushing the rest to disk again into >> | | > an index file. >> | | >> | | ^ Karsten's implementation makes this relatively easy to achieve >> | | because it already keeps this mapping in a LinkedHashMap (with a >> | | given max entries limit [1]) assuming removeEldestEntry() is >> | | overriden to flush to disk older entries. Some extra logic would >> | | be needed to bring back data from the disk too… but your suggestion >> below is also quite interesting... >> | >> | I certainly wouldn't call this easy task, because the most >> | problematic part is what we will do when the whole entry (both key >> | and value) are gone from memory and we want to read them - that >> | requires keeping some searchable structure on-disk. And that's the hard >> stuff. >> | >> | | >> | | > I believe LevelDB follows a similar design, but I think >> | | > Karsten's FCS will perform better than LevelDB since it doesn't >> | | > attempt to maintain a sorted structure on disk. >> | | >> | | ^ In-memory, the structure can optionally be ordered if it's bound >> | | [1], otherwise it's just a normal map. How would be store it at the disk >> level? >> | | B+ tree with hashes of keys and then linked lists? >> | >> | Before choosing "I love B#@& trees, let's use B#@& trees!", I'd find >> | out what requirements do we have for the structure. I believe that >> | the index itself should not be considered persistent, as it can be >> | rebuilt when preloading the data (sequentially reading the data is >> | fast, therefore we can afford do this indexing preload), the reason >> | of the index being on-disk is that we don't have enough memory to >> | store all keys, or even key hashes. Therefore it does not have to be >> | updated synchronously with the writes. It should be mostly >> | read-optimized then, because that's the thing where we need synchronous >> access to this structure. >> | >> | | >> | | > One approach to maintaining keys and offsets in memory could be >> | | > a WeakReference that points to the key stored in the in-memory >> | | > DataContainer. Once evicted from the DC, then the CacheStore >> | | > impl would need to fetch the key again from the index file >> | | > before looking up the value in the actual store. >> | | >> | | ^ Hmmm, interesting idea… has the potential to safe the memory >> | | space by not having to keep that extra data structure in the cache store. >> | >> | You mean to mix the DataContainer with xCacheEntry implementation >> | and the cache store implementation? Is that possible from design >> perspective? >> | Speaking about different kind of references, we may even optimize >> | not-well-tuned eviction by SoftReferences, so that even if the entry >> | was evicted from main DataContainer, we'd keep the value referenced >> | from the cache-store (and this does not have to be loaded from disk >> | if referenced before garbage collection). But such thought may be >> premature optimization. >> | For having eviction managed in relation with GC we should rather >> | combine this with PhantomReferences, where entries would be written >> | to cache upon finalization. >> | >> | | >> | | > This way we have hot items always in memory, semi-hot items with >> | | > offsets in memory and values on disk, and cold items needing to >> | | > be read off disk entirely (both offset and value). Also for >> | | > write-through and write-behind, as long as the item is hot or >> | | > warm (key and offset in memory), writing will be pretty fast. >> | | >> | | My worry about Karsten's impl is writing actually. If you look at >> | | the last performance numbers in [2], where we see the performance >> | | difference of force=true and force=false in Karsten's cache store >> | | compared with LevelDB JNI, you see that force=false is fastest, >> | | then JNI LevelDB, and the force=true. Me wonders what kind of >> | | write guarantees LevelDB JNI provides (and the JAVA version)... >> | >> | Just for clarification: the fast implementation is without force at >> | all, the slower is with force(false). Force(true) means updating >> | metadata (such as access times?) which is not required for cache-store. >> | But the numbers suggest that the random access with syncing is >> | really not a good option, and that we should rather use the >> | temporary append-only log, which would be persisted into structured >> | DB by different thread (as LevelDB does, I suppose). >> | >> | Thinking about all the levels and cache structures optimizing the >> | read access, I can see four levels of search structures: key + value >> | (usual DataContainer), key + offset, hash + offset, all on disk. The >> | "hash + offset" may seem superflous but for some use-cases with big >> | keys it may be worth sparing a few disk look-ups. >> | >> | Radim >> | >> | | > >> | | > On 27 Jun 2013, at 10:33, Radim Vansa <rva...@redhat.com> wrote: >> | | > >> | | >> Oops, by the cache store I mean the previously-superfast >> | | >> KarstenFileCacheStore implementation. >> | | >> >> | | >> ----- Original Message ----- >> | | >> | From: "Radim Vansa" <rva...@redhat.com> >> | | >> | To: "infinispan -Dev List" <infinispan-dev@lists.jboss.org> >> | | >> | Sent: Thursday, June 27, 2013 11:30:53 AM >> | | >> | Subject: Re: [infinispan-dev] Cachestores performance >> | | >> | >> | | >> | I have added FileChannel.force(false) flushes after all write >> | | >> | operations in the cache store, and now the comparison is also >> | | >> | updated with these values. >> | | >> | >> | | >> | Radim >> | | >> | >> | | >> | ----- Original Message ----- >> | | >> | | From: "Radim Vansa" <rva...@redhat.com> >> | | >> | | To: "infinispan -Dev List" <infinispan-dev@lists.jboss.org> >> | | >> | | Sent: Thursday, June 27, 2013 8:54:25 AM >> | | >> | | Subject: Re: [infinispan-dev] Cachestores performance >> | | >> | | >> | | >> | | Yep, write-through. LevelDB JAVA used FileChannelTable >> | | >> | | implementation (-Dleveldb.mmap), because Mmaping is not >> | | >> | | implemented very well and causes JVM crashes (I believe >> | | >> | | it's because of calling non-public API via reflection >> | | >> | | - I've found post from the Oracle JVM guys discouraging the >> | | >> | | particular trick it uses). After writing the record to the >> | | >> | | log, it calls FileChannel.force(true), therefore, it should >> | | >> | | be really on the disc by that moment. >> | | >> | | I have not looked into the JNI implementation but I expect the >> same. >> | | >> | | >> | | >> | | By the way, I have updated [1] with numbers when running on >> | | >> | | more data >> | | >> | | (2 GB >> | | >> | | instead of 100 MB). I won't retype it here, so look there. >> | | >> | | The performance is much lower. >> | | >> | | I may try also increase JVM heap size and try with a bit >> | | >> | | more data yet. >> | | >> | | >> | | >> | | Radim >> | | >> | | >> | | >> | | [1] https://community.jboss.org/wiki/FileCacheStoreRedesign >> | | >> | | >> | | >> | | ----- Original Message ----- >> | | >> | | | From: "Erik Salter" <an1...@hotmail.com> >> | | >> | | | To: "infinispan -Dev List" >> | | >> | | | <infinispan-dev@lists.jboss.org> >> | | >> | | | Sent: Wednesday, June 26, 2013 7:40:19 PM >> | | >> | | | Subject: Re: [infinispan-dev] Cachestores performance >> | | >> | | | >> | | >> | | | These were write-through cache stores, right? And with >> | | >> | | | LevelDB, this was through to the database file itself? >> | | >> | | | >> | | >> | | | Erik >> | | >> | | | >> | | >> | | | -----Original Message----- >> | | >> | | | From: infinispan-dev-boun...@lists.jboss.org >> | | >> | | | [mailto:infinispan-dev-boun...@lists.jboss.org] On Behalf >> | | >> | | | Of Radim Vansa >> | | >> | | | Sent: Wednesday, June 26, 2013 11:24 AM >> | | >> | | | To: infinispan -Dev List >> | | >> | | | Subject: [infinispan-dev] Cachestores performance >> | | >> | | | >> | | >> | | | Hi all, >> | | >> | | | >> | | >> | | | according to [1] I've created the comparison of >> | | >> | | | performance in stress-tests. >> | | >> | | | >> | | >> | | | All setups used local-cache, benchmark was executed via >> | | >> | | | Radargun (actually version not merged into master yet >> | | >> | | | [2]). I've used 4 nodes just to get more data - each >> | | >> | | | slave was absolutely independent of the others. >> | | >> | | | >> | | >> | | | First test was preloading performance - the cache started >> | | >> | | | and tried to load 1GB of data from harddrive. Without >> | | >> | | | cachestore the startup takes about 2 >> | | >> | | | - >> | | >> | | | 4 >> | | >> | | | seconds, average numbers for the cachestores are below: >> | | >> | | | >> | | >> | | | FileCacheStore: 9.8 s >> | | >> | | | KarstenFileCacheStore: 14 s >> | | >> | | | LevelDB-JAVA impl.: 12.3 s >> | | >> | | | LevelDB-JNI impl.: 12.9 s >> | | >> | | | >> | | >> | | | IMO nothing special, all times seem affordable. We don't >> | | >> | | | benchmark exactly storing the data into the cachestore, >> | | >> | | | here FileCacheStore took about >> | | >> | | | 44 >> | | >> | | | minutes, while Karsten about 38 seconds, LevelDB-JAVA 4 >> | | >> | | | minutes and LevelDB-JNI 96 seconds. The units are right, >> | | >> | | | it's minutes compared to seconds. But we all know that >> | | >> | | | FileCacheStore is bloody slow. >> | | >> | | | >> | | >> | | | Second test is stress test (5 minutes, preceded by 2 >> | | >> | | | minute >> | | >> | | | warmup) >> | | >> | | | where >> | | >> | | | each of 10 threads works on 10k entries with 1kB values >> | | >> | | | (~100 MB in total). >> | | >> | | | 20 % writes, 80 % reads, as usual. No eviction is >> | | >> | | | configured, therefore the cache-store works as a >> | | >> | | | persistent storage only for case of crash. >> | | >> | | | >> | | >> | | | FileCacheStore: 3.1M reads/s 112 writes/s // on one >> | | >> | | | node >> | | >> | | | the >> | | >> | | | performance was only 2.96M reads/s 75 writes/s >> | | >> | | | KarstenFileCacheStore: 9.2M reads/s 226k writes/s // yikes! >> | | >> | | | LevelDB-JAVA impl.: 3.9M reads/s 5100 writes/s >> | | >> | | | LevelDB-JNI impl.: 6.6M reads/s 14k writes/s // on one >> | | >> | | | node >> | | >> | | | the >> | | >> | | | performance was 3.9M/8.3k - about half of the others >> | | >> | | | Without cache store: 15.5M reads/s 4.4M writes/s >> | | >> | | | >> | | >> | | | Karsten implementation pretty rules here for two reasons. >> | | >> | | | First of all, it does not flush the data (it calls only >> | | >> | | | RandomAccessFile.write()). >> | | >> | | | Other >> | | >> | | | cheat is that it stores in-memory the keys and offsets of >> | | >> | | | data values in the database file. Therefore, it's >> | | >> | | | definitely the best choice for this scenario, but it does >> | | >> | | | not allow to scale the cache-store, especially in cases >> | | >> | | | where the keys are big and values small. However, this >> | | >> | | | performance boost is definitely worth checking - I could >> | | >> | | | think of caching the disk offsets in memory and querying >> | | >> | | | persistent index only in case of missing record, with >> | | >> | | | part of the persistent index flushed asynchronously (the >> | | >> | | | index can be always rebuilt during the preloading for >> | | >> | | | case of crash). >> | | >> | | | >> | | >> | | | The third test should have tested the scenario with more >> | | >> | | | data to be stored than memory - therefore, the stressors >> | | >> | | | operated on 100k entries >> | | >> | | | (~100 MB >> | | >> | | | of >> | | >> | | | data) but eviction was set to 10k entries (9216 entries >> | | >> | | | ended up in memory after the test has ended). >> | | >> | | | >> | | >> | | | FileCacheStore: 750 reads/s 285 writes/s // >> | | >> | | | one >> | | >> | | | node >> | | >> | | | had >> | | >> | | | only 524 reads and 213 writes per second >> | | >> | | | KarstenFileCacheStore: 458k reads/s 137k writes/s >> | | >> | | | LevelDB-JAVA impl.: 21k reads/s 9k writes/s // >> a >> | | >> | | | bit >> | | >> | | | varying >> | | >> | | | performance >> | | >> | | | LevelDB-JNI impl.: 13k-46k reads/s 6.6k-15.2k writes/s // >> | | >> | | | the >> | | >> | | | performance varied a lot! >> | | >> | | | >> | | >> | | | 100 MB of data is not much, but it takes so long to push >> | | >> | | | it into FileCacheStore that I won't use more unless we >> | | >> | | | exclude this loser from the comparison :) >> | | >> | | | >> | | >> | | | Radim >> | | >> | | | >> | | >> | | | [1] >> | | >> | | | https://community.jboss.org/wiki/FileCacheStoreRedesign >> | | >> | | | [2] https://github.com/rvansa/radargun/tree/t_keygen >> | | >> | | | >> | | >> | | | --------------------------------------------------------- >> | | >> | | | -- >> | | >> | | | Radim Vansa >> | | >> | | | Quality Assurance Engineer JBoss Datagrid tel. >> | | >> | | | +420532294559 ext. 62559 >> | | >> | | | >> | | >> | | | Red Hat Czech, s.r.o. >> | | >> | | | Brno, Purkyňova 99/71, PSČ 612 45 Czech Republic >> | | >> | | | >> | | >> | | | >> | | >> | | | _______________________________________________ >> | | >> | | | infinispan-dev mailing list >> | | >> | | | infinispan-dev@lists.jboss.org >> | | >> | | | https://lists.jboss.org/mailman/listinfo/infinispan-dev >> | | >> | | | >> | | >> | | | >> | | >> | | | _______________________________________________ >> | | >> | | | infinispan-dev mailing list >> | | >> | | | infinispan-dev@lists.jboss.org >> | | >> | | | https://lists.jboss.org/mailman/listinfo/infinispan-dev >> | | >> | | >> | | >> | | _______________________________________________ >> | | >> | | infinispan-dev mailing list >> | | >> | | infinispan-dev@lists.jboss.org >> | | >> | | https://lists.jboss.org/mailman/listinfo/infinispan-dev >> | | >> | >> | | >> | _______________________________________________ >> | | >> | infinispan-dev mailing list >> | | >> | infinispan-dev@lists.jboss.org >> | | >> | https://lists.jboss.org/mailman/listinfo/infinispan-dev >> | | >> >> | | >> _______________________________________________ >> | | >> infinispan-dev mailing list >> | | >> infinispan-dev@lists.jboss.org >> | | >> https://lists.jboss.org/mailman/listinfo/infinispan-dev >> | | > >> | | > -- >> | | > Manik Surtani >> | | > ma...@jboss.org >> | | > twitter.com/maniksurtani >> | | > >> | | > Platform Architect, JBoss Data Grid >> | | > http://red.ht/data-grid >> | | > >> | | > >> | | > _______________________________________________ >> | | > infinispan-dev mailing list >> | | > infinispan-dev@lists.jboss.org >> | | > https://lists.jboss.org/mailman/listinfo/infinispan-dev >> | | >> | | >> | | -- >> | | Galder Zamarreño >> | | gal...@redhat.com >> | | twitter.com/galderz >> | | >> | | Project Lead, Escalante >> | | http://escalante.io >> | | >> | | Engineer, Infinispan >> | | http://infinispan.org >> | | >> | | >> | | _______________________________________________ >> | | infinispan-dev mailing list >> | | infinispan-dev@lists.jboss.org >> | | https://lists.jboss.org/mailman/listinfo/infinispan-dev >> | >> | _______________________________________________ >> | infinispan-dev mailing list >> | infinispan-dev@lists.jboss.org >> | https://lists.jboss.org/mailman/listinfo/infinispan-dev >> >> _______________________________________________ >> infinispan-dev mailing list >> infinispan-dev@lists.jboss.org >> https://lists.jboss.org/mailman/listinfo/infinispan-dev > > _______________________________________________ > infinispan-dev mailing list > infinispan-dev@lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev > > > _______________________________________________ > infinispan-dev mailing list > infinispan-dev@lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev
Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org) _______________________________________________ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev