Re: [infinispan-dev] Cachestores performance

Mircea Markus Tue, 09 Jul 2013 02:55:12 -0700

On 2 Jul 2013, at 19:39, Erik Salter <an1...@hotmail.com> wrote:

> I concur with part of the below, but with a few changes:
> 
> - The cache is the primary storage, similar to Sanne's case. (DIST mode)
> - My customers are not interested in extra components to the system, like 
> databases or Cassandra nodes.  They wonder why they can't simply use the 
> existing file system on the nodes they have.
+1 for a no-dep cache store impl :-)
> - I'm only going to be using the filesystem to recover in the case of 
> upgrades and catastrophic failures.  So during normal operation, flushes to 
> disk cannot impact cluster performance.
> - Most importantly, there needs to be a way, scripted or otherwise, to 
> recover the keys from local storage in a DIST-mode cache.  I cannot guarantee 
> anything regarding node ordering, so anything about persisting segment 
> info/previous CHs are out.  If that means copying all LevelDB SST files to 
> all nodes and restarting them, that's fine.
> 
> At the executive levels of my customer, they don't see (or really care about) 
> the differentiation between data grids and MySQL -- only that one has 
> file-based persistence and the other doesn't. 
> 
> In production, we've already taken a massive outage where a unbelievable 
> series of coincidences occurred to reveal a JBoss AS bug that ended up 
> deadlocking all threads in the cluster and we had to restart all nodes.   And 
> I'm sure it'll happen again.
> 
> Hope this offers some user perspective.
> 
> Erik
> 
> -----Original Message-----
> From: infinispan-dev-boun...@lists.jboss.org 
> [mailto:infinispan-dev-boun...@lists.jboss.org] On Behalf Of Sanne Grinovero
> Sent: Tuesday, July 02, 2013 8:47 AM
> To: infinispan -Dev List
> Subject: Re: [infinispan-dev] Cachestores performance
> 
> It would be nice to have a deck of "cheat sheets" on the expected use cases 
> and guarantees: to me it looks like everyone is trying to solve a different 
> problem / having a different problem in mind.
> 
> My own take on it:
> 
> Scenario #1
> I'll primarily use Infinispan with DIST, and I don't care much for other 
> options. Reliability is guaranteed via numOwners>1, NOT by persisting to 
> disk: if a node fails, I kill the VM (the machine, not the Java process) and 
> start new ones to compensate: I'm assuming cloud nodes, so it's likely that 
> when a failed node is gone, the disk is gone as well, with all the carefully 
> stored data.
> I will use Infinispan primarily to absorb write spikes - so a "synch flush" 
> is no good for me - and to boost read performance by as much memory I can 
> throw at it.
> CacheStore is used for two reasons:
> - overflow (LIRS+passivation) for then the memory is not enough
> - clean shutdown: you can think of it as a way to be able to upgrade some 
> component in the system (Infinispan or my own); I would expect some kind of 
> "JMX flush" operation to do a clean shutdown without data loss.
> 
> Given such a scenario, I am not interested at all in synchronous storage. 
> Before we commit into a design which is basically assuming the need for 
> synchronous storage guarantees, I'd like to understand what kind of use case 
> it's aiming to solve.
> 
> It would be great to document each such use case and put down a table of 
> things which can be expected, which features should not be expected (be very 
> explicit on the limitations), and how basic operations are expected to be 
> performed in the scenario: like how do you do a rolling upgrade in Scenario 
> 1# ? How do you do a backup? And of course some configurations & code 
> examples.
> 
> Only then we would be able to pick a design (or multiple ones); for my use 
> case the proposal from Karsten seems excellent, so I'm wondering why I should 
> be looking for alternatives, and wondering why everyone is still wasting time 
> on different discussions :-D
> 
> I'm pretty sure there is people looking forward for a synch-CacheStore
> too: if you could nail down such a scenario however I'm pretty sure that some 
> other considerations would not be taken into account (like consistency of 
> data when reactivating a dormant node), so I suspect that just implementing 
> such a component would actually not make any new architecture possible, as 
> you would get blocked by other problems which need to be solved too.. better 
> define all expectations asap!
> 
> To me this thread smells of needing the off-heap Direct Memory buffers which 
> I suggested [long time ago] to efficiently offload internal buffers, but 
> failing to recognise this we're pushing responsibility to an epic level 
> complex CacheStore.. guys let's not forget that a mayor bottleneck of 
> CacheStores today is the SPI it has to implement, we identified several 
> limitations in the contract in the past which prevent a superior efficiency: 
> we're working towards a mayor release now so I'd rather focus on the API 
> changes which will make it possible to get decent performance even without 
> changing any storage engine..
> I'm pretty sure Cassandra (to pick one) doesn't scale too bad.
> 
> Cheers,
> Sanne
> 
> 
> 
> On 2 July 2013 10:09, Radim Vansa <rva...@redhat.com> wrote:
>> Hi,
>> 
>> I've written down this proposal for the implementation of new cache store.
>> 
>> https://community.jboss.org/wiki/BrnoCacheStoreDesignProposal
>> 
>> WDYT?
>> 
>> Radim
>> 
>> ----- Original Message -----
>> | From: "Radim Vansa" <rva...@redhat.com>
>> | To: "infinispan -Dev List" <infinispan-dev@lists.jboss.org>
>> | Sent: Thursday, June 27, 2013 2:37:43 PM
>> | Subject: Re: [infinispan-dev] Cachestores performance
>> |
>> |
>> |
>> | ----- Original Message -----
>> | | From: "Galder Zamarreño" <gal...@redhat.com>
>> | | To: "infinispan -Dev List" <infinispan-dev@lists.jboss.org>
>> | | Sent: Thursday, June 27, 2013 1:52:11 PM
>> | | Subject: Re: [infinispan-dev] Cachestores performance
>> | |
>> | | > As for Karsten's FCS implementation, I too have issues with the 
>> | | > key set and value offsets being solely in memory.  However I 
>> | | > think that could be improved by storing only a certain number of 
>> | | > keys/offsets in memory, and flushing the rest to disk again into 
>> | | > an index file.
>> | |
>> | | ^ Karsten's implementation makes this relatively easy to achieve 
>> | | because it already keeps this mapping in a LinkedHashMap (with a 
>> | | given max entries limit [1]) assuming removeEldestEntry() is 
>> | | overriden to flush to disk older entries. Some extra logic would 
>> | | be needed to bring back data from the disk too… but your suggestion 
>> below is also quite interesting...
>> |
>> | I certainly wouldn't call this easy task, because the most 
>> | problematic part is what we will do when the whole entry (both key 
>> | and value) are gone from memory and we want to read them - that 
>> | requires keeping some searchable structure on-disk. And that's the hard 
>> stuff.
>> |
>> | |
>> | | > I believe LevelDB follows a similar design, but I think 
>> | | > Karsten's FCS will perform better than LevelDB since it doesn't 
>> | | > attempt to maintain a sorted structure on disk.
>> | |
>> | | ^ In-memory, the structure can optionally be ordered if it's bound 
>> | | [1], otherwise it's just a normal map. How would be store it at the disk 
>> level?
>> | | B+ tree with hashes of keys and then linked lists?
>> |
>> | Before choosing "I love B#@& trees, let's use B#@& trees!", I'd find 
>> | out what requirements do we have for the structure. I believe that 
>> | the index itself should not be considered persistent, as it can be 
>> | rebuilt when preloading the data (sequentially reading the data is 
>> | fast, therefore we can afford do this indexing preload), the reason 
>> | of the index being on-disk is that we don't have enough memory to 
>> | store all keys, or even key hashes. Therefore it does not have to be 
>> | updated synchronously with the writes. It should be mostly 
>> | read-optimized then, because that's the thing where we need synchronous 
>> access to this structure.
>> |
>> | |
>> | | > One approach to maintaining keys and offsets in memory could be 
>> | | > a WeakReference that points to the key stored in the in-memory 
>> | | > DataContainer.  Once evicted from the DC, then the CacheStore 
>> | | > impl would need to fetch the key again from the index file 
>> | | > before looking up the value in the actual store.
>> | |
>> | | ^ Hmmm, interesting idea… has the potential to safe the memory 
>> | | space by not having to keep that extra data structure in the cache store.
>> |
>> | You mean to mix the DataContainer with xCacheEntry implementation 
>> | and the cache store implementation? Is that possible from design 
>> perspective?
>> | Speaking about different kind of references, we may even optimize 
>> | not-well-tuned eviction by SoftReferences, so that even if the entry 
>> | was evicted from main DataContainer, we'd keep the value referenced 
>> | from the cache-store (and this does not have to be loaded from disk 
>> | if referenced before garbage collection). But such thought may be 
>> premature optimization.
>> | For having eviction managed in relation with GC we should rather 
>> | combine this with PhantomReferences, where entries would be written 
>> | to cache upon finalization.
>> |
>> | |
>> | | > This way we have hot items always in memory, semi-hot items with 
>> | | > offsets in memory and values on disk, and cold items needing to 
>> | | > be read off disk entirely (both offset and value).  Also for 
>> | | > write-through and write-behind, as long as the item is hot or 
>> | | > warm (key and offset in memory), writing will be pretty fast.
>> | |
>> | | My worry about Karsten's impl is writing actually. If you look at 
>> | | the last performance numbers in [2], where we see the performance 
>> | | difference of force=true and force=false in Karsten's cache store 
>> | | compared with LevelDB JNI, you see that force=false is fastest, 
>> | | then JNI LevelDB, and the force=true. Me wonders what kind of 
>> | | write guarantees LevelDB JNI provides (and the JAVA version)...
>> |
>> | Just for clarification: the fast implementation is without force at 
>> | all, the slower is with force(false). Force(true) means updating 
>> | metadata (such as access times?) which is not required for cache-store.
>> | But the numbers suggest that the random access with syncing is 
>> | really not a good option, and that we should rather use the 
>> | temporary append-only log, which would be persisted into structured 
>> | DB by different thread (as LevelDB does, I suppose).
>> |
>> | Thinking about all the levels and cache structures optimizing the 
>> | read access, I can see four levels of search structures: key + value 
>> | (usual DataContainer), key + offset, hash + offset, all on disk. The 
>> | "hash + offset" may seem superflous but for some use-cases with big 
>> | keys it may be worth sparing a few disk look-ups.
>> |
>> | Radim
>> |
>> | | >
>> | | > On 27 Jun 2013, at 10:33, Radim Vansa <rva...@redhat.com> wrote:
>> | | >
>> | | >> Oops, by the cache store I mean the previously-superfast 
>> | | >> KarstenFileCacheStore implementation.
>> | | >>
>> | | >> ----- Original Message -----
>> | | >> | From: "Radim Vansa" <rva...@redhat.com>
>> | | >> | To: "infinispan -Dev List" <infinispan-dev@lists.jboss.org>
>> | | >> | Sent: Thursday, June 27, 2013 11:30:53 AM
>> | | >> | Subject: Re: [infinispan-dev] Cachestores performance
>> | | >> |
>> | | >> | I have added FileChannel.force(false) flushes after all write 
>> | | >> | operations in the cache store, and now the comparison is also 
>> | | >> | updated with these values.
>> | | >> |
>> | | >> | Radim
>> | | >> |
>> | | >> | ----- Original Message -----
>> | | >> | | From: "Radim Vansa" <rva...@redhat.com>
>> | | >> | | To: "infinispan -Dev List" <infinispan-dev@lists.jboss.org>
>> | | >> | | Sent: Thursday, June 27, 2013 8:54:25 AM
>> | | >> | | Subject: Re: [infinispan-dev] Cachestores performance
>> | | >> | |
>> | | >> | | Yep, write-through. LevelDB JAVA used FileChannelTable 
>> | | >> | | implementation (-Dleveldb.mmap), because Mmaping is not 
>> | | >> | | implemented very well and causes JVM crashes (I believe 
>> | | >> | | it's because of calling non-public API via reflection
>> | | >> | | - I've found post from the Oracle JVM guys discouraging the 
>> | | >> | | particular trick it uses). After writing the record to the 
>> | | >> | | log, it calls FileChannel.force(true), therefore, it should 
>> | | >> | | be really on the disc by that moment.
>> | | >> | | I have not looked into the JNI implementation but I expect the 
>> same.
>> | | >> | |
>> | | >> | | By the way, I have updated [1] with numbers when running on 
>> | | >> | | more data
>> | | >> | | (2 GB
>> | | >> | | instead of 100 MB). I won't retype it here, so look there. 
>> | | >> | | The performance is much lower.
>> | | >> | | I may try also increase JVM heap size and try with a bit 
>> | | >> | | more data yet.
>> | | >> | |
>> | | >> | | Radim
>> | | >> | |
>> | | >> | | [1] https://community.jboss.org/wiki/FileCacheStoreRedesign
>> | | >> | |
>> | | >> | | ----- Original Message -----
>> | | >> | | | From: "Erik Salter" <an1...@hotmail.com>
>> | | >> | | | To: "infinispan -Dev List" 
>> | | >> | | | <infinispan-dev@lists.jboss.org>
>> | | >> | | | Sent: Wednesday, June 26, 2013 7:40:19 PM
>> | | >> | | | Subject: Re: [infinispan-dev] Cachestores performance
>> | | >> | | |
>> | | >> | | | These were write-through cache stores, right?  And with 
>> | | >> | | | LevelDB, this was through to the database file itself?
>> | | >> | | |
>> | | >> | | | Erik
>> | | >> | | |
>> | | >> | | | -----Original Message-----
>> | | >> | | | From: infinispan-dev-boun...@lists.jboss.org
>> | | >> | | | [mailto:infinispan-dev-boun...@lists.jboss.org] On Behalf 
>> | | >> | | | Of Radim Vansa
>> | | >> | | | Sent: Wednesday, June 26, 2013 11:24 AM
>> | | >> | | | To: infinispan -Dev List
>> | | >> | | | Subject: [infinispan-dev] Cachestores performance
>> | | >> | | |
>> | | >> | | | Hi all,
>> | | >> | | |
>> | | >> | | | according to [1] I've created the comparison of 
>> | | >> | | | performance in stress-tests.
>> | | >> | | |
>> | | >> | | | All setups used local-cache, benchmark was executed via 
>> | | >> | | | Radargun (actually version not merged into master yet 
>> | | >> | | | [2]). I've used 4 nodes just to get more data - each 
>> | | >> | | | slave was absolutely independent of the others.
>> | | >> | | |
>> | | >> | | | First test was preloading performance - the cache started 
>> | | >> | | | and tried to load 1GB of data from harddrive. Without 
>> | | >> | | | cachestore the startup takes about 2
>> | | >> | | | -
>> | | >> | | | 4
>> | | >> | | | seconds, average numbers for the cachestores are below:
>> | | >> | | |
>> | | >> | | | FileCacheStore:        9.8 s
>> | | >> | | | KarstenFileCacheStore:  14 s
>> | | >> | | | LevelDB-JAVA impl.:   12.3 s
>> | | >> | | | LevelDB-JNI impl.:    12.9 s
>> | | >> | | |
>> | | >> | | | IMO nothing special, all times seem affordable. We don't 
>> | | >> | | | benchmark exactly storing the data into the cachestore, 
>> | | >> | | | here FileCacheStore took about
>> | | >> | | | 44
>> | | >> | | | minutes, while Karsten about 38 seconds, LevelDB-JAVA 4 
>> | | >> | | | minutes and LevelDB-JNI 96 seconds. The units are right, 
>> | | >> | | | it's minutes compared to seconds. But we all know that 
>> | | >> | | | FileCacheStore is bloody slow.
>> | | >> | | |
>> | | >> | | | Second test is stress test (5 minutes, preceded by 2 
>> | | >> | | | minute
>> | | >> | | | warmup)
>> | | >> | | | where
>> | | >> | | | each of 10 threads works on 10k entries with 1kB values 
>> | | >> | | | (~100 MB in total).
>> | | >> | | | 20 % writes, 80 % reads, as usual. No eviction is 
>> | | >> | | | configured, therefore the cache-store works as a 
>> | | >> | | | persistent storage only for case of crash.
>> | | >> | | |
>> | | >> | | | FileCacheStore:         3.1M reads/s   112 writes/s  // on one
>> | | >> | | | node
>> | | >> | | | the
>> | | >> | | | performance was only 2.96M reads/s 75 writes/s
>> | | >> | | | KarstenFileCacheStore:  9.2M reads/s  226k writes/s  // yikes!
>> | | >> | | | LevelDB-JAVA impl.:     3.9M reads/s  5100 writes/s
>> | | >> | | | LevelDB-JNI impl.:      6.6M reads/s   14k writes/s  // on one
>> | | >> | | | node
>> | | >> | | | the
>> | | >> | | | performance was 3.9M/8.3k - about half of the others
>> | | >> | | | Without cache store:   15.5M reads/s  4.4M writes/s
>> | | >> | | |
>> | | >> | | | Karsten implementation pretty rules here for two reasons. 
>> | | >> | | | First of all, it does not flush the data (it calls only 
>> | | >> | | | RandomAccessFile.write()).
>> | | >> | | | Other
>> | | >> | | | cheat is that it stores in-memory the keys and offsets of 
>> | | >> | | | data values in the database file. Therefore, it's 
>> | | >> | | | definitely the best choice for this scenario, but it does 
>> | | >> | | | not allow to scale the cache-store, especially in cases 
>> | | >> | | | where the keys are big and values small. However, this 
>> | | >> | | | performance boost is definitely worth checking - I could 
>> | | >> | | | think of caching the disk offsets in memory and querying 
>> | | >> | | | persistent index only in case of missing record, with 
>> | | >> | | | part of the persistent index flushed asynchronously (the 
>> | | >> | | | index can be always rebuilt during the preloading for 
>> | | >> | | | case of crash).
>> | | >> | | |
>> | | >> | | | The third test should have tested the scenario with more 
>> | | >> | | | data to be stored than memory - therefore, the stressors 
>> | | >> | | | operated on 100k entries
>> | | >> | | | (~100 MB
>> | | >> | | | of
>> | | >> | | | data) but eviction was set to 10k entries (9216 entries 
>> | | >> | | | ended up in memory after the test has ended).
>> | | >> | | |
>> | | >> | | | FileCacheStore:            750 reads/s         285 writes/s  //
>> | | >> | | | one
>> | | >> | | | node
>> | | >> | | | had
>> | | >> | | | only 524 reads and 213 writes per second
>> | | >> | | | KarstenFileCacheStore:    458k reads/s        137k writes/s
>> | | >> | | | LevelDB-JAVA impl.:        21k reads/s          9k writes/s  // 
>> a
>> | | >> | | | bit
>> | | >> | | | varying
>> | | >> | | | performance
>> | | >> | | | LevelDB-JNI impl.:     13k-46k reads/s  6.6k-15.2k writes/s  //
>> | | >> | | | the
>> | | >> | | | performance varied a lot!
>> | | >> | | |
>> | | >> | | | 100 MB of data is not much, but it takes so long to push 
>> | | >> | | | it into FileCacheStore that I won't use more unless we 
>> | | >> | | | exclude this loser from the comparison :)
>> | | >> | | |
>> | | >> | | | Radim
>> | | >> | | |
>> | | >> | | | [1] 
>> | | >> | | | https://community.jboss.org/wiki/FileCacheStoreRedesign
>> | | >> | | | [2] https://github.com/rvansa/radargun/tree/t_keygen
>> | | >> | | |
>> | | >> | | | ---------------------------------------------------------
>> | | >> | | | --
>> | | >> | | | Radim Vansa
>> | | >> | | | Quality Assurance Engineer JBoss Datagrid tel. 
>> | | >> | | | +420532294559 ext. 62559
>> | | >> | | |
>> | | >> | | | Red Hat Czech, s.r.o.
>> | | >> | | | Brno, Purkyňova 99/71, PSČ 612 45 Czech Republic
>> | | >> | | |
>> | | >> | | |
>> | | >> | | | _______________________________________________
>> | | >> | | | infinispan-dev mailing list
>> | | >> | | | infinispan-dev@lists.jboss.org
>> | | >> | | | https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | | >> | | |
>> | | >> | | |
>> | | >> | | | _______________________________________________
>> | | >> | | | infinispan-dev mailing list
>> | | >> | | | infinispan-dev@lists.jboss.org
>> | | >> | | | https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | | >> | |
>> | | >> | | _______________________________________________
>> | | >> | | infinispan-dev mailing list
>> | | >> | | infinispan-dev@lists.jboss.org
>> | | >> | | https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | | >> |
>> | | >> | _______________________________________________
>> | | >> | infinispan-dev mailing list
>> | | >> | infinispan-dev@lists.jboss.org
>> | | >> | https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | | >>
>> | | >> _______________________________________________
>> | | >> infinispan-dev mailing list
>> | | >> infinispan-dev@lists.jboss.org
>> | | >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | | >
>> | | > --
>> | | > Manik Surtani
>> | | > ma...@jboss.org
>> | | > twitter.com/maniksurtani
>> | | >
>> | | > Platform Architect, JBoss Data Grid
>> | | > http://red.ht/data-grid
>> | | >
>> | | >
>> | | > _______________________________________________
>> | | > infinispan-dev mailing list
>> | | > infinispan-dev@lists.jboss.org
>> | | > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | |
>> | |
>> | | --
>> | | Galder Zamarreño
>> | | gal...@redhat.com
>> | | twitter.com/galderz
>> | |
>> | | Project Lead, Escalante
>> | | http://escalante.io
>> | |
>> | | Engineer, Infinispan
>> | | http://infinispan.org
>> | |
>> | |
>> | | _______________________________________________
>> | | infinispan-dev mailing list
>> | | infinispan-dev@lists.jboss.org
>> | | https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> |
>> | _______________________________________________
>> | infinispan-dev mailing list
>> | infinispan-dev@lists.jboss.org
>> | https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev@lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> 
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev@lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> 
> 
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev@lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev


Cheers,
-- 
Mircea Markus
Infinispan lead (www.infinispan.org)





_______________________________________________
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev

Re: [infinispan-dev] Cachestores performance

Reply via email to