Agreed, it does seem that explicit factory names are better. I'll use that approach.
Thanks for all the comments ! C ________________________________________ From: Martin Kleppmann (JIRA) [[email protected]] Sent: Tuesday, July 01, 2014 2:42 AM To: [email protected] Subject: [jira] [Commented] (SAMZA-256) Provide in-memory data store implementation [ https://issues.apache.org/jira/browse/SAMZA-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048696#comment-14048696 ] Martin Kleppmann commented on SAMZA-256: ---------------------------------------- I would prefer approach (1), a separate factory for each type of storage engine. I fear that a generic key-value interface that abstracts across multiple storage engines would be a leaky abstraction; users would still have to think about which storage engine is being used under the hood. Some of the subtle differences that may arise: - LevelDB and RocksDB use a sorted log-structured representation which allows efficient range queries, but a HashMap would not allow range queries. - Perhaps the in-memory store should use a TreeMap instead, but then it's limited to keys that implement Comparable. - For the in-memory storage engine, serdes may be optional. For on-disk storage, serdes are required. - LevelDB has no mechanism for expiry; RocksDB supports pluggable compaction filters which allow expiry to be implemented; Guava collections have lots of cache-replacement and expiry options. We should be able to give the user access to whatever options the underlying storage engine provides. I also think we should name the factories after the particular storage engine being used (LevelDBStorageEngineFactory, RocksDBStorageEngineFactory, HashMapStorageEngineFactory, etc) not after their persistence characteristics (PersistentKeyValueStorageEngineFactory, InMemoryKeyValueStorageEngineFactory), because: # It's misleading: an in-memory storage engine can still be durable if changelog replication is enabled, and an on-disk storage engine can still lose data if you don't have changelog replication enabled. The difference between on-disk and in-memory storage determines whether you can store state larger than memory, not whether the state is durable. # Leaky abstraction: RocksDB has different features and different performance characteristics from LevelDB, so I don't think it makes sense to abstract over them. # Explicit is better than implicit: users will need to know what storage engine is being used, so the factory name shouldn't hide it from them. For compatibility, making KeyValueStorageEngineFactory an alias for LevelDBStorageEngineFactory sounds good to me. > Provide in-memory data store implementation > ------------------------------------------- > > Key: SAMZA-256 > URL: https://issues.apache.org/jira/browse/SAMZA-256 > Project: Samza > Issue Type: Improvement > Components: kv > Affects Versions: 0.6.0 > Reporter: Jakob Homan > Assignee: Chinmay Soman > Fix For: 0.8.0 > > > The sole current kv store, LevelDbKeyValueStore, works well when the amount > of data to be stored is prohibitively large to keep it all in memory. > However, in cases where the state is small enough to comfortably fit in > whatever memory is available, it would be better to provide an in-memory > implementation. This can be backed by either a native Java class, or perhaps > a Guava class, if that is found to scale better (or, of course, the backing > implementation could be configurable). -- This message was sent by Atlassian JIRA (v6.2#6252)
