Hi guys, I'm new to couchdb. I'm planning to use CouchDB 2.0 to transfer some usage logs from a number of devices based on customer premises to the cloud solution for analysis. By logs I mean a constant stream of small JSON documents(up to 1K), I expect up to 100K of such documents from each device daily.
1. I'm planning to implement this by setting up one way replication from couchdb on the device to the couchdb cluster(up to 8 instances) in the cloud. 2. as logs are being replicated on the cloud side I'm going to handle the data stream by subscribing to the _changes feed and passing device logs to the pipeline for the further processing. 3. Once new log entries are taken and sent to the pipeline I basically don't need them in couchdb so I'm going to have data expiration by maintaining index on ctime and having separate job to do bulk deletion of documents older than a few days. Now, I know in couchdb documents are not being really deleted, just marked as 'deleted' so the database will permanently grow. I've and option either to use periodic _purge(which I heard may be not safe especially in clustered environment) or implement this as monthly rotating database(which is more complex and I don't really want to follow this route). My questions are: - Is this a valid use case for CouchDB? I want to use it primarily because of its good replication capabilities, especially in not reliable environments with some periods of being offline etc. Otherwise I'll have to write the whole set of data sync APIs with buffering, retries etc myself. - I find counter opinions in the internet - some people say CouhDB is a perfect tool for replication over the internet, other say - do use replication ONLY in the local network. What is the truth? Should I use CouchDB as a tool to sync data over not reliable network? - Is this recommended practice to set up a chain of replication? Due to security considerations I want customer devices to replicate each to its own database in the cloud. Then I want those databases to replicate to the single central log database I'd subscribe for _changes. The reason is that it's easier for me to have a single source of _changes feed rather than multiple databases. - Is using _purge safe in my case? From the official doc I read "In clustered or replicated environments it is very difficult to guarantee that a particular purged document has been removed from all replicas". I don't think this is a problem for me as I primarily care about database size so it shouldn't be critical if some documents fail to delete. - Considering I may have up to 10G of new data daily(or ~10 million of new entries daily with 100 devices estimate) I'm probably going to expire documents older than 48 hours. How often do I need to run _purge? - Does _purge block database for read/write while running? Does _compact (that I have to run after _purge) block? thanks, --Vovan