Hi guys, 

I'm new to couchdb. I'm planning to use CouchDB 2.0 to transfer some usage logs 
from a number of devices based on customer premises to the cloud solution for 
analysis. By logs I mean a constant stream of small JSON documents(up to 1K), I 
expect up to 100K of such documents from each device daily. 

1. I'm planning to implement this by setting up one way replication from 
couchdb on the device to the couchdb cluster(up to 8 instances) in the cloud.
2. as logs are being replicated on the cloud side I'm going to handle the data 
stream by subscribing to the _changes feed and passing device logs to the 
pipeline for the further processing. 
3. Once new log entries are taken and sent to the pipeline I basically don't 
need them in couchdb so I'm going to have data expiration by maintaining index 
on ctime and having separate job to do bulk deletion of documents older than a 
few days.

Now, I know in couchdb documents are not being really deleted, just marked as 
'deleted' so the database will permanently grow. I've and option either to use 
periodic _purge(which I heard may be not safe especially in clustered 
environment) or implement this as monthly rotating database(which is more 
complex and I don't really want to follow this route).

My questions are:

- Is this a valid use case for CouchDB? I want to use it primarily because of 
its good replication capabilities, especially in not reliable environments with 
some periods of being offline etc. Otherwise I'll have to write the whole set 
of data sync APIs with buffering, retries etc myself.
- I find counter opinions in the internet - some people say CouhDB is a perfect 
tool for replication over the internet, other say - do use replication ONLY in 
the local network. What is the truth? Should I use CouchDB as a tool to sync 
data over not reliable network?
- Is this recommended practice to set up a chain of replication? Due to 
security considerations I want customer devices to replicate each to its own 
database in the cloud. Then I want those databases to replicate to the single 
central log database I'd subscribe for _changes. The reason is that it's easier 
for me to have a single source of _changes feed rather than multiple databases.
- Is using _purge safe in my case? From the official doc I read "In clustered 
or replicated environments it is very difficult to guarantee that a particular 
purged document has been removed from all replicas". I don't think this is a 
problem for me as I primarily care about database size so it shouldn't be 
critical if some documents fail to delete.
- Considering I may have up to 10G of new data daily(or ~10 million of new 
entries daily with 100 devices estimate) I'm probably going to expire documents 
older than 48 hours. How often do I need to run _purge?
- Does _purge block database for read/write while running? Does _compact (that 
I have to run after _purge) block?

thanks,
--Vovan

Reply via email to