Hi guys,

I'm new to couchdb. I'm planning to use CouchDB 2.0 to transfer some usage
logs from a number of devices based on customer premises to the cloud
solution for analysis. By logs I mean a constant stream of small JSON
documents(up to 1K), I expect up to 100K of such documents from each device
daily.

1. I'm planning to implement this by setting up one way replication from
couchdb on the device to the couchdb cluster(up to 8 instances) in the
cloud.
2. as logs are being replicated on the cloud side I'm going to handle the
data stream by subscribing to the _changes feed and passing device logs to
the pipeline for the further processing.
3. Once new log entries are taken and sent to the pipeline I basically
don't need them in couchdb so I'm going to have data expiration by
maintaining index on ctime and having separate job to do bulk deletion of
documents older than a few days.

Now, I know in couchdb documents are not being really deleted, just marked
as 'deleted' so the database will permanently grow. I've and option either
to use periodic _purge(which I heard may be not safe especially in
clustered environment) or implement this as monthly rotating database(which
is more complex and I don't really want to follow this route).

My questions are:

- Is this a valid use case for CouchDB? I want to use it primarily because
of its good replication capabilities, especially in not reliable
environments with some periods of being offline etc. Otherwise I'll have to
write the whole set of data sync APIs with buffering, retries etc myself.
- I find counter opinions in the internet - some people say CouhDB is a
perfect tool for replication over the internet, other say - do use
replication ONLY in the local network. What is the truth? Should I use
CouchDB as a tool to sync data over not reliable network?
- Is this recommended practice to set up a chain of replication? Due to
security considerations I want customer devices to replicate each to its
own database in the cloud. Then I want those databases to replicate to the
single central log database I'd subscribe for _changes. The reason is that
it's easier for me to have a single source of _changes feed rather than
multiple databases.
- Is using _purge safe in my case? From the official doc I read "In
clustered or replicated environments it is very difficult to guarantee that
a particular purged document has been removed from all replicas". I don't
think this is a problem for me as I primarily care about database size so
it shouldn't be critical if some documents fail to delete.
- Considering I may have up to 10G of new data daily(or ~10 million of new
entries daily with 100 devices estimate) I'm probably going to expire
documents older than 48 hours. How often do I need to run _purge?
- Does _purge block database for read/write while running? Does _compact
(that I have to run after _purge) block?

thanks,
--Vovan

Reply via email to