Hello, As much as I believe couchdb is a great tool my opinion would be that it is not the right one for that job. Couch would probably be a perfect tool to store and search the metadata of your documents but for the document itself I believe it wouldn't be practical. I'm still a newbie but as other pointed out couchdb database are only one file which would grow very quickly (millions documents) and the feature of couchdb (JSON documents, map-reduce view, etc..) would not even be used since you only have a single key<=>document association. You could as some suggested build your own database "sharding" layer but it really seem to be a tough fight for little gain.
What you describe sounds more like you are looking for a distributed fault tolerant parallel filesystem akin to what google, flickr, amazon have designed for themselves. I have seen a presentation somewhere of flickr's filesystem design and it really seemed to fit (lots of doc, few deletions, etc..) but I can't find it anymore.(sorry) If I were you I'd have a look at Hadoop DFS (another apache project), I don't know how good it is but it seems closer to what you're looking for. Ref: http://hadoop.apache.org/core/docs/current/hdfs_design.html Cheers, Boyd -----Original Message----- From: André Warnier [mailto:[email protected]] Sent: Friday, April 17, 2009 04:02 To: [email protected] Subject: Not-even-yet-newbie question Hi good people on this list. I was recently at ApacheCON Europe, where I followed the spirited and spiritual Introduction to CouchDB by J. Chris Anderson and Jan Lehnardt. I also browsed the CouchDB section on the ASF website. I don't know Erlang, although I followed the brief tutorial linked to from the website. It looked simple, which makes me suspect I missed quite a lot. In fact, I have the impression that I missed a whole lot more than Erlang, so I thank in advance whowever has the patience to read this and provide some answers to my questions. I very much like the "Relax" motto. What I am still trying to figure out mainly, is if CouchDB would be an appropriate tool for the following. We basically manage information and documents for other people, as an ASP service. We provide various easy ways for companies to upload their electronic documents of all kinds to a dedicated Internet server; we then process these documents à la Tikka (but not with Tikka)(extract meta-data and content), automatically index them, and store on the one side the meta-data and text content in a search engine à la Lucene (but not Lucene), and on the other side we store the original electronic document into a special passive file structure that we developed, and which has proven capable of storing reliably a few million documents so far. In that file structure, each document is identified by a unique "logical number", which we store along with the meta-data in the search engine. (So far in our case, once a document is stored, it never changes). Then we provide means for the customer to search and find their documents through a web interface to the search engine, and to retrieve the corresponding original documents. It works well and is very reliable, but slowly we are getting into a management issue due to the volumes of original electronic documents, which always increases. That is because our customers never throw away old documents, and they give us ever more varied data to handle. So we are concerned about increasing volumes to back up, and even more about volumes to restore in case something would seriously go wrong. All the above to indicate that when we ourselves talk about "documents", we talk about on the one hand a searchable index (which works very well, takes comparatively very little space and which we do not want to change for now), and on the other hand, stored corresponding electronic documents (blobs) identified and accessible via one single "key". I would be interested to understand if CouchDB would provide a reliable and efficient replacement for our self-developed and self-maintained storage structure. The first question is whether the notion of "document" in CouchDB is compatible with our own notion of document. I mean, could I define in CouchDB a document as consisting of a single text "key" (a globally unique document-id), plus a "blob" of undeterminate size (e.g. a MS-Word document, or a PDF, or an image, or a CAD drawing, or an email or whatever). And would I then be able to generate for example a search result webpage, where next to a document summary I can display a PDF icon, which when clicked retrieves the corresponding electronic document from CouchDB and sends it to the browser ? Another aspect that seems particularly interesting - if I got this right - is the self-replicating nature of CouchDB, which would allow us to define say 3 "repositories" located in different places, and which would automatically synchronise themselves. Yes ? I also seem to have understood that if one of these repositories suddenly became unavailable because the big one just hit, a document request would automatically be satisfied by the next available one in line. Yes ? Would there be some way in CouchDB to store one such document, in some logical group containing the original version (say OpenOffice text), along with its PDF/A version (which we generate when the document is originally stored) and with an image of the first page (ditto), in such a way that by using the "main key" plus some additional parameter, I can retrieve whichever version I need now ? Would I need to become proficient in Erlang before I can store a new document or retrieve a stored one, or can this be done using some simple call from some interface routine in any programming language ? (For example, a click on a PDF icon generates a call to a mod_perl add-on Apache module, which then retrieves the document from CouchDB and returns it to the browser)(perl can "do JSON" or "do XML" e.g.). To generalise the above question, for what kind of action would I necessarily need to know Erlang ? I'll no doubt have more questions if the answers to the above do not discourage me, but I promise they will be shorter. Thanks in advance.
