RE: Not-even-yet-newbie question

Boyd Ebsworthy Sun, 19 Apr 2009 09:50:55 -0700

Hello,

As much as I believe couchdb is a great tool my opinion would be that it is not 
the right one for that job. 
Couch would probably be a perfect tool to store and search the metadata of your 
documents but for the document itself I believe it wouldn't be practical. I'm 
still a newbie but as other pointed out couchdb database are only one file 
which would grow very quickly (millions documents) and the feature of couchdb 
(JSON documents, map-reduce view, etc..) would not even be used since you only 
have a single key<=>document association.
You could as some suggested build your own database "sharding" layer but it 
really seem to be a tough fight for little gain.


What you describe sounds more like you are looking for a distributed fault 
tolerant parallel filesystem akin to what google, flickr, amazon have designed 
for themselves. I have seen a presentation somewhere of flickr's filesystem 
design and it really seemed to fit (lots of doc, few deletions, etc..) but I 
can't find it anymore.(sorry)

If I were you I'd have a look at Hadoop DFS (another apache project), I don't 
know how good it is but it seems closer to what you're looking for.

Ref:
http://hadoop.apache.org/core/docs/current/hdfs_design.html


Cheers,
Boyd




-----Original Message-----
From: André Warnier [mailto:[email protected]] 
Sent: Friday, April 17, 2009 04:02
To: [email protected]
Subject: Not-even-yet-newbie question

Hi good people on this list.

I was recently at ApacheCON Europe, where I followed the spirited and 
spiritual Introduction to CouchDB by J. Chris Anderson and Jan Lehnardt. 
  I also browsed the CouchDB section on the ASF website. I don't know 
Erlang, although I followed the brief tutorial linked to from the 
website.  It looked simple, which makes me suspect I missed quite a lot.

In fact, I have the impression that I missed a whole lot more than 
Erlang, so I thank in advance whowever has the patience to read this and 
provide some answers to my questions.

I very much like the "Relax" motto.

What I am still trying to figure out mainly, is if CouchDB would be an 
appropriate tool for the following.

We basically manage information and documents for other people, as an 
ASP service.  We provide various easy ways for companies to upload their 
electronic documents of all kinds to a dedicated Internet server; we 
then process these documents à la Tikka (but not with Tikka)(extract 
meta-data and content), automatically index them, and store on the one 
side the meta-data and text content in a search engine à la Lucene (but 
not Lucene), and on the other side we store the original electronic 
document into a special passive file structure that we developed, and 
which has proven capable of storing reliably a few million documents so 
far.  In that file structure, each document is identified by a unique 
"logical number", which we store along with the meta-data in the search 
engine.  (So far in our case, once a document is stored, it never changes).
Then we provide means for the customer to search and find their 
documents through a web interface to the search engine, and to retrieve 
the corresponding original documents.

It works well and is very reliable, but slowly we are getting into a 
management issue due to the volumes of original electronic documents,
which always increases. That is because our customers never throw away 
old documents, and they give us ever more varied data to handle.
So we are concerned about increasing volumes to back up, and even more 
about volumes to restore in case something would seriously go wrong.

All the above to indicate that when we ourselves talk about "documents", 
we talk about on the one hand a searchable index (which works very well, 
takes comparatively very little space and which we do not want to change 
for now), and on the other hand, stored corresponding electronic 
documents (blobs) identified and accessible via one single "key".

I would be interested to understand if CouchDB would provide a reliable 
and efficient replacement for our self-developed and self-maintained 
storage structure.

The first question is whether the notion of "document" in CouchDB is 
compatible with our own notion of document.  I mean, could I define in 
CouchDB a document as consisting of a single text "key" (a globally 
unique document-id), plus a "blob" of undeterminate size (e.g. a MS-Word 
document, or a PDF, or an image, or a CAD drawing, or an email or 
whatever). And would I then be able to generate for example a search 
result webpage, where next to a document summary I can display a PDF 
icon, which when clicked retrieves the corresponding electronic document 
from CouchDB and sends it to the browser ?

Another aspect that seems particularly interesting - if I got this right 
- is the self-replicating nature of CouchDB, which would allow us to 
define say 3 "repositories" located in different places, and which would 
automatically synchronise themselves. Yes ?

I also seem to have understood that if one of these repositories 
suddenly became unavailable because the big one just hit, a document 
request would automatically be satisfied by the next available one in 
line. Yes ?

Would there be some way in CouchDB to store one such document, in some 
logical group containing the original version (say OpenOffice text), 
along with its PDF/A version (which we generate when the document is 
originally stored) and with an image of the first page (ditto), in such 
a way that by using the "main key" plus some additional parameter, I can 
retrieve whichever version I need now ?

Would I need to become proficient in Erlang before I can store a new 
document or retrieve a stored one, or can this be done using some simple 
call from some interface routine in any programming language ?
(For example, a click on a PDF icon generates a call to a mod_perl 
add-on Apache module, which then retrieves the document from CouchDB and 
returns it to the browser)(perl can "do JSON" or "do XML" e.g.).

To generalise the above question, for what kind of action would I 
necessarily need to know Erlang ?

I'll no doubt have more questions if the answers to the above do not 
discourage me, but I promise they will be shorter.

Thanks in advance.

RE: Not-even-yet-newbie question

Reply via email to