[jug-discussion] storing blobs on file system or in db
I'm writing this web app that allows users to upload documents, such as word docs, images, etc, and then to download those documents again on request. the documents are not searched, interpretted, processed, version controlled, or anything else. just upload and download. i wonder if there's a general rule on whether one should stick such things into a db or onto the file system. i currently favor sticking them in the db. putting them on the fs seems to interfere with clustering (different files would be on different filesystems). it's also another thing to back up and generally maintain. on the other hand putting them in the db puts extra load on the db and the network. there are a bunch of other issues too. Any ideas? Thanks for any help. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [jug-discussion] storing blobs on file system or in db
The answer is partially dependent on your db, but if you want a rule of thumb, then I suggest the fs. Some dbs really don't perform well when moving blobs in and out of the DB. Also, you need to fine tune your db and where you place the tables that will hold the blob to minimize IO interference when pulling BLOBs and non-blobs. Think about this way, if your DB needs to read a 10meg file at the same time as it needs to read 100 1KB rows for other requests, you are going to affect throughput if both sets of data live on the same spindles. I would be interested to hear what others say. -Original Message- From: Andrew Huntwork [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 16, 2005 3:21 PM To: jug-discussion@tucson-jug.org Subject: [jug-discussion] storing blobs on file system or in db I'm writing this web app that allows users to upload documents, such as word docs, images, etc, and then to download those documents again on request. the documents are not searched, interpretted, processed, version controlled, or anything else. just upload and download. i wonder if there's a general rule on whether one should stick such things into a db or onto the file system. i currently favor sticking them in the db. putting them on the fs seems to interfere with clustering (different files would be on different filesystems). it's also another thing to back up and generally maintain. on the other hand putting them in the db puts extra load on the db and the network. there are a bunch of other issues too. Any ideas? Thanks for any help. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jug-discussion] storing blobs on file system or in db
Interesting question. You could consider a shared file system. I hesitate recommending that documents be stored in a database. You don't need the transactional capabilities (correct?), and a RDMBS is not really a great blob storage device (yes, they can do it, but I don't reach for an RDBMS to store things like this unless I really need to). Randy On Mar 16, 2005, at 3:21 PM, Andrew Huntwork wrote: I'm writing this web app that allows users to upload documents, such as word docs, images, etc, and then to download those documents again on request. the documents are not searched, interpretted, processed, version controlled, or anything else. just upload and download. i wonder if there's a general rule on whether one should stick such things into a db or onto the file system. i currently favor sticking them in the db. putting them on the fs seems to interfere with clustering (different files would be on different filesystems). it's also another thing to back up and generally maintain. on the other hand putting them in the db puts extra load on the db and the network. there are a bunch of other issues too. Any ideas? Thanks for any help. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jug-discussion] storing blobs on file system or in db
it looks like the clear consensus is file system. that's what 2 of my co-workers said before i asked here, but now i actually basically believe them. I still have my doubts though...if someone has done this the db way and actually seen real scalability problems, i'd love to hear about it. Thanks for the responses. Andrew Huntwork wrote: I'm writing this web app that allows users to upload documents, such as word docs, images, etc, and then to download those documents again on request. the documents are not searched, interpretted, processed, version controlled, or anything else. just upload and download. i wonder if there's a general rule on whether one should stick such things into a db or onto the file system. i currently favor sticking them in the db. putting them on the fs seems to interfere with clustering (different files would be on different filesystems). it's also another thing to back up and generally maintain. on the other hand putting them in the db puts extra load on the db and the network. there are a bunch of other issues too. Any ideas? Thanks for any help. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [jug-discussion] storing blobs on file system or in db
I forgot to mention this in my previous post, but I did a lot benchmarking around this a few years ago and did several consulting gigs where people needed to rip out the blob in db infrastructure they had built because it was performing like a dog (a bad dog, not a good dog). landon -Original Message- From: Andrew Huntwork [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 16, 2005 4:40 PM To: jug-discussion@tucson-jug.org Subject: Re: [jug-discussion] storing blobs on file system or in db it looks like the clear consensus is file system. that's what 2 of my co-workers said before i asked here, but now i actually basically believe them. I still have my doubts though...if someone has done this the db way and actually seen real scalability problems, i'd love to hear about it. Thanks for the responses. Andrew Huntwork wrote: I'm writing this web app that allows users to upload documents, such as word docs, images, etc, and then to download those documents again on request. the documents are not searched, interpretted, processed, version controlled, or anything else. just upload and download. i wonder if there's a general rule on whether one should stick such things into a db or onto the file system. i currently favor sticking them in the db. putting them on the fs seems to interfere with clustering (different files would be on different filesystems). it's also another thing to back up and generally maintain. on the other hand putting them in the db puts extra load on the db and the network. there are a bunch of other issues too. Any ideas? Thanks for any help. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jug-discussion] storing blobs on file system or in db
Andrew Huntwork wrote: I'm writing this web app that allows users to upload documents, such as word docs, images, etc, and then to download those documents again on request. the documents are not searched, interpretted, processed, version controlled, or anything else. just upload and download. i wonder if there's a general rule on whether one should stick such things into a db or onto the file system. i currently favor sticking them in the db. putting them on the fs seems to interfere with clustering (different files would be on different filesystems). it's also another thing to back up and generally maintain. on the other hand putting them in the db puts extra load on the db and the network. there are a bunch of other issues too. Any ideas? Thanks for any help. I'm all in favor of storing large documents, images, etc. in the filesystem and storing metadata in the db. I've implemented web-based systems using both purely db and combination of db and filesystem for storing data. I've found that the db route is, as you say, easier to administer in terms of backing up and access across multiple instances of applications and easier to configure to get to the data. But the performance penalty can be severe, especially in a heavily loaded application. I've done performance analysis on the db-based application and during peak loads up to 40% of the runtime of my application is spent on serving up the BLOBs as images (I store image data in the DB and access it through a special servlet that reads the BLOB from the database along with the image metadata like length and MIME type). This is just silly tying up a servlet engine to do stuff that Apache does more efficiently. My setup now is more complicated, but much more performant. By complicated I mean that I have a Spring-configured manager for db-external assets. This coordinates the usage of the filesystem with the db. Also backing up now has to include the virtual root of the filesystem where external resources are configured (the Spring-configured manager has a property that is set to this virtual root). The other complication is the setup of the Apache server to point to the resource directory. This is not so bad because I had another servlet serving this content anyway, it has now just moved to Apache instead of using the servlet. I'm not just uploading documents and serving them, however, so my setup is probably more complicated that yours would be. My application has uploaded images that are thumbnailed on-demand to verious sizes. Just my opinion, FWIW. - Drew -- +-+ Drew Davidson | OGNL Technology +-+ | Email: [EMAIL PROTECTED] / |Web: http://www.ognl.org / |Vox: (520) 531-1966 |Fax: (520) 531-1965\ | Mobile: (520) 405-2967 \ +-+ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jug-discussion] storing blobs on file system or in db
I have, but then again I was using a @!#$!$ MS DBS at the time ;) On Wed, 16 Mar 2005, Andrew Huntwork wrote: it looks like the clear consensus is file system. that's what 2 of my co-workers said before i asked here, but now i actually basically believe them. I still have my doubts though...if someone has done this the db way and actually seen real scalability problems, i'd love to hear about it. Thanks for the responses. Andrew Huntwork wrote: I'm writing this web app that allows users to upload documents, such as word docs, images, etc, and then to download those documents again on request. the documents are not searched, interpretted, processed, version controlled, or anything else. just upload and download. i wonder if there's a general rule on whether one should stick such things into a db or onto the file system. i currently favor sticking them in the db. putting them on the fs seems to interfere with clustering (different files would be on different filesystems). it's also another thing to back up and generally maintain. on the other hand putting them in the db puts extra load on the db and the network. there are a bunch of other issues too. Any ideas? Thanks for any help. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jug-discussion] storing blobs on file system or in db
Andrew == Andrew Huntwork [EMAIL PROTECTED] writes: [...] I'm writing this web app that allows users to upload documents, such as word docs, images, etc, and then to download those documents again on request. the documents are not searched, interpretted, processed, version controlled, or anything else. Hmm... So, how do you clean them up? Or do you just let the data storage grow without bound? How does the user deal with deletions, duplicates, and/or multiple versions? How is resource consumptions (e.g., storage space) controlled? I.e., what about huge files? just upload and download. i wonder if there's a general rule on whether one should stick such things into a db or onto the file system. How often are these files subsequently going to be downloaded? I.e., are these the usual downloaded a few times and then forgotten or are they going to be hammered? What are the robustness and reliability expectations of the users? I.e., what happens when disks go bad at various points in time? What's the needs in terms of the privacy and security of these files? i currently favor sticking them in the db. putting them on the fs seems to interfere with clustering (different files would be on different filesystems). it's also another thing to back up and generally maintain. on the other hand putting them in the db puts extra load on the db and the network. there are a bunch of other issues too. You can use a clustering file system for these static files or you can do the replication as part of the upload process. [If you do a lot of heavy, static file serving then I'd suggest that you look into serving them up using one of the lightweight, high-performing web servers that tie into OS-level services.] If you're going to do this seriously, you might want to consider a separating these resources onto their own machines and/or disks. Check out e.g., using a NetApp box for the storage -- they have some nice FS snapshotting to allow for on-the-fly backups. If you have a lot of them and/or the files are large, stay well away from the database. The performance sucks because you're likely hosed unless you get much more complicated in your caching (but if you go that far you might as well put them in the FS in the first place). Use the database for the meta-data used to manage the files. Have fun, John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jug-discussion] storing blobs on file system or in db
Taking into account all the responses I've seen so far (down to J. D. Mitchell) there is relatively little consideration being given to the transactional issues. I suggest that before you settle on a solution where performance is the priority you need to examine the issue of "What is the correct behavior?" where "correct behavior" in this case (i.e., where DB's are concerned) is to avoid inconsistencies. And of course it's corollary, which is "What is the cost of incorrect behavior?" Putting BLOB assets into the same DB as the data to which it is associated gives you the simplest implementation of "correct behavior". From past experience with Oracle I know that good performance with BLOB assets can be achieved. I can't speak specifically to other DB's, but historically, the performance problems started with not having enough control over how table spaces were allocated and managed as well as the general failure of the vendor to do a good BLOB support feature. I think it is a given that BLOB assets are always associated with other data elements. Putting BLOB assets onto the file system is really the splitting of the data into two DB's -- BLOBS on the file-system and other, conventional records, in the primary DB. Immediately this presents transactional problems. Without getting into every specific case let me generalize some of the issues: Each file upload to the file-system has to be in the same transactional scope as the associated transaction with the primary DB. Upload failures (successes are easy) in all forms -- dropped connection, system failure, etc. --- need to be managed in a manner which includes rollback and cleanup on the file-system as well as rollback of the transaction with the primary DB. Furthermore, operations on the primary DB, like backups, need to be in lock-step with operations with the file-system DB. One un-informed sysadmin that does a DB backup without a lock-step backup of the file-system assets, and then there is a subsequent disk failure, will ruin your whole day (probably month, prepare to give up you life for some time.) Then, as already mentioned, the burden of clustering (and replication) falls to you to implement. One solution than has been presented is a clustered file-system or network file-system. The issue here is that any file-system that is not on the local disk puts BLOB assets back into play being slung around the networkwith all the same performance problems you were trying to get away from in the first place. Having said all that, if I had my druthers, I would put BLOB assets into the primary DB. This solves all my correctness issues and easily keeps me in the game with respect to DB clustering, replication and backups. I would deal with the performance issues by ensuring that I am designing/configuring my DB BLOB support as efficiently as possible. (I suggest that the reputation of BLOB support in DB's suffers from early problems and many people have not gone back to do the due diligence to see if the reputation is still warranted.) implement caching on the Apache/Tomcat server side to allow Apache to do it's thing. Caching to the local disk, even with the event mechanism to handle an update to the DB that was initiated on a different system, is easier to implement and prove than maintaining correctness in the same configuration. Incorrect caching means you may serve an old document. You can solve this in seconds by flushing the cache and still be out the door in time for Happy Hour. An inconsistent DB means you don't even have the correct document to begin with. Solving this, at the point at which you discover it, will be extremely difficult (that's the best case) if not impossible. One final solution I would consider is to see if my DB would allow me to "slice" my data. This could take a couple of different forms but the gist of it would be that the BLOB table spaces would be on the local disk/system with Apache/Tomcat and the other "conventional" data on the DB server. Perhaps the local disk is holding only the replication of the BLOB data? This particular analysis may not bear great fruit but it would be worth not leaving that stone unturned. Just an opinion. -J Andrew Huntwork wrote: I'm writing this web app that allows users to upload documents, such as word docs, images, etc, and then to download those documents again on request. the documents are not searched, interpretted, processed, version controlled, or anything else. just upload and download. i wonder if there's a general rule on whether one should stick such things into a db or onto the file system. i currently favor sticking them in the db. putting them on the fs seems to interfere with clustering (different files would be on different filesystems). it's also another thing to back up and generally maintain. on the other hand putting them in the db puts extra load on the db and the network. there are a bunch of other issues too. Any ideas? Thanks for