Hello everbody, every now and then somebody rises questions regarding clock speed and disk space and the kind on this list during planning for a new project or when going from development to production. I just skimmed through the archived messages of the last year and found several requests for advice. Generally, the answer is "well, it depends". All in all, threads on this topic remain quite superficial and people rarely come back with this after some time. My conclusion is that it is really not much of an issue. What do you think?
Still then, as I am just planning for a new project, Id just like to raise this issue again, maybe in a slightly different form than up to now. Maybe, we can find some aspects worth to remember and some questions to ask the next time we as a community start to augment information on running instances. There have been already several initiatives in this direction. I remember Valorie asking for information about new instances two years ago; there is the DSpace instances list in the DSpace Wiki, some information is in the Fedora Commons Examples and Solution Communities sections, but it none of these resources looks really comprehensive for me. Lets keep that in mind. About my new project, I can tell that I really have little information at hand for serious planning as of now, so I wont come up with item counts, Gigabytes of storage requirements and so on here. What I do know already is, that I will have a considerable amount of automatic processes running like mass ingest. There will be a lot of scanned and OCRed, say *huge* PDFs to index, which will make me learn about the intricacies of PDFBox, I suspect. My focus is on building a really reliable setup, rather then tweaking performance. Though, performance, such as database query performance is one aspect of reliability. Id like to hear your thoughts on either architecture and specific components, rather then numbers. Here is a list of areas I ponder about. Conventional setups seem to use two separate boxes, one for development, one for production, each running the whohle software stack including database and JSP container. But what about a dedicated database machine and Tomcat machine? Maybe with optimized mass storage for each purpose, id est smaller but faster expensive drives such as SAS or even Solid State Drives for the database, and huge cheap SATA drives, but with a hardware RAID controller (RAID 5, RAID 10?), whereas the database machine is fine with a software RAID? Or is more RAM always a better choice for the database host compared to fast drives? RAM seems to be the most limiting factor. And Java Performance seems to be more limiting then postgres performance, at least up to a certain table size. Now, I am going into numbers again. 2 Gigabytes have been seemingly a standard for small servers for quite a while now. Isnt this outdated? Should one go for a standard of lets say 8 GB or even much more and tweak Tomcat/Postgres settings to make use of it? Is there a simple rule like spending enough RAM on postgres to keep the whole db in RAM, assigning everything left to tomcat? Are there still advantages of a physical machine over a virtualized environment, provided the virtual machine gets the same amount of dedicated memory? Are there preferences anywhere to move towards a different JSP container, say Jetty over the preferred Tomcat? In the area of http servers I see a turn to lighthttpd, favouring speed and simplicity over features. Id expect something similar to happen here. If there is no need to host anything besides DSpace on the machine, might one skip Apache completely, handling even https through tomcat itself? Has anybody spent thought or gathered practical experience with load balancing setups or is a single machine of quality brand always sufficient? Would one start with clustering databases or duplicating the Tomcat Box? I guess duplicating the frontend is probably more demanding in terms of session handling, compared to configuring and maintaining a database cluster. On the other hand, the JSP container is the place where performance usually strikes first I guess. I do not consider such a setup for my project as of now, but I would like to hear whether DSpace reaches the ceiling when it comes to HA environments. Ok, I guess, this is enough as a starter? Bye, Christian ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech