Hey Sebastian-

It may help to consider other pieces aside from compute nodes
that you will need, such as nodes for proxies and databases, 
networking gear (such as switches and cables), and so on.
http://usegalaxy.org/production has some details, and there are
high-level pieces explained at 
http://wiki.g2.bx.psu.edu/Events/GDC2010?action=AttachFile&do=get&target=GDC2010_building_scalable.pdf

You should also talk to your institution's IT folks about power
requirements, how those costs passed on, off-site backup storage
(though it sounds like you're counting on RAID 5/6), etc.

It also may help if folks could share their experiences with benchmarking
their own systems along with the tools that they've been using.
The Galaxy Czars conference call could help - you could bring this
up at the next meeting.

I've answered inline, but in general I think that the bottleneck
for your planned architecture will be I/O with respect to disk.
The next bottleneck may be with respect to the network - if you
have a disk farm with a 1 Gbps (125 MBps) connection, then it 
doesn't matter if your disks can write 400+ MBps. (Nate also 
included this in his presentation.) You may want to consider 
Infiniband over Ethernet - I think the Galaxy Czars call would 
be really helpful in this respect.

> 1. Using the described bioinformatics software: where are the
> potential
> system bottlenecks? (connections between CPUs, RAM, HDDs)

One way to get a better idea is to start with existing resources, 
create a sample workflow or two, and measure performance. Again,
the Galaxy czars call could be a good bet.

> 2. What is the expected relation of integer-based and floating point
> based calculations, which will be loading the CPU cores?

This also depends on the tools being used. This might be more 
relevant if your architecture were to use more specialized hardware
(such as GPUs or FPGAs), but this should be a secondary concern.

> 3. Regarding the architectural differences (strengths, weaknesses):
> Would an AMD- or an Intel-System be more suitable?

I really can't answer which processor line is more suitable, but 
I think that having enough RAM per core is more important. Nate shows 
that main.g2.bx.psu.edu has 4 GB RAM per core.

> 4. How much I/O (read and write) can be expected at the memory
> controllers? Which tasks are most I/O intensive (regarding RAM and/or
> HDDs)?

Workflows currently write all output to disk and read all input from
disk. This gets back to previous questions on benchmarking.
 
> 5. Roughly separated in mapping and clustering jobs: which amounts of
> main memory can be expected to be required by a single job (given
> e.g.
> Illumina exome data, 50x coverage)? As far as I know mapping should
> be
> around 4 GB, clustering much more (may reach high double digits).

Nate's presentation shows that main.g2.bx.psu.edu has 24 to 48 GB per
8 core reservation, and as before it shows that there is 4 GB per core.

> 6. HDD access (R/W) is mainly in bigger blocks instead of masses of
> short operations - correct?

Again, this all depends on the tool being used and could help with some
benchmarks. This question sounds like it's mostly related to choosing the
filesystem - is that right? If so, then you may want to consider a 
compressing file system such as ZFS or BtrFS. You may also want to consider
filesystems like Ceph or Gluster (now Red Hat). I know that Ceph can
run on top of XFS and BtrFS, but you should look into BtrFS's churn rate -
it might still be evolving quickly. Again, a ping to the Galaxy Czars call 
may help on any and possibly all of these questions.

Good luck!

-Scott
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to