RE: Confused about two development utils [EXT]

James Smith Wed, 23 Dec 2020 17:40:31 -0800

We don’t use perl for everything, yes we use it for web data, yes we still use 
it as the glue language in a lot of cases, the most complex stuff is done with 
C (not even C++ as that is too slow). Others on site use Python, Java, Rust, 
Go, PHP, along with looking at using GPUs in cases where code can be highly 
parallelised

It is not just one application – but many, many applications… All with a common 
goal of understanding the human genome, and using it to assist in developing 
new understanding and techniques which can advance health care.

We are a very large sequencing centre (one of the largest in the world) – what 
I was pointing out is that you can’t just throw memory, CPUs, power at a 
problem – you have to think – how can I do what I need to do with the least 
resources. Rather than what resources can I throw at the problem.

Currently we are acting as the central repository for all COVID-19 sequencing 
in the UK, along with one of the largest “wet” labs sequencing data for it – 
and that is half the sequenced samples in the whole world. The UK is sequencing 
more COVID-19 genomes a day than most other countries have sequenced since the 
start of the pandemic in Feb/Mar. This has lead to us discovering a new more 
transmissible version of the virus, and it what part of the country the 
different strains are present – no other country in the world has the 
information, technology or infrastructure in place to achieve this.

But this is just a small part of the genomic sequencing we are looking at – we 
work on:
* other pathogens – e.g. Plasmodium (Malaria);
* cancer genomes (and how effective drugs are);
* are a major part of the Human Cell Atlas which is looking at how the 
expression of genes (in the simplest terms which ones are switched on and 
switched off) are different in different tissues;
* sequencing the genomes of other animals to understand their evolution;
* and looking at some other species in detail, to see what we can learn from 
them when they have defective genes;

Although all these are currently scaled back so that we can work relentlessly 
to support the medical teams and other researchers get on top of COVID-19.

What is interesting is that many of the developers we have on campus (well all 
wfh at the moment) are all (relatively) old as we learnt to develop code on 
machines with limited CPU and limited memory – so that things had to be 
efficient, had to be compact…. And that is as important now as it was 20 or 30 
years ago – the data we handle is going up faster than Moore’s Law! Many of us 
have pride in doing things as efficiently as possible.

It took around 10 years to sequence and assemble the first human genome {well 
we are still tinkering with it and filling in the gaps} – now at the institute 
we can sequence and assemble around 400 human genomes in a day – to the same 
quality!

So most of our issues are due to the scale of the problems we face – e.g. the 
human genome has 3 billion base-pairs (A, C, G, Ts) , so normal solutions don’t 
scale to that (once many years ago we looked at setting up an Oracle database 
where there was at least 1 row for every base pair – recording all variants 
(think of them as spelling mistakes, for example a T rather than an A, or an 
extra letter inserted or deleted) for that base pair… The schema was set up – 
and then they realised it would take 12 months to load the data which we had 
then (which is probably less than a millionth of what we have now)!

Moving compute off site is a problem as the transfer of the level of data we 
have would cause a problem – you can’t easily move all the data to the compute 
– so you have to bring the compute to the data.

The site I worked on before I became a more general developer was doing that – 
and the code that was written 12-15 years ago is actually still going strong – 
it has seen a few changes over the year – many displays have had to be 
redeveloped as the scale of the data has got so big that even the summary pages 
we produced 10 years ago have to be summarised because they are so large.

From: Mithun Bhattacharya <[email protected]>
Sent: 24 December 2020 00:06
To: mod_perl list <[email protected]>
Subject: Re: Confused about two development utils [EXT]

James would you be able to share more info about your setup ?
1. What exactly is your application doing which requires so much memory and CPU 
- is it something like gene splicing (no i don't know much about it beyond 
Jurassic Park :D )
2. Do you feel Perl was the best choice for whatever you are doing and if yes 
then why ? How much of your stuff is using mod_perl considering you mentioned 
not much is web related ?
3. What are the challenges you are currently facing with your implementation ?

On Wed, Dec 23, 2020 at 6:58 AM James Smith 
<[email protected]<mailto:[email protected]>> wrote:
Oh but memory is a problem – but not if you have just a small cluster of 
machines!

Our boxes are larger than that – but they all run virtual machine {only a small 
proportion web related} – machines/memory would rapidly become in our data 
centre - we run VMWARE [995 hosts] and openstack [10,000s of hosts] + a 
selection of large memory machines {measured in TBs of memory per machine }.

We would be looking at somewhere between 0.5 PB and 1 PB of memory – not just 
the price of buying that amount of memory - for many machines we need the 
fastest memory money can buy for the workload, but we would need a lot more 
CPUs then we currently have as we would need a larger amount of machines to 
have 64GB virtual machines {we would get 2 VMs per host. We currently have 
approx. 1-2000 CPUs running our hardware (last time I had a figure) – it would 
probably need to go to approximately 5-10,000!
It is not just the initial outlay but the environmental and financial cost of 
running that number of machines, and finding space to run them without putting 
the cooling costs through the roof!! That is without considering what 
additional constraints on storage having the extra machines may have (at the 
last count a year ago we had over 30 PBytes of storage on side – and a large 
amount of offsite backup.

We would also stretch the amount of power we can get from the national grid to 
power it all - we currently have 3 feeds from different part of the national 
grid (we are fortunately in position where this is possible) and the dedicated 
link we would need to add more power would be at least 50 miles long!

So - managing cores/memory is vitally important to us – moving to the cloud is 
an option we are looking at – but that is more than 4 times the price of our 
onsite set-up (with substantial discounts from AWS) and would require an 
upgrade of our existing link to the internet – which is currently 40Gbit of 
data (I think).

Currently we are analysing a very large amounts of data directly linked to the 
current major world problem – this is why the UK is currently being isolated as 
we have discovered and can track a new strain, in near real time – other 
countries have no ability to do this – we in a day can and do handle, sequence 
and analyse more samples than the whole of France has sequenced since February. 
We probably don’t have more of the new variant strain than in other areas of 
the world – it is just that we know we have because of the amount of sequencing 
and analysis that we in the UK have done.

From: Matthias Peng <[email protected]<mailto:[email protected]>>
Sent: 23 December 2020 12:02
To: mod_perl list <[email protected]<mailto:[email protected]>>
Subject: Re: Confused about two development utils [EXT]

Today memory is not serious problem, each of our server has 64GB memory.

Forgot to add - so our FCGI servers need a lot (and I mean a lot) more memory 
than the mod_perl servers to serve the same level of content (just in case 
memory blows up with FCGI backends)

-----Original Message-----
From: James Smith <[email protected]<mailto:[email protected]>>
Sent: 23 December 2020 11:34
To: André Warnier (tomcat/perl) <[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]>
Subject: RE: Confused about two development utils [EXT]

> This costs memory, and all the more since many perl modules are not 
> thread-safe, so if you use them in your code, at this moment the only safe 
> way to do it is to use the Apache httpd prefork model. This means that each 
> Apache httpd child process has its own copy of the perl interpreter, which 
> means that the memory used by this embedded perl interpreter has to be 
> counted n times (as many times as there are Apache httpd child processes 
> running at any one time).

This isn’t quite true - if you load modules before the process forks then they 
can cleverly share the same parts of memory. It is useful to be able to 
"pre-load" core functionality which is used across all functions {this is the 
case in Linux anyway}. It also speeds up child process generation as the 
modules are already in memory and converted to byte code.

One of the great advantages of mod_perl is Apache2::SizeLimit which can blow 
away large child process - and then if needed create new ones. This is not the 
case with some of the FCGI solutions as the individual processes can grow if 
there is a memory leak or a request that retrieves a large amount of content 
(even if not served), but perl can't give the memory back. So FCGI processes 
only get bigger and bigger and eventually blow up memory (or hit swap first)

--
 The Wellcome Sanger Institute is operated by Genome Research  Limited, a 
charity registered in England with number 1021457 and a  company registered in 
England with number 2742969, whose registered  office is 215 Euston Road, 
London, NW1 2 
[google.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=>BE.

--
 The Wellcome Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2 
[google.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=>BE.
-- The Wellcome Sanger Institute is operated by Genome Research Limited, a 
charity registered in England with number 1021457 and a company registered in 
England with number 2742969, whose registered office is 215 Euston Road, 
London, NW1 2BE.

-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.

RE: Confused about two development utils [EXT]

Reply via email to