Re: [Genome] parallel blat

Hiram Clawson Tue, 15 Jun 2010 17:55:12 -0700

Absolutely.  Break your target genome up into several hundred
overlapping pieces.  On the order of 5 to 10 million bases, or
even smaller.  Partition your 10 million short sequences into several hundred
multiple record fasta files.  Run a job for each target genome chunk
against each query fasta file.  These are all separate processes.

Please note, blat is not necessarily the best tool for short sequence
alignment.  There are other much better tools for short sequence
alignment.  See also:

http://en.wikipedia.org/wiki/List_of_sequence_alignment_software

--Hiram

----- Original Message -----
From: "Peng Yu" <[email protected]>
To: "Hiram Clawson" <[email protected]>
Cc: [email protected]
Sent: Tuesday, June 15, 2010 5:24:01 PM GMT -08:00 Tijuana / Baja California
Subject: Re: [Genome] parallel blat

I'm not sure what you described although I thought I understood.

Suppose I have 10 million short sequences to be aligned to the human
genome. It is making sense to split the 10 million sequences in 10
files (each 1 million). Then I run 10 blat commands simultaneously.
Each blat command will load all the chromosomes. Are you suggesting to
break all_human_chromosomes.list into a number of smaller lists?

blat -t=dna -q=dna -tileSize=11 -stepSize=5
all_human_chromosomes.list short_seq0.fa short_seq0.psl
...
blat -t=dna -q=dna -tileSize=11 -stepSize=5
all_human_chromosomes.list short_seq9.fa short_seq9.psl

On Tue, Jun 15, 2010 at 7:14 PM, Hiram Clawson <[email protected]> wrote:
> No, this is not what I describe.  Only the tiny portion of the
> target genome is loaded and the tiny portion of the query genome
> is loaded.  Nothing is duplicated between processes.  We regularly
> do this with genomes here and can get perhaps 100,000 processes
> running on a 1,000 CPU core super computer and get the complete
> genome to genome alignment done in a few hours.  This is much
> more simple and efficient than trying to write a complicated
> parallel functional program that would be difficult to operate
> in a variety of operating systems.  The operating system
> itself is optimized to manage the separate threads of the
> individual processes that it manages.  We don't have to
> duplicate that complication.

-- 
Regards,
Peng

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] parallel blat

Reply via email to