Absolutely. Break your target genome up into several hundred overlapping pieces. On the order of 5 to 10 million bases, or even smaller. Partition your 10 million short sequences into several hundred multiple record fasta files. Run a job for each target genome chunk against each query fasta file. These are all separate processes.
Please note, blat is not necessarily the best tool for short sequence alignment. There are other much better tools for short sequence alignment. See also: http://en.wikipedia.org/wiki/List_of_sequence_alignment_software --Hiram ----- Original Message ----- From: "Peng Yu" <[email protected]> To: "Hiram Clawson" <[email protected]> Cc: [email protected] Sent: Tuesday, June 15, 2010 5:24:01 PM GMT -08:00 Tijuana / Baja California Subject: Re: [Genome] parallel blat I'm not sure what you described although I thought I understood. Suppose I have 10 million short sequences to be aligned to the human genome. It is making sense to split the 10 million sequences in 10 files (each 1 million). Then I run 10 blat commands simultaneously. Each blat command will load all the chromosomes. Are you suggesting to break all_human_chromosomes.list into a number of smaller lists? blat -t=dna -q=dna -tileSize=11 -stepSize=5 all_human_chromosomes.list short_seq0.fa short_seq0.psl ... blat -t=dna -q=dna -tileSize=11 -stepSize=5 all_human_chromosomes.list short_seq9.fa short_seq9.psl On Tue, Jun 15, 2010 at 7:14 PM, Hiram Clawson <[email protected]> wrote: > No, this is not what I describe. Only the tiny portion of the > target genome is loaded and the tiny portion of the query genome > is loaded. Nothing is duplicated between processes. We regularly > do this with genomes here and can get perhaps 100,000 processes > running on a 1,000 CPU core super computer and get the complete > genome to genome alignment done in a few hours. This is much > more simple and efficient than trying to write a complicated > parallel functional program that would be difficult to operate > in a variety of operating systems. The operating system > itself is optimized to manage the separate threads of the > individual processes that it manages. We don't have to > duplicate that complication. -- Regards, Peng _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
