Re: [Genome] LASTZ pipeline

Angie Hinrichs Wed, 03 Feb 2010 12:27:10 -0800

Hi Bremen,

We use lastz as a drop-in replacement for blastz, so we use the same parameters 
with lastz as blastz (usually H, Y, L, K, Q, whose values are set in the DEF 
file quoted in makeDb/doc/$db.txt).  I can't estimate the memory usage, but I 
suggest breaking up your query into N groups of sequences, your target into M, 
and run lastz N*M times to align each query group to all target groups.  What N 
and M are appropriate for your system and sequences may require some 
experimentation.


We have a large compute cluster, and break up sequences into chunks of 10 
Mbases for target and 10 up to >400 Mbases for query.  We chop up sequences 
larger than 10 Mbases into chunks that overlap by 10000 bases, and concatenate 
small sequences together to get a total of ~10Mbases.  Chopping up large 
sequences means that the resulting coordinates have to be fixed after the run 
-- nice to avoid that if you can.

Each job runs on a 1.9GHz core sharing 4G RAM and 10G swap with another core.  
I don't know how much memory the jobs use (only that they don't run out of 
memory :), but here are some runtime stats from recent alignments with human as 
query, and sequences stored on local disk:

human vs. stickleback
---------------------
target size: 3.1Gbases (338 chunks of ~10Mbases)
query size: 4.6Gbases (58 chunks of ~10Mbases (some chunks much smaller))
number of jobs: 19604
average run time: <2min
max run time: 34min
total cpu time: 511hr
total i/o time:  61hr

human vs. tetraodon
-------------------
target size: 3.1Gbases (338 chunks of ~10Mbases)
query size: 4Gbases (one chunk)
number of jobs: 338
average run time: 56min
max run time: 82min
total cpu time: 314hr
total i/o time:   2hr

The genome sizes, and the percentage of repetetive sequence soft-masked in 
each, is pretty comparable.  The total output file size was about 50% larger 
for stickleback than for tetraodon.  I can only speculate about the differences 
in total cpu time and i/o usage... YMMV!

Hope that helps,

Angie


----- "Bremen Braun" <[email protected]> wrote:

> From: "Bremen Braun" <[email protected]>
> To: [email protected]
> Sent: Wednesday, February 3, 2010 8:43:48 AM GMT -08:00 US/Canada Pacific
> Subject: [Genome] LASTZ pipeline
>
> Hi all,
> I'm curious as to how the pipeline for aligning two chromosomes has changed
> with the replacement of blastz with lastz. I notice lastz has quite a bit
> more functionality than blastz, such as the ability to perform chaining. Are
> there additional tools you recommend for filtering the output of lastz?
> 
> When using lastz for whole genome alignments, can you intuitively estimate
> the expected memory usage for running the following command on a query of
> 184,000 sequences with each sequence having an average size of ~1000 bases
> and a target of 1,200 sequences with each sequence having an average length
> of 845,000 bases (expected % identity ~95)?
>     lastz query.fasta[multiple] target.fasta --chain --gapped --ambiguous=n 
> --format=sam --output=query-target_aligned.sam
> 
> Upon running, I get this message after some time:
>     FAILURE: call to realloc failed to allocate 135051136 bytes, for 
> add_segment
> I am running this on a 32bit Linux virtual machine which has 3.5GB of memory
> allocated to it. Are there different recommended parameters when dealing
> with whole genome alignment vs. interspecies chromosomal alignment?
> 
> Thanks,
> -- 
> Bremen Braun
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] LASTZ pipeline

Reply via email to