Hi Bremen, We use lastz as a drop-in replacement for blastz, so we use the same parameters with lastz as blastz (usually H, Y, L, K, Q, whose values are set in the DEF file quoted in makeDb/doc/$db.txt). I can't estimate the memory usage, but I suggest breaking up your query into N groups of sequences, your target into M, and run lastz N*M times to align each query group to all target groups. What N and M are appropriate for your system and sequences may require some experimentation.
We have a large compute cluster, and break up sequences into chunks of 10 Mbases for target and 10 up to >400 Mbases for query. We chop up sequences larger than 10 Mbases into chunks that overlap by 10000 bases, and concatenate small sequences together to get a total of ~10Mbases. Chopping up large sequences means that the resulting coordinates have to be fixed after the run -- nice to avoid that if you can. Each job runs on a 1.9GHz core sharing 4G RAM and 10G swap with another core. I don't know how much memory the jobs use (only that they don't run out of memory :), but here are some runtime stats from recent alignments with human as query, and sequences stored on local disk: human vs. stickleback --------------------- target size: 3.1Gbases (338 chunks of ~10Mbases) query size: 4.6Gbases (58 chunks of ~10Mbases (some chunks much smaller)) number of jobs: 19604 average run time: <2min max run time: 34min total cpu time: 511hr total i/o time: 61hr human vs. tetraodon ------------------- target size: 3.1Gbases (338 chunks of ~10Mbases) query size: 4Gbases (one chunk) number of jobs: 338 average run time: 56min max run time: 82min total cpu time: 314hr total i/o time: 2hr The genome sizes, and the percentage of repetetive sequence soft-masked in each, is pretty comparable. The total output file size was about 50% larger for stickleback than for tetraodon. I can only speculate about the differences in total cpu time and i/o usage... YMMV! Hope that helps, Angie ----- "Bremen Braun" <[email protected]> wrote: > From: "Bremen Braun" <[email protected]> > To: [email protected] > Sent: Wednesday, February 3, 2010 8:43:48 AM GMT -08:00 US/Canada Pacific > Subject: [Genome] LASTZ pipeline > > Hi all, > I'm curious as to how the pipeline for aligning two chromosomes has changed > with the replacement of blastz with lastz. I notice lastz has quite a bit > more functionality than blastz, such as the ability to perform chaining. Are > there additional tools you recommend for filtering the output of lastz? > > When using lastz for whole genome alignments, can you intuitively estimate > the expected memory usage for running the following command on a query of > 184,000 sequences with each sequence having an average size of ~1000 bases > and a target of 1,200 sequences with each sequence having an average length > of 845,000 bases (expected % identity ~95)? > lastz query.fasta[multiple] target.fasta --chain --gapped --ambiguous=n > --format=sam --output=query-target_aligned.sam > > Upon running, I get this message after some time: > FAILURE: call to realloc failed to allocate 135051136 bytes, for > add_segment > I am running this on a 32bit Linux virtual machine which has 3.5GB of memory > allocated to it. Are there different recommended parameters when dealing > with whole genome alignment vs. interspecies chromosomal alignment? > > Thanks, > -- > Bremen Braun > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
