Re: [EMBOSS] run parameters for inexact matching with EMBOSS needle

David Mathog Mon, 04 Feb 2019 09:57:53 -0800

On 03-Feb-2019 04:18, Anandkumar Surendrarao wrote:

- I want to search with FASTA formatted DNA sequence queries thatare <=

   100nt in length,
   - against FASTA formatted genomic DNA sequence,
   - for global end-to-end matches  (using Needleman Wunsch type global
   search, not Smith Waterman type local matching)
   - but allowing >= 80% identity and >= 80% of query length coverage


Not sure about needle, except that I expect it would be very slow.

However, I made a modified version of Lastal that added command lineparameters for identity/query/target cutoff for almost exactly the sameapplication. It is available here in the "modified programs" directory.


  https://sourceforge.net/projects/lemonade-assemble/

There is a Centos 6 64 bit binary there, for any other platform you willneed to recompile. Lastal does not have NW exactly, but it does have a-T parameter which does something which in many cases is similar. -T 0is a local alignment, and -T 1 "extends to the end" (starting from thelocal alignment). My use case involved Illumina reads, and what wouldhappen is if there was a mismatch at 3 or 4 bp from the end of that readin the alignment against the target (in this case mRNA transcripts) -T 0would truncate the alignment at the position before the mismatch, so theread would not be seen as a full length match. With -T 1 it would pickup the extra matching bases. Anyway, let's say you have


  reads.fasta   # bunch of 100bp reads
  genome.fasta  # self explanatory

CPUS=40
lastdb -P $CPUS -w 10 genome_10 genome.fasta
lastal -P $CPUS -T1 -I80 -Y80 -m250 -O50 -8 1 -9 1 \
  -f BlastTab genome_10 reads.fasta > results.txt 2>errors.log

The lastdb command makes a database but uses only every 10th position.This will make the database smaller and the search much faster. Thesepositions are only used as seeds, so if the expected match is very good-w10 gives the same results as -w1 (default). If the match is expectedto be poor use -w 1 and expect much longer run times. Set CPUS to matchyour system.

The lastal command looks for 80% identity (-I80), 80% coverage on query(-Y80), allows up to 250 overlapping matches (-m250, see below), allowsoffset overlap alignment aslong as the overlap is at least 50bp (-O50), requires overlap alignmentsto go rightto the ends (-8 1 -9 1), and -f BlastTab gives a blast tabular formatoutput file (except when it is on the other strand the query positionsare flipped rather than the target). Offset overlap is like this:


    -------------------------
              |||||||||||||||  <- -O50 == this must be >= 50bp
              ------------------------------

This only concerns alignments which overlap the ends. The -m parameteris a standard lastal command line option:


  -m: maximum initial matches per query position (10)

What it means is that if there are a lot of alignments in exactly thesame place by default one only retrieves a few of them. Increasing thevalue to 250 means many more will be retrieved. Conceivably if there issuper high coverage you might want to use an even larger number. Bestto first try a small subset to determine the appropriate -m value foryour case.

I don't know if it would be faster if the query/database were the otherway around.


Regards,

David Mathog
[email protected]
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
EMBOSS mailing list
[email protected]
http://mailman.open-bio.org/mailman/listinfo/emboss

Re: [EMBOSS] run parameters for inexact matching with EMBOSS needle

Reply via email to