On 03-Feb-2019 04:18, Anandkumar Surendrarao wrote:

- I want to search with FASTA formatted DNA sequence queries that are <=
   100nt in length,
   - against FASTA formatted genomic DNA sequence,
   - for global end-to-end matches  (using Needleman Wunsch type global
   search, not Smith Waterman type local matching)
   - but allowing >= 80% identity and >= 80% of query length coverage

Not sure about needle, except that I expect it would be very slow.

However, I made a modified version of Lastal that added command line parameters for identity/query/target cutoff for almost exactly the same application. It is available here in the "modified programs" directory.

  https://sourceforge.net/projects/lemonade-assemble/

There is a Centos 6 64 bit binary there, for any other platform you will need to recompile. Lastal does not have NW exactly, but it does have a -T parameter which does something which in many cases is similar. -T 0 is a local alignment, and -T 1 "extends to the end" (starting from the local alignment). My use case involved Illumina reads, and what would happen is if there was a mismatch at 3 or 4 bp from the end of that read in the alignment against the target (in this case mRNA transcripts) -T 0 would truncate the alignment at the position before the mismatch, so the read would not be seen as a full length match. With -T 1 it would pick up the extra matching bases. Anyway, let's say you have

  reads.fasta   # bunch of 100bp reads
  genome.fasta  # self explanatory

CPUS=40
lastdb -P $CPUS -w 10 genome_10 genome.fasta
lastal -P $CPUS -T1 -I80 -Y80 -m250 -O50 -8 1 -9 1 \
  -f BlastTab genome_10 reads.fasta > results.txt 2>errors.log


The lastdb command makes a database but uses only every 10th position. This will make the database smaller and the search much faster. These positions are only used as seeds, so if the expected match is very good -w10 gives the same results as -w1 (default). If the match is expected to be poor use -w 1 and expect much longer run times. Set CPUS to match your system.

The lastal command looks for 80% identity (-I80), 80% coverage on query (-Y80), allows up to 250 overlapping matches (-m250, see below), allows offset overlap alignment as long as the overlap is at least 50bp (-O50), requires overlap alignments to go right to the ends (-8 1 -9 1), and -f BlastTab gives a blast tabular format output file (except when it is on the other strand the query positions are flipped rather than the target). Offset overlap is like this:

    -------------------------
              |||||||||||||||  <- -O50 == this must be >= 50bp
              ------------------------------

This only concerns alignments which overlap the ends. The -m parameter is a standard lastal command line option:

  -m: maximum initial matches per query position (10)

What it means is that if there are a lot of alignments in exactly the same place by default one only retrieves a few of them. Increasing the value to 250 means many more will be retrieved. Conceivably if there is super high coverage you might want to use an even larger number. Best to first try a small subset to determine the appropriate -m value for your case.

I don't know if it would be faster if the query/database were the other way around.

Regards,

David Mathog
[email protected]
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
EMBOSS mailing list
[email protected]
http://mailman.open-bio.org/mailman/listinfo/emboss

Reply via email to