On 01/20/11 10:36, Stephen Taylor wrote:
On 20/01/2011 09:36, Peter Rice wrote:
Possibly scope to do more there. What would you like to see in SAM
output for fuzznuc?

My motivation was to build BAM tracks showing matches of lots of
patterns in the genome sequence. I hadn't thought about proteins but I
guess you could so something similar.

The SAM file would show the position of each match per line and the
CIGAR string containing the matched pattern and SEQ (col 10) containing
the query pattern expanded to show the match. The original pattern could
be in the OPT field.

Interesting. The sequence becomes the reference. We would have to do a little extra work to generate the CIGAR string for various patterns but that should be possible but modifying the pattern matching code.

I see there is a tag for Mismatching positions (MD)
which would work for regex style matches (so good for 'dreg'), but I am
not sure it would be strictly legal for a PROSITE like pattern.

e.g for [CG](5)TG{A}N(1,5)C

Could you have

MD:Z:[CG](5)TG{A}N(1,5)C

That will need some investigation. Maybe prosite patterns can be translated to regex for this purpose - many will convert easily.

It looks like {,} is not allowed. So perhaps you would have to translate
the pattern to a regex or generate an alternative optional tag. I am not
a SAM expert so apologies if I am proposing to violate the format rules!

N(1,5) is equivalent to NN?N?N?N? ... though prosite ranges can go over 100 positions.

Incidentally, I would use dreg but it doesn't allow mismatches to be
easily specified.

True, that's a regular expression library issue.

regards,

Peter
_______________________________________________
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss

Reply via email to