[EMBOSS] Many-to-many with needle and water
Hi Peter R. et al, I gather EMBOSS is looking for feedback for new applications (given the recent funding from the BBSRC - congratulations again). How about suggestions for extensions to existing EMBOSS applications? I've used bits of EMBOSS for several years now (thank you!). Something I have sometimes wanted to do is a many-to-many pairwise sequence alignment with the EMBOSS tools needle and water. Right now, needle and water take two files (here referred to as A and B), file A has just one sequence, and file B can have one or more sequences. I'd like to be able to supply two files both with multiple entries, and have needle/water do pairwise alignments between all the sequences in A against all the sequences in B. This might be useful for finding reciprocal best hits in comparative genomics (as an slower but exact alternative to FASTA or BLAST). >From an implementation point of view, I might imagine doing sequence A1 against all of B, then sequence A2 against all of B, etc. This would require looping over file B many times (easy if on disk). This would also work if the A input was stdin, but having the B input on stdin would require caching the data if A has more than one sequence :( It may sometimes also be useful to have an all-against-all pairwise comparison for a single set of sequences. The above suggested enhancement would let you do this by comparing file A to file A. However, here you only really need to do half the possible combinations (as aligning sequence A1 to sequence A2 should be the same as A2 to A1). This could be useful for implementing a basic clustering algorithm, or maybe as part of a worked example in building a simple NJ tree? So, does supporting many-to-many comparisons sound like a useful enhancement to needle and water? I should stress this isn't something I need right now. Also, it can be worked around with a wrapper script to call needle/water once for each sequence in file A (against all the sequences in file B), with the added bonus that then these jobs one-to-many comparisons can then be shared across multiple CPU cores. Regards, Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Many-to-many with needle and water
On Mon, Jul 6, 2009 at 11:35 AM, Peter Rice wrote: > > Peter C wrote: > > Hi Peter R. et al, > > > > I gather EMBOSS is looking for feedback for new applications (given > > the recent funding from the BBSRC - congratulations again). How about > > suggestions for extensions to existing EMBOSS applications? > > > > I've used bits of EMBOSS for several years now (thank you!). Something > > I have sometimes wanted to do is a many-to-many pairwise sequence > > alignment with the EMBOSS tools needle and water. > > > > Right now, needle and water take two files (here referred to as A and > > B), file A has just one sequence, and file B can have one or more > > sequences. I'd like to be able to supply two files both with multiple > > entries, and have needle/water do pairwise alignments between all the > > sequences in A against all the sequences in B. This might be useful > > for finding reciprocal best hits in comparative genomics (as an slower > > but exact alternative to FASTA or BLAST). > > The application is easy to add (after the release) > > The usual problem with all-against-all is that it involves loading one > of the inputs as a sequence set entirely in memory - to avoid reading > one input many times over. Right - and it would be difficult to decide if in memory vs reading the file many times is best in general without some specific use cases. [I suppose you could do something a bit more cunning like start by caching the sequences as you read them read for re-use, but if the number of sequences crosses a threshold, stop caching and switch to re-reading the file for subsequence loops?] > We have an application supermatcher which does this - the first sequence > is streamed through, the second is a sequence set loaded into memory. It > uses work matching to find seed alignments then runs a limited alignment > around the hits. > > superwater would be a possible name (or superneedle). If you see many-to-many versions of water and needle as a separate applications, then those names sound fine. > How popular would such a program be? I don't know - as I said, this is more of suggestion than a request. I don't *need* this tool, but there have been occasions in the past where I would have tried using it if it had existed. Perhaps others on the list can think of a better uses for this tool idea? > How large would the smaller input set be? Hard to say without specific examples in mind. For some hand waving upper limits, for comparative genomics of bacteria using protein sequences, you might have a few thousand in each file. If I was trying this as part of an ad-hoc clustering algorithm (all-against-all), again maybe a few thousand sequences. In practice, a heuristic tool like supermatcher (or FASTA or BLAST) would probably be more sensible for large datasets like this due to the computational time. I see needle and water as most useful on smaller datasets where the runtime cost of using an exact algorithm isn't too high. Therefore many-to-many needle/water searches may be best targeted at smaller sequence files. Things might be different with a multicore or GPU/OpenCL version of needle and water ;) Anyway, unless someone else thinks a many-to-many version of needle and water would be useful, I wouldn't expect you to implement this. I'm just putting the idea forward for discussion. Regards, Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] Probabilistic versions of needle/water?
Hi all, I have another suggestion for new or enhanced EMBOSS applications, again related to the existing pairwise sequence alignment tools needle and water. The FASTQ file format (or others) contains quality scores (often PHRED scores) representing the probability of an error in the associated nucleotide. Solexa/Illumina machines also provide another file with a more precise breakdown of the likelihood of each of the four bases. In some cases both sequences could have probability scores (e.g. trying to align the ends of contigs to each other), but often one sequence will be taken as fact (e.g. mapping reads onto a reference). It is possible to take these probabilities into account when considering the matches in needle (or water) by using a probabilistic version of the Needleman‐Wunsch sequence alignment algorithm (or a probabilistic Smith-Waterman). As an example of this idea, did you (Peter R) see the GNUMAP talk/poster at ISMB 2009? See http://dna.cs.byu.edu/gnumap/ I am aware of people using EMBOSS tools (I assume water) to identify (known) adaptor sequences in raw Solexa/Illumina data. I considered doing something similar myself when trying to remove primer sequences from 454 data. Such a pipeline using the current EMBOSS water would be doing this matching at a purely fixed nucleotide level (ignoring the qualities), which isn't ideal. Upgrading to a probabilistic version of water should be an improvement. Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Probabilistic versions of needle/water?
On Mon, Jul 6, 2009 at 1:32 PM, Peter Rice wrote: > >> I am aware of people using EMBOSS tools (I assume water) to identify >> (known) adaptor sequences in raw Solexa/Illumina data. I considered >> doing something similar myself when trying to remove primer sequences >> from 454 data. Such a pipeline using the current EMBOSS water would be >> doing this matching at a purely fixed nucleotide level (ignoring the >> qualities), which isn't ideal. Upgrading to a probabilistic version of >> water should be an improvement. > > Would be interesting. > > Where can I look up adaptor calling methods? The particular example I had in mind was the thread with Giles Weaver on the BioPerl mailing list, which I see you have just replied to: http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030404.html I think I made a typo earlier (needle versus water). If you are comparing a short but complete adaptor sequence to a read (which you expect may contain the full adaptor) doing a global alignment is more sensible that a local one. On re-reading, Giles did actually say he was using needle: http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030411.html Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] transeq and ambiguous codons
Hi all, Something I mentioned to Peter Rice in passing at BOSC/ISMB 2009 was I'd found an oddity in transeq with certain ambiguous codons which testing Biopython's translations. Here is a specific example (but I suspect there are more). For reference, I am expecting EMBOSS transeq to be using the NCBI tables: http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi First consider the following example, the codon TAN, which can be TAA, TAC, TAG or TAT which translate to stop or Y. Therefore the translation of TAN should be "* or Y", and EMBOSS transeq opts for "X". Which is fine: $ transeq asis:TAATACTAGTATTAN -stdout -auto >asis_1 *Y*YX Similarly for the codon TNN, again EMBOSS transeq opts for "X" because this could be a stop codon, or W, or F, or L, or S, or Y or C! Again, this is fine: $ transeq asis:TNN -stdout -auto >asis_1 X However, consider the codon TRR. R means A or G, so this can mean TAA, TGA, TAG or TGG which translate to stop or W (both EMBOSS and the NCBI standard table agree here). Therefore the translation of TRR should be "* or W", which I would expect based on the above examples to result in "X". But instead EMBOSS transeq gives "*": $ transeq asis:TAATGATAGTGGTRRTNN -stdout -auto >asis_1 ***W*X I think this is a bug. However, I am aware that the machine I tried this on is rather old, and I don't actually know which version of EMBOSS it is. How can I find out? As far as I know, there is no "-version" or "-v" or "--version" switch, and the "-help" information doesn't include this important piece of information. Nor is this in the FAQ: http://emboss.sourceforge.net/docs/faq.html So that makes two questions - how should transeq translate "TRR", and how do I check the version of EMBOSS? Thanks, Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] transeq and ambiguous codons
On Thu, Jul 9, 2009 at 12:53 AM, Scott Markel wrote: > > Peter, > > Answer to question #2: run the program embossversion. > >> embossversion > Writes the current EMBOSS version number to a file > 6.0.1 > > Scott Thanks Scott (& Thomas) for pointing out the embossversion program. I would still question why the EMBOSS tools don't also support the Unix convention of a version switch. Hypothetically, aren't some (many?) of the tools standalone and couldn't they be installed individually (e.g. as part of someone else's software bundle)? i.e. Can EMBOSS really guarantee that the needle tool and the embossversion tool are in sync? Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] transeq and ambiguous codons
On Thu, Jul 9, 2009 at 10:16 AM, Peter Rice wrote: > > Peter C. wrote: > >> Thanks Scott (& Thomas) for pointing out the embossversion program. >> >> I would still question why the EMBOSS tools don't also support the >> Unix convention of a version switch. Hypothetically, aren't some >> (many?) of the tools standalone and couldn't they be installed >> individually (e.g. as part of someone else's software bundle)? i.e. >> Can EMBOSS really guarantee that the needle tool and the >> embossversion tool are in sync? > > We could easily add a -version global qualifier ... for the next release. > > We can guarantee that embossversion and needle are in sync - assuming > they are built using the same libraries as that is where the version is > recorded. Standalone build are an issue though and it would help debug > in a few cases. That sounds good to me :) Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] transeq and ambiguous codons
On Wed, Jul 8, 2009 at 10:50 PM, Peter wrote: > Hi all, > > Something I mentioned to Peter Rice in passing at BOSC/ISMB 2009 was > I'd found an oddity in transeq with certain ambiguous codons while > testing Biopython's translations. Here is a specific example (but I > suspect there are more). For reference, I am expecting EMBOSS transeq > to be using the NCBI tables: > http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi > > First consider the following example, the codon TAN, which can be TAA, > TAC, TAG or TAT which translate to stop or Y. Therefore the > translation of TAN should be "* or Y", and EMBOSS transeq opts for > "X". Which is fine: Using raw output instead of the default FASTA works better in emails: $ transeq asis:TAATACTAGTATTAN -stdout -auto -osformat raw *Y*YX > Similarly for the codon TNN, again EMBOSS transeq opts for "X" because > this could be a stop codon, or W, or F, or L, or S, or Y or C! Again, > this is fine: Again, using raw output works better in emails: $ transeq asis:TNN -stdout -auto -osformat raw X > However, consider the codon TRR. R means A or G, so this can mean TAA, > TGA, TAG or TGG which translate to stop or W (both EMBOSS and the NCBI > standard table agree here). Therefore the translation of TRR should be > "* or W", which I would expect based on the above examples to result > in "X". But instead EMBOSS transeq gives "*": Again, using raw output works better in emails: $ transeq asis:TAATGATAGTGGTRR -stdout -auto -osformat raw ***W* > I think this is a bug. > > However, I am aware that the machine I tried this on is rather old, > and I don't actually know which version of EMBOSS it is. I can check the old machine later, but I just retested on a Mac using EMBOSS 6.0.1 (the current release), and see the same behaviour. Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] transeq and ambiguous codons
On Thu, Jul 9, 2009 at 10:08 AM, Peter Rice wrote: > > Peter C. wrote: >> However, consider the codon TRR. R means A or G, so this can mean TAA, >> TGA, TAG or TGG which translate to stop or W (both EMBOSS and the NCBI >> standard table agree here). Therefore the translation of TRR should be >> "* or W", which I would expect based on the above examples to result >> in "X". But instead EMBOSS transeq gives "*": > > This is a side effect of the way backtranslation works... OK, leaving TRR aside for the moment (I'm not sure I'd have done it that way, but I think I follow your logic), I have some more problem cases for you to consider (all using the default standard NCBI table 1). Most of these are 'unambiguous ambiguous codons' as you put it, and I would agree using X when a more specific letter is possible isn't ideal but isn't actually wrong. The "ATS" and related codons (see below) however are simply wrong. -- TRA means TAA or TGA, which are both stop codons. Therefore TRA should translate as a stop, not as an X: $ transeq asis:TAATGATRA -stdout -auto -osformat raw **X -- Now look at YTA, which means CTA or TTA which encode L, so YTA should be L not X: $ transeq asis:CTATTAYTA -stdout -auto -osformat raw LLX Likewise for YTG and YTR, and YTN. -- Another example, ATW means ATA or ATT, which both translate as I, so ATW should translate as I not X: $ transeq asis:ATAATTATW -stdout -auto -osformat raw IIX -- Conversely, ATS which means ATC or ATG which translate as I and M. Remember S means G or C. Therefore ATS should translate as X, and not I: $ transeq asis:ATCATGATS -stdout -auto -osformat raw IMI Likewise H means A, G or C, so ATH shows the same bug, as do some other AT* codons: $ transeq asis:ATAATCATGATH -stdout -auto -osformat raw IIMI [*** This one strikes me as a clear bug ***] -- Now for another debatable one, RAT means AAT or GAT which code for N and D. So, you could use B (Asx) here rather than the broader X. $ transeq asis:AATGATRAT -stdout -auto -osformat raw NDX Again, the same thing for others like RAC -> X not B, and RAY -> X not B. Similarly, you don't use J to mean leucine (L) or to isoleucine (I), and opt for X (again, this is justifiable). e.g. WTA $ transeq asis:ATATTAWTA -stdout -auto -osformat raw ILX ------ This list is only partial, and only for the standard table. Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] transeq and ambiguous codons
On Fri, Jul 10, 2009 at 10:30 AM, Peter Rice wrote: > > Peter C. wrote: >> >> OK, leaving TRR aside for the moment (I'm not sure I'd have done it that >> way, but I think I follow your logic), I have some more problem cases for >> you to consider (all using the default standard NCBI table 1). >> >> Most of these are 'unambiguous ambiguous codons' as you put it, and >> I would agree using X when a more specific letter is possible isn't ideal >> but isn't actually wrong. The "ATS" and related codons (see below) >> however are simply wrong. > > They do look wrong. The "X when it could pick a residue" ones I knew of. > > The others need a closer look. The plan is to work through all possible > codons and all the NCBI genetic codes as soon as the release is out. > > It should be a simple patch to ajtranslate.c when I'm done. > OK - I appreciate this is too last minute for the imminent EMBOSS release. >> -- >> >> Now for another debatable one, RAT means AAT or GAT which code >> for N and D. So, you could use B (Asx) here rather than the broader X. >> >> Similarly, you don't use J to mean leucine (L) or to isoleucine (I), and >> opt for X (again, this is justifiable). e.g. WTA > > Hmmm ... B and Z are ambiguity codes for amino acid analyser where all the > amide bonds are broken and that includes N->D and Q->E. We used to have one > of those in the lab. Similarly, J is for mass spec where I and L have the > same molecular weight. I don't consider them appropriate for translation. Well, as I said, this is debatable. On the one hand B and Z are IUPAC standards (although J isn't yet), but amino acids don't have the full ambiguous alphabet that we have for nucleotides so some might find such a translation surprising. http://www.chem.qmul.ac.uk/iupac/AminoAcid/A2021.html > So I plan to go for unique amino acids where possible with the ambiguity > codes. Good :) Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] transeq and ambiguous codons
On Thu, Jul 9, 2009 at 10:21 AM, Peter wrote: > On Thu, Jul 9, 2009 at 10:16 AM, Peter Rice wrote: >> >> Peter C. wrote: >> >>> Thanks Scott (& Thomas) for pointing out the embossversion program. >>> >>> I would still question why the EMBOSS tools don't also support the >>> Unix convention of a version switch. Hypothetically, aren't some >>> (many?) of the tools standalone and couldn't they be installed >>> individually (e.g. as part of someone else's software bundle)? i.e. >>> Can EMBOSS really guarantee that the needle tool and the >>> embossversion tool are in sync? >> >> We could easily add a -version global qualifier ... for the next release. >> >> We can guarantee that embossversion and needle are in sync - assuming >> they are built using the same libraries as that is where the version is >> recorded. Standalone build are an issue though and it would help debug >> in a few cases. > > That sounds good to me :) > Thinking about this again, rather than adding a whole new argument (-version), why not just include the program version as the first line of the help output (from -help)? This should also solve the corner case of standalone builds, and makes it very easy to find the version (without having to know about the embossversion tool). Thanks, Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] EMBOSS seqret : IntelliGenetics and new DOS lines
aggtgaccggccaggaaac ggtcgcccagatcaaggtaaggctgcttggtgcgccctgggttccacttgtgctct tcactctcgcggcccgagggaacgcttacgagccttatctttccctgtaggctcatgtag cctcactggagggcattgggaagatcaagtcgtgctcctggcaggcgcgtgg aggatgaggccactctgggccagtgctggaggccctgactaccctggaagtagcag gccgcatgcttggaggtgagtgagagaggaatgttctttgaagtaccggtaagcgtctag tgagtgttgcatagtcctgacagctgagtgtcacacctatggtaatagagtacttc tcactgtcttcagttcagagtgattcttcctgtttacatccctcatgttgaacacagacg tccatgggagactgagccagagtgtagttgtatttcagtcacatcacgagatcctagtct ggttatcagcttccacactattaggtcagaccaggaaagtgctctataaatt agaagctggaagatcctgaaatgaaacttaagatttcaaggtcaaatatctgcaactttg ttctcattacctattgggcgcagcttctctttaaaggcttgaattgaggatt ctgctgggtggcaccttcttgctcttacctgctggtgccttcctttcccactacaggtaa agtccatggttccctggcccgtgctgggtgagaggtcagactcctaaggtgagtga gagtattagtggtcatggtgttaggactttcctttcacagctaaaccaagtccctg ggctcttactcggtttgccttctccctccctggagatgagcctgagggaagggatgctag gtgtggaagacaggaaccagggcctgattaaccttcccttctccaggtggccaaacagga gaagaagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgt caacgttgtgcccacctttggcaagaagaagggaatgccaactcttaagtcg taattctggctttctctaataagccacttagttcagtcatcgcattgtttcatctt tacttgcaaggcctcagggagaggtgtgcttctcgg i.e. There was a problem with this example file in EMBOSS 6.0.1, but things look fine in EMBOSS 6.1.0. Great :) However, if we now convert this input file to use DOS/Windows newlines, and repeat the test (on Mac OS X, so Unix): $ embossversionReports the current EMBOSS version number 6.1.0 $ seqret -sequence emboss_ig.txt -sformat ig -osformat fasta -auto -filter H.sapiens fau mRNA, 518 bases ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgggaagatcaagtcgtgc tcctggcaggcgcgtggaggatgaggccactctgggccagtgctggaggccc tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc gtgctgggtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggaatgccaactcttaagtcgtaattctggctttc tctaataagccacttagttcagtcaa H.sapiens fau 1 gene, 2016 bases ctaccaccctctcgattctatatgtacactcgggacaagttctcctgatcgc ggcctaaggaagtaggaatgccttagcttaacaatgattaacac tgagcctcacacccacgcgatgccctcagctcctcgctcagcgctctcaccaacagccgt agcccgcaggctggacaccggttctccatgcagcgtagcccggaacatggta gctgccatctttacctgctacgccagccttctgtgcgcgcaactgtctggtcccgg tcctgcgcgagctgctgcccaggcaggttcgccggtgcgagcgtaaacggagctag gactgccttgggcggtacaaatagcagggaaccgcgcggtcgctcagcagtgacgtgaca cgcagcccacggtctgtactgacgcgccctcgcttcttcctctttctcgactccatcttc gcggtagctgggaccgccgttcaggtaagaatccttggctggatccgaagggcttg tagcaggttggctgctcagaaggcgcggaaccgaagaaccctgctccg tggccctgctccagtccctatccgaactccttgggaggcactggccttccgcacgtgagc cgccgcgaccaccatcccgtcgcgatcgtttctggaccgctttccactcccaaatctcct ttatcccagagcatttcttggcttctcttacaagccgtcctttactcagtcgccaa tatgcagctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaac ggtcgcccagatcaaggtaaggctgcttggtgcgccctgggttccacttgtgctct tcactctcgcggcccgagggaacgcttacgagccttatctttccctgtaggctcatgtag cctcactggagggcattgggaagatcaagtcgtgctcctggcaggcgcgtgg aggatgaggccactctgggccagtgctggaggccctgactaccctggaagtagcag gccgcatgcttggaggtgagtgagagaggaatgttctttgaagtaccggtaagcgtctag tgagtgttgcatagtcctgacagctgagtgtcacacctatggtaatagagtacttc tcactgtcttcagttcagagtgattcttcctgtttacatccctcatgttgaacacagacg tccatgggagactgagccagagtgtagttgtatttcagtcacatcacgagatcctagtct ggttatcagcttccacactattaggtcagaccaggaaagtgctctataaatt agaagctggaagatcctgaaatgaaacttaagatttcaaggtcaaatatctgcaactttg ttctcattacctattgggcgcagcttctctttaaaggcttgaattgaggatt ctgctgggtggcaccttcttgctcttacctgctggtgccttcctttcccactacaggtaa agtccatggttccctggcccgtgctgggtgagaggtcagactcctaaggtgagtga gagtattagtggtcatggtgttaggactttcctttcacagctaaaccaagtccctg ggctcttactcggtttgccttctccctccctggagatgagcctgagggaagggatgctag gtgtggaagacaggaaccagggcctgattaaccttcccttctccaggtggccaaacagga gaagaagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgt caacgttgtgcccacctttggcaagaagaagggaatgccaactcttaagtcg taattctggctttctctaataagccacttagttcagtcatcgcattgtttcatctt tacttgcaaggcctcagggagaggtgtgcttctcgg i.e. The ">" is missing on all the FASTA sequences. So, it looks like EMBOSS 6.1.0 fixed one problem with IntelliGenetics files, but that there is still an issue here. Peter C. P.S. Should I have reported this possible bug via sourceforge? P.P.S. Back in 2006, I reported a similar issue with a data corruption reading stockholm/pfam with DOS newlines (Sourceforge Bug #1588956, long since fixed). It seems to me that EMBOSS would benefit from explicit testing of all the file formats using DOS/Windows newlines when run on Unix, and vice versa. Does that sound feasible, or just hopelessly ambitious? ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] FASTQ format documentation
Hi all, I was just trying to double check the names EMBOSS 6.1.0 supports for the various FASTQ file formats, and none of them are listed here: http://emboss.sourceforge.net/docs/themes/SequenceFormats.html Does this need updating, or should I be looking elsewhere? Thanks Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] EMBOSS seqret : IntelliGenetics and new DOS lines
On Mon, Jul 20, 2009 at 5:16 PM, Peter Rice wrote: > > Peter C. wrote: >> Hi all, >> >> I've just updated my Mac to EMBOSS 6.1.0, and have found an >> issue with seqret conversion of IntelliGenetics files. After some >> digging, I think this problem relates to having DOS new lines in >> a file on Unix (in my case, Mac OS X). > > we have an application "noreturn" to fix things like this. That's basically an EMBOSS variant on unix2dos and dos2unix (or similar) existing Unix command line tools? I'm more interested in having all the EMBOSS tools handle either new line format themselves automatically. These days I am mostly working on Unix (including Mac OS X), but I do have to cope with Windows style text files quite often. > If you send me your file I will ty to take a look at whether we shoudl > be catching the funny newline characters. For this bug report I was using: http://emboss.sourceforge.net/docs/themes/seqformats/ig There are another three example files used in the Biopython unit tests here: http://biopython.open-bio.org/SRC/biopython/Tests/IntelliGenetics/ >> P.S. Should I have reported this possible bug via sourceforge? > > The emboss-...@emboss.open-bio.org list is the best way to get > our attention Great, another mailing list to sign up to... but if that is your preferred route, that's fine. >> P.P.S. Back in 2006, I reported a similar issue with a data >> corruption reading stockholm/pfam with DOS newlines >> (Sourceforge Bug #1588956, long since fixed). It seems to >> me that EMBOSS would benefit from explicit testing of all >> the file formats using DOS/Windows newlines when run on >> Unix, and vice versa. Does that sound feasible, or just >> hopelessly ambitious? > > We can try ... how well does biopytjhon handle these? (i.e. do we need > such examples for perl, python etc or is this an EMBOSS-specific issue?) I think this is an EMBOSS specific issue. I don't know enough about how all the different EMBOSS parsers work, but is there a singl place where you could add automatic handling of either new line convention when reading in text? For reference, in Python, you can explicitly open text files in "universal newlines" mode, which takes care of this. I don't know about Perl. Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] EMBOSS seqret : IntelliGenetics and new DOS lines
Peter Rice wrote: > > Thanks for the example files. I will start with those. > > Peter C. wrote: >> I think this is an EMBOSS specific issue. I don't know enough about >> how all the different EMBOSS parsers work, but is there a single >> place where you could add automatic handling of either new line >> convention when reading in text? > > Hope so. I think the issue is places where the parsing is checking > explicitly for \n rather than \n and \r. The solution would be to strip > both off before parsing. It will need a thorough clean through the > ajseqread code. That sounds like a good investment of effort in the long run :) Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] EMBOSS seqret : IntelliGenetics and new DOS lines
Peter Rice wrote: > > Peter C. wrote: >> However, if we now convert this input file to use DOS/Windows >> newlines, and repeat the test (on Mac OS X, so Unix): >> >> $ embossversionReports the current EMBOSS version number >> 6.1.0 >> $ seqret -sequence emboss_ig.txt -sformat ig -osformat fasta -auto -filter >> H.sapiens fau mRNA, 518 bases >> ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc >> >> i.e. The ">" is missing on all the FASTA sequences. > > Actually, it's not missing ... it is hiding. > > The sequence id has a ^M appended to it, so the '> and the id get > overwritten by the description when you look at the file. That makes sense, and I think I can see how it might have happened. > Fixed by processing the IG format ID rather than simply copying it. > > Thanks for finding that one. Sure, Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] FASTQ records with no sequence?
Hi all, On the continuing topic of the nebulous FASTQ format, are there any strong views as to weather a FASTQ files could hold records without a sequence (and therefore no quality scores)? This could make sense as output from an (agressive) quality filter. This is corner case, and applies to other file formats too of course (e.g. FASTA). I mentioned this to Peter Rice (EMBOSS) off list, and he replied: On Thu, Jul 30, 2009 at 2:56 PM, Peter Rice wrote: > EMBOSS rejects zero length sequences - something we put in some years > ago for misformatted FASTA files that someone ran through a Taverna > workflow to launch clustalw via EMBOSS's "emma". The user had got his > carriage control characters mangled so the sequence was appended to the > FASTA '>' line and appeared as a long description with no sequence. > > I can well imagine for filtering paired reads that zero length sequences > would be useful. > > At the point where the test is made we know the sequence format. > We can therefore define some or all formats as accepting or rejecting > zero length sequences. > > Similarly we can easily extend to define some applications (e.g. emma) > as requiring a minimum sequence length. > > regards, > > Peter Peter Rice is of course correct - in general the meaning and validity of a zero length sequence is context dependent. I think Peter Rice makes a good point regarding paired end reads. What I assume we was getting at is the situation where due to quality trimming, one of a pair might be trimmed to nothing - leaving essentially a singleton read. However, paired end reads are normally stored using a matched pair of FASTQ files, so it could be important to keep the zero length read present, so that they can be read in together in sync. If we do want to allow zero length sequences in FASTQ, would both of the following be valid? Should there be empty sequence and quality lines, or no sequence and quality lines? "@identifier\n+\n" (two lines, just the @ and + lines) "@identifier\n\n+\n\n" (four lines, including blank seq and qual lines) or with the repeated identifier on the plus lines: "@identifier\n+identifier\n" (two lines, just the @ and + lines) "@identifier\n\n+identifier\n\n" (four lines, including blank lines) As we are recommending no line wrapping on output this means typical FASTQ records would be four lines - so doing the same makes sense here too. Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] FASTQ records with no sequence?
On Thu, Jul 30, 2009 at 4:09 PM, Peter Rice wrote: > > Peter C. wrote: > >> As we are recommending no line wrapping on output this means >> typical FASTQ records would be four lines - so doing the same >> makes sense here too. > > I vote for 4 lines on output. If we want to allow zero length sequences, then yes, I would also vote for the 4 line output (i.e. blank lines for the sequence and the quality string). > It should be possible to allow zero lines on input depending on > where the '+' check is. Yes, I'm pretty sure a parser could cope with any of the zero length sequence FASTQ examples I gave. Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] FASTQ records with no sequence?
Hi all, On the continuing topic of the nebulous FASTQ format, are there any strong views as to weather a FASTQ files could hold records without a sequence (and therefore no quality scores)? This could make sense as output from an (aggressive) quality filter. This was a discussion I meant to start on the OBF list, not the EMBOSS list - so here is the start of the thread: http://lists.open-bio.org/pipermail/emboss/2009-July/003707.html Basically in some contexts an empty FASTQ record makes sense, so perhaps we should include examples of this for our test suite. However, there is more than one reasonable way to represent such a record (either omitting the sequence and quality lines, or including blank sequence and quality lines). On Thu, Jul 30, 2009 at 4:09 PM, Peter Rice wrote: > > Peter C. wrote: > >> As we are recommending no line wrapping on output this means >> typical FASTQ records would be four lines - so doing the same >> makes sense here too. > > I vote for 4 lines on output. If we want to allow zero length sequences, then yes, I would also vote for the 4 line output (i.e. blank lines for the sequence and the quality string). > It should be possible to allow zero lines on input depending on > where the '+' check is. Yes, I'm pretty sure a parser could cope with any of the zero length sequence FASTQ examples I gave. Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] GFF/GFF2/GFF3 examples on EMBOSS webpage
Hi all, I was just looking at this page: http://emboss.sourceforge.net/docs/themes/SequenceFormats.html This table lists GFF2 as one entry, and GFF/GFF3 as another. They link to: http://emboss.sourceforge.net/docs/themes/seqformats/gff2 and http://emboss.sourceforge.net/docs/themes/seqformats/gff respectively. These examples appear to be indentical (and the header says it is a GFF2 file). So I am a bit confused. Should one be a GFF3 file, and simply one file was uploaded twice by mistake? Thanks, Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] vectorstrip on FASTQ files
Hi, I'm trying to use vectorstrip on FASTQ files (as a simple way to remove adaptor or primer sequences). However, it seems that on output the FASTQ qualities are missing (all set to the double quote, ASCII 33, meaning PHRED quality 1 or random). Is this a known bug (or rather, a missing feature)? For illustration I am using a Sanger style FASTQ file from the NCBI SRA (short reads originally from Solexa/Illumina), SRR014849.fastq which you can download from ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX003/SRX003639/SRR014849.fastq.gz I am pretending "GTTGGAACCG" is 5' adaptor sequence, and want to find any matches in some FASTQ reads, and trim it off taking only the sequence to the right. For simplicity I'm allowing no mismatches. Here is the start of the file: $ head -n 12 SRR014849.fastq @SRR014849.1 EIXKN4201CFU84 length=93 CTTTGTTTGGAACCGAAAGGGGAATTTCAAACCCCGGTTTCCAACCTTCCAAAGCAATGCCAATA +SRR014849.1 EIXKN4201CFU84 length=93 3+&$#"""""""""""7...@71,'";C?,B;?6B;:EA1EA1EA5'9B:?:#9e...@2ea5':>5?:%A;A8A;?9B;D@/=5B;4B>+C?,EA09B;@;9E@/EA/E@/B:;1B:B:;A9<5SRR014849.9_from_31_to_84 EIXKN4201AL42E length=84 AAAGGGTTTGAATTCAAACCCTTTGGTTCCAACTTGTCTTGCTTTAGCCA Using Sanger FASTQ runs: $ vectorstrip -sequence SRR014849.fastq -sformat fastq-sanger -readfile N -alinker "GTTGGAACCG" -blinker "" -osformat fastq-sanger -outseq SRR014849_5trimmed.fastq -mismatch 0 -besthits Y -outfile SRR014849_5trimmed.txt Removes vectors from the ends of nucleotide sequence(s) But the output is missing the quality scores: $ head -n 4 SRR014849_5trimmed.fastq @SRR014849.9_from_31_to_84 EIXKN4201AL42E length=84 AAAGGGTTTGAATTCAAACCCTTTGGTTCCAACTTGTCTTGCTTTAGCCA + """""""""""""""""""""""""""""""""""""""""""""""""""""" Is this something simple to add to vectorstrip? What about other annotation (e.g. running vector strip on annotated GenBank or EMBL files)? Thanks, Peter C. P.S. This is with EMBOSS 6.1.0 with a patch from Peter Rice, running on Mac OS X. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] vectorstrip on FASTQ files
Peter Rice wrote: > > Peter C. wrote: >> Hi, >> >> I'm trying to use vectorstrip on FASTQ files (as a simple way to >> remove adaptor or primer sequences). However, it seems that on output >> the FASTQ qualities are missing (all set to the double quote, ASCII >> 33, meaning PHRED quality 1 or random). Is this a known bug (or >> rather, a missing feature)? > > It is a missing feature. vectorstrip was written before quality scores > became fashionable and, curiously, nobody has asked for them before. > > We will certainly retain them in a future release. Great - thanks! Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
On Wed, Sep 16, 2009 at 7:57 AM, Charles Plessy wrote: > > Dear EMBOSS developers, > > I have multi-sequence file in FASTQ format that contains sequencing reads, and > would like to retreive them the with seqret. But as you see in the following > example, quality scores are not preserved: > > $ seqret P13-CA.fq:F1EZY7316JY25B fastq::stdout > Reads and writes (returns) sequences > @F1EZY7316JY25B rank=040 x=3973.0 y=285.0 length=68 > AATGATACGGCGACCACCGAACACTGCGTTTGCTGGCTTTGATGCACTTCTCATGGCCAATTTCATTG > + > """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" You need to use "fastq-sanger" (or the other variants), since in EMBOSS, "fastq" currently means FASTQ ignoring the qualities. This is documented: http://emboss.sourceforge.net/docs/themes/SequenceFormats.html As an EMBOSS user, I think the current situation is confusing, and it would make much more sense to have "fastq" just an alias for "fastq-sanger" (which would be consistent with Biopython and BioPerl). http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000576.html And also this email - especially the last example: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000599.html > The purpose was to use seqret as a workaround for the fact that > vectorstrip does not keep the quality either. That's also been suggested, and is likely to be supported in future. http://lists.open-bio.org/pipermail/emboss/2009-August/003722.html Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
On Thu, Sep 17, 2009 at 8:24 AM, Peter Rice wrote: > >> Also, in contrary to what the documentation predicts, using the fastq >> format for the output does not ignore the quality scores. (Not that >> would be particularly useful, but…) > > This is deliberate. We have to write somethign in FASTQ format and we > default to the fastq-sanger format. On input, fastq-sanger ignores qualities > because there is no safe way to decide which format is correct. So again, could you reconsider making "fastq" act like "fastq-sanger"? The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores, a superset of the Solexa/Illumina FASTQ varaints - so even if you don't know which kind of FASTQ file you have, and you don't care about the qualities, parsing it as a Sanger FASTQ file will work. Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
On Thu, Sep 17, 2009 at 10:18 AM, Peter Rice wrote: > >> So again, could you reconsider making "fastq" act like "fastq-sanger"? >> The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores, >> a superset of the Solexa/Illumina FASTQ varaints - so even if you don't >> know which kind of FASTQ file you have, and you don't care about the >> qualities, parsing it as a Sanger FASTQ file will work. > > Yes, but it is dangerous if they could really be Solexa qualities. Indeed, or an Illumina 1.3+ encoded FASTQ file. So if the EMBOSS tools are used to read a FASTQ file without specifying the FASTQ variant, do the currently detect it is FASTQ and default to the "fastq" setting and ignore the quality information? > What we could do is provide a utility that reads in fastq-sanger format and > checks whether the quality scores make most sense as Sanger, SOlexa or > Ilumina. That could be useful - I guess you could scan all the reads building up a histogram of the ASCII characters used. This could immediately rule out some of the options, and then based on the distribution (if you assume they are raw reads) you could make a good guess. > I consider reading as fastq-sanger by default to be rather dangerous. That is understandable. How about removing the current "fastq" output then? That might prevent some of the confusion at the moment. I'm struggling to see any purpose for the current "fastq" output - can you give me any example use case? Right now it has to pick an arbitrary quality symbol, and uses ASCI 34 (double quote) which means PHRED 1 (random) for a Sanger FASTQ file but is invalid as a Solexa or Illumina 1.3+ FASTQ file. Regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.
On Thu, Sep 17, 2009 at 10:52 AM, Peter Rice wrote: > >>> What we could do is provide a utility that reads in fastq-sanger format >>> and checks whether the quality scores make most sense as Sanger, >>> SOlexa or Ilumina. >> >> That could be useful - I guess you could scan all the reads building up >> a histogram of the ASCII characters used. This could immediately >> rule out some of the options, and then based on the distribution (if >> you assume they are raw reads) you could make a good guess. > > The ACD file would be 'interesting' We could set the default format to be > "fastq-sanger" and issue some warning if we find the user had tried to > change it. That way the application would run with a filename as the input, > though it will appear to interfaces to be able to read any sequence input. > > Are there rules we can use to decide on improbably qualities? Values below > the Illumina and Solexa minima would seem a good guide, and perhaps > values above the likely short read maximum score. > > Maybe some existing pipelines have solme cutoff values we could adopt? Quite possibly. Telling apart raw Sanger reads and raw Solexa/Illumina reads should be easy. However, unless there are some ASCII characters in the range 59 to 63 (Solexa -5 to -1), there isn't going to be a safe way to tell Solexa and Illumina 1.3+ apart. Of course, if they just have good reads above Solexa/PHRED 10 (which would be ASCII 74), either way it isn't going to make much difference. In any case, it will be heuristic, and sometimes it will get it wrong (e.g. post processed Sanger FASTQ files with high scores might look like raw reads in Solexa/Illumina FASTQ). >>> I consider reading as fastq-sanger by default to be rather dangerous. >> >> That is understandable. How about removing the current "fastq" output >> then? That might prevent some of the confusion at the moment. I'm >> struggling to see any purpose for the current "fastq" output - can you >> give me any example use case? Right now it has to pick an arbitrary >> quality symbol, and uses ASCI 34 (double quote) which means PHRED >> 1 (random) for a Sanger FASTQ file but is invalid as a Solexa or >> Illumina 1.3+ FASTQ file. > > It is an alias for fastq-sanger which should be OK. I prefer to have an > output format name for each input format name where it looks sensible, > so if we read "fastq" as an input format it should do something on > output. Unfortunately that means it has to write quality scores somehow. I'm not convinced that the current "fastq" output (with the double quote quality string) is entirely "sensible". But I'll drop this now - I've argued my case, and will leave it at that. As long as the current behaviour is clear in the documentation, it should be OK. Regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Trim polyA from fastq files
On Thu, Oct 1, 2009 at 1:53 PM, michael watson (IAH-C) wrote: > > Hi Peter > > Thanks for that. > > Is it possible to preserve the fastq format? My input was fastq, I also put > .fastq as my output, but it only gave me straight fasta > Use "fastq-sanger" (or a variant), not just "fastq" which means ignoring the qualities in EMBOSS. http://emboss.sourceforge.net/docs/themes/SequenceFormats.html#in Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] Fwd: [DAS] DAS workshop 7th-9th April 2010
This might be of interest to some of you. Peter -- Forwarded message -- From: Jonathan Warren Date: Thu, Nov 26, 2009 at 2:57 PM Subject: [DAS] DAS workshop 7th-9th April 2010 To: d...@biodas.org, das_registry_annou...@sanger.ac.uk, biojava-dev , BioJava , BioPerl , a...@sanger.ac.uk, a...@ebi.ac.uk, ensembldev We are considering running a Distributed Annotation System workshop here at the Sanger/EBI in the UK subject to decent demand. The workshop will be held from Wednesday 7th-Friday 9th April 2010. If you would be interested in attending either to present or just take part then please email me j...@sanger.ac.uk The format of the workshop is likely to be similar to last years (1st day for beginners, 2nd for both beginners and advanced users, 3rd day for advanced), information for which can be found here: http://www.dasregistry.org/course.jsp If you would like to present then please send a short summary of what you would like to talk about. Thanks Jonathan. Jonathan Warren Senior Developer and DAS coordinator j...@sanger.ac.uk -- The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, a charity registered in England with number 1021457 and acompany registered in England with number 2742969, whose registeredoffice is 215 Euston Road, London, NW1 2BE.___ DAS mailing list d...@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/das ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Trimming illumina short reads based on quality
On Tue, Dec 1, 2009 at 2:33 PM, michael watson (IAH-C) wrote: > > Hi > > I'm sorry if I've not been keeping up to date on what is doubtless a hot > topic. > > Does EMBOSS allow one to trim short reads based on quality data (from a fastq > file)? > > If not, I have read that it is planned - any idea when it will be implemented? Not yet, but it has been proposed and I understand it is on the EMBOSS to do list along with quality filtering (Peter Rice has suggested the name quaffle for this): http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030493.html I dare say suggestions for precise trimming algorithms (e.g. median over sliding window) might be welcome. > Otherwise, alternative suggestions are welcome! I'm sure there are plenty of scripts out these, in Perl, Python etc. What is your language of choice? Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] Unknown output format 'refseqp' and 'genpept'
Hi, I have a protein IntelliGenetics file used in the Biopython test suite: http://biopython.org/SRC/biopython/Tests/IntelliGenetics/VIF_mase-pro.txt I am using EMBOSS 6.1.0 (patch level 2 I think), and I am trying to turn this into a "GenBank Protein File", or GenPept file, using EMBOSS seqret. EMBOSS can read the file fine, this works: $ seqret -auto -sformat=ig -osformat=fasta VIF_mase-pro.txt temp.txt Giving FASTA output with 16 gapped protein sequences, which is good - although the ID of the first record is a bit odd. Using "genbank" as the output format in EMBOSS seems to mean nucleotide and not protein: $ seqret -auto -sformat=ig -osformat=genbank VIF_mase-pro.txt temp.txt Error: Sequence format 'genbank' not supported for protein sequences Error: Sequence format 'genbank' not supported for protein sequences ... Error: Sequence format 'genbank' not supported for protein sequences Referring to the documentation, http://emboss.sourceforge.net/docs/themes/SequenceFormats.html I then tried "genpept" and "refseqp": $ seqret -auto -sformat=ig -osformat=genpept VIF_mase-pro.txt temp.txt Error: Unknown output format 'genpept' Error: Unknown output format 'genpept' ... Error: unknown output format 'genpept' $ seqret -auto -sformat=ig -osformat=refseqp VIF_mase-pro.txt temp.txt Error: Unknown output format 'refseqp' Error: Unknown output format 'refseqp' ... Error: unknown output format 'refseqp' Doesn't EMBOSS seqret support genpept/refseqp as an output format? Thanks, Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Unknown output format 'refseqp' and 'genpept'
On Tue, Dec 8, 2009 at 1:32 PM, Peter Rice wrote: > > Peter wrote: >> >> Hi, >> >> I have a protein IntelliGenetics file used in the Biopython test suite: >> http://biopython.org/SRC/biopython/Tests/IntelliGenetics/VIF_mase-pro.txt It probably doesn't matter what the input file is here, the fact that it was an (obsolete) format like IntelliGenetics was just chance as I was working on a Biopython unit test. >> I am using EMBOSS 6.1.0 (patch level 2 I think), and I am trying >> to turn this into a "GenBank Protein File", or GenPept file, using >> EMBOSS seqret. >> >> Doesn't EMBOSS seqret support genpept/refseqp as an output format? > > Oddly enough you are the first to ask for it. That surprises me a little bit. Could I suggest you treat known input formats which are not supported as output formats a little differently and instead of this: unknown output format 'genpept' Perhaps give, format 'genpept' is not supported for output (only input) This would help the user rule out having a typo etc. > Does biopython have a definition of the fields it expects to write out in a > GenPept or RefseqP format file? We would be able to allow GenBank as an > alias for, presumably, genpept. Not explicitly, no. I was hoping to use EMBOSS for cross validation ;) With hindsight this may have been a mistake, but we use "genbank" format to mean either nucleotides of proteins. On parsing we just look at the units of length in the LOCUS line (bp or aa). We also try to cope with both the current NCBI files and some older variants we have in our unit tests (different offsets in the LOCUS line). > Might be a good time to merge the format names and details from biopython > and emboss. Where can Ifine the biopython ones? There are two tables on the wiki which include version information: http://biopython.org/wiki/SeqIO http://biopython.org/wiki/AlignIO You can also consult the built in documentation, also available online: http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html http://biopython.org/DIST/docs/api/Bio.AlignIO-module.html For a long time I avoided having aliases (multiple names for the same thing). However, we now treat "gb" as an alias for "genbank" (since this is what the NCBI use in Entrez). We also treat "fastq-sanger" and "fastq" the same. Peter C (the one at Biopython) ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Unknown output format 'refseqp' and 'genpept'
On Tue, Dec 8, 2009 at 2:11 PM, Peter Rice wrote: > >> With hindsight this may have been a mistake, but we use "genbank" >> format to mean either nucleotides of proteins. On parsing we just >> look at the units of length in the LOCUS line (bp or aa). We also >> try to cope with both the current NCBI files and some older variants >> we have in our unit tests (different offsets in the LOCUS line). > > We try that too on input, but for output we have to be explicit so the user > can pick just one of the choices. I imagine that as with Biopython, sometimes the user has made it explicit that they are dealing with nucleotides or proteins (lots of the EMBOSS tools have switches for this), so you know if you should be using "aa" or "bp" in the LOCUS line. Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Genpept entry in MSE
On Tue, Dec 15, 2009 at 12:14 PM, Steve Taylor wrote: > > Hi, > > I am trying to load a Genpept entry into MSE, EMBOSS Version 6.0.1 on > Fedora. Unfortunately it doesn't like the LOCUS line. > > It loads, but warns: > > Warning: bad Genbank LOCUS line 'LOCUS ACN78416 225 aa > linear BCT 21-MAR-2009' > > Changing the aa to bp fixes it. What command line did you use? If you specified format "genbank", I think you should use format name "genpept" or "refseqp" instead: http://emboss.sourceforge.net/docs/themes/SequenceFormats.html Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Genpept entry in MSE
On Tue, Dec 15, 2009 at 3:26 PM, Steve Taylor wrote: > > I didn't specify any format. I assumed it would pick it up... Emboss is normally pretty good at deducing file formats, so I would have expected it to cope too. > However, I still get the error if I use > > mse -sformat1 genpept -sequence ACN78417.pep > > Is this what you mean? Probably - although I don't think I have ever used mse myself. Hopefully an EMBOSS developer can enlighten us. Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Macports/EMBOSS dyld issue
On Tue, Dec 15, 2009 at 8:16 PM, Tom Keller wrote: > > Hi, > I'm running Mac OS X 10.6, and have EMBOSS 6.0.1 installed via MacPorts. And > I have macport installed jpeg.7.dylib at /opt/local/lib/ > > But I get the following error: > $ wossname wossname > dyld: Library not loaded: /opt/local/lib/libjpeg.62.dylib > Referenced from: /opt/local/bin/wossname > Reason: image not found > Trace/BPT trap > > I tried making a link from jpeg.7.dylib to /opt/local/lib/libjpeg.62.dylib > but then I get the error: > > dyld: Library not loaded: /opt/local/lib/libjpeg.62.dylib > Referenced from: /opt/local/bin/wossname > Reason: Incompatible library version: wossname requires version 63.0.0 or > later, but libjpeg.62.dylib provides version 8.0.0 > Trace/BPT trap > > Can someone suggest a solution? > > Thomas (Tom) Keller > kellert at ohsu.edu > 503.494.2442 > 6339b R Jones Hall (BSc/CROET) > www.ohsu.edu/xd/research/research-cores/dna-analysis/ That looks like two problems, you seem to have libjpeg 62.x.x which is too old, but also EMBOSS (or dyld) isn't reporting the same kind of version number. Do you (or MacPorts) have a libjpeg.63.dylib file you could try? [I've never tried this - this is an informed guess at best] Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] getorf includes unspecified amino acids as part of the ORF sequence
On Mon, Jan 11, 2010 at 2:26 PM, Fungazid wrote: > > Hello people, > > I just installed emboss on linux ubuntu (using the ubuntu synaptic package > manager). I am using the getorf program, and I see it gives me this kind of > output lines: > >>1_3 [803 - 1120] > LARLRFVVLGNSFIASAKGWSTPYGPTTFGPFRSCIYPRVFRSTRVRKAMATRIGSNRVN > ILIRCTXNPYLGWWCYIFCIFR > > I don't like the Xs as they represent unspecified amino acids. Is there an > input parameter to tell the program to report only the regions before and > after the Xs ? > > In addition (and maybe this is beyond the scope of this mailing list) what is > the biological meaning of such Xs ? What was the input sequence like? Was there a stretch of N perhaps? Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Many-to-many with needle and water
On Mon, Jul 6, 2009 at 10:35 AM, Peter Rice wrote: > > Peter Cock or biopython wrote: >> Hi Peter R. et al, >> >> I gather EMBOSS is looking for feedback for new applications (given >> the recent funding from the BBSRC - congratulations again). How about >> suggestions for extensions to existing EMBOSS applications? >> >> I've used bits of EMBOSS for several years now (thank you!). Something >> I have sometimes wanted to do is a many-to-many pairwise sequence >> alignment with the EMBOSS tools needle and water. >> >> Right now, needle and water take two files (here referred to as A and >> B), file A has just one sequence, and file B can have one or more >> sequences. I'd like to be able to supply two files both with multiple >> entries, and have needle/water do pairwise alignments between all the >> sequences in A against all the sequences in B. This might be useful >> for finding reciprocal best hits in comparative genomics (as an slower >> but exact alternative to FASTA or BLAST). > > The application is easy to add (after the release) > > The usual problem with all-against-all is that it involves loading one > of the inputs as a sequence set entirely in memory - to avoid reading > one input many times over. > > We have an application supermatcher which does this - the first sequence > is streamed through, the second is a sequence set loaded into memory. It > uses work matching to find seed alignments then runs a limited alignment > around the hits. > > superwater would be a possible name (or superneedle). Is see EMBOSS 6.2 has a new tool "needleall" (although if there is a matching "waterall" the changelog doesn't mention it): http://lists.open-bio.org/pipermail/emboss/2010-January/003823.html I'll have to try this out... Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] Broken links on Emboss webpages
Hi, I was just looking for the EMBOSS EMBASSY documentation for the PHYLIPNEW packages, and noticed they are missing from this page: http://emboss.sourceforge.net/embassy/ Perhaps this should redirect to the latest release? i.e. http://emboss.sourceforge.net/apps/release/6.2/embassy/index.html I also found the links on this page seem to be broken: http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/phylogeny_molecular_sequence_group.html Regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] ABI to FASTQ with seqret
Hi all, I've got some "Sanger" capillary sequence files in ABI trace file format, which I understand includes the probabilities of the 4 bases along the sequencing run. I'd like to extract this as a FASTQ file with meaningful quality scores based on the trace data (for use in assembly). This doesn't seem to work - the FASTQ quality score characters are all double quotes (ASCI 34), meaning PHRED quality 1. seqret -sformat abi -osformat fastq-sanger -sequence example.ab1 -outseq example.fastq -auto Output as FASTA seems fine: seqret -sformat abi -osformat fasta -sequence example.ab1 -outseq example.fasta -auto Is ABI to FASTQ a reasonable to expect seqret to support? If so, could it be added to the TODO list please? Peter C. P.S. I'd be interested to hear suggestions for alternative tools to tackle this conversion. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] ABI to FASTQ with seqret
On Tue, Mar 30, 2010 at 1:02 PM, Peter Rice wrote: > > On 30/03/2010 12:46, Peter C. wrote: >> >> Hi all, >> >> I've got some "Sanger" capillary sequence files in ABI trace file >> format, which I understand includes the probabilities of the 4 bases >> along the sequencing run. I'd like to extract this as a FASTQ file >> with meaningful quality scores based on the trace data (for use in >> assembly). >> >> This doesn't seem to work - the FASTQ quality score characters are all >> double quotes (ASCI 34), meaning PHRED quality 1. > > I will take a look. I don;t recall anyone using the quality scores from ABI > data when we first imeplemented it (at that time Staden Experiment files > were the only supported output format with any quality scores) > Thanks Peter, Regarding other possible tools, there is the obvious choice of PHRED (although getting a copy is non-trivial), and based on this thread: http://seqanswers.com/forums/showthread.php?t=3165 I've just tried TraceTuner 3.0.6beta which is open source (specifically, GPL v2 or later): https://sourceforge.net/projects/tracetuner/ With the ttuner -nocall option to reuse the sequence as-is from the ABI file results in zero quality scores. Allowing ttuner to re-call the bases (the default), it can output FASTA/QUAL/PHD with meaningful qualities (from which I can easily make a FASTQ file). Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] ABI to FASTQ with seqret
On Tue, Mar 30, 2010 at 2:25 PM, Peter Rice wrote: > > On 30/03/2010 14:13, Peter Rice wrote: > >> Where do I look to find scores that we can use (and how do we convert >> those to phred quality scores)? > > Aha, found something. The field is called PCON (confidence values), with > values 0-255. > > There is a possibility that these could be phred scores, but I suspect they > are whatever the basecaller has decided to write there. > > http://www.appliedbiosystems.com/support/software_community/ABIF_File_Format.pdf > > Peter R. Hmm. Good question - I don't know, although if they are PHRED scores they could go unusually high (we'd expect say 0 to 50 for a raw read). It could be some other encoding (e.g. scaled from 0 for a poor base to 255 for a perfect base). Do you have any contacts at Applied Biosystems to ask? Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] ABI to FASTQ with seqret
On Tue, Mar 30, 2010 at 2:33 PM, Zheng Jin Tu wrote: > > > Hi Peter: > > You may want to check this URL about how to > convert quality score: > > http://maq.sourceforge.net/fastq.shtml > > Thanks, TU Thanks - but that just covers converting between PHRED scores and Solexa Scores. Peter Rice and I are well aware of this. The question here is what do the numbers in ABI files mean? Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] Nucleotide dotplots with EMBOSS
Hello EMBOSS team, I've just been using dottup to produce dot plots comparing two nucleotide sequences (two assemblies), where I have regions of very high similarity but some inversions. http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/dottup.html I've noticed that I can tell dotplot to reverse either of the sequences, but I would actually like it to search both for forward matches AND reverse matches to display on the same plot (ideally using different colours). Is this possible already, or might it be a reasonable feature request? Right now I can generate one plot with the forward matches, and a second plot with the reverse matches - not ideal. Thanks, Peter C. P.S. While I'm asking, I'd also like (colour) PDF output, since working with PDF files is much easier on the Mac than postscript (which thankfully is trivial to convert into PDF - so this isn't a big issue for me). ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Nucleotide dotplots with EMBOSS
On Wed, Apr 7, 2010 at 12:01 PM, Peter Rice wrote: > Sounds like a reasonable request. We will look into it. > ... > Should be possible with plplot. We will look into adding PDF to the possible > output devices. Great. Thanks, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] EMBOSS eprimer3 and latest primer3_core
Hello EMBOSS team, I'm using EMBOSS 6.2.0 on Mac OS X 10.6.3 Snow Leopard: $ embossversion Reports the current EMBOSS version number 6.2.0 I need to design some primers so I wanted to try the EMBOSS tool eprimer3, which as your documentation clearly explains requires me to install the 'primer3' program from the Whitehead Institute (specifically the primer3_core tool). I downloaded and compiled the latest version of primer3, version 2.2.2 beta (using the default, i.e. just "make", which seems to be fine - the Snow Leopard specific Makefile failed). It seems that EMBOSS eprimer3 does not like this: $ export EMBOSS_PRIMER3_CORE="/Users/xxx/Downloads/Software/primer3-2.2.2-beta/src/primer3_core" $ eprimer3 fasta::lupine.nu lupine.eprimer3 Picks PCR primers and hybridization oligos Error: Missing SEQUENCE tag Instead, I downloaded and compiled primer3 version 1.1.4 (using the defaults, i.e. just "make", there is no Snow Leopard specific Makefile included) and that seems to work: $ export EMBOSS_PRIMER3_CORE="/Users/xxx/Downloads/Software/primer3-1.1.4/src/primer3_core" $ eprimer3 fasta::lupine.nu lupine.eprimer3Picks PCR primers and hybridization oligos Picks PCR primers and hybridization oligos The eprimer3 output looks sensible too. My guess is that something in the recent primer3 alpha and beta releases of 2.x.x has changed since version 1.x.x and that EMBOSS needs to be updated to cope. Is this a known issue? Thanks, Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] EMBOSS eprimer3 and ambiguous DNA
Hello again, I just ran eprimer3 on a multiple FASTA file (using published genome sequences), and noticed a couple of messages: "Error: Unrecognized base in input sequence" Additionally, for two of the sequences there were no primer pairs (just some blank lines instead). These appear to correspond to two of the sequences in my input which had IUPAC ambiguous characters in the sequence (e.g. R, W, Y, N). The eprimer3 documentation does say explicitly that for some input files such characters are converted into N (options -mispriminglibraryfile and -mishyblibraryfile) . What is supposed to happen in a sequence in the main input file has such characters? I would expect to still get back a candidate set of primers (even if they do not cover the regions with ambiguous letters). As an experiment I added an N character to the end of an unambiguous sequence, and eprimer3 seemed happy. So, as a work around I've simply replaced all the ambiguous characters (like R, W and Y) with N, and it seems to work. Maybe eprimer3 could do this for me, or at least have this limitation mentioned in the documentation? Thanks, Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] ABI to FASTQ with seqret
On Tue, Mar 30, 2010 at 2:56 PM, Peter wrote: > On Tue, Mar 30, 2010 at 2:25 PM, Peter Rice wrote: >> >> On 30/03/2010 14:13, Peter Rice wrote: >> >>> Where do I look to find scores that we can use (and how do we convert >>> those to phred quality scores)? >> >> Aha, found something. The field is called PCON (confidence values), with >> values 0-255. >> >> There is a possibility that these could be phred scores, but I suspect they >> are whatever the basecaller has decided to write there. >> >> http://www.appliedbiosystems.com/support/software_community/ABIF_File_Format.pdf >> >> Peter R. > > Hmm. Good question - I don't know, although if they are PHRED scores > they could go unusually high (we'd expect say 0 to 50 for a raw read). > It could be some other encoding (e.g. scaled from 0 for a poor base to > 255 for a perfect base). Do you have any contacts at Applied Biosystems > to ask? > > Peter C. > Hello again Peter R (& everyone else at EMBOSS), Did you manage to find out if the PCON confidence values in ABI files are PHRED quality scores or not? Regards, Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] ABI to FASTQ with seqret
On Thu, Apr 22, 2010 at 4:22 PM, Peter Rice wrote: > > On 22/04/2010 16:06, Peter C. wrote: > >> Hello again Peter R (& everyone else at EMBOSS), >> >> Did you manage to find out if the PCON confidence values in ABI files >> are PHRED quality scores or not? > > Yes ... and maybe. > > The first scores are written bu the ABI basecaller. > > A second set can be written by any basecaller. These may be phred quality > scores but could in theory be anything. > > EMBOSS will assume they are phred scores as there is no way to tell > otherwise. > > regards, > > Peter Rice Does this mean there is an updated seqret in a public repository where I can convert an ABI file to FASTQ taking the ABI basecaller's sequence and PHRED scores? I'd be interested to test that... or a patch against EMBOSS 6.2.0. Thanks, Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] ABI to FASTQ with seqret
On Thu, Apr 22, 2010 at 6:01 PM, Peter Rice wrote: >> Does this mean there is an updated seqret in a public repository where I >> can convert an ABI file to FASTQ taking the ABI basecaller's sequence >> and PHRED scores? I'd be interested to test that... or a patch against >> EMBOSS 6.2.0. > > It is in the latest CVS code and will appeart in the July release. Thanks Peter, I tried to grab this from the anonymous CVS mirror as per the EMBOSS documentation here: http://emboss.sourceforge.net/developers/cvs.html Unfortunately it failed: $ cvs -d :pserver:c...@cvs.open-bio.org:/home/repository/emboss login Logging in to :pserver:c...@cvs.open-bio.org:2401/home/repository/emboss CVS password: cvs login: authorization failed: server cvs.open-bio.org rejected access to /home/repository/emboss for user cvs I know there have been VM problems on this machine (also known as code.open-bio.org) which have been intermitently been affecting the anonymous SVN access for other projects like BioPerl. One short term solution would be to give my OBF username peterc access to the master Emboss CVS repository on dev.open-bio.org (joke), or look into an external mirror - for example BioPerl are using github (and seriously talking about moving from SVN to git). This is going even more off topic but since ViewCVS broke a while back, I've found it much harder to browse the Emboss source code :( Regards, Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Tranalign relaxation?
On Wed, May 26, 2010 at 7:50 PM, Justin Havird wrote: > > Hi, > > I am trying to align nucleic acid sequences based on amino acid alignments > using the program tranalign. The program normally works fine for me, but > lately I have been using mitochondrial genes and am beginning to run into > problems. > > These occur when the nucleotide sequence does not match the amino acid > translation exactly. For example, in the prawn M. japonicus, the first amino > acid (MET) in the COX1 gene is encoded by the codon "ACG" rather than the > typical "ATG". Tranalign doesn't recognize ACG as encoding MET, so it throws > up this message: > > Error: Guide protein sequence M. japonicus not found in nucleic sequence M. > japonicus > > These errors occur on a taxa by taxa basis and are usually because of the > first codon. However, the error also occurs when the nucleotide sequence has > an ambiguous nucleotide (e.g., Y), even if the ambiguous nucleotide position > doesn't affect the translation (e.g., both GTC and GTT = VAL). I can usually > pinpoint the error to a specific nucleotide/codon like in these examples. > > These errors are relatively rare, but happen more frequently in some groups > (inverts and fishes mostly). > > So, does anyone know a way to "relax" the tranalign translation rules to > circumvent this problem? Or have another program/solution? Hi Justin, This might be a silly question, but have you used the tranalign argument -table to specify which genetic code table to use? I'd guess you probably want the Vertebrate Mitochondrial Code instead of the Standard Code. Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] Counting the number of sequences in a file
Hi all, Is there a tool in EMBOSS to just count the number of sequences in a file? For simple file formats like FASTA or GenBank I'd typically just use grep: $ grep -c "^LOCUS " gbvrt1.seq 31065 However, this becomes more complicated for general file formats (e.g. FASTQ files where in addition to identifiers the quality lines can also start with @) or binary files like BAM which EMBOSS now supports. Right now I could handle this by using seqret to convert the file into FASTA and then pipe that though grep to count the records. But an EMBOSS tool would be more elegant, e.g. $ countseq -sformat=genbank gbvrt1.seq 31065 For the implementation you might offer the choice between using the normal EMBOSS parsing (as in seqret) versus file format specific regular expression searches which just look for marker lines (without checking validity) which should be really fast. Regards, Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Counting the number of sequences in a file
On Tue, Jul 20, 2010 at 6:04 PM, Peter Rice wrote: > > On 20/07/10 17:27, Peter C. wrote: >> $ countseq -sformat=genbank gbvrt1.seq >> 31065 > > Of course, you could just use: > > $ seqret -filter -sformat=genbank gbvrt1.seq | grep -c '^>' > 31065 > > :-) > Exactly what I had in mind as the work around ("handle this by using seqret to convert the file into FASTA and then pipe that though grep to count the records"), although I'd not thought about the fact that FASTA is the default output format which keeps it nice and short. The (Unix) command line can be great :) Peter C ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] ABI to FASTQ with seqret
On Thu, Apr 22, 2010 at 6:01 PM, Peter Rice wrote: > > On 22/04/2010 16:48, Peter Cock wrote: > >> Does this mean there is an updated seqret in a public repository where I >> can convert an ABI file to FASTQ taking the ABI basecaller's sequence >> and PHRED scores? I'd be interested to test that... or a patch against >> EMBOSS 6.2.0. > > It is in the latest CVS code and will appeart in the July release. > Hi Peter R et al, I've just compiled and installed EMBOSS 6.3.1 on Mac OS X, and had a go converting some ABI (extension .ab1) files from our in house sequencing service to FASTQ - so far all the examples give Sanger FASTQ quality strings of "!" (ASCII 33, PHRED quality zero) or Illumina FASTQ quality strings of "@" (ASCII 64, again PHRED quality zero). I remember you saying ABI files can have two sets of quality scores, so perhaps my files have one set all of PHRED zero? I tried to find some 3rd party example files via Google, for example on http://www.elimbio.com/sequencing_sample_files.htm they have a zip file http://www.elimbio.com/Forms/pGEM.zip containing one ABI file. The output of this is more interesting: $ seqret -sformat abi -osformat fastq -auto -stdout -sequence pGEM_\(ABI\)_A01.ab1 @pGEM_(ABI) NANTCTATAGGCGAATTCGAGCTCGGTA...GNN + "!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"...!"!"!" I truncated this for brevity. Here the quality string repeats ASCI 34, ASCI 33 (PHRED quality 1, quality 0) which is rather strange. The sequence appears to agree with the provided file pGEM_(ABI)_A01.seq Have I just been unlucky with the AB1 files that I have looked at? Thus far all the quality scores seem meaningless. Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] ABI to FASTQ with seqret
On Thu, Jul 22, 2010 at 12:16 PM, Peter wrote: > On Thu, Apr 22, 2010 at 6:01 PM, Peter Rice wrote: >> >> On 22/04/2010 16:48, Peter Cock wrote: >> >>> Does this mean there is an updated seqret in a public repository where I >>> can convert an ABI file to FASTQ taking the ABI basecaller's sequence >>> and PHRED scores? I'd be interested to test that... or a patch against >>> EMBOSS 6.2.0. >> >> It is in the latest CVS code and will appeart in the July release. >> > > Hi Peter R et al, > > I've just compiled and installed EMBOSS 6.3.1 on Mac OS X, and had a > go converting some ABI (extension .ab1) files from our in house sequencing > service to FASTQ - so far all the examples give Sanger FASTQ quality strings > of "!" (ASCII 33, PHRED quality zero) or Illumina FASTQ quality strings of > "@" (ASCII 64, again PHRED quality zero). > > I remember you saying ABI files can have two sets of quality scores, > so perhaps my files have one set all of PHRED zero? > > I tried to find some 3rd party example files via Google, for example on > http://www.elimbio.com/sequencing_sample_files.htm they have a zip > file http://www.elimbio.com/Forms/pGEM.zip containing one ABI file. > The output of this is more interesting: > > $ seqret -sformat abi -osformat fastq -auto -stdout -sequence > pGEM_\(ABI\)_A01.ab1 > @pGEM_(ABI) > NANTCTATAGGCGAATTCGAGCTCGGTA...GNN > + > "!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"...!"!"!" > > I truncated this for brevity. Here the quality string repeats ASCI 34, ASCI 33 > (PHRED quality 1, quality 0) which is rather strange. The sequence appears > to agree with the provided file pGEM_(ABI)_A01.seq > > Have I just been unlucky with the AB1 files that I have looked at? Thus > far all the quality scores seem meaningless. I went back through my old emails, and see you had been testing with http://www.appliedbiosystems.com/support/software_community/ab1_files.zip (I had trouble downloading this with curl - Firefox worked). Looking at these ABI files with seqret as FASTQ does seem to give meaningful quality scores. Curious. Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] transeq and ambiguous codons
Hi again, Now that I have installed the latest and greatest version, EMBOSS 6.3.1, I'm revisiting some old issues I had with EMBOSS. In this case 'unambiguous ambiguous codons' and other translation issues. On Fri, Jul 10, 2009 at 10:14 AM, Peter C. wrote: > On Thu, Jul 9, 2009 at 10:08 AM, Peter Rice wrote: >> >> Peter C. wrote: >>> However, consider the codon TRR. R means A or G, so this can mean TAA, >>> TGA, TAG or TGG which translate to stop or W (both EMBOSS and the NCBI >>> standard table agree here). Therefore the translation of TRR should be >>> "* or W", which I would expect based on the above examples to result >>> in "X". But instead EMBOSS transeq gives "*": >> >> This is a side effect of the way backtranslation works... > > OK, leaving TRR aside for the moment (I'm not sure I'd have done it that > way, but I think I follow your logic), I have some more problem cases for > you to consider (all using the default standard NCBI table 1). > > Most of these are 'unambiguous ambiguous codons' as you put it, and > I would agree using X when a more specific letter is possible isn't ideal > but isn't actually wrong. The "ATS" and related codons (see below) > however are simply wrong. > > -- > > TRA means TAA or TGA, which are both stop codons. Therefore TRA > should translate as a stop, not as an X: > > $ transeq asis:TAATGATRA -stdout -auto -osformat raw > **X Same on EMBOSS 6.3.1, shouldn't TRA translate as stop? > -- > > Now look at YTA, which means CTA or TTA which encode L, so > YTA should be L not X: > > $ transeq asis:CTATTAYTA -stdout -auto -osformat raw > LLX Same on EMBOSS 6.3.1, giving X instead of specific amino acid (i.e. YTA is an "unambiguous ambiguous codon" for L) > Likewise for YTG and YTR, and YTN. I haven't re-checked these. > -- > > Another example, ATW means ATA or ATT, which both translate as I, > so ATW should translate as I not X: > > $ transeq asis:ATAATTATW -stdout -auto -osformat raw > IIX Same on EMBOSS 6.3.1, giving X instead of specific amino acid (i.e. ATW is an "unambiguous ambiguous codon" for I) > -- > > Conversely, ATS which means ATC or ATG which translate as I and M. > Remember S means G or C. Therefore ATS should translate as X, and > not I: > > $ transeq asis:ATCATGATS -stdout -auto -osformat raw > IMI Same on EMBOSS 6.3.1, giving potentially wrong amino acid instead of X. > Likewise H means A, G or C, so ATH shows the same bug, as do some > other AT* codons: > > $ transeq asis:ATAATCATGATH -stdout -auto -osformat raw > IIMI > > [*** This one strikes me as a clear bug ***] Same on EMBOSS 6.3.1, giving potentially wrong amino acid instead of X. As I noted before, this list is only partial, and only for the standard table. I could compile a much longer list of oddities using the Biopython translation as a reference if you wanted. Regards, Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] ABI to FASTQ with seqret
On Thu, Jul 22, 2010 at 1:28 PM, Peter Rice wrote: > > On 22/07/10 12:22, Peter C. wrote: > >>> I truncated this for brevity. Here the quality string repeats ASCI 34, ASCI >>> 33 >>> (PHRED quality 1, quality 0) which is rather strange. The sequence appears >>> to agree with the provided file pGEM_(ABI)_A01.seq >>> >>> Have I just been unlucky with the AB1 files that I have looked at? Thus >>> far all the quality scores seem meaningless. > > There are two sets of quality scores in that file. Both are the > alternating characters 1 and 0. Adding 33 gives the scores you see. > > Looks as though EMBOSS is just reporting what it finds. > > The file offset is the value returned by function > ajSeqABIGetConfidOffset. It simply reads one byte from there for each > base of sequence length. Looks like that particular random example from the internet was just odd. >> I went back through my old emails, and see you had been testing with >> http://www.appliedbiosystems.com/support/software_community/ab1_files.zip >> (I had trouble downloading this with curl - Firefox worked). Looking at these >> ABI files with seqret as FASTQ does seem to give meaningful quality scores. >> Curious. > > It should look for a PCON tag in the file and pick up the second of two, > or the first if there is only one. > > Can anyone on the list enlighten us further on what is intended for the > quality socrss in these example files? The gGEM example I have no idea - I just found it with Google. I can send you a couple of our locally produced AB1 files off list if you wouldn't mind having a look at them. It may be that however these are being generated there simply are no useful scores inside. Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] ABI to FASTQ with seqret
On Thu, Jul 22, 2010 at 5:33 PM, Tom Keller wrote: > Greetings, > The latest versions of the ABI basecaller does indeed give quality scores. I suspect the problem is my ABI files were not created using the latest ABI basecaller then. Do you have any more details (e.g. which version)? I've sent a couple of *.ab1 files off list to Peter Rice to confirm they really don't have quality scores. Tomorrow I will try and find out who to contact locally about the base calling, and what version of the base caller they have. Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] emboss stand-in for fasta
On Sat, Jul 24, 2010 at 11:56 AM, Ingo P. Korndoerfer wrote: > could anybody help me out with what to use as a stand-in for fasta ? > > fasta by itself is fine, but under windows there is no way to make fasta > accept filenames with spaces. neither "" nor """" nor '' seem to alleviate > the problem. You are talking about Bill Pearson's FASTA command line tools, right? Have you tried wrapping the filename with double quote characters, "like this.fasta", which usually works on Windows. If not, I'd also try escaping with a slash, "like\ this.fasta", just in case. > so i was hoping emboss would have something (which would also save > me having to install fasta on all of our pcs). > > what i need to do is run a sequence against an in house library and > return me the top hit in alignment. Sounds like BLAST might we a sensible choice to me - it works fine on Windows, although I'm not sure about filenames with spaces. Personally I avoid filenames with spaces - they just cause trouble. Can't you rename things before calling FASTA? e.g. Write a wrapper script for FASTA to turn spaces into underscores? Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Jemboss/EMBOSS can't find the external clustalw binary
On Wed, Aug 11, 2010 at 2:31 PM, Nigel Binns wrote: > > Hi All, > > Please can anyone tell me why my installation of Jemboss (EMBOSS v6.3.1 > patch v1-4) can't find the external clustal binary and how to correct this. > When I run a multiple sequence alignment using emma, I get the following > output: > > Died: emma uses external program 'clustalw' which is not in the PATH or > defined as EMBOSS_CLUSTALW > > I can confirm that my jemboss.properties file > ($EMBOSS_ROOT/share/EMBOSS/jemboss/resource/jemboss.properties) correctly > points to the root directories of the clustalw and primer3 binaries e.g.: > > embossPath=/path/to/clustal:/path/to/primer3 Have you got clustalw 1.x or 2.x installed? The binary names differ, clustalw.exe versus clustalw2.exe (no extension on Unix/Linux), and perhaps EMBOSS only expects the former? Have you tried setting the EMBOSS_CLUSTALW variable? Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Jemboss/EMBOSS can't find the external clustalw binary
On Thu, Aug 12, 2010 at 9:17 AM, Nigel Binns wrote: > > Hi Peter, > > Many thanks for your reply. I have ClustalW v2 installed (v2.0.12 - the > latest release). The binary is named clustalw2. However, as I understand it, > when running the Jemboss installation script, you are asked to provide the > root directory that contains the clustalw binary rather than the name of the > actual binary i.e /path/to/clustal/root/ rather than > /path/to/clustal/root/clustalw2 or have I got that wrong? I was suggesting the problem could be EMBOSS only looks for clustalw and not clustalw2. > The same issue applies to my installation of Primer3 ( latest release - > 3-2.2.2-beta). The binary name is primer3_core. I get this error when I try > to run eprimer3. > > Error application terminated > > Died: eprimer3 uses external program 'primer3_core' which is not in the > PATH or defined as EMBOSS_PRIMER3_CORE > Part of the 'primer3' package, version 3.0, available from the > Whitehead Institute. See: http://primer3.sourceforge.net/ > > Please can you tell me what file I should set the EMBOSS_CLUSTALW and > EMBOSS_PRIMER3_CORE variables in. They are just environment variables (set up in the OS), but I haven't ever used Jemboss so don't know how it would handle this. > > Many thanks for your help. > > Nigel Peter C. ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] Keep feature in union of GenBank files
Hi all, Prompted by this thread on seqanswers.com I tried using EMBOSS 6.3.1 union to merge multiple GenBank format records (in a single file) into a single GenBank record with the concatenated sequence. This worked, but the output file has no features: http://seqanswers.com/forums/showthread.php?t=7812 e.g. union -sequence many.gbk -sformat genbank -outseq merged.gbk -osformat genbank -auto Is support for features something that could be added to union please? Thanks, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Keep feature in union of GenBank files
On Mon, Nov 15, 2010 at 10:57 AM, Peter wrote: > Hi all, > > Prompted by this thread on seqanswers.com I tried using EMBOSS 6.3.1 > union to merge multiple GenBank format records (in a single file) into a > single GenBank record with the concatenated sequence. This worked, > but the output file has no features: > > http://seqanswers.com/forums/showthread.php?t=7812 > > e.g. > > union -sequence many.gbk -sformat genbank -outseq merged.gbk -osformat > genbank -auto > > Is support for features something that could be added to union please? > Thanks to Nick Loman for the seqanswers.com thread for pointing out this functionality is present but must be enabled explicitly with "-feature Y". Apologies for the noise. Peter P.S. Why isn't this the default? ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Transeq question, frame phases
On Wed, Feb 16, 2011 at 8:54 PM, David Mathog wrote: > Test case fasta file >>8Achars > > > all 6 frames for transeq, standard mode emits: >>_1 > KKX >>_2 > KKX >>_3 > KK >>_4 > FF >>_5 > FFX >>_6 > FFX > Note you can do that with a single command line: $ transeq asis: -filter -frame 6 >asis_1 KKX >asis_2 KKX >asis_3 KK >asis_4 FF >asis_5 FFX >asis_6 FFX Note that while using 1, 2, 3 for the forward frames is well defined, there are two conventions for the reverse frame - do you start from the left or the right? First let's just do the forward frames, $ transeq asis: -filter -frame 1 >asis_1 KKX $ transeq asis: -filter -frame 2 >asis_2 KKX $ transeq asis: -filter -frame 3 >asis_3 KK Are you happy with them? Now let's do that with the reverse complement strand: $ transeq asis: -filter -frame 1 >asis_1 FFX $ transeq asis: -filter -frame 2 >asis_2 FFX $ transeq asis: -filter -frame 3 >asis_3 FF Now let's do that with the original sequence but the negative frames: $ transeq asis: -filter -frame -3 >asis_6 FFX $ transeq asis: -filter -frame -2 >asis_5 FFX $ transeq asis: -filter -frame -1 >asis_4 FF Same results - perhaps the naming isn't as you expected? Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Transeq question, frame phases
On Thu, Feb 17, 2011 at 4:30 PM, David Mathog wrote: > > >> Now let's do that with the reverse complement strand: >> >> $ transeq asis: -filter -frame 1 >> >asis_1 >> FFX This is what I think that does (forward frames are easy): Frame 1, so starts at first base: Letters 123, codon TTT, gives F Letters 456, codon TTT, gives F Letters 78, partial codon TT-, gives X >> $ transeq asis: -filter -frame 2 >> >asis_2 >> FFX Frame 2, so starts at second base: Letter 1, just T, ignored Letters 234, codon TTT, gives F Letters 567, codon TTT, gives F Letters 8, partial codon T--, gives X >> $ transeq asis: -filter -frame 3 >> >asis_3 >> FF Frame 3, so starts at third base: Letters 12, bases TT, ignored Letters 345, codon TTT, gives F Letters 678, codon TTT, gives F > That is the problem. Let me try to explain more clearly what the issue is. > > That is, if the meaning of the + phases is to define the three codons > a,b,c as shown in the diagram, such that the forward translation is as > shown, then the reverse translation should be as shown above in > expected. That is, it is the translation of the exact same set of > codons done individually, but for the - strand reverse complement the > codon first, and then invert the resulting translated sequence. That > way the X, where it occurs is attached to the same partial codon "c". I couldn't understand your diagram - probably font spacing issues in part. The EMBOSS tool is doing all six frames, maybe all you need to work out the is mapping between its naming and yours. Note that it can make sense to translate a trailing partial codon, e.g. TC... could be TCA, TCC, TCG or TCT which all code for S: $ transeq asis:TCN -filter >asis_1 S $ transeq asis:TC -filter >asis_1 S Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Problem indexing PDB fasta file
Enrique de Andres Saiz wrote: > I have been looking the PDB fasta file and I see that, for the previous > warning, there are an entry whoose id is '1FNT_A' and another one whoose > id is '1FNT_a'. Then, this make me think that EMBOSS is > case-insensitive. Is this true? Are there any way to distinguish between > the two id's? Yes, EMBOSS is case-insensitive. So is the Staden/EMBLCD indexing standard that dbifasta uses. The standard also only allows one entry with each ID. dbxfasta uses a new indexing format and can index both entries, but will still assume the names are the same (a search for 1FNT_A or 1FNT_a wil return both entries). Allowing indexing to be case-sensitive is possible in future, but can slow down searches. We will investigate. Hope that helps, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] dbifasta index file format
Graziano P. wrote: hello EMBOSS users, I have some databases in fasta format (ncbi | format) and I want to index them using dbifasta, then I want to access the index files using a program that will be developed by a computer scientist of my group. I need to index the databases by accession number, ginumber and description. I have read in the dbifasta help info about the structure of the index files when the databases were indexed by accession number, but I have not found info about the structure of the index files when the databases are indexed by description. Anyone knows where I can find detailed information about the structure of the index files? Ciao Graziano, The dbifasta index files use the same format as the Staden package, the old EMBL CD-ROM distribution, and Erik Sonnhammer's "efetch" utility. They were documented in some old Staden documentation and papers. They are also documented in the EMBOSS distribution under doc/manuals/ in file internals-indexing.txt (see attached). I see that this document was written before we indexed the descriptions!!! The description (title) indexing is the same as the accession number indexing. The files are called des.hit and des.trg. dbifasta has a -maxindex option to limit the size of the longest words indexed (the index files have a value for the maximum record length). We also have a script in the distribution scripts/dbilist.pl which can list the contents of the description index (in the database index directory, run it as dbilist.pl des) The new dbxfasta index files are very different. For very large databases we recommend dbxfasta. For smaller databases dbifasta is fine and we will continue to support it. Hope that helps. If you need more details, just ask. regards, Peter EMBOSS database indexing The main index format is the named EMBLCD after its use in the CD-ROM distribution of the EMBL database. It is basically the Staden format, but we used an alternative name to allow some freedom to extend it. The intention was to keep compatibility with the Staden package. EMBOSS comes close to this, but no site seems to depend on using a common set of indices in both packages and there is no test plan so some small differences probably break this for now. All index files have a header block of 300 bytes. The first 44 bytes contain: int4 filesize int4 record count int2 record size ch20 database name ch10 database release int4 date This is followed, for no apparent reason, by 256 bytes of padding which EMBOSS fills with spaces. There is room here for any additional data EMBOSS may need. Note the "record size" header field, used to seek individual records in the index files. It requires all strings in the index to be padded to the length of the longest string - not a problem for ID or accession, but a big problem for a des index. May be worth investigating a different format which has a separate offset file, needing only to rename the "X.trg" file to "X.str" and to add an "X.bin" file which can be easily created from the "X.str" file with a list of (ajlong) offsets. For each database there is a "division lookup" file division.lkp which lists all the data files. Each division (think of EMBL or GenBank) can have up to 2 files (Staden's format allows for GCG databases, which use the NBRF format split into REF and SEQ files, as used for many years by the PIR database). All entries in the database must have a unique ID, which is stored in the "entryname.idx" file as the ID string, the file number, and the offsets in each of the two data files. Other index files (at present, only the accession numbers) have two files. The X.trg file lists the known values in sorted order, and has two numbers: the number of entries in the X.hit files, and the offset to the first entry in the X.hit file. The X.hit file has a simple list of offsets (record numbers) in the entryname.idx file. Building these files uses temporary output files with lists of all values (accessions) and their IDs. These are then sorted by value and by ID, and compared to the sorted list of IDs to build the index files. Naturally, a full index of descriptions could be rather large, especially if long words are allowed as each text string in the X.trg file must be padded out to the length of the longest string in the index. The natural solution for EMBOSS would be to limit the length of an index field for the description index, and possibly to restrict the maximum number of times a word can appear or at least to exclude certain common terms. Keywords are less of a problem because there are a limited number of them. To add further fields to database indexing, the indexing and query mechanisms for accession numbers needs to be made into discrete functions, and the simple accesion number structures need to be part of a general data structure for all field
Re: [EMBOSS] Problems with GenBank indexing
Natalia Jimenez Lozano wrote: > I was looking for an explanation to this behaviour and I've found that > skipped IDs correspond to CDS from genomic sequences and have this format: > > >gi|10121909|gb|AAG13419.1|AC000348_16 T7N9.24 [Arabidopsis thaliana] > MELPDVPVWRRVIVSAFFEALTFNIDIEEERSEIMMKTGAVVSNPRSRVKWDAFLSFQRDTSHNFTDRLY... > >gi|8778864|gb|AAF79863.1|AC000348_16 T7N9.28 [Arabidopsis thaliana] > MSVVLQITKDWVQALLGFLLLSFANISTRTNHKHFPHGSCSSIMAGFWIYMYIYSYLFITLKIIDLTS... As Jon says, dbxfasta is a solution. However, that is only a partial solution. The real problem is that these FASTA format sequences do indeed have duplicate IDs. This is protein sequence data, so it is not GenBank - was this GenPept or some other database? GenPept and other databases have been known to report "gb" or "emb" as the database for protein sequences!!! A possible solution is to add a new ID format to dbifasta and dbxfasta that uses AAG13419 and AAF7986 as the ID and ignores the AC000348_16 part. Hope this helps, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Fwd: EMBOSS for Windows without Cygwin
Duleep Samuel wrote: > Is the latest EMBOSS version 3.0.0.0 available anywhere as a precompiled > binary for Windows XP, I have tried compiling using cygwin and it > crashed, I loaded EMBOSS for windows which is a port of version 2.10.0, > loaded Staden Package and made Spin aware of EMBOSS and am working, but > feel bad that I am _One_ whole release behind, If anyone has a complied > binary I can download for testing and report back on useability, > regards, Samuel, Virologist, India Staden has support for older versions of EMBOSS. We are trying to update Staden to work with EMBOS 3.0.0 and future releases. If anyone is using EMBOSS and Staden (especially EMBOSS under the Staden SPIN interface) please contact the EMBOSS developers ([EMAIL PROTECTED]) so we know how many EMBOSS SPIN users there are. It helps to set priorities for the work. regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] nt-multi-fastA-file
Christiane Nerz wrote: > Hi all, > > I put the gb-file of an whole genome in Artemis. > Is there a possibility to export a multi-FastA-file with the bases of > all ORFs? Example: > > >ORF_1 > ATGTGTTCGTT > >ORF_2 > ATGTTCCCGACCA... > >ORF_3 > ATGCCGCAT... > > I know how to get all bases, but only as one complete sequence. > (That genome is not published yet, so there is no multi-Fasta-file at > ncbi or EMBL available) Yes, the coderet program will do this. Unfortunately coderet tries to return CDS, mRNA and translations all in one file (to be fixed for the next release). You can ask just for the CDS with a couple of extra command line options: coderet -nomrna -notranslation Give it the filename as input. The output will be the coding sequences. With -nocds instead of -notranslation you will get the protein sequences. If you have any problems parsing the GenBank file let me know. regards, Peter Rice ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] EMBOSS Funding News
EMBOSS will be funded by the UK Biotechnology and Biological Sciences Research Council (BBSRC) for the next 3 years. EBI has issued the following press release, also available from: http://www.ebi.ac.uk/Information/News/pdf/Press25Apr06-small.pdf The EMBOSS team would like to thanks all our users and developers for their patience over the past two years. regards, Peter Rice Alan Bleasby Jon Ison A brighter future for Europe’s favourite molecular biology software package New funding for EMBOSS – Europe’s leading suite of molecular biology analysis tools – guarantees open access for researchers and software developers Hinxton, 25 April, 2006 – EMBOSS, the European Molecular Biology Open Software Suite, has received a vital funding boost from the UK Biotechnology and Biological Sciences Research Council (BBSRC) that will guarantee its continued maintenance under an open source license for the next three years. This ends two years of uncertainty over the future of the project. Until recently, EMBOSS was hosted by the Medical Research Council’s Rosalind Franklin Centre for Genomics Research (RFCGR), where it was funded jointly by the BBSRC and the Medical Research Council (see ‘notes for editors’ for more information on the history of EMBOSS). With the announcement in April 2004 of the RFCGR’s closure, the future of EMBOSS hung in the balance. The new funding from the BBSRC means that EMBOSS co-founders Peter Rice and Alan Bleasby will be able to continue the EMBOSS project at the EMBL-EBI for the next three years. EMBOSS will remain freely available from emboss.sourceforge.net and anyone who wants to develop it further will have access to its source code. ‘We’re delighted that the BBSRC has recognized EMBOSS as an important tool for molecular biology’ says project leader Peter Rice. ‘The EMBOSS user community has been very patient, and it highlights a great benefit of open source software that even users in industry have continued to rely on EMBOSS despite the uncertainty about its future. This simply could not have happened if EMBOSS had been a commercial package under threat.’ EMBOSS provides a powerful package of around 300 applications for molecular biology and bioinformatics analysis. Molecular biologists use EMBOSS at all stages of their research, from planning experiments to analysing results. It also has an application-programming interface (API) that enables software developers to write their own EMBOSS applications. These can readily be strung together, allowing users to create ‘workflows’ that automate complex and time-consuming tasks. EMBOSS has also been used in many commercial software developments and is included in commercial bioinformatics systems. Its flexibility has made it an obvious core component of several data integration and bioinformatics infrastructure projects, including myGrid and EMBRACE. The new funding also provides helpdesk support for EMBOSS’s users. ‘As well as helping researchers with limited bioinformatics expertise to make the most of EMBOSS, we will be able to provide better support and documentation to the estimated 20% of our users who are also software developers’, explains Alan Bleasby. ‘We will encourage these experts to contribute their code to the project. In return, we will make their software widely available through the EMBOSS website and provide ongoing user support for it. This mechanism will help to ensure that EMBOSS evolves according to the needs of its users.’ Contact: Cath Brooksbank PhD, EMBL-EBI Scientific Outreach Officer, Hinxton, UK, Tel: +44 1223 492 552, www.ebi.ac.uk, [EMAIL PROTECTED] Anna-Lynn Wegener, EMBL Press Officer, Heidelberg, Germany, Tel: +49 6221 387 452, www.embl.org, [EMAIL PROTECTED] Notes for editors – a brief history of EMBOSS EMBOSS, an open source suite of tools for the analysis of biological data, has its origins in the late 1980s when Peter Rice, a co-founder of EMBOSS, was working at EMBL. Encouraged by his colleagues in the lab, he began to write extensions to the GCG package, which at that time provided its source code to users. His efforts evolved into EGCG (extended GCG) and Rice moved to the Sanger Centre (now the Wellcome Trust Sanger Institute) to continue its development. However, the changes to the source code licensing of GCG in 1996 put an end to further development of EGCG. Recognizing the importance of free source code to the rapid and cost-effective development of bioinformatics tools, Rice, in collaboration with Alan Bleasby (then at SEQNET, Daresbury, UK) began working on a new suite of open-source bioinformatics tools – the EMBOSS project – in 1996. EMBOSS has been funded by: the Wellcome Trust (1997–2000); the BBSRC and MRC (2001–2004); and through two posts at the MRC Rosalind Franklin Centre for Genomic Research following a merger with BBSRC’s SEQNET facility in 1998.After the closure of RFCGR in July 2005,EMBOSS moved to the
Re: [EMBOSS] New EMBL release
Wells, Isabelle wrote: > Hi All, > > EMBL release 87 has just been made available and changes to the entry ID > line were made. Did anyone install it and index the files with dbiflat? > I am just wondering whether the change in ID line structure causes > problems. There are some small changes needed. We will produce patch files next week for 3.0.0 (the 4.0.0 code in CVS already works). We waited to see a full release before making the patches, in case there are any surprises. I will send an announcement to this list when the patches are tested and copied to emboss.open-bio.org regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] display of long ensembl and vega identifiers in alignments
Hans Rudolf Hotz wrote: > A few months back, I played arround with the source code and changed one > of the library files (ajalign.c). This now allows the display of up to 20 > characters, by using a new output format "pairln" for sequence alignment > programs, like matcher or needle. This is in comparison to the default > which displays only the first 6 characters, or "pair" which displays the > first 13 characters, eg: We can make the ID arbitrarily long for a "new" alignment format. We will need formats similar to the existing matcher and needle outputs to avoid breaking too many existing parsers (I remember when NCBI changed the use of a blank at the start of each line of blast output and almost all parsers had to change). The formats are easy to make (as you found out) from the existing ones. We need to decide what to do with the standard alignment formats that have 6 characters in their definition (I assume this goes back to the days of PIR database identifiers when FASTP was first written). As we cannot fit many of the existing identifiers, we can make up unique identifiers for these (truncate the identifier, and make the names unique if they match). Or, should we change the existing formats to allow longer IDs? What do the authors of parsers think? regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] EMBOSS 4.0.0 Latest Fixes
I have posted some further fixes on the EMBOSS FTP site. None are critical. Users have been reporting interesting bugs. Some were also in release 3.0.0. The fuzznuc, fuzzpro and fuzztran reports were changed in 4.0.0 to always report something. Unfortunately users running searches over the whole database found their output files were very large. We have changed the way reports work as follows: 1. fuzznuc, fuzzpro and fuzztran again report only sequences with hits 2. when a report is closed, a default header and footer are written (solving the problem of empty output files) 3. for sites that had concerns about searches for trivial patterns taking too long and generating too much output, reports have 2 new associated qualifiers. -rmaxall limits the total number of matches reported (fuzznuc, fuzzpro and fuzztran terminate when the limit is reached), -rmaxseq limits the maximum number of hits for one sequence. We also have various fixes for reporting matches on the reverse strand, and for improved parsing of FASTA file IDs. To update your EMBOSS 4.0.0 release, go to: ftp://emboss.open-bio.org/pub/EMBOSS/fixes/ File README.fixes (see below) lists the files and describes the fixes. Copy the files to the indicated directories and reinstall. regards, Peter Rice file README.fixes 25-aug-2006 The files in this directory are bugfix replacements for files in the EMBOSS-4.0.0 distribution. Just drop the replacement files in the location shown and redo the 'make install.' Fix 1. EMBOSS-4.0.0/nucleus/embpatlist.c 31 Jul 2006: Fixes a problem with searching for patterns and regular expression in the reverse strand of nucleotide sequences. The change is to use ajSeqReverseForce (always reverses the sequence provided) instead of ajSeqReverseDo (which only reverses if the reverse flag is set) 9 Aug 2006: Revised to also fix a problem with reverse strand sequence positions. Fix 2. EMBOSS-4.0.0/ajax/ajfile.c 31 Jul 2006: This fixes a bug where deleting the last line of buffered input fails to reset the pointer to the last buffered line. This only affected debug traces. Unfortunately, the ajFileBuffClear function does call the debug trace. In practice we have only seen this bug when processing sequence data in EMBL format from an MRS server. Fix 3. EMBOSS-4.0.0/ajax/ajnam.c 31 Jul 3006: New database access methods MRS and DBFETCH need to be explicitly turned on so that showdb can report them. Fix 4. EMBOSS-4.0.0/ajax/ajseqdb.c 31 Jul 2006: The new MRS access method used a general search. This gave strange results when the ID or accession appeared in any other entry. It appears that MRS can search for id or accession only. This worked on the main MRS server at least. MRS access will be further extended in the next release. Please contact the developers [EMAIL PROTECTED] if you would like to help test new features in MRS access. 25 Aug 2006: Further change to allow multiple %s replacements in complex URLs for access method URL. Needed for complex SRS queries to resolve EMBL IDs so the following definition can be used for EMBL (warning, the URL may wrap badly in this email!) DB embl [ method: "url" format: "embl" type: "N" url: "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-noSession+-ascii+-vn+2+-e+[embl-id:%s]|[embl-acc:%s]|([emblidacc-id:%s]>embl)" comment: "EMBL from SRS including old IDs" ] Fix 5. EMBOSS-4.0.0/configure 07 Aug 2006: Fix configuration problem on Intel Mac machines. Make sure this file is executable (chmod 755 configure) after downloading it. Fix 6. EMBOSS-4.0.0/ajax/ajseq.c 09 Aug 2006: Return correct USA for "asis::" sequence input. Fix 7. EMBOSS-4.0.0/emboss/dreg.c 09 Aug 2006: Correct sequence positions on the reverse strand. Fix 8. See Fix 13 Fix 9. See Fix 13 Fix10. EMBOSS-4.0.0/doc/programs/html/banana.1.banana.gif EMBOSS-4.0.0/doc/programs/html/tcode.2.tcode.gif 14 Aug 2006: These graphics example outputs were missing from the distribution. When you run make install they will be copied to the installed documentation. Fix 11. EMBOSS-4.0.0/emboss/merger.c EMBOSS-4.0.0/emboss/needle.c EMBOSS-4.0.0/emboss/prophet.c EMBOSS-4.0.0/emboss/water.c 14 Aug 2006: These programs calculate an internal path size from the lengths of the input sequences. For sequences that are too long, a fatal error is produced. But if the sequences are extremely long, the test failed and the program gave a segmentation fault. This fix tests in a different way that will catch all cases. Fix 12. See Fix13 Fix 13. EMBOSS-4.0.0/ajax/ajacd.c EMBOSS-4.0.0/ajax/ajfeat.c EMBOSS-4.0.0/ajax/ajfeat.h EMBOSS-4.0.0/ajax/ajreport.c EMBOSS-4.0.0/ajax/ajreport.h EMBOSS-4.0.0/emboss/fuzznuc.c EMBOSS-4.0.0/emboss/fuzzpro.c EMBOSS-4.0.0/emboss/fuzztran.c 21 Aug 2006: This provides new qualifiers to l
Re: [EMBOSS] iep program for multiple protein sequences
Tao Song wrote: > Hi, > > I wonder can the iep program that calculates the isoelectric point of > a protein be used > for a protein database? When asked to input protein sequence I gave 'tsw' > instead of > 'tsw:laci_ecoli' I got an error that said 'sequence must be protein sequence > without BZ U X > or *: found bad character Z'. Does iep can only take one protein sequence as > input file? Your command does read the test swissprot database, but fails on an entry that is a sequence fragment with a Z ambiguity code. For the next release, I have a patch that will convert B and Z to D/N and E/Q using the Dayhoff frequencies of naturally occurring amino acids. This will convert the first B or Z to a charged residue (as these are more common), the second to an uncharged residue, and so on. With this change in place iep can be modified to accept any protein sequence and will produce consistent results on ambiguity codes. A question: We can try this fix as a general solution for programs requiring "pureprotein" input, by converting any B or Z (or J) ambiguity code. Is this useful? For iep the order does not matter and the converted sequence does not appear in the output, but I think a program-by-program solution is better. Other programs insisting on "pureprotein" input are hmoment, octanol and pepwindow regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] EMBOSS-Explorer Follow-up
Ryan Golhar wrote: > So I stepped through the code for tfm and it looks like it initially > looks in /usr/share/EMBOSS/doc/programs/html. So 'make install' is > putting the html docs in /usr/share/EMBOSS/doc/html/emboss/apps/... But > why? Was this an inadvertant change? Oops. usr/share/EMBOSS/doc/html/emboss/apps/ is the new location in 4.0.0 (so we do not have to keep copies of all the EMBASSY application documentation in the EMBOSS source). Will be fixed in 4.0.0. A simple copy is one way to fix it. I will make a fix for tfm. regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] EMBOSS-Explorer Follow-up
Ryan Golhar wrote: > So I stepped through the code for tfm and it looks like it initially > looks in /usr/share/EMBOSS/doc/programs/html. So 'make install' is > putting the html docs in /usr/share/EMBOSS/doc/html/emboss/apps/... But > why? Was this an inadvertant change? Aha ... tfm works, but tfm -html may fail. If the program fails to find the html file, it will check the original distribution directory. Unfortunately, if it does find an html file ... it may be from version 3. I forgot about the tfm -html option when we moved the files. A fix will take a few days to test. EMBASSY html documentation is not under the embassy package, so TFM will have to check the ACD file to find the EMBASSY package name. Easy enough - several other programs do it - but needs quite a few tests to make sure it does it correctly in all cases. regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] [Mrs-user] case sensitive identifiers
Guy Bottu wrote: > My idea is to let the MRS parser store 1fnt_aLC > (LC means lowercase) as identifier. A user can then search for the > sequence he needs in MRS and in EMBOSS (if the EMBOSS installation uses > MRS as databank access mechanism) ask for the sequence pdbprot:1fnt_alc. > This would of course also work with 1fnt_a_12835 but it avoids the use of > a meaningless and irreproducible number. Anybody a comment ? Not a general solution, but for PDB chains you could use an extra underscore for the lower case ones. For EMBOSS well, we could play with the way databases work. Not all access methods allow case sensitive searching, but we could fetch all entries and try to reject those that do not match. This would need something in the EMBOSS id. We already allow modifiers after the id to set sequence ranges pdbprot:1fbt_a[1:20] or we could add a qualifier -scasesensitive for all sequence inputs. Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] case sensitive identifiers
Guy Bottu wrote: > For the moment our emboss.default contains : > > DB pdbprot [ type: P format: fasta comment: 'protein sequences from PDB' > methodquery: app app: "/nfsben/srs/bin/linux73/getz -e '[pdbprot-id:%s]'" > methodall: direct dir: /nfsben/srs/data/blast/dbfb/pdb file: pdb > ] That raises a new problem the "app" method will work, but "srs" and "srswww" will not. They search for a pdbprot-acc match and there is no acc field. I will add a new database attribute hasaccession (default "Y") so searches know whether the acc field can be used. Unfortunately the fields attribute is defined as "everything except id and acc" so I cannot use it. So, there will be 2 new (and for the first time boolean) attributes for databases. To use them, you will need: caseidmatch: "Y" hasaccession: "N" These will also be the first to use the default values for database attributes! All other default values are empty strings :-) regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Question regarding seqret
Jean Mao wrote: > Hi, > I have a question hopefully someone can help me about it. > > I downloaded the gbrvt1.seq file from ftp://ftp.ncbi.nih.gov/genbank/ as a > test, gunzip and index it with dbxflat (I know it's not > than 2gb): > > % dbxflat -dbname=testdb -dbresource=embl -idformat=gb -directory=. > -fields='id,acc,sv,des' -filenames='gbvrt*.seq' -indexoutdir=. -release=0.0 > -date='00/00/00' > > Then I run 'seqret' but failed to retrieve entries using 'sv' or 'des' fields: I didn't see an answer to this one, but I suspect you have already figured it out. dbixflat and dbiflat will have created the sv and des indices. You have to edit the database definition in emboss.default to say the fields exist. fields: "sv des" then seqret and other programs will know they can use them. Yes, in theory seqret could work out what indices are available for a dbxflat or dbiflat indexed database - but it would be more difficult for an SRS or SRSWWW database (for example) so we depend on the database definitions. Hope that helps, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] extracting noncoding regions
Hi Shrish, Shrish Tiwari wrote: > Hi! > Is there a way of extracting the noncoding regions of a genome using an > EMBOSS program? That is a simple change to coderet to return non-coding sequence (exclude the CDS and mRNA features). Does anyone else want this? We can do it for the next release. regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] showfeat troubles
Hi Shrish, Shrish Tiwari wrote: > Hi! > I used the following command to extract only positions of CDS from gbk files: > showfeat -pos -matchtype CDS -width 0 > But I noticed that the program does not extract positions of CDS that lie on > the complementary strand, e.g. CDS complement(5683..6459) did not > show up in the resultant file. Any ideas on how I can get showfeat to extract > these positions too. It worked for me, but reports these as 5683..6469 (without -width 0 it will show the arrow in the reverse direction) Can you try running entret on the same genbank entry, and sending the output file to [EMAIL PROTECTED] so we can take a look at it. regards, Peter Rice ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Batch retrieval of taxonomy/species names using entret.....
Hi Richard, Richard Rothery wrote: > I am interested in using entret to retrieve single field entries from > swissprot or sptrembl. Specifically, I would like to feed entret a list > of accessions and have it return a file with the species names and/or > taxonomies. I intend to use this information to compare with my > phylogeny analyses of clustalw alignments. EMBOSS stores the full text in entret without parsing. We could try to extract specific fields but it is not easy to define them for all formats. You can do this with SRS. Try the EBI server for example: Go to the library page Select UniProtKB/SwissProt (or UniProtKB/TrEMBL) Select "standard query form" Enter your query in the top part (e.g. accession number) In the "create a view" section click the "list" button to egt the original lines. Select anything taxonomic from the pull down list (control-click to select more than one) Press "search". refine your query. You will see the URL at the top that can be used to retrieve data when you are happy. Failing that, you could just parse out the ID and O* lines from entret using a simple perl script. Hope that helps, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] IDs in output
Hi Bernd, Bernd Web wrote: > Hi, > > Sometimes I use an EMBOSS command directly on a FastA file. > I wonder if it is possible to select the ID used in the output, esp > for FastA records with an NCBI defline. > >> gi|248166|g|AA21972.1| description... > > in the output of an EMBOSS command becomes: > AA21972.1| > > It would be very easy if the ID could be chosen to be the GI number. > Now the ID used depends on the GI record (sp, pdb, pir) show different > IDs in EMBOSS output. Did you mistype the defline? There is a defined set of database names that can appear in NCBI deflines. If the "|g|" is really "gb" then the ID will be AA21972 which is what I would expect. If the database name is invalid (or a new one unknown to EMBOSS) then we could try to use the GI number. but the "EMBOSS way" would be to use the accession number from the sequence version. Unfortunately at present it is using the last part of sequence version "1" as the ID in your example. I will fix it for the next release. You can use -sid on the command line to give an ID to a sequence that does not have one,but not to replace an existing ID. That seems strange. It may change for the next release so that you can always use -sid to define the ID. Hope that helps Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] IDs in output
Bernd Web wrote: > Hi Peter, > > Although I copy pasted, indeed the defline was wrong. It should have been: > >> gi|248166|gb|AAB21972.1| invertase {EC 3.2.1.26} [baker's yeast, > Peptide Partial, 6 aa, segment 10 of 12] > ATNTTL > > EMBOSS extracts "AAB21972.1". > Having the version number is OK since otherwise the sequence is not > completely defined (AAB21972 could refer to multiple versions). If you specify -osformat ncbi you should be able to recreate the original defline in the EMBOSS output. > My idea was more related to selecting the GI number as ID to use in > EMBOSS applications. Now the accession number depends on the format of > the defline: > sp -> Entry Name (not primary accession) If there is an Entry name EMBOSS will use it. > ref, emb, gb -> Accesion But now EMBL and Genbank define this as the entry name anyway. > pdb -> PDB protein name with Chain concatenated to it. That seems good to me ... although we know of a problem when there are more than 26 chains and -a comes round again. > Although I wrote a script to map the names from NCBI deflines to > EMBOSS names, it could be easy to have the option to use the GI > number. Hmmm . in EMBOSS terms, this counts as yet another sequence format. We could make a new output format (-osformat gifasta for example) that uses the GI as the ID... but it would use the original sequence name as the filename first time around (and then when you read the file it would start using the GI number as the filename). But we could also make "gifasta" an input format (-sformat gifasta) and then it could use the GI number - but you would have to specify the -sformat on the command line (or gifasta::filename as input) because EMBOSS has to choose which way to interpret the defline. Does that solve your problem? NCBI regard the ID as the entire string with "|" characters embedded, but that is no use when making filenames so we had to choose something. EMBL does not use GI numbers ... they only appear in GenBank and NCBI files. I never liked them, but EMBOSS does try to do whatever the users demand :-) regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Transeq and very large sequences
michael watson (IAH-C) wrote: > I want to translate very large (eukrayotic chromosomes!) DNA sequences > in all 6 frames. Transeq takes about a day per large chromosome, > running on a linux machine with 3Gb of RAM. > > Any suggestions on alternatives or how I could speed it up? You want just a 6-frame translation of an entire chromosome? I will look into why it takes so long. We have made some changes to string size extension that may already help this for the next release. regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Transeq and very large sequences
michael watson (IAH-C) wrote: > Excellent! I set the MAXSEQIN paramter to 200,000,000 and it ran in 18 > seconds Ah, that is a challenge. I'll see what I can do with the EMBOSS code :-) regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Question regarding Reference Sequence Database
Hi Jean, > Does any program in EMBOSS package can make use of the Reference Sequence > Databases? I indexed refseq databases with dbxflat and run showfeat against > them but receive error about has zero length sequence : The next release will include refseq as a valid sequence format. You can usually get away with defining the format as Genbank. If that does not work please let me know and I will update the refseq format code. Aha ... but in this case ... NG_002612 does have zero length. This appears to be one of those entries (the EMBL CON division does much the same) that only refer to sequence data in other entries. It ends with the line: CONTIG join(complement(AC006998.3:2483..110100)) We can try to process these. The database defintion will need to know where to look up "AC006998.3" which is where the sequence data ... and all the missing features ... should be. Can you exclude the CON entries from your indexing? if not, we can try excluding them. Hope that helps, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] question!
Dear Fang, > I installed EMBOSS 2.10.0 in on windowsXP PC. However, when I use command > "extractfeat genbank:*", it does not work. The error message is "Error:uable > to read sequence 'genbank:4101655', Died: extractfeat termined:Bad value for > '-sequence' and no prompt". But it work fine with "extractfeat > embl:AK222810".Do you know the reason? If you used the database definitions provided with EMBOSS ... your genbank is possibly pointing to the CBR server in Canada which has now closed. There is also a problem with the way SRS servers define the GI number - there are now servers that index it, but as "gid" not as "gi" which EMBOSS anticipated. We sill change the field name in the next release of EMBOSS. To test whether yuor genbank definition works, you could try the ID We are now at release 4.0.0 which allows "gi" as a search field. Earlier versions only had "sv" (sequence version) ... whether that is indexed depends on the database provider. Indexing GenBank in EMBOSS does allow GI searches. > Is there any way to access ENsembl database. Is there any new version of > EMBOSS which could support more databases which could installed in windowsXP? Ah, you are running EMBOSS under windows? embosswin was provided by Andre Blavier up to EMBOSS 2.10.0. We now provide a beta release of EMBOSS 4.0.0 for windows (nobody did version 3.0.0 for windows). H ... we need to make that more obvious on the EMBOSS website. EMBOSSWIN is available by FTP from emboss.open-bio.org/pub/EMBOSS/windows/ ... only a few brave people have tested it so far, but they report that it is working. > Are all the databases which EMBOSS connected are the latest version? since I > found some database do not give the same results as what I get from the > database directly. That depends on where the databases are. There is a list of SRS servers you can check for the number of entries and the date they were indexed: http://downloads.biowisdomsrs.com/publicsrs.html for example: DB genbank [ type: N method: srswww format: genbank url: "http://iubio.bio.indiana.edu/srsbin/cgi-bin/wgetz"; dbalias: "genbankrelease" fields: "gi sv des org key" comment: "Genbank IDs" ] You can also try Entrez databases in EMBOSS 4.0.0 ... I wonder how many users have been using entrez as an access method? Hope that helps Peter Rice ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] question about display double-stranded DNA
Hi Jean, > When using remap, I prefer to use the '-noreverse' flag so that the > translation of my DNA is located closer to my DNA strand. However, using > this flag also remove the complementary strand of my DNA in the output which > is less convinient when design primers. Is there a way in remap to display > double-stranded DNA but turn off the restriction sites of the complementary > strand? I am looking at remap changes at the moment, I will see what I can do. > If not, is there a program in EMBOSS which can retrieve the sequence from > database, select start/end points and display both strands? I tried seqret > but failed. Showseq does that. It has a bug at present (I noticed it this week - fixed in the next release) that makes it show additional bases up to the end of the last line. regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] question about display double-stranded DNA
Hi Jean, > Peter, Thanks for reply. seqret can retrieve entry and select start/end > points. But seqret does NOT display both strands. Does it? Right. Seqret returns a sequence, so it can only rpeort one strand at a time. > Showseq does that. > > It has a bug at present (I noticed it this week - fixed in the next release) > that makes it show additional bases up to the end of the last line. Oops. Spoke too soon. showseq uses the dame display functions as remap and has the same limitations. I will see what we can do for the next release. regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Restriction fragment sequences
Jean-Christophe AME wrote: > Hello, > > I have a question concerning DNA restriction fragment analysis : Is > there a way to generate the actual sequence of the restriction > fragment generated by restrict or remap, this is to facilitate the in > silico construction of recombinant plasmid just with a cut and paste. > May there are some ways do this automatically (there was CloneIt but > it doesn't work). Interesting suggestion. You really need a nucleotide version of digest (or restrict with the fragment start/end and sizes reported instead of the cut sites). With the command line option -rformat listfile you can then use seqret to return the sequences but using @filename as input. Unfortunately if you do that with restrict you only get the restriction sites. We will add a new application to the next release. regards, Peter Rice ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] question about 'fuzznuc'and 'urzpro'
Hi Jean, > I know I can give a pattern like 'ACCGGT' and search against a file which > contains multiple sequences. Is there a way I can specify a 'pattern file' > which contains multiple patterns that I want to search for instead of just > one pattern each time? For example, I have a fileA which contains multiple > DNA sequences. I want to create a fileB which contains 20 patterns that I > want to seach each of them against the sequences in the fileA. We are in the > transition from GCG to EMBOSS. And the program 'findpatterns' in GCG can do > this. But I couldn't find corresponding emboss program that does the same > thing. New in EMBOSS 4.0.0, contributed by Henrikki Almusa of Medicel in Helsinki. fuzznuc (and fuzzpro and fuzztran) now can read in a file of patterns with the commandline syntax: fuzznuc @patternfile You can also use @patternfile in response to the prompt for a pattern. Here is an example pattern file with FASTA-style IDs and mismatch counts for each pattern: >pat1 cggccctaaccctagcccta >pat2 cg(2)c(3)taac cctagc(3)ta >pat3 cggc{2,4}taac{2,5} Here is a file with just the second pattern, and no name (it will default to pattern1 cg(2)c(3)taac cctagc(3)ta You can set a default name with -pname and a default mismatch with -pmismatch I note we could document this better in the fuzz* program manual entries. We will do for the 4.1 release. Hope that helps, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] question about 'fuzznuc'and 'urzpro'
Hi Jean, I copied this reply to the list - as it includes poorly documented features and some suggestions for the future. > It's great to know it can be done! I do have further questions. So in the > pattern file that has no name and contains two lines, you said it's going to > default to pattern 1. Does that means that without the '>', everything will > be concatenated and treated as one pattern? Yes. We did include a -pformat qualifier to set the format of the pattern file, so we can extend in future to have one pattern per line. Actually I should ask what's the difference between > >> pat2 > cg(2)c(3)taac > cctagc(3)ta > > and > >> pat2 > cg(2)c(3)taaccctagc(3)ta They are the same - pattern lines are simply joined together until the next new pattern header (>pat3) is found. > also what's the difference between a file containing >> pat2 > cg(2)c(3)taac > cctagc(3)ta > with a file containing > cg(2)c(3)taac > cctagc(3)ta The first allows one mismatch in matching the pattern. These patterns for with the HHTETRA entry we use for the example in the program manual (accession number L46634) >HHTETRA L46634.1 Human herpesvirus 7 (clone ED132'1.2) telomeric repeat region. aagcttaaactgaggtcacacacgactttaattacggcaacgcaacagctgtaagctgca ggaaagatacgatcgtaagcaaatgtagtcctacaatcaagcgaggttgtagacgttacc tacaatgaactacacctctaagcataacctgtcgggcacagtgagacacgcagccgtaaa ttcctcaacccaaaccgaagtctaagtctcaccctaatcgtaacagtaaccctaca actctaatcctagtccgtaaccgtaaaatcctagcccttagccctaaccctagccc taaccctagctctaaccttagctctaactctgaccctaggcctaaccctaagcctaaccc taaccgtagctctaagtttaaccctaaccctaaccctaaccatgaccctgaccctaaccc tagggctgcggccctaaccctagccctaaccctaaccctaatcctaatcctagccctaac cctagggctgcggccctaaccctagccctaaccctaaccctaaccctagggctgcggccc taaccctaaccctagggctgcggcccgaaccctaaccctaaccctaaccctaaccctagg gctgcggccctaaccctaaccctagggctgcggccctaaccctaaccctagggctgcggc ccgaaccctaaccctaaccctaaccctagggctgcggccctaaccctaaccctagggctg cggccctaaccctaaccctaactctagggctgcggccctaaccctaaccctaaccctaac cctagggctgcggcccgaaccctagccctaaccctaaccctgaccctgaccctaacccta accctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta accctaaccctaaccctaaccctaagcactggcagccaatgtcttgtaatgc cttcaaggcactctgcgagccgcgcgcagcactcagtgacaagtttgtgcac gagaaagacgctgccaaaccgcagctgcagcatgaaggctgagtgcacaaggcttt agtcccataaaggcgcggcttcccgtagagtagccgcagcgcggcgcacagagcga aggcagcggctttcagactgtttgccaagcgcagtctgcatcttaccaatgatgatcgca agcaagatgttctttcttagcatatgcgtggttaatcctgttgtggtcatcactaa gcaagctt > Also could you explain how to use -pname and -pmismatch? >I don't understand this part at all :-P Thank you very much! Ah ... they are associated qualifiers (like -sformat, sbegin, send for sequences, -osformat for sequence output, -aformat for alignments and -rformat for reports. They only show up if you use -help -verbose to see the help. This caused some problems for fuzznuc users with release 4.0.0 as they replace the previous version which had a -mismatch option and only read one pattern. -pmismatch sets a default number of mismatches for all patterns (that you can override within the pattern file). -pname sets a pattern name for the output (something that was missing before). Oops, we have a bug ... the name is being ignored in fuzznuc. Will be fixed in 4.1.0. -pformat sets the pattern file format - so far this is ignored so we have not documented pattern file format names. I think a file with one line for each pattern and numbering 1, 2, 3 added to the pattern name would be useful. We could call the formats "simple" (one line per pattern) and "fasta" (the current format with names) Oops, another bug. Using a bad pattern file name is not being caught. Fixed in 4.1.0 We also added files of regular expressions used by dreg and preg so you can also use them for pattern searched (it depends on whether you prefer prosite-style patterns or regular expressions - I find the prosite style for fuzznuc are much easier). We can use the same file formats for them. I have to check the original pattern file code from Henrikki Almusa to see whether we lost anything in the naming and formats. Hope that helps, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] dreg and reverse strand
Andres Pinzon wrote: > Hi, > Im using dreg to find some patterns on a xanthomonas* genome reverse strand. > This is the command im using: > > dreg -sequence ./campestrisVesicatoria.gb -pattern > 'TTC(G|T|C){14,17}TTC(G|A|T)' -outfile campestrisVes-rev.dreg.gb > -rformat3 genbank -sask1 Oops. Can you send me the input sequence please. We will fix it for the next release (soon) regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] Fuzznuc question: how to search complementary strand?
Andres Pinzon wrote: > Hi, > Im trying fuzznuc to search for some patterns in a a genome. > > ...But when I search the complementary strand: > > It reports a pattern on complement that exists, in fact, but on the > forward strand not in complement. > > Am I doing something wrong? I think this is one we patched soon after the 4.0.0 release. There are patches on our FTP server, and a new 4.1.0 release will appear soon with this fix included. > What options do I have to use in order to make fuzznuc to report the > occurrences of "pattern" on both: reverse and complementary strand? -complement is correct. It searched both strands. To search only the complementary strand, use the general EMBOSS option -sreverse and do not specify -complement Hope this helps, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] how to get jtranslations using extractfeat?
Andres Pinzon wrote: > Hi, > Im trying to get all the "/translation" sequences from a genome embl > feature file. > I mean, each CD have a translation tag and I need those translations > in a fasta file. I've tried all possible combinations of -type -tag > but i can not get the translated sequences, but the DNA sequences. > > Is it possible to get this translated sequences from the feature file? > Or do I have to get the corresponding CDS DNA sequences and then translate > them? Good suggestion ... we can try to make a new application. The /translation tag is rather special (because the value is a real sequence) ... also it may have a different name in some databases or feature file formats. We will need to make up names for each translation (sequence identifiers, and something derived from the feature table) like the names used by extractfeat. Alternate splicing will make it difficult to create reliable unique names. Extractfeat does have the same problem - and nobody has complained. If we keep a table of names so far we can add something to the end of any duplicates. Extracttrans is a possible name for the program. regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] question about translation start stie
Dear Fang, > Does anyone know if EMBOSS could give us the translation start site and > translation start site ? Thanks! > Looking forward to your reply. Can you give an example of what you mean? Start position and first codon perhaps? using the feature table, or from finding open reading frames? regards, Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] EMBOSS 4.1.0 released
Ryan Golhar wrote: > I agree. I was also expecting the version number on the tarballs to change > as well. At the moment, there is no way to tell they were updated... The embassy changes are all minor. We like to use the version numbers of the original code so it is a little difficult to merge in the EMBOSS version ... without making up a very long version number. Does anyone have strong preferences? Peter ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
Re: [EMBOSS] question (Error: Failed to find host 'srs.ebi.ac.uk' for database 'emblebi')
Dear Nikolai, Воробцов Николай Вадимович wrote: > The other day I have install the EMBOSS package (version for windows). > > Environmen parameters are setted as required: > SET EMBOSS_ROOT=D:\Emboss-MS > SET EMBOSS_ACDROOT=D:\Emboss-MS\acd > SET EMBOSS_DATA=D:\Emboss-MS\data > > seqret emblebi:xlrhodop > Reads and writes (returns) sequences > Error: Failed to find host 'srs.ebi.ac.uk' for database 'emblebi' > Error: Unable to read sequence 'emblebi:xlrhodop' > > Please say what a problem is? The databases defined by default connect to servers here at EBI. These databases need an internet connection. If you can connect your browser to http://srs.ebi.ac.uk then EMBOSS will be able to read databases. There are some more settings you can add if, for example, you need to define an HTTP proxy. You can also install the database flat files locally and index them with dbxflat (or dbiflat). You can use any EMBOSS program with local sequence data files, or put the sequence on the command line with the syntax: seqret asis:: regards, Peter Rice ___ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss
[EMBOSS] Profiling and testing water
Vivek Menon wrote: > Hello all, I am having issues compiling the water and needle programs from > the EMBOSS package. That makes 3 related requests in the past week! It seems profiling and looking at the code for water is becoming popular. For those who want to play with the code, it may be helpful to describe how the EMBOSS QA testing works. So far this has just been run internally to check that code changes have not broken anything. Firstly, edit file test/.embossrc to set the locations of the source test directory (emboss_qadata) and the installed test directory (emboss_testdata). The install directory is used for the test databases tsw, tembl (etc.) provided with the EMBOSS distribution. The source test directory is used so that the results of one test can be used in another. cd to the source test directory. cd to the qa subdirectory. Run all the QA tests using: ../../scripts/qatest.pl -without=srs (the command line option turns off tests that require SRS installed locally) Run one selected QA test: ../../scripts/qatest.pl water-ex Tests run in a subdirectory with the name of the test (test/qa/water-ex) If the test succeeds, the directory is removed (the command line option -kk keeps the directory). New tests are easy to define - add them to test/qatest.dat Each test has to have a unique name. Descriptions of the definition line types are in the top of the file. Tests assume files stderr and stdout are created and empty. All other output files must be included in the test definition (getting a surprise new file is an error). The .embossrc file defines the date to be 15-jul-2006 so do not be surprised if you see that date in your output - we use it to keep the results constant when updating the documentation. All the *-ex tests are examples for the manuals. Have fun!!! Peter ___ EMBOSS mailing list [EMAIL PROTECTED] http://lists.open-bio.org/mailman/listinfo/emboss