[EMBOSS] Many-to-many with needle and water

2009-07-06 Thread Peter
Hi Peter R. et al,

I gather EMBOSS is looking for feedback for new applications (given
the recent funding from the BBSRC - congratulations again). How about
suggestions for extensions to existing EMBOSS applications?

I've used bits of EMBOSS for several years now (thank you!). Something
I have sometimes wanted to do is a many-to-many pairwise sequence
alignment with the EMBOSS tools needle and water.

Right now, needle and water take two files (here referred to as A and
B), file A has just one sequence, and file B can have one or more
sequences. I'd like to be able to supply two files both with multiple
entries, and have needle/water do pairwise alignments between all the
sequences in A against all the sequences in B. This might be useful
for finding reciprocal best hits in comparative genomics (as an slower
but exact alternative to FASTA or BLAST).

>From an implementation point of view, I might imagine doing sequence
A1 against all of B, then sequence A2 against all of B, etc. This
would require looping over file B many times (easy if on disk). This
would also work if the A input was stdin, but having the B input on
stdin would require caching the data if A has more than one sequence
:(

It may sometimes also be useful to have an all-against-all pairwise
comparison for a single set of sequences. The above suggested
enhancement would let you do this by comparing file A to file A.
However, here you only really need to do half the possible
combinations (as aligning sequence A1 to sequence A2 should be the
same as A2 to A1). This could be useful for implementing a basic
clustering algorithm, or maybe as part of a worked example in building
a simple NJ tree?

So, does supporting many-to-many comparisons sound like a useful
enhancement to needle and water?

I should stress this isn't something I need right now. Also, it can be
worked around with a wrapper script to call needle/water once for each
sequence in file A (against all the sequences in file B), with the
added bonus that then these jobs one-to-many comparisons can then be
shared across multiple CPU cores.

Regards,

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Many-to-many with needle and water

2009-07-06 Thread Peter
On Mon, Jul 6, 2009 at 11:35 AM, Peter Rice  wrote:
>
> Peter C wrote:
> > Hi Peter R. et al,
> >
> > I gather EMBOSS is looking for feedback for new applications (given
> > the recent funding from the BBSRC - congratulations again). How about
> > suggestions for extensions to existing EMBOSS applications?
> >
> > I've used bits of EMBOSS for several years now (thank you!). Something
> > I have sometimes wanted to do is a many-to-many pairwise sequence
> > alignment with the EMBOSS tools needle and water.
> >
> > Right now, needle and water take two files (here referred to as A and
> > B), file A has just one sequence, and file B can have one or more
> > sequences. I'd like to be able to supply two files both with multiple
> > entries, and have needle/water do pairwise alignments between all the
> > sequences in A against all the sequences in B. This might be useful
> > for finding reciprocal best hits in comparative genomics (as an slower
> > but exact alternative to FASTA or BLAST).
>
> The application is easy to add (after the release)
>
> The usual problem with all-against-all is that it involves loading one
> of the inputs as a sequence set entirely in memory - to avoid reading
> one input many times over.

Right - and it would be difficult to decide if in memory vs reading the
file many times is best in general without some specific use cases.

[I suppose you could do something a bit more cunning like start by
caching the sequences as you read them read for re-use, but if the
number of sequences crosses a threshold, stop caching and switch
to re-reading the file for subsequence loops?]

> We have an application supermatcher which does this - the first sequence
> is streamed through, the second is a sequence set loaded into memory. It
> uses work matching to find seed alignments then runs a limited alignment
> around the hits.
>
> superwater would be a possible name (or superneedle).

If you see many-to-many versions of water and needle as a separate
applications, then those names sound fine.

> How popular would such a program be?

I don't know - as I said, this is more of suggestion than a request.
I don't *need* this tool, but there have been occasions in the past
where I would have tried using it if it had existed.

Perhaps others on the list can think of a better uses for this tool idea?

> How large would the smaller input set be?

Hard to say without specific examples in mind. For some hand waving
upper limits, for comparative genomics of bacteria using protein
sequences, you might have a few thousand in each file. If I was trying
this as part of an ad-hoc clustering algorithm (all-against-all), again
maybe a few thousand sequences. In practice, a heuristic tool like
supermatcher (or FASTA or BLAST) would probably be more sensible
for large datasets like this due to the computational time.

I see needle and water as most useful on smaller datasets where
the runtime cost of using an exact algorithm isn't too high. Therefore
many-to-many needle/water searches may be best targeted at
smaller sequence files. Things might be different with a multicore
or GPU/OpenCL version of needle and water ;)

Anyway, unless someone else thinks a many-to-many version
of needle and water would be useful, I wouldn't expect you to
implement this. I'm just putting the idea forward for discussion.

Regards,

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] Probabilistic versions of needle/water?

2009-07-06 Thread Peter
Hi all,

I have another suggestion for new or enhanced EMBOSS applications,
again related to the existing pairwise sequence alignment tools needle
and water.

The FASTQ file format (or others) contains quality scores (often PHRED
scores) representing the probability of an error in the associated
nucleotide. Solexa/Illumina machines also provide another file with a
more precise breakdown of the likelihood of each of the four bases.

In some cases both sequences could have probability scores (e.g.
trying to align the ends of contigs to each other), but often one
sequence will be taken as fact (e.g. mapping reads onto a reference).

It is possible to take these probabilities into account when
considering the matches in needle (or water) by using a probabilistic
version of the Needleman‐Wunsch sequence alignment algorithm (or a
probabilistic Smith-Waterman).

As an example of this idea, did you (Peter R) see the GNUMAP
talk/poster at ISMB 2009? See http://dna.cs.byu.edu/gnumap/

I am aware of people using EMBOSS tools (I assume water) to identify
(known) adaptor sequences in raw Solexa/Illumina data. I considered
doing something similar myself when trying to remove primer sequences
from 454 data. Such a pipeline using the current EMBOSS water would be
doing this matching at a purely fixed nucleotide level (ignoring the
qualities), which isn't ideal. Upgrading to a probabilistic version of
water should be an improvement.

Peter C.

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Probabilistic versions of needle/water?

2009-07-06 Thread Peter
On Mon, Jul 6, 2009 at 1:32 PM, Peter Rice wrote:
>
>> I am aware of people using EMBOSS tools (I assume water) to identify
>> (known) adaptor sequences in raw Solexa/Illumina data. I considered
>> doing something similar myself when trying to remove primer sequences
>> from 454 data. Such a pipeline using the current EMBOSS water would be
>> doing this matching at a purely fixed nucleotide level (ignoring the
>> qualities), which isn't ideal. Upgrading to a probabilistic version of
>> water should be an improvement.
>
> Would be interesting.
>
> Where can I look up adaptor calling methods?

The particular example I had in mind was the thread with Giles Weaver
on the BioPerl mailing list, which I see you have just replied to:

http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html
http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030404.html

I think I made a typo earlier (needle versus water). If you are
comparing a short but complete adaptor sequence to a read
(which you expect may contain the full adaptor) doing a global
alignment is more sensible that a local one. On re-reading,
Giles did actually say he was using needle:
http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030411.html

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] transeq and ambiguous codons

2009-07-08 Thread Peter
Hi all,

Something I mentioned to Peter Rice in passing at BOSC/ISMB 2009 was
I'd found an oddity in transeq with certain ambiguous codons which
testing Biopython's translations. Here is a specific example (but I
suspect there are more). For reference, I am expecting EMBOSS transeq
to be using the NCBI tables:
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

First consider the following example, the codon TAN, which can be TAA,
TAC, TAG or TAT which translate to stop or Y. Therefore the
translation of TAN should be "* or Y", and EMBOSS transeq opts for
"X". Which is fine:

$ transeq asis:TAATACTAGTATTAN -stdout -auto
>asis_1
*Y*YX

Similarly for the codon TNN, again EMBOSS transeq opts for "X" because
this could be a stop codon, or W, or F, or L, or S, or Y or C! Again,
this is fine:

$ transeq asis:TNN -stdout -auto >asis_1
X

However, consider the codon TRR. R means A or G, so this can mean TAA,
TGA, TAG or TGG which translate to stop or W (both EMBOSS and the NCBI
standard table agree here). Therefore the translation of TRR should be
"* or W", which I would expect based on the above examples to result
in "X". But instead EMBOSS transeq gives "*":

$ transeq asis:TAATGATAGTGGTRRTNN -stdout -auto
>asis_1
***W*X

I think this is a bug.

However, I am aware that the machine I tried this on is rather old,
and I don't actually know which version of EMBOSS it is. How can I
find out? As far as I know, there is no "-version" or "-v" or
"--version" switch, and the "-help" information doesn't include this
important piece of information. Nor is this in the FAQ:
http://emboss.sourceforge.net/docs/faq.html

So that makes two questions - how should transeq translate "TRR", and
how do I check the version of EMBOSS?

Thanks,

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] transeq and ambiguous codons

2009-07-09 Thread Peter
On Thu, Jul 9, 2009 at 12:53 AM, Scott Markel wrote:
>
> Peter,
>
> Answer to question #2: run the program embossversion.
>
>> embossversion
> Writes the current EMBOSS version number to a file
> 6.0.1
>
> Scott

Thanks Scott (& Thomas) for pointing out the embossversion program.

I would still question why the EMBOSS tools don't also support the
Unix convention of a version switch. Hypothetically, aren't some
(many?) of the tools standalone and couldn't they be installed
individually (e.g. as part of someone else's software bundle)? i.e.
Can EMBOSS really guarantee that the needle tool and the
embossversion tool are in sync?

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] transeq and ambiguous codons

2009-07-09 Thread Peter
On Thu, Jul 9, 2009 at 10:16 AM, Peter Rice wrote:
>
> Peter C. wrote:
>
>> Thanks Scott (& Thomas) for pointing out the embossversion program.
>>
>> I would still question why the EMBOSS tools don't also support the
>> Unix convention of a version switch. Hypothetically, aren't some
>> (many?) of the tools standalone and couldn't they be installed
>> individually (e.g. as part of someone else's software bundle)? i.e.
>> Can EMBOSS really guarantee that the needle tool and the
>> embossversion tool are in sync?
>
> We could easily add a -version global qualifier ... for the next release.
>
> We can guarantee that embossversion and needle are in sync - assuming
> they are built using the same libraries as that is where the version is
> recorded. Standalone build are an issue though and it would help debug
> in a few cases.

That sounds good to me :)

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] transeq and ambiguous codons

2009-07-09 Thread Peter
On Wed, Jul 8, 2009 at 10:50 PM, Peter wrote:
> Hi all,
>
> Something I mentioned to Peter Rice in passing at BOSC/ISMB 2009 was
> I'd found an oddity in transeq with certain ambiguous codons while
> testing Biopython's translations. Here is a specific example (but I
> suspect there are more). For reference, I am expecting EMBOSS transeq
> to be using the NCBI tables:
> http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
>
> First consider the following example, the codon TAN, which can be TAA,
> TAC, TAG or TAT which translate to stop or Y. Therefore the
> translation of TAN should be "* or Y", and EMBOSS transeq opts for
> "X". Which is fine:

Using raw output instead of the default FASTA works better in emails:

$ transeq asis:TAATACTAGTATTAN -stdout -auto -osformat raw
*Y*YX

> Similarly for the codon TNN, again EMBOSS transeq opts for "X" because
> this could be a stop codon, or W, or F, or L, or S, or Y or C! Again,
> this is fine:

Again, using raw output works better in emails:

$ transeq asis:TNN -stdout -auto -osformat raw
X

> However, consider the codon TRR. R means A or G, so this can mean TAA,
> TGA, TAG or TGG which translate to stop or W (both EMBOSS and the NCBI
> standard table agree here). Therefore the translation of TRR should be
> "* or W", which I would expect based on the above examples to result
> in "X". But instead EMBOSS transeq gives "*":

Again, using raw output works better in emails:

$ transeq asis:TAATGATAGTGGTRR -stdout -auto -osformat raw
***W*

> I think this is a bug.
>
> However, I am aware that the machine I tried this on is rather old,
> and I don't actually know which version of EMBOSS it is.

I can check the old machine later, but I just retested on a Mac using
EMBOSS 6.0.1 (the current release), and see the same behaviour.

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] transeq and ambiguous codons

2009-07-10 Thread Peter
On Thu, Jul 9, 2009 at 10:08 AM, Peter Rice wrote:
>
> Peter C. wrote:
>> However, consider the codon TRR. R means A or G, so this can mean TAA,
>> TGA, TAG or TGG which translate to stop or W (both EMBOSS and the NCBI
>> standard table agree here). Therefore the translation of TRR should be
>> "* or W", which I would expect based on the above examples to result
>> in "X". But instead EMBOSS transeq gives "*":
>
> This is a side effect of the way backtranslation works...

OK, leaving TRR aside for the moment (I'm not sure I'd have done it that
way, but I think I follow your logic), I have some more problem cases for
you to consider (all using the default standard NCBI table 1).

Most of these are 'unambiguous ambiguous codons' as you put it, and
I would agree using X when a more specific letter is possible isn't ideal
but isn't actually wrong. The "ATS" and related codons (see below)
however are simply wrong.

--

TRA means TAA or TGA, which are both stop codons. Therefore TRA
should translate as a stop, not as an X:

$ transeq asis:TAATGATRA -stdout -auto -osformat raw
**X

--

Now look at YTA, which means CTA or TTA which encode L, so
YTA should be L not X:

$ transeq asis:CTATTAYTA -stdout -auto -osformat raw
LLX

Likewise for YTG and YTR, and YTN.

--

Another example, ATW means ATA or ATT, which both translate as I,
so ATW should translate as I not X:

$ transeq asis:ATAATTATW -stdout -auto -osformat raw
IIX

--

Conversely, ATS which means ATC or ATG which translate as I and M.
Remember S means G or C. Therefore ATS should translate as X, and
not I:

$ transeq asis:ATCATGATS -stdout -auto -osformat raw
IMI

Likewise H means A, G or C, so ATH shows the same bug, as do some
other AT* codons:

$ transeq asis:ATAATCATGATH -stdout -auto -osformat raw
IIMI

[*** This one strikes me as a clear bug ***]

--

Now for another debatable one, RAT means AAT or GAT which code
for N and D. So, you could use B (Asx) here rather than the broader X.

$ transeq asis:AATGATRAT -stdout -auto -osformat raw
NDX

Again, the same thing for others like RAC -> X not B, and RAY -> X not B.

Similarly, you don't use J to mean leucine (L) or to isoleucine (I), and
opt for X (again, this is justifiable). e.g. WTA

$ transeq asis:ATATTAWTA -stdout -auto -osformat raw
ILX

------

This list is only partial, and only for the standard table.

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] transeq and ambiguous codons

2009-07-10 Thread Peter
On Fri, Jul 10, 2009 at 10:30 AM, Peter Rice wrote:
>
> Peter C. wrote:
>>
>> OK, leaving TRR aside for the moment (I'm not sure I'd have done it that
>> way, but I think I follow your logic), I have some more problem cases for
>> you to consider (all using the default standard NCBI table 1).
>>
>> Most of these are 'unambiguous ambiguous codons' as you put it, and
>> I would agree using X when a more specific letter is possible isn't ideal
>> but isn't actually wrong. The "ATS" and related codons (see below)
>> however are simply wrong.
>
> They do look wrong. The "X when it could pick a residue" ones I knew of.
>
> The others need a closer look. The plan is to work through all possible
> codons and all the NCBI genetic codes as soon as the release is out.
>
> It should be a simple patch to ajtranslate.c when I'm done.
>

OK - I appreciate this is too last minute for the imminent EMBOSS release.

>> --
>>
>> Now for another debatable one, RAT means AAT or GAT which code
>> for N and D. So, you could use B (Asx) here rather than the broader X.
>>
>> Similarly, you don't use J to mean leucine (L) or to isoleucine (I), and
>> opt for X (again, this is justifiable). e.g. WTA
>
> Hmmm ... B and Z are ambiguity codes for amino acid analyser where all the
> amide bonds are broken and that includes N->D and Q->E. We used to have one
> of those in the lab. Similarly, J is for mass spec where I and L have the
> same molecular weight. I don't consider them appropriate for translation.

Well, as I said, this is debatable. On the one hand B and Z are IUPAC standards
(although J isn't yet), but amino acids don't have the full ambiguous alphabet
that we have for nucleotides so some might find such a translation surprising.
http://www.chem.qmul.ac.uk/iupac/AminoAcid/A2021.html

> So I plan to go for unique amino acids where possible with the ambiguity
> codes.

Good :)

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] transeq and ambiguous codons

2009-07-20 Thread Peter
On Thu, Jul 9, 2009 at 10:21 AM, Peter wrote:
> On Thu, Jul 9, 2009 at 10:16 AM, Peter Rice wrote:
>>
>> Peter C. wrote:
>>
>>> Thanks Scott (& Thomas) for pointing out the embossversion program.
>>>
>>> I would still question why the EMBOSS tools don't also support the
>>> Unix convention of a version switch. Hypothetically, aren't some
>>> (many?) of the tools standalone and couldn't they be installed
>>> individually (e.g. as part of someone else's software bundle)? i.e.
>>> Can EMBOSS really guarantee that the needle tool and the
>>> embossversion tool are in sync?
>>
>> We could easily add a -version global qualifier ... for the next release.
>>
>> We can guarantee that embossversion and needle are in sync - assuming
>> they are built using the same libraries as that is where the version is
>> recorded. Standalone build are an issue though and it would help debug
>> in a few cases.
>
> That sounds good to me :)
>

Thinking about this again, rather than adding a whole new argument
(-version), why not just include the program version as the first line of
the help output (from -help)? This should also solve the corner case
of standalone builds, and makes it very easy to find the version
(without having to know about the embossversion tool).

Thanks,

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] EMBOSS seqret : IntelliGenetics and new DOS lines

2009-07-20 Thread Peter
aggtgaccggccaggaaac
ggtcgcccagatcaaggtaaggctgcttggtgcgccctgggttccacttgtgctct
tcactctcgcggcccgagggaacgcttacgagccttatctttccctgtaggctcatgtag
cctcactggagggcattgggaagatcaagtcgtgctcctggcaggcgcgtgg
aggatgaggccactctgggccagtgctggaggccctgactaccctggaagtagcag
gccgcatgcttggaggtgagtgagagaggaatgttctttgaagtaccggtaagcgtctag
tgagtgttgcatagtcctgacagctgagtgtcacacctatggtaatagagtacttc
tcactgtcttcagttcagagtgattcttcctgtttacatccctcatgttgaacacagacg
tccatgggagactgagccagagtgtagttgtatttcagtcacatcacgagatcctagtct
ggttatcagcttccacactattaggtcagaccaggaaagtgctctataaatt
agaagctggaagatcctgaaatgaaacttaagatttcaaggtcaaatatctgcaactttg
ttctcattacctattgggcgcagcttctctttaaaggcttgaattgaggatt
ctgctgggtggcaccttcttgctcttacctgctggtgccttcctttcccactacaggtaa
agtccatggttccctggcccgtgctgggtgagaggtcagactcctaaggtgagtga
gagtattagtggtcatggtgttaggactttcctttcacagctaaaccaagtccctg
ggctcttactcggtttgccttctccctccctggagatgagcctgagggaagggatgctag
gtgtggaagacaggaaccagggcctgattaaccttcccttctccaggtggccaaacagga
gaagaagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgt
caacgttgtgcccacctttggcaagaagaagggaatgccaactcttaagtcg
taattctggctttctctaataagccacttagttcagtcatcgcattgtttcatctt
tacttgcaaggcctcagggagaggtgtgcttctcgg

i.e. There was a problem with this example file in EMBOSS 6.0.1,
but things look fine in EMBOSS 6.1.0. Great :)

However, if we now convert this input file to use DOS/Windows
newlines, and repeat the test (on Mac OS X, so Unix):

$ embossversionReports the current EMBOSS version number
6.1.0
$ seqret -sequence emboss_ig.txt -sformat ig -osformat fasta -auto -filter
 H.sapiens fau mRNA, 518 bases
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgggaagatcaagtcgtgc
tcctggcaggcgcgtggaggatgaggccactctgggccagtgctggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctgggtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggaatgccaactcttaagtcgtaattctggctttc
tctaataagccacttagttcagtcaa
 H.sapiens fau 1 gene, 2016 bases
ctaccaccctctcgattctatatgtacactcgggacaagttctcctgatcgc
ggcctaaggaagtaggaatgccttagcttaacaatgattaacac
tgagcctcacacccacgcgatgccctcagctcctcgctcagcgctctcaccaacagccgt
agcccgcaggctggacaccggttctccatgcagcgtagcccggaacatggta
gctgccatctttacctgctacgccagccttctgtgcgcgcaactgtctggtcccgg
tcctgcgcgagctgctgcccaggcaggttcgccggtgcgagcgtaaacggagctag
gactgccttgggcggtacaaatagcagggaaccgcgcggtcgctcagcagtgacgtgaca
cgcagcccacggtctgtactgacgcgccctcgcttcttcctctttctcgactccatcttc
gcggtagctgggaccgccgttcaggtaagaatccttggctggatccgaagggcttg
tagcaggttggctgctcagaaggcgcggaaccgaagaaccctgctccg
tggccctgctccagtccctatccgaactccttgggaggcactggccttccgcacgtgagc
cgccgcgaccaccatcccgtcgcgatcgtttctggaccgctttccactcccaaatctcct
ttatcccagagcatttcttggcttctcttacaagccgtcctttactcagtcgccaa
tatgcagctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaac
ggtcgcccagatcaaggtaaggctgcttggtgcgccctgggttccacttgtgctct
tcactctcgcggcccgagggaacgcttacgagccttatctttccctgtaggctcatgtag
cctcactggagggcattgggaagatcaagtcgtgctcctggcaggcgcgtgg
aggatgaggccactctgggccagtgctggaggccctgactaccctggaagtagcag
gccgcatgcttggaggtgagtgagagaggaatgttctttgaagtaccggtaagcgtctag
tgagtgttgcatagtcctgacagctgagtgtcacacctatggtaatagagtacttc
tcactgtcttcagttcagagtgattcttcctgtttacatccctcatgttgaacacagacg
tccatgggagactgagccagagtgtagttgtatttcagtcacatcacgagatcctagtct
ggttatcagcttccacactattaggtcagaccaggaaagtgctctataaatt
agaagctggaagatcctgaaatgaaacttaagatttcaaggtcaaatatctgcaactttg
ttctcattacctattgggcgcagcttctctttaaaggcttgaattgaggatt
ctgctgggtggcaccttcttgctcttacctgctggtgccttcctttcccactacaggtaa
agtccatggttccctggcccgtgctgggtgagaggtcagactcctaaggtgagtga
gagtattagtggtcatggtgttaggactttcctttcacagctaaaccaagtccctg
ggctcttactcggtttgccttctccctccctggagatgagcctgagggaagggatgctag
gtgtggaagacaggaaccagggcctgattaaccttcccttctccaggtggccaaacagga
gaagaagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgt
caacgttgtgcccacctttggcaagaagaagggaatgccaactcttaagtcg
taattctggctttctctaataagccacttagttcagtcatcgcattgtttcatctt
tacttgcaaggcctcagggagaggtgtgcttctcgg

i.e. The ">" is missing on all the FASTA sequences.

So, it looks like EMBOSS 6.1.0 fixed one problem with
IntelliGenetics files, but that there is still an issue here.

Peter C.

P.S. Should I have reported this possible bug via sourceforge?

P.P.S. Back in 2006, I reported a similar issue with a data
corruption reading stockholm/pfam with DOS newlines
(Sourceforge Bug #1588956, long since fixed). It seems to
me that EMBOSS would benefit from explicit testing of all
the file formats using DOS/Windows newlines when run on
Unix, and vice versa. Does that sound feasible, or just
hopelessly ambitious?
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] FASTQ format documentation

2009-07-20 Thread Peter
Hi all,

I was just trying to double check the names EMBOSS 6.1.0 supports
for the various FASTQ file formats, and none of them are listed here:
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

Does this need updating, or should I be looking elsewhere?

Thanks

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] EMBOSS seqret : IntelliGenetics and new DOS lines

2009-07-20 Thread Peter
On Mon, Jul 20, 2009 at 5:16 PM, Peter Rice wrote:
>
> Peter C. wrote:
>> Hi all,
>>
>> I've just updated my Mac to EMBOSS 6.1.0, and have found an
>> issue with seqret conversion of IntelliGenetics files. After some
>> digging, I think this problem relates to having DOS new lines in
>> a file on Unix (in my case, Mac OS X).
>
> we have an application "noreturn" to fix things like this.

That's basically an EMBOSS variant on unix2dos and dos2unix
(or similar) existing Unix command line tools?

I'm more interested in having all the EMBOSS tools handle either
new line format themselves automatically. These days I am mostly
working on Unix (including Mac OS X), but I do have to cope with
Windows style text files quite often.

> If you send me your file I will ty to take a look at whether we shoudl
> be catching the funny newline characters.

For this bug report I was using:
http://emboss.sourceforge.net/docs/themes/seqformats/ig

There are another three example files used in the Biopython unit
tests here:
http://biopython.open-bio.org/SRC/biopython/Tests/IntelliGenetics/

>> P.S. Should I have reported this possible bug via sourceforge?
>
> The emboss-...@emboss.open-bio.org list is the best way to get
> our attention

Great, another mailing list to sign up to... but if that is your
preferred route, that's fine.

>> P.P.S. Back in 2006, I reported a similar issue with a data
>> corruption reading stockholm/pfam with DOS newlines
>> (Sourceforge Bug #1588956, long since fixed). It seems to
>> me that EMBOSS would benefit from explicit testing of all
>> the file formats using DOS/Windows newlines when run on
>> Unix, and vice versa. Does that sound feasible, or just
>> hopelessly ambitious?
>
> We can try ... how well does biopytjhon handle these? (i.e. do we need
> such examples for perl, python etc or is this an EMBOSS-specific issue?)

I think this is an EMBOSS specific issue. I don't know enough about
how all the different EMBOSS parsers work, but is there a singl
place where you could add automatic handling of either new line
convention when reading in text?

For reference, in Python, you can explicitly open text files in "universal
newlines" mode, which takes care of this. I don't know about Perl.

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] EMBOSS seqret : IntelliGenetics and new DOS lines

2009-07-20 Thread Peter
Peter Rice wrote:
>
> Thanks for the example files. I will start with those.
>
> Peter C. wrote:
>> I think this is an EMBOSS specific issue. I don't know enough about
>> how all the different EMBOSS parsers work, but is there a single
>> place where you could add automatic handling of either new line
>> convention when reading in text?
>
> Hope so. I think the issue is places where the parsing is checking
> explicitly for \n rather than \n and \r. The solution would be to strip
> both off before parsing. It will need a thorough clean through the
> ajseqread code.

That sounds like a good investment of effort in the long run :)

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] EMBOSS seqret : IntelliGenetics and new DOS lines

2009-07-21 Thread Peter
Peter Rice wrote:
>
> Peter C. wrote:
>> However, if we now convert this input file to use DOS/Windows
>> newlines, and repeat the test (on Mac OS X, so Unix):
>>
>> $ embossversionReports the current EMBOSS version number
>> 6.1.0
>> $ seqret -sequence emboss_ig.txt -sformat ig -osformat fasta -auto -filter
>>  H.sapiens fau mRNA, 518 bases
>> ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
>>
>> i.e. The ">" is missing on all the FASTA sequences.
>
> Actually, it's not missing ... it is hiding.
>
> The sequence id has a ^M appended to it, so the '> and the id get
> overwritten by the description when you look at the file.

That makes sense, and I think I can see how it might have happened.

> Fixed by processing the IG format ID rather than simply copying it.
>
> Thanks for finding that one.

Sure,

Peter C.

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] FASTQ records with no sequence?

2009-07-30 Thread Peter
Hi all,

On the continuing topic of the nebulous FASTQ format, are there
any strong views as to weather a FASTQ files could hold records
without a sequence (and therefore no quality scores)? This could
make sense as output from an (agressive) quality filter.

This is corner case, and applies to other file formats too of course
(e.g. FASTA).

I mentioned this to Peter Rice (EMBOSS) off list, and he replied:

On Thu, Jul 30, 2009 at 2:56 PM, Peter Rice wrote:
> EMBOSS rejects zero length sequences - something we put in some years
> ago for misformatted FASTA files that someone ran through a Taverna
> workflow to launch clustalw via EMBOSS's "emma". The user had got his
> carriage control characters mangled so the sequence was appended to the
> FASTA '>' line and appeared as a long description with no sequence.
>
> I can well imagine for filtering paired reads that zero length sequences
> would be useful.
>
> At the point where the test is made we know the sequence format.
> We can therefore define some or all formats as accepting or rejecting
> zero length sequences.
>
> Similarly we can easily extend to define some applications (e.g. emma)
> as requiring a minimum sequence length.
>
> regards,
>
> Peter

Peter Rice is of course correct - in general the meaning and validity
of a zero length sequence is context dependent.

I think Peter Rice makes a good point regarding paired end reads.
What I assume we was getting at is the situation where due to
quality trimming, one of a pair might be trimmed to nothing - leaving
essentially a singleton read. However, paired end reads are normally
stored using a matched pair of FASTQ files, so it could be important
to keep the zero length read present, so that they can be read in
together in sync.

If we do want to allow zero length sequences in FASTQ, would
both of the following be valid? Should there be empty sequence
and quality lines, or no sequence and quality lines?

"@identifier\n+\n" (two lines, just the @ and + lines)
"@identifier\n\n+\n\n" (four lines, including blank seq and qual lines)

or with the repeated identifier on the plus lines:

"@identifier\n+identifier\n" (two lines, just the @ and + lines)
"@identifier\n\n+identifier\n\n" (four lines, including blank lines)

As we are recommending no line wrapping on output this means
typical FASTQ records would be four lines - so doing the same
makes sense here too.

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] FASTQ records with no sequence?

2009-07-30 Thread Peter
On Thu, Jul 30, 2009 at 4:09 PM, Peter Rice wrote:
>
> Peter C. wrote:
>
>> As we are recommending no line wrapping on output this means
>> typical FASTQ records would be four lines - so doing the same
>> makes sense here too.
>
> I vote for 4 lines on output.

If we want to allow zero length sequences, then yes, I would also
vote for the 4 line output (i.e. blank lines for the sequence and
the quality string).

> It should be possible to allow zero lines on input depending on
> where the '+' check is.

Yes, I'm pretty sure a parser could cope with any of the zero length
sequence FASTQ examples I gave.

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] FASTQ records with no sequence?

2009-07-30 Thread Peter
Hi all,

On the continuing topic of the nebulous FASTQ format, are there
any strong views as to weather a FASTQ files could hold records
without a sequence (and therefore no quality scores)? This could
make sense as output from an (aggressive) quality filter.

This was a discussion I meant to start on the OBF list, not the
EMBOSS list - so here is the start of the thread:
http://lists.open-bio.org/pipermail/emboss/2009-July/003707.html

Basically in some contexts an empty FASTQ record makes sense,
so perhaps we should include examples of this for our test suite.
However, there is more than one reasonable way to represent
such a record (either omitting the sequence and quality lines, or
including blank sequence and quality lines).

On Thu, Jul 30, 2009 at 4:09 PM, Peter Rice wrote:
>
> Peter C. wrote:
>
>> As we are recommending no line wrapping on output this means
>> typical FASTQ records would be four lines - so doing the same
>> makes sense here too.
>
> I vote for 4 lines on output.

If we want to allow zero length sequences, then yes, I would also
vote for the 4 line output (i.e. blank lines for the sequence and
the quality string).

> It should be possible to allow zero lines on input depending on
> where the '+' check is.

Yes, I'm pretty sure a parser could cope with any of the zero length
sequence FASTQ examples I gave.

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] GFF/GFF2/GFF3 examples on EMBOSS webpage

2009-08-06 Thread Peter
Hi all,

I was just looking at this page:
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

This table lists GFF2 as one entry, and GFF/GFF3 as another. They link
to: http://emboss.sourceforge.net/docs/themes/seqformats/gff2 and
http://emboss.sourceforge.net/docs/themes/seqformats/gff respectively.

These examples appear to be indentical (and the header says it is a
GFF2 file). So I am a bit confused. Should one be a GFF3 file, and
simply one file was uploaded twice by mistake?

Thanks,

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] vectorstrip on FASTQ files

2009-08-19 Thread Peter
Hi,

I'm trying to use vectorstrip on FASTQ files (as a simple way to
remove adaptor or primer sequences). However, it seems that on output
the FASTQ qualities are missing (all set to the double quote, ASCII
33, meaning PHRED quality 1 or random). Is this a known bug (or
rather, a missing feature)?

For illustration I am using a Sanger style FASTQ file from the NCBI
SRA (short reads originally from Solexa/Illumina), SRR014849.fastq
which you can download from
ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX003/SRX003639/SRR014849.fastq.gz

I am pretending "GTTGGAACCG" is 5' adaptor sequence, and want to find
any matches in some FASTQ reads, and trim it off taking only the
sequence to the right. For simplicity I'm allowing no mismatches.
Here is the start of the file:

$ head -n 12 SRR014849.fastq
@SRR014849.1 EIXKN4201CFU84 length=93
CTTTGTTTGGAACCGAAAGGGGAATTTCAAACCCCGGTTTCCAACCTTCCAAAGCAATGCCAATA
+SRR014849.1 EIXKN4201CFU84 length=93
3+&$#"""""""""""7...@71,'";C?,B;?6B;:EA1EA1EA5'9B:?:#9e...@2ea5':>5?:%A;A8A;?9B;D@/=5B;4B>+C?,EA09B;@;9E@/EA/E@/B:;1B:B:;A9<5SRR014849.9_from_31_to_84 EIXKN4201AL42E length=84
AAAGGGTTTGAATTCAAACCCTTTGGTTCCAACTTGTCTTGCTTTAGCCA

Using Sanger FASTQ runs:

$ vectorstrip -sequence SRR014849.fastq -sformat fastq-sanger
-readfile N -alinker "GTTGGAACCG" -blinker "" -osformat fastq-sanger
-outseq SRR014849_5trimmed.fastq -mismatch 0 -besthits Y -outfile
SRR014849_5trimmed.txt
Removes vectors from the ends of nucleotide sequence(s)

But the output is missing the quality scores:

$ head -n 4 SRR014849_5trimmed.fastq
@SRR014849.9_from_31_to_84 EIXKN4201AL42E length=84
AAAGGGTTTGAATTCAAACCCTTTGGTTCCAACTTGTCTTGCTTTAGCCA
+
""""""""""""""""""""""""""""""""""""""""""""""""""""""

Is this something simple to add to vectorstrip? What about other
annotation (e.g. running vector strip on annotated GenBank or EMBL
files)?

Thanks,

Peter C.

P.S. This is with EMBOSS 6.1.0 with a patch from Peter Rice, running
on Mac OS X.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] vectorstrip on FASTQ files

2009-08-19 Thread Peter
 Peter Rice wrote:
>
> Peter C. wrote:
>> Hi,
>>
>> I'm trying to use vectorstrip on FASTQ files (as a simple way to
>> remove adaptor or primer sequences). However, it seems that on output
>> the FASTQ qualities are missing (all set to the double quote, ASCII
>> 33, meaning PHRED quality 1 or random). Is this a known bug (or
>> rather, a missing feature)?
>
> It is a missing feature. vectorstrip was written before quality scores
> became fashionable and, curiously, nobody has asked for them before.
>
> We will certainly retain them in a future release.

Great - thanks!

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.

2009-09-16 Thread Peter
On Wed, Sep 16, 2009 at 7:57 AM, Charles Plessy
 wrote:
>
> Dear EMBOSS developers,
>
> I have multi-sequence file in FASTQ format that contains sequencing reads, and
> would like to retreive them the with seqret. But as you see in the following
> example, quality scores are not preserved:
>
> $ seqret P13-CA.fq:F1EZY7316JY25B fastq::stdout
> Reads and writes (returns) sequences
> @F1EZY7316JY25B rank=040 x=3973.0 y=285.0 length=68
> AATGATACGGCGACCACCGAACACTGCGTTTGCTGGCTTTGATGCACTTCTCATGGCCAATTTCATTG
> +
> """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

You need to use "fastq-sanger" (or the other variants), since in
EMBOSS, "fastq" currently means FASTQ ignoring the qualities.
This is documented:

http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

As an EMBOSS user, I think the current situation is confusing, and
it would make much more sense to have "fastq" just an alias for
"fastq-sanger" (which would be consistent with Biopython and BioPerl).

http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000576.html

And also this email - especially the last example:
http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000599.html

> The purpose was to use seqret as a workaround for the fact that
> vectorstrip does not keep the quality either.

That's also been suggested, and is likely to be supported in future.
http://lists.open-bio.org/pipermail/emboss/2009-August/003722.html

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.

2009-09-17 Thread Peter
On Thu, Sep 17, 2009 at 8:24 AM, Peter Rice  wrote:
>
>> Also, in contrary to what the documentation predicts, using the fastq
>> format for the output does not ignore the quality scores. (Not that
>> would be particularly useful, but…)
>
> This is deliberate. We have to write somethign in FASTQ format and we
> default to the fastq-sanger format. On input, fastq-sanger ignores qualities
> because there is no safe way to decide which format is correct.

So again, could you reconsider making "fastq" act like "fastq-sanger"?
The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores,
a superset of the Solexa/Illumina FASTQ varaints - so even if you don't
know which kind of FASTQ file you have, and you don't care about the
qualities, parsing it as a Sanger FASTQ file will work.

Peter C.

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.

2009-09-17 Thread Peter
On Thu, Sep 17, 2009 at 10:18 AM, Peter Rice  wrote:
>
>> So again, could you reconsider making "fastq" act like "fastq-sanger"?
>> The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores,
>> a superset of the Solexa/Illumina FASTQ varaints - so even if you don't
>> know which kind of FASTQ file you have, and you don't care about the
>> qualities, parsing it as a Sanger FASTQ file will work.
>
> Yes, but it is dangerous if they could really be Solexa qualities.

Indeed, or an Illumina 1.3+ encoded FASTQ file.

So if the EMBOSS tools are used to read a FASTQ file without specifying
the FASTQ variant, do the currently detect it is FASTQ and default to the
"fastq" setting and ignore the quality information?

> What we could do is provide a utility that reads in fastq-sanger format and
> checks whether the quality scores make most sense as Sanger, SOlexa or
> Ilumina.

That could be useful - I guess you could scan all the reads building up
a histogram of the ASCII characters used. This could immediately
rule out some of the options, and then based on the distribution (if
you assume they are raw reads) you could make a good guess.

> I consider reading as fastq-sanger by default to be rather dangerous.

That is understandable. How about removing the current "fastq" output
then? That might prevent some of the confusion at the moment. I'm
struggling to see any purpose for the current "fastq" output - can you
give me any example use case? Right now it has to pick an arbitrary
quality symbol, and uses ASCI 34 (double quote) which means PHRED
1 (random) for a Sanger FASTQ file but is invalid as a Solexa or
Illumina 1.3+ FASTQ file.

Regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools.

2009-09-17 Thread Peter
On Thu, Sep 17, 2009 at 10:52 AM, Peter Rice  wrote:
>
>>> What we could do is provide a utility that reads in fastq-sanger format
>>> and checks whether the quality scores make most sense as Sanger,
>>> SOlexa or Ilumina.
>>
>> That could be useful - I guess you could scan all the reads building up
>> a histogram of the ASCII characters used. This could immediately
>> rule out some of the options, and then based on the distribution (if
>> you assume they are raw reads) you could make a good guess.
>
> The ACD file would be 'interesting' We could set the default format to be
> "fastq-sanger" and issue some warning if we find the user had tried to
> change it. That way the application would run with a filename as the input,
> though it will appear to interfaces to be able to read any sequence input.
>
> Are there rules we can use to decide on improbably qualities? Values below
> the Illumina and Solexa minima would seem a good guide, and perhaps
> values above the likely short read maximum score.
>
> Maybe some existing pipelines have solme cutoff values we could adopt?

Quite possibly. Telling apart raw Sanger reads and raw Solexa/Illumina
reads should be easy. However, unless there are some ASCII characters
in the range 59 to 63 (Solexa -5 to -1), there isn't going to be a safe way
to tell Solexa and Illumina 1.3+ apart. Of course, if they just have good
reads above Solexa/PHRED 10 (which would be ASCII 74), either way
it isn't going to make much difference. In any case, it will be heuristic,
and sometimes it will get it wrong (e.g. post processed Sanger FASTQ
files with high scores might look like raw reads in Solexa/Illumina
FASTQ).

>>> I consider reading as fastq-sanger by default to be rather dangerous.
>>
>> That is understandable. How about removing the current "fastq" output
>> then? That might prevent some of the confusion at the moment. I'm
>> struggling to see any purpose for the current "fastq" output - can you
>> give me any example use case? Right now it has to pick an arbitrary
>> quality symbol, and uses ASCI 34 (double quote) which means PHRED
>> 1 (random) for a Sanger FASTQ file but is invalid as a Solexa or
>> Illumina 1.3+ FASTQ file.
>
> It is an alias for fastq-sanger which should be OK. I prefer to have an
> output format name for each input format name where it looks sensible,
> so if we read "fastq" as an input format it should do something on
> output. Unfortunately that means it has to write quality scores somehow.

I'm not convinced that the current "fastq" output (with the double quote
quality string) is entirely "sensible". But I'll drop this now - I've argued my
case, and will leave it at that. As long as the current behaviour is clear
in the documentation, it should be OK.

Regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Trim polyA from fastq files

2009-10-01 Thread Peter
On Thu, Oct 1, 2009 at 1:53 PM, michael watson (IAH-C)
 wrote:
>
> Hi Peter
>
> Thanks for that.
>
> Is it possible to preserve the fastq format?  My input was fastq, I also put 
> .fastq as my output, but it only gave me straight fasta
>

Use "fastq-sanger" (or a variant), not just "fastq" which means
ignoring the qualities in EMBOSS.

http://emboss.sourceforge.net/docs/themes/SequenceFormats.html#in

Peter

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] Fwd: [DAS] DAS workshop 7th-9th April 2010

2009-11-26 Thread Peter
This might be of interest to some of you.

Peter

-- Forwarded message --
From: Jonathan Warren 
Date: Thu, Nov 26, 2009 at 2:57 PM
Subject: [DAS] DAS workshop 7th-9th April 2010
To: d...@biodas.org, das_registry_annou...@sanger.ac.uk, biojava-dev
, BioJava , BioPerl
, a...@sanger.ac.uk, a...@ebi.ac.uk,
ensembldev 



We are considering running a Distributed Annotation System workshop
here at the Sanger/EBI in the UK subject to decent demand.
The workshop will be held from Wednesday 7th-Friday 9th April 2010. If
you would be interested in attending either to present or just take
part
then please email me j...@sanger.ac.uk

The format of the workshop is likely to be similar to last years (1st
day for beginners, 2nd for both beginners and advanced users, 3rd day
for advanced), information for which can be found here:
http://www.dasregistry.org/course.jsp

If you would like to present then please send a short summary of what
you would like to talk about.

Thanks

Jonathan.

Jonathan Warren
Senior Developer and DAS coordinator
j...@sanger.ac.uk









--
The Wellcome Trust Sanger Institute is operated by Genome
ResearchLimited, a charity registered in England with number 1021457
and acompany registered in England with number 2742969, whose
registeredoffice is 215 Euston Road, London, NW1
2BE.___
DAS mailing list
d...@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/das
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Trimming illumina short reads based on quality

2009-12-01 Thread Peter
On Tue, Dec 1, 2009 at 2:33 PM, michael watson (IAH-C)
 wrote:
>
> Hi
>
> I'm sorry if I've not been keeping up to date on what is doubtless a hot 
> topic.
>
> Does EMBOSS allow one to trim short reads based on quality data (from a fastq 
> file)?
>
> If not, I have read that it is planned - any idea when it will be implemented?

Not yet, but it has been proposed and I understand it is on the
EMBOSS to do list along with quality filtering (Peter Rice has
suggested the name quaffle for this):
http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030493.html

I dare say suggestions for precise trimming algorithms (e.g. median
over sliding window) might be welcome.

> Otherwise, alternative suggestions are welcome!

I'm sure there are plenty of scripts out these, in Perl, Python etc.
What is your language of choice?

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] Unknown output format 'refseqp' and 'genpept'

2009-12-07 Thread Peter
Hi,

I have a protein IntelliGenetics file used in the Biopython test suite:
http://biopython.org/SRC/biopython/Tests/IntelliGenetics/VIF_mase-pro.txt

I am using EMBOSS 6.1.0 (patch level 2 I think), and I am trying
to turn this into a "GenBank Protein File", or GenPept file, using
EMBOSS seqret.

EMBOSS can read the file fine, this works:
$ seqret -auto -sformat=ig -osformat=fasta VIF_mase-pro.txt temp.txt

Giving FASTA output with 16 gapped protein sequences, which is
good - although the ID of the first record is a bit odd.

Using "genbank" as the output format in EMBOSS seems to
mean nucleotide and not protein:

$ seqret -auto -sformat=ig -osformat=genbank VIF_mase-pro.txt temp.txt
Error: Sequence format 'genbank' not supported for protein sequences
Error: Sequence format 'genbank' not supported for protein sequences
...
Error: Sequence format 'genbank' not supported for protein sequences

Referring to the documentation,
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
I then tried "genpept" and "refseqp":

$ seqret -auto -sformat=ig -osformat=genpept VIF_mase-pro.txt temp.txt
Error: Unknown output format 'genpept'
Error: Unknown output format 'genpept'
...
Error: unknown output format 'genpept'

$ seqret -auto -sformat=ig -osformat=refseqp VIF_mase-pro.txt temp.txt
Error: Unknown output format 'refseqp'
Error: Unknown output format 'refseqp'
...
Error: unknown output format 'refseqp'

Doesn't EMBOSS seqret support genpept/refseqp as an output format?

Thanks,

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Unknown output format 'refseqp' and 'genpept'

2009-12-08 Thread Peter
On Tue, Dec 8, 2009 at 1:32 PM, Peter Rice  wrote:
>
> Peter wrote:
>>
>> Hi,
>>
>> I have a protein IntelliGenetics file used in the Biopython test suite:
>> http://biopython.org/SRC/biopython/Tests/IntelliGenetics/VIF_mase-pro.txt

It probably doesn't matter what the input file is here, the fact that
it was an (obsolete) format like IntelliGenetics was just chance as
I was working on a Biopython unit test.

>> I am using EMBOSS 6.1.0 (patch level 2 I think), and I am trying
>> to turn this into a "GenBank Protein File", or GenPept file, using
>> EMBOSS seqret.
>>
>> Doesn't EMBOSS seqret support genpept/refseqp as an output format?
>
> Oddly enough you are the first to ask for it.

That surprises me a little bit.

Could I suggest you treat known input formats which are not supported
as output formats a little differently and instead of this:

unknown output format 'genpept'

Perhaps give,

format 'genpept' is not supported for output (only input)

This would help the user rule out having a typo etc.

> Does biopython have a definition of the fields it expects to write out in a
> GenPept or RefseqP format file? We would be able to allow GenBank as an
> alias for, presumably, genpept.

Not explicitly, no. I was hoping to use EMBOSS for cross validation ;)

With hindsight this may have been a mistake, but we use "genbank"
format to mean either nucleotides of proteins. On parsing we just
look at the units of length in the LOCUS line (bp or aa). We also
try to cope with both the current NCBI files and some older variants
we have in our unit tests (different offsets in the LOCUS line).

> Might be a good time to merge the format names and details from biopython
> and emboss. Where can Ifine the biopython ones?

There are two tables on the wiki which include version information:
http://biopython.org/wiki/SeqIO
http://biopython.org/wiki/AlignIO

You can also consult the built in documentation, also available online:
http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html
http://biopython.org/DIST/docs/api/Bio.AlignIO-module.html

For a long time I avoided having aliases (multiple names for the same
thing). However, we now treat "gb" as an alias for "genbank" (since
this is what the NCBI use in Entrez). We also treat "fastq-sanger" and
"fastq" the same.

Peter C (the one at Biopython)
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Unknown output format 'refseqp' and 'genpept'

2009-12-08 Thread Peter
On Tue, Dec 8, 2009 at 2:11 PM, Peter Rice  wrote:
>
>> With hindsight this may have been a mistake, but we use "genbank"
>> format to mean either nucleotides of proteins. On parsing we just
>> look at the units of length in the LOCUS line (bp or aa). We also
>> try to cope with both the current NCBI files and some older variants
>> we have in our unit tests (different offsets in the LOCUS line).
>
> We try that too on input, but for output we have to be explicit so the user
> can pick just one of the choices.

I imagine that as with Biopython, sometimes the user has made it
explicit that they are dealing with nucleotides or proteins (lots of
the EMBOSS tools have switches for this), so you know if you
should be using "aa" or "bp" in the LOCUS line.

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Genpept entry in MSE

2009-12-15 Thread Peter
On Tue, Dec 15, 2009 at 12:14 PM, Steve Taylor
 wrote:
>
> Hi,
>
> I am trying to load a Genpept entry into MSE, EMBOSS Version 6.0.1 on
> Fedora. Unfortunately it doesn't like the LOCUS line.
>
> It loads, but warns:
>
> Warning: bad Genbank LOCUS line 'LOCUS       ACN78416                 225 aa
>          linear   BCT 21-MAR-2009'
>
> Changing the aa to bp fixes it.

What command line did you use? If you specified format "genbank",
I think you should use format name "genpept" or "refseqp" instead:
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

Peter

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Genpept entry in MSE

2009-12-15 Thread Peter
On Tue, Dec 15, 2009 at 3:26 PM, Steve Taylor
 wrote:
>
> I didn't specify any format. I assumed it would pick it up...

Emboss is normally pretty good at deducing file formats, so I
would have expected it to cope too.

> However, I still get the error if I use
>
> mse -sformat1 genpept -sequence ACN78417.pep
>
> Is this what you mean?

Probably - although I don't think I have ever used mse myself.

Hopefully an EMBOSS developer can enlighten us.

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Macports/EMBOSS dyld issue

2009-12-15 Thread Peter
On Tue, Dec 15, 2009 at 8:16 PM, Tom Keller  wrote:
>
> Hi,
> I'm running Mac OS X 10.6, and have EMBOSS 6.0.1 installed via MacPorts. And 
> I have macport installed jpeg.7.dylib at /opt/local/lib/
>
> But I get the following error:
> $ wossname wossname
> dyld: Library not loaded: /opt/local/lib/libjpeg.62.dylib
>  Referenced from: /opt/local/bin/wossname
>  Reason: image not found
> Trace/BPT trap
>
> I tried making a link from jpeg.7.dylib to /opt/local/lib/libjpeg.62.dylib 
> but then I get the error:
>
> dyld: Library not loaded: /opt/local/lib/libjpeg.62.dylib
>  Referenced from: /opt/local/bin/wossname
>  Reason: Incompatible library version: wossname requires version 63.0.0 or 
> later, but libjpeg.62.dylib provides version 8.0.0
> Trace/BPT trap
>
> Can someone suggest a solution?
>
> Thomas (Tom) Keller
> kellert at ohsu.edu
> 503.494.2442
> 6339b R Jones Hall (BSc/CROET)
> www.ohsu.edu/xd/research/research-cores/dna-analysis/

That looks like two problems, you seem to have libjpeg 62.x.x
which is too old, but also EMBOSS (or dyld) isn't reporting the
same kind of version number. Do you (or MacPorts) have a
libjpeg.63.dylib file you could try?

[I've never tried this - this is an informed guess at best]

Peter

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] getorf includes unspecified amino acids as part of the ORF sequence

2010-01-11 Thread Peter
On Mon, Jan 11, 2010 at 2:26 PM, Fungazid  wrote:
>
> Hello people,
>
> I just installed emboss on linux ubuntu (using the ubuntu synaptic package 
> manager). I am using the getorf program, and I see it gives me this kind of 
> output lines:
>
>>1_3 [803 - 1120]
> LARLRFVVLGNSFIASAKGWSTPYGPTTFGPFRSCIYPRVFRSTRVRKAMATRIGSNRVN
> ILIRCTXNPYLGWWCYIFCIFR
>
> I don't like the Xs as they represent unspecified amino acids. Is there an 
> input parameter to tell the program to report only the regions before and 
> after the Xs ?
>
> In addition (and maybe this is beyond the scope of this mailing list) what is 
> the biological meaning of such Xs ?

What was the input sequence like? Was there a stretch of N perhaps?

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Many-to-many with needle and water

2010-01-15 Thread Peter
On Mon, Jul 6, 2009 at 10:35 AM, Peter Rice  wrote:
>
> Peter Cock or biopython wrote:
>> Hi Peter R. et al,
>>
>> I gather EMBOSS is looking for feedback for new applications (given
>> the recent funding from the BBSRC - congratulations again). How about
>> suggestions for extensions to existing EMBOSS applications?
>>
>> I've used bits of EMBOSS for several years now (thank you!). Something
>> I have sometimes wanted to do is a many-to-many pairwise sequence
>> alignment with the EMBOSS tools needle and water.
>>
>> Right now, needle and water take two files (here referred to as A and
>> B), file A has just one sequence, and file B can have one or more
>> sequences. I'd like to be able to supply two files both with multiple
>> entries, and have needle/water do pairwise alignments between all the
>> sequences in A against all the sequences in B. This might be useful
>> for finding reciprocal best hits in comparative genomics (as an slower
>> but exact alternative to FASTA or BLAST).
>
> The application is easy to add (after the release)
>
> The usual problem with all-against-all is that it involves loading one
> of the inputs as a sequence set entirely in memory - to avoid reading
> one input many times over.
>
> We have an application supermatcher which does this - the first sequence
> is streamed through, the second is a sequence set loaded into memory. It
> uses work matching to find seed alignments then runs a limited alignment
> around the hits.
>
> superwater would be a possible name (or superneedle).

Is see EMBOSS 6.2 has a new tool "needleall" (although if there is a
matching "waterall" the changelog doesn't mention it):
http://lists.open-bio.org/pipermail/emboss/2010-January/003823.html

I'll have to try this out...

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] Broken links on Emboss webpages

2010-03-12 Thread Peter
Hi,

I was just looking for the EMBOSS EMBASSY documentation for the
PHYLIPNEW packages, and noticed they are missing from this page:
http://emboss.sourceforge.net/embassy/

Perhaps this should redirect to the latest release? i.e.
http://emboss.sourceforge.net/apps/release/6.2/embassy/index.html

I also found the links on this page seem to be broken:
http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/phylogeny_molecular_sequence_group.html

Regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] ABI to FASTQ with seqret

2010-03-30 Thread Peter
Hi all,

I've got some "Sanger" capillary sequence files in ABI trace file
format, which I understand includes the probabilities of the 4 bases
along the sequencing run. I'd like to extract this as a FASTQ file
with meaningful quality scores based on the trace data (for use in
assembly).

This doesn't seem to work - the FASTQ quality score characters are all
double quotes (ASCI 34), meaning PHRED quality 1.

seqret -sformat abi -osformat fastq-sanger -sequence example.ab1
-outseq example.fastq -auto

Output as FASTA seems fine:

seqret -sformat abi -osformat fasta -sequence example.ab1 -outseq
example.fasta -auto

Is ABI to FASTQ a reasonable to expect seqret to support? If so, could
it be added to the TODO list please?

Peter C.

P.S. I'd be interested to hear suggestions for alternative tools to
tackle this conversion.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] ABI to FASTQ with seqret

2010-03-30 Thread Peter
On Tue, Mar 30, 2010 at 1:02 PM, Peter Rice  wrote:
>
> On 30/03/2010 12:46, Peter C. wrote:
>>
>> Hi all,
>>
>> I've got some "Sanger" capillary sequence files in ABI trace file
>> format, which I understand includes the probabilities of the 4 bases
>> along the sequencing run. I'd like to extract this as a FASTQ file
>> with meaningful quality scores based on the trace data (for use in
>> assembly).
>>
>> This doesn't seem to work - the FASTQ quality score characters are all
>> double quotes (ASCI 34), meaning PHRED quality 1.
>
> I will take a look. I don;t recall anyone using the quality scores from ABI
> data when we first imeplemented it (at that time Staden Experiment files
> were the only supported output format with any quality scores)
>

Thanks Peter,

Regarding other possible tools, there is the obvious choice of
PHRED (although getting a copy is non-trivial), and based on
this thread: http://seqanswers.com/forums/showthread.php?t=3165
I've just tried TraceTuner 3.0.6beta which is open source
(specifically, GPL v2 or later):
https://sourceforge.net/projects/tracetuner/

With the ttuner -nocall option to reuse the sequence as-is from
the ABI file results in zero quality scores.

Allowing ttuner to re-call the bases (the default), it can output
FASTA/QUAL/PHD with meaningful qualities (from which I can
easily make a FASTQ file).

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] ABI to FASTQ with seqret

2010-03-30 Thread Peter
On Tue, Mar 30, 2010 at 2:25 PM, Peter Rice  wrote:
>
> On 30/03/2010 14:13, Peter Rice wrote:
>
>> Where do I look to find scores that we can use (and how do we convert
>> those to phred quality scores)?
>
> Aha, found something. The field is called PCON (confidence values), with
> values 0-255.
>
> There is a possibility that these could be phred scores, but I suspect they
> are whatever the basecaller has decided to write there.
>
> http://www.appliedbiosystems.com/support/software_community/ABIF_File_Format.pdf
>
> Peter R.

Hmm. Good question - I don't know, although if they are PHRED scores
they could go unusually high (we'd expect say 0 to 50 for a raw read).
It could be some other encoding (e.g. scaled from 0 for a poor base to
255 for a perfect base). Do you have any contacts at Applied Biosystems
to ask?

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] ABI to FASTQ with seqret

2010-03-30 Thread Peter
On Tue, Mar 30, 2010 at 2:33 PM, Zheng Jin Tu  wrote:
>
>
> Hi Peter:
>
> You may want to check this URL about how to
> convert quality score:
>
>  http://maq.sourceforge.net/fastq.shtml
>
> Thanks, TU

Thanks - but that just covers converting between PHRED scores
and Solexa Scores. Peter Rice and I are well aware of this.

The question here is what do the numbers in ABI files mean?

Peter C.

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] Nucleotide dotplots with EMBOSS

2010-04-07 Thread Peter
Hello EMBOSS team,

I've just been using dottup to produce dot plots comparing two
nucleotide sequences (two assemblies), where I have regions of very
high similarity but some inversions.
http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/dottup.html

I've noticed that I can tell dotplot to reverse either of the
sequences, but I would actually like it to search both for forward
matches AND reverse matches to display on the same plot (ideally using
different colours). Is this possible already, or might it be a
reasonable feature request? Right now I can generate one plot with the
forward matches, and a second plot with the reverse matches - not
ideal.

Thanks,

Peter C.

P.S.
While I'm asking, I'd also like (colour) PDF output, since working
with PDF files is much easier on the Mac than postscript (which
thankfully is trivial to convert into PDF - so this isn't a big issue
for me).
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Nucleotide dotplots with EMBOSS

2010-04-07 Thread Peter
On Wed, Apr 7, 2010 at 12:01 PM, Peter Rice  wrote:
> Sounds like a reasonable request. We will look into it.
> ...
> Should be possible with plplot. We will look into adding PDF to the possible
> output devices.

Great.

Thanks,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] EMBOSS eprimer3 and latest primer3_core

2010-04-13 Thread Peter
Hello EMBOSS team,

I'm using EMBOSS 6.2.0 on Mac OS X 10.6.3 Snow Leopard:

$ embossversion
Reports the current EMBOSS version number
6.2.0

I need to design some primers so I wanted to try the EMBOSS tool
eprimer3, which as your documentation clearly explains requires me to
install the 'primer3' program from the Whitehead Institute
(specifically the primer3_core tool).

I downloaded and compiled the latest version of primer3, version 2.2.2
beta (using the default, i.e. just "make", which seems to be fine -
the Snow Leopard specific Makefile failed). It seems that EMBOSS
eprimer3 does not like this:

$ export 
EMBOSS_PRIMER3_CORE="/Users/xxx/Downloads/Software/primer3-2.2.2-beta/src/primer3_core"
$ eprimer3 fasta::lupine.nu lupine.eprimer3
Picks PCR primers and hybridization oligos
Error: Missing SEQUENCE tag

Instead, I downloaded and compiled primer3 version 1.1.4 (using the
defaults, i.e. just "make", there is no Snow Leopard specific Makefile
included) and that seems to work:

$ export 
EMBOSS_PRIMER3_CORE="/Users/xxx/Downloads/Software/primer3-1.1.4/src/primer3_core"
$ eprimer3 fasta::lupine.nu lupine.eprimer3Picks PCR primers and
hybridization oligos
Picks PCR primers and hybridization oligos

The eprimer3 output looks sensible too.

My guess is that something in the recent primer3 alpha and beta
releases of 2.x.x has changed since version 1.x.x and that EMBOSS
needs to be updated to cope. Is this a known issue?

Thanks,

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] EMBOSS eprimer3 and ambiguous DNA

2010-04-13 Thread Peter
Hello again,

I just ran eprimer3 on a multiple FASTA file (using published genome
sequences), and noticed a couple of messages:

"Error: Unrecognized base in input sequence"

Additionally, for two of the sequences there were no primer pairs (just
some blank lines instead). These appear to correspond to two of the
sequences in my input which had IUPAC ambiguous characters in the
sequence (e.g. R, W, Y, N). The eprimer3 documentation does say
explicitly that for some input files such characters are converted into
N (options -mispriminglibraryfile and -mishyblibraryfile) .

What is supposed to happen in a sequence in the main input file has
such characters?

I would expect to still get back a candidate set of primers (even if they
do not cover the regions with ambiguous letters).

As an experiment I added an N character to the end of an unambiguous
sequence, and eprimer3 seemed happy. So, as a work around I've simply
replaced all the ambiguous characters (like R, W and Y) with N, and it
seems to work. Maybe eprimer3 could do this for me, or at least have
this limitation mentioned in the documentation?

Thanks,

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] ABI to FASTQ with seqret

2010-04-22 Thread Peter
On Tue, Mar 30, 2010 at 2:56 PM, Peter  wrote:
> On Tue, Mar 30, 2010 at 2:25 PM, Peter Rice  wrote:
>>
>> On 30/03/2010 14:13, Peter Rice wrote:
>>
>>> Where do I look to find scores that we can use (and how do we convert
>>> those to phred quality scores)?
>>
>> Aha, found something. The field is called PCON (confidence values), with
>> values 0-255.
>>
>> There is a possibility that these could be phred scores, but I suspect they
>> are whatever the basecaller has decided to write there.
>>
>> http://www.appliedbiosystems.com/support/software_community/ABIF_File_Format.pdf
>>
>> Peter R.
>
> Hmm. Good question - I don't know, although if they are PHRED scores
> they could go unusually high (we'd expect say 0 to 50 for a raw read).
> It could be some other encoding (e.g. scaled from 0 for a poor base to
> 255 for a perfect base). Do you have any contacts at Applied Biosystems
> to ask?
>
> Peter C.
>

Hello again Peter R (& everyone else at EMBOSS),

Did you manage to find out if the PCON confidence values in ABI files
are PHRED quality scores or not?

Regards,

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] ABI to FASTQ with seqret

2010-04-22 Thread Peter
On Thu, Apr 22, 2010 at 4:22 PM, Peter Rice  wrote:
>
> On 22/04/2010 16:06, Peter C. wrote:
>
>> Hello again Peter R (&  everyone else at EMBOSS),
>>
>> Did you manage to find out if the PCON confidence values in ABI files
>> are PHRED quality scores or not?
>
> Yes ... and maybe.
>
> The first scores are written bu the ABI basecaller.
>
> A second set can be written by any basecaller. These may be phred quality
> scores but could in theory be anything.
>
> EMBOSS will assume they are phred scores as there is no way to tell
> otherwise.
>
> regards,
>
> Peter Rice

Does this mean there is an updated seqret in a public repository where I
can convert an ABI file to FASTQ taking the ABI basecaller's sequence
and PHRED scores? I'd be interested to test that... or a patch against
EMBOSS 6.2.0.

Thanks,

Peter C.

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] ABI to FASTQ with seqret

2010-04-26 Thread Peter
On Thu, Apr 22, 2010 at 6:01 PM, Peter Rice  wrote:
>> Does this mean there is an updated seqret in a public repository where I
>> can convert an ABI file to FASTQ taking the ABI basecaller's sequence
>> and PHRED scores? I'd be interested to test that... or a patch against
>> EMBOSS 6.2.0.
>
> It is in the latest CVS code and will appeart in the July release.

Thanks Peter,

I tried to grab this from the anonymous CVS mirror as per the EMBOSS
documentation here:
http://emboss.sourceforge.net/developers/cvs.html

Unfortunately it failed:

$ cvs -d :pserver:c...@cvs.open-bio.org:/home/repository/emboss login
Logging in to :pserver:c...@cvs.open-bio.org:2401/home/repository/emboss
CVS password:
cvs login: authorization failed: server cvs.open-bio.org rejected
access to /home/repository/emboss for user cvs

I know there have been VM problems on this machine (also known as
code.open-bio.org) which have been intermitently been affecting the
anonymous SVN access for other projects like BioPerl.

One short term solution would be to give my OBF username peterc
access to the master Emboss CVS repository on dev.open-bio.org
(joke), or look into an external mirror - for example BioPerl are using
github (and seriously talking about moving from SVN to git). This is
going even more off topic but since ViewCVS broke a while back, I've
found it much harder to browse the Emboss source code :(

Regards,

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Tranalign relaxation?

2010-05-31 Thread Peter
On Wed, May 26, 2010 at 7:50 PM, Justin Havird  wrote:
>
> Hi,
>
> I am trying to align nucleic acid sequences based on amino acid alignments
> using the program tranalign. The program normally works fine for me, but
> lately I have been using mitochondrial genes and am beginning to run into
> problems.
>
> These occur when the nucleotide sequence does not match the amino acid
> translation exactly. For example, in the prawn M. japonicus, the first amino
> acid (MET) in the COX1 gene is encoded by the codon "ACG" rather than the
> typical "ATG". Tranalign doesn't recognize ACG as encoding MET, so it throws
> up this message:
>
> Error: Guide protein sequence M. japonicus not found in nucleic sequence M.
> japonicus
>
> These errors occur on a taxa by taxa basis and are usually because of the
> first codon. However, the error also occurs when the nucleotide sequence has
> an ambiguous nucleotide (e.g., Y), even if the ambiguous nucleotide position
> doesn't affect the translation (e.g., both GTC and GTT = VAL). I can usually
> pinpoint the error to a specific nucleotide/codon like in these examples.
>
> These errors are relatively rare, but happen more frequently in some groups
> (inverts and fishes mostly).
>
> So, does anyone know a way to "relax" the tranalign translation rules to
> circumvent this problem? Or have another program/solution?

Hi Justin,

This might be a silly question, but have you used the tranalign argument
-table to specify which genetic code table to use? I'd guess you probably
want the Vertebrate Mitochondrial Code instead of the Standard Code.

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] Counting the number of sequences in a file

2010-07-20 Thread Peter
Hi all,

Is there a tool in EMBOSS to just count the number of sequences in a file?
For simple file formats like FASTA or GenBank I'd typically just use grep:

$ grep -c "^LOCUS " gbvrt1.seq
31065

However, this becomes more complicated for general file formats (e.g. FASTQ
files where in addition to identifiers the quality lines can also
start with @) or
binary files like BAM which EMBOSS now supports.

Right now I could handle this by using seqret to convert the file into FASTA
and then pipe that though grep to count the records. But an EMBOSS tool
would be more elegant, e.g.

$ countseq -sformat=genbank gbvrt1.seq
31065

For the implementation you might offer the choice between using the normal
EMBOSS parsing (as in seqret) versus file format specific regular expression
searches which just look for marker lines (without checking validity) which
should be really fast.

Regards,

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Counting the number of sequences in a file

2010-07-20 Thread Peter
On Tue, Jul 20, 2010 at 6:04 PM, Peter Rice  wrote:
>
> On 20/07/10 17:27, Peter C. wrote:
>> $ countseq -sformat=genbank gbvrt1.seq
>> 31065
>
> Of course, you could just use:
>
> $ seqret -filter -sformat=genbank gbvrt1.seq | grep -c '^>'
> 31065
>
> :-)
>

Exactly what I had in mind as the work around ("handle this by
using seqret to convert the file into FASTA and then pipe that
though grep to count the records"), although I'd not thought
about the fact that FASTA is the default output format which
keeps it nice and short. The (Unix) command line can be great :)

Peter C
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] ABI to FASTQ with seqret

2010-07-22 Thread Peter
On Thu, Apr 22, 2010 at 6:01 PM, Peter Rice  wrote:
>
> On 22/04/2010 16:48, Peter Cock wrote:
>
>> Does this mean there is an updated seqret in a public repository where I
>> can convert an ABI file to FASTQ taking the ABI basecaller's sequence
>> and PHRED scores? I'd be interested to test that... or a patch against
>> EMBOSS 6.2.0.
>
> It is in the latest CVS code and will appeart in the July release.
>

Hi Peter R et al,

I've just compiled and installed EMBOSS 6.3.1 on Mac OS X, and had a
go converting some ABI (extension .ab1) files from our in house sequencing
service to FASTQ - so far all the examples give Sanger FASTQ quality strings
of "!" (ASCII 33, PHRED quality zero) or Illumina FASTQ quality strings of
"@" (ASCII 64, again PHRED quality zero).

I remember you saying ABI files can have two sets of quality scores,
so perhaps my files have one set all of PHRED zero?

I tried to find some 3rd party example files via Google, for example on
http://www.elimbio.com/sequencing_sample_files.htm they have a zip
file http://www.elimbio.com/Forms/pGEM.zip containing one ABI file.
The output of this is more interesting:

$ seqret -sformat abi -osformat fastq  -auto -stdout -sequence
pGEM_\(ABI\)_A01.ab1
@pGEM_(ABI)
NANTCTATAGGCGAATTCGAGCTCGGTA...GNN
+
"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"...!"!"!"

I truncated this for brevity. Here the quality string repeats ASCI 34, ASCI 33
(PHRED quality 1, quality 0) which is rather strange. The sequence appears
to agree with the provided file pGEM_(ABI)_A01.seq

Have I just been unlucky with the AB1 files that I have looked at? Thus
far all the quality scores seem meaningless.

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] ABI to FASTQ with seqret

2010-07-22 Thread Peter
On Thu, Jul 22, 2010 at 12:16 PM, Peter  wrote:
> On Thu, Apr 22, 2010 at 6:01 PM, Peter Rice  wrote:
>>
>> On 22/04/2010 16:48, Peter Cock wrote:
>>
>>> Does this mean there is an updated seqret in a public repository where I
>>> can convert an ABI file to FASTQ taking the ABI basecaller's sequence
>>> and PHRED scores? I'd be interested to test that... or a patch against
>>> EMBOSS 6.2.0.
>>
>> It is in the latest CVS code and will appeart in the July release.
>>
>
> Hi Peter R et al,
>
> I've just compiled and installed EMBOSS 6.3.1 on Mac OS X, and had a
> go converting some ABI (extension .ab1) files from our in house sequencing
> service to FASTQ - so far all the examples give Sanger FASTQ quality strings
> of "!" (ASCII 33, PHRED quality zero) or Illumina FASTQ quality strings of
> "@" (ASCII 64, again PHRED quality zero).
>
> I remember you saying ABI files can have two sets of quality scores,
> so perhaps my files have one set all of PHRED zero?
>
> I tried to find some 3rd party example files via Google, for example on
> http://www.elimbio.com/sequencing_sample_files.htm they have a zip
> file http://www.elimbio.com/Forms/pGEM.zip containing one ABI file.
> The output of this is more interesting:
>
> $ seqret -sformat abi -osformat fastq  -auto -stdout -sequence
> pGEM_\(ABI\)_A01.ab1
> @pGEM_(ABI)
> NANTCTATAGGCGAATTCGAGCTCGGTA...GNN
> +
> "!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"...!"!"!"
>
> I truncated this for brevity. Here the quality string repeats ASCI 34, ASCI 33
> (PHRED quality 1, quality 0) which is rather strange. The sequence appears
> to agree with the provided file pGEM_(ABI)_A01.seq
>
> Have I just been unlucky with the AB1 files that I have looked at? Thus
> far all the quality scores seem meaningless.

I went back through my old emails, and see you had been testing with
http://www.appliedbiosystems.com/support/software_community/ab1_files.zip
(I had trouble downloading this with curl - Firefox worked). Looking at these
ABI files with seqret as FASTQ does seem to give meaningful quality scores.
Curious.

Peter C.

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] transeq and ambiguous codons

2010-07-22 Thread Peter
Hi again,

Now that I have installed the latest and greatest version, EMBOSS 6.3.1,
I'm revisiting some old issues I had with EMBOSS. In this case  'unambiguous
ambiguous codons' and other translation issues.

On Fri, Jul 10, 2009 at 10:14 AM, Peter C. wrote:
> On Thu, Jul 9, 2009 at 10:08 AM, Peter Rice wrote:
>>
>> Peter C. wrote:
>>> However, consider the codon TRR. R means A or G, so this can mean TAA,
>>> TGA, TAG or TGG which translate to stop or W (both EMBOSS and the NCBI
>>> standard table agree here). Therefore the translation of TRR should be
>>> "* or W", which I would expect based on the above examples to result
>>> in "X". But instead EMBOSS transeq gives "*":
>>
>> This is a side effect of the way backtranslation works...
>
> OK, leaving TRR aside for the moment (I'm not sure I'd have done it that
> way, but I think I follow your logic), I have some more problem cases for
> you to consider (all using the default standard NCBI table 1).
>
> Most of these are 'unambiguous ambiguous codons' as you put it, and
> I would agree using X when a more specific letter is possible isn't ideal
> but isn't actually wrong. The "ATS" and related codons (see below)
> however are simply wrong.
>
> --
>
> TRA means TAA or TGA, which are both stop codons. Therefore TRA
> should translate as a stop, not as an X:
>
> $ transeq asis:TAATGATRA -stdout -auto -osformat raw
> **X

Same on EMBOSS 6.3.1, shouldn't TRA translate as stop?

> --
>
> Now look at YTA, which means CTA or TTA which encode L, so
> YTA should be L not X:
>
> $ transeq asis:CTATTAYTA -stdout -auto -osformat raw
> LLX

Same on EMBOSS 6.3.1, giving X instead of specific amino acid
(i.e. YTA is an "unambiguous ambiguous codon" for L)

> Likewise for YTG and YTR, and YTN.

I haven't re-checked these.

> --
>
> Another example, ATW means ATA or ATT, which both translate as I,
> so ATW should translate as I not X:
>
> $ transeq asis:ATAATTATW -stdout -auto -osformat raw
> IIX

Same on EMBOSS 6.3.1, giving X instead of specific amino acid
(i.e. ATW is an "unambiguous ambiguous codon" for I)

> --
>
> Conversely, ATS which means ATC or ATG which translate as I and M.
> Remember S means G or C. Therefore ATS should translate as X, and
> not I:
>
> $ transeq asis:ATCATGATS -stdout -auto -osformat raw
> IMI

Same on EMBOSS 6.3.1, giving potentially wrong amino acid instead of X.

> Likewise H means A, G or C, so ATH shows the same bug, as do some
> other AT* codons:
>
> $ transeq asis:ATAATCATGATH -stdout -auto -osformat raw
> IIMI
>
> [*** This one strikes me as a clear bug ***]

Same on EMBOSS 6.3.1, giving potentially wrong amino acid instead of X.

As I noted before, this list is only partial, and only for the standard table.
I could compile a much longer list of oddities using the Biopython
translation as a reference if you wanted.

Regards,

Peter C.
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] ABI to FASTQ with seqret

2010-07-22 Thread Peter
On Thu, Jul 22, 2010 at 1:28 PM, Peter Rice  wrote:
>
> On 22/07/10 12:22, Peter C. wrote:
>
>>> I truncated this for brevity. Here the quality string repeats ASCI 34, ASCI 
>>> 33
>>> (PHRED quality 1, quality 0) which is rather strange. The sequence appears
>>> to agree with the provided file pGEM_(ABI)_A01.seq
>>>
>>> Have I just been unlucky with the AB1 files that I have looked at? Thus
>>> far all the quality scores seem meaningless.
>
> There are two sets of quality scores in that file. Both are the
> alternating characters 1 and 0. Adding 33 gives the scores you see.
>
> Looks as though EMBOSS is just reporting what it finds.
>
> The file offset is the value returned by function
> ajSeqABIGetConfidOffset. It simply reads one byte from there for each
> base of sequence length.

Looks like that particular random example from the internet was just odd.

>> I went back through my old emails, and see you had been testing with
>> http://www.appliedbiosystems.com/support/software_community/ab1_files.zip
>> (I had trouble downloading this with curl - Firefox worked). Looking at these
>> ABI files with seqret as FASTQ does seem to give meaningful quality scores.
>> Curious.
>
> It should look for a PCON tag in the file and pick up the second of two,
> or the first if there is only one.
>
> Can anyone on the list enlighten us further on what is intended for the
> quality socrss in these example files?

The gGEM example I have no idea - I just found it with Google.

I can send you a couple of our locally produced AB1 files off list
if you wouldn't mind having a look at them. It may be that however
these are being generated there simply are no useful scores inside.

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] ABI to FASTQ with seqret

2010-07-22 Thread Peter
On Thu, Jul 22, 2010 at 5:33 PM, Tom Keller  wrote:
> Greetings,
> The latest versions of the ABI basecaller does indeed give quality scores.

I suspect the problem is my ABI files were not created using the latest ABI
basecaller then. Do you have any more details (e.g. which version)?

I've sent a couple of *.ab1 files off list to Peter Rice to confirm they really
don't have quality scores.

Tomorrow I will try and find out who to contact locally about the base calling,
and what version of the base caller they have.

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] emboss stand-in for fasta

2010-07-24 Thread Peter
On Sat, Jul 24, 2010 at 11:56 AM, Ingo P. Korndoerfer
 wrote:
> could anybody help me out with what to use as a stand-in for fasta ?
>
> fasta by itself is fine, but under windows there is no way to make fasta
> accept filenames with spaces.  neither "" nor """" nor '' seem to alleviate
> the problem.

You are talking about Bill Pearson's FASTA command line tools, right?
Have you tried wrapping the filename with double quote characters,
"like this.fasta", which usually works on Windows. If not, I'd also try
escaping with a slash, "like\ this.fasta", just in case.

> so i was hoping emboss would have something (which would also save
> me having to install fasta on all of our pcs).
>
> what i need to do is run a sequence against an in house library and
> return me the top hit in alignment.

Sounds like BLAST might we a sensible choice to me - it works fine
on Windows, although I'm not sure about filenames with spaces.

Personally I avoid filenames with spaces - they just cause trouble.
Can't you rename things before calling FASTA? e.g. Write a wrapper
script for FASTA to turn spaces into underscores?

Peter

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Jemboss/EMBOSS can't find the external clustalw binary

2010-08-11 Thread Peter
On Wed, Aug 11, 2010 at 2:31 PM, Nigel Binns  wrote:
>
>  Hi All,
>
> Please can anyone tell me why my installation of Jemboss (EMBOSS v6.3.1
> patch v1-4) can't find the external clustal binary and how to correct this.
> When I run a multiple sequence alignment using emma, I get the following
> output:
>
> Died: emma uses external program 'clustalw' which is not in the PATH or
> defined as EMBOSS_CLUSTALW
>
> I can confirm that my jemboss.properties file
> ($EMBOSS_ROOT/share/EMBOSS/jemboss/resource/jemboss.properties) correctly
> points to the root directories of the clustalw and primer3 binaries e.g.:
>
> embossPath=/path/to/clustal:/path/to/primer3

Have you got clustalw 1.x or 2.x installed? The binary names differ,
clustalw.exe versus clustalw2.exe (no extension on Unix/Linux), and
perhaps EMBOSS only expects the former?

Have you tried setting the EMBOSS_CLUSTALW variable?

Peter

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Jemboss/EMBOSS can't find the external clustalw binary

2010-08-12 Thread Peter
On Thu, Aug 12, 2010 at 9:17 AM, Nigel Binns  wrote:
>
>  Hi Peter,
>
> Many thanks for your reply. I have ClustalW v2 installed (v2.0.12 - the
> latest release). The binary is named clustalw2. However, as I understand it,
> when running the Jemboss installation script, you are asked to provide the
> root directory that contains the clustalw binary rather than the name of the
> actual binary i.e /path/to/clustal/root/ rather than
> /path/to/clustal/root/clustalw2 or have I got that wrong?

I was suggesting the problem could be EMBOSS only looks for clustalw
and not clustalw2.

> The same issue applies to my installation of Primer3 ( latest release -
> 3-2.2.2-beta). The binary name is primer3_core. I get this error when I try
> to run eprimer3.
>
> Error   application terminated
>
>    Died: eprimer3 uses external program 'primer3_core' which is not in the
> PATH or defined as EMBOSS_PRIMER3_CORE
>         Part of the 'primer3' package, version 3.0, available from the
>         Whitehead Institute. See: http://primer3.sourceforge.net/
>
> Please can you tell me what file I should set the EMBOSS_CLUSTALW and
> EMBOSS_PRIMER3_CORE variables in.

They are just environment variables (set up in the OS), but I haven't
ever used Jemboss so don't know how it would handle this.

>
> Many thanks for your help.
>
> Nigel

Peter C.

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] Keep feature in union of GenBank files

2010-11-15 Thread Peter
Hi all,

Prompted by this thread on seqanswers.com I tried using EMBOSS 6.3.1
union to merge multiple GenBank format records (in a single file) into a
single GenBank record with the concatenated sequence. This worked,
but the output file has no features:

http://seqanswers.com/forums/showthread.php?t=7812

e.g.

union -sequence many.gbk -sformat genbank -outseq merged.gbk -osformat
genbank -auto

Is support for features something that could be added to union please?

Thanks,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Keep feature in union of GenBank files

2010-11-15 Thread Peter
On Mon, Nov 15, 2010 at 10:57 AM, Peter  wrote:
> Hi all,
>
> Prompted by this thread on seqanswers.com I tried using EMBOSS 6.3.1
> union to merge multiple GenBank format records (in a single file) into a
> single GenBank record with the concatenated sequence. This worked,
> but the output file has no features:
>
> http://seqanswers.com/forums/showthread.php?t=7812
>
> e.g.
>
> union -sequence many.gbk -sformat genbank -outseq merged.gbk -osformat
> genbank -auto
>
> Is support for features something that could be added to union please?
>

Thanks to Nick Loman for the seqanswers.com thread for pointing out this
functionality is present but must be enabled explicitly with "-feature Y".

Apologies for the noise.

Peter

P.S. Why isn't this the default?
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Transeq question, frame phases

2011-02-17 Thread Peter
On Wed, Feb 16, 2011 at 8:54 PM, David Mathog  wrote:
> Test case fasta file
>>8Achars
> 
>
> all 6 frames for transeq, standard mode emits:
>>_1
> KKX
>>_2
> KKX
>>_3
> KK
>>_4
> FF
>>_5
> FFX
>>_6
> FFX
>

Note you can do that with a single command line:

$ transeq asis: -filter -frame 6
>asis_1
KKX
>asis_2
KKX
>asis_3
KK
>asis_4
FF
>asis_5
FFX
>asis_6
FFX

Note that while using 1, 2, 3 for the forward frames is well defined, there
are two conventions for the reverse frame - do you start from the left or
the right?

First let's just do the forward frames,

$ transeq asis: -filter -frame 1
>asis_1
KKX
$ transeq asis: -filter -frame 2
>asis_2
KKX
$ transeq asis: -filter -frame 3
>asis_3
KK

Are you happy with them?

Now let's do that with the reverse complement strand:

$ transeq asis: -filter -frame 1
>asis_1
FFX
$ transeq asis: -filter -frame 2
>asis_2
FFX
$ transeq asis: -filter -frame 3
>asis_3
FF

Now let's do that with the original sequence but the negative frames:

$ transeq asis: -filter -frame -3
>asis_6
FFX
$ transeq asis: -filter -frame -2
>asis_5
FFX
$ transeq asis: -filter -frame -1
>asis_4
FF

Same results - perhaps the naming isn't as you expected?

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Transeq question, frame phases

2011-02-17 Thread Peter
On Thu, Feb 17, 2011 at 4:30 PM, David Mathog  wrote:
>
>
>> Now let's do that with the reverse complement strand:
>>
>> $ transeq asis: -filter -frame 1
>> >asis_1
>> FFX

This is what I think that does (forward frames are easy):

Frame 1, so starts at first base:
Letters 123, codon TTT, gives F
Letters 456, codon TTT, gives F
Letters 78, partial codon TT-, gives X

>> $ transeq asis: -filter -frame 2
>> >asis_2
>> FFX

Frame 2, so starts at second base:
Letter 1, just T, ignored
Letters 234, codon TTT, gives F
Letters 567, codon TTT, gives F
Letters 8, partial codon T--, gives X

>> $ transeq asis: -filter -frame 3
>> >asis_3
>> FF

Frame 3, so starts at third base:
Letters 12, bases TT, ignored
Letters 345, codon TTT, gives F
Letters 678, codon TTT, gives F


> That is the problem.  Let me try to explain more clearly what the issue is.
>
> That is, if the meaning of the + phases is to define the three codons
> a,b,c as shown in the diagram, such that the forward translation is as
> shown, then the reverse translation should be as shown above in
> expected.  That is, it is the translation of the exact same set of
> codons done individually, but for the - strand reverse complement the
> codon first, and then invert the resulting translated sequence.  That
> way the X, where it occurs is attached to the same partial codon "c".

I couldn't understand your diagram - probably font spacing issues in part.

The EMBOSS tool is doing all six frames, maybe all you need to work out
the is mapping between its naming and yours.

Note that it can make sense to translate a trailing partial codon, e.g.
TC... could be TCA, TCC, TCG or TCT which all code for S:

$ transeq asis:TCN -filter
>asis_1
S
$ transeq asis:TC -filter
>asis_1
S

Peter

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Problem indexing PDB fasta file

2006-04-10 Thread Peter Rice
Enrique de Andres Saiz wrote:
> I have been looking the PDB fasta file and I see that, for the previous 
> warning, there are an entry whoose id is '1FNT_A' and another one whoose 
> id is '1FNT_a'. Then, this make me think that EMBOSS is 
> case-insensitive. Is this true? Are there any way to distinguish between 
> the two id's?

Yes, EMBOSS is case-insensitive. So is the Staden/EMBLCD indexing standard 
that dbifasta uses.

The standard also only allows one entry with each ID.

dbxfasta uses a new indexing format and can index both entries, but will still 
assume the names are the same (a search for 1FNT_A or 1FNT_a wil return both 
entries). Allowing indexing to be case-sensitive is possible in future, but 
can slow down searches. We will investigate.

Hope that helps,

Peter

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] dbifasta index file format

2006-04-10 Thread Peter Rice

Graziano P. wrote:

hello EMBOSS users,
I have some databases in fasta format (ncbi | format)
and I want to index them using dbifasta, then I want
to access the index files using a program that will be
developed by a computer scientist of my group.
I need to index the databases by accession number,
ginumber and description. I have read in the dbifasta
help info about the structure of the index files when
the databases were indexed by accession number, but I
have not found info about the structure of the index
files when the databases are indexed by description.
Anyone knows where I can find detailed information
about the structure of the index files?


Ciao Graziano,

The dbifasta index files use the same format as the Staden package, the old 
EMBL CD-ROM distribution, and Erik Sonnhammer's "efetch" utility.


They were documented in some old Staden documentation and papers.

They are also documented in the EMBOSS distribution under doc/manuals/ in file 
internals-indexing.txt (see attached). I see that this document was written 
before we indexed the descriptions!!!


The description (title) indexing is the same as the accession number indexing. 
The files are called des.hit and des.trg. dbifasta has a -maxindex option to 
limit the size of the longest words indexed (the index files have a value for 
the maximum record length).


We also have a script in the distribution scripts/dbilist.pl which can list 
the contents of the description index (in the database index directory, run it 
as dbilist.pl des)


The new dbxfasta index files are very different. For very large databases we 
recommend dbxfasta. For smaller databases dbifasta is fine and we will 
continue to support it.


Hope that helps. If you need more details, just ask.

regards,

Peter


EMBOSS database indexing

The main index format is the named EMBLCD after its use in the CD-ROM
distribution of the EMBL database. It is basically the Staden format,
but we used an alternative name to allow some freedom to extend
it. The intention was to keep compatibility with the Staden
package. EMBOSS comes close to this, but no site seems to depend on
using a common set of indices in both packages and there is no test
plan so some small differences probably break this for now.

All index files have a header block of 300 bytes. The first 44 bytes contain:
int4 filesize
int4 record count
int2 record size
ch20 database name
ch10 database release
int4 date

This is followed, for no apparent reason, by 256 bytes of padding
which EMBOSS fills with spaces. There is room here for any additional
data EMBOSS may need.

Note the "record size" header field, used to seek individual records
in the index files. It requires all strings in the index to be padded
to the length of the longest string - not a problem for ID or
accession, but a big problem for a des index. May be worth
investigating a different format which has a separate offset file,
needing only to rename the "X.trg" file to "X.str" and to add
an "X.bin" file which can be easily created from the "X.str"
file with a list of (ajlong) offsets.

For each database there is a "division lookup" file division.lkp which
lists all the data files. Each division (think of EMBL or GenBank) can
have up to 2 files (Staden's format allows for GCG databases, which
use the NBRF format split into REF and SEQ files, as used for many
years by the PIR database).

All entries in the database must have a unique ID, which is stored in
the "entryname.idx" file as the ID string, the file number, and the
offsets in each of the two data files.

Other index files (at present, only the accession numbers) have two
files. The X.trg file lists the known values in sorted order, and
has two numbers: the number of entries in the X.hit files, and the
offset to the first entry in the X.hit file.

The X.hit file has a simple list of offsets (record numbers) in
the entryname.idx file.

Building these files uses temporary output files with lists of all
values (accessions) and their IDs. These are then sorted by value and
by ID, and compared to the sorted list of IDs to build the index files.

Naturally, a full index of descriptions could be rather large,
especially if long words are allowed as each text string in the
X.trg file must be padded out to the length of the longest string
in the index. The natural solution for EMBOSS would be to limit the
length of an index field for the description index, and possibly to
restrict the maximum number of times a word can appear or at least to
exclude certain common terms. Keywords are less of a problem because
there are a limited number of them.

To add further fields to database indexing, the indexing and query
mechanisms for accession numbers needs to be made into discrete
functions, and the simple accesion number structures need to be part
of a general data structure for all field

Re: [EMBOSS] Problems with GenBank indexing

2006-04-10 Thread Peter Rice
Natalia Jimenez Lozano wrote:

> I was looking for an explanation to this behaviour and I've found that 
> skipped IDs correspond to CDS from genomic sequences and have this format:
> 
>  >gi|10121909|gb|AAG13419.1|AC000348_16 T7N9.24 [Arabidopsis thaliana]
> MELPDVPVWRRVIVSAFFEALTFNIDIEEERSEIMMKTGAVVSNPRSRVKWDAFLSFQRDTSHNFTDRLY...
>  >gi|8778864|gb|AAF79863.1|AC000348_16 T7N9.28 [Arabidopsis thaliana]
> MSVVLQITKDWVQALLGFLLLSFANISTRTNHKHFPHGSCSSIMAGFWIYMYIYSYLFITLKIIDLTS...

As Jon says, dbxfasta is a solution.

However, that is only a partial solution. The real problem is that these FASTA 
format sequences do indeed have duplicate IDs.

This is protein sequence data, so it is not GenBank - was this GenPept or some 
other database?

GenPept and other databases have been known to report "gb" or "emb" as the 
database for protein sequences!!!

A possible solution is to add a new ID format to dbifasta and dbxfasta that 
uses AAG13419 and AAF7986 as the ID and ignores the AC000348_16 part.

Hope this helps,

Peter

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Fwd: EMBOSS for Windows without Cygwin

2006-04-10 Thread Peter Rice
Duleep Samuel wrote:

> Is the latest EMBOSS version 3.0.0.0 available anywhere as a precompiled
> binary for Windows  XP,  I have tried  compiling  using cygwin and it
> crashed, I loaded EMBOSS for windows which is a port of version 2.10.0,
> loaded Staden Package and made Spin aware of EMBOSS and am working, but
> feel bad that I am _One_ whole release behind, If anyone has a complied
> binary I can download for testing and report back on useability,
> regards, Samuel, Virologist, India

Staden has support for older versions of EMBOSS. We are trying to update 
Staden to work with EMBOS 3.0.0 and future releases.

If anyone is using EMBOSS and Staden (especially EMBOSS under the Staden SPIN 
interface) please contact the EMBOSS developers 
([EMAIL PROTECTED]) so we know how many EMBOSS SPIN users there 
are. It helps to set priorities for the work.

regards,

Peter

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] nt-multi-fastA-file

2006-04-12 Thread Peter Rice
Christiane Nerz wrote:
> Hi all,
> 
> I put the gb-file of an whole genome in Artemis.
> Is there a possibility to export a multi-FastA-file with the bases of 
> all ORFs? Example:
> 
>  >ORF_1
> ATGTGTTCGTT
>  >ORF_2
> ATGTTCCCGACCA...
>  >ORF_3
> ATGCCGCAT...
> 
> I know how to get all bases, but only as one complete sequence.
> (That genome is not published yet, so there is no multi-Fasta-file at 
> ncbi or EMBL available)

Yes, the coderet program will do this.

Unfortunately coderet tries to return CDS, mRNA and translations all in 
one file (to be fixed for the next release). You can ask just for the 
CDS with a couple of extra command line options:

coderet -nomrna -notranslation

Give it the filename as input.
The output will be the coding sequences.

With -nocds instead of -notranslation you will get the protein sequences.

If you have any problems parsing the GenBank file let me know.

regards,

Peter Rice
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] EMBOSS Funding News

2006-04-28 Thread Peter Rice
EMBOSS will be funded by the UK Biotechnology and Biological Sciences 
Research Council (BBSRC) for the next 3 years. EBI has issued the 
following press release, also available from:

http://www.ebi.ac.uk/Information/News/pdf/Press25Apr06-small.pdf

The EMBOSS team would like to thanks all our users and developers for 
their patience over the past two years.

regards,

Peter Rice
Alan Bleasby
Jon Ison

A brighter future for Europe’s favourite molecular biology software package

New funding for EMBOSS – Europe’s leading suite of molecular biology 
analysis tools – guarantees open access for researchers and software 
developers

Hinxton, 25 April, 2006 – EMBOSS, the European Molecular Biology Open 
Software Suite, has received a vital funding boost from the UK 
Biotechnology and Biological Sciences Research Council (BBSRC) that will 
guarantee its continued maintenance under an open source license for the 
next three years. This ends two years of uncertainty over the future of 
the project.

Until recently, EMBOSS was hosted by the Medical Research Council’s 
Rosalind Franklin Centre for Genomics Research (RFCGR), where it was 
funded jointly by the BBSRC and the Medical Research Council (see ‘notes 
for editors’ for more information on the history of EMBOSS). With the 
announcement in April 2004 of the RFCGR’s closure, the future of EMBOSS 
hung in the balance. The new funding from the BBSRC means that EMBOSS 
co-founders Peter Rice and Alan Bleasby will be able to continue the 
EMBOSS project at the EMBL-EBI for the next three years. EMBOSS will 
remain freely available from emboss.sourceforge.net and anyone who wants 
to develop it further will have access to its source code. ‘We’re 
delighted that the BBSRC has recognized EMBOSS as an important tool for 
molecular biology’ says project leader Peter Rice. ‘The EMBOSS user 
community has been very patient, and it highlights a great benefit of 
open source software that even users in industry have continued to rely 
on EMBOSS despite the uncertainty about its future. This simply could 
not have happened if EMBOSS had been a commercial package under threat.’

EMBOSS provides a powerful package of around 300 applications for 
molecular biology and bioinformatics analysis. Molecular biologists use 
EMBOSS at all stages of their research, from planning experiments to 
analysing results. It also has an application-programming interface 
(API) that enables software developers to write their own EMBOSS 
applications. These can readily be strung together, allowing users to 
create ‘workflows’ that automate complex and time-consuming tasks. 
EMBOSS has also been used in many commercial software developments and 
is included in commercial bioinformatics systems. Its flexibility has 
made it an obvious core component of several data integration and 
bioinformatics infrastructure projects, including myGrid and EMBRACE.

The new funding also provides helpdesk support for EMBOSS’s users. ‘As 
well as helping researchers with limited bioinformatics expertise to 
make the most of EMBOSS, we will be able to provide better support and 
documentation to the estimated 20% of our users who are also software 
developers’, explains Alan Bleasby. ‘We will encourage these experts to 
contribute their code to the project. In return, we will make their 
software widely available through the EMBOSS website and provide ongoing 
user support for it. This mechanism will help to ensure that EMBOSS 
evolves according to the needs of its users.’

Contact:

Cath Brooksbank PhD, EMBL-EBI Scientific Outreach Officer, Hinxton, UK, 
Tel: +44 1223 492 552, www.ebi.ac.uk, [EMAIL PROTECTED]
Anna-Lynn Wegener, EMBL Press Officer, Heidelberg, Germany, Tel: +49 
6221 387 452, www.embl.org, [EMAIL PROTECTED]


Notes for editors – a brief history of EMBOSS

EMBOSS, an open source suite of tools for the analysis of biological 
data, has its origins in the late 1980s when Peter Rice, a co-founder of 
EMBOSS, was working at EMBL. Encouraged by his colleagues in the lab, he 
began to write extensions to the GCG package, which at that time 
provided its source code to users. His efforts evolved into EGCG 
(extended GCG) and Rice moved to the Sanger Centre (now the Wellcome 
Trust Sanger Institute) to continue its development. However, the 
changes to the source code licensing of GCG in 1996 put an end to 
further development of EGCG. Recognizing the importance of free source 
code to the rapid and cost-effective development of bioinformatics 
tools, Rice, in collaboration with Alan Bleasby (then at SEQNET, 
Daresbury, UK) began working on a new suite of open-source 
bioinformatics tools – the EMBOSS project – in 1996. EMBOSS has been 
funded by: the Wellcome Trust (1997–2000); the BBSRC and MRC 
(2001–2004); and through two posts at the MRC Rosalind Franklin Centre 
for Genomic Research following a merger with BBSRC’s SEQNET facility in 
1998.After the closure of RFCGR in July 2005,EMBOSS moved to the

Re: [EMBOSS] New EMBL release

2006-06-20 Thread Peter Rice
Wells, Isabelle wrote:
> Hi All,
> 
> EMBL release 87 has just been made available and changes to the entry ID
> line were made. Did anyone install it and index the files with dbiflat?
> I am just wondering whether the change in ID line structure causes
> problems.

There are some small changes needed. We will produce patch files next 
week for 3.0.0 (the 4.0.0 code in CVS already works).

We waited to see a full release before making the patches, in case there 
are any surprises.

I will send an announcement to this list when the patches are tested and 
copied to emboss.open-bio.org


regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] display of long ensembl and vega identifiers in alignments

2006-08-11 Thread Peter Rice
Hans Rudolf Hotz wrote:

> A few months back, I played arround with the source code and changed one
> of the library files (ajalign.c). This now allows the display of up to 20
> characters, by using a new output format "pairln" for sequence alignment
> programs, like matcher or needle. This is in comparison to the default
> which displays only the first 6 characters, or "pair" which displays the
> first 13 characters, eg:

We can make the ID arbitrarily long for a "new" alignment format. We 
will need formats similar to the existing matcher and needle outputs to 
avoid breaking too many existing parsers (I remember when NCBI changed 
the use of a blank at the start of each line of blast output and almost 
all parsers had to change). The formats are easy to make (as you found 
out) from the existing ones.

We need to decide what to do with the standard alignment formats that 
have 6 characters in their definition (I assume this goes back to the 
days of PIR database identifiers when FASTP was first written). As we 
cannot fit many of the existing identifiers, we can make up unique 
identifiers for these (truncate the identifier, and make the names 
unique if they match).

Or, should we change the existing formats to allow longer IDs? What do 
the authors of parsers think?

regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] EMBOSS 4.0.0 Latest Fixes

2006-08-25 Thread Peter Rice
I have posted some further fixes on the EMBOSS FTP site. None are 
critical. Users have been reporting interesting bugs. Some were also in 
release 3.0.0.

The fuzznuc, fuzzpro and fuzztran reports were changed in 4.0.0 to 
always report something. Unfortunately users running searches over the 
whole database found their output files were very large. We have changed 
the way reports work as follows:

1. fuzznuc, fuzzpro and fuzztran again report only sequences with hits

2. when a report is closed, a default header and footer are written 
(solving the problem of empty output files)

3. for sites that had concerns about searches for trivial patterns 
taking too long and generating too much output, reports have 2 new 
associated qualifiers. -rmaxall limits the total number of matches 
reported (fuzznuc, fuzzpro and fuzztran terminate when the limit is 
reached), -rmaxseq limits the maximum number of hits for one sequence.

We also have various fixes for reporting matches on the reverse strand, 
and for improved parsing of FASTA file IDs.

To update your EMBOSS 4.0.0 release, go to:

ftp://emboss.open-bio.org/pub/EMBOSS/fixes/

File README.fixes (see below) lists the files and describes the fixes.

Copy the files to the indicated directories and reinstall.

regards,

Peter Rice

file README.fixes 25-aug-2006

The files in this directory are bugfix replacements for files in
the EMBOSS-4.0.0 distribution. Just drop the replacement files in
the location shown and redo the 'make install.'


Fix 1. EMBOSS-4.0.0/nucleus/embpatlist.c

31 Jul 2006: Fixes a problem with searching for patterns and regular
expression in the reverse strand of nucleotide sequences. The change
is to use ajSeqReverseForce (always reverses the sequence provided)
instead of ajSeqReverseDo (which only reverses if the reverse flag is
set)

9 Aug 2006: Revised to also fix a problem with reverse strand sequence
positions.


Fix 2. EMBOSS-4.0.0/ajax/ajfile.c

31 Jul 2006: This fixes a bug where deleting the last line of buffered
input fails to reset the pointer to the last buffered line. This only
affected debug traces. Unfortunately, the ajFileBuffClear function
does call the debug trace. In practice we have only seen this bug when
processing sequence data in EMBL format from an MRS server.

Fix 3. EMBOSS-4.0.0/ajax/ajnam.c

31 Jul 3006: New database access methods MRS and DBFETCH need to be
explicitly turned on so that showdb can report them.


Fix 4. EMBOSS-4.0.0/ajax/ajseqdb.c

31 Jul 2006: The new MRS access method used a general search. This
gave strange results when the ID or accession appeared in any other
entry. It appears that MRS can search for id or accession only. This
worked on the main MRS server at least.

MRS access will be further extended in the next release. Please
contact the developers [EMAIL PROTECTED] if you would
like to help test new features in MRS access.

25 Aug 2006: Further change to allow multiple %s replacements in
complex URLs for access method URL. Needed for complex SRS queries to
resolve EMBL IDs so the following definition can be used for EMBL 
(warning, the URL may wrap badly in this email!)

DB embl [
  method: "url"
  format: "embl"
  type: "N"
  url:
 
"http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-noSession+-ascii+-vn+2+-e+[embl-id:%s]|[embl-acc:%s]|([emblidacc-id:%s]>embl)"
  comment: "EMBL from SRS including old IDs"
  ]

Fix 5. EMBOSS-4.0.0/configure

07 Aug 2006: Fix configuration problem on Intel Mac machines. Make sure
this file is executable (chmod 755 configure) after downloading it.

Fix 6. EMBOSS-4.0.0/ajax/ajseq.c

09 Aug 2006: Return correct USA for "asis::" sequence input.

Fix 7. EMBOSS-4.0.0/emboss/dreg.c

09 Aug 2006: Correct sequence positions on the reverse strand.

Fix 8. See Fix 13

Fix 9. See Fix 13

Fix10. EMBOSS-4.0.0/doc/programs/html/banana.1.banana.gif
EMBOSS-4.0.0/doc/programs/html/tcode.2.tcode.gif

14 Aug 2006: These graphics example outputs were missing from the
distribution.  When you run make install they will be copied to the
installed documentation.

Fix 11. EMBOSS-4.0.0/emboss/merger.c
EMBOSS-4.0.0/emboss/needle.c
EMBOSS-4.0.0/emboss/prophet.c
EMBOSS-4.0.0/emboss/water.c

14 Aug 2006: These programs calculate an internal path size from the
lengths of the input sequences. For sequences that are too long, a
fatal error is produced. But if the sequences are extremely long, the
test failed and the program gave a segmentation fault. This fix tests
in a different way that will catch all cases.

Fix 12. See Fix13

Fix 13. EMBOSS-4.0.0/ajax/ajacd.c
 EMBOSS-4.0.0/ajax/ajfeat.c
 EMBOSS-4.0.0/ajax/ajfeat.h
 EMBOSS-4.0.0/ajax/ajreport.c
 EMBOSS-4.0.0/ajax/ajreport.h
 EMBOSS-4.0.0/emboss/fuzznuc.c
 EMBOSS-4.0.0/emboss/fuzzpro.c
 EMBOSS-4.0.0/emboss/fuzztran.c

21 Aug 2006: This provides new qualifiers to l

Re: [EMBOSS] iep program for multiple protein sequences

2006-09-08 Thread Peter Rice
Tao Song wrote:
> Hi,
> 
>  I wonder can the iep program  that calculates the isoelectric point of 
> a protein be used
> for a protein database? When asked to input protein sequence I gave 'tsw' 
> instead of
> 'tsw:laci_ecoli' I got an error that said 'sequence must be protein sequence 
> without BZ U X
> or *: found bad character Z'. Does iep can only take one protein sequence as 
> input file?

Your command does read the test swissprot database, but fails on an 
entry that is a sequence fragment with a Z ambiguity code.

For the next release, I have a patch that will convert B and Z to D/N 
and E/Q using the Dayhoff frequencies of naturally occurring amino 
acids. This will convert the first B or Z to a charged residue (as these 
are more common), the second to an uncharged residue, and so on. With 
this change in place iep can be modified to accept any protein sequence 
and will produce consistent results on ambiguity codes.

A question: We can try this fix as a general solution for programs 
requiring "pureprotein" input, by converting any B or Z (or J) ambiguity 
code. Is this useful? For iep the order does not matter and the 
converted sequence does not appear in the output, but I think a 
program-by-program solution is better.

Other programs insisting on "pureprotein" input are hmoment, octanol and 
pepwindow

regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] EMBOSS-Explorer Follow-up

2006-09-11 Thread Peter Rice
Ryan Golhar wrote:
> So I stepped through the code for tfm and it looks like it initially
> looks in /usr/share/EMBOSS/doc/programs/html.  So 'make install' is
> putting the html docs in /usr/share/EMBOSS/doc/html/emboss/apps/...  But
> why?  Was this an inadvertant change?

Oops. usr/share/EMBOSS/doc/html/emboss/apps/ is the new location in 
4.0.0 (so we do not have to keep copies of all the EMBASSY application 
documentation in the EMBOSS source).

Will be fixed in 4.0.0. A simple copy is one way to fix it. I will make 
a fix for tfm.

regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] EMBOSS-Explorer Follow-up

2006-09-11 Thread Peter Rice
Ryan Golhar wrote:
> So I stepped through the code for tfm and it looks like it initially
> looks in /usr/share/EMBOSS/doc/programs/html.  So 'make install' is
> putting the html docs in /usr/share/EMBOSS/doc/html/emboss/apps/...  But
> why?  Was this an inadvertant change?

Aha ... tfm works, but tfm -html may fail.

If the program fails to find the html file, it will check the original 
distribution directory. Unfortunately, if it does find an html file ... 
it may be from version 3. I forgot about the tfm -html option when we 
moved the files.

A fix will take a few days to test. EMBASSY html documentation is not 
under the embassy package, so TFM will have to check the ACD file to 
find the EMBASSY package name. Easy enough - several other programs do 
it - but needs quite a few tests to make sure it does it correctly in 
all cases.

regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] [Mrs-user] case sensitive identifiers

2006-09-28 Thread Peter Rice
Guy Bottu wrote:
> My idea is to let the MRS parser store 1fnt_aLC
> (LC means lowercase) as identifier. A user can then search for the 
> sequence he needs in MRS and in EMBOSS (if the EMBOSS installation uses 
> MRS as databank access mechanism) ask for the sequence pdbprot:1fnt_alc.
> This would of course also work with 1fnt_a_12835 but it avoids the use of 
> a meaningless and irreproducible number. Anybody a comment ?

Not a general solution, but for PDB chains you could use an extra 
underscore for the lower case ones.

For EMBOSS  well, we could play with the way databases work. Not all 
access methods allow case sensitive searching, but we could fetch all 
entries and try to reject those that do not match. This would need 
something in the EMBOSS id. We already allow modifiers after the id to 
set sequence ranges pdbprot:1fbt_a[1:20] or we could add a qualifier 
-scasesensitive for all sequence inputs.

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] case sensitive identifiers

2006-09-29 Thread Peter Rice
Guy Bottu wrote:
> For the moment our emboss.default contains :
> 
> DB pdbprot [ type: P format: fasta comment: 'protein sequences from PDB'
>  methodquery: app app: "/nfsben/srs/bin/linux73/getz -e '[pdbprot-id:%s]'"
>  methodall: direct dir: /nfsben/srs/data/blast/dbfb/pdb file: pdb
> ]

That raises a new problem  the "app" method will work, but "srs" and 
"srswww" will not.

They search for a pdbprot-acc match and there is no acc field.

I will add a new database attribute hasaccession (default "Y") so 
searches know whether the acc field can be used. Unfortunately the 
fields attribute is defined as "everything except id and acc" so I 
cannot use it.

So, there will be 2 new (and for the first time boolean) attributes for 
databases. To use them, you will need:

caseidmatch: "Y"
hasaccession: "N"

These will also be the first to use the default values for database 
attributes! All other default values are empty strings :-)

regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Question regarding seqret

2006-10-25 Thread Peter Rice
Jean Mao wrote:
> Hi, 
> I have a question hopefully someone can help me about it.
> 
> I downloaded the gbrvt1.seq file from ftp://ftp.ncbi.nih.gov/genbank/ as a 
> test, gunzip and index it with dbxflat (I know it's not > than 2gb):
> 
> %  dbxflat -dbname=testdb -dbresource=embl -idformat=gb -directory=. 
> -fields='id,acc,sv,des' -filenames='gbvrt*.seq' -indexoutdir=. -release=0.0 
> -date='00/00/00'
> 
> Then I run 'seqret' but failed to retrieve entries using 'sv' or 'des' fields:

I didn't see an answer to this one, but I suspect you have already figured it 
out.

dbixflat and dbiflat will have created the sv and des indices.

You have to edit the database definition in emboss.default to say the fields 
exist.

fields: "sv des"

then seqret and other programs will know they can use them.

Yes, in theory seqret could work out what indices are available for a dbxflat 
or 
dbiflat indexed database - but it would be more difficult for an SRS or SRSWWW 
database (for example) so we depend on the database definitions.

Hope that helps,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] extracting noncoding regions

2006-10-31 Thread Peter Rice
Hi Shrish,

Shrish Tiwari wrote:
> Hi!
> Is there a way of extracting the noncoding regions of a genome using an 
> EMBOSS program?

That is a simple change to coderet to return non-coding sequence (exclude the 
CDS and mRNA features).

Does anyone else want this? We can do it for the next release.

regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] showfeat troubles

2006-10-31 Thread Peter Rice
Hi Shrish,

Shrish Tiwari wrote:
> Hi!
> I used the following command to extract only positions of CDS from gbk files:
> showfeat -pos -matchtype CDS -width 0
> But I noticed that the program does not extract positions of CDS that lie on 
> the complementary strand, e.g. CDS complement(5683..6459) did not 
> show up in the resultant file. Any ideas on how I can get showfeat to extract 
> these positions too.

It worked for me, but reports these as 5683..6469 (without -width 0 it will 
show 
the arrow in the reverse direction)

Can you try running entret on the same genbank entry, and sending the output 
file to [EMAIL PROTECTED] so we can take a look at it.

regards,

Peter Rice

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Batch retrieval of taxonomy/species names using entret.....

2006-10-31 Thread Peter Rice
Hi Richard,

Richard Rothery wrote:
> I am interested in using entret to retrieve single field entries from
> swissprot or sptrembl. Specifically, I would like to feed entret a list
> of accessions and have it return a file with the species names and/or
> taxonomies. I intend to use this information to compare with my
> phylogeny analyses of clustalw alignments.

EMBOSS stores the full text in entret without parsing.

We could try to extract specific fields but it is not easy to define them for 
all formats.

You can do this with SRS. Try the EBI server for example:

Go to the library page

Select UniProtKB/SwissProt (or UniProtKB/TrEMBL)

Select "standard query form"

Enter your query in the top part (e.g. accession number)

In the "create a view" section click the "list" button to egt the original 
lines. Select anything taxonomic from the pull down list (control-click to 
select more than one)

Press "search".

refine your query. You will see the URL at the top that can be used to retrieve 
data when you are happy.

Failing that, you could just parse out the ID and O* lines from entret using a 
simple perl script.

Hope that helps,

Peter

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] IDs in output

2006-11-03 Thread Peter Rice
Hi Bernd,

Bernd Web wrote:
> Hi,
> 
> Sometimes I use an EMBOSS command directly on a FastA file.
> I wonder if it is possible to select the ID used in the output, esp
> for FastA records with an NCBI defline.
> 
>> gi|248166|g|AA21972.1| description...
> 
> in the output of an EMBOSS command becomes:
> AA21972.1|
> 
> It would be very easy if the ID could be chosen to be the GI number.
> Now the ID used depends on the GI record (sp, pdb, pir) show different
> IDs in EMBOSS output.

Did you mistype the defline? There is a defined set of database names that can 
appear in NCBI deflines. If the "|g|" is really "gb" then the ID will be 
AA21972 
which is what I would expect.

If the database name is invalid (or a new one unknown to EMBOSS) then we could 
try to use the GI number. but the "EMBOSS way" would be to use the accession 
number from the sequence version. Unfortunately at present it is using the last 
part of sequence version "1" as the ID in your example. I will fix it for the 
next release.

You can use -sid on the command line to give an ID to a sequence that does not 
have one,but not to replace an existing ID. That seems strange. It may change 
for the next release so that you can always use -sid to define the ID.

Hope that helps

Peter




___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] IDs in output

2006-11-03 Thread Peter Rice
Bernd Web wrote:
> Hi Peter,
> 
> Although I copy pasted, indeed the defline was wrong. It should have been:
> 
>> gi|248166|gb|AAB21972.1| invertase {EC 3.2.1.26} [baker's yeast,
> Peptide Partial, 6 aa, segment 10 of 12]
> ATNTTL
> 
> EMBOSS extracts "AAB21972.1".
> Having the version number is OK since otherwise the sequence is not
> completely defined (AAB21972 could refer to multiple versions).

If you specify -osformat ncbi you should be able to recreate the original 
defline in the EMBOSS output.

> My idea was more related to selecting the GI number as ID to use in
> EMBOSS applications. Now the accession number depends on the format of
> the defline:
> sp ->  Entry Name (not primary accession)

If there is an Entry name EMBOSS will use it.

> ref, emb, gb -> Accesion

But now EMBL and Genbank define this as the entry name anyway.

> pdb -> PDB protein name with Chain concatenated to it.

That seems good to me ... although we know of a problem when there are more 
than 
26 chains and -a comes round again.

> Although I wrote a script to map the names from NCBI deflines to
> EMBOSS names, it could be easy to have the option to use the GI
> number.

Hmmm . in EMBOSS terms, this counts as yet another sequence format. We 
could 
make a new output format (-osformat gifasta for example) that uses the GI as 
the 
ID... but it would use the original sequence name as the filename first time 
around (and then when you read the file it would start using the GI number as 
the filename).

But we could also make "gifasta" an input format (-sformat gifasta) and then it 
could use the GI number - but you would have to specify the -sformat on the 
command line (or gifasta::filename as input) because EMBOSS has to choose which 
way to interpret the defline. Does that solve your problem?

NCBI regard the ID as the entire string with "|" characters embedded, but that 
is no use when making filenames so we had to choose something.

EMBL does not use GI numbers ... they only appear in GenBank and NCBI files. I 
never liked them, but EMBOSS does try to do whatever the users demand :-)

regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Transeq and very large sequences

2006-11-27 Thread Peter Rice
michael watson (IAH-C) wrote:

> I want to translate very large (eukrayotic chromosomes!) DNA sequences
> in all 6 frames.  Transeq takes about a day per large chromosome,
> running on a linux machine with 3Gb of RAM. 
> 
> Any suggestions on alternatives or how I could speed it up?

You want just a 6-frame translation of an entire chromosome?

I will look into why it takes so long. We have made some changes to string size 
extension that may already help this for the next release.

regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Transeq and very large sequences

2006-11-27 Thread Peter Rice
michael watson (IAH-C) wrote:
> Excellent!  I set the MAXSEQIN paramter to 200,000,000 and it ran in 18
> seconds 

Ah, that is a challenge. I'll see what I can do with the EMBOSS code :-)

regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Question regarding Reference Sequence Database

2006-11-30 Thread Peter Rice
Hi Jean,

> Does any program in EMBOSS package can make use of the Reference Sequence
> Databases? I indexed refseq databases with dbxflat and run showfeat against
> them but receive error about has zero length sequence :

The next release will include refseq as a valid sequence format.

You can usually get away with defining the format as Genbank. If that does not 
work please let me know and I will update the refseq format code.

Aha ... but in this case ...

NG_002612 does have zero length. This appears to be one of those entries (the 
EMBL CON division does much the same) that only refer to sequence data in other 
entries. It ends with the line:

CONTIG  join(complement(AC006998.3:2483..110100))

We can try to process these. The database defintion will need to know where to 
look up "AC006998.3" which is where the sequence data ... and all the missing 
features ... should be.

Can you exclude the CON entries from your indexing? if not, we can try 
excluding 
them.

Hope that helps,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] question!

2007-01-23 Thread Peter Rice
Dear Fang,

> I installed EMBOSS 2.10.0 in on windowsXP PC.  However, when I use command 
> "extractfeat genbank:*", it does not work. The error message is "Error:uable 
> to read sequence 'genbank:4101655', Died: extractfeat termined:Bad value for 
> '-sequence' and no prompt".  But it work fine with "extractfeat 
> embl:AK222810".Do you know the reason?

If you used the database definitions provided with EMBOSS ... your genbank is 
possibly
pointing to the CBR  server in Canada which has now closed.

There is also a problem with the way SRS servers define the GI number - there 
are now servers that index it, but as "gid" not as "gi" which EMBOSS 
anticipated. We sill change the field name in the next release of EMBOSS.


To test whether yuor genbank definition works, you could try the ID
We are now at release 4.0.0 which allows "gi" as a search field. Earlier 
versions only had "sv" (sequence version) ... whether that is indexed depends 
on 
the database provider. Indexing GenBank in EMBOSS does allow GI searches.

> Is there any way to access ENsembl database. Is there any new version of 
> EMBOSS which could support more databases which could installed in windowsXP?

Ah, you are running EMBOSS under windows? embosswin was provided by Andre 
Blavier up to EMBOSS 2.10.0. We now provide a beta release of EMBOSS 4.0.0 for 
windows (nobody did version 3.0.0 for windows).

H ... we need to make that more obvious on the EMBOSS website. EMBOSSWIN is 
available by FTP from emboss.open-bio.org/pub/EMBOSS/windows/ ... only a few 
brave people have tested it so far, but they report that it is working.

> Are all the databases which EMBOSS connected are the latest version? since I 
> found some database do not give the same results as what I get from the 
> database directly.

That depends on where the databases are. There is a list of SRS servers you can 
check for the number of entries and the date they were indexed:

http://downloads.biowisdomsrs.com/publicsrs.html

for example:

DB genbank [ type: N method: srswww format: genbank
url: "http://iubio.bio.indiana.edu/srsbin/cgi-bin/wgetz";
dbalias: "genbankrelease"
fields: "gi sv des org key"
comment: "Genbank IDs" ]

You can also try Entrez databases in EMBOSS 4.0.0 ... I wonder how many users 
have been using entrez as an access method?

Hope that helps

Peter Rice


___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] question about display double-stranded DNA

2007-01-25 Thread Peter Rice
Hi Jean,

> When using remap, I prefer to use the '-noreverse' flag so that the
> translation of my DNA is located closer to my DNA strand. However, using
> this flag also remove the complementary strand of my DNA in the output which
> is less convinient when design primers. Is there a way in remap to display
> double-stranded DNA but turn off the restriction sites of the complementary
> strand?

I am looking at remap changes at the moment, I will see what I can do.

> If not, is there a program in EMBOSS which can retrieve the sequence from
> database, select start/end points and display both strands? I tried seqret
> but failed.

Showseq does that.

It has a bug at present (I noticed it this week - fixed in the next release) 
that makes it show additional bases up to the end of the last line.

regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] question about display double-stranded DNA

2007-01-25 Thread Peter Rice
Hi Jean,

> Peter, Thanks for reply. seqret can retrieve entry and select start/end
> points. But seqret does NOT display both strands. Does it?

Right. Seqret returns a sequence, so it can only rpeort one strand at a time.

> Showseq does that.
> 
> It has a bug at present (I noticed it this week - fixed in the next release)
> that makes it show additional bases up to the end of the last line.

Oops. Spoke too soon. showseq uses the dame display functions as remap and has
the same limitations.

I will see what we can do for the next release.

regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Restriction fragment sequences

2007-02-02 Thread Peter Rice
Jean-Christophe AME wrote:
> Hello,
> 
> I have a question concerning DNA restriction fragment analysis : Is  
> there a way to generate the actual sequence of the restriction  
> fragment generated by restrict or remap, this is to facilitate the in  
> silico construction of recombinant plasmid just with a cut and paste.  
> May there are some ways do this automatically (there was CloneIt but  
> it doesn't work).

Interesting suggestion. You really need a nucleotide version of digest (or 
restrict with the fragment start/end and sizes reported instead of the cut 
sites).

With the command line option -rformat listfile you can then use seqret to 
return 
the sequences but using @filename as input.

Unfortunately if you do that with restrict you only get the restriction sites.

We will add a new application to the next release.

regards,

Peter Rice
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] question about 'fuzznuc'and 'urzpro'

2007-02-12 Thread Peter Rice
Hi Jean,

> I know I can give a pattern like 'ACCGGT' and search against a file which
> contains multiple sequences. Is there a way I can specify a 'pattern file'
> which contains multiple patterns that I want to search for instead of just
> one pattern each time? For example, I have a fileA which contains multiple
> DNA sequences. I want to create a fileB which contains 20 patterns that I
> want to seach each of them against the sequences in the fileA. We are in the
> transition from GCG to EMBOSS. And the program 'findpatterns' in GCG can do
> this. But I couldn't find corresponding emboss program that does the same
> thing.

New in EMBOSS 4.0.0, contributed by Henrikki Almusa of Medicel in Helsinki.

fuzznuc (and fuzzpro and fuzztran) now can read in a file of patterns with the 
commandline syntax:

fuzznuc @patternfile

You can also use @patternfile in response to the prompt for a pattern.

Here is an example pattern file with FASTA-style IDs and mismatch counts for 
each pattern:

>pat1
cggccctaaccctagcccta
>pat2 
cg(2)c(3)taac
cctagc(3)ta
>pat3
cggc{2,4}taac{2,5}

Here is a file with just the second pattern, and no name (it will default to 
pattern1

cg(2)c(3)taac
cctagc(3)ta

You can set a default name with -pname and a default mismatch with -pmismatch

I note we could document this better in the fuzz* program manual entries. We 
will do for the 4.1 release.

Hope that helps,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] question about 'fuzznuc'and 'urzpro'

2007-02-13 Thread Peter Rice
Hi Jean,

I copied this reply to the list - as it includes poorly documented features
and some suggestions for the future.

> It's great to know it can be done! I do have further questions. So in the 
> pattern file that has no name and contains two lines, you said it's going to 
> default to pattern 1. Does that means that without the '>', everything will 
> be concatenated and treated as one pattern?

Yes. We did include a -pformat qualifier to set the format of the pattern file,
so we can extend in future to have one pattern per line.

Actually I should ask what's the difference between
> 
>> pat2 
> cg(2)c(3)taac
> cctagc(3)ta
> 
> and 
> 
>> pat2 
> cg(2)c(3)taaccctagc(3)ta

They are the same - pattern lines are simply joined together until the next new
pattern header (>pat3) is found.

> also what's the difference between a file containing
>> pat2 
> cg(2)c(3)taac
> cctagc(3)ta

> with a file containing
> cg(2)c(3)taac
> cctagc(3)ta

The first allows one mismatch in matching the pattern. These patterns for with
the HHTETRA entry we use for the example in the program manual (accession number
L46634)

>HHTETRA L46634.1 Human herpesvirus 7 (clone ED132'1.2) telomeric repeat region.
aagcttaaactgaggtcacacacgactttaattacggcaacgcaacagctgtaagctgca
ggaaagatacgatcgtaagcaaatgtagtcctacaatcaagcgaggttgtagacgttacc
tacaatgaactacacctctaagcataacctgtcgggcacagtgagacacgcagccgtaaa
ttcctcaacccaaaccgaagtctaagtctcaccctaatcgtaacagtaaccctaca
actctaatcctagtccgtaaccgtaaaatcctagcccttagccctaaccctagccc
taaccctagctctaaccttagctctaactctgaccctaggcctaaccctaagcctaaccc
taaccgtagctctaagtttaaccctaaccctaaccctaaccatgaccctgaccctaaccc
tagggctgcggccctaaccctagccctaaccctaaccctaatcctaatcctagccctaac
cctagggctgcggccctaaccctagccctaaccctaaccctaaccctagggctgcggccc
taaccctaaccctagggctgcggcccgaaccctaaccctaaccctaaccctaaccctagg
gctgcggccctaaccctaaccctagggctgcggccctaaccctaaccctagggctgcggc
ccgaaccctaaccctaaccctaaccctagggctgcggccctaaccctaaccctagggctg
cggccctaaccctaaccctaactctagggctgcggccctaaccctaaccctaaccctaac
cctagggctgcggcccgaaccctagccctaaccctaaccctgaccctgaccctaacccta
accctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta
accctaaccctaaccctaaccctaagcactggcagccaatgtcttgtaatgc
cttcaaggcactctgcgagccgcgcgcagcactcagtgacaagtttgtgcac
gagaaagacgctgccaaaccgcagctgcagcatgaaggctgagtgcacaaggcttt
agtcccataaaggcgcggcttcccgtagagtagccgcagcgcggcgcacagagcga
aggcagcggctttcagactgtttgccaagcgcagtctgcatcttaccaatgatgatcgca
agcaagatgttctttcttagcatatgcgtggttaatcctgttgtggtcatcactaa
gcaagctt

> Also could you explain how to use -pname and -pmismatch?
>I don't understand this part at all :-P Thank you very much!

Ah ... they are associated qualifiers (like -sformat, sbegin, send for
sequences, -osformat for sequence output, -aformat for alignments and -rformat
for reports.

They only show up if you use -help -verbose to see the help.

This caused some problems for fuzznuc users with release 4.0.0 as they replace
the previous version which had a -mismatch option and only read one pattern.

-pmismatch sets a default number of mismatches for all patterns (that you can
override within the pattern file).

-pname sets a pattern name for the output (something that was missing before).
Oops, we have a bug ... the name is being ignored in fuzznuc. Will be fixed in
4.1.0.

-pformat sets the pattern file format - so far this is ignored so we have not
documented pattern file format names. I think a file with one line for each
pattern and numbering 1, 2, 3 added to the pattern name would be useful. We
could call the formats "simple" (one line per pattern) and "fasta" (the current
format with names)

Oops, another bug. Using a bad pattern file name is not being caught. Fixed in 
4.1.0

We also added files of regular expressions used by dreg and preg so you can also
use them for pattern searched (it depends on whether you prefer prosite-style
patterns or regular expressions - I find the prosite style for fuzznuc are much
easier). We can use the same file formats for them.

I have to check the original pattern file code from Henrikki Almusa to see
whether we lost anything in the naming and formats.

Hope that helps,

Peter



___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] dreg and reverse strand

2007-02-20 Thread Peter Rice
Andres Pinzon wrote:
> Hi,
> Im using dreg to find some patterns on a xanthomonas* genome reverse strand.
> This is the command im using:
> 
>  dreg -sequence ./campestrisVesicatoria.gb -pattern
> 'TTC(G|T|C){14,17}TTC(G|A|T)' -outfile campestrisVes-rev.dreg.gb
> -rformat3 genbank -sask1

Oops. Can you send me the input sequence please. We will fix it for the next 
release (soon)

regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] Fuzznuc question: how to search complementary strand?

2007-02-26 Thread Peter Rice
Andres Pinzon wrote:
> Hi,
> Im trying fuzznuc to search for some patterns in a a genome.
> 
> ...But when I search the complementary strand:
>
> It reports a pattern on complement  that exists, in fact, but on the
> forward strand not in complement.
> 
> Am I doing something wrong?

I think this is one we patched soon after the 4.0.0 release. There are patches 
on our FTP server, and a new 4.1.0 release will appear soon with this fix 
included.

> What options do I have to use in order to make fuzznuc to report the
> occurrences of  "pattern" on both: reverse and complementary strand?

-complement is correct. It searched both strands.

To search only the complementary strand, use the general EMBOSS option 
-sreverse 
and do not specify -complement

Hope this helps,

Peter

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] how to get jtranslations using extractfeat?

2007-02-28 Thread Peter Rice
Andres Pinzon wrote:
> Hi,
> Im trying to get all the "/translation" sequences from a genome embl
> feature file.
> I mean, each CD have a translation tag and I need those translations
> in a fasta file. I've tried all possible combinations of  -type -tag
> but i can not get the translated sequences, but the DNA sequences.
> 
> Is it possible to get this translated sequences from the feature file?
> Or do I have to get the corresponding CDS DNA sequences and then translate 
> them?

Good suggestion ... we can try to make a new application. The /translation tag 
is rather special (because the value is a real sequence) ... also it may have a 
different name in some databases or feature file formats.

We will need to make up names for each translation (sequence identifiers, and 
something derived from the feature table) like the names used by extractfeat.

Alternate splicing will make it difficult to create reliable unique names. 
Extractfeat does have the same problem - and nobody has complained. If we keep 
a 
table of names so far we can add something to the end of any duplicates.

Extracttrans is a possible name for the program.

regards,

Peter

___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] question about translation start stie

2007-03-01 Thread Peter Rice
Dear Fang,

> Does anyone know if EMBOSS could give us the translation start site and
> translation start site ? Thanks!
> Looking forward to your reply.

Can you give an example of what you mean? Start position and first codon 
perhaps? using the feature table, or from finding open reading frames?

regards,

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] EMBOSS 4.1.0 released

2007-03-08 Thread Peter Rice
Ryan Golhar wrote:
> I agree.  I was also expecting the version number on the tarballs to change
> as well.  At the moment, there is no way to tell they were updated...

The embassy changes are all minor. We like to use the version numbers of the 
original code so it is a little difficult to merge in the EMBOSS version ... 
without making up a very long version number.

Does anyone have strong preferences?

Peter
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


Re: [EMBOSS] question (Error: Failed to find host 'srs.ebi.ac.uk' for database 'emblebi')

2007-03-30 Thread Peter Rice
Dear Nikolai,

Воробцов Николай Вадимович wrote:
> The other day I have install the EMBOSS package (version for windows).
> 
> Environmen parameters are setted as required:
> SET EMBOSS_ROOT=D:\Emboss-MS
> SET EMBOSS_ACDROOT=D:\Emboss-MS\acd
> SET EMBOSS_DATA=D:\Emboss-MS\data
> 
> seqret emblebi:xlrhodop
> Reads and writes (returns) sequences
> Error: Failed to find host 'srs.ebi.ac.uk' for database 'emblebi'
> Error: Unable to read sequence 'emblebi:xlrhodop'
>
> Please say what a problem is?

The databases defined by default connect to servers here at EBI.

These databases need an internet connection. If you can connect your browser to 
http://srs.ebi.ac.uk then EMBOSS will be able to read databases.

There are some more settings you can add if, for example, you need to define an 
HTTP proxy.

You can also install the database flat files locally and index them with 
dbxflat 
(or dbiflat).

You can use any EMBOSS program with local sequence data files, or put the 
sequence on the command line with the syntax:

seqret asis::

regards,

Peter Rice
___
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss


[EMBOSS] Profiling and testing water

2007-04-13 Thread Peter Rice
Vivek Menon wrote:
> Hello all, I am having issues compiling the water and needle programs from
> the EMBOSS package.

That makes 3 related requests in the past week! It seems profiling and looking
at the code for water is becoming popular.

For those who want to play with the code, it may be helpful to describe how the
EMBOSS QA testing works. So far this has just been run internally to check that
code changes have not broken anything.

Firstly, edit file test/.embossrc to set the locations of the source test
directory (emboss_qadata) and the installed test directory (emboss_testdata).

The install directory is used for the test databases tsw, tembl (etc.) provided
with the EMBOSS distribution. The source test directory is used so that the
results of one test can be used in another.

cd to the source test directory.

cd to the qa subdirectory.

Run all the QA tests using:
../../scripts/qatest.pl -without=srs

(the command line option turns off tests that require SRS installed locally)

Run one selected QA test:

../../scripts/qatest.pl water-ex

Tests run in a subdirectory with the name of the test (test/qa/water-ex)

If the test succeeds, the directory is removed (the command line option -kk
keeps the directory).

New tests are easy to define - add them to test/qatest.dat Each test has to have
a unique name. Descriptions of the definition line types are in the top of the 
file.

Tests assume files stderr and stdout are created and empty. All other output
files must be included in the test definition (getting a surprise new file is an
error).

The .embossrc file defines the date to be 15-jul-2006 so do not be surprised if
you see that date in your output - we use it to keep the results constant when
updating the documentation. All the *-ex tests are examples for the manuals.

Have fun!!!

Peter

___
EMBOSS mailing list
[EMAIL PROTECTED]
http://lists.open-bio.org/mailman/listinfo/emboss


  1   2   3   4   5   >