El mar, 04-10-2011 a las 15:13 +0100, Peter Rice escribió: > On 10/04/2011 02:38 PM, Fernando Martinez wrote: > > Hi, I am trying to retrieve sequences from a multi-fasta file were there are > > identical sequences and i want to extract only the ones in my list, how can > > I do that? > > Example: > > > > Multi.fasta file: > > > >> seq1 > > atataga... > >> seq2 > > ttatggttca.. > > [...] > >> seq1 > > atataga... > > [...] > > And I only want to take seq1 an seq2, not two times seq1!! > > If you really must start from that file .... as usual with EMBOSS there > are several ways to do it > > 1. Index with dbifasta > ---------------------- > > You can index with the older dbifasta program. This does not allow > duplicate IDs so only one seq1 will be indexed. > > % dbifasta -dbname multi -dir . -index . -file Multi.fasta -idformat > simple -auto > > Then define a database in your .embossrc file: > > DB multi [ > format: "fasta" > method: "emblcd" > type: "nucleotide" > directory: "." > ] > > Then replace "Multi.fasta" in your listfile with "multi" and you will > have the sequences you want. > > > > 2. rewrite as single files in a new directory, then rewrite as one file > > % mkdir multi > % seqret -ossingle -odsir multi Multi.fasta -auto > % ls multi > seq1.fasta seq2.fasta ... > > % cd multi > seqret '*.fasta' ../Single.fasta > > (note: you do need the quotes around the wild card file name) > > this will give you a file Single.fasta in the original directory with > only the last version of each id. > > > > 3. Write a new application > --------------------------- > > Another approach is to write your own new application. A copy of seqret > which keeps a table of ids and rejects any sequence with known ID will > rewrite the file (in any format) with only the first occurrence of each > id. We will add this to the next release. > > > 4. ... there may be more ways, but these will be enough to solve your > problem. > > Hope that helps, > > Peter Rice > EMBOSS Team
Thanks, your help was very useful, in particular the second mode. Best regards, Fernando _______________________________________________ EMBOSS mailing list EMBOSS@lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/emboss