Re: [EMBOSS] shuffleseq for multifasta?

David Mathog Fri, 09 Nov 2018 09:32:54 -0800

On 08-Nov-2018 19:19, Anandkumar Surendrarao wrote:

I am new to EMBOSS, and trying to use shufflseq to randomly shuffleentiregenomes (one-by-one). My input genomic sequences are in multifastaformat.
And I wish to retain the same multifasta format for the output file as
well, containing the shuffled DNA sequences.


This isn't an EMBOSS solution, but I use

  http://saf.bio.caltech.edu/pub/software/molbio/fastaselecth.c

to do many similar tasks. It takes sequence entries from -in, writesthe selected ones to -out, selects using the headers in the orderprovided through -sel. It can also be used to reject just thosesequences from -sel (in which case the input order carries over to theoutput). Assuming the headers are simple (so that alternate headerparsing flags are not needed) and using also my extract program (fromdrm_tools on sourceforge, there are lots of other ways of doing this) tomake a list of the header names:


extract -in source.fasta -if '>' -ifonly -mt -dl '> ' -fmt '[1]' \
 | shuf \
 | fastaselecth -in source.fasta -out shuffled.fasta -sel -

If you want a random subset of 1000 change the second line to

 | shuf | head -1000 \

and so forth.

Bump up the -wl parameter if sequences longer than 10Mbp are possible.

I'm sure that somewhere there are fasta files with headers so complexthat the alternative header parsing options in the program are notsufficient. If that happens use extract (or perl or awk or ...) tosimplify the headers.


Regards,

David Mathog
[email protected]
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
EMBOSS mailing list
[email protected]
http://mailman.open-bio.org/mailman/listinfo/emboss

Re: [EMBOSS] shuffleseq for multifasta?

Reply via email to