On 08-Nov-2018 19:19, Anandkumar Surendrarao wrote:
I am new to EMBOSS, and trying to use shufflseq to randomly shuffle entire genomes (one-by-one). My input genomic sequences are in multifasta format.
And I wish to retain the same multifasta format for the output file as
well, containing the shuffled DNA sequences.

This isn't an EMBOSS solution, but I use

  http://saf.bio.caltech.edu/pub/software/molbio/fastaselecth.c

to do many similar tasks. It takes sequence entries from -in, writes the selected ones to -out, selects using the headers in the order provided through -sel. It can also be used to reject just those sequences from -sel (in which case the input order carries over to the output). Assuming the headers are simple (so that alternate header parsing flags are not needed) and using also my extract program (from drm_tools on sourceforge, there are lots of other ways of doing this) to make a list of the header names:

extract -in source.fasta -if '>' -ifonly -mt -dl '> ' -fmt '[1]' \
 | shuf \
 | fastaselecth -in source.fasta -out shuffled.fasta -sel -

If you want a random subset of 1000 change the second line to

 | shuf | head -1000 \

and so forth.

Bump up the -wl parameter if sequences longer than 10Mbp are possible.

I'm sure that somewhere there are fasta files with headers so complex that the alternative header parsing options in the program are not sufficient. If that happens use extract (or perl or awk or ...) to simplify the headers.

Regards,

David Mathog
[email protected]
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
EMBOSS mailing list
[email protected]
http://mailman.open-bio.org/mailman/listinfo/emboss

Reply via email to