Greetings EMBOSS users!
- I am using shuffleseq on entire genomic DNA multifasta input files (EMBOSS ver 6.6.0). - For just one genome, that is relatively larger (~ 2GB) with several pseudomolecules in the 150-250Mb size range, I am splitting into individual sequences and running them as an arrau job. - All runs on UNIX based compute cluster using SLURM queue controller. - My syntax is simply: shuffleseq srun shuffleseq -sformat pearson $IN $OUT - For the most part, all is well. - With that as context, I have a few questions about the use of shuffleseq: *Q1.* What is the calculation for RAM required, based on input file size? Is there an apprximate formula? Or have users figured it out empirically? *Q2.* When I performed some downstream analyses of shuffled genomes from 5 independent runs of shuffleseq, 4/5 gave me no DNA sequence matches - suggesting shuffling worked well, but in 1/5 this was not at all the case. So I wonder whether the randomization step during shuffling is quirky in any way!? I came across this link <http://eyegene.ophthy.med.umich.edu/shuffle/> - describing possible issues with lack of true randomization in an old EMBOSS release. I makes me wonder if these sort of issues still play any role in version 6.6.0 as well? Or could there be other explanation(s) for why 4 are good shuffles but 1 is not at all. The scripts across the repetitions are easy to copy and modify suitably. Nevertheless, I've checked and re-checked syntax, no errors there. Thanks, in advance, for advice and pointers from forum members. And, in advance, best wishes for a happy and productive 2019. Cheers! Anand
_______________________________________________ EMBOSS mailing list [email protected] http://mailman.open-bio.org/mailman/listinfo/emboss
