Hi all,
The folk on the Flybase help mailing list were able to answer my questions regarding the persimilis and willistoni scaffolding naming discrepancies between the UCSC genome browser and Flybase: 1. D. persimilis: There is a one-to-one relationship between the UCSC "scaffold_" prefixed headers and the "super_" prefixed headers. For example, scaffold_14 corresponds to super_14. 2. D. willistoni: The mapping is not so trivial as it is for persimilis. Both UCSC genome browser and Flybase uses the AAA CAF1 assemblies produced in 2007. However, 89 scaffolds from the CAF1 assembly for Dwil were later suppressed from the CAF1 assembly before it was added to GenBank because these 89 scaffolds mapped to Wolbachia. The scaffold IDs were also changed to avoid confusion with the previous (original) assembly. Explicitly, it appears that UCSC genome browser is using the unsuppressed, original willistoni assembly and Flybase is using the modified assembly. A mapping between the original AAA version and what exists in GenBank/FlyBase can be found here: ftp://ftp.ncbi.nih.gov/genomes/Drosophila_melanogaster/special_requests/CAF1 /dwil/dwil_scaffold2GenBank A full list of problematic scaffolds that were found during the GenBank submission process (for all 12 Drosophila genomes) can be found here: ftp://ftp.ncbi.nih.gov/genomes/Drosophila_melanogaster/special_requests/CAF1 /foreign_scaffolds_in_caf1.txt This explanation has been reworded from the response I received on the Flybase help mailing list. I thank everyone for their generous help with this ambiguity. Hopefully others who encounter this discrepancy will find a suitable answer in this thread. Thanks, Jaaved -- Jaaved Mohammed, Ph.D. Student of Computational Biology Tri-Institutional Training Program in Computational Biology and Medicine (Cornell University - Ithaca, Weill Cornell Medical College, and Memorial Sloan-Kettering Cancer Center) From: Greg Roe [mailto:[email protected]] Sent: Tuesday, September 27, 2011 8:06 PM To: Jaaved Mohammed Cc: [email protected] Subject: Re: [Genome] super vs scaffold coordinates & D. willistoni on the browser. Hi Jaaved, We ran faCount, so no need to do that yourself: http://www.broadinstitute.org/ftp/pub/assemblies/insects/droSec1/assembly.ba ses.gz 21,424 contigs (UCSC: 14,730 super contigs) http://www.broadinstitute.org/ftp/pub/assemblies/insects/droPer1/assembly.ba ses.gz 26,812 contigs (UCSC: 12,838 super contigs) As stated before, the assemblies hosted at UCSC have not been updated for quite some time. Obviously at lot of work has been done on these organisms since. You would have to go track down what labs produced the newer data, etc, in order to answer your questions. We just don't have that information. Please let us know if you have any additional questions: [email protected] - Greg Roe UCSC Genome Bioinformatics Group On 9/19/11 6:53 AM, Jaaved Mohammed wrote: Hi Vanessa, Thanks for your response. Can you help point me to the download site with the latest assembly for either of the 3 fly species for which the engineer speaks of. I can find multiple nucleotide sequences across several sites. For example, for D. willistoni, I can find an assembly at LBNL (http://rana.lbl.gov/drosophila/assemblies.html), and from NCBI, I can download all the raw sequences from http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=7260. I could not find any of the insects up on NCBI Ensembl genome browser either. Can you help point me in the right direction. Thanks, Jaaved On Fri, Sep 16, 2011 at 3:06 PM, Vanessa Kirkup Swing <mailto:[email protected]> <[email protected]> wrote: Hi Jaaved, To answer your first question: The genomes we have are old so it is possible that the differences may be due to years of version updates. On of our engineers has this to say: Go to the current download site for these genomes, fetch the sequence file, and run an faCount on it to see what they name the bits. Compare names and genome organization with what we display. I would assume after 5 or 6 years, these genomes most likely have new assemblies. These genome project sites would most likely explain their update history. You may also find assembly history in the browsers at Ensembl. There may also be information on their trace archive pages if they have them. For example: http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=AAMC01 To answer your second question: Unfortunately, our funding covers primarily vertebrate genomes, though we do host a few of the major model organisms. Hope this help you. If you have further questions, please contact the mailing list: [email protected]. Vanessa Kirkup Swing UCSC Genome Bioinformatics Group ---------- Forwarded message ---------- From: Jaaved Mohammed <mailto:[email protected]> <[email protected]> Date: Thu, Sep 15, 2011 at 8:57 AM Subject: [Genome] super vs scaffold coordinates & D. willistoni on the browser. To: [email protected] Hello, I have two questions that I would really appreciate your help with answering. =========== Firstly, =========== I am trying to understand the origin of the "super*" coordinates for the droPer1 and droSec1 genomes available on the UCSC Genome Browser. For example, in the D. sechellia assembly, I see that all the chromosomes are prefixed by "super" on the Genome Browser: http://genome-mirror.bscb.cornell.edu/cgi-bin/hgTracks?hgsid=36382 <http://genome-mirror.bscb.cornell.edu/cgi-bin/hgTracks?hgsid=36382&chromInf o> &chromInfo Page=. However, from Flybase.org, the GFF files, or any coordinate for that matter on Flybase, is always prefixed by "scaffold" as can be seen from ftp://flybase.net/genomes/Drosophila_sechellia/current/gff/. Why is this? How were the conversion done from "scaffold" into "super" coordinates? I'm trying to convert the flybase genes reported in the GFF files into a file that I can upload to the browser to see the flybase annotated genes, non-coding RNAs, etc. however this clash of coordinate names is causing much problems. I should note that I looked in all the older revisions of the Flybase GFF files and still I see no "super" prefixed coordinates. I hope I'm not looking at the wrong flybase GFF files. The same observation was made in the droPer1 reference assembly. ============= Secondly, I've noticed that D. willistoni reference assembly is not available on the UCSC Genome Browser. Why is this? I've added this genome to the Cornell mirror using the droWil1.fa file downloaded/available from the UCSC browser. The added genome can be viewed here: http://genome-mirror.bscb.cornell.edu/cgi-bin/hgGateway?hgsid=36387 <http://genome-mirror.bscb.cornell.edu/cgi-bin/hgGateway?hgsid=36387&clade=i n> &clade=in sect&org=D.+willistoni&db=0 On a similar note to the first point above, I've observed that the coordinates are prefixed with "scaffold" on the browser, but flybase reports coordinates prefixed with "scf2_": ftp://flybase.net/genomes/Drosophila_willistoni/current/gff/. Thanks, Jaaved -- Jaaved Mohammed, Ph.D. Student of Computational Biology Tri-Institutional Training Program in Computational Biology and Medicine (Cornell University - Ithaca, Weill Cornell Medical College, and Memorial Sloan-Kettering Cancer Center) _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
