Re: [Samtools-help] Is this a bug in implementation, or my misunderstanding of the spec?

Jim Robinson Fri, 22 Aug 2014 18:36:07 -0700

Hmm, despite what that format description says, in practice it lookslike the second column is a transcript id in the files downloadable fromUCSC. For example, grepping the "refFlat.txt" file for hg38 as followsyields


grep EGFR refFlat.txt | cut -f 1-8


EGFR    NM_005228    chr7    +    55019031    55207338 55019277    55205617
EGFR    NM_201284    chr7    +    55019031    55171045 55019277    55170544
EGFR    NM_201283    chr7    +    55019031    55156951 55019277    55156843
EGFR    NM_201282    chr7    +    55019031    55168635 55019277    55168529

EGFR-AS1 NR_047551 chr7 - 55179749 55188949 5518894955188949

Jim

I've been using Picard's RefFlatReader for my own purposes, and havebeen home-rolling refFlat files. The specification I've found islisted here:
http://genome.ucsc.edu/goldenPath/gbdDescriptionsOld.html#RefFlat

In particular, the first two fields of the file are:
string geneName; "Name of gene as it appears in GenomeBrowser."
    string  name;               "Name of gene"

This looks like the gene symbol, followed by the UCSC accession ID.

This is backed up by the UCSC FAQ:

*Question: *
"I have the accession number for a gene and would like to link it tothe gene name. Is there a table that shows both pieces of information?"
*Response:*
If you are looking at the RefSeq Genes, the /refFlat/ table containsboth the gene name (usually a HUGO Gene Nomenclature Committee ID) andits accession number.
However, in the Picard implementation, the reader defines the columns as:

public enum RefFlatColumns{GENE_NAME, TRANSCRIPT_NAME, CHROMOSOME, STRAND, 
TX_START, TX_END, CDS_START, CDS_END,
        EXON_COUNT, EXON_STARTS, EXON_ENDS}
So, if the spec I listed is right, Picard interprets gene accessionIDs as transcript names. The problem here is that when a set oftranscripts is being built for a gene, transcripts are identified bytheir name, and transcripts with the same name in the same gene arerejected.
if (transcripts.containsKey(name)) {
throw new AnnotationException("Transcript " + name + " for gene" + this.getName() + " appears more than once");
}
Less amusingly, this exception is caught, and a debug level log isthrown that seems to be pretty easy to miss.
The following example (if I've implemented the spec correctly) is valid:
WASH7P ENSG00000227232 1 - 14363 29370 14363 29370 1014363,14970,15796,16607,16858,17233,17606,17915,24738,29321,14829,15038,15947,16765,17055,17368,17742,18061,24891,29370,WASH7P ENSG00000227232 1 - 14363 29370 14363 29370 1214363,14970,15796,15904,16607,16854,17233,17602,17915,18268,24738,29321,14829,15038,15901,15947,16765,17055,17364,17742,18061,18379,24891,29370,
However, since these two transcripts (that are clearly different asthey use different exons) share the same name, an exception is thrown,a message is logged to debug, andthe GeneAnnotationReader.loadRefFlat() method returns a truncatedobject, as parsing effectively stops when this happens.
What I'd like to know is:
1) Did I misinterpret the specification for the refFlat file? If Idid, my bad and I'll make the appropriate changes to adhere to the spec
2) If not #1, is this a bug in the Picard implementation?

Thanks for your help/advice.

-Jim Nemesh


------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/


_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help

------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/

_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help

Re: [Samtools-help] Is this a bug in implementation, or my misunderstanding of the spec?

Reply via email to