Hi, Using hg19 RefSeq gene model (from Table Browser, Genes+Prediction group; RefSeq Genes track; table: refGene; output format: GTF) returns for example:
chrX hg19_refGene start_codon 76709647 76709649 0.000000 + . gene_id "NM_003868"; transcript_id "NM_003868"; chrX hg19_refGene CDS 76709647 76709751 0.000000 + 0 gene_id "NM_003868"; transcript_id "NM_003868"; chrX hg19_refGene exon 76709647 76709751 0.000000 + . gene_id "NM_003868"; transcript_id "NM_003868"; chrX hg19_refGene CDS 76711768 76712010 0.000000 + 0 gene_id "NM_003868"; transcript_id "NM_003868"; chrX hg19_refGene stop_codon 76712011 76712013 0.000000 + . gene_id "NM_003868"; transcript_id "NM_003868"; chrX hg19_refGene exon 76711768 76712013 0.000000 + . gene_id "NM_003868"; transcript_id "NM_003868"; which incorrectly indicates that the start codon in the first three bases on the first aligned CDS exon. In fact, in cases like there, the first exon is not aligned to hg19, so the 'first' CDS exon that appears in the hg19 alignment is actually midway through the coding sequence: http://genome.ucsc.edu/cgi-bin/hgc?hgsid=240218517&g=htcCdnaAli&i=NM_003868&c=chrX&l=76709054&r=76712605&o=76709646&aliTable=refSeqAli&table=refGene Why are such partial coding alignments included in gene models? If they are intentionally included, it seems minimally the 'start_codon' entry in the gene model should be removed to avoid inaccurate inferences based on the assumption that the start codon is actually at that location. Is there a way to determine which refGene alignments do not have an aligned CDS start in the reference genome? Dan _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
