Hello Martin, It is understandable that this is confusing as the base coordinates systems used by the two tracks you are comparing are different. I will try to clarify.
Most data tables in the UCSC browser are configured so that coordinates are always stated with respect to the positive strand. The coordinates are also in a format that is called "zero-based, half-open". The .maf format is an exception. The coordinates are stranded and are "one-based, fully closed". MAF file specification: http://genome.ucsc.edu/FAQ/FAQformat.html#format5 Coordinate explanation in detail: http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms Because both of these genes are on the (-) strand, the coordinate formats need to be adjusted to compare between the two data sources. Here is the breakdown for one of the examples: NM_000586 Position from RefSeq Gene track: RNA/Genomic Alignments SIZE IDENTITY CHROMOSOME STRAND START END QUERY START END TOTAL -------------------------------------------------------------------------------------------- 794 100.0% 4 - 123372630 123377650 NM_000586 1 794 822 Length of chr4 = 191,154,276 The "end" in the RefSeq alignment is really the start (start/end are reversed for negative stranded alignments, so that coordinates are with respect to the positive strand, as described in the "Coordinates" link above). The coordinates are also zero-based, half-open. Since the alignment is in the (-) frame, the start of the alignment is the "half-open" coordinate (the "END" in the data table above - we do not change the column labels for (-) strand alignments, this is just something you have to learn about in order to interpret the data correctly). *Calculation to convert the RefSeq alignment coordinate to the MAF coordinate:* chrom_end - alignment_start = start position in MAF 191,154,276 - 123,377,650 = 67,776,626 For a forward stranded alignment, it would be necessary to add "1" to the start position to match the start in the MAF file, since that start would be the "0-based" coordinate in the RefSeq alignment. You can double check this in the data. Examine and let us know if this does not resolve the discrepancies you noticed. If you have other related questions or if this is still unclear, just let us know what you would like more explanation about. Coordinates can be tricky and we like to help, Jennifer --------------------------------- Jennifer Jackson UCSC Genome Informatics Group http://genome.ucsc.edu/ On 4/22/10 8:11 AM, Martin Haubrock wrote: > Dear all, > > I have downloaded the upstream1000.maf.gz file for the hg19 build using > the following URL: > > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/ > > This file contains the upstream region of RefSeq genes with annotated > 5'UTRs. I find some inconsistency in this file in comparison to the UCSC > genome browser annotation: > > ######################################## > refseq-acc: NM_000586 > > => extract of the upstream1000.maf file: > ... > r txupstream 1000 NM_000586 > s hg19.chr4 67775626 1000 - 191154276 AGGACTCTCT-CTGAGACAGG... > ... > ----------------------- > But the genome browser find this gene in the following location: > > NM_000586 at chr4:123372630-123377650 > > > ######################################## > > refseq-acc: NM_000162 > > => extract of the upstream1000.maf file: > ... > r txupstream 1000 NM_000162 > s hg19.chr7 114908641 1000 - 159138663 acctctgag... > ... > ----------------------- > But the genome browser find this gene in the following location: > > NM_000162 at chr7:44183870-44229022 > ######################################## > > > > > Did I miss something? Could you please explain me that problem? > > Martin Haubrock. > > -- > Department of Bioinformatics (UMG) > University Medicine Göttingen, Goldschmidtstr. 1, 37075 Göttingen > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
