Re: [Genome] Question: upstream1000.maf file

Jennifer Jackson Thu, 22 Apr 2010 11:16:53 -0700

Hello Martin,

It is understandable that this is confusing as the base coordinates 
systems used by the two tracks you are comparing are different. I will 
try to clarify.

Most data tables in the UCSC browser are configured so that coordinates 
are always stated with respect to the positive strand. The coordinates 
are also in a format that is called "zero-based, half-open".

The .maf format is an exception. The coordinates are stranded and are 
"one-based, fully closed".

MAF file specification:
http://genome.ucsc.edu/FAQ/FAQformat.html#format5

Coordinate explanation in detail:
http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms

Because both of these genes are on the (-) strand, the coordinate 
formats need to be adjusted to compare between the two data sources.

Here is the breakdown for one of the examples: NM_000586
Position from RefSeq Gene track:

RNA/Genomic Alignments

  SIZE IDENTITY CHROMOSOME  STRAND    START     END              QUERY 
     START  END  TOTAL
--------------------------------------------------------------------------------------------
   794  100.0%          4     - 123372630 123377650 
NM_000586     1   794   822

Length of chr4 = 191,154,276

The "end" in the RefSeq alignment is really the start (start/end are 
reversed for negative stranded alignments, so that coordinates are with 
respect to the positive strand, as described in the "Coordinates" link 
above). The coordinates are also zero-based, half-open. Since the 
alignment is in the (-) frame, the start of the alignment is the 
"half-open" coordinate (the "END" in the data table above - we do not 
change the column labels for (-) strand alignments, this is just 
something you have to learn about in order to interpret the data 
correctly).

*Calculation to convert the RefSeq alignment coordinate to the MAF 
coordinate:*

chrom_end - alignment_start = start position in MAF

191,154,276 - 123,377,650 = 67,776,626

For a forward stranded alignment, it would be necessary to add "1" to 
the start position to match the start in the MAF file, since that start 
would be the "0-based" coordinate in the RefSeq alignment. You can 
double check this in the data.

Examine and let us know if this does not resolve the discrepancies you 
noticed. If you have other related questions or if this is still 
unclear, just let us know what you would like more explanation about. 
Coordinates can be tricky and we like to help,

Jennifer

---------------------------------
Jennifer Jackson
UCSC Genome Informatics Group
http://genome.ucsc.edu/

On 4/22/10 8:11 AM, Martin Haubrock wrote:
> Dear all,
>
> I have downloaded the upstream1000.maf.gz file for the hg19 build using
> the following URL:
>
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/
>
> This file contains the upstream region of RefSeq genes with annotated
> 5'UTRs. I find some inconsistency in this file in comparison to the UCSC
> genome browser annotation:
>
> ########################################
> refseq-acc: NM_000586
>
> =>  extract of the upstream1000.maf file:
> ...
> r txupstream 1000 NM_000586
> s hg19.chr4 67775626 1000 - 191154276 AGGACTCTCT-CTGAGACAGG...
> ...
> -----------------------
> But the genome browser find this gene in the following location:
>
> NM_000586 at chr4:123372630-123377650
>
>
> ########################################
>
> refseq-acc: NM_000162
>
> =>  extract of the upstream1000.maf file:
> ...
> r txupstream 1000 NM_000162
> s hg19.chr7 114908641 1000 - 159138663 acctctgag...
> ...
> -----------------------
> But the genome browser find this gene in the following location:
>
> NM_000162 at chr7:44183870-44229022
> ########################################
>
>
>
>
> Did I miss something? Could you please explain me that problem?
>
> Martin Haubrock.
>
> --
> Department of Bioinformatics (UMG)
> University Medicine Göttingen, Goldschmidtstr. 1, 37075 Göttingen
>
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Question: upstream1000.maf file

Reply via email to