Re: [Genome] Overlapping features annotation

Jennifer Jackson Tue, 23 Jun 2009 16:53:58 -0700

Hello Ricardo,

After considering your question with some internal discussion, the most 
accurate method may be the simplest. Compare probes to transcripts, 
generate alignment, analyze results.


Alternatively, there are three programs in our code tree that do 
comparisons between tracks mapped to genomic. None do this exact task in 
one step. Some experimentation on your part would be required.
http://genome.ucsc.edu/FAQ/FAQdownloads#download27

Thank you,
Jennifer Jackson
UCSC Genome Bioinformatics Group

-------------------------------

hgMapToGene - Map a track to a genePred track.
usage:
   hgMapToGene database track geneTrack mapTable
where:
   database - is a browser database (ex. hg16)
   track - the track to map to the genePred track
           This must be bed12 psl or genePred itself
   geneTrack - a genePred format track
   newMapTable - a new table to create with two columns
           <geneTrackId><trackId>

pslMap - map PSLs alignments to new targets using alignments of
the old target to the new target.  Given inPsl and mapPsl, where
the target of inPsl is the query of mapPsl, create a new PSL
with the query of inPsl aligned to the target of mapPsl.  If
inPsl is a protein to nucleotide alignment and mapPsl is a
nucleotide to nucleotide alignment, the resulting alignment is
nucleotide to nucleotide alignment of a hypothetical mRNA that
would code for the protein.  This is useful as it gives base
alignments of spliced codons.  A chain file may be used instead
mapPsl
usage:
   pslMap [options] inPsl mapFile outPsl
Options:
  -chainMapFile - mapFile is a chain file instead of a psl file
  -swapMap - swap query and target sides of map file.
  -swapIn - swap query and target sides of inPsl file.
  -suffix=str - append str to the query ids in the output
   alignment.  Useful with protein alignments, where the result
   is not actually and alignment of the protein.
  -keepTranslated - if either psl is translated, the output psl
   will be translated (both strands explicted).  Normally an
   untranslated psl will always be created
  -mapInfo=file - output a file with information about each mapping.
   The file has the following columns:
     o srcQName, srcQStart, srcQEnd, srcQSize - qName, etc of
       psl being mapped (source alignment)
     o srcTName, srcTStart, srcTEnd - tName, etc of psl being
       mapped
     o srcStrand - strand of psl being mapped
     o srcAligned - number of aligned based in psl being mapped
     o mappingQName, mappingQStart, mappingQEnd - qName, etc of
       mapping psl used to map alignment
     o mappingTName, mappingTStart, mappingTEnd - tName, etc of
       mapping psl
     o mappingStrand - strand of mapping psl
     o mappingId - chain id, or psl file row
     o mappedQName mappedQStart, mappedQEnd - qName, etc of
       mapped psl
     o mappedTName, mappedTStart, mappedTEnd - tName, etc of
       mapped psl
     o mappedStrand - strand of mapped psl
     o mappedAligned - number of aligned bases that were mapped
     o qStartTrunc - aligned bases at qStart not mapped due to
       mapping psl/chain not covering the entire soruce psl.
       This is from the start of the query in the positive
       direction.
     o qEndTrunc - similary for qEnd
  -simplifyMappingIds - simplifying mapping ids (inPsl target
   name and mapFile query name) before matching them. This
   first drops everything after the last `-', and then drops
   everything after the last remaining `.'.

featureBits - Correlate tables via bitmap projections.
usage:
   featureBits database table(s)
This will return the number of bits in all the tables anded together
Pipe warning:  output goes to stderr.
Options:
   -bed=output.bed   Put intersection into bed format. Can use stdout.
   -fa=output.fa     Put sequence in intersection into .fa file
   -faMerge          For fa output merge overlapping features.
   -minSize=N        Minimum size to output (default 1)
   -chrom=chrN       Restrict to one chromosome
   -chromSize=sizefile       Read chrom sizes from file instead of database.
   -or               Or tables together instead of anding them
   -not              Output negation of resulting bit set.
   -countGaps        Count gaps in denominator
   -noRandom         Don't include _random (or Un) chromosomes
   -noHap            Don't include _hap chromosomes
   -dots=N           Output dot every N chroms (scaffolds) processed
   -minFeatureSize=n Don't include bits of the track that are smaller than
                     minFeatureSize, useful for differentiating between
                     alignment gaps and introns.
   -bin=output.bin   Put bin counts in output file
   -binSize=N        Bin size for generating counts in bin file (default 
500000)
   -binOverlap=N     Bin overlap for generating counts in bin file 
(default 250000)
   -bedRegionIn=input.bed    Read in a bed file for bin counts in 
specific regions
                     and write to bedRegionsOut
   -bedRegionOut=output.bed  Write a bed file of bin counts in specific 
regions
                     from bedRegionIn
   -enrichment       Calculates coverage and enrichment assuming first table
                     is reference gene track and second track something else
   '-where=some sql pattern'  Restrict to features matching some sql pattern
You can include a '!' before a table name to negate it.
Some table names can be followed by modifiers such as:
    :exon:N          Break into exons and add N to each end of each exon
    :cds             Break into coding exons
    :intron:N        Break into introns, remove N from each end
    :utr5, :utr3     Break into 5' or 3' UTRs
    :upstream:N      Consider the region of N bases before region
    :end:N           Consider the region of N bases after region
    :score:N         Consider records with score >= N
    :upstreamAll:N   Like upstream, but doesn't filter out genes that
                     have txStart==cdsStart or txEnd==cdsEnd
    :endAll:N        Like end, but doesn't filter out genes that
                     have txStart==cdsStart or txEnd==cdsEnd
The tables can be bed, psl, or chain files, or a directory full of
such files as well as actual database tables.  To count the bits
used in dir/chrN_something*.bed you'd do:
   featureBits database dir/_something.bed


Ricardo Verdugo wrote:
> Hello:
>
> I have been looking for a tool that can compare two lists of multi- 
> block genomic features (or genome alignments in PSL, e.g.) and output  
> the list of features that overlap. I have found the intersectBed (BED  
> format) and the fjoin (GFF format) programs, which do this but in a  
> one block basis, i.e. they intersect one block or exon at a time,  
> regardless of wether other blocks also match or not.
>
> I am interested in annotating microarray probes with genes and  
> transcripts. If a probe align to the genome say, in two blocks  
> separated by a gap, this probe should be annotated to a transcript  
> only if both blocks match (overlap) two consecutive exons in a  
> transcript.
>
> I would like to know if you have any program that can do this before I  
> decide if I should write my own.
>
> Thank you in advance,
>
> Ricardo Verdugo
>
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Ricardo A. Verdugo S., D.V.M. Ph.D.
> Postdoctoral Fellow
> Gary Churchill Group
> The Jackson Laboratory
> 600 Main Street,
> Bar Harbor, ME 04609
> [email protected]
> +1-207-288-6715
> +1-207-288-6847 Fax
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Overlapping features annotation

Reply via email to