Hi Michiel,

I'd like to add a bit regarding this part of your original question, because I 
believe you'll still see this when you are removing only the lineage-specific 
repeats:

> However, when I look at the UCSC genome-wide alignment between rheMac2 and 
> rn4, it seems that the repeats that should have been removed are included in 
> the alignment.

Removing the lineage-specific repeats (LSRs) is only half of the story.  Before 
alignments, LSRs are removed, not masked, from the sequences -- the sequences 
become shorter.  We are in effect pretending that no new repeat insertions 
happened since the most recent common ancestor.  This helps the aligner to 
extend an alignment further along the diverged sequences, when otherwise a 
repeat insertion in one lineage would have broken the alignment.  But now the 
sequence coordinates returned by the aligner are for artificially shortened 
sequences.  

The resulting alignments' coordinates are adjusted back to original sequence 
coords by PSU's restore_rpts script.  Where an alignment extended across an 
excised repeat, it now gets a gap as it hops over the restored repeat.  

This example:

> LINE/L2 repeat at chr7:87564770..87565324

actually seems not to have been selected as lineage-specific according to our 
files.  But here is a nearby one that was selected as lin-spec and excised from 
rheMac2:

  589  14.9  0.0  0.0  chr7      86256230 86256323 (83545043) C  L2             
LINE/L2                (0) 3419   3326    364    0  0  

If you view the surrounding region (chr7:86,256,138-86,256,416) in the rheMac2 
Genome Browser, you can see a gap in the Rat chain/net corresponding to the L2. 
 

Hope that helps,

Angie

P.S. Drilling down a bit... note that the gap jumping over the L2 is 
double-sided (two horiz. lines not one).  If you click on the blue chain, the 
details page has a link "Open Rat browser at position corresponding to the part 
of chain that is in this window."  Clicking on that link opens a new window or 
tab with the Rat browser, and you can see that rat also has an L2 right in the 
middle.  Looking back in our files, that L2 was not deemed lin-spec for rat.  
And indeed it is probably an ancient/ancestral repeat (predating lineage 
split), that was just incorrectly estimated to be lin-spec in rhesus (using 
human heuristics) without the benefit of already having alignments like these.  

That's just a reminder that the lineage-specific designation is based on 
heuristics.  We use it because it resulted in a significant gain in coverage 
for human-mouse alignments, as described in the Blastz paper (Schwartz S et 
al., 2002 I think).  


----- "Michiel de Hoon" <[email protected]> wrote:

> From: "Michiel de Hoon" <[email protected]>
> To: [email protected]
> Sent: Friday, June 24, 2011 12:16:09 AM GMT -08:00 US/Canada Pacific
> Subject: Re: [Genome] Repeat regions in whole-genome alignments
>
> Dear Pauline,
> 
> Many thanks for your reply. I think you are right and that I am
> inadvertently removing repeats that should not be removed.
> 
> To select the appropriate repeats from the DateRepeats output, UCSC
> uses the extractRepeats and extractLinSpecReps scripts. Can these
> scripts be made available somewhere? I couldn't find them in Jim
> Kent's software collection or elsewhere.
> 
> Thank you again again,
> --Michiel
> 
> > Hello Michiel,
> >
> > We do not remove all repeats, we only remove the lineage specific
> > repeats. It is possible that if your RepeatMasker scripts are
> > failing, that you have not been able to produce the actual lineage
> > specific repeats.
> >
> > For more info on the RepeatMasker scripts used to construct these
> > files please see the associated makedoc.
> >
> >
> > Hopefully this information was helpful and answers your question.
> If
> > you have further questions or require clarification feel free to
> > contact the mailing list at genome at soe.ucsc.edu.
> > <https://lists.soe.ucsc.edu/mailman/listinfo/genome>
> >
> > Regards,
> >
> > Pauline Fujita UCSC Genome Bioinformatics Group
> > http://genome.ucsc.edu
> >
> >
> > On 06/21/11 21:00, Michiel de Hoon wrote:
> >> Hello,
> >> I am trying to do a multiple alignment of the genomes of
> >> several organisms. To make sure I am doing this correctly, I tried
> to
> >> recreate the rheMac2 to rn4 pairwise alignment that is available
> from
> >> the UCSC FTP server. I found some discrepancies between my
> alignment
> >> and the UCSC alignment in the repeat regions.
> >> From looking at src/hg/utils/automation/blastz-run-ucsc, I
> understand
> >> that repeats are removed from the Fasta genome files by strip_rpts
> >> before running lastz. In src/hg/makeDb/doc/rheMac2.txt, the
> repeats
> >> to be removed are determined by running 
> >>         DateRepeats chr*.fa.out -query human -comp mouse -comp dog
> >> and then running extractRepeats 1 on the output. I couldn't find
> the
> >> extractRepeats program, but I am guessing that I can get the
> >> appropriate result by
> >>         DateRepeats chr*.fa.out -query human -comp mouse
> >> Then I run selectRpts in blastz-run-ucsc on
> chr7.fa.out_mus-musculus
> >> to generate the chr*.rpts file, which I then use with strip_rpts
> to
> >> generate the stripped chromosome. I then run lastz on the stripped
> 
> >> chromosomes.
> >> However, when I look at the UCSC genome-wide alignment between
> >> rheMac2 and rn4, it seems that the repeats that should have been
> >> removed are included in the alignment.
> >> As an example, one of the repeats removed by strip_rpts is the
> >> LINE/L2 repeat at chr7:87564770..87565324 in rheMac2. But the first
> 
> >> aligned chain in rheMac2.rn4.all.chain.gz from UCSC starts with
> >>
> >>> chain 547645084 chr7 169801366 + 87564296 169142947 chr6 147636619
> + 64051113 13 8454126 1
> >> 17      2       0
> >> 57      0       1
> >> 19      0       1
> >> 7       7       0
> >> 139     6       0
> >> 15      1       0
> >> 12      7       0
> >> 4       3       0
> >> 11      0       1
> >> 57      1       0
> >> 52      0       1
> >> 23      6       0
> >> 61      0       1    <=== this block overlaps the LINE/L2 repeat
> >> 51      2       0
> >> 71      1       0
> >>
> >>
> >> Now I understand that the lastz alignments shouldn't seed in
> >> a repeat, but are allowed to extend into a repeat. But since we
> >> removed the repeat sequences from the Fasta file altogether,
> >> how can this alignment extend into a repeat?
> >>
> >> Best wishes, and many thanks in advance,
> >> Michiel de Hoon
> >> RIKEN Omics Science Center
> 
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to