Hi,

I am working on the sequence data for the rn4. There seems three ways
to access the sequence data with the following options

1. chr*.fa.gz:
ftp://hgdownload.cse.ucsc.edu/goldenPath/rn4/chromosomes/chr*.fa.gz
2. chromFa.tar.gz:
ftp://hgdownload.cse.ucsc.edu/goldenPath/rn4/bigZips/chromFa.tar.gz
3. rn4.2bit: ftp://hgdownload.cse.ucsc.edu/goldenPath/rn4/bigZips/rn4.2bit
and then use the twoBitToFa command to convert the data into fasta
format.

Options 2 (chromFa.tag.gz) and 3 (rn4.bit) give identical sequences
for chr1. However, there are differences between options 1
(chr1.fa.gz) and 2 (chromFa.tar.gz).  On my Linux computer, the diff
command to compare the two files can be seen at the end of the email,

My understanding is that for both files should be identical as
"Repeats from RepeatMasker and Tandem Repeats Finder (with period
    of 12 or less) are shown in lower case; non-repeating sequence is
    shown in upper case."

My questions are, what caused the difference between the two files,
was it possibly caused by different version of RepeatMasker or Tandem
Repeats Finder or different parameters setting in those two softwares?
and which file should I use in extracting the sequence?

Thanks,

Yongchao


-------------------------------------------------------------------------------------------------------
chr1.fa is the unzipped file chr1.fa.gz (option 1) and
chromFa/1/chr1.fa is extracted from the file chromFa.tar.gz (option
2).

$ diff chr1.fa chromFa/1/chr1.fa |less
342,343c342,343
< ACTGCCTAAAGCAATACTAATTAGTAAGTTTTGGTGGCAAATGAGCTCTC
< AGAAGCCTAAACATAttgagaacaggcaatctccattaatgggaggttgc
---
> ACTGCCTAAAGCAATACTAATTAGTAAGTTTTGGTGGCAAATGAGCTCTc
> agaagcctaaacatattgagaacaggcaatctccattaatgggaggttgc
385,386c385,386
< AGCATATCCAAGATATTGTACTGTTTAATTTTTATCACCTTGATAAAATT
< AGAACCATTTGAGAGAAGGAAATGAGAACATGAGTTTAAGGGCCTTCTTT
---
> AGCATATCCAAGATATtgtactgtttaatttttatcaccttgataaaatt
> agaaccatttgagagaaggaaaTGAGAACATGAGTTTAAGGGCCTTCTTT
653,654c653,654
< acagtcaatgtctggcactgtggtatcccaaatatctgctagatatcttA
< AGTTtcatagcactgagtgcctccacaataaaacaggagatagcatgcat
---
> acagtcaatgtctggcactgtggtatcccaaatatctgctagatatctta
> agtttcatagcactgagtgcctcCACAATAaaacaggagatagcatgcat
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to