Hi, I am working on the sequence data for the rn4. There seems three ways to access the sequence data with the following options
1. chr*.fa.gz: ftp://hgdownload.cse.ucsc.edu/goldenPath/rn4/chromosomes/chr*.fa.gz 2. chromFa.tar.gz: ftp://hgdownload.cse.ucsc.edu/goldenPath/rn4/bigZips/chromFa.tar.gz 3. rn4.2bit: ftp://hgdownload.cse.ucsc.edu/goldenPath/rn4/bigZips/rn4.2bit and then use the twoBitToFa command to convert the data into fasta format. Options 2 (chromFa.tag.gz) and 3 (rn4.bit) give identical sequences for chr1. However, there are differences between options 1 (chr1.fa.gz) and 2 (chromFa.tar.gz). On my Linux computer, the diff command to compare the two files can be seen at the end of the email, My understanding is that for both files should be identical as "Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case." My questions are, what caused the difference between the two files, was it possibly caused by different version of RepeatMasker or Tandem Repeats Finder or different parameters setting in those two softwares? and which file should I use in extracting the sequence? Thanks, Yongchao ------------------------------------------------------------------------------------------------------- chr1.fa is the unzipped file chr1.fa.gz (option 1) and chromFa/1/chr1.fa is extracted from the file chromFa.tar.gz (option 2). $ diff chr1.fa chromFa/1/chr1.fa |less 342,343c342,343 < ACTGCCTAAAGCAATACTAATTAGTAAGTTTTGGTGGCAAATGAGCTCTC < AGAAGCCTAAACATAttgagaacaggcaatctccattaatgggaggttgc --- > ACTGCCTAAAGCAATACTAATTAGTAAGTTTTGGTGGCAAATGAGCTCTc > agaagcctaaacatattgagaacaggcaatctccattaatgggaggttgc 385,386c385,386 < AGCATATCCAAGATATTGTACTGTTTAATTTTTATCACCTTGATAAAATT < AGAACCATTTGAGAGAAGGAAATGAGAACATGAGTTTAAGGGCCTTCTTT --- > AGCATATCCAAGATATtgtactgtttaatttttatcaccttgataaaatt > agaaccatttgagagaaggaaaTGAGAACATGAGTTTAAGGGCCTTCTTT 653,654c653,654 < acagtcaatgtctggcactgtggtatcccaaatatctgctagatatcttA < AGTTtcatagcactgagtgcctccacaataaaacaggagatagcatgcat --- > acagtcaatgtctggcactgtggtatcccaaatatctgctagatatctta > agtttcatagcactgagtgcctcCACAATAaaacaggagatagcatgcat _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
