> ________________________________________ > De : Walter Eckalbar [[email protected]] > Date d'envoi : 20 juillet 2011 10:57 > À : Sébastien Boisvert > Objet : Re: RE : [Denovoassembler-users] Ray assembly kmer coverage question > > Hi Sébastien, >
Hello, See my answers below. >>250 M 104 paired-end reads gives a raw coverage of 13, not 30. > > Well there is the other side of the pair, sorry if that was unclear, and I > actually have just over 260M, I was rounding, and again the genome size is > probably 1.8Gb. Ray found a total of 517M reads. So it is about 29-30x > coverage. Which I know is not huge for Next-Gen assemblies, but it should do > reasonably well. > OK I see. >>The first thing you want to do is a Ray quality assurance run with k=21. > >>If everything runs fine, Ray should detect a peak and the average insert size >>for your libraries. > >>Then if that works, you can start playing with the k-mer length. > >>Minimum, peak and repeat coverages are detected in the coverage distribution >>>(PREFIX.CoverageDistribution.txt). >>The minimum k-mer coverage is not set to one less than peak coverage. > > I've already completed two runs with k=31 and k=61 with otherwise default > conditions, and both times its run fine, only the minimum coverage has been > set > to 1 minus the found peak coverage, and the assembly was very small. I > understand that its not the default to make it 1 minus the peak, but that > seems > to be what it wants to do with my data, even with two different kmers. > See explanations below. > In both cases I've given Ray the insert size of both of the libraries, I > could, in my next run, take those specifications out and run with k=21. That is the first thing you should do I think. k=21 will pick up more redundant k-mers than k=31 or k=61. Basically, this is why: If the sequencing error rate is > 4.7% (1/21), then mostly all k-mers will be unique and bad for k=21. If the sequencing error rate is > 3.2% (1/31), then mostly all k-mers will be unique and bad for k=31. If the sequencing error rate is > 1.6% (1/61), then mostly all k-mers will be unique and bad for k=61. I believe Illumina HiSeq TruSeq sequencing error rate varies between 0 and 2 %. You mileage may vary however depending on the quality of DNA and library preparation (nicks in DNA for instance during the library preparation). > > The reason I'm asking a lot of questions however, is that I don't exactly > have a ton of money to pay for CPU hours (and it doesn't help when the system > crashes > as it did last night) What was the nature of the crash exactly ? Was it related to Ray, to the Linux process scheduler or something else such as another user overloading one of your job nodes ? > at this point and would like to demonstrate reasonable success for my PI > before buying more. So if this minimum coverage problem appears > again with k=21, I'm basically spending another $50 on yet another test. It > may sound trivial, but this is but one program I've been testing, and its > starting to add up. > k=21 should be the first assembly run in your roadmap in my opinion. k=31 produces more erroneous k-mers and k=63 even more. By the way, did you compile with MAXKMERLENGTH=64 for your k=61 runs ? Also, what is the scheduler utilised to run your jobs ? How is it configured ? To avoid crashes: In my experience, if your job nodes are shared with other users, your MPI run will often fail because of users that overload nodes, these trolls are always present. Don't use swap files, these are evil for high-performance computing. 50$ is not a lot of money, really. To develop Ray, I use a supercomputer at my university (colosse.clumeq.ca). The building hosting the machine and the machine cost around 13 M$ (in Canadian dollars). colosse has 7680 compute cores (960 nodes * 8 cores/node). The building hosting the supercomputer is cool, see CLUMEQ transforms rundown particle accelerator into high-efficiency cooling enclosure http://www.infoworld.com/d/green-it/the-green-it-stars-2010-454?page=0,3 Pretty unique stuff we have ! I have an allocation of 250 core-years on the said machine. If you do the maths, the part of the colosse (3.2%) I use is valued at 423 k$ ! Also, the estimated cost of Ray using the COMOMO model is 271 k$ US. I am pretty sure it is more than that if you consider the peer-to-peer nature of the code, which is hard to develop I think. See http://www.ohloh.net/p/denovoassembler/estimated_cost >>Can you post PREFIX.CoverageDistribution.txt with k=21 on >>http://pastebin.com/ and link it in >your next email. > > Sure thing: > I asked for k=21, not k=31 or k=61 > Here's the distribtion for K=61 > http://pastebin.com/6kHvarC0 Check the plot: http://imgur.com/zC2Jt You don't have any peak for k=61. > > And k=31 > http://pastebin.com/rHMFr9Fe Check the plot: http://imgur.com/5da5kl You don't have any peak for k=31. If you look closely, you can see a peak trying to separate itself from all these sequencing errors occuring once. You see it too for k=61. For k=31, the derivative (in green) almost cross 0 but don't. Maybe Ray could compute the derivative of the derivative but I don't know if that would work. > There are 2 ways to shift this soon-to-be peak on the right: - Decreases the k-mer length to decrease the number of erroneous k-mers. I am pretty sure this potato will shift on the right with k=21. - Or you can get more data, but you'll need more compute power too ! >>Why do you only have powers of 2 ? >>It is not generated by Ray, Ray measures coverage values from 1 to 65535. > > I did that because I figured it would be easier for most people to interpret > the general shape of the distribution and that it would be easier to just > put it strait into an email. > Yes, but you will surely miss important stuff because what matters could be between two powers of 2, right ? > Walter > I hope this helps, otherwise let me know. Sébastien > 2011/7/20 Sébastien Boisvert > <[email protected]<mailto:[email protected]>> > Hello, > >> Hello everyone, >> >> I'm attempting to assemble a genome using data from 2 lanes of the Illumina >> HiSeq, totaling ~250M 104 bp paired end reads with an insert size of ~450bp. >> We estimate the genome size just under 2Gbp. This would roughly compute to >> 30x coverage assuming it all maps to our genome. >> > > 250 M 104 paired-end reads gives a raw coverage of 13, not 30. > > irb(main):003:0> rawBases=250*1000*1000*104 > => 26000000000 > irb(main):004:0> genomeBases=2*1000*1000*1000 > => 2000000000 > irb(main):005:0> rawCoverage=(rawBases+0.0)/genomeBases > => 13.0 > > > Because of sequencing errors, the usable k-mer coverage will be lower than 13. > > The first thing you want to do is a Ray quality assurance run with k=21. > > If everything runs fine, Ray should detect a peak and the average insert size > for your libraries. > > Then if that works, you can start playing with the k-mer length. > >> I'm attempting to use Ray (v1.6.1-rc3) and am struggling to find a setting >> that both proves to finish in reasonable amount >> of time and constructs a reasonable assembly. I've noticed under the >> default settings the minimum kmer coverage is set >> to one less than the peak coverage (which does not appear to be the same as >> the max coverage > > Minimum, peak and repeat coverages are detected in the coverage distribution > (PREFIX.CoverageDistribution.txt). > The minimum k-mer coverage is not set to one less than peak coverage. > > > Besides, v1.6.1 is on sourceforge. > >> and is often in 500-700 range) and this leads to the exclusion of far too >> many Kmers (or so it appears), and assembles an >> awful genome, with n50 in the 100's. >> > > Can you post PREFIX.CoverageDistribution.txt with k=21 on > http://pastebin.com/ and link it in your next email. > > Example: > > http://pastebin.com/BJRYwzBZ > > The corresponding CoverageDistributionAnalysis.txt file: > > k-mer length: 31 > Lowest coverage observed: 1 > MinimumCoverage: 42 > PeakCoverage: 171 > RepeatCoverage: 300 > Number of k-mers with at least MinimumCoverage: 2453478644 k-mers > Estimated genome length: 1226739322 nucleotides > Percentage of vertices with coverage 1: 83.7745 % > DistributionFile: > parrot-Testbed-A2-k31-20110719-9c8b02dbd.CoverageDistribution.txt > > > So, you can see that Ray finds the peak coverage automatically when there is > one. > >> Below is my Kmer distribution: >> > > Why do you only have powers of 2 ? > It is not generated by Ray, Ray measures coverage values from 1 to 65535. > > >> Kmer coverage bin Frequency (k=61) Frequencey (k=31) >> 2 11422 12689 >> 4 3461 5764 >> 8 2570 5380 >> 16 2191 4239 >> 32 1753 3382 >> 64 1130 2386 >> 128 923 1804 >> 256 727 1308 >> 512 491 954 >> 1024 345 684 >> 2048 269 487 >> 4096 199 375 >> 8192 159 260 >> 16384 111 188 >> 32768 75 137 >> 65536 55 92 >> 131072 40 67 >> 262144 28 46 >> 524288 20 33 >> 1048576 14 24 >> 2097152 10 16 >> 4194304 6 11 >> 8388608 4 5 >> 16777216 3 4 >> 33554432 2 4 >> 67108864 2 3 >> 134217728 2 4 >> 268435456 2 5 >> 536870912 3 7« >> 1073741824 3 1 >> 2147483648 1 0 >> 4294967296 0 0 >> 8589934592 1 1 >> >> Does anyone have suggestions for Kmer values and coverage minimums to set? >> > > Well, a good start for quality assurance is to run Ray with -k 21 (default > value) and > look in PREFIX.CoverageDistribution.txt and > PREFIX.CoverageDistributionAnalysis.txt to see > if a peak was detected. For Illumina data, Ray should do well. > > But I don't know if a raw coverage of 13 is enough. Statistically, you will > get a lot of uncovered regions according > to the Lander-Waterman statistics. > > http://en.wikipedia.org/wiki/DNA_sequencing_theory#Lander-Waterman_theory > > >> Thanks for your help, >> >> Walter >> > > Best. > > Sébastien > http://github.com/sebhtml > > ------------------------------------------------------------------------------ 10 Tips for Better Web Security Learn 10 ways to better secure your business today. Topics covered include: Web security, SSL, hacker attacks & Denial of Service (DoS), private keys, security Microsoft Exchange, secure Instant Messaging, and much more. http://www.accelacomm.com/jaw/sfnl/114/51426210/ _______________________________________________ Denovoassembler-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
