[Denovoassembler-users] RE : RE : Ray assembly kmer coverage question

Sébastien Boisvert Wed, 20 Jul 2011 08:54:49 -0700

> ________________________________________
> De : Walter Eckalbar [[email protected]]
> Date d'envoi : 20 juillet 2011 10:57
> À : Sébastien Boisvert
> Objet : Re: RE : [Denovoassembler-users] Ray assembly kmer coverage question
> 
> Hi Sébastien,
>


Hello,

See my answers below.

>>250 M 104 paired-end reads gives a raw coverage of 13, not 30.
> 
> Well there is the other side of the pair, sorry if that was unclear, and I 
> actually have just over 260M, I was rounding, and again the genome size is 
> probably 1.8Gb.  Ray found a total of 517M reads.  So it is about 29-30x 
> coverage.  Which I know is not huge for Next-Gen assemblies, but it should do 
> reasonably well.
>

OK I see.
 
>>The first thing you want to do is a Ray quality assurance run with k=21.
> 
>>If everything runs fine, Ray should detect a peak and the average insert size 
>>for your libraries.
> 
>>Then if that works, you can start playing with the k-mer length.
> 
>>Minimum, peak and repeat coverages are detected in the coverage distribution 
>>>(PREFIX.CoverageDistribution.txt).
>>The minimum k-mer coverage is not set to one less than peak coverage.
> 
> I've already completed two runs with k=31 and k=61 with otherwise default 
> conditions, and both times its run fine, only the minimum coverage has been 
> set 
> to 1 minus the found peak coverage, and the assembly was very small.  I 
> understand that its not the default to make it 1 minus the peak, but that 
> seems 
> to be what it wants to do with my data, even with two different kmers.
> 

See explanations below.

> In both cases I've given Ray the insert size of both of the libraries, I 
> could, in my next run, take those specifications out and run with k=21.

That is the first thing you should do I think. k=21 will pick up more redundant 
k-mers than k=31 or k=61.

Basically, this is why:


If the sequencing error rate is > 4.7% (1/21), then mostly all k-mers will be 
unique and bad for k=21.
If the sequencing error rate is > 3.2% (1/31), then mostly all k-mers will be 
unique and bad for k=31.
If the sequencing error rate is > 1.6% (1/61), then mostly all k-mers will be 
unique and bad for k=61.

I believe Illumina HiSeq TruSeq sequencing error rate varies between 0 and 2 %. 
You mileage may vary however depending on the quality of DNA and library 
preparation (nicks 
in DNA for instance during the library preparation).



> 
> The reason I'm asking a lot of questions however, is that I don't exactly 
> have a ton of money to pay for CPU hours (and it doesn't help when the system 
> crashes 
> as it did last night) 

What was the nature of the crash exactly ? Was it related to Ray, to the Linux 
process scheduler or something else such as another user overloading one of 
your job nodes ?

> at this point and would like to demonstrate reasonable success for my PI 
> before buying more.  So if this minimum coverage problem appears 
> again with k=21, I'm basically spending another $50 on yet another test.  It 
> may sound trivial, but this is but one program I've been testing, and its 
> starting to add up.
> 



k=21 should be the first assembly run in your roadmap in my opinion. k=31 
produces more erroneous k-mers and k=63 even more.
By the way, did you compile with MAXKMERLENGTH=64 for your k=61 runs ?

Also, what is the scheduler utilised to run your jobs ? How is it configured ?


To avoid crashes:


In my experience, if your job nodes are shared with other users, your MPI run 
will often fail because of users that overload nodes, these trolls are always 
present.


Don't use swap files, these are evil for high-performance computing.



50$ is not a lot of money, really.


To develop Ray, I use a supercomputer at my university (colosse.clumeq.ca).

The building hosting the machine and the machine cost around 13 M$ (in Canadian 
dollars). colosse has 7680 compute cores (960 nodes * 8 cores/node).

The building hosting the supercomputer is cool, see 

CLUMEQ transforms rundown particle accelerator into high-efficiency cooling 
enclosure
http://www.infoworld.com/d/green-it/the-green-it-stars-2010-454?page=0,3

Pretty unique stuff we have !


I have an allocation of 250 core-years on the said machine.

If you do the maths, the part of the colosse (3.2%) I use is valued at 423 k$ !

Also, the estimated cost of Ray using the COMOMO model is 271 k$ US.
I am pretty sure it is more than that if you consider the peer-to-peer nature 
of the code, which is hard to develop I think.

See http://www.ohloh.net/p/denovoassembler/estimated_cost


>>Can you post PREFIX.CoverageDistribution.txt with k=21 on 
>>http://pastebin.com/ and link it in >your next email.
> 
> Sure thing:
> 


I asked for k=21, not k=31 or k=61


> Here's the distribtion for K=61
> http://pastebin.com/6kHvarC0

Check the plot: http://imgur.com/zC2Jt

You don't have any peak for k=61.

> 
> And k=31
> http://pastebin.com/rHMFr9Fe


Check the plot: http://imgur.com/5da5kl

You don't have any peak for k=31.


If you look closely, you can see a peak trying to separate itself from all 
these sequencing errors 
occuring once. You see it too for k=61.


For k=31, the derivative (in green) almost cross 0 but don't.

Maybe Ray could compute the derivative of the derivative but I don't know if 
that would work.


> 

There are 2 ways to shift this soon-to-be peak on the right:

- Decreases the k-mer length to decrease the number of erroneous k-mers. I am 
pretty sure this potato will shift on the right with k=21.

- Or you can get more data, but you'll need more compute power too !


>>Why do you only have powers of 2 ?
>>It is not generated by Ray, Ray measures coverage values from 1 to 65535.
> 
> I did that because I figured it would be easier for most people to interpret 
> the general shape of the distribution and that it would be easier to just 
> put it strait into an email.
> 

Yes, but you will surely miss important stuff because what matters could be 
between two powers of 2, right ?



> Walter
> 

I hope this helps, otherwise let me know.

Sébastien

> 2011/7/20 Sébastien Boisvert 
> <[email protected]<mailto:[email protected]>>
> Hello,
> 
>> Hello everyone,
>>
>> I'm attempting to assemble a genome using data from 2 lanes of the Illumina 
>> HiSeq, totaling ~250M 104 bp paired end reads with an insert size of ~450bp. 
>>  We estimate the genome size just under 2Gbp.  This would roughly compute to 
>> 30x coverage assuming it all maps to our genome.
>>
> 
> 250 M 104 paired-end reads gives a raw coverage of 13, not 30.
> 
> irb(main):003:0> rawBases=250*1000*1000*104
> => 26000000000
> irb(main):004:0> genomeBases=2*1000*1000*1000
> => 2000000000
> irb(main):005:0> rawCoverage=(rawBases+0.0)/genomeBases
> => 13.0
> 
> 
> Because of sequencing errors, the usable k-mer coverage will be lower than 13.
> 
> The first thing you want to do is a Ray quality assurance run with k=21.
> 
> If everything runs fine, Ray should detect a peak and the average insert size 
> for your libraries.
> 
> Then if that works, you can start playing with the k-mer length.
> 
>> I'm attempting to use Ray (v1.6.1-rc3) and am struggling to find a setting 
>> that both proves to finish in reasonable amount
>> of time and constructs a reasonable assembly.  I've noticed under the 
>> default settings the minimum kmer coverage is set
>> to one less than the peak coverage (which does not appear to be the same as 
>> the max coverage
> 
> Minimum, peak and repeat coverages are detected in the coverage distribution 
> (PREFIX.CoverageDistribution.txt).
> The minimum k-mer coverage is not set to one less than peak coverage.
> 
> 
> Besides, v1.6.1 is on sourceforge.
> 
>> and is often in 500-700 range) and this leads to the exclusion of far too 
>> many Kmers (or so it appears), and assembles an
>> awful genome, with n50 in the 100's.
>>
> 
> Can you post PREFIX.CoverageDistribution.txt with k=21 on 
> http://pastebin.com/ and link it in your next email.
> 
> Example:
> 
> http://pastebin.com/BJRYwzBZ
> 
> The corresponding CoverageDistributionAnalysis.txt file:
> 
> k-mer length:   31
> Lowest coverage observed:       1
> MinimumCoverage:        42
> PeakCoverage:   171
> RepeatCoverage: 300
> Number of k-mers with at least MinimumCoverage: 2453478644 k-mers
> Estimated genome length:        1226739322 nucleotides
> Percentage of vertices with coverage 1: 83.7745 %
> DistributionFile: 
> parrot-Testbed-A2-k31-20110719-9c8b02dbd.CoverageDistribution.txt
> 
> 
> So, you can see that Ray finds the peak coverage automatically when there is 
> one.
> 
>> Below is my Kmer distribution:
>>
> 
> Why do you only have powers of 2 ?
> It is not generated by Ray, Ray measures coverage values from 1 to 65535.
> 
> 
>> Kmer coverage bin     Frequency (k=61)        Frequencey (k=31)
>> 2     11422   12689
>> 4     3461    5764
>> 8     2570    5380
>> 16    2191    4239
>> 32    1753    3382
>> 64    1130    2386
>> 128   923     1804
>> 256   727     1308
>> 512   491     954
>> 1024  345     684
>> 2048  269     487
>> 4096  199     375
>> 8192  159     260
>> 16384         111     188
>> 32768         75      137
>> 65536         55      92
>> 131072        40      67
>> 262144        28      46
>> 524288        20      33
>> 1048576       14      24
>> 2097152       10      16
>> 4194304       6       11
>> 8388608       4       5
>> 16777216      3       4
>> 33554432      2       4
>> 67108864      2       3
>> 134217728     2       4
>> 268435456     2       5
>> 536870912     3       7«
>> 1073741824    3       1
>> 2147483648    1       0
>> 4294967296    0       0
>> 8589934592    1       1
>>
>> Does anyone have suggestions for Kmer values and coverage minimums to set?
>>
> 
> Well, a good start for quality assurance is to run Ray with -k 21 (default 
> value) and
> look in PREFIX.CoverageDistribution.txt and 
> PREFIX.CoverageDistributionAnalysis.txt to see
> if a peak was detected. For Illumina data, Ray should do well.
> 
> But I don't know if a raw coverage of 13 is enough. Statistically, you will 
> get a lot of uncovered regions according
> to the Lander-Waterman statistics.
> 
> http://en.wikipedia.org/wiki/DNA_sequencing_theory#Lander-Waterman_theory
> 
> 
>> Thanks for your help,
>>
>> Walter
>>
> 
> Best.
> 
> Sébastien
> http://github.com/sebhtml
> 
> 
------------------------------------------------------------------------------
10 Tips for Better Web Security
Learn 10 ways to better secure your business today. Topics covered include:
Web security, SSL, hacker attacks & Denial of Service (DoS), private keys,
security Microsoft Exchange, secure Instant Messaging, and much more.
http://www.accelacomm.com/jaw/sfnl/114/51426210/
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

[Denovoassembler-users] RE : RE : Ray assembly kmer coverage question

Reply via email to