Hi,
> Hi Sébastien
>
> Do you have any general advice on using mate pair reads (illumina) with Ray
> assemblies? I think from our experience that there are a number of
> issues specific to MP data beyond those we find in PE data that need
> attention, for example:
>
My answers below are based on my experience so far with Assemblathon 2 datasets.
Data from BGI for the bird dataset:
|------------------------+------------+------------+----------|
| library name |insert size | Bases (Gb) | Coverage |
|------------------------+------------+------------+----------|
| 1. PARprgDAPDCAAPE | 220 | 48 | 39 |
| 2. PARprgDAPDIAAPE | 500 | 47 | 38 |
| 3. PARprgDAPDMAAPE | 800 | 43 | 35 |
|------------------------+------------+------------+----------|
| 4. PARprgDAPDWAAPE | 2000 | 15 | 12 |
| 5. PARprgDAPDWBAPE | 2000 | 31 | 26 |
| 6. PARprgDABDLBAPE | 5000 | 16 | 13 |
| 7. PARprgDABDLAAPE | 5000 | 18 | 15 |
| 8. PARprgDAADTAAPE | 10000 | 17 | 13 |
| 9. PARprgDAPDUAAPEI-12| 20000 | 16 | 13 |
| 10. PARprgDABDVAAPEI-6 | 40000 | 15 | 12 |
|------------------------+------------+------------+----------|
> The location of the MP junction can be within one of the two mate ends making
> that read chimeric (frequency is linked to read length/fragment size)
>
Yes. I think this may be a problem, but in my tests so far Ray's Optimal Read
Markers seem to do a good job at avoiding the junction when marking and
Ray's heuristics don't favor the chimeric pairs, I think.
> MP reads can have high over-read rates and an associated limited diversity
>
You mean like the same buggy Illumina cluster can be read several times
(produce same data thus limited diversity) or
you mean that some genome regions are over-represented (thus limited diversity).
> False mates can make up around 20% of a library - these usually turn out to
> be PE in orientation and almost end to end in genomic origin
>
I modified Ray recently to consider this, that is that most mate-pair libraries
have 2 peaks.
For instance, Ray detects automatically these peaks in the library
PARprgDAADTAAPE described above.
Peak 0
AverageOuterDistance: 306
StandardDeviation: 132
Peak 1
AverageOuterDistance: 10076
StandardDeviation: 1031
Presently, Ray will match the correct peak during the extension step, but only
the largest-valued peak is utilised for scaffolding.
> Some Mates seem to be formed by the synthesis of several loops meaning the
> two ends come from quite different genomic locations (this should hopefully
> be low)
>
Like super-chimeric pairs ? Not sure I understand.
> The first and third problems can I think be mostly addressed through
> pre-filtering reads against a contig assembly.
Basically, Ray detects the pairs by mapping the reads onto the seeds, so I
guess it is somehow equivalent, though I am not sure.
> But how pro-active do you think
> we have to be in addressing these (and I guess other) technical problems
> through read pre-filtering?
If you don't get the same systematic chimeric read many times, I think Ray will
just avoid using chimeric reads because they won't be similar to the population
from which they
supposedly come from.
> Working with SOAP seems to show it's relatively sensitive to
> the percentage of poorly constructed mates.
In Ray, as you may know, seed are extended using paired information. I am
presently working on a assembly engine (the part that choose where to
go next in the graph). It is called Ray NovaEngine.
> Do you think we can rely on Ray to compensate for these types of read errors
> or is it simply GIGO?
Garbage in, garbage out is always valid, regardless of the tool, in my opinion.
Ray will do some choices regarding the issues you list. But I don't have a
definite answer. I think the thing that will cause problem with Ray is the
presence of
adaptors inside provided reads as they will look as genuine genomic information
by virtue of their redundancy.
>
> It would be good to hear about general approaches to coping with MP read
> error characteristics (biochemical/bioinformatic) if anyone is willing to
> share.
>
For the biochemical part, I have read that BGI utilise nitrogen gaz only
instead of the air from the room to perform nebulization.
For the bioinformatics part, I think what I learned so far is listed above.
> Adrian
>
> Adrian Platts
> McGill
>
Sébastien Boisvert
http://github.com/sebhtml/ray
------------------------------------------------------------------------------
Storage Efficiency Calculator
This modeling tool is based on patent-pending intellectual property that
has been used successfully in hundreds of IBM storage optimization engage-
ments, worldwide. Store less, Store more with what you own, Move data to
the right place. Try It Now! http://www.accelacomm.com/jaw/sfnl/114/51427378/
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users