Hi,

> Hi Sébastien
> 
> Do you have any general advice on using mate pair reads (illumina) with Ray 
> assemblies?  I think from our experience that there are a number of
> issues specific to MP data beyond those we find in PE data that need 
> attention, for example:
>

My answers below are based on my experience so far with Assemblathon 2 datasets.

Data from BGI for the bird dataset:

|------------------------+------------+------------+----------|
|     library name       |insert size | Bases (Gb) | Coverage |
|------------------------+------------+------------+----------|
| 1.  PARprgDAPDCAAPE    |   220      |   48       |   39     |
| 2.  PARprgDAPDIAAPE    |   500      |   47       |   38     |
| 3.  PARprgDAPDMAAPE    |   800      |   43       |   35     |
|------------------------+------------+------------+----------|
| 4.  PARprgDAPDWAAPE    |   2000     |   15       |   12     |
| 5.  PARprgDAPDWBAPE    |   2000     |   31       |   26     |
| 6.  PARprgDABDLBAPE    |   5000     |   16       |   13     |
| 7.  PARprgDABDLAAPE    |   5000     |   18       |   15     |
| 8.  PARprgDAADTAAPE    |   10000    |   17       |   13     |
| 9.  PARprgDAPDUAAPEI-12|   20000    |   16       |   13     |
| 10. PARprgDABDVAAPEI-6 |   40000    |   15       |   12     |
|------------------------+------------+------------+----------|
 
> The location of the MP junction can be within one of the two mate ends making 
> that read chimeric (frequency is linked to read length/fragment size)
>

Yes. I think this may be a problem, but in my tests so far Ray's Optimal Read 
Markers seem to do a good job at avoiding the junction when marking and
Ray's heuristics don't favor the chimeric pairs, I think.
 
> MP reads can have high over-read rates and an associated limited diversity
> 

You mean like the same buggy Illumina cluster can be read several times 
(produce same data thus limited diversity) or 
you mean that some genome regions are over-represented (thus limited diversity).

> False mates can make up around 20% of a library - these usually turn out to 
> be PE in orientation and almost end to end in genomic origin
> 

I modified Ray recently to consider this, that is that most mate-pair libraries 
have 2 peaks.

For instance, Ray detects automatically these peaks in the library 
PARprgDAADTAAPE described above.

 Peak 0
  AverageOuterDistance: 306
  StandardDeviation: 132
 Peak 1
  AverageOuterDistance: 10076
  StandardDeviation: 1031


Presently, Ray will match the correct peak during the extension step, but only 
the largest-valued peak is utilised for scaffolding.

> Some Mates seem to be formed by the synthesis of several loops meaning the 
> two ends come from quite different genomic locations (this should hopefully 
> be low)
> 

Like super-chimeric pairs ? Not sure I understand.

> The first and third problems can I think be mostly addressed through 
> pre-filtering reads against a contig assembly.

Basically, Ray detects the pairs by mapping the reads onto the seeds, so I 
guess it is somehow equivalent, though I am not sure.

>  But how pro-active do you think
> we have to be in addressing these (and I guess other) technical problems 
> through read pre-filtering? 

If you don't get the same systematic chimeric read many times, I think Ray will 
just avoid using chimeric reads because they won't be similar to the population 
from which they
supposedly come from.

>  Working with SOAP seems to show it's relatively sensitive to
> the percentage of poorly constructed mates.

In Ray, as you may know, seed are extended using paired information. I am 
presently working on a assembly engine (the part that choose where to
go next in the graph). It is called Ray NovaEngine.

>  Do you think we can rely on Ray to compensate for these types of read errors 
> or is it simply GIGO?

Garbage in, garbage out is always valid, regardless of the tool, in my opinion.

Ray will do some choices regarding the issues you list. But I don't have a 
definite answer. I think the thing that will cause problem with Ray is the 
presence of 
adaptors inside provided reads as they will look as genuine genomic information 
by virtue of their redundancy.

> 
> It would be good to hear about general approaches to coping with MP read 
> error characteristics (biochemical/bioinformatic) if anyone is willing to 
> share.
> 

For the biochemical part, I have read that BGI utilise nitrogen gaz only 
instead of the air from the room to perform nebulization.

For the bioinformatics part, I think what I learned so far is listed above.

> Adrian
> 
> Adrian Platts
> McGill
> 
                  Sébastien Boisvert
                  http://github.com/sebhtml/ray
------------------------------------------------------------------------------
Storage Efficiency Calculator
This modeling tool is based on patent-pending intellectual property that
has been used successfully in hundreds of IBM storage optimization engage-
ments, worldwide.  Store less, Store more with what you own, Move data to 
the right place. Try It Now! http://www.accelacomm.com/jaw/sfnl/114/51427378/
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to