Re: [Denovoassembler-users] RAY questions

Sébastien Boisvert Fri, 21 Oct 2011 08:57:03 -0700

On 20/10/11 05:20 PM, Ola Wallerman wrote:
> Hi,
>
> thanks for the quick reply. The latency was between 245 - 275 ms. I am
> running a tes with ION torrent data now (from the new 318 ChIP) and
> latencies are now ~ 133 ms. What is expected / good numbers?
>
>


I guess you mean microseconds, not milliseconds.
The latency depends on your interconnect technology.

> For the first test run I just used a subset of reads that was at hand
> in the right format. Also, I wasn't sure how many nodes would be
> needed to handle all reads. I actually got an error at first - I used
> Illumina _sequence.txt files an had to change filenames to *.fastq. I
> suppose the quality values will be off, but I am not sure if it is
> used in Ray?
>
>    

Ray does not utilise the qualities.

> One reason to do de novo was to find contaminations, and apparently we
> have a bacteraial contaminant in some of the libraries (ChIP-seq
> libraries with low input).
>
>    

Is this because your data was bar-coded and multiplexed with some
other experiments as well ?

> There were two things that apparently did not work: insert size was
> way off for one of the libraries (103 bp, sd 8, it should be ~ 280
> with sd 40).

This may indicates some problem with your reads.

> This was the major library (93 M pairs), the other insert
> sizes are ok. Next time I will set it manually. The results for
> contigs and scaffolds were exactly the same, we dont have any
> mate-pairs but I would think the PE reads would help some with
> scaffolding?
>
>    

Mostly to go through small repeats although these are utilised for 
scaffolding too.

> The ION assembly already finished, if you are interrested these are
> the stats from 5,5 M reads on 4 nodes:
>
>    Network testing: 5 seconds
>    File partitioning: 23 seconds
>    Sequence loading: 2 minutes, 8 seconds
>    K-mer counting: 4 minutes, 50 seconds
>    Coverage distribution analysis: 1 seconds
>    Graph construction: 13 minutes, 17 seconds
>    Edge purge: 1 minutes, 31 seconds
>    Selection of optimal read markers: 5 minutes, 39 seconds
>    Detection of assembly seeds: 1 minutes, 56 seconds
>    Estimation of outer distances for paired reads: 11 seconds
>    Bidirectional extension of seeds: 4 minutes, 33 seconds
>    Merging of redundant contigs: 4 minutes, 39 seconds
>    Generation of contigs: 0 seconds
>    Scaffolding of contigs: 1 minutes, 28 seconds
>    Total: 40 minutes, 42 seconds
>
> Contigs>= 100 nt
>    Number: 1348
>    Total length: 4451596
>    Average: 3302
>    N50: 8156
>    Median: 1194
>    Largest: 40215
> Contigs>= 500 nt
>    Number: 852
>    Total length: 4337720
>    Average: 5091
>    N50: 8305
>    Median: 3586
>    Largest: 40215
>
> Cheers,
>
>    

Nice. For 454, Ion Torrent and PacBio, I need to add something to handle 
the insertions and deletions.


> Ola
>
>
>
> Citerar Sébastien Boisvert<[email protected]>:
>
>    
>> On 20/10/11 12:51 PM, Ola Wallerman wrote:
>>      
>>> Hi Sebastien,
>>>
>>>        
>> Hi Ola,
>>
>>      
>>> I am try ing out Ray for assembly of a human genome. I must say I am
>>> quite surprised by the results from my first try since it worked
>>> straight away without any problems, with only one program to run,
>>> which is not what one is used to in the NGS field... I installed v
>>> 1.7, run it with 300 M HiSeq PE reads on 20 nodes and it finished
>>> without any errors after ~ 12h, with ~1 Gbp assembled.
>>>
>>>
>>>        
>> One thing our team is aiming for with Ray is ease of use for the
>> user. (It just works TM)
>> The complexity (like the various stages of the algorithm) is
>> encapsulated in Ray.
>>
>> Just out of curiosity, what is the inter-node latency of your
>> compute resource ?
>>
>> Ray tests the network before doing its deed so the latency is in the file
>> NetworkTest.txt. I am just curious though.
>>
>>      
>>> I wonder if you could give me any advice on how to run it in the best
>>> way, eg should one use as many nodes as possible (we have 384 nodes
>>> with at least 24 GB) and should reads be quality filtered beforehand?
>>>        
>> For the assemblathon, we did not filter reads at all.
>> With Ray, filtering reads only reduces memory usage, I believe.
>>
>> If you know you have DNA contamination in your reads (non-human for
>> instance),
>> then you should filter reads.
>>
>> Adaptors utilised for the construction of so-called mate-pairs
>> through the circularisation of long DNA molecules may be present if you
>> have mate pairs. So far, it seems that the optical read markers in
>> Ray deal with
>> that.
>>
>>
>>
>>      
>>> The dataset I have is around 200 M HiSeq paired reads (100 bp, inserts
>>> ~150 to 300 bp) and ~3 billion short single end reads (~36 bp). I
>>> tried now with k=27, but perhaps a higher k is better for the long
>>> reads? The reason for doing the assembly is to use the contigs to get
>>> a better precision in calling indels and rearangements.
>>>
>>>
>>>        
>> Did you provide all the reads ?
>>
>> You have to be careful with a k that is too large because Ray does not
>> attempt at all to correct the reads. The erroneous k-mers all go in
>> an abyss (not the assembler !)
>> and are not really utilised at all.
>>
>> In my experience, k=21 works well for bacteria and for other larger genomes,
>> I usually utilise k=25 or k=31 although I don't have that much
>> experience on large genomes
>> aside from the assemblathon.
>>
>>
>> You can check some files generated by Ray for your first assembly.
>>
>> In your assembly directory, the file CoverageDistributionAnalysis.txt
>> contains the peak coverage.
>>
>> For your paired reads, the file LibraryStatistics.txt contains what
>> Ray detected
>> in your reads.
>>
>> This step is very important as paired reads are the workhorse to go
>> from reads
>> to k-mer graph to seeds to extensions.
>>
>> The importance of pairs is also highlighted by the recent application note
>> published by Illumina using the MiSeq and Ray.
>>
>> http://www.illumina.com/documents/%5Cproducts%5Cappnotes%5Cappnote_miseq_denovo.pdf
>>
>> There is also a file called SeedLengthDistribution.txt
>> This file contains the distribution of seed lengths. In Ray, a seed
>> is a region of the genome
>> that is unique. I like to say that a seed is mostly similar
>> conceptually to unitigs in
>> overlap-layout-consensus assemblers although I suspect there are
>> some differences.
>>
>>
>> Increasing k increases the uniqueness of sub-sequences extracted
>> from reads but also reduces the usable sub-sequence coverage because
>> of the sequencing errors. As I said above, you can assess your sub-sequence
>> coverage (also known as k-mer coverage) by reading the content of the file
>> CoverageDistributionAnalysis.txt
>>
>>      
>>> Best regards,
>>>
>>>
>>>        
>> Let me know if you have any other questions.
>>
>>      
>>> Ola
>>>
>>>
>>>
>>>
>>>        
>> Sébastien
>>
>>      
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> Ola Wallerman, PhD
>>> IGP, Uppsala Universitet
>>>
>>> [email protected]
>>> olawallerman@skype
>>> 0736400172
>>>
>>>
>>>        
>>
>>      
>
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Ola Wallerman, PhD
> IGP, Uppsala Universitet
>
> [email protected]
> olawallerman@skype
> 0736400172
>
>    


------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] RAY questions

Reply via email to