On 20/10/11 05:20 PM, Ola Wallerman wrote: > Hi, > > thanks for the quick reply. The latency was between 245 - 275 ms. I am > running a tes with ION torrent data now (from the new 318 ChIP) and > latencies are now ~ 133 ms. What is expected / good numbers? > >
I guess you mean microseconds, not milliseconds. The latency depends on your interconnect technology. > For the first test run I just used a subset of reads that was at hand > in the right format. Also, I wasn't sure how many nodes would be > needed to handle all reads. I actually got an error at first - I used > Illumina _sequence.txt files an had to change filenames to *.fastq. I > suppose the quality values will be off, but I am not sure if it is > used in Ray? > > Ray does not utilise the qualities. > One reason to do de novo was to find contaminations, and apparently we > have a bacteraial contaminant in some of the libraries (ChIP-seq > libraries with low input). > > Is this because your data was bar-coded and multiplexed with some other experiments as well ? > There were two things that apparently did not work: insert size was > way off for one of the libraries (103 bp, sd 8, it should be ~ 280 > with sd 40). This may indicates some problem with your reads. > This was the major library (93 M pairs), the other insert > sizes are ok. Next time I will set it manually. The results for > contigs and scaffolds were exactly the same, we dont have any > mate-pairs but I would think the PE reads would help some with > scaffolding? > > Mostly to go through small repeats although these are utilised for scaffolding too. > The ION assembly already finished, if you are interrested these are > the stats from 5,5 M reads on 4 nodes: > > Network testing: 5 seconds > File partitioning: 23 seconds > Sequence loading: 2 minutes, 8 seconds > K-mer counting: 4 minutes, 50 seconds > Coverage distribution analysis: 1 seconds > Graph construction: 13 minutes, 17 seconds > Edge purge: 1 minutes, 31 seconds > Selection of optimal read markers: 5 minutes, 39 seconds > Detection of assembly seeds: 1 minutes, 56 seconds > Estimation of outer distances for paired reads: 11 seconds > Bidirectional extension of seeds: 4 minutes, 33 seconds > Merging of redundant contigs: 4 minutes, 39 seconds > Generation of contigs: 0 seconds > Scaffolding of contigs: 1 minutes, 28 seconds > Total: 40 minutes, 42 seconds > > Contigs>= 100 nt > Number: 1348 > Total length: 4451596 > Average: 3302 > N50: 8156 > Median: 1194 > Largest: 40215 > Contigs>= 500 nt > Number: 852 > Total length: 4337720 > Average: 5091 > N50: 8305 > Median: 3586 > Largest: 40215 > > Cheers, > > Nice. For 454, Ion Torrent and PacBio, I need to add something to handle the insertions and deletions. > Ola > > > > Citerar Sébastien Boisvert<[email protected]>: > > >> On 20/10/11 12:51 PM, Ola Wallerman wrote: >> >>> Hi Sebastien, >>> >>> >> Hi Ola, >> >> >>> I am try ing out Ray for assembly of a human genome. I must say I am >>> quite surprised by the results from my first try since it worked >>> straight away without any problems, with only one program to run, >>> which is not what one is used to in the NGS field... I installed v >>> 1.7, run it with 300 M HiSeq PE reads on 20 nodes and it finished >>> without any errors after ~ 12h, with ~1 Gbp assembled. >>> >>> >>> >> One thing our team is aiming for with Ray is ease of use for the >> user. (It just works TM) >> The complexity (like the various stages of the algorithm) is >> encapsulated in Ray. >> >> Just out of curiosity, what is the inter-node latency of your >> compute resource ? >> >> Ray tests the network before doing its deed so the latency is in the file >> NetworkTest.txt. I am just curious though. >> >> >>> I wonder if you could give me any advice on how to run it in the best >>> way, eg should one use as many nodes as possible (we have 384 nodes >>> with at least 24 GB) and should reads be quality filtered beforehand? >>> >> For the assemblathon, we did not filter reads at all. >> With Ray, filtering reads only reduces memory usage, I believe. >> >> If you know you have DNA contamination in your reads (non-human for >> instance), >> then you should filter reads. >> >> Adaptors utilised for the construction of so-called mate-pairs >> through the circularisation of long DNA molecules may be present if you >> have mate pairs. So far, it seems that the optical read markers in >> Ray deal with >> that. >> >> >> >> >>> The dataset I have is around 200 M HiSeq paired reads (100 bp, inserts >>> ~150 to 300 bp) and ~3 billion short single end reads (~36 bp). I >>> tried now with k=27, but perhaps a higher k is better for the long >>> reads? The reason for doing the assembly is to use the contigs to get >>> a better precision in calling indels and rearangements. >>> >>> >>> >> Did you provide all the reads ? >> >> You have to be careful with a k that is too large because Ray does not >> attempt at all to correct the reads. The erroneous k-mers all go in >> an abyss (not the assembler !) >> and are not really utilised at all. >> >> In my experience, k=21 works well for bacteria and for other larger genomes, >> I usually utilise k=25 or k=31 although I don't have that much >> experience on large genomes >> aside from the assemblathon. >> >> >> You can check some files generated by Ray for your first assembly. >> >> In your assembly directory, the file CoverageDistributionAnalysis.txt >> contains the peak coverage. >> >> For your paired reads, the file LibraryStatistics.txt contains what >> Ray detected >> in your reads. >> >> This step is very important as paired reads are the workhorse to go >> from reads >> to k-mer graph to seeds to extensions. >> >> The importance of pairs is also highlighted by the recent application note >> published by Illumina using the MiSeq and Ray. >> >> http://www.illumina.com/documents/%5Cproducts%5Cappnotes%5Cappnote_miseq_denovo.pdf >> >> There is also a file called SeedLengthDistribution.txt >> This file contains the distribution of seed lengths. In Ray, a seed >> is a region of the genome >> that is unique. I like to say that a seed is mostly similar >> conceptually to unitigs in >> overlap-layout-consensus assemblers although I suspect there are >> some differences. >> >> >> Increasing k increases the uniqueness of sub-sequences extracted >> from reads but also reduces the usable sub-sequence coverage because >> of the sequencing errors. As I said above, you can assess your sub-sequence >> coverage (also known as k-mer coverage) by reading the content of the file >> CoverageDistributionAnalysis.txt >> >> >>> Best regards, >>> >>> >>> >> Let me know if you have any other questions. >> >> >>> Ola >>> >>> >>> >>> >>> >> Sébastien >> >> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> Ola Wallerman, PhD >>> IGP, Uppsala Universitet >>> >>> [email protected] >>> olawallerman@skype >>> 0736400172 >>> >>> >>> >> >> > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Ola Wallerman, PhD > IGP, Uppsala Universitet > > [email protected] > olawallerman@skype > 0736400172 > > ------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ Denovoassembler-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
