See my responses below.
On Mon, 2012-03-05 at 18:53 -0500, LIU wrote:
> Hi ,
>
>
> Thanks for your response.
>
> On Tue, Mar 6, 2012 at 2:21 AM, Sébastien Boisvert
> <[email protected]> wrote:
> 1. Using a k-mer length of 71 will _presumably_ not work very
> well
> because of sequencing errors. First do a test run at k=31.
> Yes i also ran k=31.
> It is the same case as k=71.
> One more question about choice of kmer length.
> I was also told that longer kmer is supposed to produce more accurate
> assembly, while shorter ones are more prone to sequencing errors.
> I am confused. perhaps i should open another ticket to ask this
> question. But i really appreciate your answer.
>
Using longer k-mer makes the k-mers more unique.
Let's say that this is a read:
*
TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG
The '*' marks a sequencing error.
For 71-mers, the sliding window is:
*
TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG
kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
So basically all the k-mers generated from that sliding window contain
the sequencing error.
For 31-mers, the sliding window is:
*
TGTGTGGGTCAGTATGTAGTCCACCTGGAAATCTTCTTTTTCCAGATTTGCCCATCCTTCTTCGTCCTCTTCCCG
kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
So with 31-mers, you will get some erroneous k-mers and some genuine
k-mers.
>
>
>
> 2. Are your interleaved files properly generated ?
>
> sequence1/1
> sequence1/2
> sequence2/1
> sequence2/2
> sequence3/1
> sequence3/2
> Yes, i think my sequences are correctlly interleaved. E.G.,
> >@AGRF-21_0011_FC64J74AAXX:2:1:1804:936#CCGACT/1
> TACATATATACATGATACATACATACATGATATATTCATATGTCACCTAAGGATGTATCATACATGATACATACATCCATGATACATACATACCG
>
>
> >@AGRF-21_0011_FC64J74AAXX:2:1:1804:936#CCGACT/2
> GATGTATGTATCATGTATGATACATCCTTAGGTGACATATGAATATATCATGTATGTATGTATCATGTATATATGTATAAATATGTAT
>
>
> >@AGRF-21_0011_FC64J74AAXX:2:1:1983:932#AATTAA/1
> TATATAGATAGATTTCA
>
>
> >@AGRF-21_0011_FC64J74AAXX:2:1:1983:932#AATTAA/2
> CTTTTTTTTTGTTTCAGTCCCCGTGCTTTCAAAATTGCCCGGGTTCAGTCCCTAAGTCGTTAAGTCCGTT
> In fact, i also tried velvet. It produced different contigs and
> scaffolds. But of course Ray and Velvet may not be directly compared
> because of different scaffolding strategy (i do not know this, it's
> simply a guess).
This look ok.
BUt why is the second sequence shorter than the first one ?
Usually, Illumina sequencing produces 2 sequences of the same length for
each pair of sequences.
>
>
> Do you get anything in LibraryStatistics.txt ?
>
>
> The LibraryStatixtics are
> NumberOfPairedLibraries: 3
>
>
> LibraryNumber: 0
> InputFormat: Interleaved,Paired
> DetectionType: Automatic
> File: /home/s4196896/mix_assembly/input/t15c15/gs1.shuffled.fasta.gz
> NumberOfSequences: 248332323
> Distribution: 31/Library0.txt
>
>
> LibraryNumber: 1
> InputFormat: Interleaved,Paired
> DetectionType: Automatic
> File: /home/s4196896/mix_assembly/input/t15c15/gs3.shuffled.fasta.gz
> NumberOfSequences: 405911176
> Distribution: 31/Library1.txt
>
>
> LibraryNumber: 2
> InputFormat: Interleaved,Paired
> DetectionType: Automatic
> File: /home/s4196896/mix_assembly/input/t15c15/gs2.shuffled.fasta.gz
> NumberOfSequences: 234114234
> Distribution: 31/Library2.txt
>
Is there anything in 31/Library0.txt, 31/Library1.txt, 31/Library2.txt
Can you provide the last 10 lines of SeedLengthDistribution.txt ?
>
> Best Regards,
> Huanle
>
> On Thu, 2012-03-01 at 17:06 -0500, LIU wrote:
> > Hi There,
> >
> > I have been using Ray to de novo assembly.
> >
> > The input reads are a mix of illumina pair-end reads (this
> account for
> > 90%), illumina single-end reads and 454 single end reads.
> >
> > The command i used is
> > mpiexec -n 60 Ray \
> > -i \
> >
> /home/s4196896/mix_assembly/input/t15c15/gs1.shuffled.fasta.gz \
> > -i \
> >
> /home/s4196896/mix_assembly/input/t15c15/gs2.shuffled.fasta.gz \
> > -i \
> >
> /home/s4196896/mix_assembly/input/t15c15/gs3.shuffled.fasta.gz \
> > -s \
> >
> /home/s4196896/mix_assembly/input/t15c15/gs2.single.fasta.gz
> \
> > -s \
> >
> /home/s4196896/mix_assembly/input/t15c15/gs3.single.fasta.gz
> \
> > -s \
> >
> /home/s4196896/mix_assembly/input/t15c15/gs1.single.fasta.gz
> \
> > -s \
> > /home/s4196896/mix_assembly/input/radseq1.seeds.fasta \
> > -s \
> > /home/s4196896/mix_assembly/input/radseq_v2.fasta \
> > -s \
> >
> /work1/s4196896/454_assembly/raw_reads/all_genomic_reads.short.fasta
> > \
> > -s \
> >
> /work1/s4196896/454_assembly/raw_reads/all_genomic_reads.long.fasta \
> > -o \
> > 71 \
> > -k \
> > 71
> >
> > The output shows that scaffolds and contigs are the same
> (same N50,
> > total number of bases and number of sequences etc.).
> >
> > This confused me.
> >
> >
> > I hope someone can help me out.
> >
> > Thanks in advance.
> >
> > Kind Regards,
> > --
> > Huanle
> >
> > School of biological Sciences, UQ, QLD, AU
>
>
>
>
>
>
>
> --
> Huanle
>
> School of biological Sciences, UQ, QLD, AU
>
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users