Re: [Denovoassembler-users] RE : Duplicate sequence in contigs

Sébastien Boisvert Tue, 29 May 2012 13:22:56 -0700

Sure for a rc8 !

I need to run further tests for 2.0 final.



Mitchell Stanton-Cook a écrit :

Hi Seb,

I see that this issue has now been closed.

Would it be possible to push an update (say rc8 or even 2.0 FINAL ;-))to sourceforge?



Regards

Mitch

On Sat, May 26, 2012 at 2:20 AM, Sébastien Boisvert<[email protected]<mailto:[email protected]>> wrote:


    I am not sure what this does.

    On my end, I have a tool that detects duplicated sequences in an
    assembly.


    Mitchell Stanton-Cook a écrit :

    Hi Seb,

    I've been using cd-hit-est to investigate.

    Attached is a simple python script and shell script that you can
    craft to suit your environment. It may be useful for debugging.

    Also attached is a sample output (which I've truncated below)


      1 >Cluster 0 & bsp;
      2 0   271623nt, >19_contig-1000000... at +/100.00%
      3 1   63731nt, >19_contig-3... at +/100.00%
      4 2   346149nt, >21_contig-8000003... *
      5 3   243006nt, >23_contig-1000002... at +/100.00%
      6 4   330249nt, >25_contig-9000002... at +/100.00%
      7 5   63731nt, >27_contig-1000002... at +/100.00%
      8 6   271674nt, >27_contig-7000002... at -/99.99%
      9 7   63749nt, >29_contig-4000000... at +/100.00%
     10 8   242978nt, >29_contig-11000002... at +/100.00%

    We'd hope not to have multiple entries for the kmer in a cluster.

    Hope this is useful,

    Regards

    Mitch



    On Fri, May 25, 2012 at 11:35 AM, Mitchell Stanton-Cook
    <[email protected] <mailto:[email protected]>> wrote:

        Hi Seb,

        Thank you. We're still doing some more work this end. We
        really really really like Ray! We are more than happy to work
        through these problems.

        We hope to be pushing <information redacted> through Ray in a
        few months. We just want to be sure that we are getting
        useful results.


        If there is anything else we can do to help/bugfix please do
        not hesitate to let us know.


        Regards

        Mitch




        On Fri, May 25, 2012 at 11:23 AM, Sébastien Boisvert
        <[email protected]
        <mailto:[email protected]>> wrote:

            Just to let you know that I may have found the problem.

            See https://github.com/sebhtml/ray/issues/55



                                                                Sébastien
            ________________________________________
            De : Mitchell Stanton-Cook [[email protected]
            <mailto:[email protected]>]
            Date d'envoi : 17 mai 2012 22:32
            À : Sébastien Boisvert
            Objet : Re: Duplicate sequence in contigs

            Hi Seb,

            Thanks for the reply.


            A bit more background:
            -------------------------------
            This is 100 bp Illumina PE data (insert of ~300 bp s.d of
            about ~10%).

            ~ 1000X coverage. Ray (1.7) did not like this high
            coverage (you have previously commented on the user-list
            about this). We sampled this down to ~100X coverage.

            We also cleaned the reads (you have pointed this is not
            necessary, but having a consistent cleaned set makes
            downstream analysis i.e. snp calling much easier). All
            input bases have a Q score >= 30. Reads > 70bp after
            trimming were filtered. After this the mean reads size is
            ~97 bp. We only consider read pairs (if 1 one of the
            sequences in the pairs fails a cleaning criterion, both
            do) and hence no single end reads go into Ray.

            Ray was executed like this (we used Ray's internal
            estimations/calculations to determine the best parameters):

            mpiexec -n $PROC Ray -i XXXX.fastq -k $K -o $OUT

            We looked at kmers from 15-35 in increments of 2.

            The genome is ~1.8 Mb. From the assemblies it appears
            there are not a lot repetitive elements in the genome. I
            have attached a csv of the results (XXXX_ALL.csv) (
            abbreviations: c= contigs, s= scaffold, N = scaffolding
            character).

            Results for kmer 21 and kmer 23 are interesting.

            kmer 21) As previously mentioned we have a 63 Kb
            duplication. For 63 Kb the start of the two contigs are
            almost identical (3 mismatches and 1 gap):



             Score = 1.165e+05 bits (63086),  Expect = 0.0

             Identities = 63093/63096
            <tel:63093%2F63096><tel:63093%2F63096> (99%), Gaps =
            1/63096 (0%)
             Strand=Plus/Plus

            I have attached the alignment (XXXX_b2s.aln)

            There is also a smaller 5 Kb duplication detected:


            Score = 9583 bits (5189),  Expect = 0.0

             Identities = 5192/5193 (99%), Gaps = 1/5193 (0%)
             Strand=Plus/Minus

            kmer 23) (Notice there is about 63 Kb difference between
            the "c 100 bp total len" in the .csv file of kmer 21 and
            kmer 23). From the blast2seq there is no such 63 Kb
            duplication found.

            Now, once again focusing on the "c 100 bp total len"  in
            the .csv file. I propose kmer 21 duplicate, kmer 23
            no-duplicate, kmer 25 non-duplicate, kmer 27 duplicate (I
            verified this true), kmer 29 non-duplicate. Strangely
            kmer 31 is missing about 200 Kb in comparison to the
            other assemblies.

            What we are wondering is:

            1) Why is there large, almost identical duplicates
            present in the assemblies?
            2) Why do we see it in some kmers and not others? (I
            could understand if lower kmer = duplicates, higher kmers
            = non-duplicates if these duplicates are propagated by a
            sequencing error)
            3) Is this a bug or an inherit issue with graph based
            assembly?
            4) If possible, how can we fix this?
            5) How can I help you with this?


            I hope I have provided you with enough information. If
            not please let me know.

            p.s. I have know problem with this going to the list as
            long as the attachments are not included.



            Regards

            Mitch


            On Fri, May 18, 2012 at 12:25 AM, Sébastien Boisvert
            <[email protected]
            
<mailto:[email protected]><mailto:[email protected]
            <mailto:[email protected]>>> wrote:
            Hi,

            Is the 63 kB perfectly duplicated in your assembly ?


            Le 2012-05-17 01:26, Mitchell Stanton-Cook a écrit :

            Hi Seb,

            Hope all is well.

            I was wondering if you have ever seen duplicate sequence
            in contigs.

            We have ~63 kB duplicate at the start of two different
            contigs. Beyond this it's unique.

            This is Ray 1.7.

            I'm re-running with Ray2.0rc5.

            I came across these posts (ABySS specific):

            
http://groups.google.com/group/abyss-users/browse_thread/thread/264e894b4ec0c96d/30dab75afa686878
            
http://groups.google.com/group/abyss-users/browse_thread/thread/7a03ee033b11afc4
            
http://groups.google.com/group/abyss-users/browse_thread/thread/f0f3a650bd12cf1e


            Any ideas?

            Regards

            Mitch

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] RE : Duplicate sequence in contigs

Reply via email to