Re: [Bioc-sig-seq] large BAM files and large BED files

Martin Morgan Mon, 19 Sep 2011 12:10:59 -0700

On 09/19/2011 11:55 AM, Michael Lawrence wrote:



On Mon, Sep 19, 2011 at 11:31 AM, Martin Morgan <mtmor...@fhcrc.org
<mailto:mtmor...@fhcrc.org>> wrote:

    On 09/19/2011 11:26 AM, Rene Paradis wrote:

        Thanks Martin and Michael for your constructive advices,

        I used the ScanBamParam object to successfully load a part of
        the Chr1
        from a Bam file via ScanBam. Honestly I do not know what are the
        differences between readGappedAlignments, readBamGappedAlignment and
        ScanBam. The last two of them can take a  ScanBamParam object.


    scanBam returns a list-of-lists, it's the most flexible but least
    'user-friendly'.

    readGappedAlignments is meant to be a 'front end' to read
    GappedAlignments from several different sources, and
    readBamGappedAlignments is meant to be one of those sources; usually
    the 'user' would readGappedAlignments.


        But I wished I could select the seqname in GRanges to retrieve
        all the
        chr1 (as an example) data from the Bam file. It seems I must
        select a
        range. So I put a value that goes beyond the range of the chr1
        because I
        do not know that range, and I got an<<INTEGER () can only be
        applied to
        a 'integer', not a special>>.


Couldn't Rsamtools give something more informative?

The info in the original post isn't enough to understand how the erroroccurs.


        There must be something I missed that
        could help me doing that.


    see ?scanBamHeader, e.g.,

     >  fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
     > scanBamHeader(fl)[[1]]$targets
    seq1 seq2
    1575 1584

Would be nice to have a method for getting a Seqinfo out of a BAM
header. Then one can just coerce that to a GRanges. rtracklayer does the
equivalent for BigWig.


In devel,

> seqinfo(open(BamFile(fl)))
Seqinfo of length 2
seqnames seqlengths isCircular
seq1           1575         NA
seq2           1584         NA

I think this needs to be updated to deal with recent changes to Seqinfoto store the reference genome (which is sometimes also present in theBAM file).


Maritn


Michael

    Martin



        ultimately, I want to launch a PICS analysis that requires a
        segReadsList object.

        Overall I definitely progressed by your help, thank you.

        Rene




        On Fri, 2011-09-16 at 14:29 -0700, Martin Morgan wrote:

            On 09/16/2011 02:11 PM, Michael Lawrence wrote:

                It sounds like you're trying to use BED as an
                alternative to BAM? Probably
                not a good idea, especially at this scale. Why are you
                aiming for a
                GenomeData? A GappedAlignments might be more
                appropriate. See
                GenomicRanges::__readGappedAlignments() for bringing a
                BAM into a
                GappedAlignments.


            Hi Rene

            the 'which' argument to readGappedAlignments (it'll become
            'param' with
            the next release, and be a ScanBamParam object) allows you
            to select
            regions to process, e.g., chromosome-at-a-time, to help with
            file size.

            Martin


                This page might help:
                
http://bioconductor.org/help/__workflows/high-throughput-__sequencing/#sequencing-__resources
                
<http://bioconductor.org/help/workflows/high-throughput-sequencing/#sequencing-resources>

                But it could really be improved.

                Michael

                On Fri, Sep 16, 2011 at 1:44 PM, Rene
                Paradis<rene.paradis@genome.__ulaval.ca
                <mailto:rene.para...@genome.ulaval.ca>

                    wrote:


                    Hello,

                    I am experiencing a problem regarding the load in
                    memory of bed files of
                    30 GB. my function read.table unleash the error :
                    Error in unique(x) :
                    length xxxxxx is too large for hashing.

                    this is generated by the function MKsetup of the
                    unique.c file. Even by
                    increasing by 10 000x the value, the error persists.
                    I believe the
                    function pushes more data in ram, but I am not sure
                    this is the good way
                    to focus on.

                    Ultimately, I would like to produce a GenomeData
                    object from either a
                    BAM file or a bed file.

                    has someone ever worked with very very big BAM files
                    (about 30 GB)

                    thanks

                    Rene paradis

                    _________________________________________________
                    Bioc-sig-sequencing mailing list
                    Bioc-sig-sequencing@r-project.__org
                    <mailto:Bioc-sig-sequencing@r-project.org>
                    https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
                    <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>


                        [[alternative HTML version deleted]]

                _________________________________________________
                Bioc-sig-sequencing mailing list
                Bioc-sig-sequencing@r-project.__org
                <mailto:Bioc-sig-sequencing@r-project.org>
                https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
                <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>







    --
    Computational Biology
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

    Location: M1-B861
    Telephone: 206 667-2793 <tel:206%20667-2793>

    _________________________________________________
    Bioc-sig-sequencing mailing list
    Bioc-sig-sequencing@r-project.__org
    <mailto:Bioc-sig-sequencing@r-project.org>
    https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
    <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>



--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] large BAM files and large BED files

Reply via email to