Re: [Bioc-sig-seq] large BAM files and large BED files

Martin Morgan Tue, 20 Sep 2011 06:32:37 -0700

On 09/20/2011 05:56 AM, Rene Paradis wrote:

On Mon, 2011-09-19 at 12:10 -0700, Martin Morgan wrote:

On 09/19/2011 11:55 AM, Michael Lawrence wrote:



On Mon, Sep 19, 2011 at 11:31 AM, Martin Morgan<mtmor...@fhcrc.org
<mailto:mtmor...@fhcrc.org>>  wrote:

     On 09/19/2011 11:26 AM, Rene Paradis wrote:

         Thanks Martin and Michael for your constructive advices,

         I used the ScanBamParam object to successfully load a part of
         the Chr1
         from a Bam file via ScanBam. Honestly I do not know what are the
         differences between readGappedAlignments, readBamGappedAlignment and
         ScanBam. The last two of them can take a  ScanBamParam object.


     scanBam returns a list-of-lists, it's the most flexible but least
     'user-friendly'.

     readGappedAlignments is meant to be a 'front end' to read
     GappedAlignments from several different sources, and
     readBamGappedAlignments is meant to be one of those sources; usually
     the 'user' would readGappedAlignments.


         But I wished I could select the seqname in GRanges to retrieve
         all the
         chr1 (as an example) data from the Bam file. It seems I must
         select a
         range. So I put a value that goes beyond the range of the chr1
         because I
         do not know that range, and I got an<<INTEGER () can only be
         applied to
         a 'integer', not a special>>.


Couldn't Rsamtools give something more informative?


The info in the original post isn't enough to understand how the error
occurs.

To reproduce the original error:
choose a 30GB Bam file. and run:
dataIP<- read.table(fileName,header = TRUE, colClasses -
c("factor","integer","factor"))

after a while, it will trow the error.


the seqinfo above did not work for me, probably I do not have the latest
devel version of Seqinfo.

fl

[1] "/opt/R/lib/Rsamtools/extdata/ex1.bam"

seqinfo(open(BamFile(fl)))

Error in function (classes, fdef, mtable)  :
   unable to find an inherited method for function "seqinfo", for
signature "BamFile"



otherwise, we have splitted the Big bam into chromosomes with samtools.
We were then able to load these smaller files.


Hi Rene --

Here's a BAM file, for example

  library(Rsamtools)
  fl <- system.file("extdata", "ex1.bam", package="Rsamtools")

Get the information about each sequence length from the BAM file, andmake it a GRanges object, 1 ranges per sequence


  t <- scanBamHeader(fl)[[1]][["targets"]]
  which <- GRanges(names(t), IRanges(1, unname(t)))

Then iterate over the object, either in a 'for' loop

  for (i in seq_along(which)) {
      ## read one chromosome
      param <- ScanBamParam(which=which[i], what=character())
      aln <- readGappedAlignments(fl, param=param)
      ## do more work...
  }

or lapply

  lapply(seq_along(which), function(i, fl, which) {
      param <- ScanBamParam(which=which[i], what=character())
      aln <- readGappedAlignments(fl, param=param)
      ## do more work
      table(width(aln))
  }, fl, which)

Hope that helps,

Martin


         There must be something I missed that
         could help me doing that.


     see ?scanBamHeader, e.g.,

      >   fl<- system.file("extdata", "ex1.bam", package="Rsamtools")
      >  scanBamHeader(fl)[[1]]$targets
     seq1 seq2
     1575 1584

Would be nice to have a method for getting a Seqinfo out of a BAM
header. Then one can just coerce that to a GRanges. rtracklayer does the
equivalent for BigWig.


In devel,

  >  seqinfo(open(BamFile(fl)))
Seqinfo of length 2
seqnames seqlengths isCircular
seq1           1575         NA
seq2           1584         NA

I think this needs to be updated to deal with recent changes to Seqinfo
to store the reference genome (which is sometimes also present in the
BAM file).

I think this update would be useful for us. Thank you for your support
Martin and Michael.

Rene

Maritn


Michael

     Martin



         ultimately, I want to launch a PICS analysis that requires a
         segReadsList object.

         Overall I definitely progressed by your help, thank you.

         Rene




         On Fri, 2011-09-16 at 14:29 -0700, Martin Morgan wrote:

             On 09/16/2011 02:11 PM, Michael Lawrence wrote:

                 It sounds like you're trying to use BED as an
                 alternative to BAM? Probably
                 not a good idea, especially at this scale. Why are you
                 aiming for a
                 GenomeData? A GappedAlignments might be more
                 appropriate. See
                 GenomicRanges::__readGappedAlignments() for bringing a
                 BAM into a
                 GappedAlignments.


             Hi Rene

             the 'which' argument to readGappedAlignments (it'll become
             'param' with
             the next release, and be a ScanBamParam object) allows you
             to select
             regions to process, e.g., chromosome-at-a-time, to help with
             file size.

             Martin


                 This page might help:
                 
http://bioconductor.org/help/__workflows/high-throughput-__sequencing/#sequencing-__resources
                 
<http://bioconductor.org/help/workflows/high-throughput-sequencing/#sequencing-resources>

                 But it could really be improved.

                 Michael

                 On Fri, Sep 16, 2011 at 1:44 PM, Rene
                 Paradis<rene.paradis@genome.__ulaval.ca
                 <mailto:rene.para...@genome.ulaval.ca>

                     wrote:


                     Hello,

                     I am experiencing a problem regarding the load in
                     memory of bed files of
                     30 GB. my function read.table unleash the error :
                     Error in unique(x) :
                     length xxxxxx is too large for hashing.

                     this is generated by the function MKsetup of the
                     unique.c file. Even by
                     increasing by 10 000x the value, the error persists.
                     I believe the
                     function pushes more data in ram, but I am not sure
                     this is the good way
                     to focus on.

                     Ultimately, I would like to produce a GenomeData
                     object from either a
                     BAM file or a bed file.

                     has someone ever worked with very very big BAM files
                     (about 30 GB)

                     thanks

                     Rene paradis

                     _________________________________________________
                     Bioc-sig-sequencing mailing list
                     Bioc-sig-sequencing@r-project.__org
                     <mailto:Bioc-sig-sequencing@r-project.org>
                     https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
                     <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>


                         [[alternative HTML version deleted]]

                 _________________________________________________
                 Bioc-sig-sequencing mailing list
                 Bioc-sig-sequencing@r-project.__org
                 <mailto:Bioc-sig-sequencing@r-project.org>
                 https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
                 <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>







     --
     Computational Biology
     Fred Hutchinson Cancer Research Center
     1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

     Location: M1-B861
     Telephone: 206 667-2793<tel:206%20667-2793>

     _________________________________________________
     Bioc-sig-sequencing mailing list
     Bioc-sig-sequencing@r-project.__org
     <mailto:Bioc-sig-sequencing@r-project.org>
     https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
     <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>



--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] large BAM files and large BED files

Reply via email to