On 07/14/2011 08:47 AM, Ivan Gregoretti wrote:
Hello Martin,

I am currently reading in data with scanBam and making use of the
ScanBamParam function.

My bam data looks like this:

names(bam1)
[1] "chr1:1-197195432" "chr2:1-181748087" "chr3:1-159599783"

class(bam1)
[1] "list"

class(bam1$"chr1:1-197195432")
[1] "list"

class(bam1$"chr1:1-197195432"$qname)
[1] "character"

How do we consolidate all three lists (chr1, chr2, chr3) into a single list?

I read the Rsamtools documentation but the examples do not quite apply
to this case.

These are lists-of-lists, and the somewhat opaque way of getting elements out is along the lines of

  unlist(lapply(bam1, "[[", "qname"))

This paradigm seems not to work for XStringSet, for which

  do.call(c, unname(lapply(bam1, "[[", "seq")))

Martin



Thank you,

Ivan





On Wed, Jul 13, 2011 at 5:25 PM, Ivan Gregoretti<ivang...@gmail.com>  wrote:
Hi Martin,

I will have to play a little bit both with the counter function and
with simplify2array. It's not evident for me what they are intended to
do.

I can answer your question about BAM size. Our samples are all from
Illumina HiSeq now. So, we are talking about no less than 100 million
good quality tags per lane.

In different occasions I have run into the problem of the 2^31 -1
limit and I am always thinking about ways to process large amounts of
information by chunks and in parallel. With time, sequencers will only
output more tags so we will will run into this problems only more
often.

The suggestion of using ScanBamParam is good, thank you, but at this
early stage of sample analysis I need to load the whole BAM. Some
sample quality assessments are in order.

Thank you,

Ivan




On Wed, Jul 13, 2011 at 5:04 PM, Martin Morgan<mtmor...@fhcrc.org>  wrote:
On 07/13/2011 01:57 PM, Martin Morgan wrote:

On 07/13/2011 01:36 PM, Ivan Gregoretti wrote:

Hi everybody,

As I wait for my large BAM to be read in by scanBAM, I can't help but
to wonder:

Has anybody tried combining scanBam with multicore to load the
chromosomes in parallel?

That would require

1) to merge the chunks at the end and

2) the original BAM to be indexed.

Does anybody have any experience to share?

Was wondering how large and long we're talking about?

Use of ScanBamParam(what=...) can help.

For some tasks I'd think of a coarser granularity, e.g., in the context
of multiple bam files so that the data reduction (to a vector of
10,000's of counts) occurs on each core.

counter = function(fl, genes) {
aln = readGappedAlignments(fl)
strand(aln) = "*"
hits = countOverlaps(aln, genes)
countOverlaps(genes, aln[hits==1])
}
simplify2array(mclapply(bamFiles, counter, genes))

One issue I understand people have is that mclapply uses 'serialize()'
to convert the return value of each function to a raw vector. raw
vectors have the same total length limit as any other vector (2^31 -1
elements) and this places a limit on the size of chunk returned by each
core. Also I believe that exceeding the limit can silently corrupt the
data (i.e., a bug). This is second-hand information.

Should also have mentioned that casting a GRanges object to RangesList
provides the appropriate list to iterate over chromosomes, and that
ScanBamParam will accept a which of class RangesList.

Martin


Martin


Thank you,

Ivan

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing




--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793




--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to