On 03/09/2011 09:34 AM, Martin Morgan wrote: > On 03/09/2011 08:18 AM, Ivan Gregoretti wrote: >> I use BED because it uses less memory. >> >> BAM format contains the read names, the sequences, the quality string >> and more information. I do not need that. I only need chromosome name, >> start, end, and strand. > > In Rsamtools, scanBam with param=ScanBamParam(what=c("qname", "pos", > "qwidth", "strand")) and it'll be very fast.
'rname' rather than 'qname'. > > see ?scanBam, ?ScanBamParam and the vignette. > > Martin > >> >> So, for almost all my analyses, I start by converting my .bam to a >> minimalistic .bed.gz outside R and then from R I load my tags into a >> GRanges with import(). >> >> As simple as that. >> >> Ivan >> >> >> Ivan Gregoretti, PhD >> National Institute of Diabetes and Digestive and Kidney Diseases >> National Institutes of Health >> 5 Memorial Dr, Building 5, Room 205. >> Bethesda, MD 20892. USA. >> Phone: 1-301-496-1016 and 1-301-496-1592 >> Fax: 1-301-496-9878 >> >> >> >> On Wed, Mar 9, 2011 at 10:51 AM, Michael Lawrence >> <lawrence.mich...@gene.com> wrote: >>> >>> >>> On Wed, Mar 9, 2011 at 7:33 AM, Ivan Gregoretti <ivang...@gmail.com> wrote: >>>> >>>> I find simple BED files to be slow to import. I only use BED without >>>> track headers. The data is derived mostly from *-seq so we are talking >>>> about multiple million lines per file. >>>> >>>> The problem as I understand it is that the function reads one row at a >>>> time. It could be much faster if it read, say, 1000 rows at a time. >>>> >>> >>> I hope it's not reading one row at a time. It just calls read.table(), in a >>> fairly efficient way, with colClasses specified, etc. Why do you have high >>> throughput sequencing results in BED files? BED is really for genes. Most >>> other things fit into BAM, bedGraph (which uses the same basic parser >>> though), WIG, etc. >>> >>>> >>>> I never get errors. There are no bugs to fix. It's just very slow for >>>> the real world of high throughput sequencing. That's all. >>>> >>>> Thanks, >>>> >>>> Ivan >>>> >>>> >>>> Ivan Gregoretti, PhD >>>> National Institute of Diabetes and Digestive and Kidney Diseases >>>> National Institutes of Health >>>> 5 Memorial Dr, Building 5, Room 205. >>>> Bethesda, MD 20892. USA. >>>> Phone: 1-301-496-1016 and 1-301-496-1592 >>>> Fax: 1-301-496-9878 >>>> >>>> >>>> >>>> On Wed, Mar 9, 2011 at 10:21 AM, Michael Lawrence >>>> <lawrence.mich...@gene.com> wrote: >>>>> >>>>> >>>>> On Wed, Mar 9, 2011 at 6:41 AM, Ivan Gregoretti <ivang...@gmail.com> >>>>> wrote: >>>>>> >>>>>> Just to expand a little bit Vincent's response. >>>>>> >>>>>> If you happen to be handling very large BED files, you probably keep >>>>>> them compressed. The good news is that even in that case, you can load >>>>>> them: >>>>>> >>>>>> lit = import("~/lit.bed.gz"."bed") >>>>>> >>>>>> There is still the long-standing issue of how slow the import() >>>>>> function is but I am still hopeful. >>>>>> >>>>> >>>>> This is the first I've heard of this. What sort of files are slow? Do >>>>> they >>>>> have a track line? The parsing gets complicated when there are track >>>>> lines >>>>> and multiple tracks in a file. BED is a complex format with many >>>>> variants. >>>>> >>>>>> >>>>>> Ivan >>>>>> >>>>>> Ivan Gregoretti, PhD >>>>>> National Institute of Diabetes and Digestive and Kidney Diseases >>>>>> National Institutes of Health >>>>>> 5 Memorial Dr, Building 5, Room 205. >>>>>> Bethesda, MD 20892. USA. >>>>>> Phone: 1-301-496-1016 and 1-301-496-1592 >>>>>> Fax: 1-301-496-9878 >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Mar 8, 2011 at 9:26 PM, Vincent Carey >>>>>> <st...@channing.harvard.edu> wrote: >>>>>>> 2011/3/8 Thiago Yukio Kikuchi Oliveira <strat...@gmail.com>: >>>>>>>> Hi, >>>>>>>> >>>>>>>> Is there a BED file parser for R? >>>>>>> >>>>>>> I suppose it depends on what you mean by "parser". import() from the >>>>>>> rtracklayer package imports BED and constructs and populates a >>>>>>> RangedData object with the contents. Here we look at a small bed >>>>>>> file >>>>>>> in text, >>>>>>> start R, load rtracklayer, import the data, show the result, and show >>>>>>> the resources used. >>>>>>> >>>>>>> bash-3.2$ head ~/junc716_20.bed >>>>>>> chr20 55658 64827 JUNC00000001 14 + 55658 64827 >>>>>>> 255,0,0 2 27,25 0,9144 >>>>>>> chr20 55662 64821 JUNC00000002 2 - 55662 64821 >>>>>>> 255,0,0 2 34,8 0,9151 >>>>>>> chr20 135774 147029 JUNC00000003 1 - 135774 >>>>>>> 147029 >>>>>>> 255,0,0 2 8,29 0,11226 >>>>>>> chr20 167951 172361 JUNC00000004 1 + 167951 >>>>>>> 172361 >>>>>>> 255,0,0 2 29,8 0,4402 >>>>>>> chr20 189824 192113 JUNC00000005 3 + 189824 >>>>>>> 192113 >>>>>>> 255,0,0 2 33,9 0,2280 >>>>>>> chr20 189829 192113 JUNC00000006 3 + 189829 >>>>>>> 192113 >>>>>>> 255,0,0 2 32,9 0,2275 >>>>>>> chr20 193930 199576 JUNC00000007 4 - 193930 >>>>>>> 199576 >>>>>>> 255,0,0 2 28,11 0,5635 >>>>>>> chr20 207050 207846 JUNC00000008 2 - 207050 >>>>>>> 207846 >>>>>>> 255,0,0 2 20,34 0,762 >>>>>>> chr20 218306 218925 JUNC00000009 1 - 218306 >>>>>>> 218925 >>>>>>> 255,0,0 2 11,26 0,593 >>>>>>> chr20 221160 225070 JUNC00000010 25 - 221160 >>>>>>> 225070 >>>>>>> 255,0,0 2 29,9 0,3901 >>>>>>> bash-3.2$ head ~/junc716_20.bed > ~/lit.bed >>>>>>> bash-3.2$ R213 --vanilla --quiet >>>>>>>> library(rtracklayer) >>>>>>> Loading required package: RCurl >>>>>>> Loading required package: bitops >>>>>>>> lit = import("~/lit.bed") >>>>>>>> lit >>>>>>> RangedData with 10 rows and 9 value columns across 1 space >>>>>>> space ranges | name score strand >>>>>>> thickStart >>>>>>> <character> <IRanges> | <character> <numeric> <character> >>>>>>> <integer> >>>>>>> 1 chr20 [ 55659, 64827] | JUNC00000001 14 + >>>>>>> 55658 >>>>>>> 2 chr20 [ 55663, 64821] | JUNC00000002 2 - >>>>>>> 55662 >>>>>>> 3 chr20 [135775, 147029] | JUNC00000003 1 - >>>>>>> 135774 >>>>>>> 4 chr20 [167952, 172361] | JUNC00000004 1 + >>>>>>> 167951 >>>>>>> 5 chr20 [189825, 192113] | JUNC00000005 3 + >>>>>>> 189824 >>>>>>> 6 chr20 [189830, 192113] | JUNC00000006 3 + >>>>>>> 189829 >>>>>>> 7 chr20 [193931, 199576] | JUNC00000007 4 - >>>>>>> 193930 >>>>>>> 8 chr20 [207051, 207846] | JUNC00000008 2 - >>>>>>> 207050 >>>>>>> 9 chr20 [218307, 218925] | JUNC00000009 1 - >>>>>>> 218306 >>>>>>> 10 chr20 [221161, 225070] | JUNC00000010 25 - >>>>>>> 221160 >>>>>>> thickEnd itemRgb blockCount blockSizes blockStarts >>>>>>> <integer> <character> <integer> <character> <character> >>>>>>> 1 64827 #FF0000 2 27,25 0,9144 >>>>>>> 2 64821 #FF0000 2 34,8 0,9151 >>>>>>> 3 147029 #FF0000 2 8,29 0,11226 >>>>>>> 4 172361 #FF0000 2 29,8 0,4402 >>>>>>> 5 192113 #FF0000 2 33,9 0,2280 >>>>>>> 6 192113 #FF0000 2 32,9 0,2275 >>>>>>> 7 199576 #FF0000 2 28,11 0,5635 >>>>>>> 8 207846 #FF0000 2 20,34 0,762 >>>>>>> 9 218925 #FF0000 2 11,26 0,593 >>>>>>> 10 225070 #FF0000 2 29,9 0,3901 >>>>>>> >>>>>>>> sessionInfo() >>>>>>> R version 2.13.0 Under development (unstable) (2011-03-01 r54628) >>>>>>> Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit) >>>>>>> >>>>>>> locale: >>>>>>> [1] C >>>>>>> >>>>>>> attached base packages: >>>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>>> >>>>>>> other attached packages: >>>>>>> [1] rtracklayer_1.11.11 RCurl_1.5-0 bitops_1.0-4.1 >>>>>>> >>>>>>> loaded via a namespace (and not attached): >>>>>>> [1] BSgenome_1.19.4 Biobase_2.11.9 Biostrings_2.19.15 >>>>>>> [4] GenomicRanges_1.3.23 IRanges_1.9.25 Matrix_0.999375-47 >>>>>>> [7] XML_3.2-0 grid_2.13.0 lattice_0.19-17 >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> / Thiago Yukio Kikuchi Oliveira >>>>>>>> (=\ >>>>>>>> \=) Faculdade de Medicina de Ribeirão Preto >>>>>>>> / Laboratório de Genética Molecular e Bioinformática >>>>>>>> /=) >>>>>>>> ----------------------------------------------------------------- >>>>>>>> (=/ Centro de Terapia Celular/CEPID/FAPESP - Hemocentro de Rib. >>>>>>>> Preto >>>>>>>> / Rua Tenente Catão Roxo, 2501 CEP 14151-140 >>>>>>>> (=\ Ribeirão Preto - São Paulo >>>>>>>> \=) Fone: 55 16 2101-9300 Ramal: 9603 >>>>>>>> / E-mail: stra...@lgmb.fmrp.usp.br >>>>>>>> /=) strat...@gmail.com >>>>>>>> (=/ >>>>>>>> / Bioinformatic Team - BiT: http://lgmb.fmrp.usp.br >>>>>>>> (=\ Hemocentro de Ribeirão Preto: http://pegasus.fmrp.usp.br >>>>>>>> \=) >>>>>>>> / >>>>>>>> ----------------------------------------------------------------- >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioc-sig-sequencing mailing list >>>>>>>> Bioc-sig-sequencing@r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioc-sig-sequencing mailing list >>>>>>> Bioc-sig-sequencing@r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Bioc-sig-sequencing mailing list >>>>>> Bioc-sig-sequencing@r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>>>> >>>>> >>> >>> >> >> _______________________________________________ >> Bioc-sig-sequencing mailing list >> Bioc-sig-sequencing@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing