Re: [Bioc-sig-seq] BED file parser

Michael Lawrence Thu, 10 Mar 2011 06:51:10 -0800

Thanks for this suggestion. I did that for GFF (because there can be an
insane number of attributes that take forever to parse), and I can do it for
BED, as well.


Michael

On Thu, Mar 10, 2011 at 6:08 AM, Jonathan Cairns <
jonathan.cai...@cancer.org.uk> wrote:

> FWIW, If you need only the chr, start, end and strand information from a
> BED file, the function read.bed() in the BayesPeak package does this.
> Moreover it is usually around twice as fast as import() because it ignores
> the rest of the columns. However, it cannot yet handle the compressed
> .bed.gz format.
>
> Perhaps this would be a useful option in import.bed()? From my experience,
> use of BED files is not that uncommon for ChIP-seq reads.
>
> Jonathan
> ________________________________________
> From: bioc-sig-sequencing-boun...@r-project.org [
> bioc-sig-sequencing-boun...@r-project.org] On Behalf Of Ivan Gregoretti [
> ivang...@gmail.com]
> Sent: 09 March 2011 16:18
> To: Michael Lawrence
> Cc: bioc-sig-sequencing@r-project.org
> Subject: Re: [Bioc-sig-seq] BED file parser
>
> I use BED because it uses less memory.
>
> BAM format contains the read names, the sequences, the quality string
> and more information. I do not need that. I only need chromosome name,
> start, end, and strand.
>
> So, for almost all my analyses, I start by converting my .bam to a
> minimalistic .bed.gz outside R and then from R I load my tags into a
> GRanges with import().
>
> As simple as that.
>
> Ivan
>
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1016 and 1-301-496-1592
> Fax: 1-301-496-9878
>
>
>
> On Wed, Mar 9, 2011 at 10:51 AM, Michael Lawrence
> <lawrence.mich...@gene.com> wrote:
> >
> >
> > On Wed, Mar 9, 2011 at 7:33 AM, Ivan Gregoretti <ivang...@gmail.com>
> wrote:
> >>
> >> I find simple BED files to be slow to import. I only use BED without
> >> track headers. The data is derived mostly from *-seq so we are talking
> >> about multiple million lines per file.
> >>
> >> The problem as I understand it is that the function reads one row at a
> >> time. It could be much faster if it read, say, 1000 rows at a time.
> >>
> >
> > I hope it's not reading one row at a time. It just calls read.table(), in
> a
> > fairly efficient way, with colClasses specified, etc. Why do you have
> high
> > throughput sequencing results in BED files? BED is really for genes. Most
> > other things fit into BAM, bedGraph (which uses the same basic parser
> > though), WIG, etc.
> >
> >>
> >> I never get errors. There are no bugs to fix. It's just very slow for
> >> the real world of high throughput sequencing. That's all.
> >>
> >> Thanks,
> >>
> >> Ivan
> >>
> >>
> >> Ivan Gregoretti, PhD
> >> National Institute of Diabetes and Digestive and Kidney Diseases
> >> National Institutes of Health
> >> 5 Memorial Dr, Building 5, Room 205.
> >> Bethesda, MD 20892. USA.
> >> Phone: 1-301-496-1016 and 1-301-496-1592
> >> Fax: 1-301-496-9878
> >>
> >>
> >>
> >> On Wed, Mar 9, 2011 at 10:21 AM, Michael Lawrence
> >> <lawrence.mich...@gene.com> wrote:
> >> >
> >> >
> >> > On Wed, Mar 9, 2011 at 6:41 AM, Ivan Gregoretti <ivang...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Just to expand a little bit Vincent's response.
> >> >>
> >> >> If you happen to be handling very large BED files, you probably keep
> >> >> them compressed. The good news is that even in that case, you can
> load
> >> >> them:
> >> >>
> >> >> lit = import("~/lit.bed.gz"."bed")
> >> >>
> >> >> There is still the long-standing issue of how slow the import()
> >> >> function is but I am still hopeful.
> >> >>
> >> >
> >> > This is the first I've heard of this. What sort of files are slow? Do
> >> > they
> >> > have a track line? The parsing gets complicated when there are track
> >> > lines
> >> > and multiple tracks in a file. BED is a complex format with many
> >> > variants.
> >> >
> >> >>
> >> >> Ivan
> >> >>
> >> >> Ivan Gregoretti, PhD
> >> >> National Institute of Diabetes and Digestive and Kidney Diseases
> >> >> National Institutes of Health
> >> >> 5 Memorial Dr, Building 5, Room 205.
> >> >> Bethesda, MD 20892. USA.
> >> >> Phone: 1-301-496-1016 and 1-301-496-1592
> >> >> Fax: 1-301-496-9878
> >> >>
> >> >>
> >> >>
> >> >> On Tue, Mar 8, 2011 at 9:26 PM, Vincent Carey
> >> >> <st...@channing.harvard.edu> wrote:
> >> >> > 2011/3/8 Thiago Yukio Kikuchi Oliveira <strat...@gmail.com>:
> >> >> >> Hi,
> >> >> >>
> >> >> >> Is there a BED file parser for R?
> >> >> >
> >> >> > I suppose it depends on what you mean by "parser".  import() from
> the
> >> >> > rtracklayer package imports BED and constructs and populates a
> >> >> > RangedData object with the contents.  Here we look at a small bed
> >> >> > file
> >> >> > in text,
> >> >> > start R, load rtracklayer, import the data, show the result, and
> show
> >> >> > the resources used.
> >> >> >
> >> >> > bash-3.2$ head ~/junc716_20.bed
> >> >> > chr20   55658   64827   JUNC00000001    14      +       55658
> 64827
> >> >> >  255,0,0 2       27,25   0,9144
> >> >> > chr20   55662   64821   JUNC00000002    2       -       55662
> 64821
> >> >> >  255,0,0 2       34,8    0,9151
> >> >> > chr20   135774  147029  JUNC00000003    1       -       135774
> >> >> >  147029
> >> >> >  255,0,0 2       8,29    0,11226
> >> >> > chr20   167951  172361  JUNC00000004    1       +       167951
> >> >> >  172361
> >> >> >  255,0,0 2       29,8    0,4402
> >> >> > chr20   189824  192113  JUNC00000005    3       +       189824
> >> >> >  192113
> >> >> >  255,0,0 2       33,9    0,2280
> >> >> > chr20   189829  192113  JUNC00000006    3       +       189829
> >> >> >  192113
> >> >> >  255,0,0 2       32,9    0,2275
> >> >> > chr20   193930  199576  JUNC00000007    4       -       193930
> >> >> >  199576
> >> >> >  255,0,0 2       28,11   0,5635
> >> >> > chr20   207050  207846  JUNC00000008    2       -       207050
> >> >> >  207846
> >> >> >  255,0,0 2       20,34   0,762
> >> >> > chr20   218306  218925  JUNC00000009    1       -       218306
> >> >> >  218925
> >> >> >  255,0,0 2       11,26   0,593
> >> >> > chr20   221160  225070  JUNC00000010    25      -       221160
> >> >> >  225070
> >> >> >  255,0,0 2       29,9    0,3901
> >> >> > bash-3.2$ head ~/junc716_20.bed > ~/lit.bed
> >> >> > bash-3.2$ R213 --vanilla --quiet
> >> >> >> library(rtracklayer)
> >> >> > Loading required package: RCurl
> >> >> > Loading required package: bitops
> >> >> >> lit = import("~/lit.bed")
> >> >> >> lit
> >> >> > RangedData with 10 rows and 9 value columns across 1 space
> >> >> >         space           ranges |         name     score      strand
> >> >> > thickStart
> >> >> >   <character>        <IRanges> |  <character> <numeric> <character>
> >> >> >  <integer>
> >> >> > 1        chr20 [ 55659,  64827] | JUNC00000001        14
> +
> >> >> >  55658
> >> >> > 2        chr20 [ 55663,  64821] | JUNC00000002         2
> -
> >> >> >  55662
> >> >> > 3        chr20 [135775, 147029] | JUNC00000003         1
> -
> >> >> > 135774
> >> >> > 4        chr20 [167952, 172361] | JUNC00000004         1
> +
> >> >> > 167951
> >> >> > 5        chr20 [189825, 192113] | JUNC00000005         3
> +
> >> >> > 189824
> >> >> > 6        chr20 [189830, 192113] | JUNC00000006         3
> +
> >> >> > 189829
> >> >> > 7        chr20 [193931, 199576] | JUNC00000007         4
> -
> >> >> > 193930
> >> >> > 8        chr20 [207051, 207846] | JUNC00000008         2
> -
> >> >> > 207050
> >> >> > 9        chr20 [218307, 218925] | JUNC00000009         1
> -
> >> >> > 218306
> >> >> > 10       chr20 [221161, 225070] | JUNC00000010        25
> -
> >> >> > 221160
> >> >> >    thickEnd     itemRgb blockCount  blockSizes blockStarts
> >> >> >   <integer> <character>  <integer> <character> <character>
> >> >> > 1      64827     #FF0000          2       27,25      0,9144
> >> >> > 2      64821     #FF0000          2        34,8      0,9151
> >> >> > 3     147029     #FF0000          2        8,29     0,11226
> >> >> > 4     172361     #FF0000          2        29,8      0,4402
> >> >> > 5     192113     #FF0000          2        33,9      0,2280
> >> >> > 6     192113     #FF0000          2        32,9      0,2275
> >> >> > 7     199576     #FF0000          2       28,11      0,5635
> >> >> > 8     207846     #FF0000          2       20,34       0,762
> >> >> > 9     218925     #FF0000          2       11,26       0,593
> >> >> > 10    225070     #FF0000          2        29,9      0,3901
> >> >> >
> >> >> >> sessionInfo()
> >> >> > R version 2.13.0 Under development (unstable) (2011-03-01 r54628)
> >> >> > Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
> >> >> >
> >> >> > locale:
> >> >> > [1] C
> >> >> >
> >> >> > attached base packages:
> >> >> > [1] stats     graphics  grDevices utils     datasets  methods
> base
> >> >> >
> >> >> > other attached packages:
> >> >> > [1] rtracklayer_1.11.11 RCurl_1.5-0         bitops_1.0-4.1
> >> >> >
> >> >> > loaded via a namespace (and not attached):
> >> >> > [1] BSgenome_1.19.4      Biobase_2.11.9       Biostrings_2.19.15
> >> >> > [4] GenomicRanges_1.3.23 IRanges_1.9.25       Matrix_0.999375-47
> >> >> > [7] XML_3.2-0            grid_2.13.0          lattice_0.19-17
> >> >> >
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> Thanks
> >> >> >>
> >> >> >>     /    Thiago Yukio Kikuchi Oliveira
> >> >> >> (=\
> >> >> >>   \=) Faculdade de Medicina de Ribeirão Preto
> >> >> >>    /   Laboratório de Genética Molecular e Bioinformática
> >> >> >>   /=)
> >> >> >> -----------------------------------------------------------------
> >> >> >> (=/   Centro de Terapia Celular/CEPID/FAPESP - Hemocentro de Rib.
> >> >> >> Preto
> >> >> >>   /    Rua Tenente Catão Roxo, 2501 CEP 14151-140
> >> >> >> (=\   Ribeirão Preto - São Paulo
> >> >> >>   \=) Fone: 55 16 2101-9300   Ramal: 9603
> >> >> >>    /   E-mail: stra...@lgmb.fmrp.usp.br
> >> >> >>   /=)            strat...@gmail.com
> >> >> >> (=/
> >> >> >>   /    Bioinformatic Team - BiT: http://lgmb.fmrp.usp.br
> >> >> >> (=\   Hemocentro de Ribeirão Preto: http://pegasus.fmrp.usp.br
> >> >> >>   \=)
> >> >> >>    /
> >> >> >> -----------------------------------------------------------------
> >> >> >>
> >> >> >> _______________________________________________
> >> >> >> Bioc-sig-sequencing mailing list
> >> >> >> Bioc-sig-sequencing@r-project.org
> >> >> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >> >> >>
> >> >> >
> >> >> > _______________________________________________
> >> >> > Bioc-sig-sequencing mailing list
> >> >> > Bioc-sig-sequencing@r-project.org
> >> >> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >> >> >
> >> >>
> >> >> _______________________________________________
> >> >> Bioc-sig-sequencing mailing list
> >> >> Bioc-sig-sequencing@r-project.org
> >> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >> >
> >> >
> >
> >
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
> This communication is from Cancer Research UK. Our website is at
> www.cancerresearchuk.org. We are a registered charity in England and Wales
> (1089464) and in Scotland (SC041666) and a company limited by guarantee
> registered in England and Wales under number 4325234. Our registered address
> is Angel Building, 407 St John Street, London, EC1V 4AD. Our central
> telephone number is 020 7242 0200.
>
> This communication and any attachments contain information which is
> confidential and may also be privileged.   It is for the exclusive use of
> the intended recipient(s).  If you are not the intended recipient(s) please
> note that any form of disclosure, distribution, copying or use of this
> communication or the information in it or in any attachments is strictly
> prohibited and may be unlawful.  If you have received this communication in
> error, please notify the sender and delete the email and destroy any copies
> of it.
>
> E-mail communications cannot be guaranteed to be secure or error free, as
> information could be intercepted, corrupted, amended, lost, destroyed,
> arrive late or incomplete, or contain viruses.  We do not accept liability
> for any such matters or their consequences.  Anyone who communicates with us
> by e-mail is taken to accept the risks in doing so.
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] BED file parser

Reply via email to