[Bioc-devel] Issue tracker for Bioconductor

2014-08-27 Thread Arun Kalyanasundaram
Hi,

I am a graduate student at CMU and I am interested in studying scientific
software eco-systems such as Bioconductor.

I wanted to know if there is a publicly available issue tracker / bug
reports for Bioconductor or something that I can gain read-only access to.

Thank you all for your help.

Best, 
-Arun 



[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] writeVcf performance

2014-08-27 Thread Gabe Becker
Martin and Val.

I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with
profiling enabled. The results of summaryRprof for that run are attached,
though for a variety of reasons they are pretty misleading.

It took over an hour to write (3700+seconds), so it's definitely a
bottleneck when the data get very large, even if it isn't for smaller data.

Michael and I both think the culprit is all the pasting and cbinding that
is going on, and more to the point, that memory for an internal
representation to be written out is allocated at all.  Streaming across the
object, looping by rows and writing directly to file (e.g. from C) should
be blisteringly fast in comparison.

~G


On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence micha...@gene.com
wrote:

 Gabe is still testing/profiling, but we'll send something randomized along
 eventually.


 On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan mtmor...@fhcrc.org
 wrote:

 I didn't see in the original thread a reproducible (simulated, I guess)
 example, to be explicit about what the problem is??

 Martin


 On 08/26/2014 10:47 AM, Michael Lawrence wrote:

 My understanding is that the heap optimization provided marginal gains,
 and
 that we need to think harder about how to optimize the all of the string
 manipulation in writeVcf. We either need to reduce it or reduce its
 overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.


 On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain voben...@fhcrc.org
 wrote:

  Hi Gabe,

 Martin responded, and so did Michael,

 https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

 It sounded like Michael was ok with working with/around heap
 initialization.

 Michael, is that right or should we still consider this on the table?


 Val


 On 08/26/2014 09:34 AM, Gabe Becker wrote:

  Val,

 Has there been any movement on this? This remains a substantial
 bottleneck for us when writing very large VCF files (e.g.
 variants+genotypes for whole genome NGS samples).

 I was able to see a ~25% speedup with 4 cores and  an optimal speedup
 of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
 parallelization strategy and no other changes. I suspect this could be
 improved on quite a bit, or possibly made irrelevant with judicious use
 of serial C code.

 Did you and Martin make any plans regarding optimizing writeVcf?

 Best
 ~G


 On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain voben...@fhcrc.org
 mailto:voben...@fhcrc.org wrote:

  Hi Michael,

  I'm interested in working on this. I'll discuss with Martin next
  week when we're both back in the office.

  Val





  On 08/05/14 07:46, Michael Lawrence wrote:

  Hi guys (Val, Martin, Herve):

  Anyone have an itch for optimization? The writeVcf function is
  currently a
  bottleneck in our WGS genotyping pipeline. For a typical 50
  million row
  gVCF, it was taking 2.25 hours prior to yesterday's
 improvements
  (pasteCollapseRows) that brought it down to about 1 hour,
 which
  is still
  too long by my standards ( 0). Only takes 3 minutes to call
 the
  genotypes
  (and associated likelihoods etc) from the variant calls (using
  80 cores and
  450 GB RAM on one node), so the output is an issue. Profiling
  suggests that
  the running time scales non-linearly in the number of rows.

  Digging a little deeper, it seems to be something with R's
  string/memory
  allocation. Below, pasting 1 million strings takes 6 seconds,
 but
 10
  million strings takes over 2 minutes. It gets way worse with
 50
  million. I
  suspect it has something to do with R's string hash table.

  set.seed(1000)
  end - sample(1e8, 1e6)
  system.time(paste0(END, =, end))
   user  system elapsed
  6.396   0.028   6.420

  end - sample(1e8, 1e7)
  system.time(paste0(END, =, end))
   user  system elapsed
  134.714   0.352 134.978

  Indeed, even this takes a long time (in a fresh session):

  set.seed(1000)
  end - sample(1e8, 1e6)
  end - sample(1e8, 1e7)
  system.time(as.character(end))
   user  system elapsed
 57.224   0.156  57.366

  But running it a second time is faster (about what one would
  expect?):

  system.time(levels - as.character(end))
   user  system elapsed
 23.582   0.021  23.589

  I did some simple profiling of R to find that the resizing of
  the string
  hash table is not a significant component of the time. So
 maybe
  something
  to do with the R heap/gc? No time right now to go deeper. But
 I
  know Martin
  likes this sort of thing ;)

  Michael

   [[alternative HTML 

Re: [Bioc-devel] Issue tracker for Bioconductor

2014-08-27 Thread Dan Tenenbaum
Hi Arun,

- Original Message -
 From: Arun Kalyanasundaram arunk...@cs.cmu.edu
 To: bioc-devel@r-project.org
 Sent: Wednesday, August 27, 2014 10:01:44 AM
 Subject: [Bioc-devel] Issue tracker for Bioconductor
 
 Hi,
 
 I am a graduate student at CMU and I am interested in studying
 scientific
 software eco-systems such as Bioconductor.
 
 I wanted to know if there is a publicly available issue tracker / bug
 reports for Bioconductor or something that I can gain read-only
 access to.
 

There is no central tracker. Sometimes bugs are discussed on this mailing list. 

Some packages have issue trackers, mostly in github. These packages *should* 
list the issue tracker URL in the package DESCRIPTION field (but don't always). 
Most packages that are in github as well as svn are listed here:

https://gitsvn.bioconductor.org/list_bridges

Dan



 Thank you all for your help.
 
 Best,
 -Arun
 
 
 
   [[alternative HTML version deleted]]
 
 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] writeVcf performance

2014-08-27 Thread Gabe Becker
The profiling I attached in my previous email is for 24 geno fields, as I
said, but our typical usecase involves only ~4-6 fields, and is faster but
still on the order of dozens of minutes.

Sorry for the confusion.
~G


On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker becke...@gene.com wrote:

 Martin and Val.

 I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields)
 with profiling enabled. The results of summaryRprof for that run are
 attached, though for a variety of reasons they are pretty misleading.

 It took over an hour to write (3700+seconds), so it's definitely a
 bottleneck when the data get very large, even if it isn't for smaller data.

 Michael and I both think the culprit is all the pasting and cbinding that
 is going on, and more to the point, that memory for an internal
 representation to be written out is allocated at all.  Streaming across the
 object, looping by rows and writing directly to file (e.g. from C) should
 be blisteringly fast in comparison.

 ~G


 On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence micha...@gene.com
 wrote:

 Gabe is still testing/profiling, but we'll send something randomized
 along eventually.


 On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan mtmor...@fhcrc.org
 wrote:

 I didn't see in the original thread a reproducible (simulated, I guess)
 example, to be explicit about what the problem is??

 Martin


 On 08/26/2014 10:47 AM, Michael Lawrence wrote:

 My understanding is that the heap optimization provided marginal gains,
 and
 that we need to think harder about how to optimize the all of the string
 manipulation in writeVcf. We either need to reduce it or reduce its
 overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.


 On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain voben...@fhcrc.org
 wrote:

  Hi Gabe,

 Martin responded, and so did Michael,

 https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

 It sounded like Michael was ok with working with/around heap
 initialization.

 Michael, is that right or should we still consider this on the table?


 Val


 On 08/26/2014 09:34 AM, Gabe Becker wrote:

  Val,

 Has there been any movement on this? This remains a substantial
 bottleneck for us when writing very large VCF files (e.g.
 variants+genotypes for whole genome NGS samples).

 I was able to see a ~25% speedup with 4 cores and  an optimal
 speedup
 of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
 parallelization strategy and no other changes. I suspect this could be
 improved on quite a bit, or possibly made irrelevant with judicious
 use
 of serial C code.

 Did you and Martin make any plans regarding optimizing writeVcf?

 Best
 ~G


 On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain voben...@fhcrc.org
 mailto:voben...@fhcrc.org wrote:

  Hi Michael,

  I'm interested in working on this. I'll discuss with Martin next
  week when we're both back in the office.

  Val





  On 08/05/14 07:46, Michael Lawrence wrote:

  Hi guys (Val, Martin, Herve):

  Anyone have an itch for optimization? The writeVcf function
 is
  currently a
  bottleneck in our WGS genotyping pipeline. For a typical 50
  million row
  gVCF, it was taking 2.25 hours prior to yesterday's
 improvements
  (pasteCollapseRows) that brought it down to about 1 hour,
 which
  is still
  too long by my standards ( 0). Only takes 3 minutes to call
 the
  genotypes
  (and associated likelihoods etc) from the variant calls
 (using
  80 cores and
  450 GB RAM on one node), so the output is an issue. Profiling
  suggests that
  the running time scales non-linearly in the number of rows.

  Digging a little deeper, it seems to be something with R's
  string/memory
  allocation. Below, pasting 1 million strings takes 6
 seconds, but
 10
  million strings takes over 2 minutes. It gets way worse with
 50
  million. I
  suspect it has something to do with R's string hash table.

  set.seed(1000)
  end - sample(1e8, 1e6)
  system.time(paste0(END, =, end))
   user  system elapsed
  6.396   0.028   6.420

  end - sample(1e8, 1e7)
  system.time(paste0(END, =, end))
   user  system elapsed
  134.714   0.352 134.978

  Indeed, even this takes a long time (in a fresh session):

  set.seed(1000)
  end - sample(1e8, 1e6)
  end - sample(1e8, 1e7)
  system.time(as.character(end))
   user  system elapsed
 57.224   0.156  57.366

  But running it a second time is faster (about what one would
  expect?):

  system.time(levels - as.character(end))
   user  system elapsed
 23.582   0.021  23.589

  I did some simple profiling of R to find that the resizing of
   

Re: [Bioc-devel] Issue tracker for Bioconductor

2014-08-27 Thread Sean Davis
Hi, Arun.  There is not such a system that covers the entire Bioconductor
project.  Since packages are largely contributed by diverse developers,
there are many disparate bug tracking systems in use (and many packages
with no formal bug tracking).  There is a facility in R to allow package
authors to specify a bug reporting mechanism.  Some details are available
on this page:

http://stat.ethz.ch/R-manual/R-devel/library/utils/html/bug.report.html

Several packages in Bioconductor supply bug reporting links, so you could
look into those that do.

Sean



On Wed, Aug 27, 2014 at 1:01 PM, Arun Kalyanasundaram arunk...@cs.cmu.edu
wrote:

 Hi,

 I am a graduate student at CMU and I am interested in studying scientific
 software eco-systems such as Bioconductor.

 I wanted to know if there is a publicly available issue tracker / bug
 reports for Bioconductor or something that I can gain read-only access to.

 Thank you all for your help.

 Best,
 -Arun



 [[alternative HTML version deleted]]

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel