[Rd] write.csv performance improvements?

2023-03-29 Thread Toby Hocking
Dear R-devel,
I did a systematic comparison of write.csv with similar functions, and
observed two asymptotic inefficiencies that could be improved.

1. write.csv is quadratic time (N^2) in the number of columns N.
Can write.csv be improved to use a linear time algorithm, so it can handle
CSV files with larger numbers of columns?
For more details including figures and session info, please see
https://github.com/tdhock/atime/issues/9

2. write.csv uses memory that is linear in the number of rows, whereas
similar R functions for writing CSV use only constant memory. This is not
as important of an issue to fix, because anyway linear memory is used to
store the data in R. But since the other functions use constant memory,
could write.csv also? Is there some copying happening that could be
avoided? (this memory measurement uses bench::mark, which in turn uses
utils::Rprofmem)
https://github.com/tdhock/atime/issues/10

Sincerely,
Toby Dylan Hocking

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] read.csv quadratic time in number of columns

2023-03-29 Thread Toby Hocking
Dear R-devel,
A number of people have observed anecdotally that read.csv is slow for
large number of columns, for example:
https://stackoverflow.com/questions/7327851/read-csv-is-extremely-slow-in-reading-csv-files-with-large-numbers-of-columns
I did a systematic comparison of read.csv with similar functions, and
observed that read.csv is quadratic time (N^2) in the number of columns N,
whereas the others are linear (N).
Can read.csv be improved to use a linear time algorithm, so it can handle
CSV files with larger numbers of columns?
For more details including figures and session info, please see
https://github.com/tdhock/atime/issues/8
Sincerely,
Toby Dylan Hocking

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Bioc-devel] httr::GET() problem downloading a ExperimentHub resource

2023-03-29 Thread Martin Morgan
Some more not-necessarily helpful observations. You can get verbose output with

curl::curl_fetch_disk(url, tempfile(), handle = new_handle(verbose = TRUE))

and on the command line with curl -v -L �

Also, it seems that other BAM files can be downloaded, e.g., from 
eh[["EH3502"]] (also httr::with_verbose(eh[["EH3502"]])). Would be worth while 
verifying this a little more completely; I looked for

mcols(eh)|> as_tibble(rownames="ehid") |> filter(sourcetype == "BAM", 
rdataclass == "BamFile")

If it�s true that other BAM files are ok, then it points to the way the files 
are being served on �your� end.

One difference I see is that �your� files have Content-Encoding: gzip, but 
there is no Content-Encoding tag on the BAM file above. I guess BAM files are 
(some flavor of) gzip (?), but maybe this is confusing the R curl library�

Martin

From: Robert Castelo 
Date: Wednesday, March 29, 2023 at 4:08 PM
To: Martin Morgan , bioc-devel@r-project.org 

Subject: Re: [Bioc-devel] httr::GET() problem downloading a ExperimentHub 
resource
good catch, but really enigmatic, BAI files work, but BAM don't:

dat <- 
read.csv("https://raw.githubusercontent.com/functionalgenomics/gDNAinRNAseqData/devel/inst/extdata/metadata_LiYu22subsetBAMfiles.csv;)
rdatapath <- strsplit(dat$RDataPath, ":")
bamfiles <- unlist(rdatapath)[seq(1, 18, 2)]
baifiles <- unlist(rdatapath)[seq(2, 18, 2)]

bamurls <- paste0(dat$Location_Prefix, bamfiles)
baiurls <- paste0(dat$Location_Prefix, baifiles)

## BAM files give error
for (bf in bamurls) {
  cat(sprintf("%s\n", basename(bf)))
  tryCatch({
curl::curl_fetch_disk(bf, tempfile())
  }, error=function(e) message(paste0(e, "\n")))
}

## BAI files do not give error
for (bf in baiurls) {
  cat(sprintf("%s\n", basename(bf)))
  tryCatch({
curl::curl_fetch_disk(bf, tempfile())
  }, error=function(e) message(paste0(e, "\n")))
}

any further idea??

robert.

On 29/3/23 21:10, Martin Morgan wrote:
Not really helpful but this could be simplified a bit by removing the redirect 
from experiment hub, and the layer from httr to curl, so

url = 
"https://functionalgenomics.upf.edu/experimenthub/gdnainrnaseqdata/LiYu22subsetBAMfiles/s32gDNA0.bam;
curl::curl_fetch_disk(url, tempfile())
Error in 
curl::curl_fetch_disk("https://functionalgenomics.upf.edu/experimenthub/gdnainrnaseqdata/LiYu22subsetBAMfiles/s32gDNA0.bam;,
  :
  Failed writing received data to disk/application

I notice the index file (extension .bai) works; do other BAM files work, too?

Martin

From: Bioc-devel 
 on 
behalf of Robert Castelo 
Date: Wednesday, March 29, 2023 at 1:18 PM
To: bioc-devel@r-project.org 

Subject: [Bioc-devel] httr::GET() problem downloading a ExperimentHub resource
hi,

we recently added a few new ExperimentHub resources, consisting of BAM
files and their corresponding BAI files and hosted in my own server.
while it seems that they are accessible, they cannot be downloaded
through the ExperimentHub API. the minimum example reproducing the
problem is this one (using Bioc devel):

library(ExperimentHub)
httr::GET("https://experimenthub.bioconductor.org/fetch/8129;)
Error in curl::curl_fetch_memory(url, handle = handle) :
   Failed writing received data to disk/application

while there's apparently no problem to "manually" download the resource
using 'download.file()' and loading it with
'GenomicAlignments::readGAlignments()':

download.file("https://experimenthub.bioconductor.org/fetch/8129;,
"file.bam")
trying URL 'https://experimenthub.bioconductor.org/fetch/8129'
Content type 'application/octet-stream' length 13296358 bytes (12.7 MB)
==
downloaded 12.7 MB

gal <- GenomicAlignments::readGAlignments("file.bam")
gal[1:3]
GAlignments object with 3 alignments and 0 metadata columns:
   seqnames strand   cigarqwidth start end width
 
   [1] chr1  +   49M1S50 16208 1625649
   [2] chr1  +   3S47M50 16976 1702247
   [3] chr1  -  10M177N40M50 17046 17272   227
   njunc
   
   [1] 0
   [2] 0
   [3] 1
   ---
   seqinfo: 2580 sequences from an unspecified genome

any hint why 'httr::GET()' fails, while 'download.file()' doesn't?

thanks!!

robert.
ps: just to clarify, the 'httr::GET()' example is behind the following
problem:

eh <- ExperimentHub()
z <- eh[["EH8079"]]
see ?gDNAinRNAseqData and browseVignettes('gDNAinRNAseqData') for
documentation
downloading 2 resources
retrieving 2 resources

Re: [Bioc-devel] httr::GET() problem downloading a ExperimentHub resource

2023-03-29 Thread Robert Castelo
good catch, but really enigmatic, BAI files work, but BAM don't:

dat <- 
read.csv("https://raw.githubusercontent.com/functionalgenomics/gDNAinRNAseqData/devel/inst/extdata/metadata_LiYu22subsetBAMfiles.csv;)
rdatapath <- strsplit(dat$RDataPath, ":")
bamfiles <- unlist(rdatapath)[seq(1, 18, 2)]
baifiles <- unlist(rdatapath)[seq(2, 18, 2)]

bamurls <- paste0(dat$Location_Prefix, bamfiles)
baiurls <- paste0(dat$Location_Prefix, baifiles)

## BAM files give error
for (bf in bamurls) {
   cat(sprintf("%s\n", basename(bf)))
   tryCatch({
     curl::curl_fetch_disk(bf, tempfile())
   }, error=function(e) message(paste0(e, "\n")))
}

## BAI files do not give error
for (bf in baiurls) {
   cat(sprintf("%s\n", basename(bf)))
   tryCatch({
     curl::curl_fetch_disk(bf, tempfile())
   }, error=function(e) message(paste0(e, "\n")))
}

any further idea??

robert.

On 29/3/23 21:10, Martin Morgan wrote:
>
> Not really helpful but this could be simplified a bit by removing the 
> redirect from experiment hub, and the layer from httr to curl, so
>
> url = 
> "https://functionalgenomics.upf.edu/experimenthub/gdnainrnaseqdata/LiYu22subsetBAMfiles/s32gDNA0.bam;
>
> curl::curl_fetch_disk(url, tempfile())
>
> Error in 
> curl::curl_fetch_disk("https://functionalgenomics.upf.edu/experimenthub/gdnainrnaseqdata/LiYu22subsetBAMfiles/s32gDNA0.bam;,
>  
> :
>
>   Failed writing received data to disk/application
>
> I notice the index file (extension .bai) works; do other BAM files 
> work, too?
>
> Martin
>
> *From: *Bioc-devel  on behalf of 
> Robert Castelo 
> *Date: *Wednesday, March 29, 2023 at 1:18 PM
> *To: *bioc-devel@r-project.org 
> *Subject: *[Bioc-devel] httr::GET() problem downloading a 
> ExperimentHub resource
>
> hi,
>
> we recently added a few new ExperimentHub resources, consisting of BAM
> files and their corresponding BAI files and hosted in my own server.
> while it seems that they are accessible, they cannot be downloaded
> through the ExperimentHub API. the minimum example reproducing the
> problem is this one (using Bioc devel):
>
> library(ExperimentHub)
> httr::GET("https://experimenthub.bioconductor.org/fetch/8129;)
> Error in curl::curl_fetch_memory(url, handle = handle) :
>    Failed writing received data to disk/application
>
> while there's apparently no problem to "manually" download the resource
> using 'download.file()' and loading it with
> 'GenomicAlignments::readGAlignments()':
>
> download.file("https://experimenthub.bioconductor.org/fetch/8129;,
> "file.bam")
> trying URL 'https://experimenthub.bioconductor.org/fetch/8129'
> Content type 'application/octet-stream' length 13296358 bytes (12.7 MB)
> ==
> downloaded 12.7 MB
>
> gal <- GenomicAlignments::readGAlignments("file.bam")
> gal[1:3]
> GAlignments object with 3 alignments and 0 metadata columns:
>    seqnames strand   cigar    qwidth start end width
>      
>    [1] chr1  +   49M1S    50 16208 16256    49
>    [2] chr1  +   3S47M    50 16976 17022    47
>    [3] chr1  -  10M177N40M    50 17046 17272   227
>    njunc
>    
>    [1] 0
>    [2] 0
>    [3] 1
>    ---
>    seqinfo: 2580 sequences from an unspecified genome
>
> any hint why 'httr::GET()' fails, while 'download.file()' doesn't?
>
> thanks!!
>
> robert.
> ps: just to clarify, the 'httr::GET()' example is behind the following
> problem:
>
> eh <- ExperimentHub()
> z <- eh[["EH8079"]]
> see ?gDNAinRNAseqData and browseVignettes('gDNAinRNAseqData') for
> documentation
> downloading 2 resources
> retrieving 2 resources
> |==|
> 100%
>
> Error: failed to load resource
>    name: EH8079
>    title: RNA-seq data BAM file subset of HRR589632 contaminated with 0%
> gDNA
>    reason: 1 resources failed to download
> In addition: Warning messages:
> 1: download failed
>    web resource path:
> ‘https://experimenthub.bioconductor.org/fetch/8129’
> 
>    local file path: ‘/home/rcastelo/.cache/R/ExperimentHub/12ba1aa03_8129’
>    reason: Failed writing received data to disk/application
> 2: bfcadd() failed; resource removed
>    rid: BFC3
>    fpath: ‘https://experimenthub.bioconductor.org/fetch/8129’
> 

Re: [Bioc-devel] httr::GET() problem downloading a ExperimentHub resource

2023-03-29 Thread Martin Morgan
Not really helpful but this could be simplified a bit by removing the redirect 
from experiment hub, and the layer from httr to curl, so

url = 
"https://functionalgenomics.upf.edu/experimenthub/gdnainrnaseqdata/LiYu22subsetBAMfiles/s32gDNA0.bam;
curl::curl_fetch_disk(url, tempfile())
Error in 
curl::curl_fetch_disk("https://functionalgenomics.upf.edu/experimenthub/gdnainrnaseqdata/LiYu22subsetBAMfiles/s32gDNA0.bam;,
  :
  Failed writing received data to disk/application

I notice the index file (extension .bai) works; do other BAM files work, too?

Martin

From: Bioc-devel  on behalf of Robert Castelo 

Date: Wednesday, March 29, 2023 at 1:18 PM
To: bioc-devel@r-project.org 
Subject: [Bioc-devel] httr::GET() problem downloading a ExperimentHub resource
hi,

we recently added a few new ExperimentHub resources, consisting of BAM
files and their corresponding BAI files and hosted in my own server.
while it seems that they are accessible, they cannot be downloaded
through the ExperimentHub API. the minimum example reproducing the
problem is this one (using Bioc devel):

library(ExperimentHub)
httr::GET("https://experimenthub.bioconductor.org/fetch/8129;)
Error in curl::curl_fetch_memory(url, handle = handle) :
   Failed writing received data to disk/application

while there's apparently no problem to "manually" download the resource
using 'download.file()' and loading it with
'GenomicAlignments::readGAlignments()':

download.file("https://experimenthub.bioconductor.org/fetch/8129;,
"file.bam")
trying URL 'https://experimenthub.bioconductor.org/fetch/8129'
Content type 'application/octet-stream' length 13296358 bytes (12.7 MB)
==
downloaded 12.7 MB

gal <- GenomicAlignments::readGAlignments("file.bam")
gal[1:3]
GAlignments object with 3 alignments and 0 metadata columns:
   seqnames strand   cigarqwidth start end width
 
   [1] chr1  +   49M1S50 16208 1625649
   [2] chr1  +   3S47M50 16976 1702247
   [3] chr1  -  10M177N40M50 17046 17272   227
   njunc
   
   [1] 0
   [2] 0
   [3] 1
   ---
   seqinfo: 2580 sequences from an unspecified genome

any hint why 'httr::GET()' fails, while 'download.file()' doesn't?

thanks!!

robert.
ps: just to clarify, the 'httr::GET()' example is behind the following
problem:

eh <- ExperimentHub()
z <- eh[["EH8079"]]
see ?gDNAinRNAseqData and browseVignettes('gDNAinRNAseqData') for
documentation
downloading 2 resources
retrieving 2 resources
|==|
100%

Error: failed to load resource
   name: EH8079
   title: RNA-seq data BAM file subset of HRR589632 contaminated with 0%
gDNA
   reason: 1 resources failed to download
In addition: Warning messages:
1: download failed
   web resource path:
�https://experimenthub.bioconductor.org/fetch/8129�

   local file path: �/home/rcastelo/.cache/R/ExperimentHub/12ba1aa03_8129�
   reason: Failed writing received data to disk/application
2: bfcadd() failed; resource removed
   rid: BFC3
   fpath: �https://experimenthub.bioconductor.org/fetch/8129�

   reason: download failed
3: download failed
   hub path: �https://experimenthub.bioconductor.org/fetch/8129�

   cache resource: �EH8079 : 8129�
   reason: bfcadd() failed; see warnings()


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org 

[Bioc-devel] rowSums, colSums, rowMeans, colMeans generics moved from BiocGenerics to MatrixGenerics

2023-03-29 Thread Hervé Pagès
Hi developers,

A couple of days ago I moved the rowSums, colSums, rowMeans, colMeans generics
from *BiocGenerics* to *MatrixGenerics*, and this seems to break a lot of
packages on today's build report for devel, sorry for that. I didn't have
time to look closely at the damage caused by this change yet, but will do
it in a few days and repair as much as possible.

The fix is very simple. Packages that explicitly import these generrics
from BiocGenerics now need to import them from *MatrixGenerics* (they are
in *MatrixGenerics* >= 1.11.1). However, a lot of packages also fail
because they depend directly or indirectly on a package that tries to
import these generics from the old place. In this case, there's not much to
do, these packages will auto-repair when the packages they depend on get
fixed.

Sorry for the inconvenience and let me know if you have any questions.

Cheers,
H.

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] httr::GET() problem downloading a ExperimentHub resource

2023-03-29 Thread Robert Castelo
hi,

we recently added a few new ExperimentHub resources, consisting of BAM 
files and their corresponding BAI files and hosted in my own server. 
while it seems that they are accessible, they cannot be downloaded 
through the ExperimentHub API. the minimum example reproducing the 
problem is this one (using Bioc devel):

library(ExperimentHub)
httr::GET("https://experimenthub.bioconductor.org/fetch/8129;)
Error in curl::curl_fetch_memory(url, handle = handle) :
   Failed writing received data to disk/application

while there's apparently no problem to "manually" download the resource 
using 'download.file()' and loading it with 
'GenomicAlignments::readGAlignments()':

download.file("https://experimenthub.bioconductor.org/fetch/8129;, 
"file.bam")
trying URL 'https://experimenthub.bioconductor.org/fetch/8129'
Content type 'application/octet-stream' length 13296358 bytes (12.7 MB)
==
downloaded 12.7 MB

gal <- GenomicAlignments::readGAlignments("file.bam")
gal[1:3]
GAlignments object with 3 alignments and 0 metadata columns:
   seqnames strand   cigar    qwidth start end width
     
   [1] chr1  +   49M1S    50 16208 16256    49
   [2] chr1  +   3S47M    50 16976 17022    47
   [3] chr1  -  10M177N40M    50 17046 17272   227
   njunc
   
   [1] 0
   [2] 0
   [3] 1
   ---
   seqinfo: 2580 sequences from an unspecified genome

any hint why 'httr::GET()' fails, while 'download.file()' doesn't?

thanks!!

robert.
ps: just to clarify, the 'httr::GET()' example is behind the following 
problem:

eh <- ExperimentHub()
z <- eh[["EH8079"]]
see ?gDNAinRNAseqData and browseVignettes('gDNAinRNAseqData') for 
documentation
downloading 2 resources
retrieving 2 resources
|==| 
100%

Error: failed to load resource
   name: EH8079
   title: RNA-seq data BAM file subset of HRR589632 contaminated with 0% 
gDNA
   reason: 1 resources failed to download
In addition: Warning messages:
1: download failed
   web resource path: 
‘https://experimenthub.bioconductor.org/fetch/8129’ 

   local file path: ‘/home/rcastelo/.cache/R/ExperimentHub/12ba1aa03_8129’
   reason: Failed writing received data to disk/application
2: bfcadd() failed; resource removed
   rid: BFC3
   fpath: ‘https://experimenthub.bioconductor.org/fetch/8129’ 

   reason: download failed
3: download failed
   hub path: ‘https://experimenthub.bioconductor.org/fetch/8129’ 

   cache resource: ‘EH8079 : 8129’
   reason: bfcadd() failed; see warnings()


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Important Bioconductor Release Deadlines

2023-03-29 Thread Kern, Lori
Please remember,  The Bioconductor 3.16 branch will be frozen Monday April 
10th.   After that date, no changes will be permitted ever on that branch.

The deadline for devel Bioconductor 3.17 for packages to pass R CMD build and R 
CMD check is April 21th.  While you will still be able to make commits past 
this date,  This ensures any changes pushed to git.bioconductor.org are 
reflected in at least one build report before the devel branch will be copied 
to a release 3.17 branch.

Cheers,



Lori Shepherd - Kern

Bioconductor Core Team

Roswell Park Comprehensive Cancer Center

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263


This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified that any disclosure, copying, 
distribution, or use of this email message is prohibited.  If you have received 
this message in error, please notify the sender immediately by e-mail and 
delete this email message from your computer. Thank you.
[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Rd] Incorrect behavior of ks.test and psmirnov functions with exact=TRUE

2023-03-29 Thread Kurt Hornik
> Alexey Sergushichev writes:

Thanks.  This is now fixed for the upcoming 4.3.0 release.

Best
-k

> HI,
> I've noticed what I think is an incorrect behavior of stats::psmirnov
> function and consequently of ks.test when run in an exact mode.

> For example:
> psmirnov(1, sizes=c(50, 50), z=1:100, two.sided = FALSE, lower.tail = F,
> exact=TRUE)

> produces 2.775558e-15

> However, the exact value should be 1/combination(100, 50), which is
> 9.9e-30. While the absolute error is small, the relative error is huge, and
> it is not fixed by setting option log.p=T

> To compare, SciPy has a correct implementation in scipy.stats.ks_2samp:
> scipy.stats.ks_2samp(list(range(1,51)), list(range(51, 101)),
> alternative="greater", method="exact")
> returns 9.911653021418333e-30.

> I've tried to dig in a bit and the problem comes down to how the final
> value is calculated in psmirnov function:

> if (log.p & !lower.tail)
> return(log1p(-ret/exp(logdenom)))
> if (!log.p & !lower.tail)
> return(1 - ret/exp(logdenom))

> There exp(logdenom) is a relatively good (but not perfect) approximation of
> combination(100, 50) = 1.008913e+29, ret is also a good approximation of
> combination(100, 50)-1 = 1.008913e+29 but there is not enough double
> precision for 1 - ret/exp(logdenom) to capture 1/combination(100, 50).

> I don't have time to provide a fix, at least not now, but I think this
> behavior (good absolute error, but poor relative error for small values)
> should at least be mentioned in the manual of the methods psmirnov and/or
> ks.test

> Best,
> Alexey Sergushichev

>   [[alternative HTML version deleted]]

> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel