Re: [Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects

2023-12-12 Thread Christian Arnold via Bioc-devel
Dear Vincent and others,

thanks for the reply! Irrespective of whether a different OrgDb is
required, the name itself suggested that there "should be" also
corresponding OrgDb and TxDb packages. I can build one on my own, I see,
is there anyone who works on providing the TxDB object for Bioc?

I am also asking this because the T2T people specifically provide an
"updated" gene annotation dataset which may differ from what's inside
OrgDb and may be incompatible with? See here:
https://github.com/marbl/CHM13:

/JHU RefSeqv110 + Liftoff v5.1
<https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13v2.0_RefSeq_Liftoff_v5.1.gff3.gz>:
This containscuratedannotations of the ampliconic genes on the Y
chromosome, correcting annotation errors in GENCODEv35 CAT/Liftoff and
RefSeqv110 annotation. Additional copies found in T2T-Y were annotated
to the closest available gene in RefSeq, allowing multiple genes to have
the same common name. This file has been modified to correct special
character issues from the original file.
/

/
/

For ArchR, I tried to understand how one can create a new genome by
checking here:
https://www.archrproject.com/bookdown/getting-set-up.html. There, they
explicitly mention the TxDb and OrgDb objects that are needed for
building a custom genome. There seems to be another option when both or
any of these 2 is not available ("Alternatively, if you dont have
a|TxDb|and|OrgDb|object, you can create a|geneAnnotation|object from the
following information" ), but I first tried to do it the easy way as I
want to properly embed it in a pipeline with as little "custom" code as
possible.


Thanks,
Christian




On 11/12/2023 15:30, Vincent Carey wrote:
> Thanks Jim, I tend to agree with you.  Christian, I had a look at
> ArchR but could not tell where the
> system contacts the Bioc annotation elements.  Can you give some
> hints?  I'd like to be able to
> verify compatibility.
>
> On Mon, Dec 11, 2023 at 9:19 AM James W. MacDonald  wrote:
>
> I don't believe a different OrgDb is required. The OrgDb package
> is meant to provide annotations for genes such as gene symbol or
> GO term, etc, which are orthogonal to the sequence of the genome,
> so the current version should suffice.
>
> -Original Message-
> From: Bioc-devel  On Behalf Of
> Vincent Carey
> Sent: Sunday, December 10, 2023 1:44 PM
> To: Christian Arnold 
> Cc: bioc-devel@r-project.org
> Subject: Re: [Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects
>
> Good question.  I believe these will be forthcoming soon.  In the
> mean time you can create your own.  See, for example
>
> 
> https://urldefense.com/v3/__https://github.com/vjcitn/BiocT2T/blob/devel/inst/scripts/makeTxDb.R__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMcpWaIEw$
>
>
> It's an active area so you can pull a gff file from
> 
> https://urldefense.com/v3/__https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T*CHM13*assemblies*annotation*__;Ly8vLw!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adM7PNUeks$
> and adjust the code noted above for the TxDb.
>
> For the org.db I have to get back to you.
>
> On Sun, Dec 10, 2023 at 12:06 PM Christian Arnold via Bioc-devel <
> bioc-devel@r-project.org> wrote:
>
> > Hello, I am working with the new human T2T-CHM13v2.0 assembly and
> > while a BSgenome package already exists
> > (BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0), I could not find the
> > corresponding TxDb and OrgDb packages. Is there any information
> when
> > they may also become available so it is easier to work with the new
> > genome for packages like ArchR, which support a custom genome
> but need
> > these standard annotation packages for their creation?
> >
> >
> > Thanks a lot for any information regarding this!
> >
> > Best, Christian
> >
> > ___
> > Bioc-devel@r-project.org mailing list
> >
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/bioc
> >
> -devel__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIM
> > iLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMOtbUwTc$
> >
>
> --
> The information in this e-mail is intended only for the
> ...{{dropped:18}}
>
> ___
> Bioc-devel@r-project.org mailing list
> 
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/bioc-devel__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMOtbUwTc$
>
>
>
> The information in this e-mail is intended only for th...{{dropped:16}}

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects

2023-12-10 Thread Christian Arnold via Bioc-devel

Hello, I am working with the new human T2T-CHM13v2.0  assembly and while
a BSgenome package already exists
(BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0), I could not find the
corresponding TxDb and OrgDb packages. Is there any information when
they may also become available so it is easier to work with the new
genome for packages like ArchR, which support a custom genome but need
these standard annotation packages for their creation?


Thanks a lot for any information regarding this!

Best, Christian

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Package problems due to results() function from other package?

2023-10-28 Thread Christian Arnold
For my package SNPhood that did not receive any code changes or updates
in quite a while, I suddenly see errors with Bioc 3.18:
https://master.bioconductor.org/checkResults/3.18/bioc-LATEST/SNPhood/nebbiolo2-buildsrc.html

Error: processing vignette 'workflow.Rmd' failed with diagnostics:
unused argument (type = "allelicBias")

This comes from this line I think:

names(results(SNPhood.o, type = "allelicBias"))

For literally years, this didnt cause any problems, and the results
function is actually (re)defined in the SNPhood package:

results <- function(SNPhood.o, type, elements = NULL)

I am not sure now what causes this. Should I use the syntax
SNPhood::results to make it clear, or I am wrongly assuming that the
wrong result function is taken that causes the error?

Any pointers?


Best

Christian


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Bioc-issue-bot remarks missing / wrong biocViews field in DESCRIPTION

2022-03-04 Thread Christian Arnold
Hi, for a new package submission
(https://github.com/Bioconductor/Contributions/issues/2550), I receive
an automated bot reply with the following:

"The package DESCRIPTION must contain a biocViews field with one or more
valid biocViews terms.

Please fix your DESCRIPTION. See current biocViews
"


However, my DESCRIPTION does contain a biocViews section and I run
BiocCheck() without warnings and errors (see
https://github.com/chrarnold/GRaNIE/blob/main/DESCRIPTION). I also
manually checked all biocViews terms with the current list, and they all
seem to be valid. What is the problem?

I cannot figure it out. Also, is there any way to re-open an issue or I
really need to re-create a new issue whenever I try a modification when
automated closing of a new issue occurs?


Best

Christian

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] GenomeInfoDb::fetchExtendedChromInfoFromUCSC broken again?

2017-01-31 Thread Christian Arnold


Hi,

it appears that in the newest devel version, 
fetchExtendedChromInfoFromUCSC is broken (again). I found an earlier 
post here: 
https://www.mail-archive.com/bioc-devel@r-project.org/msg06193.html, 
which however states that this has been fixed already in the previous 
version 1.11.2 of the package:



library(GenomeInfoDb)

fetchExtendedChromInfoFromUCSC("mm9")

> ("mm9")
Error in file(file, "rt") : cannot open connection
In addition: Warning messages:
1: In FUN(genome = names(SUPPORTED_UCSC_GENOMES)[idx], circ_seqs = 
supported_genome$circ_seqs,  :
  NCBI seqlevel was set to NA for mm9 UCSC seqlevel(s) not in the NCBI 
assembly: chr1_random, chr3_random, chr4_random, chr5_random, chr7_random,
  chr8_random, chr9_random, chr13_random, chr16_random, chr17_random, 
chrX_random, chrY_random, chrUn_random

2: In file(file, "rt") :
  URL 
'https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_01635.18.assembly.txt': 
status was '404 Not Found'




sessionInfo()
R Under development (unstable) (2016-06-30 r70858)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C LC_TIME=de_DE.UTF-8
LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_DE.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=de_DE.UTF-8   LC_NAME=C LC_ADDRESS=C   
LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C


attached base packages:
[1] stats4parallel  stats graphics  grDevices utils datasets  
methods   base


other attached packages:
[1] GenomeInfoDb_1.11.6 IRanges_2.9.16  S4Vectors_0.13.11 
BiocGenerics_0.21.3


loaded via a namespace (and not attached):
[1] tools_3.4.0

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] lpsymphony - BiocParallel crash

2016-09-21 Thread Christian Arnold
Dear Bioconductor developers,


I am having a somewhat mysterious and challenging problem which we 
believe is a bug in either (1) BiocParallel or (2) the lpsymphony 
library from either Bioconductor or the SYMPHONY backend.

The problem is, in a nutshell, that R silently crashes or, to be more 
precise, never finishes the calculation when using the only function 
from lpsymphony in combination with BiocParallel.

We updated the latest version of lpsymphony last week, did not help. 
This can be reproduced with two different configurations, using 
different library versions of both BiocParallel and lpsymphony, 
respectively. We also tested this on two different machines (one of 
which runs R in the devel version, one does not; see the two 
sessionInfo() at the end of this message)


The following code can reproduce the problem:


/library(BiocParallel)/

/library(lpsymphony)/


/lpsymphonyTest <- function(x){/

/# Example from the R help of lpsymphony_solve_LP/

/obj <- c(2, 4, 3)/

/mat <- matrix(c(3, 2, 1, 4, 1, 3, 2, 2, 2), nrow = 3)/

/dir <- c("<=", "<=", "<=")/

/rhs <- c(60, 40, 80)/

/max <- TRUE/

/test = lpsymphony_solve_LP(obj, mat, dir, rhs, max = max)/

/Sys.sleep(1)/

/return(1)/

/}/


/# First run: Two iterations, should run in parallel/

/nIterationsAndCores = 2/

/register(MulticoreParam(workers = nIterationsAndCores))/

/commonGenes <- bplapply(X = seq(2), FUN = lpsymphonyTest) /


/# Second run: One iteration, should run on only one core/

/commonGenes <- bplapply(X = seq(1), FUN = lpsymphonyTest) /


/# Third run: Two iterations again, should run in parallel but crashed 
and never finishes/

/commonGenes <- bplapply(X = seq(2), FUN = lpsymphonyTest) /


What happens in our case is that the third run never finishes. The two R 
processes are in state “SLEEP” rather than running. Once you execute the 
bplapply loop for only one iteration, everything after with number of 
iterations > 1 crashes. It works fine for any number of iterations > 1 
despite of the order, but again, executing only a single iteration 
causes the problems afterwards. Note that using registering again for 
the second and third run does not help and yields the same behavior.



In fact, we have another issue, namely that the parallelization with 
BiocParallel in combination with the IHW package (which calls lpsymphony 
in the background) does not yield running times as expected but instead 
much, much longer ones. However, let's focus on this one first, because 
we think they might be related.


Thanks for your help,


Best

Christian




These are the two configurations where this problem can be reproduced:


(1)

/R Under development (unstable) (2016-06-30 r70858)/

//

/Platform: x86_64-pc-linux-gnu (64-bit)/

//

/Running under: Ubuntu 16.04.1 LTS/

//

/
/

//

/locale:/

//

/[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 
LC_COLLATE=en_US.UTF-8 /

//

/[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8 
LC_PAPER=de_DE.UTF-8 LC_NAME=C /

//

/[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 
LC_IDENTIFICATION=C /

//

/attached base packages:/

//

/[1] stats graphics grDevices utils datasets methods base /

//

/other attached packages:/

//

/[1] lpsymphony_1.1.2 BiocParallel_1.7.8/

//

/loaded via a namespace (and not attached):/

//

/[1] parallel_3.4.0 tools_3.4.0 /



(2)

/R version 3.3.1 (2016-06-21)/

//

/Platform: x86_64-pc-linux-gnu (64-bit)/

//

/Running under: CentOS release 6.5 (Final)/

//

/
//locale:/

//

/[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 /

//

/[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 
LC_MESSAGES=en_US.UTF-8 /

//

/[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C /

//

/[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C /

//

/
//attached base packages:/

//

/[1] stats graphics grDevices utils datasets methods base /

//

/
//other attached packages:/

//

/[1] lpsymphony_1.0.2 BiocParallel_1.6.6/

//

/
//loaded via a namespace (and not attached):/

//

/[1] parallel_3.3.1 tools_3.3.1 /


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Rsamtools: How to import only the first 1000 or so reads

2016-01-21 Thread Christian Arnold

Dear Bioc-Community,

I was wondering if it is currently possible to only import the first X 
reads rather than all reads from an arbitrary BAM file into a list using 
Rsamtools with scanBam? I did not find any parameter in scanBamParam 
that seems to capture what I need. Specifically, I do not want to 
specify genomic regions because for arbitrary BAM files, there might not 
be any read overlapping. Command-line solutions are also not an option 
here because I want to integrate this into the SNPhood package and it 
should run on Windows too.


The reason for this task is that I want to automatically determine the 
read length from a BAM file, and my currently favoured strategy is to 
import the first 1000 reads or so and then determine the maximum length 
of the sequence, which should be a good proxy for the real read length.


Thanks for suggestions,
Christian

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Reducing memory footprint of large object

2015-11-05 Thread Christian Arnold


Hi all,

I wanted to ask around in this list with full of experts if any of you 
have an advice about the following problem:


I got a large SNPhood object from someone (package SNPhood, which I 
developed) from an analysis of 200.000 SNPs or so that stores lots of 
read counts and the positions of overlapping reads in general. In total, 
the object is 2 GB large. I examined the object and identified the slots 
that need the most memory. In this particular slot, a nested list is 
stored that saves the read start positions of all overlapping reads for 
each SNP region.


For example, for one individual, a list of length 120,049 with integer 
vectors, with 20,853,838 elements within the vectors in total:


> 
format(object.size(SNPhood.o@internal$readStartPos$ambiguous$GM12878), 
units = "Mb")

[1] "86 Mb"

Unsurprisingly, when unlisting, this can only be a bit improved:
format(object.size(unlist(SNPhood.o@internal$readStartPos$ambiguous$GM12878)), 
units = "Mb")

"79.6 Mb"

> length(SNPhood.o@internal$readStartPos$ambiguous$GM12878)
[1] 120049
> length(unlist(SNPhood.o@internal$readStartPos$ambiguous$GM12878))
[1] 20853838

The vector of read start positions may look like this:

head(unlist(SNPhood.o@internal$readStartPos$ambiguous$GM12878),50)
 [1] 714086 714087 714088 714089 714099 714100 714106 714108 714110 
714114 714114 714123 714123 714123 714125 714125 714128 714130 714138 
714139 714145 714148 714149 714150 714151 714152 714154 714164 714164 
714172 714173 714184 714186 714187 714188 714189 714192 714194 714198 
714204 714206 714209 714209 714212 714216 714219 714219 714223 714224 714224


So there are a few reads with identical start sites, but this does not 
occur too often. I indeed need all of this information for further 
processing.


Do you have any idea if I can save this information more efficiently so 
that the overall object size is reduced? I could try an Rle, but the 
structure of the data does not be ideal for this...


Any tips are very much appreciated!

Thanks,
Christian

--
—
Christian Arnold, PhD
Staff Bioinformatician

SCB Unit - Computational Biology
Joint appointment Genome Biology
Joint appointment European Bioinformatics Institute (EMBL-EBI)

European Molecular Biology Laboratory (EMBL)
Meyerhofstrasse 1; 69117, Heidelberg, Germany

Email: christian.arn...@embl.de
Phone: +49(0)6221-387-8472
Web: http://www.zaugg.embl.de/

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Issues with R CMD check and an ExperimentalData package

2015-09-22 Thread Christian Arnold


Hi all,

I am sorry for writing again, but I didn't find anything in the internet 
and I hope someone can help me here.


Maybe I am doing something stupid, but I am having troubles running R 
CMD check for an ExperimentalData package (biocViews: ExperimentData). 
it has no actual code, only a few files in inst/extdata. And one 
vignette located in "vignettes" (Rmd, knitr)



 I receive the following warnings:
1. * checking for unstated dependencies in examples ... WARNING
no parsed files found
2. * checking re-building of vignette outputs ...Warning in 
file.copy(pkgdir, vd2, recursive = TRUE) : too deep nesting

Warning in file.copy(pkgdir, vd2, recursive = TRUE) : too deep nesting
3 . * checking PDF version of manual ... WARNING
LaTeX errors when creating PDF version.
This typically indicates Rd problems.
4. * checking PDF version of manual without hyperrefs or index ... ERROR

Probably, 3 and 4 are just consequences of 2, but I don't understand 2 
in particular. The vignette is very short, and compiles fine in RStudio.

How do I fix this, what is wrong? How do I fix 1?

R CMD BiocCheck --new-package runs without errors(required), only a 
recommendation to remove vignette sources from inst/doc



Any help is appreciated, thanks,
Christian


--
—
Christian Arnold, PhD
Staff Bioinformatician

SCB Unit - Computational Biology
Joint appointment Genome Biology
Joint appointment European Bioinformatics Institute (EMBL-EBI)

European Molecular Biology Laboratory (EMBL)
Meyerhofstrasse 1; 69117, Heidelberg, Germany

Email: christian.arn...@embl.de
Phone: +49(0)6221-387-8472
Web: http://www.zaugg.embl.de/

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Proper way of documenting a BiocGenerics generics with extended prototypes

2015-09-21 Thread Christian Arnold
Hi Jim, Kasper, and Hervé,

thanks a lot for the quick answer. For some reason, I was under the 
impression that I have to use exactly the original prototype from the 
generics, but I didn't fully realize I can extend this arbitrarily as 
long as the original arguments - object and ... in this case, are among 
the extended list.

It works now, perfect, no complaints.

*
**A related question that I have is the following, maybe someone can 
share with me his or her thoughts: *

In addition to counts (for which a generics already exists), users can 
extract "enrichment" instead of counts out of my object. Because there 
is a generics for counts, I thought it makes conceptually sense to also 
have a generics for enrichment then, because "enrichment" may also be 
something that other packages may use as a general terminology in a 
genomic context. So currently, I am creating one in an identical fashion 
to counts.

Similarly, I defined one for "parameters" to extract parameters out of 
the object that were defined previously in the main function that 
creates the object.

For both "enrichment" and "parameters", I could of course also regular 
functions, but I was wondering which of the two approaches I should 
take. Any thoughts?


Thanks a lot, very much appreciated!
Christian



On 20.09.2015 21:09, Jim Hester wrote:
> Christian,
>
> You need to use your extended prototype in the function definition 
> within your `setMethod()` call.  In particular you do _not_ need to 
> write an explicit `@usage` directive.
>
> See 
> https://github.com/jimhester/examplePrototype/blob/master/R/package.R 
> for an example package without any NOTEs for this situation.
>
> Jim
>
> On Sun, Sep 20, 2015 at 1:23 PM, Christian Arnold 
> <christian.arn...@embl.de <mailto:christian.arn...@embl.de>> wrote:
>
> Hi all,
>
> I am currently producing the documentation for a new hopefully
> soon-to-be-submitted Bioconductor package and I am facing some
> difficulties with R CMD check.
> In summary: I want to dispatch the existing generic for "counts"
> to make it usable with my object of type SNPhood as well.
>
> The original prototype from BiocGenerics:
>
> standardGeneric for "counts" defined from package "BiocGenerics"
>
> function (object, ...)
>
>
>
> I am calling, as you can see, an internal function for this.
> Currently, because of its original prototype, I use:
>
> setMethod("counts", "SNPhood", function(object, ...)
> {.getCounts(object, ...)})
>
>
>
> However, in my case, I want to use an extended prototype to
> specify which counts exactly should be extracted.
>
> .getCounts <- function(SNPhood.o, type, readGroup = NULL, dataset
> = NULL, ...)
>
>
>
> Everything works, but I want to document the additional arguments
> properly. However, with roxygen 2, I am not sure how to achieve
> that without producing warnings.
>
> I am using:
>
> #' @usage counts (object, type, readGroup, dataset)
>
> #' @template object
>
> #' @param type bla
>
> #' @template readGroup
>
> #' @template dataset
>
>
> This produces:
>
> * checking for code/documentation mismatches ... WARNING
> Codoc mismatches from documentation object 'counts,SNPhood-method':
> counts
>   Code: function(object, ...)
>   Docs: function(object, type, readGroup, dataset)
>   Argument names in code not in docs:
> ...
>   Argument names in docs not in code:
> type readGroup dataset
>   Mismatches in argument names:
> Position: 2 Code: ... Docs: type
>
>
>
> So, what is the recommended way here? I could write an own "new"
> counts function, but I thought if a generics already exists, I
> should use it because this is exactly what my function does also...
>
>
> Thanks,
> Christian
>
> -- 
> —
> Christian Arnold, PhD
> Staff Bioinformatician
>
> SCB Unit - Computational Biology
> Joint appointment Genome Biology
> Joint appointment European Bioinformatics Institute (EMBL-EBI)
>
> European Molecular Biology Laboratory (EMBL)
> Meyerhofstrasse 1; 69117, Heidelberg, Germany
>
> Email: christian.arn...@embl.de <mailto:christian.arn...@embl.de>
> Phone: +49(0)6221-387-8472 <tel:%2B49%280%296221-387-8472>
> Web: http://www.zaugg.embl.de/
>
> ___
> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mail

[Bioc-devel] Proper way of documenting a BiocGenerics generics with extended prototypes

2015-09-20 Thread Christian Arnold

Hi all,

I am currently producing the documentation for a new hopefully 
soon-to-be-submitted Bioconductor package and I am facing some 
difficulties with R CMD check.
In summary: I want to dispatch the existing generic for "counts" to make 
it usable with my object of type SNPhood as well.


The original prototype from BiocGenerics:

standardGeneric for "counts" defined from package "BiocGenerics"

function (object, ...)



I am calling, as you can see, an internal function for this. Currently, 
because of its original prototype, I use:


setMethod("counts", "SNPhood", function(object, ...) {.getCounts(object, ...)})



However, in my case, I want to use an extended prototype to specify 
which counts exactly should be extracted.


.getCounts <- function(SNPhood.o, type, readGroup = NULL, dataset = NULL, ...)



Everything works, but I want to document the additional arguments 
properly. However, with roxygen 2, I am not sure how to achieve that 
without producing warnings.


I am using:

#' @usage counts (object, type, readGroup, dataset)

#' @template object

#' @param type bla

#' @template readGroup

#' @template dataset


This produces:

* checking for code/documentation mismatches ... WARNING
Codoc mismatches from documentation object 'counts,SNPhood-method':
counts
  Code: function(object, ...)
  Docs: function(object, type, readGroup, dataset)
  Argument names in code not in docs:
...
  Argument names in docs not in code:
type readGroup dataset
  Mismatches in argument names:
Position: 2 Code: ... Docs: type



So, what is the recommended way here? I could write an own "new" counts 
function, but I thought if a generics already exists, I should use it 
because this is exactly what my function does also...



Thanks,
Christian

--
—
Christian Arnold, PhD
Staff Bioinformatician

SCB Unit - Computational Biology
Joint appointment Genome Biology
Joint appointment European Bioinformatics Institute (EMBL-EBI)

European Molecular Biology Laboratory (EMBL)
Meyerhofstrasse 1; 69117, Heidelberg, Germany

Email: christian.arn...@embl.de
Phone: +49(0)6221-387-8472
Web: http://www.zaugg.embl.de/

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Overloading subset operator for an S4 object with more than two dimensions

2015-05-18 Thread Christian Arnold

Thanks for your input, highly appreciated!

I can see that the semantics of [ are violated, so I agree that 
overwriting the subset method  is probably a better way to go. 
Essentially, the object stores several, individual-specific count 
matrices from RNA-Seq experiments in an potentially allele(read 
group)-specific manner. So the dimensions to subset on are the read 
groups, the rows and columns of the matrices, and the individuals itself.

So I guess overloading the subset method with four arguments, each 
corresponding to one of the dimensions a subset is suitable for this 
kind of object, is the way to go.

Thanks,
Christian


On 14.05.2015 15:57, Michael Lawrence wrote:
 I agree with Wolfgang that the semantics of [ are being violated here. 
 It would though help if you could be a little less vague about your 
 intent. What is this data structure going to store, how should it behave?

 On Thu, May 14, 2015 at 3:35 AM, Christian Arnold 
 christian.arn...@embl.de mailto:christian.arn...@embl.de wrote:

 Hi there,

 I am about to develop a Bioconductor package that implements a
 custom S4 object, and I am currently thinking about a few issues,
 including the following:

 Say we have an S4 object that stores a lot of information in
 different slots. Assume that it does make sense to extract
 information out of this object in four different dimensions
 (conceptually similar to a four-dimensional object), so one would
 like to use the subset [ operator for this, but extending beyond
 the typical one or two dimensions to 4:

 setClass(A,
 
 representation=representation(a=numeric,b=numeric,c=numeric,d=numeric))
 a = new(A, a=1:5,b=1:5,c=1:5,d=1:5)

 Now it would be nice to do stuff like a[1,2,3:4,5], which should
 simply return the selected elements in slots a, b, c, and d,
 respectively. So a[1,2,3:4,5] would return:

 An object of class A
 Slot a:
 [1] 1

 Slot b:
 [1] 2

 Slot c:
 [1] 3 4

 Slot d:
 [1] 5

 This is how far I've come:

 setMethod([, c(A, ANY, ANY,ANY),
   function(x, i, j, ..., drop=TRUE)
   {
 dots - list(...)
 if (length(dots)  2) {
   stop(Too many arguments, must be four dimensional)
 }

 # Parse the extra two dimensions that we need from the
 ... argument
 k = ifelse(length(dots)  0 , dots[[1]], c(1:5))
 l = ifelse(length(dots) == 2, dots[[2]], c(1:5))

 initialize(x, a=x@a[i],b=x@b[j],c=x@c[k],d=x@d[l])
   })

 This works for stuff like a[1,2,3, 4], but fails with a general
 error if one of the indices is a vector such as a[1:2,2,3, 4] or
 a[1,2,3,4:5].


 So, in summary, my questions are:
 1. Is there a reasonable way of achieving the 4-dimensional
 subsetting that works as a user would expect it to work?
 2. Does it make more sense to write a custom function instead to
 achieve this, such as subsetObject() without overloading [
 explicitly? What are the Bioconductor recommendations here?

 I'd appreciate any help, suggestions, etc!

 Thanks,
 Christian

 ___
 Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing
 list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel



-- 
—
Christian Arnold, PhD
Staff Bioinformatician

SCB Unit - Computational Biology
Joint appointment Genome Biology
Joint appointment European Bioinformatics Institute (EMBL-EBI)

European Molecular Biology Laboratory (EMBL)
Meyerhofstrasse 1; 69117, Heidelberg, Germany

Email: christian.arn...@embl.de
Phone: +49(0)6221-387-8472
Web: http://www.embl.de/research/units/scb/zaugg/


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Overloading subset operator for an S4 object with more than two dimensions

2015-05-14 Thread Christian Arnold

Hi there,

I am about to develop a Bioconductor package that implements a custom S4 
object, and I am currently thinking about a few issues, including the 
following:


Say we have an S4 object that stores a lot of information in different 
slots. Assume that it does make sense to extract information out of this 
object in four different dimensions (conceptually similar to a 
four-dimensional object), so one would like to use the subset [ 
operator for this, but extending beyond the typical one or two 
dimensions to 4:


setClass(A, 
representation=representation(a=numeric,b=numeric,c=numeric,d=numeric))

a = new(A, a=1:5,b=1:5,c=1:5,d=1:5)

Now it would be nice to do stuff like a[1,2,3:4,5], which should simply 
return the selected elements in slots a, b, c, and d, respectively. So 
a[1,2,3:4,5] would return:


An object of class A
Slot a:
[1] 1

Slot b:
[1] 2

Slot c:
[1] 3 4

Slot d:
[1] 5

This is how far I've come:

setMethod([, c(A, ANY, ANY,ANY),
  function(x, i, j, ..., drop=TRUE)
  {
dots - list(...)
if (length(dots)  2) {
  stop(Too many arguments, must be four dimensional)
}

# Parse the extra two dimensions that we need from the ... 
argument

k = ifelse(length(dots)  0 , dots[[1]], c(1:5))
l = ifelse(length(dots) == 2, dots[[2]], c(1:5))

initialize(x, a=x@a[i],b=x@b[j],c=x@c[k],d=x@d[l])
  })

This works for stuff like a[1,2,3, 4], but fails with a general error if 
one of the indices is a vector such as a[1:2,2,3, 4] or a[1,2,3,4:5].



So, in summary, my questions are:
1. Is there a reasonable way of achieving the 4-dimensional subsetting 
that works as a user would expect it to work?
2. Does it make more sense to write a custom function instead to achieve 
this, such as subsetObject() without overloading [ explicitly? What 
are the Bioconductor recommendations here?


I'd appreciate any help, suggestions, etc!

Thanks,
Christian

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel