Re: [Bioc-devel] Question relating to extending a class and inclusion of data

Hervé Pagès Tue, 21 May 2024 20:04:39 -0700

Hi,

On 5/21/24 01:58, Vilhelm Suksi wrote:
> Hi!
>
> Excuse the long email, but there are a number of things to be clarified in 
> preparation for submitting the notame package which I have been developing to 
> meet Bioconductor guidelines. As of now it passes almost all of the automatic 
> checks, with the exception of formatting and some functions that are over 50 
> lines long.
>
> Background 1:
> The notame package already has a significant following, and was published in 
> 2020 with an associated protocol article published in the "Metabolomics Data 
> Processing and Data Analysis—Current Best Practices" special issue of the 
> Metabolites journal (https://www.mdpi.com/2218-1989/10/4/135). The original 
> package relies on the MetaboSet container class, which extends ExpressionSet 
> with three slots, namely group_col, time_col and subject_col. These slots are 
> used to store the names of the corresponding sample data columns, and are 
> used as default arguments to most functions. This makes for a more 
> streamlined experience. However, the submission guidelines state that 
> existing classes should be preferred, such as SummarizedExperiment. We will 
> be implementing support for SummarizedExperiment over the summer. We have 
> included a MetaboSet - SummarizedExperiment converter for interoperability.
>
> Q1: Can an initial Bioconductor submission rely on the Metaboset container 
> class? Support for MetaboSet would do well to be included anyways for 
> existing users until it is phased out.
Since you already have a user base, you will need a roadmap for the 
transition from Metaboset to MetaboExperiment. Bioconductor has a 
6-month release cycle that facilitates this. More on this below.
> Q2: Is it ok to extend the SummarizedExperiment class to utilize the three 
> aforementioned slots? It could be called MetaboExperiment. Or should the 
> functions be modified such that said columns are specified explicitly, using 
> SummarizedExperiment?


It's better to define your own SummarizedExperiment extension with the 
three additional slots. This way you will have a container 
(MetaboExperiment) that is semantically equivalent (or close) to 
Metaboset. Which means that: (1) in principle you won't need to modify 
the interface of your existing functions, and (2) you'll be able to 
provide coercion methods to go back and forth between the 
MetaboExperiment and Metaboset representations (see ?setAs). Overall 
this should make the transition from Metaboset to MetaboExperiment 
easier/smoother.

This transition would roughly look something like this:

1. Submit theMetaboset-based version of the package for inclusion in 
BioC 3.20.

2. After the 3.20 release (next Fall), make the following changes in the 
devel branch of the package:

- Implement the MetaboExperiment class + accessors (getters/setters) + 
constructor function(s) + show() method.

- Implement the coercion methods to go from Metaboset to 
MetaboExperiment and vice-versa.

- Modify the implementation of all the functions that deal with 
Metaboset objects to deal with MetaboExperiment objects. This will be 
the primary representation that they handle. If they receive a 
Metaboset, they will immediately replace it with a MetaboExperiment 
using as(..., "MetaboExperiment").

- Modify all the documentation, unit tests, and serialized objects 
accordingly.

3. Now you are ready to deprecate the Metaboset class. I recommend that 
you also do this in the devel branch before the 3.21 release. There are 
no well established guidelines to deprecate an S4 class. I recommend 
that you use .Deprecated() to display a deprecation message in its 
show() method, constructor function(s), getters/setters, and coercion 
method from MetaboExperiment to Metaboset.

4. After the 3.21 release (Spring 2025), make the Metaboset class 
defunct by replacing all the .Deprecated() calls with .Defunct() calls.

> Background 2:
> The notame package caters to untargeted LC-MS data analysis metabolic 
> profiling experiments, encompassing data pretreatment (quality control, 
> normalization, imputation and other steps leading up to feature selection) 
> and feature selection (univariate analysis and supervised learning). Raw data 
> preprocessing is not supported. Instead, the package offers utilities for 
> flexibly reading peak tables from an Excel file, resulting from various 
> point-and-click software such as MS-DIAL. As such, data in Excel format needs 
> to be included, but is not available in any Bioconductor package, although 
> such Excel data could be procured from existing data in Bioconductor. 
> However, existing untargeted LC-MS data in Bioconductor can not be used, as 
> is, to demonstrate the full functionality of the notame package. With regard 
> to feature data, there needs to be several analytical modes. Sample data 
> needs to include study group, time point, subject ID and several batches. 
> Blank samples would be good as well. Packages I have checked for data with 
> the above specifications include FaahKO, MetaMSdata, msdata, msqc1, mtbls2, 
> pmp, PtH2O2lipids, and ropls. As of now, the example data is not realistic in 
> that it is scrambled and I have not yet been informed of the origin and 
> modification of the data.
>
> Q3: If I get access to information about the origin and modification of the 
> now used data, can I further modify it to satisfy the needs of the package 
> for an initial Bioconductor release? Or does it need to be realistic? 
> Consider this the explicit pre-approval inquiry for including data in the 
> notame package.
I'm not sure I fully understand the question (or its connection with 
Excel) but yes you can include unrealistic data in the package. As long 
as it allows you to properly illustrate the basic usage of your 
functions in the man pages and/or vignette(s). It can also be useful to 
have small (and unrealistic) data for the unit tests. The important 
thing here is that the data must be small.
> Q4: Do you think a separate ExperimentData package satisfying the 
> specifications laid out in Background 2 is warranted? This could be included 
> in a future version with SummarizedExperiment/MetaboExperiment support.
It depends on the size of the data. For a software package, we limit the 
size of the source tarball to 5G. So if you're going to exceed that 
limit then the datasets need to go in an experiment data package.
>
> Q5: The instructions state that the data needs to be documented 
> (https://contributions.bioconductor.org/docs.html#doc-inst-script). Is the 
> availability of the original data strictly necessary?  I notice many packages 
> don't include documentation on how the data was procured.

The availability of the original data is not strictly necessary but the 
data still needs to be documented i.e. what's its nature, where it's 
coming from, how it was imported/transformed, etc...

Best,

H.

>
> Thanks,
> Vilhelm Suksi
> Turku Data Science Group
> vks...@utu.fi
>
> _______________________________________________
> Bioc-devel@r-project.org  mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Hervé Pagès

Bioconductor Core Team
hpages.on.git...@gmail.com


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Question relating to extending a class and inclusion of data

Reply via email to