Re: RRID update on salsa on packages starting with A+B

Andreas Tille Tue, 27 Mar 2018 00:05:28 -0700

Hi Steffen,

On Mon, Mar 26, 2018 at 07:23:24PM +0200, Steffen Möller wrote:
> 
> I just procrastinated a bit into using the comfort of salsa to update
> debian/upstream/metadata and here the references to SciCrunch, OMICtools and
> bio.tools registries. All three registries have improved their coverage
> enormously over the past few months. I am deeply impressed.


Thanks a lot for the large update.
 
> Anyway. I came across
> 
>  * one or two entries

Which ones?

> that had perfect RRID descriptions on salsa but not on
> our task page - does the package need to be re-uploaded for the change to
> become visible?

Re-uploading is *not* needed.  The data come from Salsa Git repositories
(since about two weeks the machine-readable gatherer was pointed from
Alioth to Salsa).  However, there is an about 24 hour delay between
commits and visibility of the data on the web sentinel since at least
two cron jobs are involved (one that gathers the data and one that
creates the pages).

>  * belvu and blixem that are from the same source package but have different
> task entries and also separate catalog entries in all three registries. This
> breaks the current UDD schema. I have annotated it now as ['belvu','blixem']
> (for bio.tools, the others analogously).
> 
> Ideas for improvements anyone? Or is this how it should be for now?

I'm not sure.  In any case the current gatherer code will do nothing (at
best) or fail.  It seems that we are lucky and it does not break.  The
thing is that if we change our data model somebody (currently only me)
needs to adapt the code.  Currently there is no chance to resolve

 - Name: OMICtools
   Entry: ['OMICS_23183', 'OMICS_23184', 'OMICS_15828']

or

 - Name: SciCrunch
   Entry: ['SCR_015989','SCR_015994', 'NA']

How should the gatherer magically guess what binary package to choose?
The entry

 - Name: bio.tools
   Entry: ['belvu', 'blixem', 'dotter']

looks helpfull - but it is just pure luck that bio.tools has choosen IDs
matching our package names.  So I think your data model is not helpful
since there is no chance to define a sequence of the binary packages
build from one source package.  Thus we somehow need to define the
binary package name explicitly.

For citations we are using the field Debian-package[1] which is for
instance used for meme package[2] (just to have another example since
in seqtools also the dotter publication is marked like this).  However,
this is because I once added an additional field "package" to the bibref
table which looks for instance like this:


udd=# select * from bibref where (source = 'meme' or source = 'seqtools' ) and 
key = 'title';
  source  |  key  |                                                  value      
                                             | package | rank 
----------+-------+----------------------------------------------------------------------------------------------------------+---------+------
 meme     | title | MEME: discovering and analyzing DNA and protein sequence 
motifs                                          |         |    0
 meme     | title | Discovering Sequence Motifs with Arbitrary Insertions and 
Deletions                                      | glam2   |    0
 seqtools | title | SeqTools: visual tools for manual analysis of sequence 
alignments                                        |         |    0
 seqtools | title | Scoredist: A simple and robust protein sequence distance 
estimator                                       |         |    1
 seqtools | title | A dot-matrix program with dynamic threshold control suited 
for genomic DNA and protein sequence analysis | dotter  |    0
 seqtools | title | A workbench for large-scale sequence homology analysis      
                                             |         |    2


You see, packages with different names than the source packages got an
additional value in the package column since it was defined in our data
model first and implemented in the code afterwards.  However, the
registry table looks like this:


udd=# select * from registry where source = 'seqtools';
  source  |   name    |           entry           
----------+-----------+---------------------------
 seqtools | OMICtools | {OMICS_23183,OMICS_23184}
 seqtools | bio.tools | {belvu,blixem}
 seqtools | SciCrunch | {SCR_015989,SCR_015994}


That's the status before your last commit since the machine-readable
gatherer cron job was not run yet.  The gatherer takes what it gets and
injects it into the database.  Its not magic - its code that needs to be
adapted to a data model.  Changing the data model and hoping that
something sensible will happen is not working.

What we should clarify in advance is:  Does the source column in the
registry table make sense at all or should it rather be a package column
refering to binary packages?  The web sentinel is working on binary
packages so may be we should not keep source package names but rather
binary package names inside this table.  Alternatively we could add
another package column which is filled if the package we want to refer
to has a different name than the source package (as we are doing in the
bibref table).

Once we are talking about this: We also might question the bibref table
and drop the source column in favour of keeping only the package column.
The initial idea to have the source column is that upstream metadata
belong to a certain source but may be this fact is irrelevant if there
are different scientific metadata for different binaries created from
the same source.

In short: Lets discuss this first before adding new syntax hacks that
will be not properly understood by the current code.

Kind regards and thanks again for your effort to add scientific metadata

      Andreas.


[1] https://wiki.debian.org/UpstreamMetadata#Fields
[2] https://salsa.debian.org/med-team/meme/blob/master/debian/upstream/metadata

-- 
http://fam-tille.de

Re: RRID update on salsa on packages starting with A+B

Reply via email to