Hi Steffen, On Mon, Mar 26, 2018 at 07:23:24PM +0200, Steffen Möller wrote: > > I just procrastinated a bit into using the comfort of salsa to update > debian/upstream/metadata and here the references to SciCrunch, OMICtools and > bio.tools registries. All three registries have improved their coverage > enormously over the past few months. I am deeply impressed.
Thanks a lot for the large update. > Anyway. I came across > > * one or two entries Which ones? > that had perfect RRID descriptions on salsa but not on > our task page - does the package need to be re-uploaded for the change to > become visible? Re-uploading is *not* needed. The data come from Salsa Git repositories (since about two weeks the machine-readable gatherer was pointed from Alioth to Salsa). However, there is an about 24 hour delay between commits and visibility of the data on the web sentinel since at least two cron jobs are involved (one that gathers the data and one that creates the pages). > * belvu and blixem that are from the same source package but have different > task entries and also separate catalog entries in all three registries. This > breaks the current UDD schema. I have annotated it now as ['belvu','blixem'] > (for bio.tools, the others analogously). > > Ideas for improvements anyone? Or is this how it should be for now? I'm not sure. In any case the current gatherer code will do nothing (at best) or fail. It seems that we are lucky and it does not break. The thing is that if we change our data model somebody (currently only me) needs to adapt the code. Currently there is no chance to resolve - Name: OMICtools Entry: ['OMICS_23183', 'OMICS_23184', 'OMICS_15828'] or - Name: SciCrunch Entry: ['SCR_015989','SCR_015994', 'NA'] How should the gatherer magically guess what binary package to choose? The entry - Name: bio.tools Entry: ['belvu', 'blixem', 'dotter'] looks helpfull - but it is just pure luck that bio.tools has choosen IDs matching our package names. So I think your data model is not helpful since there is no chance to define a sequence of the binary packages build from one source package. Thus we somehow need to define the binary package name explicitly. For citations we are using the field Debian-package[1] which is for instance used for meme package[2] (just to have another example since in seqtools also the dotter publication is marked like this). However, this is because I once added an additional field "package" to the bibref table which looks for instance like this: udd=# select * from bibref where (source = 'meme' or source = 'seqtools' ) and key = 'title'; source | key | value | package | rank ----------+-------+----------------------------------------------------------------------------------------------------------+---------+------ meme | title | MEME: discovering and analyzing DNA and protein sequence motifs | | 0 meme | title | Discovering Sequence Motifs with Arbitrary Insertions and Deletions | glam2 | 0 seqtools | title | SeqTools: visual tools for manual analysis of sequence alignments | | 0 seqtools | title | Scoredist: A simple and robust protein sequence distance estimator | | 1 seqtools | title | A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis | dotter | 0 seqtools | title | A workbench for large-scale sequence homology analysis | | 2 You see, packages with different names than the source packages got an additional value in the package column since it was defined in our data model first and implemented in the code afterwards. However, the registry table looks like this: udd=# select * from registry where source = 'seqtools'; source | name | entry ----------+-----------+--------------------------- seqtools | OMICtools | {OMICS_23183,OMICS_23184} seqtools | bio.tools | {belvu,blixem} seqtools | SciCrunch | {SCR_015989,SCR_015994} That's the status before your last commit since the machine-readable gatherer cron job was not run yet. The gatherer takes what it gets and injects it into the database. Its not magic - its code that needs to be adapted to a data model. Changing the data model and hoping that something sensible will happen is not working. What we should clarify in advance is: Does the source column in the registry table make sense at all or should it rather be a package column refering to binary packages? The web sentinel is working on binary packages so may be we should not keep source package names but rather binary package names inside this table. Alternatively we could add another package column which is filled if the package we want to refer to has a different name than the source package (as we are doing in the bibref table). Once we are talking about this: We also might question the bibref table and drop the source column in favour of keeping only the package column. The initial idea to have the source column is that upstream metadata belong to a certain source but may be this fact is irrelevant if there are different scientific metadata for different binaries created from the same source. In short: Lets discuss this first before adding new syntax hacks that will be not properly understood by the current code. Kind regards and thanks again for your effort to add scientific metadata Andreas. [1] https://wiki.debian.org/UpstreamMetadata#Fields [2] https://salsa.debian.org/med-team/meme/blob/master/debian/upstream/metadata -- http://fam-tille.de