Re: [spdx-tech] Element Identifier Proposal - spdxNamespace

Gary O'Neall Wed, 18 Aug 2021 11:36:07 -0700

Thanks Dave – really good points.


A few modifications to my proposal below and one significant model issue I’d 
like to separate out into a separate discussion.  Details below.

 

Gary

 

From: Spdx-tech@lists.spdx.org <Spdx-tech@lists.spdx.org> On Behalf Of David 
Kemp
Sent: Wednesday, August 18, 2021 6:38 AM
To: Gary O'Neall <g...@sourceauditor.com>
Cc: SPDX-list <Spdx-tech@lists.spdx.org>
Subject: Re: [spdx-tech] Element Identifier Proposal - spdxNamespace

 

Gary, I nearly agree, and I think we are making this harder than it is. The 
model currently uses made-up typenames, not actual types from 
https://www.w3.org/TR/rdf11-concepts/#xsd-datatypes.  Fixing the model would 
fix the problem.

[G.O.] Agree – we should refer to the already defined data type names.

 

*       documentNamespace would have a restriction on the non-namespace portion 
of the ID to allow for parsing of external document references (e.g. 
externalDocumentRef-12:SPDXRef-14).  Currently, this is a colon (‘:’), however, 
we may want to choose a different character which less common in URI’s.  We 
would remove the required prefixe of “SPDXRef-“.


I don't think we should define any special syntax for the non-namespace portion 
of the URI - the last slash in hier-part separates the namespace from the 
non-namespace, if the URI ends with a slash the non-namespace part is empty, 
and the non-namespace part cannot contain a slash.  That is simple and 
unambiguous for both concatenating and splitting.

[G.O.] Let me revise my proposal to allow any valid IRI character – basically 
remove the restriction.

This requires a change to the non-linked data serializations.  This would 
create a compatibility issue with SPDX version 2.2, but I don’t think it is a 
major issue.  In looking at the IRI ABNF 
<https://datatracker.ietf.org/doc/html/rfc3986#appendix-A>  there are quite a 
few characters which are not allowed in the IRI segments and fragments we can 
choose from.  It looks like ‘|’ is available (unless I missed something).

This change will not have any impact on the linked-data serializations as we 
would use the native serialization mechanisms for using namespaces.  Nor will 
this change have any impact on the model (other than recording the namespace – 
already mentioned in other parts of this proposal).

For tag/value, we would change the way we parse external document references.  
In SPDX 2.2, we look for a pattern 
‘DocumentRef-[0-9a-zA-Z\.\-\+]+:SPDXRef-[0-9a-zA-Z\.\-\+]+’ for the ID (e.g. 
Relationship: SPDXRef-DOCUMENT COPY_OF 
DocumentRef-spdx-tool-1.2:SPDXRef-ToolsElement).  We could change the pattern 
to `[any valid IRI]\|[any valid IRI]`  (Note: I omitted the regex for valid 
IRI’s since the regex I found <https://stackoverflow.com/a/190405>  is 10,985 
characters long).  

Note that this separator character is for separating the externalId  (known as 
the DocumentRef in SPDX 2.2) from the short ID used in non-linked data 
serialization.  It does not have any impact on how the IRI’s for the namespace 
or ID’s are  constructed.

I can also think of some very complicated algorithms that reference the 
external map to determine the external document reference without a separator 
character, but these algorithms are so complex, I doubt many tool providers 
would want to implement them.

*       Elements have a property document which references the Document 
containing the documentNamespace property

*       an alternative would be to include the documentNamespace property in 
the Element – not my preferred approach, but it would still solve the 
translation problem


This assumes that 1) a Document can contain other Documents, and 2) any 
contained Documents have their elements stripped out and moved into the 
top-level Document.  If you embed one document into another, wouldn't it be 
preferable to leave its contents intact, with the namespace of an Element 
always specified in the Document containing that Element?

[G.O.] TL;DR – Let’s table the specifics on how we model and store the 
documentNamespace and see if we agree a documentNamespace needs to be included 
somewhere in the model referenceable from in individual Element.  I believe 
this is necessary if we want to be able to have lossless translation between 
linked data and non-linked data formats.

 

I wasn’t intending on supporting embedded documents – just referencing elements 
in external documents.  The problem I was trying to solve is how to find the 
Document level information needed to translate the element back if all you have 
is the element to start with.  We need the namespace to convert the ID’s back 
to a non-linked data format (e.g. in tag/value convert a full URI back into a 
namespace and ID).  If we only have an IRI reference to an element, we need to 
be able to find the Document level information.  One way would be to copy all 
Document level information to each element.  I think this would only work if 
all the properties were simple types which are represented in RDF as literals.  
Another (IMHO simpler) approach would be to have a class containing that 
information and each element would reference the class.  This may be an issue 
best left for later resolution – I have a number of use cases where I need 
Document level information and if we allow Elements to “stand alone” as per the 
Sean requirements, I’ll need a way to get at the Document level information.  I 
should also define “Document Level Information” to be the metadata describing 
the SBOM Metadata typically generated with the SBOM Metadata is created (OK – I 
think that definition is even more confusing 😉).  The current model copies 
other Document level information into each Element which I disagree with – 
something we could discuss separately from the identifier discussion.


-----------------

I agree with Sean that we should consider not requiring Document for Elements 
whose id is an absolute URI.  I started a drawing showing:

[G.O.] I agree you should be able to reference an Element JUST using an 
absolute URI without needing to include any reference to the Document.  I do 
think there is Document level information related to that element which needs 
to be accessed for several use cases.   Having a property from the Element to 
the Document could provide this.  Three reasons I believe this is important – 
Most use cases require Document level information such as the creator etc., all 
elements are created by someone or something and capturing the Document level 
information at creating time should not add too much complexity, linking to a 
common Document metadata removes the duplication of data between Element 
properties and the Document properties.  I may be in the minority on this 
opinion, but I would like to discuss it further on a future tech call separate 
from the identifier question.


1) a Document containing Elements with absolute URI ids

2) a Document containing Elements with relative URI ids  (what we are using 
today, and must continue to allow)

3) an Element not contained in a Document with an absolute URI id

The slide might be more confusing than enlightening, but it seems clear to me.  
It is intended to show that all references to an Element are always an absolute 
URI even when the element id is relative.
https://docs.google.com/presentation/d/1v62mftkzWvH8WwdQgtJwWM6FovHoS2VRTWkZxjbaG00.
 

 

Dave

 

 

On Tue, Aug 17, 2021 at 2:44 PM Gary O'Neall <g...@sourceauditor.com 
<mailto:g...@sourceauditor.com> > wrote:

As suggested on today’s tech call, I’m reposting a slightly modified proposal 
for Element identifiers originally emailed on August 4.

 

Proposal Summary:

*       All Element ID’s and LicenseRef ID’s are URI’s (same as is the case for 
SPDX 2.2). 
*       Document has a property documentNamespace which is a string prefix for 
all ID’s local to the document (similar to SPDX 2.2, but with a few differences 
outlined below)

*       documentNamespace would be optional – if it is not present, the non 
linked-data formats would use the full URI for the ID’s within the document 
(SPDX 2.2 has this as a required field)
*       We would remove the requirement for all element ID’s to be fragments of 
the Document (e.g. no more special treatement of “#” – the documentNamespace 
would be a simple string prefix).  This would create a minor incompatibility 
with SPDX 2.2.
*       documentNamespace would have a restriction on the non-namespace portion 
of the ID to allow for parsing of external document references (e.g. 
externalDocumentRef-12:SPDXRef-14).  Currently, this is a colon (‘:’), however, 
we may want to choose a different character which less common in URI’s.  We 
would remove the required prefixe of “SPDXRef-“.

*       Elements have a property document which references the Document 
containing the documentNamespace property

*       an alternative would be to include the documentNamespace property in 
the Element – not my preferred approach, but it would still solve the 
translation problem

 

Below is the criteria from today’s call an my evaluation of meeting those 
criteria:

*       Independently unique defined – All element ID’s are URI’s (or IRI’s if 
we change the spec) and meet this requirement
*       Independently unique referenced – All element ID’s are URI’s (or IRI’s 
if we change the spec) and meet this requirement
*       Not requiring intermediate steps

*       For linked data, the URI would be used for reference and can be done 
directly
*       For non-linked data going to linked data, the URI can be used to access 
directly
*       For non-linked data going to non-linked data, the target document would 
need to be deserialized for access (I think this would be true for any of the 
proposals)

*       Support non-linked data serializations – Storing the documentNamespace 
as a property allows for lossless translation to/from linked and non-linked 
data serialization formats

 

Original emailed proposal from August 4:

 

To support non-linked data, we need a way to translate the ID’s back to and 
from non-linked-data serialization formats.

 

The easiest approach would be to just include the entire ID String in formats 
like tag/value.

 

This would end up with something like:

 

               …

               SPDXID: 
http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C3301#SPDXRef-File

               …

 

Rather than the current format of:

 

               DocumentNamespace: 
http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C3301

               …

               SPDXID: SPDXRef-File

               …

 

Similar issues for the Spreadsheet and YAML formats.  We also have a 
non-linked-data JSON format which would also have the same ID issues.

 

If the above change is acceptable to those using the non-linked-data 
serialization formats, I would definitely go with the simpler approach.

 

If we want the ID’s to be short, however, we’ll need to introduce something 
like namespaces which are really a string prefix and have pre-defined rules to 
make it possible to reliable to translate between the linked-data formats 
(which will always use URI’s) and the non-linked-data formats.

 

Here are the rules for SPDX-2.0:

*       The full URI is formed by concatenating the documentNamespace + ‘#’ + 
SPDXID in non-linked-data formats
*       Linked Data formats must include a default namespace in their 
serialization – this is the same namespace used as the documentNamespace 
property used in the non-linked-data format appended by ‘#’
*       SPDX ID’s are restricted to the format SPDXRef-[idString] where 
idString is a unique string containing letters, numbers, ., and/or -.
*       Any ID’s not defined within the SPDX document use the format 
DocumentRef-[idString]:SpdxRef-[idString] for non-linked-data formats and uses 
the external map to form the full URI

 

Note – there are similar rules for LicenseRef’s.

 

Sean raised a valid issue regarding the required use of ‘#’.

 

I have a proposed solution below:

 

In thinking about this, since we have the documentNamespace and XMLNS 
properties (for RDF/XML), we could relax this requirement and allow any valid 
URI namespace prefix.  This creates a minor incompatibility since we would need 
to append a ‘#’ to the documentNamespace property for any pre-3.0 SPDX 
documents.

 

I would still suggest restricting the characters available for SPDXRef’s to 
make it possible to parse the ID’s in the non-linked-data formats.  We could, 
however, extend some of the characters (e.g. add “/” as an allowed character).  
As per previous discussions, we could also remove the requirements for the 
SPDXRef- prefix.

 

This would solve some of the issues raised previously yet still allow support 
for both linked-data and non-linked data.

 

Here’s a proposed set of rules for 3.0:

*       The full URI is formed by concatenating the documentNamespace + SPDXID 
in non-linked-data formats.  The documentNamespace property would be optional.  
If the documentNamespace not included, the SPDXID must be the full URI.
*       Linked Data formats may include a default namespace in their 
serialization – this is the same namespace used as the documentNamespace 
property used in the non-linked-data format
*       SPDX ID’s are restricted to be a unique (within the document) string 
containing only letters, numbers, ., /, and/or -.
*       Any ID’s not defined within the SPDX document use the format 
DocumentRef-[idString]:[idString] for non-linked-data formats (NOTE: the ‘:’ 
must not be an allowed character in the idString)

 

I would further proposed some recommended practices:

*       Namespaces are used and must be unique
*       SPDX ID’s have a format the conveys information about the type (per 
previous conversations)
*       Namespaces not include ‘#’ to make the URI’s more HTTP addressable (per 
Sean’s concern)

 

Variations on a theme:

*       We could introduce a separator character for the namespace that would 
be appended to the documentNamespace.  This would relax the requirement for an 
XMLNS property in the RDF serializations since we could then parse – although 
I’m not sure how reliable the parsing would be.
*       Require a namespace – this would make the tag/value more readable and 
the expense of flexibility

 

Let me know if this sounds reasonable.

 

Gary

 

 

 

 

-------------------------------------------------

Gary O'Neall

Principal Consultant

Source Auditor Inc.

Mobile: 408.805.0586

Email: g...@sourceauditor.com <mailto:g...@sourceauditor.com> 

CONFIDENTIALITY NOTE: The information transmitted, including attachments, is 
intended only for the person(s) or entity to which it is addressed and may 
contain confidential and/or privileged material. Any review, re-transmission, 
dissemination or other use of, or taking of any action in reliance upon this 
information by persons or entities other than the intended recipient is 
prohibited. If you received this in error, please contact the sender and 
destroy any copies of this information.

 





-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#4164): https://lists.spdx.org/g/Spdx-tech/message/4164
Mute This Topic: https://lists.spdx.org/mt/84955264/21656
Group Owner: spdx-tech+ow...@lists.spdx.org
Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [spdx-tech] Element Identifier Proposal - spdxNamespace

Reply via email to