Thanks Dave – really good points.
A few modifications to my proposal below and one significant model issue I’d like to separate out into a separate discussion. Details below. Gary From: Spdx-tech@lists.spdx.org <Spdx-tech@lists.spdx.org> On Behalf Of David Kemp Sent: Wednesday, August 18, 2021 6:38 AM To: Gary O'Neall <g...@sourceauditor.com> Cc: SPDX-list <Spdx-tech@lists.spdx.org> Subject: Re: [spdx-tech] Element Identifier Proposal - spdxNamespace Gary, I nearly agree, and I think we are making this harder than it is. The model currently uses made-up typenames, not actual types from https://www.w3.org/TR/rdf11-concepts/#xsd-datatypes. Fixing the model would fix the problem. [G.O.] Agree – we should refer to the already defined data type names. * documentNamespace would have a restriction on the non-namespace portion of the ID to allow for parsing of external document references (e.g. externalDocumentRef-12:SPDXRef-14). Currently, this is a colon (‘:’), however, we may want to choose a different character which less common in URI’s. We would remove the required prefixe of “SPDXRef-“. I don't think we should define any special syntax for the non-namespace portion of the URI - the last slash in hier-part separates the namespace from the non-namespace, if the URI ends with a slash the non-namespace part is empty, and the non-namespace part cannot contain a slash. That is simple and unambiguous for both concatenating and splitting. [G.O.] Let me revise my proposal to allow any valid IRI character – basically remove the restriction. This requires a change to the non-linked data serializations. This would create a compatibility issue with SPDX version 2.2, but I don’t think it is a major issue. In looking at the IRI ABNF <https://datatracker.ietf.org/doc/html/rfc3986#appendix-A> there are quite a few characters which are not allowed in the IRI segments and fragments we can choose from. It looks like ‘|’ is available (unless I missed something). This change will not have any impact on the linked-data serializations as we would use the native serialization mechanisms for using namespaces. Nor will this change have any impact on the model (other than recording the namespace – already mentioned in other parts of this proposal). For tag/value, we would change the way we parse external document references. In SPDX 2.2, we look for a pattern ‘DocumentRef-[0-9a-zA-Z\.\-\+]+:SPDXRef-[0-9a-zA-Z\.\-\+]+’ for the ID (e.g. Relationship: SPDXRef-DOCUMENT COPY_OF DocumentRef-spdx-tool-1.2:SPDXRef-ToolsElement). We could change the pattern to `[any valid IRI]\|[any valid IRI]` (Note: I omitted the regex for valid IRI’s since the regex I found <https://stackoverflow.com/a/190405> is 10,985 characters long). Note that this separator character is for separating the externalId (known as the DocumentRef in SPDX 2.2) from the short ID used in non-linked data serialization. It does not have any impact on how the IRI’s for the namespace or ID’s are constructed. I can also think of some very complicated algorithms that reference the external map to determine the external document reference without a separator character, but these algorithms are so complex, I doubt many tool providers would want to implement them. * Elements have a property document which references the Document containing the documentNamespace property * an alternative would be to include the documentNamespace property in the Element – not my preferred approach, but it would still solve the translation problem This assumes that 1) a Document can contain other Documents, and 2) any contained Documents have their elements stripped out and moved into the top-level Document. If you embed one document into another, wouldn't it be preferable to leave its contents intact, with the namespace of an Element always specified in the Document containing that Element? [G.O.] TL;DR – Let’s table the specifics on how we model and store the documentNamespace and see if we agree a documentNamespace needs to be included somewhere in the model referenceable from in individual Element. I believe this is necessary if we want to be able to have lossless translation between linked data and non-linked data formats. I wasn’t intending on supporting embedded documents – just referencing elements in external documents. The problem I was trying to solve is how to find the Document level information needed to translate the element back if all you have is the element to start with. We need the namespace to convert the ID’s back to a non-linked data format (e.g. in tag/value convert a full URI back into a namespace and ID). If we only have an IRI reference to an element, we need to be able to find the Document level information. One way would be to copy all Document level information to each element. I think this would only work if all the properties were simple types which are represented in RDF as literals. Another (IMHO simpler) approach would be to have a class containing that information and each element would reference the class. This may be an issue best left for later resolution – I have a number of use cases where I need Document level information and if we allow Elements to “stand alone” as per the Sean requirements, I’ll need a way to get at the Document level information. I should also define “Document Level Information” to be the metadata describing the SBOM Metadata typically generated with the SBOM Metadata is created (OK – I think that definition is even more confusing 😉). The current model copies other Document level information into each Element which I disagree with – something we could discuss separately from the identifier discussion. ----------------- I agree with Sean that we should consider not requiring Document for Elements whose id is an absolute URI. I started a drawing showing: [G.O.] I agree you should be able to reference an Element JUST using an absolute URI without needing to include any reference to the Document. I do think there is Document level information related to that element which needs to be accessed for several use cases. Having a property from the Element to the Document could provide this. Three reasons I believe this is important – Most use cases require Document level information such as the creator etc., all elements are created by someone or something and capturing the Document level information at creating time should not add too much complexity, linking to a common Document metadata removes the duplication of data between Element properties and the Document properties. I may be in the minority on this opinion, but I would like to discuss it further on a future tech call separate from the identifier question. 1) a Document containing Elements with absolute URI ids 2) a Document containing Elements with relative URI ids (what we are using today, and must continue to allow) 3) an Element not contained in a Document with an absolute URI id The slide might be more confusing than enlightening, but it seems clear to me. It is intended to show that all references to an Element are always an absolute URI even when the element id is relative. https://docs.google.com/presentation/d/1v62mftkzWvH8WwdQgtJwWM6FovHoS2VRTWkZxjbaG00. Dave On Tue, Aug 17, 2021 at 2:44 PM Gary O'Neall <g...@sourceauditor.com <mailto:g...@sourceauditor.com> > wrote: As suggested on today’s tech call, I’m reposting a slightly modified proposal for Element identifiers originally emailed on August 4. Proposal Summary: * All Element ID’s and LicenseRef ID’s are URI’s (same as is the case for SPDX 2.2). * Document has a property documentNamespace which is a string prefix for all ID’s local to the document (similar to SPDX 2.2, but with a few differences outlined below) * documentNamespace would be optional – if it is not present, the non linked-data formats would use the full URI for the ID’s within the document (SPDX 2.2 has this as a required field) * We would remove the requirement for all element ID’s to be fragments of the Document (e.g. no more special treatement of “#” – the documentNamespace would be a simple string prefix). This would create a minor incompatibility with SPDX 2.2. * documentNamespace would have a restriction on the non-namespace portion of the ID to allow for parsing of external document references (e.g. externalDocumentRef-12:SPDXRef-14). Currently, this is a colon (‘:’), however, we may want to choose a different character which less common in URI’s. We would remove the required prefixe of “SPDXRef-“. * Elements have a property document which references the Document containing the documentNamespace property * an alternative would be to include the documentNamespace property in the Element – not my preferred approach, but it would still solve the translation problem Below is the criteria from today’s call an my evaluation of meeting those criteria: * Independently unique defined – All element ID’s are URI’s (or IRI’s if we change the spec) and meet this requirement * Independently unique referenced – All element ID’s are URI’s (or IRI’s if we change the spec) and meet this requirement * Not requiring intermediate steps * For linked data, the URI would be used for reference and can be done directly * For non-linked data going to linked data, the URI can be used to access directly * For non-linked data going to non-linked data, the target document would need to be deserialized for access (I think this would be true for any of the proposals) * Support non-linked data serializations – Storing the documentNamespace as a property allows for lossless translation to/from linked and non-linked data serialization formats Original emailed proposal from August 4: To support non-linked data, we need a way to translate the ID’s back to and from non-linked-data serialization formats. The easiest approach would be to just include the entire ID String in formats like tag/value. This would end up with something like: … SPDXID: http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C3301#SPDXRef-File … Rather than the current format of: DocumentNamespace: http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C3301 … SPDXID: SPDXRef-File … Similar issues for the Spreadsheet and YAML formats. We also have a non-linked-data JSON format which would also have the same ID issues. If the above change is acceptable to those using the non-linked-data serialization formats, I would definitely go with the simpler approach. If we want the ID’s to be short, however, we’ll need to introduce something like namespaces which are really a string prefix and have pre-defined rules to make it possible to reliable to translate between the linked-data formats (which will always use URI’s) and the non-linked-data formats. Here are the rules for SPDX-2.0: * The full URI is formed by concatenating the documentNamespace + ‘#’ + SPDXID in non-linked-data formats * Linked Data formats must include a default namespace in their serialization – this is the same namespace used as the documentNamespace property used in the non-linked-data format appended by ‘#’ * SPDX ID’s are restricted to the format SPDXRef-[idString] where idString is a unique string containing letters, numbers, ., and/or -. * Any ID’s not defined within the SPDX document use the format DocumentRef-[idString]:SpdxRef-[idString] for non-linked-data formats and uses the external map to form the full URI Note – there are similar rules for LicenseRef’s. Sean raised a valid issue regarding the required use of ‘#’. I have a proposed solution below: In thinking about this, since we have the documentNamespace and XMLNS properties (for RDF/XML), we could relax this requirement and allow any valid URI namespace prefix. This creates a minor incompatibility since we would need to append a ‘#’ to the documentNamespace property for any pre-3.0 SPDX documents. I would still suggest restricting the characters available for SPDXRef’s to make it possible to parse the ID’s in the non-linked-data formats. We could, however, extend some of the characters (e.g. add “/” as an allowed character). As per previous discussions, we could also remove the requirements for the SPDXRef- prefix. This would solve some of the issues raised previously yet still allow support for both linked-data and non-linked data. Here’s a proposed set of rules for 3.0: * The full URI is formed by concatenating the documentNamespace + SPDXID in non-linked-data formats. The documentNamespace property would be optional. If the documentNamespace not included, the SPDXID must be the full URI. * Linked Data formats may include a default namespace in their serialization – this is the same namespace used as the documentNamespace property used in the non-linked-data format * SPDX ID’s are restricted to be a unique (within the document) string containing only letters, numbers, ., /, and/or -. * Any ID’s not defined within the SPDX document use the format DocumentRef-[idString]:[idString] for non-linked-data formats (NOTE: the ‘:’ must not be an allowed character in the idString) I would further proposed some recommended practices: * Namespaces are used and must be unique * SPDX ID’s have a format the conveys information about the type (per previous conversations) * Namespaces not include ‘#’ to make the URI’s more HTTP addressable (per Sean’s concern) Variations on a theme: * We could introduce a separator character for the namespace that would be appended to the documentNamespace. This would relax the requirement for an XMLNS property in the RDF serializations since we could then parse – although I’m not sure how reliable the parsing would be. * Require a namespace – this would make the tag/value more readable and the expense of flexibility Let me know if this sounds reasonable. Gary ------------------------------------------------- Gary O'Neall Principal Consultant Source Auditor Inc. Mobile: 408.805.0586 Email: g...@sourceauditor.com <mailto:g...@sourceauditor.com> CONFIDENTIALITY NOTE: The information transmitted, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, re-transmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and destroy any copies of this information. -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#4164): https://lists.spdx.org/g/Spdx-tech/message/4164 Mute This Topic: https://lists.spdx.org/mt/84955264/21656 Group Owner: spdx-tech+ow...@lists.spdx.org Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-