From: Stefano Tondo <[email protected]> When consolidating SPDX documents via expand_collection(), objects with the same SPDX ID can appear in multiple source documents with different levels of completeness. The previous implementation used simple set union (self.objects |= other.objects), which would keep an arbitrary version when duplicates existed.
This caused data loss during consolidation, particularly affecting externalIdentifier arrays where one version might have a basic PURL while another has multiple PURLs with Git metadata qualifiers. Fix by implementing intelligent object merging that: - Detects objects with duplicate SPDX IDs - Compares completeness based on externalIdentifier count - Keeps the more complete version (more externalIdentifiers) - Preserves objects without IDs as-is This ensures that consolidated SBOMs contain the most complete metadata available from all source documents. The bug was discovered while testing multi-PURL support where packages can have varying externalIdentifier counts (base PURL vs base + Git commit + Git branch PURLs), but affects any scenario with duplicate SPDX IDs during consolidation. Signed-off-by: Stefano Tondo <[email protected]> --- meta/lib/oe/sbom30.py | 47 ++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 46 insertions(+), 1 deletion(-) diff --git a/meta/lib/oe/sbom30.py b/meta/lib/oe/sbom30.py index 227ac51877..c77e18f4e8 100644 --- a/meta/lib/oe/sbom30.py +++ b/meta/lib/oe/sbom30.py @@ -822,7 +822,52 @@ class ObjectSet(oe.spdx30.SHACLObjectSet): if not e.externalSpdxId in imports: imports[e.externalSpdxId] = e - self.objects |= other.objects + # Merge objects intelligently: if same SPDX ID exists, keep the one with more complete data + # + # WHY DUPLICATES OCCUR: When consolidating SPDX documents (e.g., recipe -> package -> image), + # the same package can be referenced at different build stages, each with varying levels of + # detail. Early stages may have basic PURLs, while later stages add Git metadata qualifiers. + # This is architectural - multi-stage builds naturally create multiple representations of + # the same entity. + # + # However, preserve object identity for types that get referenced (like CreationInfo) + # to avoid breaking serialization + other_by_id = {} + for obj in other.objects: + obj_id = getattr(obj, '_id', None) + if obj_id: + other_by_id[obj_id] = obj + + self_by_id = {} + for obj in self.objects: + obj_id = getattr(obj, '_id', None) + if obj_id: + self_by_id[obj_id] = obj + + # Merge: for duplicate IDs, prefer the object with more externalIdentifier entries + # but only for Element types (not CreationInfo, Agent, Tool, etc.) + for obj_id, other_obj in other_by_id.items(): + if obj_id in self_by_id: + self_obj = self_by_id[obj_id] + # Only replace Elements with more complete data + # Do NOT replace CreationInfo or other supporting types to preserve object identity + if isinstance(self_obj, oe.spdx30.Element): + # If both have externalIdentifier, keep the one with more entries + self_ext_ids = getattr(self_obj, 'externalIdentifier', []) + other_ext_ids = getattr(other_obj, 'externalIdentifier', []) + if len(other_ext_ids) > len(self_ext_ids): + # Replace self object with other (more complete) object + self.objects.discard(self_obj) + self.objects.add(other_obj) + # For non-Element types (CreationInfo, Agent, Tool), keep existing to preserve identity + else: + # New object, just add it + self.objects.add(other_obj) + + # Add any objects without IDs + for obj in other.objects: + if not getattr(obj, '_id', None): + self.objects.add(obj) for o in add_objectsets: merge_doc(o) -- 2.53.0
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#231586): https://lists.openembedded.org/g/openembedded-core/message/231586 Mute This Topic: https://lists.openembedded.org/mt/117922738/21656 Group Owner: [email protected] Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
