From: Stefano Tondo <[email protected]>

When consolidating SPDX documents via expand_collection(), objects
with the same SPDX ID can appear in multiple source documents with
different levels of completeness. The previous implementation used
simple set union (self.objects |= other.objects), which would keep
an arbitrary version when duplicates existed.

This caused data loss during consolidation, particularly affecting
externalIdentifier arrays where one version might have a basic PURL
while another has multiple PURLs with Git metadata qualifiers.

Fix by implementing intelligent object merging that:
- Detects objects with duplicate SPDX IDs
- Compares completeness based on externalIdentifier count
- Keeps the more complete version (more externalIdentifiers)
- Preserves objects without IDs as-is

This ensures that consolidated SBOMs contain the most complete
metadata available from all source documents.

The bug was discovered while testing multi-PURL support where
packages can have varying externalIdentifier counts (base PURL
vs base + Git commit + Git branch PURLs), but affects any
scenario with duplicate SPDX IDs during consolidation.

Signed-off-by: Stefano Tondo <[email protected]>
---
 meta/lib/oe/sbom30.py | 47 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/meta/lib/oe/sbom30.py b/meta/lib/oe/sbom30.py
index 227ac51877..c77e18f4e8 100644
--- a/meta/lib/oe/sbom30.py
+++ b/meta/lib/oe/sbom30.py
@@ -822,7 +822,52 @@ class ObjectSet(oe.spdx30.SHACLObjectSet):
                 if not e.externalSpdxId in imports:
                     imports[e.externalSpdxId] = e
 
-            self.objects |= other.objects
+            # Merge objects intelligently: if same SPDX ID exists, keep the 
one with more complete data
+            #
+            # WHY DUPLICATES OCCUR: When consolidating SPDX documents (e.g., 
recipe -> package -> image),
+            # the same package can be referenced at different build stages, 
each with varying levels of
+            # detail. Early stages may have basic PURLs, while later stages 
add Git metadata qualifiers.
+            # This is architectural - multi-stage builds naturally create 
multiple representations of
+            # the same entity.
+            #
+            # However, preserve object identity for types that get referenced 
(like CreationInfo)
+            # to avoid breaking serialization
+            other_by_id = {}
+            for obj in other.objects:
+                obj_id = getattr(obj, '_id', None)
+                if obj_id:
+                    other_by_id[obj_id] = obj
+
+            self_by_id = {}
+            for obj in self.objects:
+                obj_id = getattr(obj, '_id', None)
+                if obj_id:
+                    self_by_id[obj_id] = obj
+
+            # Merge: for duplicate IDs, prefer the object with more 
externalIdentifier entries
+            # but only for Element types (not CreationInfo, Agent, Tool, etc.)
+            for obj_id, other_obj in other_by_id.items():
+                if obj_id in self_by_id:
+                    self_obj = self_by_id[obj_id]
+                    # Only replace Elements with more complete data
+                    # Do NOT replace CreationInfo or other supporting types to 
preserve object identity
+                    if isinstance(self_obj, oe.spdx30.Element):
+                        # If both have externalIdentifier, keep the one with 
more entries
+                        self_ext_ids = getattr(self_obj, 'externalIdentifier', 
[])
+                        other_ext_ids = getattr(other_obj, 
'externalIdentifier', [])
+                        if len(other_ext_ids) > len(self_ext_ids):
+                            # Replace self object with other (more complete) 
object
+                            self.objects.discard(self_obj)
+                            self.objects.add(other_obj)
+                    # For non-Element types (CreationInfo, Agent, Tool), keep 
existing to preserve identity
+                else:
+                    # New object, just add it
+                    self.objects.add(other_obj)
+
+            # Add any objects without IDs
+            for obj in other.objects:
+                if not getattr(obj, '_id', None):
+                    self.objects.add(obj)
 
         for o in add_objectsets:
             merge_doc(o)
-- 
2.53.0

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#231586): 
https://lists.openembedded.org/g/openembedded-core/message/231586
Mute This Topic: https://lists.openembedded.org/mt/117922738/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to