MrJs133 opened a new pull request, #243:
URL: https://github.com/apache/incubator-hugegraph-ai/pull/243

   Bug Report
   After deleting vertices, running `update_vid_embedding` does **not** 
successfully remove the corresponding vectors.
   
   ---
   
   ### Initial State
   
   * 13 vertices
   * 13 `vid` embeddings
   
   ---
   
   ### After Clearing Graph Data
   
   * 0 vertices
   
   ---
   
   ### After `update_vid_embedding`
   
   * Output shows: “Removed 13 vectors”
   
![image](https://github.com/user-attachments/assets/be815bf5-c516-4ebb-989b-71e2120c9e66)
   
   
   However, calling `get_vector_index_info` still shows 13 `vid` embeddings.
   
![image](https://github.com/user-attachments/assets/0ab510d8-6b2e-4ce3-b54f-979e4ec775ba)
   
   
   ---
   
   ### Problem Location
   
   Function `get_vector_index_info`:
   
   ```python
   def get_vector_index_info():
       chunk_vector_index = VectorIndex.from_index_file(
           os.path.join(resource_path, huge_settings.graph_name, "chunks")
       )
       graph_vid_vector_index = VectorIndex.from_index_file(
           os.path.join(resource_path, huge_settings.graph_name, "graph_vids")
       )
       return json.dumps({
           "embed_dim": chunk_vector_index.index.d,
           "vector_info": {
               "chunk_vector_num": chunk_vector_index.index.ntotal,
               "graph_vid_vector_num": graph_vid_vector_index.index.ntotal,
               "graph_properties_vector_num": len(chunk_vector_index.properties)
           }
       }, ensure_ascii=False, indent=2)
   ```
   
   This logic is correct. It reads `graph_vid_vector_num` from 
`graph_vid_vector_index.index.ntotal`.
   
   ---
   
   ### `update_vid_embedding` Code Analysis
   
   ```python
   past_vids = self.vid_index.properties
   present_vids = context["vertices"]
   removed_vids = set(past_vids) - set(present_vids)
   removed_num = self.vid_index.remove(removed_vids)
   added_vids = list(set(present_vids) - set(past_vids))
   ```
   
   This correctly identifies vectors to be removed and added.
   `self.vid_index.remove()` implementation:
   
   ```python
   def remove(self, props: Union[Set[Any], List[Any]]) -> int:
       if isinstance(props, list):
           props = set(props)
       indices = []
       remove_num = 0
   
       for i, p in enumerate(self.properties):
           if p in props:
               indices.append(i)
               remove_num += 1
       self.index.remove_ids(np.array(indices))
       self.properties = [p for i, p in enumerate(self.properties) if i not in 
indices]
       return remove_num
   ```
   
   This also seems correct.
   
   ---
   
   ### Debug Output
   
   ```python
   log.debug("before %s", self.vid_index.index.ntotal)
   removed_num = self.vid_index.remove(removed_vids)
   log.debug("after %s", self.vid_index.index.ntotal)
   ```
   
   Output:
   
   ```
   [05/20/25 13:51:20] DEBUG before 13
   [05/20/25 13:51:20] DEBUG after 0
   ```
   
   → This confirms that in-memory deletion is successful.
   
   However, re-running `update_vid_embedding` again shows:
   
   ```
   [05/20/25 13:53:23] DEBUG before 13
   [05/20/25 13:53:23] DEBUG after 0
   ```
   
   → Confirms that the vector index file still contains 13 vectors (i.e., 
deletion was not persisted).
   
   And this is **verified by loading the index from file** via:
   
   ```python
   self.index_dir = os.path.join(resource_path, huge_settings.graph_name, 
"graph_vids")
   self.vid_index = VectorIndex.from_index_file(self.index_dir)
   
   log.debug("after %s", self.vid_index.index.ntotal)
   ```
   Note: The result of this deletion was not saved.
   
   ---
   
   ### Root Cause
   
   In the full `BuildSemanticIndex.run()` implementation:
   
   ```python
   removed_vids = set(past_vids) - set(present_vids)
   removed_num = self.vid_index.remove(removed_vids)
   added_vids = list(set(present_vids) - set(past_vids))
   
   if added_vids:
       ...
       self.vid_index.add(...)
       self.vid_index.to_index_file(self.index_dir)
   else:
       log.debug("No update vertices to build vector index.")
   ```
   
   The call to `self.vid_index.to_index_file(self.index_dir)` only happens **if 
`added_vids` is non-empty**.
   
   So if you're only removing embeddings (i.e., no new vertices), the deletion 
is never persisted to disk.
   
   ---
   
   ### Fix
   
   ```python
   removed_num = self.vid_index.remove(removed_vids)
   self.vid_index.to_index_file(self.index_dir)  # <-- Add this line
   ```
   
   ---
   
   ### Verification
   
   * **Remove only**: works ✅
   * **Add only**: works ✅
   * **No change**: works ✅
   
   Problem solved.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to