kaori-seasons commented on issue #7370:
URL: https://github.com/apache/gravitino/issues/7370#issuecomment-2972319655

   @yuqi1129
   The root cause of this Metaspace OOM error is that the IsolatedClassLoader 
instance is not released in time during the high-frequency catalog attribute 
update operation, resulting in Metaspace memory leak.
   
   - 1. Error triggering process
   The loop operation in the user script triggers the following problem chain:
   
   Catalog attribute update causes cache invalidation: Each time a PUT request 
is executed to update the catalog attribute, the alterCatalog method will first 
invalidate the catalog cache CatalogManager.java:683 and then reload the 
catalog instance CatalogManager.java:703-706.
   
   Frequent creation of IsolatedClassLoader: Each time the catalog is reloaded, 
the createCatalogWrapper method will create a new IsolatedClassLoader instance 
CatalogManager.java:962.
   
   ServiceLoader loads consume Metaspace: In the lookupCatalogProvider method, 
the ServiceLoader.load operation loads the CatalogProvider class in the 
IsolatedClassLoader CatalogManager.java:1149, which creates a large amount of 
class metadata in the Metaspace.
   
   - 2. Memory leak mechanism
   ClassLoader accumulation: Although the system is configured with a cache 
cleanup mechanism CatalogManager.java:302-306 , under high-frequency 
operations, the creation speed of new IsolatedClassLoaders may exceed the 
cleanup speed of garbage collection.
   
   Metaspace recovery lag: Each IsolatedClassLoader will load class metadata in 
Metaspace. Even if the CatalogWrapper.close() method will close the classLoader 
CatalogManager.java:250 , there may be a delay in the recovery of Metaspace.
   
   *Solution transformation plan*
   
   - 1. Short-term mitigation plan
   Increase JVM Metaspace configuration:
   
   ```
   -XX:MetaspaceSize=256m
   -XX:MaxMetaspaceSize=512m
   -XX:+CMSClassUnloadingEnabled
   -XX:+UseCMSInitiatingOccupancyOnly
   ```
   Adjust cache expiration time: Appropriately extend the expiration time of 
the catalog cache to reduce unnecessary reloading frequency.
   
   - 2. Code optimization plan
   Optimize the catalog attribute update mechanism:
    -  Analyze which attribute changes really need to recreate the catalog 
instance
    -  For attribute updates that do not affect the core functions of the 
catalog, consider hot updates instead of reloading
    -  Implement an incremental update mechanism for attribute changes
   Improve the IsolatedClassLoader lifecycle management:
    -  Add stricter resource management in the createCatalogWrapper method 
CatalogManager.java:957-983
    -  Consider implementing the reuse mechanism of IsolatedClassLoader, and 
share the classloader for catalogs of the same provider type
   Enhance the cleanup logic of the IsolatedClassLoader.close() method 
IsolatedClassLoader.java:150-158
   Add protection mechanisms:
    -  Implement current limiting when the catalog operation frequency is too 
high
    -  Increase Metaspace usage monitoring and alarms
    -  Implement a batch processing mechanism for catalog updates
   3. Monitoring and alerting
   Add key indicator monitoring:
   
   - Metaspace usage rate
   - IsolatedClassLoader creation/destruction frequency
   - Catalog cache hit rate
   - ServiceLoader call frequency
   Implement downgrade strategy
   - When Metaspace usage is too high, suspend non-critical catalog operations
   - Implement a queue mechanism for catalog updates to avoid concurrent 
operations


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to