----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70463/ -----------------------------------------------------------
(Updated April 16, 2019, 6:15 a.m.) Review request for atlas, Kapildeo Nayak, Madhan Neethiraj, Nikhil Bonte, Nixon Rodrigues, and Sarath Subramanian. Changes ------- Updates include: - Addressed review comments. - Renamed 'DataPatch' to 'AtlasPatch'. Bugs: ATLAS-3132 https://issues.apache.org/jira/browse/ATLAS-3132 Repository: atlas Description ------- **Approach** - Refactored existing implementation for new design. - Renamed 'Java Patch Framework' to 'Data Patch Framework', rationale being that this is essentially to modify structure of existing data. - New _DataPatchService_: Modified order in which services are called. _DataPatchService_ will be called before other services are invoked, thereby giving chance for it to complete before entertaining new data. - New _DataPatchRegistry_: Data access (CRUD) operation for data patches. - New _UniqueAttributePatchHandler_: Current implementation for adding the new property to data vertices. Implemented rudimentary caching to precent repetitive look-ups. - New REST Endpoint to query status of patches. - Duplicates entities are detected during the patch application process. (See below.) **Performance** Since the data patching operation is high-volume operation, it has been treated with priority. - New _NewPropertyDataHandler_ uses database in bulk loading mode for rapid processing. This scales with resources. Additional properties: - _atlas.processing.batchSize_: Size of batch. - _atlas.processing.numWorkers_: Number of worker threads to be employed. - Leverages existing PC framework. Processing speed: - 300K vertices: ~5 mins (8 threads, batch size: 3000) - 3.2 M vertices: ~39 mins (12 threads, batch size: 300, memory: 8192 MB) - 4.2 M entities: ~45 mins (from: 2019-04-12 04:44:50 to 2019-04-12 05:29:04), (4 threads, batch size: 300) **Duplicates Detection** Once the patch is run, user can do a fgrep on the application.log and get a dump of all the duplicates detected in the process: _fgrep "Duplicates detected" /var/log/atlas/application.log_ **Memory & CPU** Higher the memory, more the threads that can be spawned. Diffs (updated) ----- intg/src/main/java/org/apache/atlas/pc/WorkItemConsumer.java b7eb4d89c intg/src/main/java/org/apache/atlas/pc/WorkItemManager.java 0e7d3f22d notification/src/main/java/org/apache/atlas/kafka/EmbeddedKafkaServer.java 32b597fb6 notification/src/main/java/org/apache/atlas/kafka/KafkaNotification.java 1d0a2734b repository/src/main/java/org/apache/atlas/repository/patches/AtlasJavaPatchHandler.java 9153d497b repository/src/main/java/org/apache/atlas/repository/patches/AtlasPatchHandler.java PRE-CREATION repository/src/main/java/org/apache/atlas/repository/patches/AtlasPatchManager.java PRE-CREATION repository/src/main/java/org/apache/atlas/repository/patches/AtlasPatchRegistry.java PRE-CREATION repository/src/main/java/org/apache/atlas/repository/patches/AtlasPatchService.java PRE-CREATION repository/src/main/java/org/apache/atlas/repository/patches/PatchContext.java a60422b80 repository/src/main/java/org/apache/atlas/repository/patches/TypeNameAttributeCache.java PRE-CREATION repository/src/main/java/org/apache/atlas/repository/patches/UniqueAttributePatch.java PRE-CREATION repository/src/main/java/org/apache/atlas/repository/patches/UniqueAttributePatchHandler.java f2238f1b0 repository/src/main/java/org/apache/atlas/repository/patches/UniqueAttributePatchProcessor.java PRE-CREATION repository/src/main/java/org/apache/atlas/repository/store/bootstrap/AtlasTypeDefStoreInitializer.java 78f3faf99 repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasGraphUtilsV2.java 80141b4f1 repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphRetriever.java 03d2c066b repository/src/test/java/org/apache/atlas/patches/AtlasPatchRegistryTest.java PRE-CREATION webapp/src/main/java/org/apache/atlas/notification/NotificationHookConsumer.java ce2d76f11 webapp/src/main/java/org/apache/atlas/web/resources/AdminResource.java c5ceb9d6d webapp/src/test/java/org/apache/atlas/web/resources/AdminResourceTest.java 223a90a9c Diff: https://reviews.apache.org/r/70463/diff/4/ Changes: https://reviews.apache.org/r/70463/diff/3-4/ Testing ------- **Unit tests** Additional tests added. **Volume tests** Verification with large datasets: - 4M entities - 3.2M entities - 16K entities. **Performance tests** CPU usage, memory usage and disk IO. **Pre-commit build** https://builds.apache.org/view/A/view/Atlas/job/PreCommit-ATLAS-Build-Test/1031/ **Gremlin Queries for Verification** Check entities that do not have the new attribute: ``` g.V().has('__typeName', within('hive_db','hive_table','hive_column')).hasNot('Referenceable.__u_qualifiedName').valueMap('__guid') ``` Drop entities with new attribute: ``` g.V().has('__typeName', within('hive_db','hive_table','hive_column')).has('Referenceable.__u_qualifiedName').properties('Referenceable.__u_qualifiedName').drop() ``` Re-run patch: ``` g.V().has('__patch.id', 'JAVA_PATCH_0000_001').property('__patch.state','FAILED'); ``` Thanks, Ashutosh Mestry