----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/57649/#review169383 -----------------------------------------------------------
Jenkins is going to check this review request... - Apoorv Naik On March 17, 2017, 5:09 a.m., Ashutosh Mestry wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/57649/ > ----------------------------------------------------------- > > (Updated March 17, 2017, 5:09 a.m.) > > > Review request for atlas and Madhan Neethiraj. > > > Bugs: ATLAS-1665 > https://issues.apache.org/jira/browse/ATLAS-1665 > > > Repository: atlas > > > Description > ------- > > **Background** > ============== > Existing implementation of Export API w.r.t ZIP file genration adds 1 *.json* > file per entitiy. This makes ZIP file creation inefficient. The ZIP files are > 75% larger in size than what could be possible with fewer *.json* file > entries. > > **Solution** > ============ > The implementation uses the new v2 API *AtlasEntityWithExtInfo* > representation instead of *AtlasEntity*. This format combines an entity with > related entities as one. E.g. *hive_table* will contain all the > *hive_columns* that it is made up of. (See example section below.) > > This results in significant reduction of generated *JSON* files. This impacts > reduction in generated *ZIP* file. > > **Implementation Details** > ========================== > *Export API* > - Modified *Gremlin* used to fetch connected entities to return *guid* with > *boolean* to indicate if the entity is process or not. > - _ExportService_ Modified implementation to fetch *AtlasEntityWithExtInfo* > instead of *AtlasEntity*. Modified book keeping to save *process* (lineage) > entities after all non-process entities are saved. > - _ZipSink_ Minor modification to serialize *AtlasEntityWithExtInfo*. > > *Import API* > - _ZipSource_ Modified to source *AtlasEntityWithExtInfo*. > - _EntityImportStream_ Modified to source *AtlasEntityWithExtInfo*. > - _AtlasEntityStreamForImport.getGuid_ Modified to source requested entities > first from stored *AtlasEntityWithExtInfo* object. Request from stream only > if not found. > - _AtlasEntityStoreV1.bulkImport_ Minor modification to use the new changes > to stream. > > > **Functional Areas Impacted** > ============================= > *Export* > - Full > - Connected > - HDFS path-based import. > > *Import* > - Regular flow. > > **Examples** > ============ > Case *hive_db*: Within the GraphDB the database has inward edges from objects > that refer to it. Tables in this case. So *AtlasEntityWithExtInfo* for > database will not have any referred entities. > > Case of *hive_table*: Within the GraphDB the table has outward edges pointing > to the columns it is made up of. It also has edges pointing to database and > storage descriptor. Hence, the *AtlasEntityWithExtInfo* for table will have > all full representation of all the columns and reference to database and > storage descriptor. > > **Metrics** > =========== > > Date | File Size | No. of Entities | Export |Import | > | | | Duration |Duration | > -----|-----------|-----------------|----------|---------| > 3/02 | 180 MB | 202930 | 22 mins| 1:38 hrs| > 3/08 | 7 KB | 3 | 5 secs| 7 sec| > --------------------------------------------------------| > Improvement | > --------------------------------------------------------| > 3/14 | 38 MB | 202930 | 20 mins| 1:10 hrs| > 3/14 | 5 KB | 3 | 5 secs| 7 sec| > > > **Summary** > =========== > With these changes the file size reduction is: ~65%. > > > Diffs > ----- > > intg/src/main/java/org/apache/atlas/model/instance/AtlasEntity.java 4e3895d > > repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityGraphDiscoveryV1.java > 6c88510 > > repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStoreV1.java > cce3fca > > repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStream.java > 5d9a7d4 > > repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStreamForImport.java > 8cb36ac > > repository/src/main/java/org/apache/atlas/repository/store/graph/v1/EntityStream.java > 4c43921 > > repository/src/main/java/org/apache/atlas/repository/store/graph/v1/InMemoryMapEntityStream.java > 241f6d0 > > repository/src/main/java/org/apache/atlas/util/AtlasGremlin2QueryProvider.java > 4743b73 > webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java > e123ff7 > webapp/src/main/java/org/apache/atlas/web/resources/ZipSink.java 37d9eb5 > webapp/src/main/java/org/apache/atlas/web/resources/ZipSource.java a69f7fa > > > Diff: https://reviews.apache.org/r/57649/diff/6/ > > > Testing > ------- > > Test data: > - QuickStart_v1: 3 databases. > - A *hive_db* with 922 tables. > - Stocks *hive_db* with 1 database, table, process and 5 columns. > - A *hive_db* with 522K entities. > > The changes impact all the flows in the Export & Import APIs. > Unit testing: Manual. > Integration testing: Manual. > Accuracy testing: Manual. Verified using Export -> Import -> Export -> file > compare. > > > File Attachments > ---------------- > > Patch on 2.6-maint > > https://reviews.apache.org/media/uploaded/files/2017/03/17/5fc7a466-9bac-4282-a9fd-659d0528b443__export-size-optimized.2.6-maint.2.patch > > > Thanks, > > Ashutosh Mestry > >