Create/Update Large Dual-referenced Set of Entities

Sotelo, Javier Mon, 21 Oct 2019 14:26:26 -0700

Hello,

What is the proper way to bulk-insert/bulk-update a large set (~75K) of 
dual-referenced entities in Atlas 2.0?


TL;DR:

  *   We can’t POST (/v2/entity/bulk ) an rdbms_instance with 15 rdbms_dbs, 
each with 100 rdbms_tables, each with 50 rdms_columns with parent/child 
relationships to each other in one API call.
  *   Splitting it bottom-up doesn’t work because column entities require table 
entities to exist.
  *   Splitting it top-down doesn’t work because the process creates false 
deletes/updates on the second synchronization cycle.
  *   Other than getting all entities, manually comparing each other 
field-by-field and splitting up each request such that there are no “dangling” 
references across API calls, is there a another/better way?

Details:
We are trying to harvest metadata from a large RDBMS instance that we can’t , 
suppose we have an RDBS instance with 15 databases each with 100 tables and 
each table with 50 columns producing 75K entities. Since including them all in 
one API call, would time out (or cause a “broken pipe” error), we would need to 
split it up into multiple API calls. But since we need a GUID for the 
parent/child references (and they may not exist yet, aka be negative). We would 
need to be very careful in how we split it up.

We can’t first create all the columns because the column entities require valid 
table entities to reference first (same with the database-to-table case). When 
we try to create all the databases first then tables then columns, it works on 
the first go-around. However, when our script runs for the second time, as soon 
as we do a POST on the /v2/entity/bulk endpoint with all the rdbms_dbs first, 
Atlas deletes all the rdbms_tables (which makes sense since the tables can’t 
exist without databases, and we just removed all the db-table relationships). 
At the end of the script our relationship tree is built correctly, however, we 
end up with many Atlas deletes.

One solution would be to read all pre-existing entities per each entity type, 
compare each of them (previous vs current), determine which entities are new 
and which are the same, and hope that the actually diff/update isn’t bigger 
than the request limit but that seems like a lot of work to end up with a 
solution that could still fail.

We’ve looked at https://atlas.apache.org/#/ImportAPIOptions but that seems to 
have been designed to be used along with the export (from Atlas) which doesn’t 
apply to our scenario.

Is there a better way?

Thank you for your time!
Javier

Create/Update Large Dual-referenced Set of Entities

Reply via email to