[ https://issues.apache.org/jira/browse/ATLAS-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Saad updated ATLAS-4389: ------------------------ Attachment: image-2021-08-05-11-22-29-259.png > Best practice or a way to bring in large number of entities on a regular > basis. > ------------------------------------------------------------------------------- > > Key: ATLAS-4389 > URL: https://issues.apache.org/jira/browse/ATLAS-4389 > Project: Atlas > Issue Type: Bug > Components: atlas-core > Affects Versions: 2.0.0, 2.1.0 > Reporter: Saad > Priority: Major > Labels: documentation, newbie, performance > Attachments: image-2021-08-05-11-22-29-259.png > > > Would you be so kind to let us know if there is any best practice or a way to > bring in large number of entities on a regular basis. > *Our use case:* > We will be bringing in around 12,000 datasets, 12,000 jobs and 70,000 > columns. We want to do this as part of our deployment pipeline for other > upstream projects. > At every deploy we want to do the following: > - Add the jobs, datasets and columns that are not in Atlas > - Update the jobs, datasets and columns that are in Atlas > - Delete the jobs from Atlas that are deleted from the upstream systems. > So far we have considered using the bulk API endpoint(/v2/entity/bulk). This > has its own issues. We found that if the payload is too big in our case > bigger than 300-500 entities this times out. The more deeper the > relationships the fewer the entities you can send through the bulk endpoint. > Inspecting some of the code we feel that both REST and streaming data through > Kafka follow the same codepath and finally yield the same performance. > Further we found that when creating entities the type registry becomes the > bottle neck. We discovered this by profiling the jvm. We found that only one > core processes the the entities and their relationships. > *Questions:* > 1- What is the best practice when bulk loading lots on entities in a > reasonable time. We are aiming to load 12k jobs, 12k datasets and 70k columns > in less than 10 mins.? > 2- Where should we start if we want to scale the API, is there any known way > to horizontally scale Atlas? > Here are some of the stats for the load testing we did, -- This message was sent by Atlassian Jira (v8.3.4#803005)