Aman,

 

Atlas uses following 2 HBase tables:
atlas_janus: used by JanusGraph to store Atlas typedefs, entities and 
relationships
ATLAS_ENTITY_AUDIT_EVENTS: used by Atlas to store audit logs of change to 
entities
 

Can you share the size taken up by each of above tables? It will help to 
determine the next steps to purge data.

 

As Pinal and Nikhil suggested earlier, deleting from atlas_janus table or 
setting TTL for entries in the table are viable options.

 

Thanks,

Madhan

 

 

From: Nikhil Bonte <bontenik...@gmail.com>
Date: Tuesday, September 24, 2024 at 1:17 AM
To: "dev@atlas.apache.org" <dev@atlas.apache.org>
Cc: "pinal.s...@freestoneinfotech.com" <pinal.s...@freestoneinfotech.com>, 
"mad...@apache.org" <mad...@apache.org>
Subject: Re: Exponential record growth in atlas janus hbase table

 

Hello Aman,

 

May I ask why you want to set TTL on the table `apache_atlas_janus`?

 

Asking this as the table `apache_atlas_janus` actually stores the typeDefs, 
entities.

In case you delete anything from this table bypassing the Atlas APIs layer, 
there could be issues with typeDefs or Entities like data corruption or data 
incompleteness.

 

In case you are worried about the large rows in the table, it is the way how 
Janusgraph manages Hbase as a persistence layer. You may check Hbase 
compactions to optimize & reduce disk usage periodically.

 

Regarding data in the table is not readable, Janusgraph encrypts data before 
persisting to Hbase, you may check Janusgraph documentation for more details on 
this.

 

Overall I suggest not to modify Hbase table `apache_atlas_janus` bypassing the 
API layer, In case you want to get rid of older assets which might not be 
relevant over the period of time, you may use DELETE APIs as suggested by Pinal.

 

 

Thanks

Nikhil P. Bonte

 

On Tue, Sep 24, 2024 at 11:31 AM Aman Kumar <akms21...@gmail.com> wrote:

Hi Pinal, thanks for the response,

Actually, our requirement here is to specifically setup TTL on 'janus
table' in hbase.

 - Everytime we update an entity, some new rows and some records are being
created in the Hbase (Janus table). In production, the row count is
reaching around 20-30M, further increasing the disk space.

 - Since the rows in the janus hbase table are stored in serialized form,
we don't know what exactly to delete from there.

 - Only attribute that we can figure out is the record timestamps in hbase.
Pruning the records based on timestamps itself causes entities to disappear
from the UI.

As we don't have any new releases done since 2.3.0 which supports TTL on
'Janus Table', we are thinking of deserializing the janus hbase table rows
and implementing our own TTL logic.

Is there a way to deserialize the janus hbase table rows? It will be very
helpful if you can point out the classes in the source code, where
serialization/deserialization happens.

Or is there any property that can be set in order to disable the
compression in the janus hbase table.

Thanks!

On Mon, Sep 23, 2024 at 10:36 AM pinal shah <pi...@apache.org> wrote:

> On Mon, Sep 23, 2024 at 11:29 AM pinal shah <shahpina...@gmail.com> wrote:
>
> > Hi Aman Kumar,
> >
> > To permanently delete records from Atlas, you can use the REST API call
> > PUT /admin/purge.
> > But this API purges only soft deleted entities, hence before using this
> > api, delete the required entities using either DELETE
> v2/entity/bulk/guid=
> > or v2/entity/DELETE /guid/{guid}
> >
> > Below points to be noted when you purge a deleted entity:
> > - The entity is removed from Atlas.
> > - Related, dependent entities are also removed. For example, when purging
> > a deleted Hive table, the deleted entities for the table columns, DLL,
> and
> > storage description are also purged.
> > - The entity is no longer available in search results, even with Show
> > historical entities enabled.
> > - Lineage relationships that include the purged entities are removed,
> > which breaks lineages that depend upon a purged entity to show
> connections
> > between ancestors and descendents.
> > - Classifications propagated across the purged entities are removed in
> all
> > descendent entities.
> > - Classifications assigned to the purged entities and set to propagate
> are
> > removed from all descendent entities.
> >
> > Regards,
> > Pinal Shah
> >
> > On Fri, Sep 20, 2024 at 5:53 PM Aman Kumar <akms21...@gmail.com> wrote:
> >
> >> Hi Team,
> >>
> >> I'm using apache-atlas-2.3.0 with embedded hbase and solr.
> >>
> >> Problem Statement:
> >>
> >> - Row count in 'apache_atlas_janus' table is increasing exponentially
> >> whenever a new typedef, new entity is created/updated. While this is
> >> not an issue with embedded hbase and solr distribution, In production,
> >> we are reaching around 20-30M row count in janus table
> >>
> >> - While apache-atlas provides greater control on setting up the TTL on
> >> 'audit table' (
> >> https://issues.apache.org/jira/browse/ATLAS-4768
> >> ), there's no way to control TTL on the 'janus table', other than to
> >> setup TTL on the column families manually using hbase shell.
> >>
> >> - Records on 'janus table' are not human-readable, since they
> >> are stored in a serialized form.
> >>
> >> - Setting up TTL manually on the janus table's column families causes
> >> atlas to malfunction, evidently so, because we don't know what rows
> >> are getting deleted.
> >>
> >>
> >>
> >> What is required:
> >>
> >>     -> Some level of control on the
> >> janus table TTL, so as to purge out older/not required records,
> >> without messing up with any other components in atlas.
> >>
> >>     -> If TTL is not possible, then at least there should be a way to
> >> deserialize the hbase rows in
> >> the janus table, so that we can implement our own TTL logic.
> >>
> >>
> >>
> >> What I've tried:
> >>
> >>     -> Reading the janus hbase table through the java code, ran into
> >> this issue:
> >> https://github.com/JanusGraph/janusgraph/issues/941
> >>
> >>     -> Tried setting up the TTL on the vertices in janusgraph using
> >> gremlin queries. The problem is, each vertex in atlas is defined with
> >> the label of 'vertex'.
> >>        Setting up management object on the label itself throwing the
> >> error of: 'Name cannot be in protected namespace: vertex'
> >>
> >>     -> Even tried setting up TTL on a vertex on a local janusgraph
> >> instance (without atlas). Didn't saw any difference in row count even
> >> after vertex TTL is expired
> >>
> >>     -> Atlast, tried to delete some rows in the janus table based on a
> >> timestamp range, for the following scenarios:
> >>
> >>         - Tried deleting the rows in janus table only for a single
> >> update timestamp
> >>
> >>         - Tried deleting the rows only for entity updates timestamp
> >>
> >>         - Tried deleting the rows which were created before the latest
> >> entity update
> >>
> >>
> >>     In all the cases, entity got disappeared in the UI, with the
> >> following error:
> >>         No typename found for given entity with guid:
> >> c90744dc-7ac6-4b5f-8fd2-ffc6282f5a64
> >>
> >>
> >>     In short, deleting any rows related to the entity
> >> in the janus table is messing up with the entity itself.
> >>
> >>
> >> Please let me know if there's any existing solution for the above
> >> problem, or should I reach out to
> >> the janusgraph community regarding the serialization/deserialization
> >> issue.
> >>
> >> Thanks!
> >>
> >
>

Reply via email to