Hi Team,

I'm using apache-atlas-2.3.0 with embedded hbase and solr.

Problem Statement:

- Row count in 'apache_atlas_janus' table is increasing exponentially
whenever a new typedef, new entity is created/updated. While this is
not an issue with embedded hbase and solr distribution, In production,
we are reaching around 20-30M row count in janus table

- While apache-atlas provides greater control on setting up the TTL on
'audit table' (
https://issues.apache.org/jira/browse/ATLAS-4768
), there's no way to control TTL on the 'janus table', other than to
setup TTL on the column families manually using hbase shell.

- Records on 'janus table' are not human-readable, since they
are stored in a serialized form.

- Setting up TTL manually on the janus table's column families causes
atlas to malfunction, evidently so, because we don't know what rows
are getting deleted.



What is required:

    -> Some level of control on the
janus table TTL, so as to purge out older/not required records,
without messing up with any other components in atlas.

    -> If TTL is not possible, then at least there should be a way to
deserialize the hbase rows in
the janus table, so that we can implement our own TTL logic.



What I've tried:

    -> Reading the janus hbase table through the java code, ran into
this issue:
https://github.com/JanusGraph/janusgraph/issues/941

    -> Tried setting up the TTL on the vertices in janusgraph using
gremlin queries. The problem is, each vertex in atlas is defined with
the label of 'vertex'.
       Setting up management object on the label itself throwing the
error of: 'Name cannot be in protected namespace: vertex'

    -> Even tried setting up TTL on a vertex on a local janusgraph
instance (without atlas). Didn't saw any difference in row count even
after vertex TTL is expired

    -> Atlast, tried to delete some rows in the janus table based on a
timestamp range, for the following scenarios:

        - Tried deleting the rows in janus table only for a single
update timestamp

        - Tried deleting the rows only for entity updates timestamp

        - Tried deleting the rows which were created before the latest
entity update


    In all the cases, entity got disappeared in the UI, with the
following error:
        No typename found for given entity with guid:
c90744dc-7ac6-4b5f-8fd2-ffc6282f5a64


    In short, deleting any rows related to the entity
in the janus table is messing up with the entity itself.


Please let me know if there's any existing solution for the above
problem, or should I reach out to
the janusgraph community regarding the serialization/deserialization issue.

Thanks!

Reply via email to