Hi all,
I think it's time to discuss about how we want to get prepared for
scenarios where the number of identities (mostly users, for the vast
majority) to manage is considerably high - from 1 million to above; the
typical case being CIAM (Customer IAM).
In the IdM deliveries I've been involved so far, scaling Apache Syncope
up to hundreds of thousands of identities is not trivial, but doable:
naturally, most of optimization work shall be done at DBMS level, as
that is obviously the component which is stressed more.
I think we can agree about the fact that, in such scenarios, the most
critical data are the ones bound to the actual identities (hence no
connectors, resources, tasks, reports or any other configuration):
consider that with 1 million users and 10 attributes for each user, we
have the following table sizing to deal with:
* SyncopeUser: 1M rows
* UPlainAttr: 10M rows
* UPlainAttrValue: 10M rows
Moreover, the search views [1] are all on the same size order (although
one can enable the Elasticsearch extension in such cases, to improve
performances).
I think this is what we need to change in order to get better results.
So far, I have been able to think of a couple of possibilities:
1. Leverage the JSON column support provided by PostgreSQL [2], MySQL
[3], SQL Server [4] and Oracle DB [5] to extend the current
OpenJPA-based persistence layer
Pros:
* reduce the sizing problems by removing the need of UPlainAttr and
UPlainAttrValue tables, search views and joins
* limited implementation effort, as most of the current JPA layer can
be retained
* keep enjoying the benefits of referential integrity and other
constraints enforced by DBMS (including UNIQUE)
Cons:
* each DBMS provides JSON support in its own fashion: implementation
wouldn't be trivial (while we can make it incremental, and add support
for one DBMS at a time)
* scaling capabilities and performance might be overrated - even
though there seems to be very nice references, at least for PostgreSQL
[6][7]
2. Implement a new persistence layer based on a different technology - I
have done some experiments with Apache Cassandra [8] and the Datastax
Java Driver [9]
Pros:
* built native for scalability and high availability
* proven and widespread adoption
* Object Mapper [10] allows to semi-transparently convert between
query results and domain, somehow similar as JPA's EntityManager
Cons:
* relations are obviously not available, only custom types [11]: the
persistence model shall be redesigned to cope with such situation
* constraints are not available - more specifically UNIQUE, which will
require additional code handling
* implementation effort: all the persistence layer shall be redone,
not only identity-related entities as User, UPlainAttr, UPlainAttrValue...
Besides the two above, there are of course other options in the NoSQL
world (Neo4j, MongoDB, ...), but I am afraid they all present similar
challenges as Cassandra.
WDYT?
Regards.
[1]
https://github.com/apache/syncope/blob/master/core/persistence-jpa/src/main/resources/views.xml#L50-L94
[2] https://www.postgresql.org/docs/10/static/functions-json.html
[3] https://dev.mysql.com/doc/refman/8.0/en/json.html
[4]
https://docs.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-2017
[5] https://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246
[6]
https://www.postgresql.eu/events/fosdem2018/sessions/session/1691/slides/63/High-Performance%20JSON_%20PostgreSQL%20Vs.%20MongoDB.pdf
[7]
http://coussej.github.io/2016/01/14/Replacing-EAV-with-JSONB-in-PostgreSQL/
[8] http://cassandra.apache.org/
[9] https://github.com/datastax/java-driver
[10]
https://docs.datastax.com/en/developer/java-driver/3.5/manual/object_mapper/
[11]
http://cassandra.apache.org/doc/latest/cql/types.html?highlight=user%20defined%20types#user-defined-types
--
Francesco Chicchiriccò
Tirasa - Open Source Excellence
http://www.tirasa.net/
Member at The Apache Software Foundation
Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail
http://home.apache.org/~ilgrosso/