Re: PROPOSAL: Moving away from DB-enforced auto-increment primary keys

Terence Monteiro Sat, 18 Oct 2025 09:50:00 -0700

Nice to hear more on this topic! Please see my comments inline:

On Fri, Oct 17, 2025 at 8:58 AM Kapil Panchal <
[email protected]> wrote:

> Databases store rows as key-value pairs keyed by the primary key.
>

Actual DB storage is more nuanced. For MySQL/MariaDB, the default InnoDB
engine uses B+ tree-based pages with PK-based clustering, somewhat
resembling key-value organization. However, PostgreSQL uses heap storage
with rows appended in insertion order in 8KB data file pages, so the
statement is largely inaccurate for Postgres—rows are not stored as
key-value pairs keyed by the PK.

> While UUIDs or Snowflake IDs can marginally improve primary-key lookup
> times via O(log n) algorithms, this benefits at most 10% of queries.
>

This 10% figure needs deeper investigation. Is it based on studies of
existing systems or Fineract production data? We should assess its
relevance to Fineract’s current and future workloads (e.g., microfinance
reporting). One exercise we can do is analyze query logs from Fineract’s
backend DB (e.g., via MySQL’s slow query log or PostgreSQL’s
pg_stat_statements) to quantify PK-indexed vs. non-PK queries. Focusing on
key tables like m_loan or m_client, we could identify which indexes are hit
most. Do any implementers/partners have such data (e.g., query stats or
EXPLAIN outputs) and would be willing to share on the list or on a call?

> The remaining 90% of searches—those performed on columns or fields—still
> require scanning O(n) rows, where such an approach offers no improvement.
>

In well-optimized systems, common queries leverage PK or secondary
indexes. From
running several Mifos Sandbox and production instances as a SaaS solution,
I’ve seen Fineract’s key tables (e.g., m_loan) use well-defined PKs and
relationships for efficient joins and lookups. Are we certain 90% of
queries bypass these indexes? Query log analysis could clarify this.

> In practical terms, adopting UUIDs or Snowflake IDs does not yield
> measurable performance gains for the majority of operations and risks
> creating the impression of optimization without substantive effect.
>

Besides the 10/90 figure, practicality is a valid concern. Implementing
UUIDs or Snowflake requires marrying developer and DB expertise. Having
worked on Flyway and Liquibase migrations for multiple Fineract production
tenants and analyzed schema changes and query patterns over years, I’d be
happy to share insights from this DevOps perspective. Paul’s comparison
table suggests UUIDs have lower maintenance costs than Snowflake, which we
should weigh, especially for multi-tenant setups common in Fineract.

My suggestion is to validate the three approaches (auto-increment, UUID,
Snowflake) with pros, cons, and data-driven stats, as Paul and James
emphasized. On James’s points, we should check for downstream reliance on
sequential IDs (e.g., in reporting tools) and verify idempotency
impacts—Paul’s note about UUIDs having minimal effect is promising but
needs testing with Fineract. An FSIP as James proposed, is a great starting
point to formalize this. I’m confident our community has the capability to
tackle this, but let’s ground it in real-world data.

-Terence Monteiro.

> On Thu, Sep 4, 2025 at 2:26 PM Ádám Sághy <[email protected]> wrote:
>
>> Dear Fineract community,
>>
>> I’d like to open a discussion on our current use of auto-increment
>> database primary keys (e.g. BIGINT IDENTITY) and whether it’s time to
>> evolve this design.
>>
>> *Current Issues with Auto-Increment PKs*
>>
>>    - Always requires flush/commit: The database generates the ID only
>>    once the transaction is flushed or committed, which makes it difficult to
>>    work with new entities before persistence. This complicates batching,
>>    caching, and testability.
>>    - Performance implications: Because IDs are only known after flush,
>>    code often performs additional flushes just to obtain generated IDs. This
>>    introduces unnecessary round-trips and slows down persistence at scale.
>>    Also this cause various issues around having multiple flushes per
>>    transactions.
>>    - Predictability: Sequential IDs are trivially guessable. When
>>    exposed via APIs, this can pose a security and privacy concern.
>>    - Operational flexibility: While a single database removes the need
>>    for distributed coordination, relying solely on DB-generated sequences 
>> ties
>>    identity generation to persistence, which reduces flexibility when
>>    considering service decomposition, replication, or multi-tenant setups.
>>
>>
>> *Recommendation: Snowflake-style BIGINT IDs*
>>
>> Application-level ID generation using a Snowflake-style algorithm (64-bit
>> IDs with timestamp + machine ID + sequence).
>>
>> Pros:
>>
>>    - IDs are available immediately in memory, no flush required.
>>    - Globally unique across instances.
>>    - Roughly time-ordered, which helps with debugging and querying.
>>    - Removes guessability of plain sequential IDs.
>>    - Still compact BIGINT for indexing and joins.
>>    - *No need for migration!*
>>
>> Cons:
>>
>>    - Requires a generator implementation and clock synchronization
>>    safeguards.
>>    - IDs are less human-friendly than sequential integers.
>>
>>
>>
>> Looking forward to hear your thoughts on this proposal!
>>
>>
>> Regards,
>>
>> Adam
>>
>>
>>
>

Re: PROPOSAL: Moving away from DB-enforced auto-increment primary keys

Reply via email to