Nice to hear more on this topic! Please see my comments inline: On Fri, Oct 17, 2025 at 8:58 AM Kapil Panchal < [email protected]> wrote:
> Databases store rows as key-value pairs keyed by the primary key. > Actual DB storage is more nuanced. For MySQL/MariaDB, the default InnoDB engine uses B+ tree-based pages with PK-based clustering, somewhat resembling key-value organization. However, PostgreSQL uses heap storage with rows appended in insertion order in 8KB data file pages, so the statement is largely inaccurate for Postgres—rows are not stored as key-value pairs keyed by the PK. > While UUIDs or Snowflake IDs can marginally improve primary-key lookup > times via O(log n) algorithms, this benefits at most 10% of queries. > This 10% figure needs deeper investigation. Is it based on studies of existing systems or Fineract production data? We should assess its relevance to Fineract’s current and future workloads (e.g., microfinance reporting). One exercise we can do is analyze query logs from Fineract’s backend DB (e.g., via MySQL’s slow query log or PostgreSQL’s pg_stat_statements) to quantify PK-indexed vs. non-PK queries. Focusing on key tables like m_loan or m_client, we could identify which indexes are hit most. Do any implementers/partners have such data (e.g., query stats or EXPLAIN outputs) and would be willing to share on the list or on a call? > The remaining 90% of searches—those performed on columns or fields—still > require scanning O(n) rows, where such an approach offers no improvement. > In well-optimized systems, common queries leverage PK or secondary indexes. From running several Mifos Sandbox and production instances as a SaaS solution, I’ve seen Fineract’s key tables (e.g., m_loan) use well-defined PKs and relationships for efficient joins and lookups. Are we certain 90% of queries bypass these indexes? Query log analysis could clarify this. > In practical terms, adopting UUIDs or Snowflake IDs does not yield > measurable performance gains for the majority of operations and risks > creating the impression of optimization without substantive effect. > Besides the 10/90 figure, practicality is a valid concern. Implementing UUIDs or Snowflake requires marrying developer and DB expertise. Having worked on Flyway and Liquibase migrations for multiple Fineract production tenants and analyzed schema changes and query patterns over years, I’d be happy to share insights from this DevOps perspective. Paul’s comparison table suggests UUIDs have lower maintenance costs than Snowflake, which we should weigh, especially for multi-tenant setups common in Fineract. My suggestion is to validate the three approaches (auto-increment, UUID, Snowflake) with pros, cons, and data-driven stats, as Paul and James emphasized. On James’s points, we should check for downstream reliance on sequential IDs (e.g., in reporting tools) and verify idempotency impacts—Paul’s note about UUIDs having minimal effect is promising but needs testing with Fineract. An FSIP as James proposed, is a great starting point to formalize this. I’m confident our community has the capability to tackle this, but let’s ground it in real-world data. -Terence Monteiro. > On Thu, Sep 4, 2025 at 2:26 PM Ádám Sághy <[email protected]> wrote: > >> Dear Fineract community, >> >> I’d like to open a discussion on our current use of auto-increment >> database primary keys (e.g. BIGINT IDENTITY) and whether it’s time to >> evolve this design. >> >> *Current Issues with Auto-Increment PKs* >> >> - Always requires flush/commit: The database generates the ID only >> once the transaction is flushed or committed, which makes it difficult to >> work with new entities before persistence. This complicates batching, >> caching, and testability. >> - Performance implications: Because IDs are only known after flush, >> code often performs additional flushes just to obtain generated IDs. This >> introduces unnecessary round-trips and slows down persistence at scale. >> Also this cause various issues around having multiple flushes per >> transactions. >> - Predictability: Sequential IDs are trivially guessable. When >> exposed via APIs, this can pose a security and privacy concern. >> - Operational flexibility: While a single database removes the need >> for distributed coordination, relying solely on DB-generated sequences >> ties >> identity generation to persistence, which reduces flexibility when >> considering service decomposition, replication, or multi-tenant setups. >> >> >> *Recommendation: Snowflake-style BIGINT IDs* >> >> Application-level ID generation using a Snowflake-style algorithm (64-bit >> IDs with timestamp + machine ID + sequence). >> >> Pros: >> >> - IDs are available immediately in memory, no flush required. >> - Globally unique across instances. >> - Roughly time-ordered, which helps with debugging and querying. >> - Removes guessability of plain sequential IDs. >> - Still compact BIGINT for indexing and joins. >> - *No need for migration!* >> >> Cons: >> >> - Requires a generator implementation and clock synchronization >> safeguards. >> - IDs are less human-friendly than sequential integers. >> >> >> >> Looking forward to hear your thoughts on this proposal! >> >> >> Regards, >> >> Adam >> >> >> >
