sourabh-27 opened a new pull request, #18831: URL: https://github.com/apache/pinot/pull/18831
## Description Solves [#18808](https://github.com/apache/pinot/issues/18808) Apache Pinot's schema-update API has been strictly additive. The `PUT /schemas/{schemaName}` endpoint rejects any update that removes a column to prevent accidental backward-incompatible changes. The only workaround was using `force=true`, which is unsafe as it bypasses all structural validation. Furthermore, Pinot lacked a mechanism to reclaim on-disk space occupied by removed columns in existing, already-built segments. This PR introduces guard-railed support for deleting columns from a schema, divided into two independent, opt-in tiers: ### 1. Logical Deletion (Controller / Schema API) A new `allowColumnDeletion` query parameter is introduced to the schema-update endpoints. * **Behavior:** When set to `true`, callers can intentionally drop columns present in the old schema but absent from the new one. * **Safety Guards:** All other backward-compatibility rules remain same (e.g., changes to column types or primary keys are still rejected). ### 2. Physical Reclamation (Server / Segment Reload) A config flag, `reclaimDeletedColumnsOnReload`, dictates whether data for ingested columns missing from the schema is physically purged from segments during a reload operation. * **Behavior:** Previously, only auto-generated default columns were cleaned up; ingested column data persisted indefinitely. When this flag is enabled (set to `true`), a segment reload explicitly drops the forward index, dictionary, and all auxiliary indexes for columns no longer present in the schema, freeing up disk space. ### 3. Query Layer Behavior (Unchanged) * **Behavior:** Queries referencing a column that has been deleted from the schema will **throw an error**. This remains the consistent, standard behavior alongside these changes. --- ## Changes ### `pinot-controller` * **`PinotSchemaRestletResource`**: Added the `allowColumnDeletion` query parameter (default: `false`) to both the multipart and JSON `PUT /schemas/{schemaName}` endpoints. * **`PinotHelixResourceManager`**: Passed the `allowColumnDeletion` parameter down into `updateSchema(...)`. ### `pinot-spi` * **`Schema`**: Overloaded `isBackwardCompatibleWith(Schema oldSchema, boolean allowColumnDeletion)`. The original single-argument method signature remains intact. * **`IndexingConfig`**: Added the `reclaimDeletedColumnsOnReload` option (default: `false`). ### `pinot-segment-local` * **`IndexLoadingConfig`**: Exposed the `isReclaimDeletedColumnsOnReload()` configuration property. * **`BaseDefaultColumnHandler`**: Updated to compute `REMOVE` actions for ingested columns absent from the schema when the reclamation flag is active (extending the existing auto-generated column removal logic). ### `pinot-clients` * **`SchemaAdminClient`**: Overloaded `updateSchema(..., boolean allowColumnDeletion)`. --- ## Testing - `./mvnw -pl pinot-spi -am -Dtest=SchemaTest -Dsurefire.failIfNoSpecifiedTests=false test` - `./mvnw -pl pinot-controller -am -Dtest=PinotSchemaRestletResourceTest -Dsurefire.failIfNoSpecifiedTests=false test` - `./mvnw -pl pinot-segment-local -am -Dtest=DefaultColumnHandlerTest,SegmentPreProcessorTest -Dsurefire.failIfNoSpecifiedTests=false test` - `./mvnw -pl pinot-integration-tests -am -Dtest=OfflineClusterIntegrationTest#testSchemaColumnDeletion -Dsurefire.failIfNoSpecifiedTests=false test` - `./mvnw spotless:apply -pl pinot-spi,pinot-controller,pinot-segment-local,pinot-clients,pinot-integration-tests` - `./mvnw license:format -pl pinot-spi,pinot-controller,pinot-segment-local,pinot-clients,pinot-integration-tests` - `./mvnw checkstyle:check -pl pinot-spi,pinot-controller,pinot-segment-local,pinot-clients,pinot-integration-tests` - `./mvnw license:check -pl pinot-spi,pinot-controller,pinot-segment-local,pinot-clients,pinot-integration-tests` - `git diff --check` --- ## Release Notes * **New Schema-Update Option:** `PUT /schemas/{schemaName}` endpoints now accept an optional `allowColumnDeletion` query parameter (default: `false`). When `true`, columns omitted from the new schema are safely dropped, provided they are not actively referenced by any table configuration. Structural type and primary-key compatibility assertions remain strictly enforced. * **New Table Indexing Configuration:** Added `indexingConfig.reclaimDeletedColumnsOnReload` (default: `false`). When activated, ingested columns omitted from the schema are physically wiped from segments upon reload to reclaim storage space. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
