aminghadersohi opened a new pull request, #36150:
URL: https://github.com/apache/superset/pull/36150
### SUMMARY
This PR moves datetime format detection from query-time to dataset
configuration time, addressing performance concerns and log noise issues raised
in PR #35042.
**What changed:**
- Added `datetime_format` column to the `table_columns` database table to
persist detected formats
- Created `DatetimeFormatDetector` service class that samples column data
and detects datetime formats during dataset creation/refresh
- Updated `RefreshDatasetCommand` to automatically detect and store formats
when datasets are synced
- Modified `normalize_dttm_col()` utility to accept pre-detected format
mappings, eliminating runtime detection
- Added REST API endpoint `POST
/api/v1/dataset/<pk>/detect_datetime_formats` for manual format detection
- Included configuration options: `DATASET_AUTO_DETECT_DATETIME_FORMATS` and
`DATETIME_FORMAT_DETECTION_SAMPLE_SIZE`
**Why this matters:**
- **Performance**: Eliminates repeated format detection on every query
execution
- **Reliability**: Provides consistent datetime parsing across all queries
for the same column
- **Maintainability**: Reduces "Could not infer format" warnings that spam
application logs
- **User Control**: Allows manual format override and visibility into
detected formats
**Technical approach:**
- Detection happens once during dataset metadata refresh (existing workflow)
- Falls back to runtime detection if no format is stored (backward
compatible)
- Uses SQLAlchemy's `quoted_name` for safe SQL identifier handling
- Comprehensive test coverage with 7 unit tests
### BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
N/A - Backend-only change
### TESTING INSTRUCTIONS
1. **Set up test database with datetime columns:**
```sql
CREATE TABLE test_dates (
id INT,
date_iso VARCHAR(20), -- stores "2023-01-15"
date_us VARCHAR(20), -- stores "01/15/2023"
timestamp_col VARCHAR(30) -- stores "2023-01-15 14:30:00"
);
INSERT INTO test_dates VALUES
(1, '2023-01-15', '01/15/2023', '2023-01-15 14:30:00'),
(2, '2023-02-20', '02/20/2023', '2023-02-20 16:45:00');
```
2. **Add dataset to Superset:**
- Navigate to Data → Datasets → + Dataset
- Select the database and `test_dates` table
- Click "Create Dataset and Create Chart"
3. **Verify format detection:**
- Check application logs - should see "Detected format" messages
- Query the database to verify formats were stored:
```sql
SELECT column_name, datetime_format, is_dttm
FROM table_columns
WHERE table_id = (SELECT id FROM tables WHERE table_name =
'test_dates');
```
4. **Test manual detection API:**
```bash
curl -X POST
http://localhost:8088/api/v1/dataset/<dataset_id>/detect_datetime_formats?force=true
\
-H "Authorization: Bearer <token>"
```
5. **Verify query performance:**
- Run a query using the datetime columns
- Check logs - should NOT see format detection warnings during query
execution
- Verify datetime columns are parsed correctly in query results
6. **Test unit tests:**
```bash
pytest tests/unit_tests/datasets/test_datetime_format_detector.py -v
```
### ADDITIONAL INFORMATION
- [x] Includes DB Migration (follow approval process in
[SIP-59](https://github.com/apache/superset/issues/13351))
- [x] Migration is atomic, supports rollback & is backwards-compatible
- [x] Confirm DB migration upgrade and downgrade tested
- [x] Runtime estimates and downtime expectations provided:
- Migration adds a nullable column - near-instant on all databases
- No data migration required
- Zero downtime - existing queries continue to work
- [x] Introduces new feature or API
- New API endpoint: `POST /api/v1/dataset/<pk>/detect_datetime_formats`
- New config options: `DATASET_AUTO_DETECT_DATETIME_FORMATS`,
`DATETIME_FORMAT_DETECTION_SAMPLE_SIZE`
- [ ] Removes existing feature or API
- [ ] Changes UI
- [ ] Has associated issue
- [ ] Required feature flags
**Related work:**
- Builds on temporary fix from PR #35042
- Addresses Max Beauchemin's feedback about pandas being designed for REPL
use cases rather than production
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]