cccs-jc opened a new issue, #14995:
URL: https://github.com/apache/iceberg/issues/14995

   ### Apache Iceberg version
   
   1.9.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   ## Description
   There appears to be an issue in Apache Iceberg's data file pruning when 
using the SQL `LIKE` operator containing escaped underscore (`_`) characters. 
Unlike when using the Spark `startswith` function or simpler `LIKE` patterns 
without underscores, the presence of escaped special characters in the `LIKE` 
expression leads to improper query optimizations. As a result, all data files 
are scanned instead of pruning irrelevant files.
   
   The appropriate file pruning occurs for queries using the Spark `startswith` 
function or `LIKE` expressions without escaped underscores but fails when 
underscores are escaped manually (with `\_`) or using an explicit escape 
character.
   
   ---
   
   ## Steps to Reproduce
   
   1. Create an Iceberg table (e.g., `prod_catalog.admin.inventory`) with a 
`TRUNCATE(128)` transformation on the `file_path` column:
       ```sql
       ALTER TABLE prod_catalog.admin.inventory SET TBLPROPERTIES 
('write.metadata.metrics.column.file_path'='truncate(128)');
       ```
   
   2. Run the following query using the Spark `startswith` operator:
       ```sql
       SELECT count(*)
       FROM prod_catalog.admin.inventory
       WHERE 
           loaded_at = DATE('2026-01-05') AND 
           startswith(file_path, 'warehouse/iceberg/good_facts/fact1_table/');
       ```
      - **Execution Plan Observation:** File pruning is successful. Only 11 
data files are scanned (`number of result data files: 11`).
   
   3. Replace `startswith` with the SQL `LIKE` operator:
       ```sql
       SELECT count(*)
       FROM prod_catalog.admin.inventory
       WHERE 
           loaded_at = DATE('2026-01-05') AND 
           file_path LIKE 'warehouse/iceberg/good_facts/fact1_table/%';
       ```
      - **Execution Plan Observation:** File pruning fails, and all data files 
are scanned.
   
   4. Escape special characters in the `LIKE` expression:
       - Using a backslash:
         ```sql
         SELECT count(*)
         FROM prod_catalog.admin.inventory
         WHERE 
             loaded_at = DATE('2026-01-05') AND 
             file_path LIKE 'warehouse/iceberg/good\_facts/fact1\_table/%';
         ```
       - Using an explicit escape character (`ESCAPE 'X'`):
         ```sql
         SELECT count(*)
         FROM prod_catalog.admin.inventory
         WHERE 
             loaded_at = DATE('2026-01-05') AND 
             file_path LIKE 'warehouse/iceberg/goodX_facts/fact1X_table/%' 
ESCAPE 'X';
         ```
      - **Execution Plan Observation:** In both cases, file pruning still 
fails, and all data files are scanned.
   
   5. Finally, remove the underscore to see what happens:
       ```sql
       SELECT count(*)
       FROM prod_catalog.admin.inventory
       WHERE 
           loaded_at = DATE('2026-01-05') AND 
           file_path LIKE 'warehouse/iceberg/goodfacts/fact1table/%';
       ```
      - **Execution Plan Observation:** File pruning works correctly. The 
`LIKE` operator is converted into `StartsWith` during the Filter phase, and 
only relevant data files are scanned.
   
   ---
   
   ## Expected Behavior
   - The presence of escaped underscores in the `LIKE` operator (via `\_` or 
`ESCAPE`) should not prevent Iceberg from optimizing data file pruning. Iceberg 
should recognize the `LIKE` pattern and convert it to an equivalent 
`StartsWith` expression for efficient pruning.
   
   ---
   
   ## Observed Behavior
   - Iceberg does not optimize file pruning when the `LIKE` operator includes 
escaped special characters like underscores (`_`), leading to all files being 
scanned. This behavior is inconsistent with both `startswith` and simpler 
`LIKE` patterns that do not contain escaped characters.
   
   ---
   
   ## Additional Information
   - When the pattern contains no underscores, Iceberg correctly converts the 
`LIKE` expression to a `StartsWith` during the Filter phase.
   - Example of an unoptimized filter condition:
    ```
   (3) Filter [codegen id : 1]
   Input [2]: [loaded_at#811, file_path#817]
   Condition : ((isnotnull(file_path#817) AND (loaded_at#811 = 2026-01-05)) AND 
file_path#817 LIKE warehouse/iceberg/goodX_facts/fact1X_table/  ESCAPE 'X')
    ```
   
   - Example of an optimized filter condition when underscores are removed:
   ```
   (3) Filter [codegen id : 1]
   Input [2]: [loaded_at#844, file_path#850]
   Condition : ((isnotnull(file_path#850) AND (loaded_at#844 = 2026-01-05)) AND 
StartsWith(file_path#850, warehouse/iceberg/goodfacts/fact1table/))
   ```
   
   
   
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [x] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to