[I] ANTLR filter parser Unicode handling [directory-scimple]

via GitHub Thu, 04 Dec 2025 05:24:08 -0800


LohithkumarAV opened a new issue, #981:
URL: https://github.com/apache/directory-scimple/issues/981


   # SCIM Filter Parser Fails with Accented Characters
   
   ## Summary
   The Apache Directory SCIM filter parser fails to parse SCIM search requests 
when filter values contain accented/diacritic characters, returning `400 Bad 
Request` with error `"Unable to map or parse JSON to SCIM schema"`.
   
   ## Environment
   - **Library**: `org.apache.directory.scim:scim-spec`
   - **Parser**: ANTLR-based filter parser in 
`org.apache.directory.scim.spec.filter.Filter`
   - **Java Version**: 17+
   - **Affected Component**: `GroupService.find()` method calling 
`buildFilterTree(filter)`
   
   ## Impact
   - **Severity**: High
   - **Scope**: Blocks SCIM RFC 7644 Section 3.13 compliance for 
internationalized string normalization
   - **Affected Operations**: All SCIM search operations with accented 
characters in filter values
   
   ## Steps to Reproduce
   
   ### 1. Create a group with accented characters
   **Request:**
   ```json
   POST /scim/v2/Groups
   {
     "schemas": ["urn:ietf:params:scim:schemas:core:2.0:Group"],
     "displayName": "José's Team"
   }
   
   **Response:** ✅ Success (201 Created)
   
   {
     "id": "468b6df5-80aa-4c94-ab39-75e36172d859",
     "displayName": "José's Team"
   }
   
   ### 2. Search with exact accented characters
   **Request:**
   
   POST /scim/v2/Groups/.search
   {
     "schemas": ["urn:ietf:params:scim:api:messages:2.0:SearchRequest"],
     "filter": "displayName eq \"José's Team\"",
     "startIndex": 1,
     "count": 10
   }
   
    Failure (400 Bad Request)
   
   {
     "status": 400,
     "scimType": "invalidSyntax",
     "error": "Unable to map or parse JSON to SCIM schema. Please check syntax 
and field types."
   }
   
   
   Search WITHOUT accents (workaround)
   Request
   
   POST /scim/v2/Groups/.search
   {
     "schemas": ["urn:ietf:params:scim:api:messages:2.0:SearchRequest"],
     "filter": "displayName eq \"Jose's Team\"",
     "startIndex": 1,
     "count": 10
   }
   
   Response
   Success (200 OK) - Returns the group with displayName: "José's Team"
   
   
    Test Results Matrix
   
   | Filter Value | Expected | Actual | Status |
   |-------------|----------|--------|--------|
   | "José's Team" | 200 OK | 400 Bad Request | ❌ FAIL |
   | "Jose's Team" | 200 OK | 200 OK | ✅ PASS |
   | "JOSÉ'S TEAM" | 200 OK | 400 Bad Request | ❌ FAIL |
   | "JOSE'S TEAM" | 200 OK | 200 OK | ✅ PASS |
   | "Müller's Gruppe" | 200 OK | 400 Bad Request | ❌ FAIL |
   | "Muller's Gruppe" | 200 OK | 200 OK | ✅ PASS |
   | "Café Équipe" | 200 OK | 400 Bad Request | ❌ FAIL |
   | "Cafe Equipe" | 200 OK | 200 OK | ✅ PASS |
   | "Ñoño's Tëäm" | 200 OK | 400 Bad Request | ❌ FAIL |
   | "Nono's Team" | 200 OK | 200 OK | ✅ PASS |
   | "Åse's Øverhead" | 200 OK | 400 Bad Request | ❌ FAIL |
   | "åse's øverhead" | 200 OK | 400 Bad Request | ❌ FAIL |
   
   
    Affected Character Sets
   
   The parser fails with:
   1. **Spanish accents**: José, JOSÉ
   2. **German umlauts**: Müller, MÜLLER
   3. **French accents**: Café, Équipe
   4. **Multiple diacritics**: Ñoño's Tëäm
   5. **Nordic characters**: Åse's Øverhead, åse's øverhead
   
   
   Root Cause Analysis
   
   The ANTLR grammar used by the SCIM filter parser appears to have issues 
tokenizing Unicode characters in the following contexts:
   
   1. **Accented characters combined with apostrophes**: José's, Müller's
   2. **Multiple diacritics**: Ñoño's Tëäm
   3. **Nordic characters**: Åse's Øverhead
   4. **French accents**: Café Équipe
   
   The parser likely treats these as invalid token sequences rather than valid 
string literals.
   
   
   Expected Behavior
   
   According to **RFC 7644 Section 3.4.2.2** (Filtering):
   > String attribute values are compared using case-insensitive matching and 
SHOULD be normalized according to Section 3.13.
   
   The filter parser should:
   1. Accept any valid Unicode characters in string literals
   2. Parse filter values containing accented characters without errors
   3. Allow the application layer to perform normalization for comparison
   
   
   Actual Behavior
   
   The parser rejects filter values containing accented characters with a 
generic syntax error, preventing any normalization logic from executing.
   
   
    Code Flow Analysis
   
   The failure occurs **before** application code is reached:
   
   1. ❌ Apache Directory SCIM library receives HTTP POST with filter string
   2. ❌ ANTLR parser attempts to tokenize: `"displayName eq \"José's Team\""`
   3. ❌ Parser fails on accented characters → throws exception
   4. ❌ Returns 400 Bad Request
   5. ⛔ Application's `find(Filter filter, ...)` method **never called**
   6. ⛔ Custom normalization logic **never executes**
   
   **Proof**: When filter has no accents (`"Jose's Team"`), parsing succeeds 
and application-level normalization correctly matches groups with accented 
names.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] ANTLR filter parser Unicode handling [directory-scimple]

Reply via email to