(doris) branch master updated: [fix](fe) Reject invalid char filter replacement in tokenize (#64794)

airborne Wed, 01 Jul 2026 16:28:00 -0700

This is an automated email from the ASF dual-hosted git repository.

airborne12 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git



The following commit(s) were added to refs/heads/master by this push:
     new 43693081393 [fix](fe) Reject invalid char filter replacement in 
tokenize (#64794)
43693081393 is described below

commit 43693081393600544272858901686e421ab8af40
Author: hoshinojyunn <[email protected]>
AuthorDate: Thu Jul 2 07:27:39 2026 +0800

    [fix](fe) Reject invalid char filter replacement in tokenize (#64794)
    
    Issue Number: None
    
    Related PR: None
    
    Problem Summary: Doris FE accepted empty or multi-character
    char_filter_replacement values for inverted-index char_replace
    configuration. The BE char_replace implementation only supports
    replacing with a single byte, so invalid configurations could silently
    produce incorrect tokenize results. This change centralizes FE char
    filter validation, requires char_filter_replacement to be a single
    non-empty character when specified, and reuses the same validation for
    tokenize() analysis and inverted index property checks.
    
    Reject empty or multi-character char_filter_replacement values in
    tokenize() and inverted index property validation.
    
    - Test: FE unit test
    - ./run-fe-ut.sh --run
    
org.apache.doris.analysis.InvertedIndexPropertiesTest,org.apache.doris.nereids.trees.expressions.functions.scalar.TokenizeTest
    - Regression test added but not run because
    regression-test/conf/regression-conf.groovy currently points to a
    user-configured external cluster
    - Behavior changed: Yes (invalid char_filter_replacement is now rejected
    during FE analysis/validation)
    - Does this need documentation: No
    
    ### What problem does this PR solve?
    
    Issue Number: close #xxx
    
    Related PR: #xxx
    
    Problem Summary:
    FE previously accepted invalid `char_filter_replacement` values in
    inverted-index `char_replace` configuration and `tokenize()` properties.
    This could pass analysis successfully but produce incorrect results in
    BE, because the BE `char_replace` implementation only supports a
    single-byte
    replacement.
    
    Two concrete examples are:
    
    1. Multi-character replacement:
     ```sql
     SELECT tokenize(
         'a.b.c',
    
    
'"parser"="english","char_filter_type"="char_replace","char_filter_pattern"=".","char_filter_replacement"="xyz"'
     );
    ```
    Before this change, FE accepted the input, but the actual result was:
    
    [{"token":"axbxc"}]
    
    while the intuitive expected behavior for replacing . with xyz would be:
    
    [{"token":"axyzbxyzc"}]
    
    2. Empty replacement:
    ```sql
     SELECT tokenize(
         'a.b.c',
    
    
'"parser"="english","char_filter_type"="char_replace","char_filter_pattern"=".","char_filter_replacement"=""'
     );
    ```
     Before this change, FE also accepted the input, but the actual result was:
    
     [{"token":"a"},{"token":"b"},{"token":"c"}]
     while the expected result for removing . would be:
    
     [{"token":"abc"}]
    
    The root cause is that FE did not validate char_filter_replacement strictly 
enough, while BE only handles single-byte replacement correctly. This PR fixes 
the issue by centralizing char filter validation in FE and reusing it from both 
inverted index property validation and
    Tokenize.checkLegalityBeforeTypeCoercion().
    
    After this change, FE rejects char_filter_replacement unless it is a single 
non-empty character, preventing these invalid configurations from reaching BE.
    
    This PR also adds FE unit tests to cover:
    
    - the wrapped exception path in Tokenize.checkLegalityBeforeTypeCoercion()
    - every branch and every exception path in 
InvertedIndexUtil.checkCharFilterProperties()
---
 .../apache/doris/analysis/InvertedIndexUtil.java   |  53 +++++----
 .../CharReplaceCharFilterValidator.java            |  30 +++---
 .../expressions/functions/scalar/Tokenize.java     |   5 +-
 .../analysis/InvertedIndexPropertiesTest.java      | 120 +++++++++++++++++++++
 .../doris/indexpolicy/PolicyValidatorTests.java    |  14 +++
 .../expressions/functions/scalar/TokenizeTest.java |  85 +++++++++++++++
 .../analyzer/test_custom_analyzer1.groovy          |  15 ++-
 .../inverted_index_p0/test_properties.groovy       |  43 ++++++++
 .../suites/inverted_index_p0/test_tokenize.groovy  |  12 +++
 9 files changed, 337 insertions(+), 40 deletions(-)

diff --git 
a/fe/fe-core/src/main/java/org/apache/doris/analysis/InvertedIndexUtil.java 
b/fe/fe-core/src/main/java/org/apache/doris/analysis/InvertedIndexUtil.java
index 54129adf81b..449e0333066 100644
--- a/fe/fe-core/src/main/java/org/apache/doris/analysis/InvertedIndexUtil.java
+++ b/fe/fe-core/src/main/java/org/apache/doris/analysis/InvertedIndexUtil.java
@@ -134,15 +134,43 @@ public class InvertedIndexUtil {
         }
     }
 
-    private static boolean isSingleByte(String str) {
+    private static boolean isAscii(String str) {
         for (int i = 0; i < str.length(); i++) {
-            if (str.charAt(i) > 0xFF) {
+            if (str.charAt(i) > 0x7F) {
                 return false;
             }
         }
         return true;
     }
 
+    public static void checkCharFilterProperties(Map<String, String> 
properties) throws AnalysisException {
+        String charFilterType = 
properties.get(INVERTED_INDEX_PARSER_CHAR_FILTER_TYPE);
+        if (charFilterType == null) {
+            return;
+        }
+
+        String charFilterPattern = 
properties.get(INVERTED_INDEX_PARSER_CHAR_FILTER_PATTERN);
+        String charFilterReplacement = 
properties.get(INVERTED_INDEX_PARSER_CHAR_FILTER_REPLACEMENT);
+        if (!INVERTED_INDEX_CHAR_FILTER_CHAR_REPLACE.equals(charFilterType)) {
+            throw new AnalysisException("Invalid 'char_filter_type', only '"
+                    + INVERTED_INDEX_CHAR_FILTER_CHAR_REPLACE + "' is 
supported");
+        }
+        if (charFilterPattern == null || charFilterPattern.isEmpty()) {
+            throw new AnalysisException("Missing 'char_filter_pattern' for 
'char_replace' filter type");
+        }
+        if (!isAscii(charFilterPattern)) {
+            throw new AnalysisException("'char_filter_pattern' must contain 
only ASCII characters");
+        }
+        if (charFilterReplacement != null) {
+            if (charFilterReplacement.isEmpty() || 
charFilterReplacement.length() != 1) {
+                throw new AnalysisException("'char_filter_replacement' must be 
a single non-empty character");
+            }
+            if (!isAscii(charFilterReplacement)) {
+                throw new AnalysisException("'char_filter_replacement' must 
contain only ASCII characters");
+            }
+        }
+    }
+
     private static void checkInvertedIndexProperties(Map<String, String> 
properties, PrimitiveType colType,
             TInvertedIndexFileStorageFormat invertedIndexFileStorageFormat) 
throws AnalysisException {
         Set<String> allowedKeys = new HashSet<>(Arrays.asList(
@@ -174,9 +202,6 @@ public class InvertedIndexUtil {
         }
         String parserMode = properties.get(INVERTED_INDEX_PARSER_MODE_KEY);
         String supportPhrase = 
properties.get(INVERTED_INDEX_SUPPORT_PHRASE_KEY);
-        String charFilterType = 
properties.get(INVERTED_INDEX_PARSER_CHAR_FILTER_TYPE);
-        String charFilterPattern = 
properties.get(INVERTED_INDEX_PARSER_CHAR_FILTER_PATTERN);
-        String charFilterReplacement = 
properties.get(INVERTED_INDEX_PARSER_CHAR_FILTER_REPLACEMENT);
         String ignoreAbove = 
properties.get(INVERTED_INDEX_PARSER_IGNORE_ABOVE_KEY);
         String lowerCase = properties.get(INVERTED_INDEX_PARSER_LOWERCASE_KEY);
         String stopWords = properties.get(INVERTED_INDEX_PARSER_STOPWORDS_KEY);
@@ -233,23 +258,7 @@ public class InvertedIndexUtil {
                     + ", support_phrase must be true or false");
         }
 
-        if (charFilterType != null) {
-            if 
(!INVERTED_INDEX_CHAR_FILTER_CHAR_REPLACE.equals(charFilterType)) {
-                throw new AnalysisException("Invalid 'char_filter_type', only 
'"
-                    + INVERTED_INDEX_CHAR_FILTER_CHAR_REPLACE + "' is 
supported");
-            }
-            if (charFilterPattern == null || charFilterPattern.isEmpty()) {
-                throw new AnalysisException("Missing 'char_filter_pattern' for 
'char_replace' filter type");
-            }
-            if (!isSingleByte(charFilterPattern)) {
-                throw new AnalysisException("'char_filter_pattern' must 
contain only ASCII characters");
-            }
-            if (charFilterReplacement != null && 
!charFilterReplacement.isEmpty()) {
-                if (!isSingleByte(charFilterReplacement)) {
-                    throw new AnalysisException("'char_filter_replacement' 
must contain only ASCII characters");
-                }
-            }
-        }
+        checkCharFilterProperties(properties);
 
         if (ignoreAbove != null) {
             try {
diff --git 
a/fe/fe-core/src/main/java/org/apache/doris/indexpolicy/CharReplaceCharFilterValidator.java
 
b/fe/fe-core/src/main/java/org/apache/doris/indexpolicy/CharReplaceCharFilterValidator.java
index 2e7fe15b2a2..73f29a050c0 100644
--- 
a/fe/fe-core/src/main/java/org/apache/doris/indexpolicy/CharReplaceCharFilterValidator.java
+++ 
b/fe/fe-core/src/main/java/org/apache/doris/indexpolicy/CharReplaceCharFilterValidator.java
@@ -17,10 +17,13 @@
 
 package org.apache.doris.indexpolicy;
 
+import org.apache.doris.analysis.InvertedIndexUtil;
+import org.apache.doris.common.AnalysisException;
 import org.apache.doris.common.DdlException;
 
 import com.google.common.collect.ImmutableSet;
 
+import java.util.HashMap;
 import java.util.Map;
 import java.util.Set;
 
@@ -39,25 +42,20 @@ public class CharReplaceCharFilterValidator extends 
BasePolicyValidator {
 
     @Override
     protected void validateSpecific(Map<String, String> props) throws 
DdlException {
+        Map<String, String> charFilterProperties = new HashMap<>();
+        
charFilterProperties.put(InvertedIndexUtil.INVERTED_INDEX_PARSER_CHAR_FILTER_TYPE,
+                InvertedIndexUtil.INVERTED_INDEX_CHAR_FILTER_CHAR_REPLACE);
         if (props.containsKey("pattern")) {
-            String pattern = props.get("pattern");
-            if (pattern != null && !pattern.isEmpty()) {
-                for (int i = 0; i < pattern.length(); i++) {
-                    if (pattern.charAt(i) > 255) {
-                        throw new DdlException(
-                                "pattern must contain only single-byte 
characters in [0,255]");
-                    }
-                }
-            }
+            
charFilterProperties.put(InvertedIndexUtil.INVERTED_INDEX_PARSER_CHAR_FILTER_PATTERN,
 props.get("pattern"));
         }
         if (props.containsKey("replacement")) {
-            String replacement = props.get("replacement");
-            if (replacement == null || replacement.length() != 1) {
-                throw new DdlException("replacement must be exactly one byte");
-            }
-            if (replacement.charAt(0) > 255) {
-                throw new DdlException("replacement must be in [0,255]");
-            }
+            
charFilterProperties.put(InvertedIndexUtil.INVERTED_INDEX_PARSER_CHAR_FILTER_REPLACEMENT,
+                    props.get("replacement"));
+        }
+        try {
+            InvertedIndexUtil.checkCharFilterProperties(charFilterProperties);
+        } catch (AnalysisException e) {
+            throw new DdlException(e.getMessage(), e);
         }
     }
 }
diff --git 
a/fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/Tokenize.java
 
b/fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/Tokenize.java
index 634e8606710..0f43ab6d824 100644
--- 
a/fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/Tokenize.java
+++ 
b/fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/Tokenize.java
@@ -17,6 +17,7 @@
 
 package org.apache.doris.nereids.trees.expressions.functions.scalar;
 
+import org.apache.doris.analysis.InvertedIndexUtil;
 import org.apache.doris.catalog.FunctionSignature;
 import org.apache.doris.nereids.exceptions.AnalysisException;
 import org.apache.doris.nereids.parser.NereidsParser;
@@ -72,7 +73,9 @@ public class Tokenize extends ScalarFunction
             return;
         }
         try {
-            new NereidsParser().parseProperties(properties);
+            InvertedIndexUtil.checkCharFilterProperties(new 
NereidsParser().parseProperties(properties));
+        } catch (org.apache.doris.common.AnalysisException e) {
+            throw new AnalysisException(e.getMessage(), e);
         } catch (Throwable e) {
             throw new AnalysisException("tokenize second argument must be 
properties format");
         }
diff --git 
a/fe/fe-core/src/test/java/org/apache/doris/analysis/InvertedIndexPropertiesTest.java
 
b/fe/fe-core/src/test/java/org/apache/doris/analysis/InvertedIndexPropertiesTest.java
index ce9376152e4..b2e2e01e278 100644
--- 
a/fe/fe-core/src/test/java/org/apache/doris/analysis/InvertedIndexPropertiesTest.java
+++ 
b/fe/fe-core/src/test/java/org/apache/doris/analysis/InvertedIndexPropertiesTest.java
@@ -17,6 +17,10 @@
 
 package org.apache.doris.analysis;
 
+import org.apache.doris.catalog.PrimitiveType;
+import org.apache.doris.common.AnalysisException;
+import org.apache.doris.thrift.TInvertedIndexFileStorageFormat;
+
 import org.junit.jupiter.api.Assertions;
 import org.junit.jupiter.api.Test;
 
@@ -25,6 +29,12 @@ import java.util.Map;
 
 public class InvertedIndexPropertiesTest {
 
+    private static void assertCheckCharFilterPropertiesThrows(Map<String, 
String> props, String expectedMessage) {
+        AnalysisException exception = 
Assertions.assertThrows(AnalysisException.class,
+                () -> InvertedIndexUtil.checkCharFilterProperties(props));
+        
Assertions.assertTrue(exception.getMessage().contains(expectedMessage), 
exception.getMessage());
+    }
+
     // --- getInvertedIndexParser ---
 
     @Test
@@ -223,6 +233,116 @@ public class InvertedIndexPropertiesTest {
         Assertions.assertEquals("_", result.get("char_filter_replacement"));
     }
 
+    @Test
+    public void testCheckCharFilterPropertiesRejectsEmptyReplacement() {
+        Map<String, String> props = new HashMap<>();
+        props.put("char_filter_type", "char_replace");
+        props.put("char_filter_pattern", ".");
+        props.put("char_filter_replacement", "");
+        assertCheckCharFilterPropertiesThrows(props, 
"'char_filter_replacement' must be a single non-empty character");
+    }
+
+    @Test
+    public void testCheckCharFilterPropertiesRejectsMultiCharReplacement() {
+        Map<String, String> props = new HashMap<>();
+        props.put("char_filter_type", "char_replace");
+        props.put("char_filter_pattern", ".");
+        props.put("char_filter_replacement", "xyz");
+        assertCheckCharFilterPropertiesThrows(props, 
"'char_filter_replacement' must be a single non-empty character");
+    }
+
+    @Test
+    public void testCheckCharFilterPropertiesAllowsMissingType() {
+        Assertions.assertDoesNotThrow(() -> 
InvertedIndexUtil.checkCharFilterProperties(new HashMap<>()));
+    }
+
+    @Test
+    public void testCheckCharFilterPropertiesRejectsInvalidType() {
+        Map<String, String> props = new HashMap<>();
+        props.put("char_filter_type", "invalid");
+        assertCheckCharFilterPropertiesThrows(props, "Invalid 
'char_filter_type'");
+    }
+
+    @Test
+    public void testCheckCharFilterPropertiesRejectsMissingPattern() {
+        Map<String, String> props = new HashMap<>();
+        props.put("char_filter_type", "char_replace");
+        assertCheckCharFilterPropertiesThrows(props, "Missing 
'char_filter_pattern' for 'char_replace' filter type");
+    }
+
+    @Test
+    public void testCheckCharFilterPropertiesRejectsEmptyPattern() {
+        Map<String, String> props = new HashMap<>();
+        props.put("char_filter_type", "char_replace");
+        props.put("char_filter_pattern", "");
+        assertCheckCharFilterPropertiesThrows(props, "Missing 
'char_filter_pattern' for 'char_replace' filter type");
+    }
+
+    @Test
+    public void testCheckCharFilterPropertiesRejectsNonAsciiPattern() {
+        Map<String, String> props = new HashMap<>();
+        props.put("char_filter_type", "char_replace");
+        props.put("char_filter_pattern", "中");
+        assertCheckCharFilterPropertiesThrows(props, "'char_filter_pattern' 
must contain only ASCII characters");
+    }
+
+    @Test
+    public void testCheckCharFilterPropertiesRejectsLatin1Pattern() {
+        Map<String, String> props = new HashMap<>();
+        props.put("char_filter_type", "char_replace");
+        props.put("char_filter_pattern", "é");
+        assertCheckCharFilterPropertiesThrows(props, "'char_filter_pattern' 
must contain only ASCII characters");
+    }
+
+    @Test
+    public void testCheckCharFilterPropertiesAllowsNullReplacement() {
+        Map<String, String> props = new HashMap<>();
+        props.put("char_filter_type", "char_replace");
+        props.put("char_filter_pattern", ".");
+
+        Assertions.assertDoesNotThrow(() -> 
InvertedIndexUtil.checkCharFilterProperties(props));
+    }
+
+    @Test
+    public void testCheckCharFilterPropertiesRejectsNonAsciiReplacement() {
+        Map<String, String> props = new HashMap<>();
+        props.put("char_filter_type", "char_replace");
+        props.put("char_filter_pattern", ".");
+        props.put("char_filter_replacement", "中");
+        assertCheckCharFilterPropertiesThrows(props, 
"'char_filter_replacement' must contain only ASCII characters");
+    }
+
+    @Test
+    public void testCheckCharFilterPropertiesRejectsLatin1Replacement() {
+        Map<String, String> props = new HashMap<>();
+        props.put("char_filter_type", "char_replace");
+        props.put("char_filter_pattern", ".");
+        props.put("char_filter_replacement", "é");
+        assertCheckCharFilterPropertiesThrows(props, 
"'char_filter_replacement' must contain only ASCII characters");
+    }
+
+    @Test
+    public void testCheckCharFilterPropertiesAllowsSingleAsciiReplacement() {
+        Map<String, String> props = new HashMap<>();
+        props.put("char_filter_type", "char_replace");
+        props.put("char_filter_pattern", ".");
+        props.put("char_filter_replacement", "_");
+
+        Assertions.assertDoesNotThrow(() -> 
InvertedIndexUtil.checkCharFilterProperties(props));
+    }
+
+    @Test
+    public void testCheckInvertedIndexParserAllowsDotCharFilterPattern() {
+        Map<String, String> props = new HashMap<>();
+        props.put("parser", "english");
+        props.put("char_filter_type", "char_replace");
+        props.put("char_filter_pattern", ".");
+        props.put("char_filter_replacement", "_");
+
+        Assertions.assertDoesNotThrow(() -> 
InvertedIndexUtil.checkInvertedIndexParser("c",
+                PrimitiveType.VARCHAR, props, 
TInvertedIndexFileStorageFormat.V2));
+    }
+
     // --- buildAnalyzerSqlFragment (migrated from 
InvertedIndexSqlGeneratorTest) ---
 
     @Test
diff --git 
a/fe/fe-core/src/test/java/org/apache/doris/indexpolicy/PolicyValidatorTests.java
 
b/fe/fe-core/src/test/java/org/apache/doris/indexpolicy/PolicyValidatorTests.java
index e48432bd98f..60ac8a68ff6 100644
--- 
a/fe/fe-core/src/test/java/org/apache/doris/indexpolicy/PolicyValidatorTests.java
+++ 
b/fe/fe-core/src/test/java/org/apache/doris/indexpolicy/PolicyValidatorTests.java
@@ -220,4 +220,18 @@ public class PolicyValidatorTests {
                 () -> validator.validate(props));
         Assertions.assertTrue(exception.getMessage().contains("enclosed in 
square brackets"));
     }
+
+    @Test
+    public void 
testCharReplaceCharFilterValidator_RejectsNonAsciiReplacement() {
+        CharReplaceCharFilterValidator validator = new 
CharReplaceCharFilterValidator();
+        Map<String, String> props = new HashMap<>();
+        props.put("type", "char_replace");
+        props.put("pattern", ".");
+        props.put("replacement", "é");
+
+        Exception exception = Assertions.assertThrows(DdlException.class,
+                () -> validator.validate(props));
+        Assertions.assertTrue(exception.getMessage()
+                .contains("'char_filter_replacement' must contain only ASCII 
characters"));
+    }
 }
diff --git 
a/fe/fe-core/src/test/java/org/apache/doris/nereids/trees/expressions/functions/scalar/TokenizeTest.java
 
b/fe/fe-core/src/test/java/org/apache/doris/nereids/trees/expressions/functions/scalar/TokenizeTest.java
new file mode 100644
index 00000000000..e3f17daa841
--- /dev/null
+++ 
b/fe/fe-core/src/test/java/org/apache/doris/nereids/trees/expressions/functions/scalar/TokenizeTest.java
@@ -0,0 +1,85 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.nereids.trees.expressions.functions.scalar;
+
+import org.apache.doris.nereids.exceptions.AnalysisException;
+import org.apache.doris.nereids.trees.expressions.literal.StringLiteral;
+import org.apache.doris.utframe.TestWithFeService;
+
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+public class TokenizeTest extends TestWithFeService {
+
+    @Test
+    public void testTokenizeAcceptsValidCharFilterReplacement() {
+        Tokenize tokenize = new Tokenize(new StringLiteral("a.b.c"),
+                new 
StringLiteral("\"parser\"=\"english\",\"char_filter_type\"=\"char_replace\","
+                        + 
"\"char_filter_pattern\"=\".\",\"char_filter_replacement\"=\"_\""));
+
+        
Assertions.assertDoesNotThrow(tokenize::checkLegalityBeforeTypeCoercion);
+    }
+
+    @Test
+    public void testTokenizeRejectsEmptyCharFilterReplacement() {
+        Tokenize tokenize = new Tokenize(new StringLiteral("a.b.c"),
+                new 
StringLiteral("\"parser\"=\"english\",\"char_filter_type\"=\"char_replace\","
+                        + 
"\"char_filter_pattern\"=\".\",\"char_filter_replacement\"=\"\""));
+        AnalysisException exception = 
Assertions.assertThrows(AnalysisException.class,
+                tokenize::checkLegalityBeforeTypeCoercion);
+        Assertions.assertTrue(exception.getMessage()
+                        .contains("'char_filter_replacement' must be a single 
non-empty character"),
+                exception.getMessage());
+        
Assertions.assertInstanceOf(org.apache.doris.common.AnalysisException.class, 
exception.getCause());
+    }
+
+    @Test
+    public void testTokenizeRejectsMultiCharFilterReplacement() {
+        Tokenize tokenize = new Tokenize(new StringLiteral("a.b.c"),
+                new 
StringLiteral("\"parser\"=\"english\",\"char_filter_type\"=\"char_replace\","
+                        + 
"\"char_filter_pattern\"=\".\",\"char_filter_replacement\"=\"xyz\""));
+        AnalysisException exception = 
Assertions.assertThrows(AnalysisException.class,
+                tokenize::checkLegalityBeforeTypeCoercion);
+        Assertions.assertTrue(exception.getMessage()
+                        .contains("'char_filter_replacement' must be a single 
non-empty character"),
+                exception.getMessage());
+        
Assertions.assertInstanceOf(org.apache.doris.common.AnalysisException.class, 
exception.getCause());
+    }
+
+    @Test
+    public void testTokenizeRejectsLatin1CharFilterReplacement() {
+        Tokenize tokenize = new Tokenize(new StringLiteral("a.b.c"),
+                new 
StringLiteral("\"parser\"=\"english\",\"char_filter_type\"=\"char_replace\","
+                        + 
"\"char_filter_pattern\"=\".\",\"char_filter_replacement\"=\"é\""));
+        AnalysisException exception = 
Assertions.assertThrows(AnalysisException.class,
+                tokenize::checkLegalityBeforeTypeCoercion);
+        Assertions.assertTrue(exception.getMessage()
+                        .contains("'char_filter_replacement' must contain only 
ASCII characters"),
+                exception.getMessage());
+        
Assertions.assertInstanceOf(org.apache.doris.common.AnalysisException.class, 
exception.getCause());
+    }
+
+    @Test
+    public void testTokenizeRejectsMalformedProperties() {
+        Tokenize tokenize = new Tokenize(new StringLiteral("a.b.c"), new 
StringLiteral("not_a_property"));
+
+        AnalysisException exception = 
Assertions.assertThrows(AnalysisException.class,
+                tokenize::checkLegalityBeforeTypeCoercion);
+        Assertions.assertEquals("tokenize second argument must be properties 
format", exception.getMessage());
+    }
+}
diff --git 
a/regression-test/suites/inverted_index_p0/analyzer/test_custom_analyzer1.groovy
 
b/regression-test/suites/inverted_index_p0/analyzer/test_custom_analyzer1.groovy
index 71f56afb7df..1f40f3c42d4 100644
--- 
a/regression-test/suites/inverted_index_p0/analyzer/test_custom_analyzer1.groovy
+++ 
b/regression-test/suites/inverted_index_p0/analyzer/test_custom_analyzer1.groovy
@@ -18,6 +18,19 @@
 import java.sql.SQLException
 
 suite("test_custom_analyzer1", "p0") {
+    test {
+        sql """
+            CREATE INVERTED INDEX CHAR_FILTER 
invalid_non_ascii_replacement_char_filter_custom_analyzer
+            PROPERTIES
+            (
+                "type" = "char_replace",
+                "pattern" = ".",
+                "replacement" = "é"
+            );
+        """
+        exception "'char_filter_replacement' must contain only ASCII 
characters"
+    }
+
     sql """
         CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS word_splitter_all
         PROPERTIES
@@ -114,4 +127,4 @@ suite("test_custom_analyzer1", "p0") {
         qt_sql """ select * from test_custom_analyzer2 where ch match 'bg'; """
     } finally {
     }
-}
\ No newline at end of file
+}
diff --git a/regression-test/suites/inverted_index_p0/test_properties.groovy 
b/regression-test/suites/inverted_index_p0/test_properties.groovy
index 16c255be72d..57245fa9873 100644
--- a/regression-test/suites/inverted_index_p0/test_properties.groovy
+++ b/regression-test/suites/inverted_index_p0/test_properties.groovy
@@ -119,6 +119,49 @@ suite("test_properties", "p0"){
     create_table_with_inverted_index_properties(missing_char_filter_pattern, 
"Missing 'char_filter_pattern' for 'char_replace' filter type")
     assertEquals(success, false)
 
+    def valid_dot_char_filter_pattern = """
+        CREATE TABLE IF NOT EXISTS ${indexTblName}(
+            `id` int(11) NULL,
+            `c` text NULL,
+            INDEX c_idx(`c`) USING INVERTED PROPERTIES(
+                "parser"="english",
+                "char_filter_type"="char_replace",
+                "char_filter_pattern"=".",
+                "char_filter_replacement"="_"
+            ) COMMENT ''
+        ) ENGINE=OLAP
+        DUPLICATE KEY(`id`)
+        COMMENT 'OLAP'
+        DISTRIBUTED BY HASH(`id`) BUCKETS 1
+        PROPERTIES(
+            "replication_allocation" = "tag.location.default: 1"
+        );
+    """
+    create_table_with_inverted_index_properties(valid_dot_char_filter_pattern, 
"")
+    assertEquals(success, true)
+
+    def non_ascii_char_filter_replacement = """
+        CREATE TABLE IF NOT EXISTS ${indexTblName}(
+            `id` int(11) NULL,
+            `c` text NULL,
+            INDEX c_idx(`c`) USING INVERTED PROPERTIES(
+                "parser"="english",
+                "char_filter_type"="char_replace",
+                "char_filter_pattern"=".",
+                "char_filter_replacement"="é"
+            ) COMMENT ''
+        ) ENGINE=OLAP
+        DUPLICATE KEY(`id`)
+        COMMENT 'OLAP'
+        DISTRIBUTED BY HASH(`id`) BUCKETS 1
+        PROPERTIES(
+            "replication_allocation" = "tag.location.default: 1"
+        );
+    """
+    
create_table_with_inverted_index_properties(non_ascii_char_filter_replacement,
+            "'char_filter_replacement' must contain only ASCII characters")
+    assertEquals(success, false)
+
     def invalid_property_key = """
         CREATE TABLE IF NOT EXISTS ${indexTblName}(
             `id` int(11) NULL,
diff --git a/regression-test/suites/inverted_index_p0/test_tokenize.groovy 
b/regression-test/suites/inverted_index_p0/test_tokenize.groovy
index d0bdada2e31..f57022ee3b4 100644
--- a/regression-test/suites/inverted_index_p0/test_tokenize.groovy
+++ b/regression-test/suites/inverted_index_p0/test_tokenize.groovy
@@ -96,6 +96,18 @@ suite("test_tokenize"){
 
     qt_tokenize_sql """SELECT TOKENIZE('GET /images/hm_bg.jpg HTTP/1.0 
test:abc=bcd','"parser"="unicode","char_filter_type" = 
"char_replace","char_filter_pattern" = "._=:,","char_filter_replacement" = " 
"');"""
     qt_tokenize_sql """SELECT TOKENIZE('GET /images/hm_bg.jpg HTTP/1.0 
test:abc=bcd', '"parser"="unicode","char_filter_type" = "char_replace", 
"char_filter_pattern" = "._=:,", "char_filter_replacement" = " "');"""
+    test {
+        sql """SELECT TOKENIZE('a.b.c', 
'"parser"="english","char_filter_type"="char_replace","char_filter_pattern"=".","char_filter_replacement"="xyz"');"""
+        exception "'char_filter_replacement' must be a single non-empty 
character"
+    }
+    test {
+        sql """SELECT TOKENIZE('a.b.c', 
'"parser"="english","char_filter_type"="char_replace","char_filter_pattern"=".","char_filter_replacement"=""');"""
+        exception "'char_filter_replacement' must be a single non-empty 
character"
+    }
+    test {
+        sql """SELECT TOKENIZE('a.b.c', 
'"parser"="english","char_filter_type"="char_replace","char_filter_pattern"=".","char_filter_replacement"="é"');"""
+        exception "'char_filter_replacement' must contain only ASCII 
characters"
+    }
 
     qt_tokenize_sql """SELECT TOKENIZE('华夏智胜新税股票A', '"parser"="unicode"');"""
     qt_tokenize_sql """SELECT TOKENIZE('华夏智胜新税股票A', 
'"parser"="unicode","stopwords" = "none"');"""


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris) branch master updated: [fix](fe) Reject invalid char filter replacement in tokenize (#64794)

Reply via email to