stevedlawrence commented on code in PR #1106:
URL: https://github.com/apache/daffodil/pull/1106#discussion_r1376067919


##########
daffodil-lib/src/test/scala/org/apache/daffodil/lib/util/TestFixedICUDecimalFormat.java:
##########
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.daffodil.lib.util;
+
+import com.ibm.icu.text.DecimalFormat;
+import com.ibm.icu.text.DecimalFormatSymbols;
+import org.junit.Test;
+import static junit.framework.TestCase.assertEquals;
+import static org.junit.Assert.*;
+
+import java.util.Locale;
+import java.util.Objects;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+/**
+ * Tests about ICU-22558
+ *
+ * https://unicode-org.atlassian.net/browse/ICU-22558
+ */
+public class TestFixedICUDecimalFormat {
+
+    // Create DecimalFormatSymbols for a Locale where the decimal separator is 
a dot
+    DecimalFormatSymbols symbols = new DecimalFormatSymbols(Locale.US);
+
+    /**
+     * Illustrates the bug in ICU.
+     */
+    @Test
+    public void testDecimalFormatWithHash() {
+        // Create DecimalFormat instance with the pattern "#.##"
+        DecimalFormat decimalFormat = new DecimalFormat("#.##", symbols);

Review Comment:
   Seems like a patterm of `.##` gives the expected results
   
   ```scala
   scala> new com.ibm.icu.text.DecimalFormat(".##").format(0.12)
   res1: String = .12
   ```
   
   Is that a reasonable workaround?



##########
daffodil-lib/src/test/scala/org/apache/daffodil/lib/util/TestFixedICUDecimalFormat.java:
##########
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.daffodil.lib.util;
+
+import com.ibm.icu.text.DecimalFormat;
+import com.ibm.icu.text.DecimalFormatSymbols;
+import org.junit.Test;
+import static junit.framework.TestCase.assertEquals;
+import static org.junit.Assert.*;
+
+import java.util.Locale;
+import java.util.Objects;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+/**
+ * Tests about ICU-22558
+ *
+ * https://unicode-org.atlassian.net/browse/ICU-22558
+ */
+public class TestFixedICUDecimalFormat {
+
+    // Create DecimalFormatSymbols for a Locale where the decimal separator is 
a dot
+    DecimalFormatSymbols symbols = new DecimalFormatSymbols(Locale.US);
+
+    /**
+     * Illustrates the bug in ICU.
+     */
+    @Test
+    public void testDecimalFormatWithHash() {
+        // Create DecimalFormat instance with the pattern "#.##"
+        DecimalFormat decimalFormat = new DecimalFormat("#.##", symbols);
+
+        // Convert the number 0.12 to String
+        String formattedNumber = decimalFormat.format(0.12);
+
+        // This assert tests for the current behavior. (which is buggy)
+        assertEquals("0.12", formattedNumber); // Wrong answer: See ICU-22558
+        // Assert that the formatted string is ".12"
+        if (formattedNumber == ".12") {
+            // see https://unicode-org.atlassian.net/browse/ICU-22558
+            fail("Bug ICU-22558 has been fixed. This test set is no longer 
needed to illustrate the problem.");
+        }
+    }

Review Comment:
   Historically we've been putting tests that show ICU behavior here:
   
   
[daffodil-runtime1/src/test/scala/org/apache/daffodil/runtime1/processors/input/TestICU.scala](https://github.com/apache/daffodil/blob/main/daffodil-runtime1/src/test/scala/org/apache/daffodil/runtime1/processors/input/TestICU.scala)



##########
daffodil-lib/src/main/scala/org/apache/daffodil/lib/util/FixedICUDecimalFormat.java:
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.daffodil.lib.util;
+
+import com.ibm.icu.text.DecimalFormat;
+import com.ibm.icu.text.DecimalFormatSymbols;
+
+import java.util.Objects;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+public class FixedICUDecimalFormat {
+
+    /**
+     * This regex enables us to correct a problem in ICUs interpretation of 
the pattern in the case
+     * where the pattern has no digits before the decimal points, such as 
"#.##"
+     * so that there should be zero leading digits if the integer part is 0.
+     * <p>
+     *     See https://unicode-org.atlassian.net/browse/ICU-22558
+     * <p>
+     * It captures unescaped digits or '#' characters
+     * preceding a non-escaped decimal point in an ICU4J pattern string.
+     * <p>
+     * The whole regex (without java escapes) is {@code 
(?<!\\)([0-9#]*)(?<!\\)(\.) }
+     * <ul>
+     * <li>{@code (?<!\\)}: Negative lookbehind to ensure the character is not 
preceded by a backslash.</li>
+     * <li>{@code ([0-9#]*)}: Capture zero or more consecutive digits or '#' 
characters in group 1.</li>
+     * <li>{@code (?<!\\)(\.)}: The negative lookbehind ensures that only a 
non-escaped decimal point
+     * will be captured in group 2.</li>
+     * </ul>
+     *
+     * With this pattern:
+     * <ul>
+     * <li>Group 1 will capture any preceding unescaped digits or '#' 
characters.</li>
+     * <li>Group 2 will capture the non-escaped decimal point.</li>
+     * </ul>
+     *
+     * The match will fail only if there is no non-escaped decimal point.
+     *
+     * Package private for unit testing.
+     */
+    static Pattern pattern = 
Pattern.compile("(?<!\\\\)([0-9#]*)(?<!\\\\)(\\.)");
+
+    /**
+     * Construct a DecimalFormat, but correct for the ICU mis-interpretation 
of "#." where it
+     * mistakenly sets the minimum number of integer digits to 1, not 0.
+     *
+     * This is not fooled by things like "#5#.0" which is silly, but behaves 
like "0.0".
+     * <p>
+     * See https://unicode-org.atlassian.net/browse/ICU-22558
+     * </p>
+     *
+     * @param patternString - an ICU pattern for a DecimalFormat, to be 
created and fixed (if needed)
+     * @return a DecimalFormat properly initialized by the pattern.
+     */
+    public static DecimalFormat fromPattern(String patternString) {
+        DecimalFormatSymbols symbols = DecimalFormatSymbols.getInstance(); // 
default locale
+        return fromPattern(patternString, symbols);
+    }
+    /**
+     * Package private helper version that takes symbols so that you can 
define a locale.
+     * This allows tests to not have to adjust for locale decimal points "." 
vs. ",",
+     * by specifying a locale. But it isn't needed for regular usage.
+     *
+     * @param patternString - ICU pattern to fix (if needed)
+     * @param symbols - used to specify locale (because decimal points can be 
"." or ","
+     * @return a DecimalFormat properly initialized by the pattern.
+     */
+    static DecimalFormat fromPattern(String patternString, 
DecimalFormatSymbols symbols) {
+        DecimalFormat decimalFormat = new DecimalFormat(patternString, 
symbols);
+        Matcher matcher = pattern.matcher(patternString);
+        if (matcher.find()) {
+            // There IS a decimal point.
+            String digits = matcher.group(1); // digits or # unescaped, before 
the decimal point
+            assert(Objects.equals(".", matcher.group(2))); // 2nd group 
matches the decimal point itself
+            long count = digits.chars().filter(Character::isDigit).count();
+            if (count == 0) {
+                // The decimal point is not preceded by any required digits.
+                // This is the fix ICU is getting wrong.
+                //
+                // Note even when ICU fixes this, this code won't break. It's 
just
+                // unnecessary.
+                //
+                //
+                decimalFormat.setMinimumIntegerDigits(0);
+            }
+        }
+        return decimalFormat;
+    }
+}

Review Comment:
   Does this belong in `src/main/test` instead? It feels like it's more about 
showing a workaround than becoming part of our library for use. If we were to 
do this or something similar in Daffodil for the `TextNumberEv`, I feel like 
the logic would just go directly in there. We already do inspection of the 
pattern and setting of properties to "fix" ICU behavior (see 
`PrimitivesTextNumber.scala` and `EvTextNumber.scala`), so it feels like the 
logic wants to be along side that instead of a special class. And if the 
PatternStringParser can be used, the complexity of the pattern goes away.



##########
daffodil-lib/src/main/scala/org/apache/daffodil/lib/util/FixedICUDecimalFormat.java:
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.daffodil.lib.util;
+
+import com.ibm.icu.text.DecimalFormat;
+import com.ibm.icu.text.DecimalFormatSymbols;
+
+import java.util.Objects;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+public class FixedICUDecimalFormat {
+
+    /**
+     * This regex enables us to correct a problem in ICUs interpretation of 
the pattern in the case
+     * where the pattern has no digits before the decimal points, such as 
"#.##"
+     * so that there should be zero leading digits if the integer part is 0.
+     * <p>
+     *     See https://unicode-org.atlassian.net/browse/ICU-22558
+     * <p>

Review Comment:
   Seems like this might be intentional behavior, I found this in the ICU code:
   
   
https://github.com/unicode-org/icu/blob/main/icu4j/main/core/src/main/java/com/ibm/icu/impl/number/PatternStringParser.java#L622-L635
   
   So if you want no integral zeros, you have to use a pattern like `.#`, but 
that will require a fractional zero.



##########
daffodil-lib/src/main/scala/org/apache/daffodil/lib/util/FixedICUDecimalFormat.java:
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.daffodil.lib.util;
+
+import com.ibm.icu.text.DecimalFormat;
+import com.ibm.icu.text.DecimalFormatSymbols;
+
+import java.util.Objects;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+public class FixedICUDecimalFormat {
+
+    /**
+     * This regex enables us to correct a problem in ICUs interpretation of 
the pattern in the case
+     * where the pattern has no digits before the decimal points, such as 
"#.##"
+     * so that there should be zero leading digits if the integer part is 0.
+     * <p>
+     *     See https://unicode-org.atlassian.net/browse/ICU-22558
+     * <p>
+     * It captures unescaped digits or '#' characters
+     * preceding a non-escaped decimal point in an ICU4J pattern string.
+     * <p>
+     * The whole regex (without java escapes) is {@code 
(?<!\\)([0-9#]*)(?<!\\)(\.) }
+     * <ul>
+     * <li>{@code (?<!\\)}: Negative lookbehind to ensure the character is not 
preceded by a backslash.</li>
+     * <li>{@code ([0-9#]*)}: Capture zero or more consecutive digits or '#' 
characters in group 1.</li>
+     * <li>{@code (?<!\\)(\.)}: The negative lookbehind ensures that only a 
non-escaped decimal point
+     * will be captured in group 2.</li>
+     * </ul>
+     *
+     * With this pattern:
+     * <ul>
+     * <li>Group 1 will capture any preceding unescaped digits or '#' 
characters.</li>
+     * <li>Group 2 will capture the non-escaped decimal point.</li>
+     * </ul>
+     *
+     * The match will fail only if there is no non-escaped decimal point.
+     *
+     * Package private for unit testing.
+     */
+    static Pattern pattern = 
Pattern.compile("(?<!\\\\)([0-9#]*)(?<!\\\\)(\\.)");
+
+    /**
+     * Construct a DecimalFormat, but correct for the ICU mis-interpretation 
of "#." where it
+     * mistakenly sets the minimum number of integer digits to 1, not 0.
+     *
+     * This is not fooled by things like "#5#.0" which is silly, but behaves 
like "0.0".
+     * <p>
+     * See https://unicode-org.atlassian.net/browse/ICU-22558
+     * </p>
+     *
+     * @param patternString - an ICU pattern for a DecimalFormat, to be 
created and fixed (if needed)
+     * @return a DecimalFormat properly initialized by the pattern.
+     */
+    public static DecimalFormat fromPattern(String patternString) {
+        DecimalFormatSymbols symbols = DecimalFormatSymbols.getInstance(); // 
default locale
+        return fromPattern(patternString, symbols);
+    }
+    /**
+     * Package private helper version that takes symbols so that you can 
define a locale.
+     * This allows tests to not have to adjust for locale decimal points "." 
vs. ",",
+     * by specifying a locale. But it isn't needed for regular usage.
+     *
+     * @param patternString - ICU pattern to fix (if needed)
+     * @param symbols - used to specify locale (because decimal points can be 
"." or ","
+     * @return a DecimalFormat properly initialized by the pattern.
+     */
+    static DecimalFormat fromPattern(String patternString, 
DecimalFormatSymbols symbols) {
+        DecimalFormat decimalFormat = new DecimalFormat(patternString, 
symbols);
+        Matcher matcher = pattern.matcher(patternString);

Review Comment:
   Instead of using this kindof regex, it looks like icu4j has a public 
class/functions  (looks undocumented but stable) to get information from a 
pattern. Something like this:
   
   ```scala
   val ppi = 
com.ibm.icu.impl.number.PatternStringParser.parseToPatternInfo(patternString)
   val pi = ppi.positive
   ```
   
   The `pi` variable is a `ParsedSubpatternInfo` that has a bunch of properties 
about the pattern, like `integerTotal`, `integerNumerals`, and 
`integerLeadingHashSigns`. This is probably all subject to change since I think 
it's an internal class, but looking at the history of the PatternStringParser, 
it looks like it changes pretty rarely.
   
   In fact, I believe some of Daffodil inspects the pattern to detect things 
like if there is an `E`, comma, decimal, etc. and tries to deal with things 
like escaped and quoted characters like this pattern does. We should refactor 
all that to use this PatternStringParser if we need to ask any questions about 
the pattern.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to