stevedlawrence commented on code in PR #1106: URL: https://github.com/apache/daffodil/pull/1106#discussion_r1376067919
########## daffodil-lib/src/test/scala/org/apache/daffodil/lib/util/TestFixedICUDecimalFormat.java: ########## @@ -0,0 +1,159 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.daffodil.lib.util; + +import com.ibm.icu.text.DecimalFormat; +import com.ibm.icu.text.DecimalFormatSymbols; +import org.junit.Test; +import static junit.framework.TestCase.assertEquals; +import static org.junit.Assert.*; + +import java.util.Locale; +import java.util.Objects; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +/** + * Tests about ICU-22558 + * + * https://unicode-org.atlassian.net/browse/ICU-22558 + */ +public class TestFixedICUDecimalFormat { + + // Create DecimalFormatSymbols for a Locale where the decimal separator is a dot + DecimalFormatSymbols symbols = new DecimalFormatSymbols(Locale.US); + + /** + * Illustrates the bug in ICU. + */ + @Test + public void testDecimalFormatWithHash() { + // Create DecimalFormat instance with the pattern "#.##" + DecimalFormat decimalFormat = new DecimalFormat("#.##", symbols); Review Comment: Seems like a patterm of `.##` gives the expected results ```scala scala> new com.ibm.icu.text.DecimalFormat(".##").format(0.12) res1: String = .12 ``` Is that a reasonable workaround? ########## daffodil-lib/src/test/scala/org/apache/daffodil/lib/util/TestFixedICUDecimalFormat.java: ########## @@ -0,0 +1,159 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.daffodil.lib.util; + +import com.ibm.icu.text.DecimalFormat; +import com.ibm.icu.text.DecimalFormatSymbols; +import org.junit.Test; +import static junit.framework.TestCase.assertEquals; +import static org.junit.Assert.*; + +import java.util.Locale; +import java.util.Objects; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +/** + * Tests about ICU-22558 + * + * https://unicode-org.atlassian.net/browse/ICU-22558 + */ +public class TestFixedICUDecimalFormat { + + // Create DecimalFormatSymbols for a Locale where the decimal separator is a dot + DecimalFormatSymbols symbols = new DecimalFormatSymbols(Locale.US); + + /** + * Illustrates the bug in ICU. + */ + @Test + public void testDecimalFormatWithHash() { + // Create DecimalFormat instance with the pattern "#.##" + DecimalFormat decimalFormat = new DecimalFormat("#.##", symbols); + + // Convert the number 0.12 to String + String formattedNumber = decimalFormat.format(0.12); + + // This assert tests for the current behavior. (which is buggy) + assertEquals("0.12", formattedNumber); // Wrong answer: See ICU-22558 + // Assert that the formatted string is ".12" + if (formattedNumber == ".12") { + // see https://unicode-org.atlassian.net/browse/ICU-22558 + fail("Bug ICU-22558 has been fixed. This test set is no longer needed to illustrate the problem."); + } + } Review Comment: Historically we've been putting tests that show ICU behavior here: [daffodil-runtime1/src/test/scala/org/apache/daffodil/runtime1/processors/input/TestICU.scala](https://github.com/apache/daffodil/blob/main/daffodil-runtime1/src/test/scala/org/apache/daffodil/runtime1/processors/input/TestICU.scala) ########## daffodil-lib/src/main/scala/org/apache/daffodil/lib/util/FixedICUDecimalFormat.java: ########## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.daffodil.lib.util; + +import com.ibm.icu.text.DecimalFormat; +import com.ibm.icu.text.DecimalFormatSymbols; + +import java.util.Objects; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +public class FixedICUDecimalFormat { + + /** + * This regex enables us to correct a problem in ICUs interpretation of the pattern in the case + * where the pattern has no digits before the decimal points, such as "#.##" + * so that there should be zero leading digits if the integer part is 0. + * <p> + * See https://unicode-org.atlassian.net/browse/ICU-22558 + * <p> + * It captures unescaped digits or '#' characters + * preceding a non-escaped decimal point in an ICU4J pattern string. + * <p> + * The whole regex (without java escapes) is {@code (?<!\\)([0-9#]*)(?<!\\)(\.) } + * <ul> + * <li>{@code (?<!\\)}: Negative lookbehind to ensure the character is not preceded by a backslash.</li> + * <li>{@code ([0-9#]*)}: Capture zero or more consecutive digits or '#' characters in group 1.</li> + * <li>{@code (?<!\\)(\.)}: The negative lookbehind ensures that only a non-escaped decimal point + * will be captured in group 2.</li> + * </ul> + * + * With this pattern: + * <ul> + * <li>Group 1 will capture any preceding unescaped digits or '#' characters.</li> + * <li>Group 2 will capture the non-escaped decimal point.</li> + * </ul> + * + * The match will fail only if there is no non-escaped decimal point. + * + * Package private for unit testing. + */ + static Pattern pattern = Pattern.compile("(?<!\\\\)([0-9#]*)(?<!\\\\)(\\.)"); + + /** + * Construct a DecimalFormat, but correct for the ICU mis-interpretation of "#." where it + * mistakenly sets the minimum number of integer digits to 1, not 0. + * + * This is not fooled by things like "#5#.0" which is silly, but behaves like "0.0". + * <p> + * See https://unicode-org.atlassian.net/browse/ICU-22558 + * </p> + * + * @param patternString - an ICU pattern for a DecimalFormat, to be created and fixed (if needed) + * @return a DecimalFormat properly initialized by the pattern. + */ + public static DecimalFormat fromPattern(String patternString) { + DecimalFormatSymbols symbols = DecimalFormatSymbols.getInstance(); // default locale + return fromPattern(patternString, symbols); + } + /** + * Package private helper version that takes symbols so that you can define a locale. + * This allows tests to not have to adjust for locale decimal points "." vs. ",", + * by specifying a locale. But it isn't needed for regular usage. + * + * @param patternString - ICU pattern to fix (if needed) + * @param symbols - used to specify locale (because decimal points can be "." or "," + * @return a DecimalFormat properly initialized by the pattern. + */ + static DecimalFormat fromPattern(String patternString, DecimalFormatSymbols symbols) { + DecimalFormat decimalFormat = new DecimalFormat(patternString, symbols); + Matcher matcher = pattern.matcher(patternString); + if (matcher.find()) { + // There IS a decimal point. + String digits = matcher.group(1); // digits or # unescaped, before the decimal point + assert(Objects.equals(".", matcher.group(2))); // 2nd group matches the decimal point itself + long count = digits.chars().filter(Character::isDigit).count(); + if (count == 0) { + // The decimal point is not preceded by any required digits. + // This is the fix ICU is getting wrong. + // + // Note even when ICU fixes this, this code won't break. It's just + // unnecessary. + // + // + decimalFormat.setMinimumIntegerDigits(0); + } + } + return decimalFormat; + } +} Review Comment: Does this belong in `src/main/test` instead? It feels like it's more about showing a workaround than becoming part of our library for use. If we were to do this or something similar in Daffodil for the `TextNumberEv`, I feel like the logic would just go directly in there. We already do inspection of the pattern and setting of properties to "fix" ICU behavior (see `PrimitivesTextNumber.scala` and `EvTextNumber.scala`), so it feels like the logic wants to be along side that instead of a special class. And if the PatternStringParser can be used, the complexity of the pattern goes away. ########## daffodil-lib/src/main/scala/org/apache/daffodil/lib/util/FixedICUDecimalFormat.java: ########## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.daffodil.lib.util; + +import com.ibm.icu.text.DecimalFormat; +import com.ibm.icu.text.DecimalFormatSymbols; + +import java.util.Objects; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +public class FixedICUDecimalFormat { + + /** + * This regex enables us to correct a problem in ICUs interpretation of the pattern in the case + * where the pattern has no digits before the decimal points, such as "#.##" + * so that there should be zero leading digits if the integer part is 0. + * <p> + * See https://unicode-org.atlassian.net/browse/ICU-22558 + * <p> Review Comment: Seems like this might be intentional behavior, I found this in the ICU code: https://github.com/unicode-org/icu/blob/main/icu4j/main/core/src/main/java/com/ibm/icu/impl/number/PatternStringParser.java#L622-L635 So if you want no integral zeros, you have to use a pattern like `.#`, but that will require a fractional zero. ########## daffodil-lib/src/main/scala/org/apache/daffodil/lib/util/FixedICUDecimalFormat.java: ########## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.daffodil.lib.util; + +import com.ibm.icu.text.DecimalFormat; +import com.ibm.icu.text.DecimalFormatSymbols; + +import java.util.Objects; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +public class FixedICUDecimalFormat { + + /** + * This regex enables us to correct a problem in ICUs interpretation of the pattern in the case + * where the pattern has no digits before the decimal points, such as "#.##" + * so that there should be zero leading digits if the integer part is 0. + * <p> + * See https://unicode-org.atlassian.net/browse/ICU-22558 + * <p> + * It captures unescaped digits or '#' characters + * preceding a non-escaped decimal point in an ICU4J pattern string. + * <p> + * The whole regex (without java escapes) is {@code (?<!\\)([0-9#]*)(?<!\\)(\.) } + * <ul> + * <li>{@code (?<!\\)}: Negative lookbehind to ensure the character is not preceded by a backslash.</li> + * <li>{@code ([0-9#]*)}: Capture zero or more consecutive digits or '#' characters in group 1.</li> + * <li>{@code (?<!\\)(\.)}: The negative lookbehind ensures that only a non-escaped decimal point + * will be captured in group 2.</li> + * </ul> + * + * With this pattern: + * <ul> + * <li>Group 1 will capture any preceding unescaped digits or '#' characters.</li> + * <li>Group 2 will capture the non-escaped decimal point.</li> + * </ul> + * + * The match will fail only if there is no non-escaped decimal point. + * + * Package private for unit testing. + */ + static Pattern pattern = Pattern.compile("(?<!\\\\)([0-9#]*)(?<!\\\\)(\\.)"); + + /** + * Construct a DecimalFormat, but correct for the ICU mis-interpretation of "#." where it + * mistakenly sets the minimum number of integer digits to 1, not 0. + * + * This is not fooled by things like "#5#.0" which is silly, but behaves like "0.0". + * <p> + * See https://unicode-org.atlassian.net/browse/ICU-22558 + * </p> + * + * @param patternString - an ICU pattern for a DecimalFormat, to be created and fixed (if needed) + * @return a DecimalFormat properly initialized by the pattern. + */ + public static DecimalFormat fromPattern(String patternString) { + DecimalFormatSymbols symbols = DecimalFormatSymbols.getInstance(); // default locale + return fromPattern(patternString, symbols); + } + /** + * Package private helper version that takes symbols so that you can define a locale. + * This allows tests to not have to adjust for locale decimal points "." vs. ",", + * by specifying a locale. But it isn't needed for regular usage. + * + * @param patternString - ICU pattern to fix (if needed) + * @param symbols - used to specify locale (because decimal points can be "." or "," + * @return a DecimalFormat properly initialized by the pattern. + */ + static DecimalFormat fromPattern(String patternString, DecimalFormatSymbols symbols) { + DecimalFormat decimalFormat = new DecimalFormat(patternString, symbols); + Matcher matcher = pattern.matcher(patternString); Review Comment: Instead of using this kindof regex, it looks like icu4j has a public class/functions (looks undocumented but stable) to get information from a pattern. Something like this: ```scala val ppi = com.ibm.icu.impl.number.PatternStringParser.parseToPatternInfo(patternString) val pi = ppi.positive ``` The `pi` variable is a `ParsedSubpatternInfo` that has a bunch of properties about the pattern, like `integerTotal`, `integerNumerals`, and `integerLeadingHashSigns`. This is probably all subject to change since I think it's an internal class, but looking at the history of the PatternStringParser, it looks like it changes pretty rarely. In fact, I believe some of Daffodil inspects the pattern to detect things like if there is an `E`, comma, decimal, etc. and tries to deal with things like escaped and quoted characters like this pattern does. We should refactor all that to use this PatternStringParser if we need to ask any questions about the pattern. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
