[ https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377736#comment-16377736 ]
ASF GitHub Bot commented on ARROW-2145: --------------------------------------- cpcloud commented on a change in pull request #1651: ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal conversion not working for NaN values URL: https://github.com/apache/arrow/pull/1651#discussion_r170762992 ########## File path: cpp/src/arrow/util/decimal.cc ########## @@ -212,160 +253,124 @@ static void StringToInteger(const std::string& str, Decimal128* out) { } } -Status Decimal128::FromString(const std::string& s, Decimal128* out, int* precision, - int* scale) { - // Implements this regex: "(\\+?|-?)((0*)(\\d*))(\\.(\\d+))?((E|e)(\\+|-)?\\d+)?"; - if (s.empty()) { - return Status::Invalid("Empty string cannot be converted to decimal"); - } +static const boost::regex DECIMAL_REGEX( + // sign of the number + "(?<SIGN>[-+]?)" - std::string::const_iterator charp = s.cbegin(); - std::string::const_iterator end = s.cend(); + // digits around the decimal point + "(((?<LEFT_DIGITS>\\d+)\\.(?<FIRST_RIGHT_DIGITS>\\d*)|\\.(?<SECOND_RIGHT_DIGITS>\\d+)" + ")" - char first_char = *charp; - bool is_negative = false; - if (first_char == '+' || first_char == '-') { - is_negative = first_char == '-'; - ++charp; - } + // optional exponent + "([eE](?<FIRST_EXP_VALUE>[-+]?\\d+))?" - if (charp == end) { - std::stringstream ss; - ss << "Single character: '" << first_char << "' is not a valid decimal value"; - return Status::Invalid(ss.str()); - } + // otherwise + "|" - std::string::const_iterator numeric_string_start = charp; + // we're just an integer + "(?<INTEGER>\\d+)" - DCHECK_LT(charp, end); + // or an integer with an exponent + "(?:[eE](?<SECOND_EXP_VALUE>[-+]?\\d+))?)"); - // skip leading zeros - charp = std::find_if_not(charp, end, [](char value) { return value == '0'; }); +static inline bool is_zero_character(char c) { return c == '0'; } - // all zeros and no decimal point - if (charp == end) { - if (out != NULLPTR) { - *out = 0; - } +Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* precision, + int32_t* scale) { + if (s.empty()) { + return Status::Invalid("Empty string cannot be converted to decimal"); + } - // Not sure what other libraries assign precision to for this case (this case of - // a string consisting only of one or more zeros) + // case of all zeros + if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) { if (precision != NULLPTR) { - *precision = static_cast<int>(charp - numeric_string_start); + *precision = 0; } if (scale != NULLPTR) { *scale = 0; } + *out = 0; return Status::OK(); } - std::string::const_iterator whole_part_start = charp; + boost::smatch results; + const bool matches = boost::regex_match(s, results, DECIMAL_REGEX); + + if (!matches) { + std::stringstream ss; + ss << s << " does not match"; + return Status::Invalid(ss.str()); + } + + const std::string sign = results["SIGN"]; Review comment: So, `operator[](const std::string)` returns a `const_reference` to a `sub_match` object, which has a cast to `std::string` operator defined. `sub_match` has `first` and `second` attributes which are bidirectional iterators which are used to construct a string like `std::string(match.first, match.second)`. Alternatively we use `results["SIGN"].str()`. The main difference is that the first uses `__builtin_memcpy` and the second uses `reserve` then ultimately `__builtin_memset` N number of times. I suspect that one call to memcpy N bytes is cheaper than N calls to memset individual elements. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Decimal conversion not working for NaN values > ------------------------------------------------------ > > Key: ARROW-2145 > URL: https://issues.apache.org/jira/browse/ARROW-2145 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.8.0 > Reporter: Antony Mayi > Assignee: Phillip Cloud > Priority: Major > Labels: pull-request-available > > {code:python} > import pyarrow as pa > import pandas as pd > import decimal > pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), > decimal.Decimal('NaN')]})) > {code} > throws following exception: > {code} > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927) > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in > dataframe_to_arrays > convert_types)] > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in > <listcomp> > for c, t in zip(columns_to_convert, > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in > convert_column > return pa.array(col, from_pandas=True, type=ty) > File "pyarrow/array.pxi", line 170, in pyarrow.lib.array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224) > File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465) > File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068) > pyarrow.lib.ArrowException: Unknown error: an integer is required (got type > str) > {code} > Same problem with other special decimal values like {{infinity}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)