[jira] [Created] (ARROW-9122) [C++] Adapt ascii_lower/ascii_upper bulk transforms to work on sliced arrays
Wes McKinney created ARROW-9122: --- Summary: [C++] Adapt ascii_lower/ascii_upper bulk transforms to work on sliced arrays Key: ARROW-9122 URL: https://issues.apache.org/jira/browse/ARROW-9122 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 See comments at https://github.com/apache/arrow/pull/7418#discussion_r439754427 Also add unit tests to verify that only the referenced data slice has been transformed in the result -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9118) [C++] Add more general BoundsCheck function that also checks for arbitrary lower limits in integer arrays
Wes McKinney created ARROW-9118: --- Summary: [C++] Add more general BoundsCheck function that also checks for arbitrary lower limits in integer arrays Key: ARROW-9118 URL: https://issues.apache.org/jira/browse/ARROW-9118 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 See ARROW-9083. The current {{IndexBoundsCheck}} is specialized to skip a comparison for unsigned integers and uses 0 as the lower bound for signed integers. This could be generalized so that we could check e.g. if int64 values will fit in the int32 range -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9115) [C++] Process data buffers in batch in ascii_lower / ascii_upper kernels rather than using string_view value iteration
Wes McKinney created ARROW-9115: --- Summary: [C++] Process data buffers in batch in ascii_lower / ascii_upper kernels rather than using string_view value iteration Key: ARROW-9115 URL: https://issues.apache.org/jira/browse/ARROW-9115 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 Also add a benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9092) [C++] gandiva-decimal-test hangs with LLVM 9
Wes McKinney created ARROW-9092: --- Summary: [C++] gandiva-decimal-test hangs with LLVM 9 Key: ARROW-9092 URL: https://issues.apache.org/jira/browse/ARROW-9092 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney I built Gandiva C++ unittests with LLVM 9 on Ubuntu 18.04 and gandiva-decimal-test hangs forever -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9091) [C++] Utilize function's default options when passing no options to CallFunction to a function that requires them
Wes McKinney created ARROW-9091: --- Summary: [C++] Utilize function's default options when passing no options to CallFunction to a function that requires them Key: ARROW-9091 URL: https://issues.apache.org/jira/browse/ARROW-9091 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Otherwise benign usage of {{CallFunction}} can cause an unintuitive segfault in some cases -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9085) [C++][CI] Appveyor CI test failures
Wes McKinney created ARROW-9085: --- Summary: [C++][CI] Appveyor CI test failures Key: ARROW-9085 URL: https://issues.apache.org/jira/browse/ARROW-9085 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 See https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/33417919 These seem to have been introduced by https://github.com/apache/arrow/commit/b058cf0d1c26ad7984c104bb84322cc7dcc66f00 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9080) [C++] arrow::AllocateBuffer returns a Result>
Wes McKinney created ARROW-9080: --- Summary: [C++] arrow::AllocateBuffer returns a Result> Key: ARROW-9080 URL: https://issues.apache.org/jira/browse/ARROW-9080 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 This seemed counterintuitive to me since using Buffers almost anywhere requires a shared_ptr -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9075) [C++] Optimize Filter implementation
Wes McKinney created ARROW-9075: --- Summary: [C++] Optimize Filter implementation Key: ARROW-9075 URL: https://issues.apache.org/jira/browse/ARROW-9075 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 I split this off from ARROW-5760 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9067) [C++] Create reusable branchless / vectorized index boundschecking functions
Wes McKinney created ARROW-9067: --- Summary: [C++] Create reusable branchless / vectorized index boundschecking functions Key: ARROW-9067 URL: https://issues.apache.org/jira/browse/ARROW-9067 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 It is possible to do branch-free index boundschecking in batches for better performance. I am implementing this as part of the Take/Filter optimization (so please wait until I have PRs up for this work), but these functions can be moved somewhere more general purpose and used in places where we are currently boundschecking inside inner loops. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9045) [C++] Improve and expand Take/Filter benchmarks
Wes McKinney created ARROW-9045: --- Summary: [C++] Improve and expand Take/Filter benchmarks Key: ARROW-9045 URL: https://issues.apache.org/jira/browse/ARROW-9045 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 I'm putting this up as a separate patch for review -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9043) [Go] Temporarily copy LICENSE.txt to go/
Wes McKinney created ARROW-9043: --- Summary: [Go] Temporarily copy LICENSE.txt to go/ Key: ARROW-9043 URL: https://issues.apache.org/jira/browse/ARROW-9043 Project: Apache Arrow Issue Type: Improvement Components: Go Reporter: Wes McKinney Fix For: 1.0.0 {{go mod}} needs to find a license file in the root of the Go module. In the future "go mod" may be able to follow symlinks in which case this can be replaced by a symlink. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9034) [C++] Implement binary (two bitmap) version of BitBlockCounter
Wes McKinney created ARROW-9034: --- Summary: [C++] Implement binary (two bitmap) version of BitBlockCounter Key: ARROW-9034 URL: https://issues.apache.org/jira/browse/ARROW-9034 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 The current BitBlockCounter from ARROW-9029 is useful for unary operations. Some operations involve multiple bitmaps and so it's useful to be able to determine the block popcounts of the AND of the respective words in the bitmaps. So each returned block would contain the number of bits that are set in both bitmaps at the same locations -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9033) [Python] Add tests to verify that one can build a C++ extension against the manylinux1 wheels
Wes McKinney created ARROW-9033: --- Summary: [Python] Add tests to verify that one can build a C++ extension against the manylinux1 wheels Key: ARROW-9033 URL: https://issues.apache.org/jira/browse/ARROW-9033 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Some project want to be able to use the Python wheels to build other Python packages with C++ extensions that need to link against libarrow.so. It would be great if someone would add automated tests to ensure that our wheel builds can be used successfully in this fashion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9032) [C++] Split arrow/util/bit_util.h into multiple header files
Wes McKinney created ARROW-9032: --- Summary: [C++] Split arrow/util/bit_util.h into multiple header files Key: ARROW-9032 URL: https://issues.apache.org/jira/browse/ARROW-9032 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 This header has grown quite large and any given compilation unit's use of it is likely limited to only a couple of functions or classes. I suspect it would improve compilation time to split up this header into a few headers organized by frequency of code use. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9031) [R] Implement conversion from Type::UINT64 to R vector
Wes McKinney created ARROW-9031: --- Summary: [R] Implement conversion from Type::UINT64 to R vector Key: ARROW-9031 URL: https://issues.apache.org/jira/browse/ARROW-9031 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Wes McKinney Fix For: 1.0.0 This case is not handled in array_to_vector.cpp -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9030) [Python] Clean up some usages of pyarrow.compat, move some common functions/symbols to lib.pyx
Wes McKinney created ARROW-9030: --- Summary: [Python] Clean up some usages of pyarrow.compat, move some common functions/symbols to lib.pyx Key: ARROW-9030 URL: https://issues.apache.org/jira/browse/ARROW-9030 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney I started doing this while looking into ARROW-4633 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9029) [C++] Implement BitmapScanner interface to accelerate processing of mostly-not-null data
Wes McKinney created ARROW-9029: --- Summary: [C++] Implement BitmapScanner interface to accelerate processing of mostly-not-null data Key: ARROW-9029 URL: https://issues.apache.org/jira/browse/ARROW-9029 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 In analytics, it is common for data to be all not-null or mostly not-null. Data with > 50% nulls tends to be more exceptional. In this might, our {{BitmapReader}} class which allows iteration of each bit in a bitmap can be wasteful for mostly set validity bitmaps. I propose instead a new interface for use in kernel implementations, for lack of a better term {{BitmapScanner}}. This works as follows: * Uses popcount to accumulate consecutive 64-bit words from a bitmap where all values are set, up to some limit (e.g. anywhere from 8 to 128 words -- we can use benchmarks to determine what is a good limit). The length of this "all-on" run is returned to the caller in a single function call, so that this "run" of data can be processed without any bit-by-bit bitmap checking * If words containing unset bits is encountered, the scanner will similarly accumulate non-full words until the next full word is encountered or a limit is hit. The length of this "has nulls" run is returned to the caller, which then proceeds bit-by-bit to process the data For data with a lot of nulls, this may degrade performance somewhat but probably not that much empirically. However, data that is mostly-not-null should benefit from this. This BitmapScanner utility can probably also be used to accelerate the implementation of Filter for mostly-not-null data -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9018) [C++] Remove APIs that were deprecated in 0.17.x and prior
Wes McKinney created ARROW-9018: --- Summary: [C++] Remove APIs that were deprecated in 0.17.x and prior Key: ARROW-9018 URL: https://issues.apache.org/jira/browse/ARROW-9018 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9006) [C++] Use Cast kernels to implement Scalar::Parse and Scalar::CastTo
Wes McKinney created ARROW-9006: --- Summary: [C++] Use Cast kernels to implement Scalar::Parse and Scalar::CastTo Key: ARROW-9006 URL: https://issues.apache.org/jira/browse/ARROW-9006 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 We should not maintain distinct (and possibly differently behaving) implementations of elementwise array casting and scalar casting. The new kernels framework provides for relatively easily generating kernels that can process arrays or scalars. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9003) [C++] Add VectorFunction wrapping arrow::Concatenate
Wes McKinney created ARROW-9003: --- Summary: [C++] Add VectorFunction wrapping arrow::Concatenate Key: ARROW-9003 URL: https://issues.apache.org/jira/browse/ARROW-9003 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 This would be a varargs function -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9001) [R] Box outputs as correct type in call_function
Wes McKinney created ARROW-9001: --- Summary: [R] Box outputs as correct type in call_function Key: ARROW-9001 URL: https://issues.apache.org/jira/browse/ARROW-9001 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Wes McKinney Fix For: 1.0.0 This would prevent segfaults by putting the SEXP in the wrong kind of R6 container -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8999) [Python][C++] Non-deterministic segfault in "AMD64 MacOS 10.15 Python 3.7" build
Wes McKinney created ARROW-8999: --- Summary: [Python][C++] Non-deterministic segfault in "AMD64 MacOS 10.15 Python 3.7" build Key: ARROW-8999 URL: https://issues.apache.org/jira/browse/ARROW-8999 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 1.0.0 I've been seeing this segfault periodically the last week, does anyone have an idea what might be wrong? https://github.com/apache/arrow/pull/7273/checks?check_run_id=717249862 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8998) [Python] Make NumPy an optional runtime dependency
Wes McKinney created ARROW-8998: --- Summary: [Python] Make NumPy an optional runtime dependency Key: ARROW-8998 URL: https://issues.apache.org/jira/browse/ARROW-8998 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Wes McKinney Since in the relatively near future, one will be able to do non-trivial analytical operations and query processing natively on Arrow data structures through pyarrow, it does not make sense to require users to always install NumPy when that install pyarrow. I propose to split the NumPy-depending parts of libarrow_python into a libarrow_numpy (which also must be bundled) and moving this part of the codebase into a separate Cython module. This refactoring should be relatively painless though there may be a number of packaging details to chase up since this would introduce a new shared library to be installed in various packaging targets. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8995) [C++] Scalar formatting code used in array/diff.cc should be reusable
Wes McKinney created ARROW-8995: --- Summary: [C++] Scalar formatting code used in array/diff.cc should be reusable Key: ARROW-8995 URL: https://issues.apache.org/jira/browse/ARROW-8995 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Formatting Array values as strings is not specific to the diff.cc code, so it may make sense to move this code elsewhere where it can be used generally (perhaps a method like {{Array::FormatValue}}?). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8994) [C++] Disable include-what-you-use cpplint lint checks
Wes McKinney created ARROW-8994: --- Summary: [C++] Disable include-what-you-use cpplint lint checks Key: ARROW-8994 URL: https://issues.apache.org/jira/browse/ARROW-8994 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 If we want to be serious about IWYU, it would be better to use IWYU directly. The minimal checks that IWYU does can be a nuisance rather than addressing the problem holistically -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8991) [C++][Compute] Add scalar_hash function
Wes McKinney created ARROW-8991: --- Summary: [C++][Compute] Add scalar_hash function Key: ARROW-8991 URL: https://issues.apache.org/jira/browse/ARROW-8991 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 The purpose of this function is to compute 32- or 64-bit hash values for each cell in an Array. Hashes for nested types can be computed recursively by combining the hash values of their children -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8990) [C++] Benchmark hash table against thirdparty options, possibly vendor a thirdparty hash table library
Wes McKinney created ARROW-8990: --- Summary: [C++] Benchmark hash table against thirdparty options, possibly vendor a thirdparty hash table library Key: ARROW-8990 URL: https://issues.apache.org/jira/browse/ARROW-8990 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney While we have our own hash table implementation, it would be worthwhile to set up some benchmarks so that we can compare against std::unordered_map and some other thirdparty libraries for hash tables to know whether we should possibly use a thirdparty library. See e.g. https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8989) [C++] Document available functions in FunctionRegistry
Wes McKinney created ARROW-8989: --- Summary: [C++] Document available functions in FunctionRegistry Key: ARROW-8989 URL: https://issues.apache.org/jira/browse/ARROW-8989 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Create a compute page in the C++ section of the Sphinx docs and make a list of the available functions and what they do -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8985) [Format] Add "byte width" field with default of 16 to Decimal Flatbuffers type for forward compatibility
Wes McKinney created ARROW-8985: --- Summary: [Format] Add "byte width" field with default of 16 to Decimal Flatbuffers type for forward compatibility Key: ARROW-8985 URL: https://issues.apache.org/jira/browse/ARROW-8985 Project: Apache Arrow Issue Type: Improvement Components: Format Reporter: Wes McKinney Fix For: 1.0.0 This will permit larger or smaller decimals to be added to the format later without having to add a new Type union value -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8970) [C++] Reduce shared library code size (umbrella issue)
Wes McKinney created ARROW-8970: --- Summary: [C++] Reduce shared library code size (umbrella issue) Key: ARROW-8970 URL: https://issues.apache.org/jira/browse/ARROW-8970 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney We're reaching a point where we may need to be careful about decisions that increase code size: * Instantiating too many templates for code that isn't performance sensitive * Inlining functions that don't need to be inline Code size tends to correlate also with compilation times, but not always. I'll use this umbrella issue to organize issues related to reducing compiled code size -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8969) [C++] Reduce generated code in compute/kernels/scalar_compare.cc
Wes McKinney created ARROW-8969: --- Summary: [C++] Reduce generated code in compute/kernels/scalar_compare.cc Key: ARROW-8969 URL: https://issues.apache.org/jira/browse/ARROW-8969 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 We are instantiating templates in this module for cases that, byte-wise, do the exact same comparison. For example: * For equals, not_equals, we can use the same 32-bit/64-bit comparison kernels for signed int / unsigned int / floating point types of the same byte width * TimestampType can reuse int64 kernels, similarly for other date/time types * BinaryType/StringType can share kernels etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8966) [C++] Move arrow::ArrayData to a separate header file
Wes McKinney created ARROW-8966: --- Summary: [C++] Move arrow::ArrayData to a separate header file Key: ARROW-8966 URL: https://issues.apache.org/jira/browse/ARROW-8966 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 There are code modules (such as compute kernels) that only require ArrayData for doing computations, so pulling in all the code in array.h is not necessary. There are probably other code paths that might benefit from this also. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8961) [C++] Vendor utf8proc library
Wes McKinney created ARROW-8961: --- Summary: [C++] Vendor utf8proc library Key: ARROW-8961 URL: https://issues.apache.org/jira/browse/ARROW-8961 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 This is a minimal MIT-licensed library for UTF-8 data processing originally developed for use in Julia https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8956) [C++] arrow::ScalarEquals returns false when values are both null
Wes McKinney created ARROW-8956: --- Summary: [C++] arrow::ScalarEquals returns false when values are both null Key: ARROW-8956 URL: https://issues.apache.org/jira/browse/ARROW-8956 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney I wasn't sure if this was deliberate but it appeared while writing unit tests and so wanted to check what was the intention before changing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8955) [C++] Use kernels for casting Scalar values instead of bespoke implementation
Wes McKinney created ARROW-8955: --- Summary: [C++] Use kernels for casting Scalar values instead of bespoke implementation Key: ARROW-8955 URL: https://issues.apache.org/jira/browse/ARROW-8955 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 See details of casting in arrow/scalar.cc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8951) [C++] Fix compiler warning in compute/kernels/scalar_cast_temporal.cc
Wes McKinney created ARROW-8951: --- Summary: [C++] Fix compiler warning in compute/kernels/scalar_cast_temporal.cc Key: ARROW-8951 URL: https://issues.apache.org/jira/browse/ARROW-8951 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 The kernel functor can return an uninitialized value on errors {code} ../src/arrow/compute/kernels/scalar_cast_temporal.cc: In member function ‘OUT arrow::compute::internal::ParseTimestamp::Call(arrow::compute::KernelContext*, ARG0) const [with OUT = long int; ARG0 = nonstd::sv_lite::basic_string_view]’: ../src/arrow/compute/kernels/scalar_cast_temporal.cc:267:12: warning: ‘result’ may be used uninitialized in this function [-Wmaybe-uninitialized] return result; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8945) [Python] An independent Cython package for projects that want to program against the C data interface
Wes McKinney created ARROW-8945: --- Summary: [Python] An independent Cython package for projects that want to program against the C data interface Key: ARROW-8945 URL: https://issues.apache.org/jira/browse/ARROW-8945 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Wes McKinney I've been thinking it would be useful to have a minimal Cython package, call it "cyarrow", containing some pxd files and a small amount of compiled pyx code (using a C compiler only) that enables projects written in Cython to interact with Arrow datasets in minimal ways (for example, iterating over their values, interacting with dictionary-encoded/categorical arrays) that don't amount to reimplementation of the "hard stuff" where they would want to utilize pyarrow or the C++ library instead. Otherwise, every Python project that has compiled code in Cython and wants to use the C interface would have to create their own minimal implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8939) [C++] Arrow C++ Data Frame-style programming interface for analytics (umbrella issue)
Wes McKinney created ARROW-8939: --- Summary: [C++] Arrow C++ Data Frame-style programming interface for analytics (umbrella issue) Key: ARROW-8939 URL: https://issues.apache.org/jira/browse/ARROW-8939 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney This is an umbrella issue for the "C++ Data Frame" project that has been discussed on the mailing list with the following Google docs overview https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit I will attach issues to this JIRA to help organize and track the project as we make progress. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8938) [R] Provide binding and argument packing to use arrow::compute::CallFunction to use any compute kernel from R dynamically
Wes McKinney created ARROW-8938: --- Summary: [R] Provide binding and argument packing to use arrow::compute::CallFunction to use any compute kernel from R dynamically Key: ARROW-8938 URL: https://issues.apache.org/jira/browse/ARROW-8938 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Wes McKinney Fix For: 1.0.0 This will drastically simplify exposing new functions to R users -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8933) [C++] Reduce generated code in vector_hash.cc
Wes McKinney created ARROW-8933: --- Summary: [C++] Reduce generated code in vector_hash.cc Key: ARROW-8933 URL: https://issues.apache.org/jira/browse/ARROW-8933 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Since hashing doesn't need to know about logical types, we can do the following: * Use same generated code for both BinaryType and StringType * Use same generated code for primitive types having the same byte width These two changes should reduce binary size and improve compilation speed -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8937) [C++] Add "parse_strptime" function for string to timestamp conversions using the kernels framework
Wes McKinney created ARROW-8937: --- Summary: [C++] Add "parse_strptime" function for string to timestamp conversions using the kernels framework Key: ARROW-8937 URL: https://issues.apache.org/jira/browse/ARROW-8937 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney This should be relatively straightforward to implement using the new kernels framework -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8936) [C++] Parallelize execution of arrow::compute::ScalarFunction
Wes McKinney created ARROW-8936: --- Summary: [C++] Parallelize execution of arrow::compute::ScalarFunction Key: ARROW-8936 URL: https://issues.apache.org/jira/browse/ARROW-8936 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8935) [Python] Add necessary plumbing to enable Numba-generated functions to be registered as functions in the global C++ function/kernels registry
Wes McKinney created ARROW-8935: --- Summary: [Python] Add necessary plumbing to enable Numba-generated functions to be registered as functions in the global C++ function/kernels registry Key: ARROW-8935 URL: https://issues.apache.org/jira/browse/ARROW-8935 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Wes McKinney -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8934) [C++] Add timestamp subtract kernel aliased to int64 subtract implementation
Wes McKinney created ARROW-8934: --- Summary: [C++] Add timestamp subtract kernel aliased to int64 subtract implementation Key: ARROW-8934 URL: https://issues.apache.org/jira/browse/ARROW-8934 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 We can use the same scalar exec function for int64 subtraction as well as {{(array[TIMESTAMP], array[TIMESTAMP]) -> duration}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8930) [C++] libz.so linking error with liborc.a
Wes McKinney created ARROW-8930: --- Summary: [C++] libz.so linking error with liborc.a Key: ARROW-8930 URL: https://issues.apache.org/jira/browse/ARROW-8930 Project: Apache Arrow Issue Type: Bug Components: C++, Continuous Integration Reporter: Wes McKinney Fix For: 1.0.0 This is failing in the Travis CI ARM build https://travis-ci.org/github/apache/arrow/jobs/690722203 {code} : && /usr/bin/ccache /usr/bin/c++ -Wno-noexcept-type -fdiagnostics-color=always -ggdb -O0 -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror -march=armv8-a -g -rdynamic src/arrow/adapters/orc/CMakeFiles/arrow-orc-adapter-test.dir/adapter_test.cc.o -o debug/arrow-orc-adapter-test -Wl,-rpath,/build/cpp/debug debug/libarrow_testing.a debug/libarrow.a debug//libgtest_maind.so debug//libgtestd.so /usr/lib/aarch64-linux-gnu/libsnappy.so.1.1.8 /usr/lib/aarch64-linux-gnu/liblz4.so /usr/lib/aarch64-linux-gnu/libz.so -lpthread -ldl orc_ep-install/lib/liborc.a /usr/lib/aarch64-linux-gnu/libssl.so /usr/lib/aarch64-linux-gnu/libcrypto.so /usr/lib/aarch64-linux-gnu/libbrotlienc.so /usr/lib/aarch64-linux-gnu/libbrotlidec.so /usr/lib/aarch64-linux-gnu/libbrotlicommon.so /usr/lib/aarch64-linux-gnu/libbz2.so /usr/lib/aarch64-linux-gnu/libzstd.so /usr/lib/aarch64-linux-gnu/libprotobuf.so /usr/lib/aarch64-linux-gnu/libglog.so jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -pthread -lrt && : /usr/bin/ld: orc_ep-install/lib/liborc.a(Compression.cc.o): undefined reference to symbol 'inflateEnd' /usr/bin/ld: /usr/lib/aarch64-linux-gnu/libz.so: error adding symbols: DSO missing from command line collect2: error: ld returned 1 exit status {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8929) [C++] Change compute::Arity:VarArgs min_args default to 0
Wes McKinney created ARROW-8929: --- Summary: [C++] Change compute::Arity:VarArgs min_args default to 0 Key: ARROW-8929 URL: https://issues.apache.org/jira/browse/ARROW-8929 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 The issue of minimum number of arguments is separate from providing an {{InputType}} for input type checking. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8928) [C++] Measure microperformance associated with data structure access interactions with arrow::compute::ExecBatch
Wes McKinney created ARROW-8928: --- Summary: [C++] Measure microperformance associated with data structure access interactions with arrow::compute::ExecBatch Key: ARROW-8928 URL: https://issues.apache.org/jira/browse/ARROW-8928 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 {{arrow::compute::ExecBatch}} uses a vector of {{arrow::Datum}} to contain a collection of ArrayData and Scalar objects for kernel execution. It would be helpful to know how many nanoseconds of overhead is associated with basic interactions with this data structure to know the cost of using our vendored variant, and other such issues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8926) [C++] Improve docstrings in new public APIs in arrow/compute and fix miscellaneous typos
Wes McKinney created ARROW-8926: --- Summary: [C++] Improve docstrings in new public APIs in arrow/compute and fix miscellaneous typos Key: ARROW-8926 URL: https://issues.apache.org/jira/browse/ARROW-8926 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 I've noticed some imprecise language while reading the headers and some other opportunities for improvement -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8923) [C++] Improve usability of arrow::compute::CallFunction by moving ExecContext* argument to end and adding default
Wes McKinney created ARROW-8923: --- Summary: [C++] Improve usability of arrow::compute::CallFunction by moving ExecContext* argument to end and adding default Key: ARROW-8923 URL: https://issues.apache.org/jira/browse/ARROW-8923 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8922) [C++] Implement example string scalar kernel function to assist with string kernels buildout per ARROW-555
Wes McKinney created ARROW-8922: --- Summary: [C++] Implement example string scalar kernel function to assist with string kernels buildout per ARROW-555 Key: ARROW-8922 URL: https://issues.apache.org/jira/browse/ARROW-8922 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 I will write a patch to provide an example of creating a string-input string-output kernel for executing scalar-valued string functions -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8921) [C++] Add "TypeResolver" class interface to replace current OutputType::Resolver pattern
Wes McKinney created ARROW-8921: --- Summary: [C++] Add "TypeResolver" class interface to replace current OutputType::Resolver pattern Key: ARROW-8921 URL: https://issues.apache.org/jira/browse/ARROW-8921 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Like the {{TypeMatcher}} for extensible input type checking, TypeResolver will allow more flexibility with respect to the output type resolution rule. Currently the resolver function is defined as {code} using Resolver = std::function(KernelContext*, const std::vector&)>; {code} By changing to a {{TypeResolver}} interface with a virtual Resolve function, we also can provide for better human-readability when printing kernel signatures (by having {{TypeResolver::ToString}}) and permitting TypeResolvers to be compared -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8920) [CI] ARM Travis CI build is failing with archery "case_sensitive" error
Wes McKinney created ARROW-8920: --- Summary: [CI] ARM Travis CI build is failing with archery "case_sensitive" error Key: ARROW-8920 URL: https://issues.apache.org/jira/browse/ARROW-8920 Project: Apache Arrow Issue Type: Bug Components: CI Reporter: Wes McKinney Fix For: 1.0.0 See https://travis-ci.org/github/apache/arrow/jobs/690602409 {code} Traceback (most recent call last): File "/home/travis/.local/bin/archery", line 11, in load_entry_point('archery', 'console_scripts', 'archery')() File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 490, in load_entry_point return get_distribution(dist).load_entry_point(group, name) File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 2853, in load_entry_point return ep.load() File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 2453, in load return self.resolve() File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 2459, in resolve module = __import__(self.module_name, fromlist=['__name__'], level=0) File "/home/travis/build/apache/arrow/dev/archery/archery/cli.py", line 100, in case_sensitive=False) TypeError: __init__() got an unexpected keyword argument 'case_sensitive' {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8919) [C++] Add "DispatchBest" APIs to compute::Function that selects a kernel that may require implicit casts to invoke
Wes McKinney created ARROW-8919: --- Summary: [C++] Add "DispatchBest" APIs to compute::Function that selects a kernel that may require implicit casts to invoke Key: ARROW-8919 URL: https://issues.apache.org/jira/browse/ARROW-8919 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Currently we have "DispatchExact" which requires an exact match of input types. "DispatchBest" would permit kernel selection with implicit casts required. Since multiple kernels may be valid when allowing implicit casts, we will need to break ties by estimating the "cost" of the implicit casts. For example, casting int8 to int32 is "less expensive" than implicitly casting to int64 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8918) [C++] Add cast "metafunction" to FunctionRegistry that addresses dispatching to appropriate type-specific CastFunction
Wes McKinney created ARROW-8918: --- Summary: [C++] Add cast "metafunction" to FunctionRegistry that addresses dispatching to appropriate type-specific CastFunction Key: ARROW-8918 URL: https://issues.apache.org/jira/browse/ARROW-8918 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 By setting the output type in {{CastOptions}}, we can write {code} call_function("cast", [arg], cast_options) {code} This simplifies use of casting for binding developers -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8917) [C++] Add compute::Function subclass for invoking certain kernels on RecordBatch/Table-valued inputs
Wes McKinney created ARROW-8917: --- Summary: [C++] Add compute::Function subclass for invoking certain kernels on RecordBatch/Table-valued inputs Key: ARROW-8917 URL: https://issues.apache.org/jira/browse/ARROW-8917 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 This will enable bindings to invoke such functions (like take, filter) like {code} call_function('take', [table, indices]) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8916) [Python] Add relevant glue for implementing each kind of FunctionOptions
Wes McKinney created ARROW-8916: --- Summary: [Python] Add relevant glue for implementing each kind of FunctionOptions Key: ARROW-8916 URL: https://issues.apache.org/jira/browse/ARROW-8916 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8905) [C++] Collapse Take APIs from 8 to 1 or 2
Wes McKinney created ARROW-8905: --- Summary: [C++] Collapse Take APIs from 8 to 1 or 2 Key: ARROW-8905 URL: https://issues.apache.org/jira/browse/ARROW-8905 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 There are currently 8 {{Take}} functions with different function signatures. Fewer functions would make life easier for binding developers -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8904) [Python] Fix usages of deprecated C++ APIs related to child/field
Wes McKinney created ARROW-8904: --- Summary: [Python] Fix usages of deprecated C++ APIs related to child/field Key: ARROW-8904 URL: https://issues.apache.org/jira/browse/ARROW-8904 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 1.0.0 {code} -- Running cmake --build for pyarrow cmake --build . --config debug -- -j16 [19/20] Building CXX object CMakeFiles/lib.dir/lib.cpp.o lib.cpp:20265:85: warning: 'num_children' is deprecated: Use num_fields() [-Wdeprecated-declarations] __pyx_t_1 = __pyx_f_7pyarrow_3lib__normalize_index(__pyx_v_i, __pyx_v_self->type->num_children()); if (unlikely(__pyx_t_1 == ((Py_ssize_t)-1L))) __PYX_ERR(1, 119, __pyx_L1_error) ^ /home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been explicitly marked deprecated here ARROW_DEPRECATED("Use num_fields()") ^ /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 'ARROW_DEPRECATED' # define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__))) ^ lib.cpp:20276:76: warning: 'child' is deprecated: Use field(i) [-Wdeprecated-declarations] __pyx_t_2 = __pyx_f_7pyarrow_3lib_pyarrow_wrap_field(__pyx_v_self->type->child(__pyx_v_index)); if (unlikely(!__pyx_t_2)) __PYX_ERR(1, 120, __pyx_L1_error) ^ /home/wesm/local/include/arrow/type.h:251:3: note: 'child' has been explicitly marked deprecated here ARROW_DEPRECATED("Use field(i)") ^ /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 'ARROW_DEPRECATED' # define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__))) ^ lib.cpp:20507:56: warning: 'num_children' is deprecated: Use num_fields() [-Wdeprecated-declarations] __pyx_t_1 = __Pyx_PyInt_From_int(__pyx_v_self->type->num_children()); if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 139, __pyx_L1_error) ^ /home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been explicitly marked deprecated here ARROW_DEPRECATED("Use num_fields()") ^ /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 'ARROW_DEPRECATED' # define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__))) ^ lib.cpp:23361:44: warning: 'num_children' is deprecated: Use num_fields() [-Wdeprecated-declarations] __pyx_r = __pyx_v_self->__pyx_base.type->num_children(); ^ /home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been explicitly marked deprecated here ARROW_DEPRECATED("Use num_fields()") ^ /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 'ARROW_DEPRECATED' # define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__))) ^ lib.cpp:24039:44: warning: 'num_children' is deprecated: Use num_fields() [-Wdeprecated-declarations] __pyx_r = __pyx_v_self->__pyx_base.type->num_children(); ^ /home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been explicitly marked deprecated here ARROW_DEPRECATED("Use num_fields()") ^ /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 'ARROW_DEPRECATED' # define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__))) ^ lib.cpp:58220:37: warning: 'child' is deprecated: Use field(pos) [-Wdeprecated-declarations] __pyx_v_child = __pyx_v_self->ap->child(__pyx_v_child_id); ^ /home/wesm/local/include/arrow/array.h:1281:3: note: 'child' has been explicitly marked deprecated here ARROW_DEPRECATED("Use field(pos)") ^ /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 'ARROW_DEPRECATED' # define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__))) ^ lib.cpp:58956:74: warning: 'children' is deprecated: Use fields() [-Wdeprecated-declarations] __pyx_v_child_fields = __pyx_v_self->__pyx_base.__pyx_base.type->type->children(); ^ /home/wesm/local/include/arrow/type.h:257:3: note: 'children' has been explicitly marked deprecated here ARROW_DEPRECATED("Use fields()") ^ /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 'ARROW_DEPRECATED' # define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
[jira] [Created] (ARROW-8903) [C++] Implement optimized "unsafe take" for use with selection vectors for kernel execution
Wes McKinney created ARROW-8903: --- Summary: [C++] Implement optimized "unsafe take" for use with selection vectors for kernel execution Key: ARROW-8903 URL: https://issues.apache.org/jira/browse/ARROW-8903 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Selection vectors constructed from filters do not need to be subjected to boundschecking and other such safety checks as are present with a usual invocation of {{take}}. So based on the type width of a selection vector (uint16?) we should implement highly streamlined take implementations that additionally take into consideration that selection vectors are monotonic by construction -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8901) [C++] Reduce number of take kernels
Wes McKinney created ARROW-8901: --- Summary: [C++] Reduce number of take kernels Key: ARROW-8901 URL: https://issues.apache.org/jira/browse/ARROW-8901 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney After ARROW-8792 we can observe that we are generating 312 take kernels {code} In [1]: import pyarrow.compute as pc In [2]: reg = pc.function_registry() In [3]: reg.get_function('take') Out[3]: arrow.compute.Function kind: vector num_kernels: 312 {code} You can see them all here: https://gist.github.com/wesm/c3085bf40fa2ee5e555204f8c65b4ad5 It's probably going to be sufficient to only support int16, int32, and int64 index types for almost all types and insert implicit casts (once we implement implicit-cast-insertion into the execution code) for other index types. If we determine that there is some performance hot path where we need to specialize for other index types, then we can always do that. Additionally, we should be able to collapse the date/time kernels since we're just moving memory. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8898) [C++] Determine desirable maximum length for ExecBatch in pipelined and parallel execution of kernels
Wes McKinney created ARROW-8898: --- Summary: [C++] Determine desirable maximum length for ExecBatch in pipelined and parallel execution of kernels Key: ARROW-8898 URL: https://issues.apache.org/jira/browse/ARROW-8898 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Maximum lengths like 16K or 64K seem to be popular, but we should write our own benchmarks so that we can justify the choice of default chunksize -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8897) [C++] Determine strategy for propagating failures in initializing built-in function registry in arrow/compute
Wes McKinney created ARROW-8897: --- Summary: [C++] Determine strategy for propagating failures in initializing built-in function registry in arrow/compute Key: ARROW-8897 URL: https://issues.apache.org/jira/browse/ARROW-8897 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney As discussed on https://github.com/apache/arrow/pull/7240, we are using {{DCHECK_OK}} to check statuses when initializing the built-in registry. We could propagate failures by changing {{arrow::compute::GetFunctionRegistry}} to return Result, but there may be other ways -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8896) [C++] Reimplement dictionary unpacking in Cast kernels using Take
Wes McKinney created ARROW-8896: --- Summary: [C++] Reimplement dictionary unpacking in Cast kernels using Take Key: ARROW-8896 URL: https://issues.apache.org/jira/browse/ARROW-8896 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 As suggested by [~apitrou] this should yield less code to maintain -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8895) [C++] Add C++ unit tests for filter function on temporal type inputs, including timestamps
Wes McKinney created ARROW-8895: --- Summary: [C++] Add C++ unit tests for filter function on temporal type inputs, including timestamps Key: ARROW-8895 URL: https://issues.apache.org/jira/browse/ARROW-8895 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 These are used in R but not tested in C++, so I only found out that I had missed adding the kernels to the Filter VectorFunction when running the R test suite -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8894) [C++] C++ array kernels framework and execution buildout (umbrella issue)
Wes McKinney created ARROW-8894: --- Summary: [C++] C++ array kernels framework and execution buildout (umbrella issue) Key: ARROW-8894 URL: https://issues.apache.org/jira/browse/ARROW-8894 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney In the wake of ARROW-8792, this issue is to serve as an umbrella issue for follow up work and associated "buildout" which includes things like: * Implementation of many new function types and adding new kernel cases to existing functions * Adding implicit casting functionality to function execution * Creation of "bound" physical arrays expressions * Pipeline execution (executing multiple kernels while eliminating temporary allocation) * Parallel execution of scalar and aggregate kernels (including parallel execution of pipelined kernels) There's quite a few existing JIRAs in the project that I'll attach to this issue and I'll open plenty more issues as things occur to me to help organize the work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8893) [R] Fix cpplint issues introduced by ARROW-8885
Wes McKinney created ARROW-8893: --- Summary: [R] Fix cpplint issues introduced by ARROW-8885 Key: ARROW-8893 URL: https://issues.apache.org/jira/browse/ARROW-8893 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Wes McKinney Fix For: 1.0.0 {code} (arrow-3.7) 12:34 ~/code/arrow/r $ ./lint.sh /home/wesm/code/arrow/r/src/arrow_types.h:20: Include the directory when naming .h files [build/include_subdir] [4] /home/wesm/code/arrow/r/src/arrow_types.h:66: Add #include for forward [build/include_what_you_use] [4] /home/wesm/code/arrow/r/src/arrow_types.h:83: Add #include for vector<> [build/include_what_you_use] [4] /home/wesm/code/arrow/r/src/arrow_types.h:95: Add #include for numeric_limits<> [build/include_what_you_use] [4] /home/wesm/code/arrow/r/src/arrow_types.h:110: Add #include for shared_ptr<> [build/include_what_you_use] [4] /home/wesm/code/arrow/r/src/arrow_exports.h:22: Include the directory when naming .h files [build/include_subdir] [4] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8892) [C++][CI] CI builds for MSVC do not build benchmarks
Wes McKinney created ARROW-8892: --- Summary: [C++][CI] CI builds for MSVC do not build benchmarks Key: ARROW-8892 URL: https://issues.apache.org/jira/browse/ARROW-8892 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 We must ensure that our benchmarks always build on Windows I'm fixing these errors for example in ARROW-8792 {code} C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(249): error C2220: warning treated as error - no 'object' file generated C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(256): note: see reference to function template instantiation 'void parquet::BM_PlainEncodingSpaced(benchmark::State &)' being compiled C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(249): warning C4244: 'argument': conversion from 'const int64_t' to 'int', possible loss of data C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(292): warning C4244: 'argument': conversion from 'const int64_t' to 'int', possible loss of data C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(306): note: see reference to function template instantiation 'void parquet::BM_PlainDecodingSpaced(benchmark::State &)' being compiled C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(299): warning C4244: 'argument': conversion from 'int64_t' to 'int', possible loss of data C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(300): warning C4244: 'argument': conversion from 'const int64_t' to 'int', possible loss of data [11/67] Linking CXX executable release\arrow-ipc-read-write-benchmark.exe {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8891) [C++] Split non-cast compute kernels into a separate shared library
Wes McKinney created ARROW-8891: --- Summary: [C++] Split non-cast compute kernels into a separate shared library Key: ARROW-8891 URL: https://issues.apache.org/jira/browse/ARROW-8891 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Since we are going to implement a lot more precompiled kernels, I am not sure it makes sense to require all of them to be compiled unconditionally just to get access to {{compute::Cast}}, which is needed in many different contexts. After ARROW-8792 is merged, I would suggest creating a plugin hook for adding a bundle of kernels from a shared library outside of libarrow.so, and then moving all the object code outside of Cast to something like libarrow_compute.so. Then we can change the CMake flags to compile Cast kernels always (?) and then opt in to building the additional kernels package separately -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8876) [C++] Implement casts from date types to Timestamp
Wes McKinney created ARROW-8876: --- Summary: [C++] Implement casts from date types to Timestamp Key: ARROW-8876 URL: https://issues.apache.org/jira/browse/ARROW-8876 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Discovered the absence of this while refactoring cast.cc. Since we can cast Timestamp -> date, we should be able to cast the other way -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8866) [C++] Split Type::UNION into Type::SPARSE_UNION and Type::DENSE_UNION
Wes McKinney created ARROW-8866: --- Summary: [C++] Split Type::UNION into Type::SPARSE_UNION and Type::DENSE_UNION Key: ARROW-8866 URL: https://issues.apache.org/jira/browse/ARROW-8866 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Similar to the recent {{Type::INTERVAL}} split, having these two array types which have different memory layouts under the same {{Type::type}} value makes function dispatch somewhat more complicated. This issue is less critical from INTERVAL so this may not be urgent but seems like a good pre-10 change -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8863) [C++] Array constructors must set ArrayData::null_count to 0 when there is no validity bitmap
Wes McKinney created ARROW-8863: --- Summary: [C++] Array constructors must set ArrayData::null_count to 0 when there is no validity bitmap Key: ARROW-8863 URL: https://issues.apache.org/jira/browse/ARROW-8863 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Many type-specific array constructors incorrectly set the null count to unknown. It would be better to set it to 0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8823) [C++] Compute aggregate compression ratio when producing compressed IPC body messages
Wes McKinney created ARROW-8823: --- Summary: [C++] Compute aggregate compression ratio when producing compressed IPC body messages Key: ARROW-8823 URL: https://issues.apache.org/jira/browse/ARROW-8823 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney It would be beneficial to know the exact bytes-on-wire savings once the message has been produced. Since this computation would be relatively trivial it would not add overhead to the IPC write hot path. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8800) [C++] Split arrow::ChunkedArray into arrow/chunked_array.h
Wes McKinney created ARROW-8800: --- Summary: [C++] Split arrow::ChunkedArray into arrow/chunked_array.h Key: ARROW-8800 URL: https://issues.apache.org/jira/browse/ARROW-8800 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 There are plenty of scenarios where ChunkedArray is used separate from Table, it would probably make sense to split up the headers, implementation, and unit tests -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8793) [C++] BitUtil::SetBitsTo probably doesn't need to be inline
Wes McKinney created ARROW-8793: --- Summary: [C++] BitUtil::SetBitsTo probably doesn't need to be inline Key: ARROW-8793 URL: https://issues.apache.org/jira/browse/ARROW-8793 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Inlining this function probably does not yield meaningful performance benefits -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8792) [C++] Improved declarative compute function / kernel development framework, normalize calling conventions
Wes McKinney created ARROW-8792: --- Summary: [C++] Improved declarative compute function / kernel development framework, normalize calling conventions Key: ARROW-8792 URL: https://issues.apache.org/jira/browse/ARROW-8792 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 I'm working on a significant revamp of the way that kernels are implemented in the project as discussed on the mailing list. PR to follow within the next week or sooner A brief list of features: * Kernel selection that takes into account the shape of inputs (whether Scalar or Array, so you can provide an implementation just for Arrays and a separate one just for Scalars if you want) * More customizable / less monolithic type-to-kernel dispatch * Browsable function registry (see all available kernels and their input type signatures) * Central code path for type-checking and argument validation * Central code path for kernel execution on ChunkedArray inputs There's a lot of JIRAs in the backlog that will follow from this work so I will attach those to this issue for visibility but this issue will cover the initial refactoring work to port the existing code to the new framework without altering existing features. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8769) [C++] Add convenience methods to access fields by name in StructScalar
Wes McKinney created ARROW-8769: --- Summary: [C++] Add convenience methods to access fields by name in StructScalar Key: ARROW-8769 URL: https://issues.apache.org/jira/browse/ARROW-8769 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 This would improve usability of this type -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8762) [C++][Gandiva] Replace Gandiva's BitmapAnd with common implementation
Wes McKinney created ARROW-8762: --- Summary: [C++][Gandiva] Replace Gandiva's BitmapAnd with common implementation Key: ARROW-8762 URL: https://issues.apache.org/jira/browse/ARROW-8762 Project: Apache Arrow Issue Type: Improvement Components: C++, C++ - Gandiva Reporter: Wes McKinney Fix For: 1.0.0 Now that the arrow/util/bit_util.h implementation has been optimized, we should just use that one -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8750) [Python] pyarrow.feather.write_feather does not default to lz4 compression if it's available
Wes McKinney created ARROW-8750: --- Summary: [Python] pyarrow.feather.write_feather does not default to lz4 compression if it's available Key: ARROW-8750 URL: https://issues.apache.org/jira/browse/ARROW-8750 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 1.0.0, 0.17.1 This was my intention but I seem to have implemented it incorrectly -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8746) [Python][Documentation] Add column limit recommendations Parquet page
Wes McKinney created ARROW-8746: --- Summary: [Python][Documentation] Add column limit recommendations Parquet page Key: ARROW-8746 URL: https://issues.apache.org/jira/browse/ARROW-8746 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Python Reporter: Wes McKinney Users would be well advised to not write columns with large numbers (> 1000) of columns -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8727) [C++] Do not require struct-initialization of StringConverter to parse strings to other types
Wes McKinney created ARROW-8727: --- Summary: [C++] Do not require struct-initialization of StringConverter to parse strings to other types Key: ARROW-8727 URL: https://issues.apache.org/jira/browse/ARROW-8727 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 I ran into this issue while working on refactoring kernels. {{StringConverter}} must be initialized to be able to support parametric types like Timestamp, but this produces an awkwardness and possibly a performance penalty (I haven't measured yet) in inlined functions. In any case, I'm refactoring everything to be static non-stateful -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8711) [Python] Expose strptime timestamp parsing in read_csv conversion options
Wes McKinney created ARROW-8711: --- Summary: [Python] Expose strptime timestamp parsing in read_csv conversion options Key: ARROW-8711 URL: https://issues.apache.org/jira/browse/ARROW-8711 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Wes McKinney Fix For: 1.0.0 Follow up to ARROW-8111 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8712) [R] Expose strptime timestamp parsing in read_csv conversion options
Wes McKinney created ARROW-8712: --- Summary: [R] Expose strptime timestamp parsing in read_csv conversion options Key: ARROW-8712 URL: https://issues.apache.org/jira/browse/ARROW-8712 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Wes McKinney Fix For: 1.0.0 Follow up to ARROW-8111 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8706) [C++][Parquet] Tracking JIRA for PARQUET-1857 (unencrypted INT16_MAX Parquet row group limit)
Wes McKinney created ARROW-8706: --- Summary: [C++][Parquet] Tracking JIRA for PARQUET-1857 (unencrypted INT16_MAX Parquet row group limit) Key: ARROW-8706 URL: https://issues.apache.org/jira/browse/ARROW-8706 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 1.0.0, 0.17.1 JIRA to make sure this patch gets included in a patch release -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8700) [C++] static libgflags.a fails to link properly in gcc 4.x
Wes McKinney created ARROW-8700: --- Summary: [C++] static libgflags.a fails to link properly in gcc 4.x Key: ARROW-8700 URL: https://issues.apache.org/jira/browse/ARROW-8700 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney I am seeing this with gcc 4.8 on Ubuntu 18.04 {code} $ ninja [55/179] Linking CXX executable release/arrow-json-integration-test FAILED: release/arrow-json-integration-test : && /usr/bin/ccache /usr/bin/g++-4.8 -O3 -DNDEBUG -Wall -Wno-attributes -msse4.2 -O3 -DNDEBUG -rdynamic src/arrow/ipc/CMakeFiles/arrow-json-integration-test.dir/json_integration_test.cc.o -o release/arrow-json-integration-test -Wl,-rpath,/home/wesm/code/arrow/cpp/build-4.8/release release/libarrow_testing.so.18.0.0 release/libarrow.so.18.0.0 -ldl release//libgtest_main.so release//libgtest.so release//libgmock.so boost_ep-prefix/src/boost_ep/stage/lib/libboost_filesystem.a boost_ep-prefix/src/boost_ep/stage/lib/libboost_system.a -ldl ../bundled/gflags_ep-prefix/src/gflags_ep/lib/libgflags.a jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -pthread -lrt -lpthread && : src/arrow/ipc/CMakeFiles/arrow-json-integration-test.dir/json_integration_test.cc.o: In function `_GLOBAL__sub_I__ZN3fLS11FLAGS_arrowE': json_integration_test.cc:(.text.startup+0x1cc): undefined reference to `google::FlagRegisterer::FlagRegisterer(char const*, char const*, char const*, std::string*, std::string*)' json_integration_test.cc:(.text.startup+0x275): undefined reference to `google::FlagRegisterer::FlagRegisterer(char const*, char const*, char const*, std::string*, std::string*)' json_integration_test.cc:(.text.startup+0x317): undefined reference to `google::FlagRegisterer::FlagRegisterer(char const*, char const*, char const*, std::string*, std::string*)' collect2: error: ld returned 1 exit status [88/179] Building CXX object src/arrow/ipc/CMakeFiles/arrow-ipc-read-write-test.dir/read_write_test.cc.o ninja: build stopped: subcommand failed. {code} CMake invocation {code} $ cmake .. -GNinja -DARROW_GANDIVA=ON -DARROW_CSV=ON -DARROW_BUILD_TESTS=ON -DARROW_BUILD_BENCHMARKS=ON {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8684) [Packaging][Python] "SystemError: Bad call flags in _PyMethodDef_RawFastCallDict" in Python 3.7.7 on macOS when using pyarrow wheel
Wes McKinney created ARROW-8684: --- Summary: [Packaging][Python] "SystemError: Bad call flags in _PyMethodDef_RawFastCallDict" in Python 3.7.7 on macOS when using pyarrow wheel Key: ARROW-8684 URL: https://issues.apache.org/jira/browse/ARROW-8684 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 1.0.0 [~npr] reported this on the 0.17.0 RC0 vote thread but I have confirmed it independently. It was also reported at https://github.com/apache/arrow/issues/7082 Here are steps to reproduce on macOS: {code} conda create -yn py-3.7-defaults python=3.7 -c defaults conda activate py-3.7-defaults pip install pyarrow {code} Now open the Python interpreter, run {{import pyarrow}}, then exit the interpreter ({{python -c "import pyarrow"}} didn't trigger it for me): {code} $ python Python 3.7.7 (default, Mar 26 2020, 10:32:53) [Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow >>> Error in atexit._run_exitfuncs: Traceback (most recent call last): File "pyarrow/types.pxi", line 2638, in pyarrow.lib._unregister_py_extension_types SystemError: Bad call flags in _PyMethodDef_RawFastCallDict. METH_OLDARGS is no longer supported! Segmentation fault: 11 {code} It fails with Python 3.7.6 when using {{-c conda-forge}} also, so it is not particular to defaults. Frustratingly, the problem doesn't exist in Python 3.7.4 but occurs for me with 3.7.5, 3.7.6, and 3.7.7. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8683) [C++] Add option for user-defined version identifier for Arrow libraries
Wes McKinney created ARROW-8683: --- Summary: [C++] Add option for user-defined version identifier for Arrow libraries Key: ARROW-8683 URL: https://issues.apache.org/jira/browse/ARROW-8683 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 It would be useful to be able to "watermark" shared libraries with e.g. the git hash to determine the exact origin of a particular build of the project. The version identifier could default to the current git revision but be overridden in the CMake invocation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8676) [Rust] Create implementation of IPC RecordBatch body buffer compression from ARROW-300
Wes McKinney created ARROW-8676: --- Summary: [Rust] Create implementation of IPC RecordBatch body buffer compression from ARROW-300 Key: ARROW-8676 URL: https://issues.apache.org/jira/browse/ARROW-8676 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Wes McKinney -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8674) [JS] Implement IPC RecordBatch body buffer compression from ARROW-300
Wes McKinney created ARROW-8674: --- Summary: [JS] Implement IPC RecordBatch body buffer compression from ARROW-300 Key: ARROW-8674 URL: https://issues.apache.org/jira/browse/ARROW-8674 Project: Apache Arrow Issue Type: Sub-task Components: JavaScript Reporter: Wes McKinney This may not be a hard requirement for JS because this would require pulling in implementations of LZ4 and ZSTD which not all users may want -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8675) [C#] Create implementation of ARROW-300 / IPC record batch body buffer compression
Wes McKinney created ARROW-8675: --- Summary: [C#] Create implementation of ARROW-300 / IPC record batch body buffer compression Key: ARROW-8675 URL: https://issues.apache.org/jira/browse/ARROW-8675 Project: Apache Arrow Issue Type: Sub-task Components: C# Reporter: Wes McKinney -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8673) [Go] Implement IPC RecordBatch body compression from ARROW-300
Wes McKinney created ARROW-8673: --- Summary: [Go] Implement IPC RecordBatch body compression from ARROW-300 Key: ARROW-8673 URL: https://issues.apache.org/jira/browse/ARROW-8673 Project: Apache Arrow Issue Type: Sub-task Components: Go Reporter: Wes McKinney -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8671) [C++] Use IPC body compression metadata approved in ARROW-300
Wes McKinney created ARROW-8671: --- Summary: [C++] Use IPC body compression metadata approved in ARROW-300 Key: ARROW-8671 URL: https://issues.apache.org/jira/browse/ARROW-8671 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 This will adapt the existing code to use the new metadata, while maintaining backward compatibility code to recognize the "experimental" metadata written in 0.17.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8672) [Java] Implement RecordBatch IPC buffer compression from ARROW-300
Wes McKinney created ARROW-8672: --- Summary: [Java] Implement RecordBatch IPC buffer compression from ARROW-300 Key: ARROW-8672 URL: https://issues.apache.org/jira/browse/ARROW-8672 Project: Apache Arrow Issue Type: Sub-task Components: Java Reporter: Wes McKinney Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8670) [Format] Create reference implementations of IPC RecordBatch body compression from ARROW-300
Wes McKinney created ARROW-8670: --- Summary: [Format] Create reference implementations of IPC RecordBatch body compression from ARROW-300 Key: ARROW-8670 URL: https://issues.apache.org/jira/browse/ARROW-8670 Project: Apache Arrow Issue Type: New Feature Components: Format Reporter: Wes McKinney Fix For: 1.0.0 Tracking JIRA for implementing ARROW-300 in different PLs -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8667) [C++] Add multi-consumer Scheduler API to sit one layer above ThreadPool
Wes McKinney created ARROW-8667: --- Summary: [C++] Add multi-consumer Scheduler API to sit one layer above ThreadPool Key: ARROW-8667 URL: https://issues.apache.org/jira/browse/ARROW-8667 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 I believe we should define an abstraction to allow for custom resource allocation strategies (round robin, even time, etc.) to be devised for situations where there are different thread pool consumers that are working independently of each other. Consider the classic nested parallelism scenario: * Task A in thread 1 may issue N subtasks that run in parallel * Task B in thread 2 may issue K subtasks With our current ThreadPool abstraction, it is easy to conceive scenarios where either Task A or Task B trample each other. One approach to remedy this problem is to have an API like so: {code} // Inform the scheduler that you want to submit tasks that are "your tasks" int consumer_id = scheduler->NewConsumer(); for (...) { Future fut = scheduler->Submit(consumer_id, DoWork, ...); } scheduler->FinishConsumer(consumer_id); {code} The idea is that the scheduler would maintain separate task queues for each consumer and e.g. track consumer-specific metrics of interest to determine how tasks are allocated. The scheduler could have different logic to control tasks being assigned to worker threads: * Round-robin * Even-time allocation (run fewer tasks for consumers with "slow" tasks and more tasks from consumers with "fast" tasks -- though there are some nuances here like avoiding starving a consumer if they've been doing a lot of "slow" tasks and then a "fast" consumer shows up) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8661) [C++][Gandiva] Reduce number of files and headers
Wes McKinney created ARROW-8661: --- Summary: [C++][Gandiva] Reduce number of files and headers Key: ARROW-8661 URL: https://issues.apache.org/jira/browse/ARROW-8661 Project: Apache Arrow Issue Type: Improvement Components: C++, C++ - Gandiva Reporter: Wes McKinney Fix For: 1.0.0 I feel that the Gandiva subpackage is more Java-like in its code organization than the rest of the Arrow codebase, and it might be easier to navigate and develop with closely related code condensed into some larger headers and compilation units. Additionally, it's not necessary to have a header file for each component of the function registry -- the registration functions can be declared in function_registry.h or function_registry_internal.h -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8660) [C++][Gandiva] Reduce dependence on Boost
Wes McKinney created ARROW-8660: --- Summary: [C++][Gandiva] Reduce dependence on Boost Key: ARROW-8660 URL: https://issues.apache.org/jira/browse/ARROW-8660 Project: Apache Arrow Issue Type: Improvement Components: C++, C++ - Gandiva Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 Remove Boost usages aside from Boost.Multiprecision -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8635) [R] test-filesystem.R takes ~40 seconds to run?
Wes McKinney created ARROW-8635: --- Summary: [R] test-filesystem.R takes ~40 seconds to run? Key: ARROW-8635 URL: https://issues.apache.org/jira/browse/ARROW-8635 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Wes McKinney Fix For: 1.0.0 {code} ✔ | 22 | Expressions ✔ | 107 | Feather [0.2 s] ✔ | 7 | Field ✔ | 40 | File system [38.1 s] ✔ | 6 | install_arrow() ✔ | 26 | JsonTableReader [0.1 s] ✔ | 24 | MessageReader ✔ | 12 | Message ✔ | 31 | Parquet file reading/writing [0.2 s] ⠏ | 0 | To/from Pythonvirtualenv: arrow-test {code} Is this expected? I assume it's related to S3 but that seems like a long time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8633) [C++] Add ValidateAscii function
Wes McKinney created ARROW-8633: --- Summary: [C++] Add ValidateAscii function Key: ARROW-8633 URL: https://issues.apache.org/jira/browse/ARROW-8633 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 In some cases, we want to be able to check whether it's safe to use functions that assume ASCII (like {{std::tolower}}, or {{std::string::substr). This was implemented in a PR for ARROW-6131 that was not merged -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8626) [C++] Implement "round robin" scheduler interface to fixed-size ThreadPool
Wes McKinney created ARROW-8626: --- Summary: [C++] Implement "round robin" scheduler interface to fixed-size ThreadPool Key: ARROW-8626 URL: https://issues.apache.org/jira/browse/ARROW-8626 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Currently, when submitting tasks to a thread pool, they are all commingled in a common queue. When a new task submitter shows up, they must wait in the back of the line behind all other queued tasks. A simple alternative to this would be round-robin scheduling, where each new consumer is assigned a unique integer id, and the schedule / thread pool internally maintains the tasks associated with the consumer in separate queues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8623) [C++][Gandiva] Reduce use of Boost, remove Boost headers from header files
Wes McKinney created ARROW-8623: --- Summary: [C++][Gandiva] Reduce use of Boost, remove Boost headers from header files Key: ARROW-8623 URL: https://issues.apache.org/jira/browse/ARROW-8623 Project: Apache Arrow Issue Type: Improvement Components: C++, C++ - Gandiva Reporter: Wes McKinney Fix For: 1.0.0 Boost is currently a transitive dependency of many of Gandiva's public header files. I suggest the following: * Do not include Boost transitively in any installed header file * Reduce usages of Boost altogether On the latter point, most usages of Boost can be trimmed by having a {{hash_combine}} function inside the Arrow codebase. See results of grepping the codebase https://gist.github.com/wesm/190006d91628e6bf7c04deb596a52cff It seems that Boost cannot be easily eliminated altogether at the present moment because of a use of Boost.Multiprecision ({{int256_t}}). At some point someone may want to implement sufficient 256-bit integer functions so that we don't have to depend on Boost.Multiprecision -- This message was sent by Atlassian Jira (v8.3.4#803005)