[jira] [Created] (ARROW-9127) [Rust] Update thirft library dependencies
Andrew Lamb created ARROW-9127: -- Summary: [Rust] Update thirft library dependencies Key: ARROW-9127 URL: https://issues.apache.org/jira/browse/ARROW-9127 Project: Apache Arrow Issue Type: Bug Reporter: Andrew Lamb Update to latest version of apache thrift (1.3) Rationale: We were trying to update the version of `byteorder` that an internal project used, but arrow/parquet -> depends on parquet-format-rs -> depends on thrift. [~sunchao] recently updated the thrift-pin in parquet-format in [https://github.com/apache/arrow/pull/6626,] so now it is possible to update the thrift version here as well The thrift dependency was postponed when the dependencies were last updated. See: https://github.com/apache/arrow/pull/6626 https://issues.apache.org/jira/browse/ARROW-8124 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9126) [C++] Trimmed Boost bundle fails to build on Windows
Cuong Nguyen created ARROW-9126: --- Summary: [C++] Trimmed Boost bundle fails to build on Windows Key: ARROW-9126 URL: https://issues.apache.org/jira/browse/ARROW-9126 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Cuong Nguyen Build with the following commands {code:java} mkdir build cd build cmake .. -DARROW_PARQUET=ON cmake --build .{code} Error from build log {code:java} .\boost/graph/two_bit_color_map.hpp(106): fatal error C1083: Cannot open include file: 'boost/graph/detail/empty_header.hpp': No such file or directory {code} This was because configuring Boost to build a subset of libraries doesn't work on Windows as it does on Linux. As a result, all libraries, including those being trimmed, were built: {code:java} Component configuration: - atomic : building - chrono : building - container : building - date_time : building - exception : building - filesystem : building - headers : building - iostreams : building - locale : building - log : building - mpi : building - program_options : building - python : building - random : building - regex : building - serialization : building - system : building - test : building - thread : building - timer : building - wave : building {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9125) [C++] Add missing include for arrow::internal::ZeroMemory() for Valgrind
Kouhei Sutou created ARROW-9125: --- Summary: [C++] Add missing include for arrow::internal::ZeroMemory() for Valgrind Key: ARROW-9125 URL: https://issues.apache.org/jira/browse/ARROW-9125 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9124) DFParser should consume sql query as &str instead of String
QP Hou created ARROW-9124: - Summary: DFParser should consume sql query as &str instead of String Key: ARROW-9124 URL: https://issues.apache.org/jira/browse/ARROW-9124 Project: Apache Arrow Issue Type: Improvement Reporter: QP Hou Assignee: QP Hou It's more efficient to use &str instead of String -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9123) [Python][wheel] Use libzstd.a explicitly
Kouhei Sutou created ARROW-9123: --- Summary: [Python][wheel] Use libzstd.a explicitly Key: ARROW-9123 URL: https://issues.apache.org/jira/browse/ARROW-9123 Project: Apache Arrow Issue Type: Improvement Components: Packaging, Python Reporter: Kouhei Sutou Assignee: Kouhei Sutou {{ARROW_ZSTD_USE_SHARED}} is introduced by ARROW-9084. We need to set {{ARROW_ZSTD_USE_SHARED=OFF}} explicitly to use static zstd library. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9122) [C++] Adapt ascii_lower/ascii_upper bulk transforms to work on sliced arrays
Wes McKinney created ARROW-9122: --- Summary: [C++] Adapt ascii_lower/ascii_upper bulk transforms to work on sliced arrays Key: ARROW-9122 URL: https://issues.apache.org/jira/browse/ARROW-9122 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 See comments at https://github.com/apache/arrow/pull/7418#discussion_r439754427 Also add unit tests to verify that only the referenced data slice has been transformed in the result -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9121) Do not wipe the filesystem when path is empty
Mohamed Zenadi created ARROW-9121: - Summary: Do not wipe the filesystem when path is empty Key: ARROW-9121 URL: https://issues.apache.org/jira/browse/ARROW-9121 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Mohamed Zenadi The `DeleteDirContents` method in the filesystems api has a default behavior or *wiping* the whole filesystem if we give it an empty path. It's documented as: > Like DeleteDir, but doesn’t delete the directory itself. Passing an empty > path (“”) will wipe the entire filesystem tree. And the corresponding code confirms that: {code:java} auto parts = SplitAbstractPath(path); RETURN_NOT_OK(ValidateAbstractPathParts(parts)); if (parts.empty()) { // Wipe filesystem impl_->RootDir().entries.clear(); return Status::OK(); } {code} This is a weird default that does not make sense. If the user wanted really to wipe his filesystem, he'd pass a `/`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9120) [C++] Lint and Format _internal headers
Ben Kietzman created ARROW-9120: --- Summary: [C++] Lint and Format _internal headers Key: ARROW-9120 URL: https://issues.apache.org/jira/browse/ARROW-9120 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.17.1 Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 1.0.0 Currently, headers named /*_internal.h/ are neither clang-formatted nor cpplinted. Since they're not exported, CLI lint (forbid , nullptr, ...) need not be applied -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9119) [C++] Add support for building with system static gRPC
Kouhei Sutou created ARROW-9119: --- Summary: [C++] Add support for building with system static gRPC Key: ARROW-9119 URL: https://issues.apache.org/jira/browse/ARROW-9119 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9118) [C++] Add more general BoundsCheck function that also checks for arbitrary lower limits in integer arrays
Wes McKinney created ARROW-9118: --- Summary: [C++] Add more general BoundsCheck function that also checks for arbitrary lower limits in integer arrays Key: ARROW-9118 URL: https://issues.apache.org/jira/browse/ARROW-9118 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 See ARROW-9083. The current {{IndexBoundsCheck}} is specialized to skip a comparison for unsigned integers and uses 0 as the lower bound for signed integers. This could be generalized so that we could check e.g. if int64 values will fit in the int32 range -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9117) [Python] Is there Pandas circular dependency problem?
SEUNGMIN HEO created ARROW-9117: --- Summary: [Python] Is there Pandas circular dependency problem? Key: ARROW-9117 URL: https://issues.apache.org/jira/browse/ARROW-9117 Project: Apache Arrow Issue Type: Bug Reporter: SEUNGMIN HEO -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9116) [C++] Add BinaryArray::total_values_length()
Antoine Pitrou created ARROW-9116: - Summary: [C++] Add BinaryArray::total_values_length() Key: ARROW-9116 URL: https://issues.apache.org/jira/browse/ARROW-9116 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou Fix For: 1.0.0 It's often useful to compute the total data size of a binary array. Sample implementation: {code:c++} int64_t total_values_length() const { return raw_value_offsets_[length() + data_->offset] - raw_value_offsets_[data_->offset]; } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9115) [C++] Process data buffers in batch in ascii_lower / ascii_upper kernels rather than using string_view value iteration
Wes McKinney created ARROW-9115: --- Summary: [C++] Process data buffers in batch in ascii_lower / ascii_upper kernels rather than using string_view value iteration Key: ARROW-9115 URL: https://issues.apache.org/jira/browse/ARROW-9115 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 Also add a benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9114) Illegal instruction crash in arrow.dll
MP created ARROW-9114: - Summary: Illegal instruction crash in arrow.dll Key: ARROW-9114 URL: https://issues.apache.org/jira/browse/ARROW-9114 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.17.1 Environment: Conda environment on Windows Server 2016. Importantly, the CPU does *not* support AVX2. Reporter: MP We have been encountering illegal instruction crashes in {{arrow.dll}} when using the {{conda}} packages from {{conda-forge}}. Here are the relevant packages that were installed: {{arrow-cpp: 0.17.1-py37h1234567_4_cpu}} {{parquet-cpp: 1.5.1-2}} {{pyarrow: 0.17.1-py37h1234567_4_cpu}} {{snappy: 1.1.8-he025d50_1}} The error is: {noformat}Windows fatal exception: 7code 0x\c01d{noformat} Some further investigation revealed that the offending instruction is {{BZHI}}, which as I understand it is part of the {{BMI2}} set, in turn part of {{AVX2}}. We believe this is in fact arising in {{snappy}} code here: https://github.com/google/snappy/blob/1.1.8/snappy.cc#L717-L728 The {{snappy 1.1.8}} package appears to have been built with {{BMI2}} support enabled, if you look at the release build log here: https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=115252&view=logs&j=2cc45e14-23e3-52d7-b33a-8c2744410b97&t=21c44aa7-1ae3-5312-cacc-7f19fefc82f4 Of course, this is then arguably an upstream issue, but I have reported it here because perhaps that configuration is the desired choice for the 'standard' {{snappy}} package and something else might need to be done in {{arrow}} instead, for example. (Incidentally, is the {{snappy}} runtime dependency correct in the {{arrow}} feedstocks? If it's statically linked, shouldn't it only be required at build time?) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9113) Fix exception causes in cli.py
Ram Rachum created ARROW-9113: - Summary: Fix exception causes in cli.py Key: ARROW-9113 URL: https://issues.apache.org/jira/browse/ARROW-9113 Project: Apache Arrow Issue Type: Bug Reporter: Ram Rachum I recently went over [Matplotlib](https://github.com/matplotlib/matplotlib/pull/16706), [Pandas](https://github.com/pandas-dev/pandas/pull/32322) and [NumPy](https://github.com/numpy/numpy/pull/15731), fixing a small mistake in the way that Python 3's exception chaining is used. If you're interested, I can do it here too. I've done it on just one file right now. The mistake is this: In some parts of the code, an exception is being caught and replaced with a more user-friendly error. In these cases the syntax `raise new_error from old_error` needs to be used. Python 3's exception chaining means it shows not only the traceback of the current exception, but that of the original exception (and possibly more.) This is regardless of `raise from`. The usage of `raise from` tells Python to put a more accurate message between the tracebacks. Instead of this: During handling of the above exception, another exception occurred: You'll get this: The above exception was the direct cause of the following exception: The first is inaccurate, because it signifies a bug in the exception-handling code itself, which is a separate situation than wrapping an exception. Let me know what you think! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9112) [R] Update autobrew script location
Neal Richardson created ARROW-9112: -- Summary: [R] Update autobrew script location Key: ARROW-9112 URL: https://issues.apache.org/jira/browse/ARROW-9112 Project: Apache Arrow Issue Type: Task Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 1.0.0 Jeroen is moving it to a different location. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9111) csv.read_csv progress bar
Jeff Hammerbacher created ARROW-9111: Summary: csv.read_csv progress bar Key: ARROW-9111 URL: https://issues.apache.org/jira/browse/ARROW-9111 Project: Apache Arrow Issue Type: Improvement Affects Versions: 0.17.1 Reporter: Jeff Hammerbacher When reading a very large csv file, it would be nice to see some diagnostic output from pyarrow. [readr|[https://readr.tidyverse.org/reference/read_delim.html]] has a `progress` parameter, for example. [tqdm|[https://github.com/tqdm/tqdm]] is often used in the Python community to provide this functionality. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9110) [C++] Fix CPU cache size detection on macOS
Krisztian Szucs created ARROW-9110: -- Summary: [C++] Fix CPU cache size detection on macOS Key: ARROW-9110 URL: https://issues.apache.org/jira/browse/ARROW-9110 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 1.0.0 Running certain benchmarks on macOS never ends because CpuInfo detects the RAM size as the size of L1 cache. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9109) [Python][Packaging] Enable S3 support in manylinux wheels
Antoine Pitrou created ARROW-9109: - Summary: [Python][Packaging] Enable S3 support in manylinux wheels Key: ARROW-9109 URL: https://issues.apache.org/jira/browse/ARROW-9109 Project: Apache Arrow Issue Type: Sub-task Components: Packaging, Python Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9108) [C++][Dataset] Add Parquet Statistics conversion for timestamp columns
Francois Saint-Jacques created ARROW-9108: - Summary: [C++][Dataset] Add Parquet Statistics conversion for timestamp columns Key: ARROW-9108 URL: https://issues.apache.org/jira/browse/ARROW-9108 Project: Apache Arrow Issue Type: Sub-task Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9107) [C++][Dataset] Time-based types support
Francois Saint-Jacques created ARROW-9107: - Summary: [C++][Dataset] Time-based types support Key: ARROW-9107 URL: https://issues.apache.org/jira/browse/ARROW-9107 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques We lack the support of date/timestamp partitions, and predicate pushdown rules. Timestamp columns are usually the most important predicate in OLAP style queries, we need to support this transparently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9106) [C++] Add C++ foundation to ease file transcoding
Antoine Pitrou created ARROW-9106: - Summary: [C++] Add C++ foundation to ease file transcoding Key: ARROW-9106 URL: https://issues.apache.org/jira/browse/ARROW-9106 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou In some situations (e.g. reading a Windows-produced CSV file), the user might transcode data before ingesting it into Arrow. Rather than build transcoding in C++ (which would require a library of encodings), we could delegate it to bindings as needed, by providing a generic InputStream facility. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9105) [C++] ParquetFileFragment::SplitByRowGroup doesn't handle filter on partition field
Joris Van den Bossche created ARROW-9105: Summary: [C++] ParquetFileFragment::SplitByRowGroup doesn't handle filter on partition field Key: ARROW-9105 URL: https://issues.apache.org/jira/browse/ARROW-9105 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Fix For: 1.0.0 When splitting a fragment into row group fragments, filtering on the partition field raises an error. Python reproducer: ``` df = pd.DataFrame({"dummy": [1, 1, 1, 1], "part": ["A", "A", "B", "B"]}) df.to_parquet("test_partitioned_filter", partition_cols="part", engine="pyarrow") import pyarrow.dataset as ds dataset = ds.dataset("test_partitioned_filter", format="parquet", partitioning="hive") fragment = list(dataset.get_fragments())[0] ``` ``` In [31]: dataset.to_table(filter=ds.field("part") == "A").to_pandas() Out[31]: dummy part 0 1A 1 1A In [32]: fragment.split_by_row_group(ds.field("part") == "A") --- ArrowInvalid Traceback (most recent call last) in > 1 fragment.split_by_row_group(ds.field("part") == "A") ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.ParquetFileFragment.split_by_row_group() ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset._insert_implicit_casts() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Field named 'part' not found or not unique in the schema. ``` This is probably a "strange" thing to do, since the fragment from a partitioned dataset is already coming only from a single partition (so will always only satisfy a single equality expression). But it's still nice that as a user you don't have to care about only passing part of the filter down to {{split_by_row_groups}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9104) [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory
Krisztian Szucs created ARROW-9104: -- Summary: [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory Key: ARROW-9104 URL: https://issues.apache.org/jira/browse/ARROW-9104 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Krisztian Szucs Fix For: 1.0.0 If the source directory is not writable the test raises permission denied error: [ RUN ] TestEncryptionConfiguration.UniformEncryption 1632 unknown file: Failure 1633 C++ exception with description "IOError: Failed to open local file '/arrow/cpp/submodules/parquet-testing/data/tmp_uniform_encryption.parquet.encrypted'. Detail: [errno 13] Permission denied" thrown in the test body. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9103) [Python] Clarify behaviour of CSV reader for non-UTF8 text data
Joris Van den Bossche created ARROW-9103: Summary: [Python] Clarify behaviour of CSV reader for non-UTF8 text data Key: ARROW-9103 URL: https://issues.apache.org/jira/browse/ARROW-9103 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche See https://stackoverflow.com/questions/62153229/how-does-pyarrow-read-csv-handle-different-file-encodings/62321673#62321673 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9102) [Packaging] Upload built manylinux docker images
Krisztian Szucs created ARROW-9102: -- Summary: [Packaging] Upload built manylinux docker images Key: ARROW-9102 URL: https://issues.apache.org/jira/browse/ARROW-9102 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 1.0.0 However the secrets were set on azure pipelines the upload step is failing: https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=13104&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181 So the manylinux builds take more than two hours. This is due to azure's secret handling, we need to explicitly export the azure secret variables as environment variables. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9101) [Doc][C++][Python] Document encoding expected by CSV and JSON readers
Antoine Pitrou created ARROW-9101: - Summary: [Doc][C++][Python] Document encoding expected by CSV and JSON readers Key: ARROW-9101 URL: https://issues.apache.org/jira/browse/ARROW-9101 Project: Apache Arrow Issue Type: Task Components: C++, Documentation, Python Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9100) Add ascii_lower kernel
Maarten Breddels created ARROW-9100: --- Summary: Add ascii_lower kernel Key: ARROW-9100 URL: https://issues.apache.org/jira/browse/ARROW-9100 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Maarten Breddels -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9099) Add TRIM function for string
Sagnik Chakraborty created ARROW-9099: - Summary: Add TRIM function for string Key: ARROW-9099 URL: https://issues.apache.org/jira/browse/ARROW-9099 Project: Apache Arrow Issue Type: Task Reporter: Sagnik Chakraborty -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9098) RecordBatch::ToStructArray cannot handle record batches with 0 column
Zhuo Peng created ARROW-9098: Summary: RecordBatch::ToStructArray cannot handle record batches with 0 column Key: ARROW-9098 URL: https://issues.apache.org/jira/browse/ARROW-9098 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.17.1 Reporter: Zhuo Peng If RecordBatch::ToStructArray is called against a record batch with 0 column, the following error will be raised: Invalid: Can't infer struct array length with 0 child arrays -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9097) [Rust] Customizable schema inference for CSV
Sergey Todyshev created ARROW-9097: -- Summary: [Rust] Customizable schema inference for CSV Key: ARROW-9097 URL: https://issues.apache.org/jira/browse/ARROW-9097 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Sergey Todyshev Please consider extracting infer_csv_schema function into separate module allowing customization of fields DataType inference. Currently the missing part is an inference of datetime fields. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9096) data type "integer" not understood: pandas roundtrip
Richard Wu created ARROW-9096: - Summary: data type "integer" not understood: pandas roundtrip Key: ARROW-9096 URL: https://issues.apache.org/jira/browse/ARROW-9096 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.17.1 Reporter: Richard Wu The following will fail the roundtrip since the column indexes' pandas_type is converted from int64 to integer when an additional column is introduced and subsequently moved to the index: {code:java} df = pd.DataFrame(np.ones((3,1), index=[[1,2,3]]) df['foo'] = np.arange(3) df = df.set_index('foo', append=True) table = pyarrow.Table.from_pandas(df) table.to_pandas() # Errors{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9095) [Rust] Fix NullArray to comply with spec
Neville Dipale created ARROW-9095: - Summary: [Rust] Fix NullArray to comply with spec Key: ARROW-9095 URL: https://issues.apache.org/jira/browse/ARROW-9095 Project: Apache Arrow Issue Type: Sub-task Components: Rust Affects Versions: 0.17.0 Reporter: Neville Dipale When I implemented the NullArray, I didn't comply with the spec under the premise that I'd handle reading and writing IPC in a spec-compliant way as that looked like the easier approach. After some integration testing, I realised that I wasn't doing it correctly, so it's better to comply with the spec by not allocating any buffers for the array. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9094) [Python] Bump versions of compiled dependencies in manylinux wheels
Antoine Pitrou created ARROW-9094: - Summary: [Python] Bump versions of compiled dependencies in manylinux wheels Key: ARROW-9094 URL: https://issues.apache.org/jira/browse/ARROW-9094 Project: Apache Arrow Issue Type: Task Components: Packaging, Python Reporter: Antoine Pitrou Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9093) [FlightRPC][C++][Python] Allow setting gRPC client options
David Li created ARROW-9093: --- Summary: [FlightRPC][C++][Python] Allow setting gRPC client options Key: ARROW-9093 URL: https://issues.apache.org/jira/browse/ARROW-9093 Project: Apache Arrow Issue Type: Improvement Components: C++, FlightRPC, Python Reporter: David Li Assignee: David Li There's no way to set generic gRPC options which are useful for tuning behavior (e.g. round-robin load balancing). Rather than bind all of these one by one, gRPC allows setting arguments as generic string-string or string-integer pairs; we could expose this (and leave the interpretation implementation-dependent). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9092) [C++] gandiva-decimal-test hangs with LLVM 9
Wes McKinney created ARROW-9092: --- Summary: [C++] gandiva-decimal-test hangs with LLVM 9 Key: ARROW-9092 URL: https://issues.apache.org/jira/browse/ARROW-9092 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney I built Gandiva C++ unittests with LLVM 9 on Ubuntu 18.04 and gandiva-decimal-test hangs forever -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9091) [C++] Utilize function's default options when passing no options to CallFunction to a function that requires them
Wes McKinney created ARROW-9091: --- Summary: [C++] Utilize function's default options when passing no options to CallFunction to a function that requires them Key: ARROW-9091 URL: https://issues.apache.org/jira/browse/ARROW-9091 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Otherwise benign usage of {{CallFunction}} can cause an unintuitive segfault in some cases -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9090) [C++] Bump versions of bundled libraries
Antoine Pitrou created ARROW-9090: - Summary: [C++] Bump versions of bundled libraries Key: ARROW-9090 URL: https://issues.apache.org/jira/browse/ARROW-9090 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Antoine Pitrou Fix For: 1.0.0 We should bump the versions of bundled dependencies, wherever possible, to ensure that users get bugfixes and improvements made in those third-party libraries. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9089) [Python] A PyFileSystem handler for fsspec-based filesystems
Joris Van den Bossche created ARROW-9089: Summary: [Python] A PyFileSystem handler for fsspec-based filesystems Key: ARROW-9089 URL: https://issues.apache.org/jira/browse/ARROW-9089 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Follow-up on ARROW-8766 to use this machinery to add an FSSpecHandler -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9088) [Rust] Recent version of arrow crate does not compile into wasm target
Sergey Todyshev created ARROW-9088: -- Summary: [Rust] Recent version of arrow crate does not compile into wasm target Key: ARROW-9088 URL: https://issues.apache.org/jira/browse/ARROW-9088 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Sergey Todyshev arrow 0.16 compiles successfully into wasm32-unknown-unknown, but recent git version does not. it would be nice to fix that. compiler errors: {noformat} error[E0433]: failed to resolve: could not find `unix` in `os` --> /home/regl/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:41:18 | 41 | use std::os::unix::ffi::OsStringExt; | could not find `unix` in `os` error[E0432]: unresolved import `unix` --> /home/regl/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:6:5 | 6 | use unix; | no `unix` in the root{noformat} the problem is that prettytable-rs dependency depends on dirs which causes this error -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9087) Missing HDFS options parsing
Yuan Zhou created ARROW-9087: Summary: Missing HDFS options parsing Key: ARROW-9087 URL: https://issues.apache.org/jira/browse/ARROW-9087 Project: Apache Arrow Issue Type: Bug Reporter: Yuan Zhou Assignee: Yuan Zhou HDFS options for kerberos ticket and extra conf is not parsed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9086) [CI][Homebrew] Enable Gandiva
Kouhei Sutou created ARROW-9086: --- Summary: [CI][Homebrew] Enable Gandiva Key: ARROW-9086 URL: https://issues.apache.org/jira/browse/ARROW-9086 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9085) [C++][CI] Appveyor CI test failures
Wes McKinney created ARROW-9085: --- Summary: [C++][CI] Appveyor CI test failures Key: ARROW-9085 URL: https://issues.apache.org/jira/browse/ARROW-9085 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 See https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/33417919 These seem to have been introduced by https://github.com/apache/arrow/commit/b058cf0d1c26ad7984c104bb84322cc7dcc66f00 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9084) [C++] cmake is unable to find zstd target when ZSTD_SOURCE=SYSTEM
Dmitry Kalinkin created ARROW-9084: -- Summary: [C++] cmake is unable to find zstd target when ZSTD_SOURCE=SYSTEM Key: ARROW-9084 URL: https://issues.apache.org/jira/browse/ARROW-9084 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.17.1 Environment: zstd 1.4.5 Reporter: Dmitry Kalinkin Assignee: Dmitry Kalinkin A following problem occurs when arrow-cpp is built against system zstd: {noformat} CMake Error at cmake_modules/ThirdpartyToolchain.cmake:1860 (get_target_property): get_target_property() called with non-existent target "ZSTD::zstd". {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9083) [R] collect int64 as R integer type if not out of bounds
Neal Richardson created ARROW-9083: -- Summary: [R] collect int64 as R integer type if not out of bounds Key: ARROW-9083 URL: https://issues.apache.org/jira/browse/ARROW-9083 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson {{bit64::integer64}} can be awkward to work with in R (one example: https://github.com/apache/arrow/issues/7385). Often in Arrow we get {{int64}} types from [compute methods|https://github.com/apache/arrow/pull/7308] or other translation methods that auto-promote to the largest integer type, but they would fit fine in a 32-bit integer, which is R's native type. When calling {{Array__as_vector}} on an int64, we could first call the minmax function on the array, and if the extrema are within the range of a 32-bit int, return a regular R integer vector. This would add a little bit of ambiguity as to what R type you'll get from an Arrow type, but I wonder if the benefits are worth it since you can't do much with an integer64 in R. (We could also make this optional, similar to ARROW-7657, so you could specify a "strict" mode if you are in a use case where roundtrip fidelity is more important than R usability.) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9082) [Rust] - Stream reader fail when steam not ended with (optional) 0xFFFFFFFF 0x00000000"
Eyal Leshem created ARROW-9082: -- Summary: [Rust] - Stream reader fail when steam not ended with (optional) 0x 0x" Key: ARROW-9082 URL: https://issues.apache.org/jira/browse/ARROW-9082 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.17.1 Reporter: Eyal Leshem according to spec : [https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format] , the 0x 0x is optional in the arrow response stream , but currently when client receive such response it's read all the batches well , but return an error in the end (instead of Ok(None)) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9081) [C++] Upgrade to LLVM 10
Ben Kietzman created ARROW-9081: --- Summary: [C++] Upgrade to LLVM 10 Key: ARROW-9081 URL: https://issues.apache.org/jira/browse/ARROW-9081 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.17.1 Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 1.0.0 Upgrade llvm dependencies to use version 10 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9080) [C++] arrow::AllocateBuffer returns a Result>
Wes McKinney created ARROW-9080: --- Summary: [C++] arrow::AllocateBuffer returns a Result> Key: ARROW-9080 URL: https://issues.apache.org/jira/browse/ARROW-9080 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 This seemed counterintuitive to me since using Buffers almost anywhere requires a shared_ptr -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9079) [C++] Write benchmark for arithmetic kernels
Krisztian Szucs created ARROW-9079: -- Summary: [C++] Write benchmark for arithmetic kernels Key: ARROW-9079 URL: https://issues.apache.org/jira/browse/ARROW-9079 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 1.0.0 The add kernel's implementation has changed in https://github.com/apache/arrow/pull/7341, in order to ensure that no performance regression was introduced write a benchmark for the kernels and compare the results with the previous implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9078) [C++] Parquet writing of extension type with nested storage type fails
Joris Van den Bossche created ARROW-9078: Summary: [C++] Parquet writing of extension type with nested storage type fails Key: ARROW-9078 URL: https://issues.apache.org/jira/browse/ARROW-9078 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche A reproducer in Python: {code:python} import pyarrow as pa import pyarrow.parquet as pq class MyStructType(pa.PyExtensionType): def __init__(self): pa.PyExtensionType.__init__(self, pa.struct([('left', pa.int64()), ('right', pa.int64())])) def __reduce__(self): return MyStructType, () struct_array = pa.StructArray.from_arrays( [ pa.array([0, 1], type="int64", from_pandas=True), pa.array([1, 2], type="int64", from_pandas=True), ], names=["left", "right"], ) # works table = pa.table({'a': struct_array}) pq.write_table(table, "test_struct.parquet") # doesn't work mystruct_array = pa.ExtensionArray.from_storage(MyStructType(), struct_array) table = pa.table({'a': mystruct_array}) pq.write_table(table, "test_struct.parquet") {code} Writing the simple StructArray nowadays works (and reading it back in as well). But when the struct array is the storage array of an ExtensionType, it fails with the following error: {code} ArrowException: Unknown error: data type leaf_count != builder_leaf_count1 2 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9077) [C++] Fix aggregate/scalar-compare benchmark null_percent calculation
Frank Du created ARROW-9077: --- Summary: [C++] Fix aggregate/scalar-compare benchmark null_percent calculation Key: ARROW-9077 URL: https://issues.apache.org/jira/browse/ARROW-9077 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Frank Du Assignee: Frank Du Wrong null percent in aggregate/scalar-compare as the changes in benchmark_util.h. Correct both to use the new defined boilerplate. ./release/arrow-compute-aggregate-benchmark -- Benchmark Time CPU Iterations UserCounters... -- SumKernelFloat/32768/1 5.38 us 5.38 us 129832 bytes_per_second=5.67524G/s {color:#FF}null_percent=10k{color} size=32.768k SumKernelFloat/32768/1000 5.36 us 5.35 us 130069 bytes_per_second=5.6994G/s null_percent=1000 size=32.768k SumKernelFloat/32768/100 5.35 us 5.35 us 131071 bytes_per_second=5.70903G/s null_percent=100 size=32.768k SumKernelFloat/32768/50 10.8 us 10.7 us 65504 bytes_per_second=2.84073G/s null_percent=50 size=32.768k SumKernelFloat/32768/10 4.94 us 4.93 us 141624 bytes_per_second=6.18964G/s null_percent=10 size=32.768k SumKernelFloat/32768/1 4.41 us 4.40 us 158949 bytes_per_second=6.92913G/s null_percent=1 size=32.768k -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9076) [Rust] Async CSV reader
Sergey Todyshev created ARROW-9076: -- Summary: [Rust] Async CSV reader Key: ARROW-9076 URL: https://issues.apache.org/jira/browse/ARROW-9076 Project: Apache Arrow Issue Type: New Feature Reporter: Sergey Todyshev rust-csv crate recently adds async implementation for CSV reader. It would be nice to have it in arrow crate as well. It is extremely useful in an application that needs to parse large CSV files in WebAssembly. It would be nice to have async JSON reader as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9075) [C++] Optimize Filter implementation
Wes McKinney created ARROW-9075: --- Summary: [C++] Optimize Filter implementation Key: ARROW-9075 URL: https://issues.apache.org/jira/browse/ARROW-9075 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 I split this off from ARROW-5760 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9074) [GLib] Add missing arrow-json check
Kouhei Sutou created ARROW-9074: --- Summary: [GLib] Add missing arrow-json check Key: ARROW-9074 URL: https://issues.apache.org/jira/browse/ARROW-9074 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9073) [C++] RapidJSON include directory detection doesn't work with RapidJSONConfig.cmake
Kouhei Sutou created ARROW-9073: --- Summary: [C++] RapidJSON include directory detection doesn't work with RapidJSONConfig.cmake Key: ARROW-9073 URL: https://issues.apache.org/jira/browse/ARROW-9073 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9072) [C++][Gandiva][MinGW] Enable crashed tests
Kouhei Sutou created ARROW-9072: --- Summary: [C++][Gandiva][MinGW] Enable crashed tests Key: ARROW-9072 URL: https://issues.apache.org/jira/browse/ARROW-9072 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva Reporter: Kouhei Sutou Some Gandiva tests are crashed with MinGW. They are disabled in {{ci/scripts/cpp_test.sh}}. We should fix the problems of the crashes and enable these tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9071) [C++] MakeArrayOfNull makes invalid ListArray
Zhuo Peng created ARROW-9071: Summary: [C++] MakeArrayOfNull makes invalid ListArray Key: ARROW-9071 URL: https://issues.apache.org/jira/browse/ARROW-9071 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Zhuo Peng One way to reproduce this bug is: >>> a = pa.array([[1, 2]]) >>> b = pa.array([None, None], type=pa.null()) >>> t1 = pa.Table.from_arrays([a], ["a"]) >>> t2 = pa.Table.from_arrays([b], ["b"]) >>> pa.concat_tables([t1, t2], promote=True) Traceback (most recent call last): File "", line 1, in File "pyarrow/table.pxi", line 2138, in pyarrow.lib.concat_tables File "pyarrow/public-api.pxi", line 390, in pyarrow.lib.pyarrow_wrap_table File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Column 0: In chunk 1: Invalid: List child array invalid: Invalid: Buffer #1 too small in array of type int64 and length 2: expected at least 16 byte(s), got 12 (because concat_tables(promote=True) will call MakeArrayOfNulls ([https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/table.cc#L647))|https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/table.cc#L647)'] The code here seems incorrect: [https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/array/util.cc#L218] the length of the child array of a ListArray may not equal to the length of the ListArray. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9070) [C++] StructScalar needs field accessor methods
Neal Richardson created ARROW-9070: -- Summary: [C++] StructScalar needs field accessor methods Key: ARROW-9070 URL: https://issues.apache.org/jira/browse/ARROW-9070 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Neal Richardson Fix For: 1.0.0 The minmax compute function returns a struct with fields "min" and "max". So to write an R binding for the {{min()}} method on arrow objects, I call "minmax" and then take the "min" field from the result. However, at least from my reading of scalar.h compared with array_nested.h, there are no field/GetFieldByName/etc. methods for StructScalar, so I can't get it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9069) [C++] MakeArrayFromScalar can't handle struct
Neal Richardson created ARROW-9069: -- Summary: [C++] MakeArrayFromScalar can't handle struct Key: ARROW-9069 URL: https://issues.apache.org/jira/browse/ARROW-9069 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Neal Richardson Fix For: 1.0.0 The R bindings translate data to/from Scalars by using the Array methods already implemented: to go from R object to a Scalar, it creates a length-1 Array and then slices out the 0th element with GetScalar(); to go from Scalar to R object, it calls MakeArrayFromScalar and then the as.vector method on that Array (in R, there is no scalar type anyway, only length-1 vectors). This generally works fine but if I get a Struct scalar (as the minmax compute function returns), I can't do anything with it because MakeArrayFromScalar doesn't work with structs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9068) [C++][Dataset] Simplify Partitioning interface
Francois Saint-Jacques created ARROW-9068: - Summary: [C++][Dataset] Simplify Partitioning interface Key: ARROW-9068 URL: https://issues.apache.org/jira/browse/ARROW-9068 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Francois Saint-Jacques The `int segment` of `Partitioning::Parse` should not be exposed to the user. KeyValuePartiioning should be a private Impl interface, not in public headers. The same apply to `Partitioning::Format`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9067) [C++] Create reusable branchless / vectorized index boundschecking functions
Wes McKinney created ARROW-9067: --- Summary: [C++] Create reusable branchless / vectorized index boundschecking functions Key: ARROW-9067 URL: https://issues.apache.org/jira/browse/ARROW-9067 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 It is possible to do branch-free index boundschecking in batches for better performance. I am implementing this as part of the Take/Filter optimization (so please wait until I have PRs up for this work), but these functions can be moved somewhere more general purpose and used in places where we are currently boundschecking inside inner loops. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9066) [Python] Raise correct error in isnull()
Uwe Korn created ARROW-9066: --- Summary: [Python] Raise correct error in isnull() Key: ARROW-9066 URL: https://issues.apache.org/jira/browse/ARROW-9066 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.17.1 Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9065) Support parsing date32 in dataset partition folders
Dave Hirschfeld created ARROW-9065: -- Summary: Support parsing date32 in dataset partition folders Key: ARROW-9065 URL: https://issues.apache.org/jira/browse/ARROW-9065 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Dave Hirschfeld I have some data which is partitioned by year/month/date. It would be useful if the date could be automatically parsed: ```python In [17]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), ("day", pa.date32())]) In [18]: partition = DirectoryPartitioning(schema) In [19]: partition.parse("/2020/06/2020-06-08") --- ArrowNotImplementedError Traceback (most recent call last) in > 1 partition.parse("/2020/06/2020-06-08") ~\envs\dev\lib\site-packages\pyarrow\_dataset.pyx in pyarrow._dataset.Partitioning.parse() ~\envs\dev\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~\envs\dev\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status() ArrowNotImplementedError: parsing scalars of type date32[day] ``` Not a big issue since you can just use string and convert, but nevertheless it would be nice if it Just Worked ```python In [22]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), ("day", pa.string())]) In [23]: partition = DirectoryPartitioning(schema) In [24]: partition.parse("/2020/06/2020-06-08") Out[24]: ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9064) optimization debian package manager tweaks
Pratik Raj created ARROW-9064: - Summary: optimization debian package manager tweaks Key: ARROW-9064 URL: https://issues.apache.org/jira/browse/ARROW-9064 Project: Apache Arrow Issue Type: Improvement Reporter: Pratik Raj By default, Ubuntu or Debian based "apt" or "apt-get" system installs recommended but not suggested packages . By passing "--no-install-recommends" option, the user lets apt-get know not to consider recommended packages as a dependency to install. This results in smaller downloads and installation of packages . Refer to blog at [Ubuntu Blog] at https://ubuntu.com/blog/we-reduced-our-docker-images-by-60-with-no-install-recommends -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9063) [Python][C++] Order of files are not respected using the new pyarrow.dataset
William Liu created ARROW-9063: -- Summary: [Python][C++] Order of files are not respected using the new pyarrow.dataset Key: ARROW-9063 URL: https://issues.apache.org/jira/browse/ARROW-9063 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.17.1 Environment: ubuntu-18.04 Reporter: William Liu Say we have multiple parquet files under the same folder (a.parquet, b.parquet, c.parquet). If I pass a list of file paths into either of the two statements below {code:java} ds = pq.ParquetDataset(fps, use_legacy_dataset=False) ds = pyarrow.dataset(fps){code} Then rows of the resulting table will have: ......aaa......aaa...ccc..bbb... -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9062) [Rust] Support to read JSON into dictionary type
Sven Wagner-Boysen created ARROW-9062: - Summary: [Rust] Support to read JSON into dictionary type Key: ARROW-9062 URL: https://issues.apache.org/jira/browse/ARROW-9062 Project: Apache Arrow Issue Type: Sub-task Reporter: Sven Wagner-Boysen Currently a JSON reader build from a schema using the type dictionary for one of the fields in the schema will fail with JsonError("struct types are not yet supported") {code:java} let builder = ReaderBuilder::new().with_schema(..) let mut reader: Reader = builder.build::(File::open(path).unwrap()).unwrap(); let rb = reader.next().unwrap() {code} Suggested solution: Support reading into a dictionary in Json Reader: [https://github.com/apache/arrow/blob/master/rust/arrow/src/json/reader.rs#L368] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9061) [Packaging][APT][Yum][GLib] Add Apache Arrow Datasets GLib
Kouhei Sutou created ARROW-9061: --- Summary: [Packaging][APT][Yum][GLib] Add Apache Arrow Datasets GLib Key: ARROW-9061 URL: https://issues.apache.org/jira/browse/ARROW-9061 Project: Apache Arrow Issue Type: Improvement Components: GLib, Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9060) [GLib] Add support for building Apache Arrow Datasets GLib with non-installed Apache Arrow Datasets
Kouhei Sutou created ARROW-9060: --- Summary: [GLib] Add support for building Apache Arrow Datasets GLib with non-installed Apache Arrow Datasets Key: ARROW-9060 URL: https://issues.apache.org/jira/browse/ARROW-9060 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou It's required for packaging: https://travis-ci.org/github/ursa-labs/crossbow/builds/695595159 {noformat} CXX libarrow_dataset_glib_la-scanner.lo scanner.cpp:24:33: fatal error: arrow/util/iterator.h: No such file or directory #include ^ {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9059) [Rust] Documentation for slicing array data has the wrong sign
Bobby Wagner created ARROW-9059: --- Summary: [Rust] Documentation for slicing array data has the wrong sign Key: ARROW-9059 URL: https://issues.apache.org/jira/browse/ARROW-9059 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Bobby Wagner In the slice_data function in array.rs, the docstring says it panics if offset+length is less than data.len(), the code actually panics if offset + length is greater than data.len() -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9058) [Packaging][wheel] Boost download is failed
Kouhei Sutou created ARROW-9058: --- Summary: [Packaging][wheel] Boost download is failed Key: ARROW-9058 URL: https://issues.apache.org/jira/browse/ARROW-9058 Project: Apache Arrow Issue Type: Improvement Components: Packaging, Python Reporter: Kouhei Sutou Assignee: Kouhei Sutou https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=12893&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181 {noformat} + curl -sL https://dl.bintray.com/boostorg/release/1.68.0/source/boost_1_68_0.tar.gz -o /boost_1_68_0.tar.gz + tar xf boost_1_68_0.tar.gz tar: This does not look like a tar archive tar: Error exit delayed from previous errors {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9057) Projection should work on InMemoryScan without error
QP Hou created ARROW-9057: - Summary: Projection should work on InMemoryScan without error Key: ARROW-9057 URL: https://issues.apache.org/jira/browse/ARROW-9057 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: QP Hou Assignee: QP Hou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9056) [C++] Aggregation methods for Scalars?
Neal Richardson created ARROW-9056: -- Summary: [C++] Aggregation methods for Scalars? Key: ARROW-9056 URL: https://issues.apache.org/jira/browse/ARROW-9056 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Neal Richardson Fix For: 1.0.0 See discussion on https://github.com/apache/arrow/pull/7308. Many/most would no-op (sum, mean, min, max), but maybe they should exist and not error? Maybe they're not needed, but I could see how you might invoke a function on the result of a previous aggregation or something. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9055) [C++] Add sum/mean kernels for Boolean type
Neal Richardson created ARROW-9055: -- Summary: [C++] Add sum/mean kernels for Boolean type Key: ARROW-9055 URL: https://issues.apache.org/jira/browse/ARROW-9055 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Neal Richardson Fix For: 1.0.0 See https://github.com/apache/arrow/pull/7308 (ARROW-6978) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9054) [C++] Add ScalarAggregateOptions
Neal Richardson created ARROW-9054: -- Summary: [C++] Add ScalarAggregateOptions Key: ARROW-9054 URL: https://issues.apache.org/jira/browse/ARROW-9054 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Neal Richardson Fix For: 1.0.0 See discussion on https://github.com/apache/arrow/pull/7308. MinMax has an option for null behavior, but Sum and Mean do not. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9049) [C++] Add a Result<> returning method for for constructing a dictionary
Micah Kornfield created ARROW-9049: -- Summary: [C++] Add a Result<> returning method for for constructing a dictionary Key: ARROW-9049 URL: https://issues.apache.org/jira/browse/ARROW-9049 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Micah Kornfield Assignee: Micah Kornfield Dictionary types require a signed integer index type. Today there is a DCHECK that this is the case in the constructor. When reading data from an unknown source it is possible due to corruption (or user error) that the dictionary index type is not signed. We should add a method that checks for signedness and use that at all system boundaries to validate input data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9050) [Release] Use 1.0.0 as the next version
Kouhei Sutou created ARROW-9050: --- Summary: [Release] Use 1.0.0 as the next version Key: ARROW-9050 URL: https://issues.apache.org/jira/browse/ARROW-9050 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9047) [Rust] Setting 0-bits of a 0-length bitset segfaults
Max Burke created ARROW-9047: Summary: [Rust] Setting 0-bits of a 0-length bitset segfaults Key: ARROW-9047 URL: https://issues.apache.org/jira/browse/ARROW-9047 Project: Apache Arrow Issue Type: Improvement Reporter: Max Burke See PR for details -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9052) [CI][MinGW] Enable Gandiva
Kouhei Sutou created ARROW-9052: --- Summary: [CI][MinGW] Enable Gandiva Key: ARROW-9052 URL: https://issues.apache.org/jira/browse/ARROW-9052 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva, Continuous Integration, GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9053) [Rust] Add sort for lists and structs
Neville Dipale created ARROW-9053: - Summary: [Rust] Add sort for lists and structs Key: ARROW-9053 URL: https://issues.apache.org/jira/browse/ARROW-9053 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Neville Dipale -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9048) [C#] Support Float16
Eric Erhardt created ARROW-9048: --- Summary: [C#] Support Float16 Key: ARROW-9048 URL: https://issues.apache.org/jira/browse/ARROW-9048 Project: Apache Arrow Issue Type: Bug Components: C# Reporter: Eric Erhardt With [https://github.com/dotnet/runtime/issues/936], .NET is getting a `System.Half` type, which is a 16-bit floating point number. Once that type lands in .NET we can implement support for the Float16 type in Arrow. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9051) [GLib] Refer Array related objects from Array
Kouhei Sutou created ARROW-9051: --- Summary: [GLib] Refer Array related objects from Array Key: ARROW-9051 URL: https://issues.apache.org/jira/browse/ARROW-9051 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9046) [C++][R] Put more things in type_fwds
Neal Richardson created ARROW-9046: -- Summary: [C++][R] Put more things in type_fwds Key: ARROW-9046 URL: https://issues.apache.org/jira/browse/ARROW-9046 Project: Apache Arrow Issue Type: Improvement Components: C++, R Reporter: Neal Richardson Assignee: Ben Kietzman Fix For: 1.0.0 Hopefully to reduce compile time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9045) [C++] Improve and expand Take/Filter benchmarks
Wes McKinney created ARROW-9045: --- Summary: [C++] Improve and expand Take/Filter benchmarks Key: ARROW-9045 URL: https://issues.apache.org/jira/browse/ARROW-9045 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 I'm putting this up as a separate patch for review -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9044) [Go][Packaging] Revisit the license file attachment to the go packages
Krisztian Szucs created ARROW-9044: -- Summary: [Go][Packaging] Revisit the license file attachment to the go packages Key: ARROW-9044 URL: https://issues.apache.org/jira/browse/ARROW-9044 Project: Apache Arrow Issue Type: Improvement Components: Go, Packaging Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 1.0.0 As per https://github.com/apache/arrow/pull/7355#issuecomment-639560475 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9043) [Go] Temporarily copy LICENSE.txt to go/
Wes McKinney created ARROW-9043: --- Summary: [Go] Temporarily copy LICENSE.txt to go/ Key: ARROW-9043 URL: https://issues.apache.org/jira/browse/ARROW-9043 Project: Apache Arrow Issue Type: Improvement Components: Go Reporter: Wes McKinney Fix For: 1.0.0 {{go mod}} needs to find a license file in the root of the Go module. In the future "go mod" may be able to follow symlinks in which case this can be replaced by a symlink. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9042) [C++] Add Subtract and Multiply arithmetic kernels with wrap-around behavior
Krisztian Szucs created ARROW-9042: -- Summary: [C++] Add Subtract and Multiply arithmetic kernels with wrap-around behavior Key: ARROW-9042 URL: https://issues.apache.org/jira/browse/ARROW-9042 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Krisztian Szucs Fix For: 1.0.0 Also avoid undefined behaviour caused by signed integer overflow. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9041) overloaded virtual function "arrow::io::Writable::Write" is only partially overridden in class
Karthikeyan Natarajan created ARROW-9041: Summary: overloaded virtual function "arrow::io::Writable::Write" is only partially overridden in class Key: ARROW-9041 URL: https://issues.apache.org/jira/browse/ARROW-9041 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.15.0 Reporter: Karthikeyan Natarajan Following warnings appear cpp/build/arrow/install/include/arrow/io/file.h(189): warning: overloaded virtual function "arrow::io::Writable::Write" is only partially overridden in class "arrow::io::MemoryMappedFile" cpp/build/arrow/install/include/arrow/io/memory.h(98): warning: overloaded virtual function "arrow::io::Writable::Write" is only partially overridden in class "arrow::io::MockOutputStream" cpp/build/arrow/install/include/arrow/io/memory.h(116): warning: overloaded virtual function "arrow::io::Writable::Write" is only partially overridden in class "arrow::io::FixedSizeBufferWriter" Suggestion solution is to use `using Writable::Write` in protected/private. [https://isocpp.org/wiki/faq/strange-inheritance#hiding-rule] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9040) [Python][Parquet]"_ParquetDatasetV2" fail to read with columns and use_pandas_metadata=True
cmsxbc created ARROW-9040: - Summary: [Python][Parquet]"_ParquetDatasetV2" fail to read with columns and use_pandas_metadata=True Key: ARROW-9040 URL: https://issues.apache.org/jira/browse/ARROW-9040 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.17.1 Reporter: cmsxbc When call _ParquetDatasetV2.read(columns=['column'], use_pandas_metadata=True), "TypeError: unhashable type 'dict'" will be raised from {code:java} index_columns = set(_get_pandas_index_columns(metadata)) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9039) py_bytes created by pyarrow 0.11.1 cannot be deserialized by more recent versions
Yoav Git created ARROW-9039: --- Summary: py_bytes created by pyarrow 0.11.1 cannot be deserialized by more recent versions Key: ARROW-9039 URL: https://issues.apache.org/jira/browse/ARROW-9039 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1, 0.11.1 Environment: python, windows Reporter: Yoav Git I have been saving dataframes into mongodb using: {{import pandas as pd; import pyarrow as pa}} {{df = pd.DataFrame([[1,2,3],[2,3,4]], columns = ['x','y','z'])}} {{byte = pa.serialize(df).to_buffer().to_pybytes()}} and then reading back using: {{df = pa.deserialize(pa.py_buffer(memoryview(byte)))}} However, pyarrow is not back-compatible. i.e. both versions 0.11.1 and 0.15.1 can read their own pybytes created by it. Alas, they cannot read each other. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9038) [C++] Improve BitBlockCounter
Yibo Cai created ARROW-9038: --- Summary: [C++] Improve BitBlockCounter Key: ARROW-9038 URL: https://issues.apache.org/jira/browse/ARROW-9038 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Yibo Cai ARROW-9029 implements BitBlockCounter. There are chances to improve pops counting performance per this review comment: https://github.com/apache/arrow/pull/7346#discussion_r435005226 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9037) [C++/C-ABI] unable to import array with null count == -1 (which could be exported)
Zhuo Peng created ARROW-9037: Summary: [C++/C-ABI] unable to import array with null count == -1 (which could be exported) Key: ARROW-9037 URL: https://issues.apache.org/jira/browse/ARROW-9037 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.17.1 Reporter: Zhuo Peng If an Array is created with null_count == -1 but without any null (and thus no null bitmap buffer), then ArrayData.null_count will remain -1 when exporting if null_count is never computed. The exported C struct also has null_count == -1 [1]. But when importing, if null_count != 0, an error [2] will be raised. [1] https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L560 [2] https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L1404 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9036) Null pointer exception when caching data frames)
Gaurangi Saxena created ARROW-9036: -- Summary: Null pointer exception when caching data frames) Key: ARROW-9036 URL: https://issues.apache.org/jira/browse/ARROW-9036 Project: Apache Arrow Issue Type: Bug Components: Java Affects Versions: 0.16.0 Reporter: Gaurangi Saxena I get a NPE when I try to cache a DataFrame in spark with Arrow as read format. Stack Trace - java.lang.NullPointerExceptionjava.lang.NullPointerException at org.apache.arrow.vector.ipc.ReadChannel.readFully(ReadChannel.java:61) at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:649) at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) at org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:106) at com.google.cloud.spark.bigquery.ArrowBinaryIterator$ArrowReaderIterator.hasNext(ArrowBinaryIterator.scala:84) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9035) 8 vs 64 byte alignment
Anthony Abate created ARROW-9035: Summary: 8 vs 64 byte alignment Key: ARROW-9035 URL: https://issues.apache.org/jira/browse/ARROW-9035 Project: Apache Arrow Issue Type: Bug Components: C++, Documentation Affects Versions: 0.17.0 Reporter: Anthony Abate I used the C++ library to create a very small arrow file (1 field of 5 int32) and was surprised that the buffers are not aligned to 64 bytes as per the documentation section "Buffer Alignment and Padding" with examples.. based on the examples there, the 20 bytes of int32 should be padded to 64 bytes, but it is only 24 (see below) . extract message metadata - see BodyLength = 24 {code:java} { version: "V4", header_type: "RecordBatch", header: { nodes: [ { length: 5, null_count: 0 } ], buffers: [ { offset: 0, length: 0 }, { offset: 0, length: 20 } ] }, bodyLength: 24 } {code} Reading further down the documentation section "Encapsulated message format" it says serialization should use 8 byte alignment. These both seem at odds with each other and some clarification is needed. Is the documentation wrong? Or Should 8 byte alignment be used for File and 64 byte for IPC ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9034) [C++] Implement binary (two bitmap) version of BitBlockCounter
Wes McKinney created ARROW-9034: --- Summary: [C++] Implement binary (two bitmap) version of BitBlockCounter Key: ARROW-9034 URL: https://issues.apache.org/jira/browse/ARROW-9034 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 The current BitBlockCounter from ARROW-9029 is useful for unary operations. Some operations involve multiple bitmaps and so it's useful to be able to determine the block popcounts of the AND of the respective words in the bitmaps. So each returned block would contain the number of bits that are set in both bitmaps at the same locations -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9033) [Python] Add tests to verify that one can build a C++ extension against the manylinux1 wheels
Wes McKinney created ARROW-9033: --- Summary: [Python] Add tests to verify that one can build a C++ extension against the manylinux1 wheels Key: ARROW-9033 URL: https://issues.apache.org/jira/browse/ARROW-9033 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Some project want to be able to use the Python wheels to build other Python packages with C++ extensions that need to link against libarrow.so. It would be great if someone would add automated tests to ensure that our wheel builds can be used successfully in this fashion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9032) [C++] Split arrow/util/bit_util.h into multiple header files
Wes McKinney created ARROW-9032: --- Summary: [C++] Split arrow/util/bit_util.h into multiple header files Key: ARROW-9032 URL: https://issues.apache.org/jira/browse/ARROW-9032 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 This header has grown quite large and any given compilation unit's use of it is likely limited to only a couple of functions or classes. I suspect it would improve compilation time to split up this header into a few headers organized by frequency of code use. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9031) [R] Implement conversion from Type::UINT64 to R vector
Wes McKinney created ARROW-9031: --- Summary: [R] Implement conversion from Type::UINT64 to R vector Key: ARROW-9031 URL: https://issues.apache.org/jira/browse/ARROW-9031 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Wes McKinney Fix For: 1.0.0 This case is not handled in array_to_vector.cpp -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9030) [Python] Clean up some usages of pyarrow.compat, move some common functions/symbols to lib.pyx
Wes McKinney created ARROW-9030: --- Summary: [Python] Clean up some usages of pyarrow.compat, move some common functions/symbols to lib.pyx Key: ARROW-9030 URL: https://issues.apache.org/jira/browse/ARROW-9030 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney I started doing this while looking into ARROW-4633 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9029) [C++] Implement BitmapScanner interface to accelerate processing of mostly-not-null data
Wes McKinney created ARROW-9029: --- Summary: [C++] Implement BitmapScanner interface to accelerate processing of mostly-not-null data Key: ARROW-9029 URL: https://issues.apache.org/jira/browse/ARROW-9029 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 In analytics, it is common for data to be all not-null or mostly not-null. Data with > 50% nulls tends to be more exceptional. In this might, our {{BitmapReader}} class which allows iteration of each bit in a bitmap can be wasteful for mostly set validity bitmaps. I propose instead a new interface for use in kernel implementations, for lack of a better term {{BitmapScanner}}. This works as follows: * Uses popcount to accumulate consecutive 64-bit words from a bitmap where all values are set, up to some limit (e.g. anywhere from 8 to 128 words -- we can use benchmarks to determine what is a good limit). The length of this "all-on" run is returned to the caller in a single function call, so that this "run" of data can be processed without any bit-by-bit bitmap checking * If words containing unset bits is encountered, the scanner will similarly accumulate non-full words until the next full word is encountered or a limit is hit. The length of this "has nulls" run is returned to the caller, which then proceeds bit-by-bit to process the data For data with a lot of nulls, this may degrade performance somewhat but probably not that much empirically. However, data that is mostly-not-null should benefit from this. This BitmapScanner utility can probably also be used to accelerate the implementation of Filter for mostly-not-null data -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9028) [R] Should be able to convert an empty table
Francois Saint-Jacques created ARROW-9028: - Summary: [R] Should be able to convert an empty table Key: ARROW-9028 URL: https://issues.apache.org/jira/browse/ARROW-9028 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)