[jira] [Created] (ARROW-8974) [C++] Refine TransferBitmap template parameters
Yibo Cai created ARROW-8974: --- Summary: [C++] Refine TransferBitmap template parameters Key: ARROW-8974 URL: https://issues.apache.org/jira/browse/ARROW-8974 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Yibo Cai Assignee: Yibo Cai [TransferBitmap|https://github.com/apache/arrow/blob/44e723d9ac7c64739d419ad66618d2d56003d1b7/cpp/src/arrow/util/bit_util.cc#L110] has two template parameters of bool type with four combinations. Change them to function parameters can reduce code size. I think "restore_trailing_bits" cannot impact performance. "invert_bits" needs benchmark. Also, bool parameter is hard to figure out at [caller side|https://github.com/apache/arrow/blob/44e723d9ac7c64739d419ad66618d2d56003d1b7/cpp/src/arrow/util/bit_util.cc#L208], better to use meaningful defines. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8973) [Java] Support batch value appending for large varchar/varbinary vectors
Liya Fan created ARROW-8973: --- Summary: [Java] Support batch value appending for large varchar/varbinary vectors Key: ARROW-8973 URL: https://issues.apache.org/jira/browse/ARROW-8973 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Support appending values in batch for LargeVarCharVector/LargeVarBinaryVector. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8972) [Java] Support range value comparison for large varchar/varbinary vectors
Liya Fan created ARROW-8972: --- Summary: [Java] Support range value comparison for large varchar/varbinary vectors Key: ARROW-8972 URL: https://issues.apache.org/jira/browse/ARROW-8972 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan Support comparing a range of values for LargeVarCharVector and LargeVarBinaryVector. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8971) upgrade Pip
bindu created ARROW-8971: Summary: upgrade Pip Key: ARROW-8971 URL: https://issues.apache.org/jira/browse/ARROW-8971 Project: Apache Arrow Issue Type: Bug Reporter: bindu Could you please update the pip latest version 20.1 [https://github.com/apache/arrow/blob/2688a62f8179f20c20c06a10fcd22fe8a714ae48/python/manylinux1/scripts/requirements.txt] CVE-2018-20225 An issue was discovered in pip (all versions) because it installs the version with the highest version number, even if the user had intended to obtain a private package from a private index. This only affects use of the --extra-index-url option, and exploitation requires that the package does not already exist in the public index (and thus the attacker can put the package there with an arbitrary version number). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8970) [C++] Reduce shared library code size (umbrella issue)
Wes McKinney created ARROW-8970: --- Summary: [C++] Reduce shared library code size (umbrella issue) Key: ARROW-8970 URL: https://issues.apache.org/jira/browse/ARROW-8970 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney We're reaching a point where we may need to be careful about decisions that increase code size: * Instantiating too many templates for code that isn't performance sensitive * Inlining functions that don't need to be inline Code size tends to correlate also with compilation times, but not always. I'll use this umbrella issue to organize issues related to reducing compiled code size -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8969) [C++] Reduce generated code in compute/kernels/scalar_compare.cc
Wes McKinney created ARROW-8969: --- Summary: [C++] Reduce generated code in compute/kernels/scalar_compare.cc Key: ARROW-8969 URL: https://issues.apache.org/jira/browse/ARROW-8969 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 We are instantiating templates in this module for cases that, byte-wise, do the exact same comparison. For example: * For equals, not_equals, we can use the same 32-bit/64-bit comparison kernels for signed int / unsigned int / floating point types of the same byte width * TimestampType can reuse int64 kernels, similarly for other date/time types * BinaryType/StringType can share kernels etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8968) [c++][gandiva] Show link warning message on s390x
Kazuaki Ishizaki created ARROW-8968: --- Summary: [c++][gandiva] Show link warning message on s390x Key: ARROW-8968 URL: https://issues.apache.org/jira/browse/ARROW-8968 Project: Apache Arrow Issue Type: Bug Reporter: Kazuaki Ishizaki When execute gandiva test, the warning message is shown as follows {code} ~/arrow/cpp/src/gandiva$ ../../build/debug/gandiva-binary-test -V Running main() from /home/ishizaki/arrow/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from TestBinary [ RUN ] TestBinary.TestSimple warning: Linking two modules of different data layouts: 'precompiled' is 'E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-a:8:16-n32:64' whereas 'codegen' is 'E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64' [ OK ] TestBinary.TestSimple (41 ms) [--] 1 test from TestBinary (41 ms total) [--] Global test environment tear-down [==] 1 test from 1 test case ran. (41 ms total) [ PASSED ] 1 test. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Arrow sync all at 12pm US-Eastern / 16:00 UTC
No, we are just talking about removing static libraries from conda-forge that may be (/have been) used as part of the Arrow build. This shouldn't affect any non-conda Arrow users/developers. Cheers, Uwe On Wed, May 27, 2020, at 6:53 PM, Rémi Dettai wrote: > @Uwe: Just a quick question about the static build, I'm not sure I > understood correctly: are we talking about removing the install step for > the static libraries or the arrow_static target as a whole? > > Le mer. 27 mai 2020 à 18:34, Neal Richardson > a écrit : > > > Attendees: > > Mahmut Bulut > > Projjal Chanda > > Rémi Dettai > > Laurent Goujon > > Andy Grove > > Uwe Korn > > Micah Kornfield > > Wes McKinney > > Rok Mihevc > > Neal Richardson > > François Saint-Jacques > > > > Discussion: > > * patch queue is growing, please review things > > * 1.0 > > * Timeline: targeting July 1 > > * Desire to add forward compatibility changes to format > > * Documentation: opportunity to reposition website, add user guides > > * Integration testing: now in Rust, but questions about which tests are > > running/passing > > * Adding extra checks to Rust for undefined behavior > > * Conda: question about static library usage; also heads up that they're > > consolidating the recipes > > > > On Wed, May 27, 2020 at 8:03 AM Wes McKinney wrote: > > > > > The usual biweekly call will be held at > > > > > > https://meet.google.com/vtm-teks-phx > > > > > > All are welcome. Meeting notes will be posted to the mailing list > > afterword > > > > > >
[jira] [Created] (ARROW-8967) [Python] [Parquet] Table.to_pandas() fails to convert valid TIMESTAMP_MILLIS fails to convert to pandas timestamp
Mark Waddle created ARROW-8967: -- Summary: [Python] [Parquet] Table.to_pandas() fails to convert valid TIMESTAMP_MILLIS fails to convert to pandas timestamp Key: ARROW-8967 URL: https://issues.apache.org/jira/browse/ARROW-8967 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.17.0 Reporter: Mark Waddle reading a parquet file with a valid TIMESTAMP_MILLIS value of -6155291520 (0019-06-20) results in the following error {noformat} File "pyarrow/array.pxi", line 587, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 1640, in pyarrow.lib.Table._to_pandas File "/Users/mark/.local/share/virtualenvs/parquetpy-BNIqCtDj/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 766, in table_to_blockmanager blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) File "/Users/mark/.local/share/virtualenvs/parquetpy-BNIqCtDj/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 1102, in _table_to_blocks list(extension_columns.keys())) File "pyarrow/table.pxi", line 1107, in pyarrow.lib.table_to_blocks File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Casting from timestamp[ms] to timestamp[ns] would result in out of bounds timestamp: -6155291520 {noformat} as it stands there is no way to read this file i would like to be able to choose the timestamp unit when reading, much like you can when writing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Arrow sync all at 12pm US-Eastern / 16:00 UTC
@Uwe: Just a quick question about the static build, I'm not sure I understood correctly: are we talking about removing the install step for the static libraries or the arrow_static target as a whole? Le mer. 27 mai 2020 à 18:34, Neal Richardson a écrit : > Attendees: > Mahmut Bulut > Projjal Chanda > Rémi Dettai > Laurent Goujon > Andy Grove > Uwe Korn > Micah Kornfield > Wes McKinney > Rok Mihevc > Neal Richardson > François Saint-Jacques > > Discussion: > * patch queue is growing, please review things > * 1.0 > * Timeline: targeting July 1 > * Desire to add forward compatibility changes to format > * Documentation: opportunity to reposition website, add user guides > * Integration testing: now in Rust, but questions about which tests are > running/passing > * Adding extra checks to Rust for undefined behavior > * Conda: question about static library usage; also heads up that they're > consolidating the recipes > > On Wed, May 27, 2020 at 8:03 AM Wes McKinney wrote: > > > The usual biweekly call will be held at > > > > https://meet.google.com/vtm-teks-phx > > > > All are welcome. Meeting notes will be posted to the mailing list > afterword > > >
Re: Arrow sync all at 12pm US-Eastern / 16:00 UTC
Attendees: Mahmut Bulut Projjal Chanda Rémi Dettai Laurent Goujon Andy Grove Uwe Korn Micah Kornfield Wes McKinney Rok Mihevc Neal Richardson François Saint-Jacques Discussion: * patch queue is growing, please review things * 1.0 * Timeline: targeting July 1 * Desire to add forward compatibility changes to format * Documentation: opportunity to reposition website, add user guides * Integration testing: now in Rust, but questions about which tests are running/passing * Adding extra checks to Rust for undefined behavior * Conda: question about static library usage; also heads up that they're consolidating the recipes On Wed, May 27, 2020 at 8:03 AM Wes McKinney wrote: > The usual biweekly call will be held at > > https://meet.google.com/vtm-teks-phx > > All are welcome. Meeting notes will be posted to the mailing list afterword >
[jira] [Created] (ARROW-8966) [C++] Move arrow::ArrayData to a separate header file
Wes McKinney created ARROW-8966: --- Summary: [C++] Move arrow::ArrayData to a separate header file Key: ARROW-8966 URL: https://issues.apache.org/jira/browse/ARROW-8966 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 There are code modules (such as compute kernels) that only require ArrayData for doing computations, so pulling in all the code in array.h is not necessary. There are probably other code paths that might benefit from this also. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8965) [Python][Documentation] Pyarrow documentation for pip nightlies references 404'd location
Erin Ryan created ARROW-8965: Summary: [Python][Documentation] Pyarrow documentation for pip nightlies references 404'd location Key: ARROW-8965 URL: https://issues.apache.org/jira/browse/ARROW-8965 Project: Apache Arrow Issue Type: Task Affects Versions: 0.17.1 Reporter: Erin Ryan The pyarrow documentation gives to options for nightly builds, one for use with anaconda, one for use with pip. While the anaconda command works, the command for pip sends users to [https://repo.fury.io/arrow-nightlies/,] a url which 404s. Sphinx docs need updated for correct url of gemfury.com/arrow-nightlies/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
Arrow sync all at 12pm US-Eastern / 16:00 UTC
The usual biweekly call will be held at https://meet.google.com/vtm-teks-phx All are welcome. Meeting notes will be posted to the mailing list afterword
[jira] [Created] (ARROW-8964) Pyarrow: improve reading of partitioned parquet datasets whose schema changed
Ira Saktor created ARROW-8964: - Summary: Pyarrow: improve reading of partitioned parquet datasets whose schema changed Key: ARROW-8964 URL: https://issues.apache.org/jira/browse/ARROW-8964 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.17.1 Environment: Ubuntu 18.04, latest miniconda with python 3.7, pyarrow 0.17.1 Reporter: Ira Saktor Hi there, i'm encountering the following issue when reading from HDFS: *My situation:* I have a paritioned parquet dataset in HDFS, whose recent partitions contain parquet files with more columns than the older ones. When i try to read data using pyarrow.dataset.dataset and filter on recent data, i still get only the columns that are also contained in the old parquet files. I'd like to somehow merge the schema or use the schema from parquet files from which data ends up being loaded. *when using:* `pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive', filters = my_filter_expression).to_table().to_pandas()` Is there please a way to handle schema changes in a way, that the read data would contain all columns? everything works fine when i copy the needed parquet files into a separate folder, however it is very inconvenient way of working. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2020-05-27-0
Arrow Build Report for Job nightly-2020-05-27-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0 Failed Tasks: - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-azure-conda-linux-gcc-py38 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-azure-conda-osx-clang-py38 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-azure-conda-win-vs2015-py38 - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-travis-homebrew-cpp - homebrew-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-travis-homebrew-r-autobrew - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-github-test-conda-cpp-valgrind - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-github-test-conda-python-3.7-dask-latest - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-github-test-conda-python-3.7-spark-master - test-conda-python-3.8-dask-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-github-test-conda-python-3.8-dask-master - test-fedora-30-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-circle-test-fedora-30-cpp - test-fedora-30-python-3: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-azure-test-fedora-30-python-3 Succeeded Tasks: - centos-6-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-github-centos-6-amd64 - centos-7-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-travis-centos-7-aarch64 - centos-7-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-github-centos-7-amd64 - centos-8-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-travis-centos-8-aarch64 - centos-8-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-github-centos-8-amd64 - debian-buster-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-github-debian-buster-amd64 - debian-buster-arm64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-travis-debian-buster-arm64 - debian-stretch-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-github-debian-stretch-amd64 - debian-stretch-arm64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-travis-debian-stretch-arm64 - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-travis-gandiva-jar-osx - gandiva-jar-xenial: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-travis-gandiva-jar-xenial - nuget: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-github-nuget - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-github-test-conda-cpp - test-conda-python-3.6-pandas-0.23: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-github-test-conda-python-3.6-pandas-0.23 - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-github-test-conda-python-3.6 - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-github-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-kartothek-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-27-0-github-test-conda-python-3.7-kartothek-latest - test-conda-python-3.7-kartothek-master: URL:
[jira] [Created] (ARROW-8963) Parquet cpp optimize allocate memory
yiming.xu created ARROW-8963: Summary: Parquet cpp optimize allocate memory Key: ARROW-8963 URL: https://issues.apache.org/jira/browse/ARROW-8963 Project: Apache Arrow Issue Type: Improvement Components: Format Affects Versions: 0.17.1 Reporter: yiming.xu LeafReader::NextBatch should Reset memory first, otherwise Reserve will allocate memory twice -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8962) [C++] Linking failure with clang-4.0
Uwe Korn created ARROW-8962: --- Summary: [C++] Linking failure with clang-4.0 Key: ARROW-8962 URL: https://issues.apache.org/jira/browse/ARROW-8962 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Uwe Korn Assignee: Uwe Korn {code:java} FAILED: release/arrow-file-to-stream : && /Users/uwe/miniconda3/envs/pyarrow-dev/bin/ccache /Users/uwe/miniconda3/envs/pyarrow-dev/bin/x86_64-apple-darwin13.4.0-clang++ -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -stdlib=libc++ -fvisibility-inlines-hidden -std=c++14 -fmessage-length=0 -Qunused-arguments -fcolor-diagnostics -O3 -DNDEBUG -Wall -Wno-unknown-warning-option -Wno-pass-failed -msse4.2 -O3 -DNDEBUG -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk -Wl,-search_paths_first -Wl,-headerpad_max_install_names -Wl,-pie -Wl,-headerpad_max_install_names -Wl,-dead_strip_dylibs src/arrow/ipc/CMakeFiles/arrow-file-to-stream.dir/file_to_stream.cc.o -o release/arrow-file-to-stream release/libarrow.a /usr/local/opt/openssl@1.1/lib/libssl.dylib /usr/local/opt/openssl@1.1/lib/libcrypto.dylib /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libbrotlienc-static.a /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libbrotlidec-static.a /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libbrotlicommon-static.a /Users/uwe/miniconda3/envs/pyarrow-dev/lib/liblz4.dylib /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libsnappy.1.1.7.dylib /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libz.dylib /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libzstd.dylib /Users/uwe/miniconda3/envs/pyarrow-dev/lib/liborc.a /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libprotobuf.dylib jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a && : Undefined symbols for architecture x86_64: "arrow::internal::(anonymous namespace)::StringToFloatConverterImpl::main_junk_value_", referenced from: arrow::internal::StringToFloat(char const*, unsigned long, float*) in libarrow.a(value_parsing.cc.o) arrow::internal::StringToFloat(char const*, unsigned long, double*) in libarrow.a(value_parsing.cc.o) "arrow::internal::(anonymous namespace)::StringToFloatConverterImpl::fallback_junk_value_", referenced from: arrow::internal::StringToFloat(char const*, unsigned long, float*) in libarrow.a(value_parsing.cc.o) arrow::internal::StringToFloat(char const*, unsigned long, double*) in libarrow.a(value_parsing.cc.o) ld: symbol(s) not found for architecture x86_64 clang-4.0: error: linker command failed with exit code 1 (use -v to see invocation) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)