[jira] [Created] (ARROW-8141) [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API
Frank Du created ARROW-8141: --- Summary: [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API Key: ARROW-8141 URL: https://issues.apache.org/jira/browse/ARROW-8141 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Frank Du Attachments: image-2020-03-18-11-08-38-201.png We are running benchmark on the arrow avx512 build, perf show unpack1_32 as the major hotspot for BM_PlainDecodingBoolean indicator. Implement this func with Intrinsics code show big improvements. See below the results on CLX 8280 cpu which is capable of AVX512. |Benchmark|Indictor|unit|default sse build|avx512 build|avx512 build + Intrinsics|Intrinsics improvements| |parquet-encoding-benchmark|BM_PlainDecodingBoolean/1024|G/s|1.55394|3.77701|5.02805|1.331224964| |BM_PlainDecodingBoolean/4096 |G/s|1.83472|5.3826|8.3443|1.550235945| |BM_PlainDecodingBoolean/32768|G/s|2.00957|6.1258|10.3793|1.694358288| |BM_PlainDecodingBoolean/65536|G/s|2.02249|6.20035|10.5778|1.706000468| -- This message was sent by Atlassian Jira (v8.3.4#803005)
[C++] Arrow toolchains and developer tools upgraded to LLVM 8
Letting everyone know with https://github.com/apache/arrow/pull/6266 being merged, everyone doing C++ development needs to update from clang*-7 tools to clang*-8, especially clang-format-8, to get your patches to pass lint checks. Thanks to Jun, Kou, and Prudhvi for their teamwork on the LLVM upgrade - Wes
[jira] [Created] (ARROW-8140) [Developer] Follow NullType -> NullField change
Kouhei Sutou created ARROW-8140: --- Summary: [Developer] Follow NullType -> NullField change Key: ARROW-8140 URL: https://issues.apache.org/jira/browse/ARROW-8140 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Kouhei Sutou Assignee: Kouhei Sutou The lint CI job is failed since ARROW-8101 merge because ARROW-8101 uses old class name (NullType). The old class name was renamed to NullField by ARROW-2255. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8139) [C++] FileSystem enum causes attributes warning
Neal Richardson created ARROW-8139: -- Summary: [C++] FileSystem enum causes attributes warning Key: ARROW-8139 URL: https://issues.apache.org/jira/browse/ARROW-8139 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 0.17.0 See e.g. https://github.com/apache/arrow/runs/512427577?check_suite_focus=true#step:7:996 {code} In file included from /arrow/r/check/arrow.Rcheck/00_pkg_src/arrow/libarrow/arrow-0.16.0.9000/include/arrow/dataset/discovery.h:31:0, from /arrow/r/check/arrow.Rcheck/00_pkg_src/arrow/libarrow/arrow-0.16.0.9000/include/arrow/dataset/api.h:21, from ./arrow_types.h:203, from array_to_vector.cpp:18: /arrow/r/check/arrow.Rcheck/00_pkg_src/arrow/libarrow/arrow-0.16.0.9000/include/arrow/filesystem/filesystem.h:65:1: warning: type attributes ignored after type is already defined [-Wattributes] {code} This isn't new but I've been staring at the R Linux builds a lot and wanted to clean this up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8138) parquet::arrow::FileReader cannot read multiple RowGroup
Feng Tian created ARROW-8138: Summary: parquet::arrow::FileReader cannot read multiple RowGroup Key: ARROW-8138 URL: https://issues.apache.org/jira/browse/ARROW-8138 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.16.0 Environment: Centos 7 Reporter: Feng Tian When use parquet::arrow::FileReader to read parquet file consisting multiple row groups, reader->RowGroup(i)->Column(c)->Read It will repeated read data of the first rowgroup. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8137) [C++][Dataset] Investigate multithreaded discovery
Ben Kietzman created ARROW-8137: --- Summary: [C++][Dataset] Investigate multithreaded discovery Key: ARROW-8137 URL: https://issues.apache.org/jira/browse/ARROW-8137 Project: Apache Arrow Issue Type: Improvement Components: C++, C++ - Dataset Affects Versions: 0.16.0 Reporter: Ben Kietzman Fix For: 1.0.0 Currently FileSystemDatasetFactory Inpsects all files serially. For slow file systems or systems which support batched reads, this could be accelerated by inspecting files in parallel. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8136) [C++][Python] Creating dataset from relative path no longer working
Joris Van den Bossche created ARROW-8136: Summary: [C++][Python] Creating dataset from relative path no longer working Key: ARROW-8136 URL: https://issues.apache.org/jira/browse/ARROW-8136 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Joris Van den Bossche Fix For: 0.17.0 Since https://github.com/apache/arrow/pull/6597, local relative paths don't work anymore: {code} In [1]: import pyarrow.dataset as ds In [2]: ds.dataset("test.parquet") --- ArrowInvalid Traceback (most recent call last) in > 1 ds.dataset("test.parquet") ~/scipy/repos/arrow/python/pyarrow/dataset.py in dataset(paths_or_factories, filesystem, partitioning, format) 327 328 if isinstance(paths_or_factories, str): --> 329 return factory(paths_or_factories, **kwargs).finish() 330 331 if not isinstance(paths_or_factories, list): ~/scipy/repos/arrow/python/pyarrow/dataset.py in factory(path_or_paths, filesystem, partitioning, format) 246 factories = [] 247 for path in path_or_paths: --> 248 fs, paths_or_selector = _ensure_fs_and_paths(path, filesystem) 249 factories.append(FileSystemDatasetFactory(fs, paths_or_selector, 250 format, options)) ~/scipy/repos/arrow/python/pyarrow/dataset.py in _ensure_fs_and_paths(path, filesystem) 165 from pyarrow.fs import FileType, FileSelector 166 --> 167 filesystem, path = _ensure_fs(filesystem, _stringify_path(path)) 168 infos = filesystem.get_target_infos([path])[0] 169 if infos.type == FileType.Directory: ~/scipy/repos/arrow/python/pyarrow/dataset.py in _ensure_fs(filesystem, path) 158 if filesystem is not None: 159 return filesystem, path --> 160 return FileSystem.from_uri(path) 161 162 ~/scipy/repos/arrow/python/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.from_uri() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: URI has empty scheme: 'test.parquet' {code} [~apitrou] Is this something that should be fixed in {{FileSystemFromUriOrPath}} or rather on the python side? ({{FileSystem.from_uri}} ensures to get the absolute path for Pathlib objects, but not for strings) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2020-03-17-0
Arrow Build Report for Job nightly-2020-03-17-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0 Failed Tasks: - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-win-vs2015-py38 - gandiva-jar-trusty: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-travis-gandiva-jar-trusty - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-cpp-valgrind - test-conda-python-3.7-turbodbc-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-turbodbc-latest - test-conda-python-3.7-turbodbc-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-turbodbc-master - wheel-osx-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-travis-wheel-osx-cp35m - wheel-osx-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-travis-wheel-osx-cp36m - wheel-osx-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-travis-wheel-osx-cp37m - wheel-osx-cp38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-travis-wheel-osx-cp38 Succeeded Tasks: - centos-6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-github-centos-6 - centos-7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-github-centos-7 - centos-8: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-github-centos-8 - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-linux-gcc-py38 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-osx-clang-py38 - debian-buster: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-github-debian-buster - debian-stretch: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-github-debian-stretch - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-travis-gandiva-jar-osx - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-travis-homebrew-cpp - macos-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-travis-macos-r-autobrew - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-cpp - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.6 - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-dask-latest - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-kartothek-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-kartothek-latest - test-conda-python-3.7-kartothek-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-kartothek-master - test-conda-python-3.7-pandas-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-pandas-latest - test-conda-python-3.7-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-pandas-master - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-spark-master - test-conda-python-3.7: URL:
[jira] [Created] (ARROW-8135) Problem importing PyArrow on a cluster
Matej Murin created ARROW-8135: -- Summary: Problem importing PyArrow on a cluster Key: ARROW-8135 URL: https://issues.apache.org/jira/browse/ARROW-8135 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.16.0 Environment: Linux, RedHat CentOS 7 Reporter: Matej Murin Hi, when I am trying to import pyarrow in python, I get the following error: *File "", line 1, in * *File "/services/matejm/anaconda3/lib/python3.7/site-packages/pyarrow/__init__.py", line 49, in * *from pyarrow.lib import cpu_count, set_cpu_count* *ImportError: libaws-cpp-sdk-s3.so: cannot open shared object file: No such file or directory* What can this be related to? I have searched wherever i could've and could not find any reason for it, so I figured i might as well try in here. Thank you very much -- This message was sent by Atlassian Jira (v8.3.4#803005)