[jira] [Created] (ARROW-7558) [Packaging][deb][RPM] Use the host owner and group for artifacts
Kouhei Sutou created ARROW-7558: --- Summary: [Packaging][deb][RPM] Use the host owner and group for artifacts Key: ARROW-7558 URL: https://issues.apache.org/jira/browse/ARROW-7558 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7557) [C++][Compute] Validate sorting stability in random test
Yibo Cai created ARROW-7557: --- Summary: [C++][Compute] Validate sorting stability in random test Key: ARROW-7557 URL: https://issues.apache.org/jira/browse/ARROW-7557 Project: Apache Arrow Issue Type: Improvement Components: C++ - Compute Reporter: Yibo Cai Assignee: Yibo Cai Sorting kernel unit test doesn't validate sorting stability in random test. [1] Should assert "lhs < rhs" when "array.Value(lhs) == array.Value(rhs)". [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices_test.cc#L112-L121 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7556) Performance regression in pyarrow-0.15.1 vs pyarrow-0.12.1 when reading a "partitioned parquet table" ?
Julien MASSIOT created ARROW-7556: - Summary: Performance regression in pyarrow-0.15.1 vs pyarrow-0.12.1 when reading a "partitioned parquet table" ? Key: ARROW-7556 URL: https://issues.apache.org/jira/browse/ARROW-7556 Project: Apache Arrow Issue Type: Bug Components: Benchmarking, Python Affects Versions: 0.15.1 Reporter: Julien MASSIOT Attachments: load_df_pyarrow-0-15.1.png, load_df_pyarrow-0.12.1.png, load_pyarrow_0.12.1.cprof, load_pyarrow_0.15.1.cprof Hi, I am currently running a small test with pyarrow to load 2 "partitioned" parquet tables. The performance seems to be 2 times less with pyarrow-0.15.1 than with pyarrow-0.12.1. In my test: * the 'parquet' tables I am loading are called 'reports' & 'memory' * they have been generated through pandas.to_parquet by specifying the partitions columns * they are both partitioned by 2 columns 'p_type' and 'p_start' * it is small tables: ** reports *** 90 partitions (1 parquet file / partition) *** total size: 6.2MB ** memory *** 105 partitions (1 parquet file / partition) *** total size: 9.1MB Here is the code of my simple test that tries to read them (I'm using a filter on the p_start partition): {code:java} // code placeholder import os import sys import time import pyarrow from pyarrow.parquet import ParquetDataset def load_dataframe(data_dir, table, start_date, end_date): return ParquetDataset(os.path.join(data_dir, table), filters=[('p_start', '>=', start_date), ('p_start', '<=', end_date) ]).read().to_pandas() print(f'pyarrow version;{pyarrow.__version__}') data_dir = sys.argv[1] for i in range(1, 10): start = time.time() start_date = '20191223' end_date = '20200108' load_dataframe(sys.argv[1], 'reports', start_date, end_date) load_dataframe(sys.argv[1], 'memory', start_date, end_date) print(f'loaded;in;{time.time()-start}') {code} Here are the results: * with pyarrow-0.12.1 $ python -m cProfile -o load_pyarrow_0.12.1.cprof load_df_from_pyarrow.py parquet/ pyarrow version;0.12.1 loaded;in;0.5566098690032959 loaded;in;0.32605648040771484 loaded;in;0.28951501846313477 loaded;in;0.29279112815856934 loaded;in;0.3474299907684326 loaded;in;0.4075736999511719 loaded;in;0.425199031829834 loaded;in;0.34653329849243164 loaded;in;0.300839900970459 (~350ms to load the 2 tables) !load_df_pyarrow-0.12.1.png! * with pyarrow-0.15.1 $ python -m cProfile -o load_pyarrow_0.15.1.cprof load_df_from_pyarrow.py parquet/ pyarrow version;0.15.1 loaded;in;1.1126022338867188 loaded;in;0.8931224346160889 loaded;in;1.3298325538635254 loaded;in;0.8584625720977783 loaded;in;0.9232609272003174 loaded;in;1.0619215965270996 loaded;in;0.8619768619537354 loaded;in;0.8686420917510986 loaded;in;1.1183602809906006 (>800ms to load the 2 tables) !load_df_pyarrow-0-15.1.png! Is there a performance regression here ? Am I missing something ? In attachment, you can find the 2 .cprof files. [^load_pyarrow_0.12.1.cprof] [^load_pyarrow_0.15.1.cprof] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2020-01-12-0
Arrow Build Report for Job nightly-2020-01-12-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0 Failed Tasks: - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-osx-clang-py38 - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-travis-gandiva-jar-osx - test-conda-python-3.7-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.7-pandas-master - wheel-manylinux1-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-wheel-manylinux1-cp37m Succeeded Tasks: - centos-6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-centos-6 - centos-7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-centos-7 - centos-8: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-centos-8 - conda-linux-gcc-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-linux-gcc-py27 - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-linux-gcc-py38 - conda-osx-clang-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-osx-clang-py27 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-osx-clang-py37 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-win-vs2015-py38 - debian-buster: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-debian-buster - debian-stretch: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-debian-stretch - gandiva-jar-trusty: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-travis-gandiva-jar-trusty - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-travis-homebrew-cpp - macos-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-travis-macos-r-autobrew - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-cpp - test-conda-python-2.7-pandas-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-2.7-pandas-latest - test-conda-python-2.7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-2.7 - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.6 - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.7-dask-latest - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-pandas-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.7-pandas-latest - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.7-spark-master - test-conda-python-3.7-turbodbc-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.7-turbodbc-latest - test-conda-python-3.7-turbodbc-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.7-turbodbc-master - test-conda-python-3.7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.7 - test-conda-python-3.8-dask-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.8-dask-master -
[jira] [Created] (ARROW-7555) [Python] Drop support for python 2.7
Krisztian Szucs created ARROW-7555: -- Summary: [Python] Drop support for python 2.7 Key: ARROW-7555 URL: https://issues.apache.org/jira/browse/ARROW-7555 Project: Apache Arrow Issue Type: Improvement Reporter: Krisztian Szucs Fix For: 1.0.0 After the 0.16 release we should consider to drop support for python 2.7 because it is not maintained anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005)