[jira] [Created] (ARROW-7558) [Packaging][deb][RPM] Use the host owner and group for artifacts

2020-01-12 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-7558:
---

 Summary: [Packaging][deb][RPM] Use the host owner and group for 
artifacts
 Key: ARROW-7558
 URL: https://issues.apache.org/jira/browse/ARROW-7558
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7557) [C++][Compute] Validate sorting stability in random test

2020-01-12 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-7557:
---

 Summary: [C++][Compute] Validate sorting stability in random test
 Key: ARROW-7557
 URL: https://issues.apache.org/jira/browse/ARROW-7557
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Compute
Reporter: Yibo Cai
Assignee: Yibo Cai


Sorting kernel unit test doesn't validate sorting stability in random test. [1]
Should assert "lhs < rhs" when "array.Value(lhs) == array.Value(rhs)".

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices_test.cc#L112-L121



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7556) Performance regression in pyarrow-0.15.1 vs pyarrow-0.12.1 when reading a "partitioned parquet table" ?

2020-01-12 Thread Julien MASSIOT (Jira)
Julien MASSIOT created ARROW-7556:
-

 Summary: Performance regression in pyarrow-0.15.1 vs 
pyarrow-0.12.1 when reading a "partitioned parquet table" ?
 Key: ARROW-7556
 URL: https://issues.apache.org/jira/browse/ARROW-7556
 Project: Apache Arrow
  Issue Type: Bug
  Components: Benchmarking, Python
Affects Versions: 0.15.1
Reporter: Julien MASSIOT
 Attachments: load_df_pyarrow-0-15.1.png, load_df_pyarrow-0.12.1.png, 
load_pyarrow_0.12.1.cprof, load_pyarrow_0.15.1.cprof

Hi,

I am currently running a small test with pyarrow to load 2 "partitioned" 
parquet tables.

The performance seems to be 2 times less with pyarrow-0.15.1 than with 
pyarrow-0.12.1.

In my test:
 * the 'parquet' tables I am loading are called 'reports' & 'memory'
 * they have been generated through pandas.to_parquet by specifying the 
partitions columns
 * they are both partitioned by 2 columns 'p_type' and 'p_start'
 * it is small tables:
 ** reports
 *** 90 partitions (1 parquet file / partition)
 *** total size: 6.2MB
 ** memory
 *** 105 partitions (1 parquet file / partition)
 *** total size: 9.1MB

 

Here is the code of my simple test that tries to read them (I'm using a filter 
on the p_start partition):

 
{code:java}
// code placeholder

import os
import sys
import time
import pyarrow
from pyarrow.parquet import ParquetDataset
def load_dataframe(data_dir, table, start_date, end_date):
return ParquetDataset(os.path.join(data_dir, table),
  filters=[('p_start', '>=', start_date),
   ('p_start', '<=', end_date)
   ]).read().to_pandas()

print(f'pyarrow version;{pyarrow.__version__}')

data_dir = sys.argv[1]

for i in range(1, 10):
start = time.time()
start_date = '20191223'
end_date = '20200108'
load_dataframe(sys.argv[1], 'reports', start_date, end_date)
load_dataframe(sys.argv[1], 'memory', start_date, end_date)
print(f'loaded;in;{time.time()-start}')

{code}
 

 

Here are the results:
 * with pyarrow-0.12.1

$ python -m cProfile -o load_pyarrow_0.12.1.cprof load_df_from_pyarrow.py 
parquet/

pyarrow version;0.12.1

loaded;in;0.5566098690032959
loaded;in;0.32605648040771484
loaded;in;0.28951501846313477
loaded;in;0.29279112815856934
loaded;in;0.3474299907684326
loaded;in;0.4075736999511719
loaded;in;0.425199031829834
loaded;in;0.34653329849243164
loaded;in;0.300839900970459

(~350ms to load the 2 tables)

 

!load_df_pyarrow-0.12.1.png!

 
 * with pyarrow-0.15.1

 

$ python -m cProfile -o load_pyarrow_0.15.1.cprof load_df_from_pyarrow.py 
parquet/

pyarrow version;0.15.1
loaded;in;1.1126022338867188
loaded;in;0.8931224346160889
loaded;in;1.3298325538635254
loaded;in;0.8584625720977783
loaded;in;0.9232609272003174
loaded;in;1.0619215965270996
loaded;in;0.8619768619537354
loaded;in;0.8686420917510986
loaded;in;1.1183602809906006

(>800ms to load the 2 tables)

 

!load_df_pyarrow-0-15.1.png!

 

Is there a performance regression here ?

Am I missing something ?

 

In attachment, you can find the 2 .cprof files.

[^load_pyarrow_0.12.1.cprof]

[^load_pyarrow_0.15.1.cprof]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-01-12-0

2020-01-12 Thread Crossbow


Arrow Build Report for Job nightly-2020-01-12-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0

Failed Tasks:
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-osx-clang-py38
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-travis-gandiva-jar-osx
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.7-pandas-master
- wheel-manylinux1-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-wheel-manylinux1-cp37m

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-centos-8
- conda-linux-gcc-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-linux-gcc-py27
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-linux-gcc-py38
- conda-osx-clang-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-osx-clang-py27
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-osx-clang-py37
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-conda-win-vs2015-py38
- debian-buster:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-debian-buster
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-azure-debian-stretch
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-travis-gandiva-jar-trusty
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-travis-homebrew-cpp
- macos-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-travis-macos-r-autobrew
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-cpp
- test-conda-python-2.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-2.7-pandas-latest
- test-conda-python-2.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-2.7
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.7-pandas-latest
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.7-spark-master
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.7
- test-conda-python-3.8-dask-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-12-0-circle-test-conda-python-3.8-dask-master
- 

[jira] [Created] (ARROW-7555) [Python] Drop support for python 2.7

2020-01-12 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-7555:
--

 Summary: [Python] Drop support for python 2.7
 Key: ARROW-7555
 URL: https://issues.apache.org/jira/browse/ARROW-7555
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Krisztian Szucs
 Fix For: 1.0.0


After the 0.16 release we should consider to drop support for python 2.7 
because it is not maintained anymore.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)