[jira] [Created] (ARROW-8141) [C++] Optimize BM_PlainDecodingBoolean performance using AVX512 Intrinsics API

2020-03-17 Thread Frank Du (Jira)
Frank Du created ARROW-8141:
---

 Summary: [C++] Optimize BM_PlainDecodingBoolean performance using 
AVX512 Intrinsics API
 Key: ARROW-8141
 URL: https://issues.apache.org/jira/browse/ARROW-8141
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Frank Du
 Attachments: image-2020-03-18-11-08-38-201.png

We are running benchmark on the arrow avx512 build, perf show unpack1_32 as the 
major hotspot for BM_PlainDecodingBoolean indicator.

Implement this func with Intrinsics code show big improvements. See below the 
results on CLX 8280 cpu which is capable of AVX512.
|Benchmark|Indictor|unit|default sse build|avx512 build|avx512 build + 
Intrinsics|Intrinsics improvements|
|parquet-encoding-benchmark|BM_PlainDecodingBoolean/1024|G/s|1.55394|3.77701|5.02805|1.331224964|
|BM_PlainDecodingBoolean/4096 |G/s|1.83472|5.3826|8.3443|1.550235945|
|BM_PlainDecodingBoolean/32768|G/s|2.00957|6.1258|10.3793|1.694358288|
|BM_PlainDecodingBoolean/65536|G/s|2.02249|6.20035|10.5778|1.706000468|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[C++] Arrow toolchains and developer tools upgraded to LLVM 8

2020-03-17 Thread Wes McKinney
Letting everyone know with https://github.com/apache/arrow/pull/6266
being merged, everyone doing C++ development needs to update from
clang*-7 tools to clang*-8, especially clang-format-8, to get your
patches to pass lint checks.

Thanks to Jun, Kou, and Prudhvi for their teamwork on the LLVM upgrade

- Wes


[jira] [Created] (ARROW-8140) [Developer] Follow NullType -> NullField change

2020-03-17 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8140:
---

 Summary: [Developer] Follow NullType -> NullField change
 Key: ARROW-8140
 URL: https://issues.apache.org/jira/browse/ARROW-8140
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


The lint CI job is failed since ARROW-8101 merge because ARROW-8101 uses old 
class name (NullType). The old class name was renamed to NullField by 
ARROW-2255.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8139) [C++] FileSystem enum causes attributes warning

2020-03-17 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8139:
--

 Summary: [C++] FileSystem enum causes attributes warning
 Key: ARROW-8139
 URL: https://issues.apache.org/jira/browse/ARROW-8139
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 0.17.0


See e.g. 
https://github.com/apache/arrow/runs/512427577?check_suite_focus=true#step:7:996

{code}
In file included from 
/arrow/r/check/arrow.Rcheck/00_pkg_src/arrow/libarrow/arrow-0.16.0.9000/include/arrow/dataset/discovery.h:31:0,
 from 
/arrow/r/check/arrow.Rcheck/00_pkg_src/arrow/libarrow/arrow-0.16.0.9000/include/arrow/dataset/api.h:21,
 from ./arrow_types.h:203,
 from array_to_vector.cpp:18:
/arrow/r/check/arrow.Rcheck/00_pkg_src/arrow/libarrow/arrow-0.16.0.9000/include/arrow/filesystem/filesystem.h:65:1:
 warning: type attributes ignored after type is already defined [-Wattributes]
{code}

This isn't new but I've been staring at the R Linux builds a lot and wanted to 
clean this up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8138) parquet::arrow::FileReader cannot read multiple RowGroup

2020-03-17 Thread Feng Tian (Jira)
Feng Tian created ARROW-8138:


 Summary: parquet::arrow::FileReader cannot read multiple RowGroup
 Key: ARROW-8138
 URL: https://issues.apache.org/jira/browse/ARROW-8138
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.16.0
 Environment: Centos 7
Reporter: Feng Tian


When use parquet::arrow::FileReader to read parquet file consisting multiple 
row groups,

reader->RowGroup(i)->Column(c)->Read

It will repeated read data of the first rowgroup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8137) [C++][Dataset] Investigate multithreaded discovery

2020-03-17 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8137:
---

 Summary: [C++][Dataset] Investigate multithreaded discovery
 Key: ARROW-8137
 URL: https://issues.apache.org/jira/browse/ARROW-8137
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Dataset
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 1.0.0


Currently FileSystemDatasetFactory Inpsects all files serially. For slow file 
systems or systems which support batched reads, this could be accelerated by 
inspecting files in parallel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8136) [C++][Python] Creating dataset from relative path no longer working

2020-03-17 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8136:


 Summary: [C++][Python] Creating dataset from relative path no 
longer working
 Key: ARROW-8136
 URL: https://issues.apache.org/jira/browse/ARROW-8136
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


Since https://github.com/apache/arrow/pull/6597, local relative paths don't 
work anymore:

{code}
In [1]: import pyarrow.dataset as ds  

In [2]: ds.dataset("test.parquet")  
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 ds.dataset("test.parquet")

~/scipy/repos/arrow/python/pyarrow/dataset.py in dataset(paths_or_factories, 
filesystem, partitioning, format)
327 
328 if isinstance(paths_or_factories, str):
--> 329 return factory(paths_or_factories, **kwargs).finish()
330 
331 if not isinstance(paths_or_factories, list):

~/scipy/repos/arrow/python/pyarrow/dataset.py in factory(path_or_paths, 
filesystem, partitioning, format)
246 factories = []
247 for path in path_or_paths:
--> 248 fs, paths_or_selector = _ensure_fs_and_paths(path, filesystem)
249 factories.append(FileSystemDatasetFactory(fs, paths_or_selector,
250   format, options))

~/scipy/repos/arrow/python/pyarrow/dataset.py in _ensure_fs_and_paths(path, 
filesystem)
165 from pyarrow.fs import FileType, FileSelector
166 
--> 167 filesystem, path = _ensure_fs(filesystem, _stringify_path(path))
168 infos = filesystem.get_target_infos([path])[0]
169 if infos.type == FileType.Directory:

~/scipy/repos/arrow/python/pyarrow/dataset.py in _ensure_fs(filesystem, path)
158 if filesystem is not None:
159 return filesystem, path
--> 160 return FileSystem.from_uri(path)
161 
162 

~/scipy/repos/arrow/python/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.from_uri()

~/scipy/repos/arrow/python/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: URI has empty scheme: 'test.parquet'

{code}

[~apitrou] Is this something that should be fixed in 
{{FileSystemFromUriOrPath}} or rather on the python side? 
({{FileSystem.from_uri}} ensures to get the absolute path for Pathlib objects, 
but not for strings)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-03-17-0

2020-03-17 Thread Crossbow


Arrow Build Report for Job nightly-2020-03-17-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0

Failed Tasks:
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-win-vs2015-py38
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-travis-gandiva-jar-trusty
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-cpp-valgrind
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-turbodbc-master
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-travis-wheel-osx-cp35m
- wheel-osx-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-travis-wheel-osx-cp36m
- wheel-osx-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-travis-wheel-osx-cp37m
- wheel-osx-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-travis-wheel-osx-cp38

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-github-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-github-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-github-centos-8
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-linux-gcc-py38
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-azure-conda-osx-clang-py38
- debian-buster:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-github-debian-buster
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-github-debian-stretch
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-travis-gandiva-jar-osx
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-travis-homebrew-cpp
- macos-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-travis-macos-r-autobrew
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-cpp
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-kartothek-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-kartothek-latest
- test-conda-python-3.7-kartothek-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-kartothek-master
- test-conda-python-3.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-pandas-latest
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-pandas-master
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-17-0-circle-test-conda-python-3.7-spark-master
- test-conda-python-3.7:
  URL: 

[jira] [Created] (ARROW-8135) Problem importing PyArrow on a cluster

2020-03-17 Thread Matej Murin (Jira)
Matej Murin created ARROW-8135:
--

 Summary: Problem importing PyArrow on a cluster
 Key: ARROW-8135
 URL: https://issues.apache.org/jira/browse/ARROW-8135
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.16.0
 Environment: Linux, RedHat CentOS 7
Reporter: Matej Murin


Hi, when I am trying to import pyarrow in python, I get the following error:

*File "", line 1, in *
 *File 
"/services/matejm/anaconda3/lib/python3.7/site-packages/pyarrow/__init__.py", 
line 49, in *
 *from pyarrow.lib import cpu_count, set_cpu_count*
*ImportError: libaws-cpp-sdk-s3.so: cannot open shared object file: No such 
file or directory*
What can this be related to? I have searched wherever i could've and could not 
find any reason for it, so I figured i might as well try in here.

Thank you very much



--
This message was sent by Atlassian Jira
(v8.3.4#803005)