[jira] [Created] (ARROW-9127) [Rust] Update thirft library dependencies

2020-06-14 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-9127:
--

 Summary: [Rust] Update thirft library dependencies
 Key: ARROW-9127
 URL: https://issues.apache.org/jira/browse/ARROW-9127
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Andrew Lamb


Update to latest version of apache thrift (1.3)

 

Rationale:

We were trying to update the version of `byteorder` that an internal project 
used, but arrow/parquet -> depends on parquet-format-rs -> depends on thrift.

 

[~sunchao] recently updated the thrift-pin in parquet-format in 
[https://github.com/apache/arrow/pull/6626,]  so now it is possible to update 
the thrift version here as well

 

The thrift dependency was postponed when the dependencies were last updated. 
See:

https://github.com/apache/arrow/pull/6626

https://issues.apache.org/jira/browse/ARROW-8124



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9126) [C++] Trimmed Boost bundle fails to build on Windows

2020-06-13 Thread Cuong Nguyen (Jira)
Cuong Nguyen created ARROW-9126:
---

 Summary: [C++] Trimmed Boost bundle fails to build on Windows
 Key: ARROW-9126
 URL: https://issues.apache.org/jira/browse/ARROW-9126
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Cuong Nguyen


 

Build with the following commands
{code:java}
mkdir build
cd build
cmake .. -DARROW_PARQUET=ON
cmake --build .{code}
Error from build log

 
{code:java}
.\boost/graph/two_bit_color_map.hpp(106): fatal error C1083: Cannot open 
include file: 'boost/graph/detail/empty_header.hpp': No such file or directory
{code}
This was because configuring Boost to build a subset of libraries doesn't work 
on Windows as it does on Linux. As a result, all libraries, including those 
being trimmed, were built:

 
{code:java}
Component configuration:

 - atomic : building
 - chrono : building
 - container : building
 - date_time : building
 - exception : building
 - filesystem : building
 - headers : building
 - iostreams : building
 - locale : building
 - log : building
 - mpi : building
 - program_options : building
 - python : building
 - random : building
 - regex : building
 - serialization : building
 - system : building
 - test : building
 - thread : building
 - timer : building
 - wave : building
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9125) [C++] Add missing include for arrow::internal::ZeroMemory() for Valgrind

2020-06-13 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-9125:
---

 Summary: [C++] Add missing include for 
arrow::internal::ZeroMemory() for Valgrind
 Key: ARROW-9125
 URL: https://issues.apache.org/jira/browse/ARROW-9125
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9124) DFParser should consume sql query as &str instead of String

2020-06-13 Thread QP Hou (Jira)
QP Hou created ARROW-9124:
-

 Summary: DFParser should consume sql query as &str instead of 
String
 Key: ARROW-9124
 URL: https://issues.apache.org/jira/browse/ARROW-9124
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: QP Hou
Assignee: QP Hou


It's more efficient to use &str instead of String



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9123) [Python][wheel] Use libzstd.a explicitly

2020-06-13 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-9123:
---

 Summary: [Python][wheel] Use libzstd.a explicitly
 Key: ARROW-9123
 URL: https://issues.apache.org/jira/browse/ARROW-9123
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging, Python
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


{{ARROW_ZSTD_USE_SHARED}} is introduced by ARROW-9084. We need to set 
{{ARROW_ZSTD_USE_SHARED=OFF}} explicitly to use static zstd library.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9122) [C++] Adapt ascii_lower/ascii_upper bulk transforms to work on sliced arrays

2020-06-13 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9122:
---

 Summary: [C++] Adapt ascii_lower/ascii_upper bulk transforms to 
work on sliced arrays
 Key: ARROW-9122
 URL: https://issues.apache.org/jira/browse/ARROW-9122
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


See comments at https://github.com/apache/arrow/pull/7418#discussion_r439754427

Also add unit tests to verify that only the referenced data slice has been 
transformed in the result



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9121) Do not wipe the filesystem when path is empty

2020-06-13 Thread Mohamed Zenadi (Jira)
Mohamed Zenadi created ARROW-9121:
-

 Summary: Do not wipe the filesystem when path is empty
 Key: ARROW-9121
 URL: https://issues.apache.org/jira/browse/ARROW-9121
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Mohamed Zenadi


The `DeleteDirContents` method in the filesystems api has a default behavior or 
*wiping* the whole filesystem if we give it an empty path.

 

It's documented as:
 > Like DeleteDir, but doesn’t delete the directory itself. Passing an empty 
 > path (“”) will wipe the entire filesystem tree. 

 

And the corresponding code confirms that:
{code:java}
  auto parts = SplitAbstractPath(path);
  RETURN_NOT_OK(ValidateAbstractPathParts(parts));  
  
  if (parts.empty()) {
// Wipe filesystem
impl_->RootDir().entries.clear();
return Status::OK();
  }
{code}
 

This is a weird default that does not make sense. If the user wanted really to 
wipe his filesystem, he'd pass a `/`.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9120) [C++] Lint and Format _internal headers

2020-06-12 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-9120:
---

 Summary: [C++] Lint and Format _internal headers
 Key: ARROW-9120
 URL: https://issues.apache.org/jira/browse/ARROW-9120
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.17.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Currently, headers named /*_internal.h/ are neither clang-formatted nor 
cpplinted. Since they're not exported, CLI lint (forbid , nullptr, ...) 
need not be applied



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9119) [C++] Add support for building with system static gRPC

2020-06-12 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-9119:
---

 Summary: [C++] Add support for building with system static gRPC
 Key: ARROW-9119
 URL: https://issues.apache.org/jira/browse/ARROW-9119
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9118) [C++] Add more general BoundsCheck function that also checks for arbitrary lower limits in integer arrays

2020-06-12 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9118:
---

 Summary: [C++] Add more general BoundsCheck function that also 
checks for arbitrary lower limits in integer arrays
 Key: ARROW-9118
 URL: https://issues.apache.org/jira/browse/ARROW-9118
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


See ARROW-9083. The current {{IndexBoundsCheck}} is specialized to skip a 
comparison for unsigned integers and uses 0 as the lower bound for signed 
integers. This could be generalized so that we could check e.g. if int64 values 
will fit in the int32 range



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9117) [Python] Is there Pandas circular dependency problem?

2020-06-12 Thread SEUNGMIN HEO (Jira)
SEUNGMIN HEO created ARROW-9117:
---

 Summary: [Python] Is there Pandas circular dependency problem?
 Key: ARROW-9117
 URL: https://issues.apache.org/jira/browse/ARROW-9117
 Project: Apache Arrow
  Issue Type: Bug
Reporter: SEUNGMIN HEO






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9116) [C++] Add BinaryArray::total_values_length()

2020-06-12 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9116:
-

 Summary: [C++] Add BinaryArray::total_values_length()
 Key: ARROW-9116
 URL: https://issues.apache.org/jira/browse/ARROW-9116
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
 Fix For: 1.0.0


It's often useful to compute the total data size of a binary array.
Sample implementation:
{code:c++}
  int64_t total_values_length() const {
return raw_value_offsets_[length() + data_->offset] - 
raw_value_offsets_[data_->offset];
  }
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9115) [C++] Process data buffers in batch in ascii_lower / ascii_upper kernels rather than using string_view value iteration

2020-06-12 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9115:
---

 Summary: [C++] Process data buffers in batch in ascii_lower / 
ascii_upper kernels rather than using string_view value iteration
 Key: ARROW-9115
 URL: https://issues.apache.org/jira/browse/ARROW-9115
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


Also add a benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9114) Illegal instruction crash in arrow.dll

2020-06-12 Thread MP (Jira)
MP created ARROW-9114:
-

 Summary: Illegal instruction crash in arrow.dll
 Key: ARROW-9114
 URL: https://issues.apache.org/jira/browse/ARROW-9114
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.17.1
 Environment: Conda environment on Windows Server 2016. Importantly, 
the CPU does *not* support AVX2.
Reporter: MP


We have been encountering illegal instruction crashes in {{arrow.dll}} when 
using the {{conda}} packages from {{conda-forge}}. Here are the relevant 
packages that were installed:

{{arrow-cpp: 0.17.1-py37h1234567_4_cpu}}
{{parquet-cpp: 1.5.1-2}}
{{pyarrow: 0.17.1-py37h1234567_4_cpu}}
{{snappy: 1.1.8-he025d50_1}}

The error is:

{noformat}Windows fatal exception: 7code 0x\c01d{noformat}

Some further investigation revealed that the offending instruction is {{BZHI}}, 
which as I understand it is part of the {{BMI2}} set, in turn part of {{AVX2}}. 
We believe this is in fact arising in {{snappy}} code here:

https://github.com/google/snappy/blob/1.1.8/snappy.cc#L717-L728

The {{snappy 1.1.8}} package appears to have been built with {{BMI2}} support 
enabled, if you look at the release build log here:

https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=115252&view=logs&j=2cc45e14-23e3-52d7-b33a-8c2744410b97&t=21c44aa7-1ae3-5312-cacc-7f19fefc82f4

Of course, this is then arguably an upstream issue, but I have reported it here 
because perhaps that configuration is the desired choice for the 'standard' 
{{snappy}} package and something else might need to be done in {{arrow}} 
instead, for example.

(Incidentally, is the {{snappy}} runtime dependency correct in the {{arrow}} 
feedstocks? If it's statically linked, shouldn't it only be required at build 
time?)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9113) Fix exception causes in cli.py

2020-06-12 Thread Ram Rachum (Jira)
Ram Rachum created ARROW-9113:
-

 Summary: Fix exception causes in cli.py
 Key: ARROW-9113
 URL: https://issues.apache.org/jira/browse/ARROW-9113
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Ram Rachum


I recently went over 
[Matplotlib](https://github.com/matplotlib/matplotlib/pull/16706), 
[Pandas](https://github.com/pandas-dev/pandas/pull/32322) and 
[NumPy](https://github.com/numpy/numpy/pull/15731), fixing a small mistake in 
the way that Python 3's exception chaining is used. If you're interested, I can 
do it here too. I've done it on just one file right now. 

The mistake is this: In some parts of the code, an exception is being caught 
and replaced with a more user-friendly error. In these cases the syntax `raise 
new_error from old_error` needs to be used.

Python 3's exception chaining means it shows not only the traceback of the 
current exception, but that of the original exception (and possibly more.) This 
is regardless of `raise from`. The usage of `raise from` tells Python to put a 
more accurate message between the tracebacks. Instead of this: 

During handling of the above exception, another exception occurred:

You'll get this: 

The above exception was the direct cause of the following exception:

The first is inaccurate, because it signifies a bug in the exception-handling 
code itself, which is a separate situation than wrapping an exception.

Let me know what you think! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9112) [R] Update autobrew script location

2020-06-11 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-9112:
--

 Summary: [R] Update autobrew script location
 Key: ARROW-9112
 URL: https://issues.apache.org/jira/browse/ARROW-9112
 Project: Apache Arrow
  Issue Type: Task
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0


Jeroen is moving it to a different location.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9111) csv.read_csv progress bar

2020-06-11 Thread Jeff Hammerbacher (Jira)
Jeff Hammerbacher created ARROW-9111:


 Summary: csv.read_csv progress bar
 Key: ARROW-9111
 URL: https://issues.apache.org/jira/browse/ARROW-9111
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 0.17.1
Reporter: Jeff Hammerbacher


When reading a very large csv file, it would be nice to see some diagnostic 
output from pyarrow. 
[readr|[https://readr.tidyverse.org/reference/read_delim.html]] has a 
`progress` parameter, for example. [tqdm|[https://github.com/tqdm/tqdm]] is 
often used in the Python community to provide this functionality.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9110) [C++] Fix CPU cache size detection on macOS

2020-06-11 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9110:
--

 Summary: [C++] Fix CPU cache size detection on macOS
 Key: ARROW-9110
 URL: https://issues.apache.org/jira/browse/ARROW-9110
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


Running certain benchmarks on macOS never ends because CpuInfo detects the RAM 
size as  the size of L1 cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9109) [Python][Packaging] Enable S3 support in manylinux wheels

2020-06-11 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9109:
-

 Summary: [Python][Packaging] Enable S3 support in manylinux wheels
 Key: ARROW-9109
 URL: https://issues.apache.org/jira/browse/ARROW-9109
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Packaging, Python
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9108) [C++][Dataset] Add Parquet Statistics conversion for timestamp columns

2020-06-11 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-9108:
-

 Summary: [C++][Dataset] Add Parquet Statistics conversion for 
timestamp columns
 Key: ARROW-9108
 URL: https://issues.apache.org/jira/browse/ARROW-9108
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9107) [C++][Dataset] Time-based types support

2020-06-11 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-9107:
-

 Summary: [C++][Dataset] Time-based types support
 Key: ARROW-9107
 URL: https://issues.apache.org/jira/browse/ARROW-9107
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Francois Saint-Jacques


We lack the support of date/timestamp partitions, and predicate pushdown rules. 
Timestamp columns are usually the most important predicate in OLAP style 
queries, we need to support this transparently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9106) [C++] Add C++ foundation to ease file transcoding

2020-06-11 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9106:
-

 Summary: [C++] Add C++ foundation to ease file transcoding
 Key: ARROW-9106
 URL: https://issues.apache.org/jira/browse/ARROW-9106
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


In some situations (e.g. reading a Windows-produced CSV file), the user might 
transcode data before ingesting it into Arrow. Rather than build transcoding in 
C++ (which would require a library of encodings), we could delegate it to 
bindings as needed, by providing a generic InputStream facility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9105) [C++] ParquetFileFragment::SplitByRowGroup doesn't handle filter on partition field

2020-06-11 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9105:


 Summary: [C++] ParquetFileFragment::SplitByRowGroup doesn't handle 
filter on partition field
 Key: ARROW-9105
 URL: https://issues.apache.org/jira/browse/ARROW-9105
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


When splitting a fragment into row group fragments, filtering on the partition 
field raises an error.

Python reproducer:

```
df = pd.DataFrame({"dummy": [1, 1, 1, 1], "part": ["A", "A", "B", "B"]})
df.to_parquet("test_partitioned_filter", partition_cols="part", 
engine="pyarrow")

import pyarrow.dataset as ds
dataset = ds.dataset("test_partitioned_filter", format="parquet", 
partitioning="hive")
fragment = list(dataset.get_fragments())[0]
```

```
In [31]: dataset.to_table(filter=ds.field("part") == "A").to_pandas()   

   
Out[31]: 
   dummy part
0  1A
1  1A

In [32]: fragment.split_by_row_group(ds.field("part") == "A")   

   
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 fragment.split_by_row_group(ds.field("part") == "A")

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset.ParquetFileFragment.split_by_row_group()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset._insert_implicit_casts()

~/scipy/repos/arrow/python/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Field named 'part' not found or not unique in the schema.
```

This is probably a "strange" thing to do, since the fragment from a partitioned 
dataset is already coming only from a single partition (so will always only 
satisfy a single equality expression). But it's still nice that as a user you 
don't have to care about only passing part of the filter down to 
{{split_by_row_groups}}.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9104) [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory

2020-06-11 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9104:
--

 Summary: [C++] Parquet encryption tests should write files to a 
temporary directory instead of the testing submodule's directory
 Key: ARROW-9104
 URL: https://issues.apache.org/jira/browse/ARROW-9104
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Krisztian Szucs
 Fix For: 1.0.0


If the source directory is not writable the test raises permission denied error:

[ RUN  ] TestEncryptionConfiguration.UniformEncryption
1632
unknown file: Failure
1633
C++ exception with description "IOError: Failed to open local file 
'/arrow/cpp/submodules/parquet-testing/data/tmp_uniform_encryption.parquet.encrypted'.
 Detail: [errno 13] Permission denied" thrown in the test body.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9103) [Python] Clarify behaviour of CSV reader for non-UTF8 text data

2020-06-11 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9103:


 Summary: [Python] Clarify behaviour of CSV reader for non-UTF8 
text data
 Key: ARROW-9103
 URL: https://issues.apache.org/jira/browse/ARROW-9103
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


See 
https://stackoverflow.com/questions/62153229/how-does-pyarrow-read-csv-handle-different-file-encodings/62321673#62321673



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9102) [Packaging] Upload built manylinux docker images

2020-06-11 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9102:
--

 Summary: [Packaging] Upload built manylinux docker images
 Key: ARROW-9102
 URL: https://issues.apache.org/jira/browse/ARROW-9102
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


However the secrets were set on azure pipelines the upload step is failing: 
https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=13104&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181

So the manylinux builds take more than two hours. This is due to azure's secret 
handling, we need to explicitly export the azure secret variables as 
environment variables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9101) [Doc][C++][Python] Document encoding expected by CSV and JSON readers

2020-06-11 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9101:
-

 Summary: [Doc][C++][Python] Document encoding expected by CSV and 
JSON readers
 Key: ARROW-9101
 URL: https://issues.apache.org/jira/browse/ARROW-9101
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Documentation, Python
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9100) Add ascii_lower kernel

2020-06-11 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-9100:
---

 Summary: Add ascii_lower kernel
 Key: ARROW-9100
 URL: https://issues.apache.org/jira/browse/ARROW-9100
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Maarten Breddels






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9099) Add TRIM function for string

2020-06-10 Thread Sagnik Chakraborty (Jira)
Sagnik Chakraborty created ARROW-9099:
-

 Summary: Add TRIM function for string
 Key: ARROW-9099
 URL: https://issues.apache.org/jira/browse/ARROW-9099
 Project: Apache Arrow
  Issue Type: Task
Reporter: Sagnik Chakraborty






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9098) RecordBatch::ToStructArray cannot handle record batches with 0 column

2020-06-10 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-9098:


 Summary: RecordBatch::ToStructArray cannot handle record batches 
with 0 column
 Key: ARROW-9098
 URL: https://issues.apache.org/jira/browse/ARROW-9098
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.17.1
Reporter: Zhuo Peng


If RecordBatch::ToStructArray is called against a record batch with 0 column, 
the following error will be raised:

Invalid: Can't infer struct array length with 0 child arrays



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9097) [Rust] Customizable schema inference for CSV

2020-06-10 Thread Sergey Todyshev (Jira)
Sergey Todyshev created ARROW-9097:
--

 Summary: [Rust] Customizable schema inference for CSV
 Key: ARROW-9097
 URL: https://issues.apache.org/jira/browse/ARROW-9097
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Sergey Todyshev


Please consider extracting infer_csv_schema function into separate module 
allowing customization of fields DataType inference. Currently the missing part 
is an inference of datetime fields.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9096) data type "integer" not understood: pandas roundtrip

2020-06-10 Thread Richard Wu (Jira)
Richard Wu created ARROW-9096:
-

 Summary: data type "integer" not understood: pandas roundtrip
 Key: ARROW-9096
 URL: https://issues.apache.org/jira/browse/ARROW-9096
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.17.1
Reporter: Richard Wu


The following will fail the roundtrip since the column indexes' pandas_type is 
converted from int64 to integer when an additional column is introduced and 
subsequently moved to the index:

 
{code:java}
df = pd.DataFrame(np.ones((3,1), index=[[1,2,3]])
df['foo'] = np.arange(3)
df = df.set_index('foo', append=True)
table = pyarrow.Table.from_pandas(df)
table.to_pandas()  # Errors{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9095) [Rust] Fix NullArray to comply with spec

2020-06-10 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9095:
-

 Summary: [Rust] Fix NullArray to comply with spec
 Key: ARROW-9095
 URL: https://issues.apache.org/jira/browse/ARROW-9095
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 0.17.0
Reporter: Neville Dipale


When I implemented the NullArray, I didn't comply with the spec under the 
premise that I'd handle reading and writing IPC in a spec-compliant way as that 
looked like the easier approach.

After some integration testing, I realised that I wasn't doing it correctly, so 
it's better to comply with the spec by not allocating any buffers for the array.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9094) [Python] Bump versions of compiled dependencies in manylinux wheels

2020-06-10 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9094:
-

 Summary: [Python] Bump versions of compiled dependencies in 
manylinux wheels
 Key: ARROW-9094
 URL: https://issues.apache.org/jira/browse/ARROW-9094
 Project: Apache Arrow
  Issue Type: Task
  Components: Packaging, Python
Reporter: Antoine Pitrou
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9093) [FlightRPC][C++][Python] Allow setting gRPC client options

2020-06-10 Thread David Li (Jira)
David Li created ARROW-9093:
---

 Summary: [FlightRPC][C++][Python] Allow setting gRPC client options
 Key: ARROW-9093
 URL: https://issues.apache.org/jira/browse/ARROW-9093
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, FlightRPC, Python
Reporter: David Li
Assignee: David Li


There's no way to set generic gRPC options which are useful for tuning behavior 
(e.g. round-robin load balancing). Rather than bind all of these one by one, 
gRPC allows setting arguments as generic string-string or string-integer pairs; 
we could expose this (and leave the interpretation implementation-dependent).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9092) [C++] gandiva-decimal-test hangs with LLVM 9

2020-06-10 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9092:
---

 Summary: [C++] gandiva-decimal-test hangs with LLVM 9
 Key: ARROW-9092
 URL: https://issues.apache.org/jira/browse/ARROW-9092
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


I built Gandiva C++ unittests with LLVM 9 on Ubuntu 18.04 and 
gandiva-decimal-test hangs forever



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9091) [C++] Utilize function's default options when passing no options to CallFunction to a function that requires them

2020-06-10 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9091:
---

 Summary: [C++] Utilize function's default options when passing no 
options to CallFunction to a function that requires them
 Key: ARROW-9091
 URL: https://issues.apache.org/jira/browse/ARROW-9091
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Otherwise benign usage of {{CallFunction}} can cause an unintuitive segfault in 
some cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9090) [C++] Bump versions of bundled libraries

2020-06-10 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9090:
-

 Summary: [C++] Bump versions of bundled libraries
 Key: ARROW-9090
 URL: https://issues.apache.org/jira/browse/ARROW-9090
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou
 Fix For: 1.0.0


We should bump the versions of bundled dependencies, wherever possible, to 
ensure that users get bugfixes and improvements made in those third-party 
libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9089) [Python] A PyFileSystem handler for fsspec-based filesystems

2020-06-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9089:


 Summary: [Python] A PyFileSystem handler for fsspec-based 
filesystems
 Key: ARROW-9089
 URL: https://issues.apache.org/jira/browse/ARROW-9089
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Follow-up on ARROW-8766 to use this machinery to add an FSSpecHandler



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9088) [Rust] Recent version of arrow crate does not compile into wasm target

2020-06-10 Thread Sergey Todyshev (Jira)
Sergey Todyshev created ARROW-9088:
--

 Summary: [Rust] Recent version of arrow crate does not compile 
into wasm target
 Key: ARROW-9088
 URL: https://issues.apache.org/jira/browse/ARROW-9088
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Sergey Todyshev


arrow 0.16 compiles successfully into wasm32-unknown-unknown, but recent git 
version does not. it would be nice to fix that.

compiler errors:

 
{noformat}
error[E0433]: failed to resolve: could not find `unix` in `os`
--> 
/home/regl/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:41:18
 |
  41 | use std::os::unix::ffi::OsStringExt;
 |   could not find `unix` in `os`
  
  error[E0432]: unresolved import `unix`
   --> 
/home/regl/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:6:5
|
  6 | use unix;
|  no `unix` in the root{noformat}
the problem is that prettytable-rs dependency depends on dirs which causes this 
error

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9087) Missing HDFS options parsing

2020-06-09 Thread Yuan Zhou (Jira)
Yuan Zhou created ARROW-9087:


 Summary: Missing HDFS options parsing
 Key: ARROW-9087
 URL: https://issues.apache.org/jira/browse/ARROW-9087
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Yuan Zhou
Assignee: Yuan Zhou


HDFS options for kerberos ticket and extra conf is not parsed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9086) [CI][Homebrew] Enable Gandiva

2020-06-09 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-9086:
---

 Summary: [CI][Homebrew] Enable Gandiva
 Key: ARROW-9086
 URL: https://issues.apache.org/jira/browse/ARROW-9086
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9085) [C++][CI] Appveyor CI test failures

2020-06-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9085:
---

 Summary: [C++][CI] Appveyor CI test failures
 Key: ARROW-9085
 URL: https://issues.apache.org/jira/browse/ARROW-9085
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


See 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/33417919

These seem to have been introduced by 

https://github.com/apache/arrow/commit/b058cf0d1c26ad7984c104bb84322cc7dcc66f00



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9084) [C++] cmake is unable to find zstd target when ZSTD_SOURCE=SYSTEM

2020-06-09 Thread Dmitry Kalinkin (Jira)
Dmitry Kalinkin created ARROW-9084:
--

 Summary: [C++] cmake is unable to find zstd target when 
ZSTD_SOURCE=SYSTEM
 Key: ARROW-9084
 URL: https://issues.apache.org/jira/browse/ARROW-9084
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.17.1
 Environment: zstd 1.4.5
Reporter: Dmitry Kalinkin
Assignee: Dmitry Kalinkin


A following problem occurs when arrow-cpp is built against system zstd:
{noformat}
CMake Error at cmake_modules/ThirdpartyToolchain.cmake:1860 
(get_target_property):
  get_target_property() called with non-existent target "ZSTD::zstd".
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9083) [R] collect int64 as R integer type if not out of bounds

2020-06-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-9083:
--

 Summary: [R] collect int64 as R integer type if not out of bounds
 Key: ARROW-9083
 URL: https://issues.apache.org/jira/browse/ARROW-9083
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson


{{bit64::integer64}} can be awkward to work with in R (one example: 
https://github.com/apache/arrow/issues/7385). Often in Arrow we get {{int64}} 
types from [compute methods|https://github.com/apache/arrow/pull/7308] or other 
translation methods that auto-promote to the largest integer type, but they 
would fit fine in a 32-bit integer, which is R's native type. 

When calling {{Array__as_vector}} on an int64, we could first call the minmax 
function on the array, and if the extrema are within the range of a 32-bit int, 
return a regular R integer vector. This would add a little bit of ambiguity as 
to what R type you'll get from an Arrow type, but I wonder if the benefits are 
worth it since you can't do much with an integer64 in R. (We could also make 
this optional, similar to ARROW-7657, so you could specify a "strict" mode if 
you are in a use case where roundtrip fidelity is more important than R 
usability.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9082) [Rust] - Stream reader fail when steam not ended with (optional) 0xFFFFFFFF 0x00000000"

2020-06-09 Thread Eyal Leshem (Jira)
Eyal Leshem created ARROW-9082:
--

 Summary: [Rust] - Stream reader fail when steam not ended with 
(optional) 0x 0x" 
 Key: ARROW-9082
 URL: https://issues.apache.org/jira/browse/ARROW-9082
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.17.1
Reporter: Eyal Leshem


according to spec : 
[https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format] , the 
0x 0x is optional in the arrow response stream , but currently 
when client receive such response it's read all the batches well , but return 
an error  in the end (instead of Ok(None)) 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9081) [C++] Upgrade to LLVM 10

2020-06-09 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-9081:
---

 Summary: [C++] Upgrade to LLVM 10
 Key: ARROW-9081
 URL: https://issues.apache.org/jira/browse/ARROW-9081
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.17.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Upgrade llvm dependencies to use version 10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9080) [C++] arrow::AllocateBuffer returns a Result>

2020-06-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9080:
---

 Summary: [C++] arrow::AllocateBuffer returns a 
Result>
 Key: ARROW-9080
 URL: https://issues.apache.org/jira/browse/ARROW-9080
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This seemed counterintuitive to me since using Buffers almost anywhere requires 
a shared_ptr



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9079) [C++] Write benchmark for arithmetic kernels

2020-06-09 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9079:
--

 Summary: [C++] Write benchmark for arithmetic kernels
 Key: ARROW-9079
 URL: https://issues.apache.org/jira/browse/ARROW-9079
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


The add kernel's implementation has changed in 
https://github.com/apache/arrow/pull/7341, in order to ensure that no 
performance regression was introduced write a benchmark for the kernels and 
compare the results with the previous implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9078) [C++] Parquet writing of extension type with nested storage type fails

2020-06-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9078:


 Summary: [C++] Parquet writing of extension type with nested 
storage type fails
 Key: ARROW-9078
 URL: https://issues.apache.org/jira/browse/ARROW-9078
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


A reproducer in Python:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq


class MyStructType(pa.PyExtensionType): 
 
def __init__(self): 
pa.PyExtensionType.__init__(self, pa.struct([('left', pa.int64()), 
('right', pa.int64())])) 
 
def __reduce__(self): 
return MyStructType, () 


struct_array = pa.StructArray.from_arrays(
[
pa.array([0, 1], type="int64", from_pandas=True),
pa.array([1, 2], type="int64", from_pandas=True),
],
names=["left", "right"],
)

# works
table = pa.table({'a': struct_array})
pq.write_table(table, "test_struct.parquet")

# doesn't work
mystruct_array = pa.ExtensionArray.from_storage(MyStructType(), struct_array)
table = pa.table({'a': mystruct_array})
pq.write_table(table, "test_struct.parquet")
{code}

Writing the simple StructArray nowadays works (and reading it back in as well). 

But when the struct array is the storage array of an ExtensionType, it fails 
with the following error:

{code}
ArrowException: Unknown error: data type leaf_count != builder_leaf_count1 2
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9077) [C++] Fix aggregate/scalar-compare benchmark null_percent calculation

2020-06-09 Thread Frank Du (Jira)
Frank Du created ARROW-9077:
---

 Summary: [C++] Fix aggregate/scalar-compare benchmark null_percent 
calculation
 Key: ARROW-9077
 URL: https://issues.apache.org/jira/browse/ARROW-9077
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Frank Du
Assignee: Frank Du


Wrong null percent in aggregate/scalar-compare as the changes in 
benchmark_util.h. Correct both to use the new defined boilerplate.

./release/arrow-compute-aggregate-benchmark
 
--
 Benchmark Time CPU Iterations UserCounters...
 
--
 SumKernelFloat/32768/1 5.38 us 5.38 us 129832 bytes_per_second=5.67524G/s 
{color:#FF}null_percent=10k{color} size=32.768k
 SumKernelFloat/32768/1000 5.36 us 5.35 us 130069 bytes_per_second=5.6994G/s 
null_percent=1000 size=32.768k
 SumKernelFloat/32768/100 5.35 us 5.35 us 131071 bytes_per_second=5.70903G/s 
null_percent=100 size=32.768k
 SumKernelFloat/32768/50 10.8 us 10.7 us 65504 bytes_per_second=2.84073G/s 
null_percent=50 size=32.768k
 SumKernelFloat/32768/10 4.94 us 4.93 us 141624 bytes_per_second=6.18964G/s 
null_percent=10 size=32.768k
 SumKernelFloat/32768/1 4.41 us 4.40 us 158949 bytes_per_second=6.92913G/s 
null_percent=1 size=32.768k



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9076) [Rust] Async CSV reader

2020-06-08 Thread Sergey Todyshev (Jira)
Sergey Todyshev created ARROW-9076:
--

 Summary: [Rust] Async CSV reader
 Key: ARROW-9076
 URL: https://issues.apache.org/jira/browse/ARROW-9076
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Sergey Todyshev


rust-csv crate recently adds async implementation for CSV reader. It would be 
nice to have it in arrow crate as well. It is extremely useful in an 
application that needs to parse large CSV files in WebAssembly.

It would be nice to have async JSON reader as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9075) [C++] Optimize Filter implementation

2020-06-08 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9075:
---

 Summary: [C++] Optimize Filter implementation
 Key: ARROW-9075
 URL: https://issues.apache.org/jira/browse/ARROW-9075
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


I split this off from ARROW-5760 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9074) [GLib] Add missing arrow-json check

2020-06-08 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-9074:
---

 Summary: [GLib] Add missing arrow-json check
 Key: ARROW-9074
 URL: https://issues.apache.org/jira/browse/ARROW-9074
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9073) [C++] RapidJSON include directory detection doesn't work with RapidJSONConfig.cmake

2020-06-08 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-9073:
---

 Summary: [C++] RapidJSON include directory detection doesn't work 
with RapidJSONConfig.cmake
 Key: ARROW-9073
 URL: https://issues.apache.org/jira/browse/ARROW-9073
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9072) [C++][Gandiva][MinGW] Enable crashed tests

2020-06-08 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-9072:
---

 Summary: [C++][Gandiva][MinGW] Enable crashed tests
 Key: ARROW-9072
 URL: https://issues.apache.org/jira/browse/ARROW-9072
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva
Reporter: Kouhei Sutou


Some Gandiva tests are crashed with MinGW. They are disabled in 
{{ci/scripts/cpp_test.sh}}.

We should fix the problems of the crashes and enable these tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9071) [C++] MakeArrayOfNull makes invalid ListArray

2020-06-08 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-9071:


 Summary: [C++] MakeArrayOfNull makes invalid ListArray
 Key: ARROW-9071
 URL: https://issues.apache.org/jira/browse/ARROW-9071
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Zhuo Peng


One way to reproduce this bug is:

 

>>> a = pa.array([[1, 2]])

>>> b = pa.array([None, None], type=pa.null())

>>> t1 = pa.Table.from_arrays([a], ["a"])
>>> t2 = pa.Table.from_arrays([b], ["b"])

 

>>> pa.concat_tables([t1, t2], promote=True)
Traceback (most recent call last):
 File "", line 1, in 
 File "pyarrow/table.pxi", line 2138, in pyarrow.lib.concat_tables
 File "pyarrow/public-api.pxi", line 390, in pyarrow.lib.pyarrow_wrap_table
 File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 0: In chunk 1: Invalid: List child array 
invalid: Invalid: Buffer #1 too small in array of type int64 and length 2: 
expected at least 16 byte(s), got 12

(because concat_tables(promote=True) will call MakeArrayOfNulls 
([https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/table.cc#L647))|https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/table.cc#L647)']

 

The code here seems incorrect:

[https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/array/util.cc#L218]

the length of the child array of a ListArray may not equal to the length of the 
ListArray.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9070) [C++] StructScalar needs field accessor methods

2020-06-08 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-9070:
--

 Summary: [C++] StructScalar needs field accessor methods
 Key: ARROW-9070
 URL: https://issues.apache.org/jira/browse/ARROW-9070
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Neal Richardson
 Fix For: 1.0.0


The minmax compute function returns a struct with fields "min" and "max". So to 
write an R binding for the {{min()}} method on arrow objects, I call "minmax" 
and then take the "min" field from the result. However, at least from my 
reading of scalar.h compared with array_nested.h, there are no 
field/GetFieldByName/etc. methods for StructScalar, so I can't get it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9069) [C++] MakeArrayFromScalar can't handle struct

2020-06-08 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-9069:
--

 Summary: [C++] MakeArrayFromScalar can't handle struct
 Key: ARROW-9069
 URL: https://issues.apache.org/jira/browse/ARROW-9069
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Neal Richardson
 Fix For: 1.0.0


The R bindings translate data to/from Scalars by using the Array methods 
already implemented: to go from R object to a Scalar, it creates a length-1 
Array and then slices out the 0th element with GetScalar(); to go from Scalar 
to R object, it calls MakeArrayFromScalar and then the as.vector method on that 
Array (in R, there is no scalar type anyway, only length-1 vectors). 

This generally works fine but if I get a Struct scalar (as the minmax compute 
function returns), I can't do anything with it because MakeArrayFromScalar 
doesn't work with structs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9068) [C++][Dataset] Simplify Partitioning interface

2020-06-08 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-9068:
-

 Summary: [C++][Dataset] Simplify Partitioning interface
 Key: ARROW-9068
 URL: https://issues.apache.org/jira/browse/ARROW-9068
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Francois Saint-Jacques


The `int segment` of `Partitioning::Parse` should not be exposed to the user. 
KeyValuePartiioning should be a private Impl interface, not in public headers. 

The same apply to `Partitioning::Format`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9067) [C++] Create reusable branchless / vectorized index boundschecking functions

2020-06-08 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9067:
---

 Summary: [C++] Create reusable branchless / vectorized index 
boundschecking functions
 Key: ARROW-9067
 URL: https://issues.apache.org/jira/browse/ARROW-9067
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


It is possible to do branch-free index boundschecking in batches for better 
performance. 

I am implementing this as part of the Take/Filter optimization (so please wait 
until I have PRs up for this work), but these functions can be moved somewhere 
more general purpose and used in places where we are currently boundschecking 
inside inner loops.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9066) [Python] Raise correct error in isnull()

2020-06-08 Thread Uwe Korn (Jira)
Uwe Korn created ARROW-9066:
---

 Summary: [Python] Raise correct error in isnull()
 Key: ARROW-9066
 URL: https://issues.apache.org/jira/browse/ARROW-9066
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.17.1
Reporter: Uwe Korn
Assignee: Uwe Korn






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9065) Support parsing date32 in dataset partition folders

2020-06-08 Thread Dave Hirschfeld (Jira)
Dave Hirschfeld created ARROW-9065:
--

 Summary: Support parsing date32 in dataset partition folders
 Key: ARROW-9065
 URL: https://issues.apache.org/jira/browse/ARROW-9065
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Dave Hirschfeld


I have some data which is partitioned by year/month/date. It would be useful if 
the date could be automatically parsed:
```python

In [17]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), 
("day", pa.date32())])

In [18]: partition = DirectoryPartitioning(schema)

In [19]: partition.parse("/2020/06/2020-06-08")
---
ArrowNotImplementedError Traceback (most recent call last)
 in 
> 1 partition.parse("/2020/06/2020-06-08")

~\envs\dev\lib\site-packages\pyarrow\_dataset.pyx in 
pyarrow._dataset.Partitioning.parse()

~\envs\dev\lib\site-packages\pyarrow\error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~\envs\dev\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()

ArrowNotImplementedError: parsing scalars of type date32[day]

```


Not a big issue since you can just use string and convert, but nevertheless it 
would be nice if it Just Worked
```python

In [22]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), 
("day", pa.string())])

In [23]: partition = DirectoryPartitioning(schema)

In [24]: partition.parse("/2020/06/2020-06-08")
Out[24]: 
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9064) optimization debian package manager tweaks

2020-06-08 Thread Pratik Raj (Jira)
Pratik Raj created ARROW-9064:
-

 Summary: optimization debian package manager tweaks
 Key: ARROW-9064
 URL: https://issues.apache.org/jira/browse/ARROW-9064
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Pratik Raj


By default, Ubuntu or Debian based "apt" or "apt-get" system installs 
recommended but not suggested packages .

By passing "--no-install-recommends" option, the user lets apt-get know not to 
consider recommended packages as a dependency to install.

This results in smaller downloads and installation of packages .

Refer to blog at [Ubuntu Blog] at 
https://ubuntu.com/blog/we-reduced-our-docker-images-by-60-with-no-install-recommends



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9063) [Python][C++] Order of files are not respected using the new pyarrow.dataset

2020-06-08 Thread William Liu (Jira)
William Liu created ARROW-9063:
--

 Summary: [Python][C++] Order of files are not respected using the 
new pyarrow.dataset
 Key: ARROW-9063
 URL: https://issues.apache.org/jira/browse/ARROW-9063
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.17.1
 Environment: ubuntu-18.04
Reporter: William Liu


Say we have multiple parquet files under the same folder (a.parquet, b.parquet, 
c.parquet). If I pass a list of file paths into either of the two statements 
below
{code:java}
ds = pq.ParquetDataset(fps, use_legacy_dataset=False)
ds = pyarrow.dataset(fps){code}
Then rows of the resulting table will have:

......aaa......aaa...ccc..bbb...

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9062) [Rust] Support to read JSON into dictionary type

2020-06-08 Thread Sven Wagner-Boysen (Jira)
Sven Wagner-Boysen created ARROW-9062:
-

 Summary: [Rust] Support to read JSON into dictionary type
 Key: ARROW-9062
 URL: https://issues.apache.org/jira/browse/ARROW-9062
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Sven Wagner-Boysen


Currently a JSON reader build from a schema using the type dictionary for one 
of the fields in the schema will fail with JsonError("struct types are not yet 
supported")
{code:java}
let builder = ReaderBuilder::new().with_schema(..)
let mut reader: Reader = 
builder.build::(File::open(path).unwrap()).unwrap();
let rb = reader.next().unwrap()

{code}
 

Suggested solution:

Support reading into a dictionary in Json Reader: 
[https://github.com/apache/arrow/blob/master/rust/arrow/src/json/reader.rs#L368]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9061) [Packaging][APT][Yum][GLib] Add Apache Arrow Datasets GLib

2020-06-07 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-9061:
---

 Summary: [Packaging][APT][Yum][GLib] Add Apache Arrow Datasets GLib
 Key: ARROW-9061
 URL: https://issues.apache.org/jira/browse/ARROW-9061
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib, Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9060) [GLib] Add support for building Apache Arrow Datasets GLib with non-installed Apache Arrow Datasets

2020-06-07 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-9060:
---

 Summary: [GLib] Add support for building Apache Arrow Datasets 
GLib with non-installed Apache Arrow Datasets
 Key: ARROW-9060
 URL: https://issues.apache.org/jira/browse/ARROW-9060
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


It's required for packaging: 
https://travis-ci.org/github/ursa-labs/crossbow/builds/695595159

{noformat}
  CXX  libarrow_dataset_glib_la-scanner.lo
scanner.cpp:24:33: fatal error: arrow/util/iterator.h: No such file or directory
 #include 
 ^
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9059) [Rust] Documentation for slicing array data has the wrong sign

2020-06-07 Thread Bobby Wagner (Jira)
Bobby Wagner created ARROW-9059:
---

 Summary: [Rust] Documentation for slicing array data has the wrong 
sign
 Key: ARROW-9059
 URL: https://issues.apache.org/jira/browse/ARROW-9059
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Bobby Wagner


In the slice_data function in array.rs, the docstring says it panics if 
offset+length is less than data.len(), the code actually panics if offset + 
length is greater than data.len()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9058) [Packaging][wheel] Boost download is failed

2020-06-07 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-9058:
---

 Summary: [Packaging][wheel] Boost download is failed
 Key: ARROW-9058
 URL: https://issues.apache.org/jira/browse/ARROW-9058
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging, Python
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=12893&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181

{noformat}
+ curl -sL 
https://dl.bintray.com/boostorg/release/1.68.0/source/boost_1_68_0.tar.gz -o 
/boost_1_68_0.tar.gz
+ tar xf boost_1_68_0.tar.gz
tar: This does not look like a tar archive
tar: Error exit delayed from previous errors
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9057) Projection should work on InMemoryScan without error

2020-06-07 Thread QP Hou (Jira)
QP Hou created ARROW-9057:
-

 Summary: Projection should work on InMemoryScan without error
 Key: ARROW-9057
 URL: https://issues.apache.org/jira/browse/ARROW-9057
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: QP Hou
Assignee: QP Hou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9056) [C++] Aggregation methods for Scalars?

2020-06-06 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-9056:
--

 Summary: [C++] Aggregation methods for Scalars?
 Key: ARROW-9056
 URL: https://issues.apache.org/jira/browse/ARROW-9056
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Neal Richardson
 Fix For: 1.0.0


See discussion on https://github.com/apache/arrow/pull/7308. Many/most would 
no-op (sum, mean, min, max), but maybe they should exist and not error? Maybe 
they're not needed, but I could see how you might invoke a function on the 
result of a previous aggregation or something.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9055) [C++] Add sum/mean kernels for Boolean type

2020-06-06 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-9055:
--

 Summary: [C++] Add sum/mean kernels for Boolean type
 Key: ARROW-9055
 URL: https://issues.apache.org/jira/browse/ARROW-9055
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Neal Richardson
 Fix For: 1.0.0


See https://github.com/apache/arrow/pull/7308 (ARROW-6978)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9054) [C++] Add ScalarAggregateOptions

2020-06-06 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-9054:
--

 Summary: [C++] Add ScalarAggregateOptions
 Key: ARROW-9054
 URL: https://issues.apache.org/jira/browse/ARROW-9054
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Neal Richardson
 Fix For: 1.0.0


See discussion on https://github.com/apache/arrow/pull/7308. MinMax has an 
option for null behavior, but Sum and Mean do not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9049) [C++] Add a Result<> returning method for for constructing a dictionary

2020-06-06 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-9049:
--

 Summary: [C++] Add a Result<> returning method for for 
constructing a dictionary
 Key: ARROW-9049
 URL: https://issues.apache.org/jira/browse/ARROW-9049
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Micah Kornfield
Assignee: Micah Kornfield


Dictionary types require a signed integer index type.  Today there is a DCHECK 
that this is the case in the constructor.  

When reading data from an unknown source it is possible due to corruption (or 
user error) that the dictionary index type is not signed. We should add a 
method that checks for signedness and use that at all system boundaries to 
validate input data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9050) [Release] Use 1.0.0 as the next version

2020-06-06 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-9050:
---

 Summary: [Release] Use 1.0.0 as the next version
 Key: ARROW-9050
 URL: https://issues.apache.org/jira/browse/ARROW-9050
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9047) [Rust] Setting 0-bits of a 0-length bitset segfaults

2020-06-06 Thread Max Burke (Jira)
Max Burke created ARROW-9047:


 Summary: [Rust] Setting 0-bits of a 0-length bitset segfaults
 Key: ARROW-9047
 URL: https://issues.apache.org/jira/browse/ARROW-9047
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Max Burke


See PR for details



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9052) [CI][MinGW] Enable Gandiva

2020-06-06 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-9052:
---

 Summary: [CI][MinGW] Enable Gandiva
 Key: ARROW-9052
 URL: https://issues.apache.org/jira/browse/ARROW-9052
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva, Continuous Integration, GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9053) [Rust] Add sort for lists and structs

2020-06-06 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9053:
-

 Summary: [Rust] Add sort for lists and structs
 Key: ARROW-9053
 URL: https://issues.apache.org/jira/browse/ARROW-9053
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9048) [C#] Support Float16

2020-06-06 Thread Eric Erhardt (Jira)
Eric Erhardt created ARROW-9048:
---

 Summary: [C#] Support Float16
 Key: ARROW-9048
 URL: https://issues.apache.org/jira/browse/ARROW-9048
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#
Reporter: Eric Erhardt


With [https://github.com/dotnet/runtime/issues/936], .NET is getting a 
`System.Half` type, which is a 16-bit floating point number. Once that type 
lands in .NET we can implement support for the Float16 type in Arrow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9051) [GLib] Refer Array related objects from Array

2020-06-06 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-9051:
---

 Summary: [GLib] Refer Array related objects from Array
 Key: ARROW-9051
 URL: https://issues.apache.org/jira/browse/ARROW-9051
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9046) [C++][R] Put more things in type_fwds

2020-06-05 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-9046:
--

 Summary: [C++][R] Put more things in type_fwds
 Key: ARROW-9046
 URL: https://issues.apache.org/jira/browse/ARROW-9046
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, R
Reporter: Neal Richardson
Assignee: Ben Kietzman
 Fix For: 1.0.0


Hopefully to reduce compile time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9045) [C++] Improve and expand Take/Filter benchmarks

2020-06-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9045:
---

 Summary: [C++] Improve and expand Take/Filter benchmarks
 Key: ARROW-9045
 URL: https://issues.apache.org/jira/browse/ARROW-9045
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


I'm putting this up as a separate patch for review



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9044) [Go][Packaging] Revisit the license file attachment to the go packages

2020-06-05 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9044:
--

 Summary: [Go][Packaging] Revisit the license file attachment to 
the go packages
 Key: ARROW-9044
 URL: https://issues.apache.org/jira/browse/ARROW-9044
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go, Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


As per https://github.com/apache/arrow/pull/7355#issuecomment-639560475



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9043) [Go] Temporarily copy LICENSE.txt to go/

2020-06-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9043:
---

 Summary: [Go] Temporarily copy LICENSE.txt to go/
 Key: ARROW-9043
 URL: https://issues.apache.org/jira/browse/ARROW-9043
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Wes McKinney
 Fix For: 1.0.0


{{go mod}} needs to find a license file in the root of the Go module. In the 
future "go mod" may be able to follow symlinks in which case this can be 
replaced by a symlink.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9042) [C++] Add Subtract and Multiply arithmetic kernels with wrap-around behavior

2020-06-05 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9042:
--

 Summary: [C++] Add Subtract and Multiply arithmetic kernels with 
wrap-around behavior
 Key: ARROW-9042
 URL: https://issues.apache.org/jira/browse/ARROW-9042
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Krisztian Szucs
 Fix For: 1.0.0


Also avoid undefined behaviour caused by signed integer overflow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9041) overloaded virtual function "arrow::io::Writable::Write" is only partially overridden in class

2020-06-04 Thread Karthikeyan Natarajan (Jira)
Karthikeyan Natarajan created ARROW-9041:


 Summary: overloaded virtual function "arrow::io::Writable::Write" 
is only partially overridden in class 
 Key: ARROW-9041
 URL: https://issues.apache.org/jira/browse/ARROW-9041
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.15.0
Reporter: Karthikeyan Natarajan


Following warnings appear 

cpp/build/arrow/install/include/arrow/io/file.h(189): warning: overloaded 
virtual function "arrow::io::Writable::Write" is only partially overridden in 
class "arrow::io::MemoryMappedFile"

cpp/build/arrow/install/include/arrow/io/memory.h(98): warning: overloaded 
virtual function "arrow::io::Writable::Write" is only partially overridden in 
class "arrow::io::MockOutputStream"

cpp/build/arrow/install/include/arrow/io/memory.h(116): warning: overloaded 
virtual function "arrow::io::Writable::Write" is only partially overridden in 
class "arrow::io::FixedSizeBufferWriter"

Suggestion solution is to use `using Writable::Write` in protected/private.

[https://isocpp.org/wiki/faq/strange-inheritance#hiding-rule]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9040) [Python][Parquet]"_ParquetDatasetV2" fail to read with columns and use_pandas_metadata=True

2020-06-04 Thread cmsxbc (Jira)
cmsxbc created ARROW-9040:
-

 Summary: [Python][Parquet]"_ParquetDatasetV2" fail to read with 
columns and use_pandas_metadata=True
 Key: ARROW-9040
 URL: https://issues.apache.org/jira/browse/ARROW-9040
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.17.1
Reporter: cmsxbc


When call _ParquetDatasetV2.read(columns=['column'], use_pandas_metadata=True),

"TypeError: unhashable type 'dict'"  will be raised from 
{code:java}
index_columns = set(_get_pandas_index_columns(metadata))
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9039) py_bytes created by pyarrow 0.11.1 cannot be deserialized by more recent versions

2020-06-04 Thread Yoav Git (Jira)
Yoav Git created ARROW-9039:
---

 Summary: py_bytes created by pyarrow 0.11.1 cannot be deserialized 
by more recent versions
 Key: ARROW-9039
 URL: https://issues.apache.org/jira/browse/ARROW-9039
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.1, 0.11.1
 Environment: python, windows
Reporter: Yoav Git


I have been saving dataframes into mongodb using:

{{import pandas as pd; import pyarrow as pa}}
{{df = pd.DataFrame([[1,2,3],[2,3,4]], columns = ['x','y','z'])}}

{{byte = pa.serialize(df).to_buffer().to_pybytes()}}

and then reading back using:

{{df = pa.deserialize(pa.py_buffer(memoryview(byte)))}}

However, pyarrow is not back-compatible. i.e. both versions 0.11.1 and 0.15.1 
can read their own pybytes created by it. Alas, they cannot read each other. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9038) [C++] Improve BitBlockCounter

2020-06-04 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-9038:
---

 Summary: [C++] Improve BitBlockCounter
 Key: ARROW-9038
 URL: https://issues.apache.org/jira/browse/ARROW-9038
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai


ARROW-9029 implements BitBlockCounter. There are chances to improve pops 
counting performance per this review comment: 
https://github.com/apache/arrow/pull/7346#discussion_r435005226



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9037) [C++/C-ABI] unable to import array with null count == -1 (which could be exported)

2020-06-04 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-9037:


 Summary: [C++/C-ABI] unable to import array with null count == -1 
(which could be exported)
 Key: ARROW-9037
 URL: https://issues.apache.org/jira/browse/ARROW-9037
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.17.1
Reporter: Zhuo Peng


If an Array is created with null_count == -1 but without any null (and thus no 
null bitmap buffer), then ArrayData.null_count will remain -1 when exporting if 
null_count is never computed. The exported C struct also has null_count == -1 
[1]. But when importing, if null_count != 0, an error [2] will be raised.

[1] 
https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L560

[2] 
https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L1404

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9036) Null pointer exception when caching data frames)

2020-06-04 Thread Gaurangi Saxena (Jira)
Gaurangi Saxena created ARROW-9036:
--

 Summary: Null pointer exception when caching data frames)
 Key: ARROW-9036
 URL: https://issues.apache.org/jira/browse/ARROW-9036
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 0.16.0
Reporter: Gaurangi Saxena


I get a NPE when I try to cache a DataFrame in spark with Arrow as read format.

 

Stack Trace - 

java.lang.NullPointerExceptionjava.lang.NullPointerException at 
org.apache.arrow.vector.ipc.ReadChannel.readFully(ReadChannel.java:61) at 
org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:649)
 at 
org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
 at 
org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:106)
 at 
com.google.cloud.spark.bigquery.ArrowBinaryIterator$ArrowReaderIterator.hasNext(ArrowBinaryIterator.scala:84)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9035) 8 vs 64 byte alignment

2020-06-04 Thread Anthony Abate (Jira)
Anthony Abate created ARROW-9035:


 Summary: 8 vs 64 byte alignment
 Key: ARROW-9035
 URL: https://issues.apache.org/jira/browse/ARROW-9035
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Documentation
Affects Versions: 0.17.0
Reporter: Anthony Abate


I used the C++ library to create a very small arrow file (1 field of 5 int32) 
and was surprised that the buffers are not aligned to 64 bytes as per the 
documentation section "Buffer Alignment and Padding" with examples.. based on 
the examples there, the 20 bytes of int32 should be padded to 64 bytes, but it 
is only 24 (see below) .   

extract message metadata - see BodyLength = 24
{code:java}
{
  version: "V4",
  header_type: "RecordBatch",
  header: {
nodes: [
  {
length: 5,
null_count: 0
  }
],
buffers: [
  {
offset: 0,
length: 0
  },
  {
offset: 0,
length: 20
  }
]
  },
  bodyLength: 24
} {code}
Reading further down the documentation section "Encapsulated message format" it 
says serialization should use 8 byte alignment. 

These both seem at odds with each other and some clarification is needed.

Is the documentation wrong? 

Or

Should 8 byte alignment be used for File and 64 byte for IPC ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9034) [C++] Implement binary (two bitmap) version of BitBlockCounter

2020-06-04 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9034:
---

 Summary: [C++] Implement binary (two bitmap) version of 
BitBlockCounter
 Key: ARROW-9034
 URL: https://issues.apache.org/jira/browse/ARROW-9034
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


The current BitBlockCounter from ARROW-9029 is useful for unary operations. 
Some operations involve multiple bitmaps and so it's useful to be able to 
determine the block popcounts of the AND of the respective words in the 
bitmaps. So each returned block would contain the number of bits that are set 
in both bitmaps at the same locations



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9033) [Python] Add tests to verify that one can build a C++ extension against the manylinux1 wheels

2020-06-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9033:
---

 Summary: [Python] Add tests to verify that one can build a C++ 
extension against the manylinux1 wheels
 Key: ARROW-9033
 URL: https://issues.apache.org/jira/browse/ARROW-9033
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


Some project want to be able to use the Python wheels to build other Python 
packages with C++ extensions that need to link against libarrow.so. It would be 
great if someone would add automated tests to ensure that our wheel builds can 
be used successfully in this fashion. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9032) [C++] Split arrow/util/bit_util.h into multiple header files

2020-06-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9032:
---

 Summary: [C++] Split arrow/util/bit_util.h into multiple header 
files
 Key: ARROW-9032
 URL: https://issues.apache.org/jira/browse/ARROW-9032
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This header has grown quite large and any given compilation unit's use of it is 
likely limited to only a couple of functions or classes. I suspect it would 
improve compilation time to split up this header into a few headers organized 
by frequency of code use. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9031) [R] Implement conversion from Type::UINT64 to R vector

2020-06-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9031:
---

 Summary: [R] Implement conversion from Type::UINT64 to R vector
 Key: ARROW-9031
 URL: https://issues.apache.org/jira/browse/ARROW-9031
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Wes McKinney
 Fix For: 1.0.0


This case is not handled in array_to_vector.cpp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9030) [Python] Clean up some usages of pyarrow.compat, move some common functions/symbols to lib.pyx

2020-06-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9030:
---

 Summary: [Python] Clean up some usages of pyarrow.compat, move 
some common functions/symbols to lib.pyx
 Key: ARROW-9030
 URL: https://issues.apache.org/jira/browse/ARROW-9030
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


I started doing this while looking into ARROW-4633



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9029) [C++] Implement BitmapScanner interface to accelerate processing of mostly-not-null data

2020-06-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9029:
---

 Summary: [C++] Implement BitmapScanner interface to accelerate 
processing of mostly-not-null data
 Key: ARROW-9029
 URL: https://issues.apache.org/jira/browse/ARROW-9029
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


In analytics, it is common for data to be all not-null or mostly not-null. Data 
with > 50% nulls tends to be more exceptional. In this might, our 
{{BitmapReader}} class which allows iteration of each bit in a bitmap can be 
wasteful for mostly set validity bitmaps.

I propose instead a new interface for use in kernel implementations, for lack 
of a better term {{BitmapScanner}}. This works as follows:

* Uses popcount to accumulate consecutive 64-bit words from a bitmap where all 
values are set, up to some limit (e.g. anywhere from 8 to 128 words -- we can 
use benchmarks to determine what is a good limit). The length of this "all-on" 
run is returned to the caller in a single function call, so that this "run" of 
data can be processed without any bit-by-bit bitmap checking
* If words containing unset bits is encountered, the scanner will similarly 
accumulate non-full words until the next full word is encountered or a limit is 
hit. The length of this "has nulls" run is returned to the caller, which then 
proceeds bit-by-bit to process the data

For data with a lot of nulls, this may degrade performance somewhat but 
probably not that much empirically. However, data that is mostly-not-null 
should benefit from this. 

This BitmapScanner utility can probably also be used to accelerate the 
implementation of Filter for mostly-not-null data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9028) [R] Should be able to convert an empty table

2020-06-03 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-9028:
-

 Summary: [R] Should be able to convert an empty table
 Key: ARROW-9028
 URL: https://issues.apache.org/jira/browse/ARROW-9028
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   10   >