[jira] [Created] (ARROW-18048) [Dev][Archery][Crossbow] Comment bot doesn't report suitable task URLs

2022-10-14 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-18048:


 Summary: [Dev][Archery][Crossbow] Comment bot doesn't report 
suitable task URLs
 Key: ARROW-18048
 URL: https://issues.apache.org/jira/browse/ARROW-18048
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


If we generate a report immediately after we submit CI tasks, the generated 
report doesn't have suitable task CI task URL. Because CI tasks aren't started. 
We need to wait for a while before we generate a report to collect suitable CI 
task URLs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18049) [R] Support column renaming in col_select argument to file reading functions

2022-10-14 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18049:


 Summary: [R] Support column renaming in col_select argument to 
file reading functions
 Key: ARROW-18049
 URL: https://issues.apache.org/jira/browse/ARROW-18049
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


We should support the ability to rename columns when reading in data via the 
CSV/Parquet/Feather/JSON file readers.

We currently have an argument {{col_select}}, which allows users to choose 
which columns to read in, but renaming doesn't work.  

To implement this, we'd need to check if any columns have been renamed by 
{{col_select}} and then updating the schema of the object being returned once 
the file has been read.

{code:r}

library(readr)
library(arrow)
readr::read_csv(readr_example("mtcars.csv"), col_select = c(not_hp = hp))
#> # A tibble: 32 × 1
#>not_hp
#> 
#>  1110
#>  2110
#>  3 93
#>  4110
#>  5175
#>  6105
#>  7245
#>  8 62
#>  9 95
#> 10123
#> # … with 22 more rows
arrow::read_csv_arrow(readr_example("mtcars.csv"), col_select = c(not_hp = hp))
#> # A tibble: 32 × 1
#>   hp
#>
#>  1   110
#>  2   110
#>  393
#>  4   110
#>  5   175
#>  6   105
#>  7   245
#>  862
#>  995
#> 10   123
#> # … with 22 more rows
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18050) [C++] Substrait consumer should reject plans containing options that it doesn't recognize

2022-10-14 Thread Weston Pace (Jira)
Weston Pace created ARROW-18050:
---

 Summary: [C++] Substrait consumer should reject plans containing 
options that it doesn't recognize
 Key: ARROW-18050
 URL: https://issues.apache.org/jira/browse/ARROW-18050
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


This is a follow-up to ARROW-17966.

This is a bit tricky because we basically just pass the call today to:

{noformat}
  using SubstraitCallToArrow =
  std::function(const SubstraitCall&)>;
{noformat}

Then we have no way of knowing which options were actually used by the call.  
Since we don't know this we have no way of checking to make sure we used all of 
the given options.

This is rather important since a given, but not recognized, option should be 
rejected.  This means the producer is asking for some kind of specific behavior 
and we are probably not going to be able to provide it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18051) [C++] Enable tests skipped by ARROW-16392

2022-10-14 Thread Weston Pace (Jira)
Weston Pace created ARROW-18051:
---

 Summary: [C++] Enable tests skipped by ARROW-16392
 Key: ARROW-18051
 URL: https://issues.apache.org/jira/browse/ARROW-18051
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


There are a number of unit tests that we still skip (on Windows) due to 
ARROW-16392.  However, ARROW-16392 has been fixed.  There is no reason to skip 
these any longer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18052) pyarrow.parquet.write_to_dataset permssion denied when write to s3 bucket in another account

2022-10-14 Thread John (Jira)
John created ARROW-18052:


 Summary: pyarrow.parquet.write_to_dataset permssion denied when 
write to s3 bucket in another account
 Key: ARROW-18052
 URL: https://issues.apache.org/jira/browse/ARROW-18052
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 9.0.0
Reporter: John


We are using pyarrow.parquet.write_to_dataset to write to an s3 file system in 
another account.  This fails because write_to_dataset calls 
pyarrow.dataset.write_dataset with create_dir=True.  There is not a parameter 
in the call to write_to_dataset that allows you to set create_dir to False.  
When setting it to False, this works correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18053) [Dev] merge_arrow_pr.py doesn't detect Co-authored-by:

2022-10-14 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-18053:


 Summary: [Dev] merge_arrow_pr.py doesn't detect Co-authored-by:
 Key: ARROW-18053
 URL: https://issues.apache.org/jira/browse/ARROW-18053
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


For example, {{Co-authored-by:}} in 
https://github.com/apache/arrow/pull/14381/commits/41b7f26a3631c5c6cfa3abd369fbf39263cfb536
 isn't detected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18054) [Python][CI] Enable Cython tests on windows wheels

2022-10-14 Thread Jira
Raúl Cumplido created ARROW-18054:
-

 Summary: [Python][CI] Enable Cython tests on windows wheels
 Key: ARROW-18054
 URL: https://issues.apache.org/jira/browse/ARROW-18054
 Project: Apache Arrow
  Issue Type: Task
  Components: Continuous Integration, Python
Reporter: Raúl Cumplido
Assignee: Raúl Cumplido
 Fix For: 10.0.0


We currently have `set PYARROW_TEST_CYTHON=OFF` on the windows wheel tests. We 
should run the cython tests for windows too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18055) arrow-dataset-dataset-writer-test still times out occassionally

2022-10-14 Thread Weston Pace (Jira)
Weston Pace created ARROW-18055:
---

 Summary: arrow-dataset-dataset-writer-test still times out 
occassionally
 Key: ARROW-18055
 URL: https://issues.apache.org/jira/browse/ARROW-18055
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace
 Fix For: 10.0.0


https://github.com/ursacomputing/crossbow/actions/runs/3242586440/jobs/5316061574

https://github.com/ursacomputing/crossbow/actions/runs/3246657465/jobs/5325660174



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18056) [Ruby] Add support for building Arrow::Table from {name: Arrow::Tensor}

2022-10-14 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-18056:


 Summary: [Ruby] Add support for building Arrow::Table from {name: 
Arrow::Tensor}
 Key: ARROW-18056
 URL: https://issues.apache.org/jira/browse/ARROW-18056
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Ruby
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18057) [R] test for slice functions fail on builds without Datasets capability

2022-10-14 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18057:


 Summary: [R] test for slice functions fail on builds without 
Datasets capability
 Key: ARROW-18057
 URL: https://issues.apache.org/jira/browse/ARROW-18057
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


The changes in ARROW-13766 introduced a test which depends on datasets 
functionality being enabled - we should skip this on CI builds where it is not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18058) [Dev][Archery] Remove removed ARROW_JNI related code

2022-10-14 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-18058:


 Summary: [Dev][Archery] Remove removed ARROW_JNI related code
 Key: ARROW-18058
 URL: https://issues.apache.org/jira/browse/ARROW-18058
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18059) [CI][Packaging] Nightly test for centos-8-stream-arm64 times out installing arrow-libs

2022-10-14 Thread Jira
Raúl Cumplido created ARROW-18059:
-

 Summary: [CI][Packaging] Nightly test for centos-8-stream-arm64 
times out installing arrow-libs
 Key: ARROW-18059
 URL: https://issues.apache.org/jira/browse/ARROW-18059
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Packaging
Reporter: Raúl Cumplido


It seems we sometimes time out when installing some dependencies:
{code:java}
+ dnf install -y --enablerepo=powertools arrow9-libs

Last metadata expiration check: 0:00:05 ago on Thu Oct 13 10:45:07 2022.

Dependencies resolved.



 PackageArch   Version  Repository Size



Installing:

 arrow9-libsaarch649.0.0-1.el8  apache-arrow-centos-stream6.3 M

Transaction Summary



Install  1 Package

Total download size: 6.3 M

Installed size: 26 M

Downloading Packages:

[MIRROR] arrow9-libs-9.0.0-1.el8.aarch64.rpm: Curl error (28): Timeout was 
reached for 
https://apache.jfrog.io/artifactory/arrow/centos/8-stream/aarch64/Packages/arrow9-libs-9.0.0-1.el8.aarch64.rpm
 [Operation timed out after 3 milliseconds with 0 out of 0 bytes received]

[MIRROR] arrow9-libs-9.0.0-1.el8.aarch64.rpm: Curl error (35): SSL connect 
error for 
https://apache.jfrog.io/artifactory/arrow/centos/8-stream/aarch64/Packages/arrow9-libs-9.0.0-1.el8.aarch64.rpm
 [OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to apache.jfrog.io:443 ]

[MIRROR] arrow9-libs-9.0.0-1.el8.aarch64.rpm: Curl error (35): SSL connect 
error for 
https://apache.jfrog.io/artifactory/arrow/centos/8-stream/aarch64/Packages/arrow9-libs-9.0.0-1.el8.aarch64.rpm
 [OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to apache.jfrog.io:443 ]

[MIRROR] arrow9-libs-9.0.0-1.el8.aarch64.rpm: Curl error (28): Timeout was 
reached for 
https://apache.jfrog.io/artifactory/arrow/centos/8-stream/aarch64/Packages/arrow9-libs-9.0.0-1.el8.aarch64.rpm
 [Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds]

[FAILED] arrow9-libs-9.0.0-1.el8.aarch64.rpm: No more mirrors to try - All 
mirrors were already tried without success
{code}
This is the job that failed: 
 * 
[centos-8-stream-arm64|https://github.com/ursacomputing/crossbow/runs/8864772917]

we should probably retry or try to find if there is a way to override the 
default timeout as the library is available.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18060) [C++] Writing a dataset with 0 rows doesn't create any files

2022-10-14 Thread David Li (Jira)
David Li created ARROW-18060:


 Summary: [C++] Writing a dataset with 0 rows doesn't create any 
files
 Key: ARROW-18060
 URL: https://issues.apache.org/jira/browse/ARROW-18060
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 9.0.0
Reporter: David Li


If the input data has no rows, no files get created. This is potentially 
unexpected as it looks like "nothing happened". It might be nicer to create an 
empty file. With partitioning, though, that then gets weird (there's no 
partition values) so maybe an error might make more sense instead.

Reproduction in Python
{code:python}
import tempfile
from pathlib import Path

import pyarrow
import pyarrow.dataset

print("PyArrow version:", pyarrow.__version__)

table = pyarrow.table([
[],
], schema=pyarrow.schema([
("ints", "int64"),
]))

with tempfile.TemporaryDirectory() as d:
pyarrow.dataset.write_dataset(table, d, format="feather")
print(list(Path(d).iterdir()))
{code}
Output
{noformat}
> python repro.py
PyArrow version: 9.0.0
[] {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18061) [CI][R] Reduce number of jobs on every commit

2022-10-14 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-18061:
---

 Summary: [CI][R] Reduce number of jobs on every commit
 Key: ARROW-18061
 URL: https://issues.apache.org/jira/browse/ARROW-18061
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Neal Richardson


* Force tests: true fully contains all tests in the false version, so false is 
redundant
* The CentOS 7 job previously was useful to catch gcc 4.8 issues, but that's no 
longer a problem, and the CentOS with devtoolset-8 variant is already tested 
nightly
* The Windows builds can be pruned if we test only on current release and 
development version, which only use UCRT. We test the older build nightly in 
the r-binary-packages workflow



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18062) [R] error in CI jobs for R 3.5 and 3.6 when R package being installed

2022-10-14 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18062:


 Summary: [R] error in CI jobs for R 3.5 and 3.6 when R package 
being installed
 Key: ARROW-18062
 URL: https://issues.apache.org/jira/browse/ARROW-18062
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane


e.g. 
https://github.com/ursacomputing/crossbow/actions/runs/3246698242/jobs/5325752692#step:5:3164

>From the install logs on that CI job:
{code}
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for ‘arrow’:
 .onLoad failed in loadNamespace() for 'arrow', details:
  call: fun_cache[[unqualified_name]] <- fun
  error: invalid type/length (closure/0) in vector allocation
Error: loading failed
{code}

It is currently erroring for R 3.5 and 3.6 in the nightlies with this error.

The line of code where the error comes from was added in  ARROW-16444 but 
seeing as that was 3 months ago, it seems unlikely that this change introduced 
the error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18063) [C++][Python]

2022-10-14 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-18063:


 Summary: [C++][Python] 
 Key: ARROW-18063
 URL: https://issues.apache.org/jira/browse/ARROW-18063
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ben Kietzman


[Mailing list 
thread|https://lists.apache.org/thread/r484sqrd6xjdd058prbrcwh3t5vg91so]

The goal is to:
- generate a substrait plan in Python using Ibis
- ... wherein tables are specified using custom URLs
- use the python API {{run_query}} to execute the plan
- ... against source data which is *streamed* from those URLs rather than 
pulled fully into local memory

The obstacles include:
- The API for constructing a data stream from the custom URLs is only available 
in c++
- The python {{run_query}} function requires tables as input and cannot accept 
a RecordBatchReader even if one could be constructed from a custom URL
- Writing custom cython is not preferred

Some potential solutions:
- Use ExecuteSerializedPlan() directly usable from c++ so that construction of 
data sources need not be handled in python. Passing a buffer from python/ibis 
down to C++ is much simpler and can be navigated without writing cython
- Refactor NamedTableProvider from a lambda mapping {{names -> data source}} 
into a registry so that data source factories can be added from c++ then 
referenced by name from python
- Extend {{run_query}} to support non-Table sources and require the user to 
write a python mapping from URLs to {{pa.RecordBatchReader}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18064) Error of wrong number of rows read from file

2022-10-14 Thread Blake erickson (Jira)
Blake erickson created ARROW-18064:
--

 Summary: Error of wrong number of rows read from file
 Key: ARROW-18064
 URL: https://issues.apache.org/jira/browse/ARROW-18064
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 9.0.0, 8.0.1, 8.0.0, 7.0.1, 7.0.0
 Environment: Python Info
3.10.7 (tags/v3.10.7:6cc6b13, Sep  5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)]
Pyarrow Info
6.0.1
Platform Info
Windows-10-10.0.19042-SP0
Windows
10
10.0.19042
19042
AMD64
Reporter: Blake erickson
 Attachments: badplug.parquet, readBadParquet.py

on version greater than 6.0.1 fail to read tables saying expected length n, got 
n=1 rows

 

Tables can be read column by column fine, or with a fixed number of rows 
matching the meta data fine.      Reads correctly in version 6.0.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18065) Change the way how arrow reads IPC buffered files

2022-10-14 Thread Jira
Percy Camilo Triveño Aucahuasi created ARROW-18065:
--

 Summary: Change the way how arrow reads IPC buffered files
 Key: ARROW-18065
 URL: https://issues.apache.org/jira/browse/ARROW-18065
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Percy Camilo Triveño Aucahuasi
Assignee: Percy Camilo Triveño Aucahuasi


The [PR#14226|https://github.com/apache/arrow/pull/14226], that solves 
ARROW-17599, is changing a bit the logic for reading parquet buffered files 
through async FileReader::RecordBatchGenerator.  Initially those changes were 
thought for the parquet file reader, but we need to reflect those same changes 
for the ipc file reader as well.

The goal for this ticket is to change the ipc reader for buffered files in the 
same way is being defined by the 
[PR#14226|https://github.com/apache/arrow/pull/14226].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18066) [Python][Doc] Documentation for Test Groups out of sync

2022-10-14 Thread Bryce Mecum (Jira)
Bryce Mecum created ARROW-18066:
---

 Summary: [Python][Doc] Documentation for Test Groups out of sync
 Key: ARROW-18066
 URL: https://issues.apache.org/jira/browse/ARROW-18066
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 10.0.0
Reporter: Bryce Mecum


In the Python documentation for [Test 
Groups|https://arrow.apache.org/docs/dev/developers/python.html#test-groups] 
it's stated:
{quote}We have many tests that are grouped together using pytest marks. Some of 
these are disabled by default. To enable a test group, pass --$GROUP_NAME, e.g. 
--parquet. [continues]
{quote}
However, I get:

{{$ pytest --parquet # from arrow/python dir}}
{{ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]}}
{{pytest: error: unrecognized arguments: --parquet}}

 

Am I interpreting the docs wrong or has something changed since writing the 
docs?

 

Related: `pytest --markers` only reports the full list of pytest Marks when I'm 
in the `arrow/python/pyarrow` directory but not `arrow/python`. When I'm in the 
`arrow/python/pyarrow` subdirectory, marks work as expected (ie `pytest -m 
parquet works`).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18067) compile error in ARM64

2022-10-14 Thread chenqiang (Jira)
chenqiang created ARROW-18067:
-

 Summary: compile error in ARM64
 Key: ARROW-18067
 URL: https://issues.apache.org/jira/browse/ARROW-18067
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 8.0.0
 Environment: Kunpeng 920 CentOS 7
Reporter: chenqiang
 Attachments: image-2022-10-15-11-08-33-142.png

Compile error in ARM64,failed to verify the thirdparty boost's SHA256.

The boost's version is boost_1_75_0.tar.gz, download address is : 
[https://sourceforge.net/projects/boost/files/boost/1.75.0/]

!image-2022-10-15-11-08-33-142.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)