[jira] [Updated] (ARROW-16690) [R] Ability to specify chunk size in R flight do_put method

2022-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16690:
---
Labels: FlightClient R flight pull-request-available  (was: FlightClient R 
flight)

> [R] Ability to specify chunk size in R flight do_put method
> ---
>
> Key: ARROW-16690
> URL: https://issues.apache.org/jira/browse/ARROW-16690
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Affects Versions: 8.0.0
>Reporter: Chris Dunderdale
>Priority: Minor
>  Labels: FlightClient, R, flight, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hi there :)
> Duplicating here what's already on the 
> [PR|https://github.com/apache/arrow/pull/13267]
> *Summary*
> An additional parameter in Flight do_put to specify chunk size in R.
> *Problem*
> Currently, all data is sent through in a single message when using 
> [do_put|https://github.com/apache/arrow/blob/647371b504df166860bd33346dcbd962c85e046f/r/R/flight.R]
>  in R. It's a likely scenario that users will want the ability to control the 
> batch sizes without building a custom do_put method.
> *Solution*
> Additional (optional) parameter to specify chunk size.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16690) [R] Ability to specify chunk size in R flight do_put method

2022-05-30 Thread Chris Dunderdale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Dunderdale updated ARROW-16690:
-
Summary: [R] Ability to specify chunk size in R flight do_put method  (was: 
Ability to specify chunk size in R flight do_put method)

> [R] Ability to specify chunk size in R flight do_put method
> ---
>
> Key: ARROW-16690
> URL: https://issues.apache.org/jira/browse/ARROW-16690
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Affects Versions: 8.0.0
>Reporter: Chris Dunderdale
>Priority: Minor
>  Labels: FlightClient, R, flight
>
> Hi there :)
> Duplicating here what's already on the 
> [PR|https://github.com/apache/arrow/pull/13267]
> *Summary*
> An additional parameter in Flight do_put to specify chunk size in R.
> *Problem*
> Currently, all data is sent through in a single message when using 
> [do_put|https://github.com/apache/arrow/blob/647371b504df166860bd33346dcbd962c85e046f/r/R/flight.R]
>  in R. It's a likely scenario that users will want the ability to control the 
> batch sizes without building a custom do_put method.
> *Solution*
> Additional (optional) parameter to specify chunk size.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16690) Ability to specify chunk size in R flight do_put method

2022-05-30 Thread Chris Dunderdale (Jira)
Chris Dunderdale created ARROW-16690:


 Summary: Ability to specify chunk size in R flight do_put method
 Key: ARROW-16690
 URL: https://issues.apache.org/jira/browse/ARROW-16690
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC
Affects Versions: 8.0.0
Reporter: Chris Dunderdale


Hi there :)

Duplicating here what's already on the 
[PR|https://github.com/apache/arrow/pull/13267]

*Summary*
An additional parameter in Flight do_put to specify chunk size in R.

*Problem*
Currently, all data is sent through in a single message when using 
[do_put|https://github.com/apache/arrow/blob/647371b504df166860bd33346dcbd962c85e046f/r/R/flight.R]
 in R. It's a likely scenario that users will want the ability to control the 
batch sizes without building a custom do_put method.

*Solution*
Additional (optional) parameter to specify chunk size.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16687) [C++] Can not find conda-installed Google Benchmark library

2022-05-30 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544049#comment-17544049
 ] 

Kouhei Sutou commented on ARROW-16687:
--

Could you provide CMake log with {{-DCMAKE_FIND_DEBUG_MODE=ON}}?

> [C++] Can not find conda-installed Google Benchmark library
> ---
>
> Key: ARROW-16687
> URL: https://issues.apache.org/jira/browse/ARROW-16687
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> I have {{benchmark 1.6.1}} installed from conda-forge, yet when trying to 
> build Arrow C++ with benchmarks enabled I get the following error:
> {code}
> CMake Error at cmake_modules/ThirdpartyToolchain.cmake:253 (find_package):
>   By not providing "Findbenchmark.cmake" in CMAKE_MODULE_PATH this project
>   has asked CMake to find a package configuration file provided by
>   "benchmark", but CMake did not find one.
>   Could not find a package configuration file provided by "benchmark"
>   (requested version 1.6.0) with any of the following names:
> benchmarkConfig.cmake
> benchmark-config.cmake
>   Add the installation prefix of "benchmark" to CMAKE_PREFIX_PATH or set
>   "benchmark_DIR" to a directory containing one of the above files.  If
>   "benchmark" provides a separate development package or SDK, be sure it has
>   been installed.
> Call Stack (most recent call first):
>   cmake_modules/ThirdpartyToolchain.cmake:2141 (resolve_dependency)
>   CMakeLists.txt:567 (include)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16689) [CI] Improve R Nightly Workflow

2022-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16689:
---
Labels: pull-request-available  (was: )

> [CI] Improve R Nightly Workflow
> ---
>
> Key: ARROW-16689
> URL: https://issues.apache.org/jira/browse/ARROW-16689
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Continuous Integration
>Reporter: Jacob Wujciak-Jens
>Assignee: Jacob Wujciak-Jens
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Only upload if all tests succeed, improve overall polish, and improve 
> documentation.
> Add ubuntu 22.04 binary (see ARROW-16678).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (ARROW-16688) [R][Python] Extension types cannot be registered in both R and Python

2022-05-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544039#comment-17544039
 ] 

Antoine Pitrou edited comment on ARROW-16688 at 5/30/22 7:27 PM:
-

Also it would be nice if you could test with the published R and Python binary 
packages instead
(and/or nightly builds thereof).


was (Author: pitrou):
Also it would be nice if you could test with the published R and Python binary 
packages instead.

> [R][Python] Extension types cannot be registered in both R and Python
> -
>
> Key: ARROW-16688
> URL: https://issues.apache.org/jira/browse/ARROW-16688
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python, R
>Reporter: Dewey Dunnington
>Priority: Major
>
> When registering extension types as is now possible in the R bindings, it 
> looks as though we cannot register an extension type in R and Python at the 
> same time:
> {code:R}
> # apache/arrow@master
> library(arrow, warn.conflicts = FALSE)
> library(reticulate)
> # this is a virtualenv with pyarrow installed against the same commit
> use_virtualenv(
>   "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev",
>   required = TRUE
> )
> pa <- import("pyarrow")
> pa[["__version__"]]
> #> [1] "9.0.0.dev131+g8a36f0f6c"
> py_run_string("
> import pyarrow as pa
> class TestExtensionType(pa.ExtensionType):
> 
> def __init__(self):
> super().__init__(pa.int32(), 'arrow.test_type')
> 
> def __arrow_ext_serialize__(self):
> return b''
> @classmethod
> def __arrow_ext_deserialize__(cls, storage_type, serialized):
> return cls()
> pa.register_extension_type(TestExtensionType())
> ")
> arrow::register_extension_type(
>   arrow::new_extension_type(int32(), "arrow.test_type")
> )
> #> Error: Key error: A type extension with name arrow.test_type already 
> defined
> {code}
> I also get a segfault if I try to surface a Python type into R (probably 
> because the R bindings mistakenly assume that if {{type.id() == 
> Type::EXTENSION}} then it is safe to cast to our own {{ExtensionType}} C++ 
> subclass that implements R-specific things.
> This came about because the 'geoarrow' Python and 'geoarrow' R packages both 
> register a number of extension type definitions.
> - geoarrow's Python registration: 
> https://github.com/jorisvandenbossche/python-geoarrow/blob/main/src/geoarrow/extension_types.py#L108-L117
> - geoarrow's R registration: 
> https://github.com/paleolimbot/geoarrow/blob/master/R/pkg-arrow.R#L208-L223
> I can also force an interaction if I build GDAL against the same Arrow that 
> the arrow R package is linked against and attempt to load a Feather file 
> saved with an extension type using the sf package. I will attempt to recreate 
> that interaction as well in both R and Python.
> I don't know enough about linking to know to what extent this is linked to my 
> own development setup/build of the R package, although I think there are at 
> least some environments where a shared library is picked up first by the R 
> config script (fedora36, for example). It does look like my own R package 
> build is dynamically linking to libarrow.dylib.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16688) [R][Python] Extension types cannot be registered in both R and Python

2022-05-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544039#comment-17544039
 ] 

Antoine Pitrou commented on ARROW-16688:


Also it would be nice if you could test with the published R and Python binary 
packages instead.

> [R][Python] Extension types cannot be registered in both R and Python
> -
>
> Key: ARROW-16688
> URL: https://issues.apache.org/jira/browse/ARROW-16688
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python, R
>Reporter: Dewey Dunnington
>Priority: Major
>
> When registering extension types as is now possible in the R bindings, it 
> looks as though we cannot register an extension type in R and Python at the 
> same time:
> {code:R}
> # apache/arrow@master
> library(arrow, warn.conflicts = FALSE)
> library(reticulate)
> # this is a virtualenv with pyarrow installed against the same commit
> use_virtualenv(
>   "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev",
>   required = TRUE
> )
> pa <- import("pyarrow")
> pa[["__version__"]]
> #> [1] "9.0.0.dev131+g8a36f0f6c"
> py_run_string("
> import pyarrow as pa
> class TestExtensionType(pa.ExtensionType):
> 
> def __init__(self):
> super().__init__(pa.int32(), 'arrow.test_type')
> 
> def __arrow_ext_serialize__(self):
> return b''
> @classmethod
> def __arrow_ext_deserialize__(cls, storage_type, serialized):
> return cls()
> pa.register_extension_type(TestExtensionType())
> ")
> arrow::register_extension_type(
>   arrow::new_extension_type(int32(), "arrow.test_type")
> )
> #> Error: Key error: A type extension with name arrow.test_type already 
> defined
> {code}
> I also get a segfault if I try to surface a Python type into R (probably 
> because the R bindings mistakenly assume that if {{type.id() == 
> Type::EXTENSION}} then it is safe to cast to our own {{ExtensionType}} C++ 
> subclass that implements R-specific things.
> This came about because the 'geoarrow' Python and 'geoarrow' R packages both 
> register a number of extension type definitions.
> - geoarrow's Python registration: 
> https://github.com/jorisvandenbossche/python-geoarrow/blob/main/src/geoarrow/extension_types.py#L108-L117
> - geoarrow's R registration: 
> https://github.com/paleolimbot/geoarrow/blob/master/R/pkg-arrow.R#L208-L223
> I can also force an interaction if I build GDAL against the same Arrow that 
> the arrow R package is linked against and attempt to load a Feather file 
> saved with an extension type using the sf package. I will attempt to recreate 
> that interaction as well in both R and Python.
> I don't know enough about linking to know to what extent this is linked to my 
> own development setup/build of the R package, although I think there are at 
> least some environments where a shared library is picked up first by the R 
> config script (fedora36, for example). It does look like my own R package 
> build is dynamically linking to libarrow.dylib.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16688) [R][Python] Extension types cannot be registered in both R and Python

2022-05-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544038#comment-17544038
 ] 

Antoine Pitrou commented on ARROW-16688:


{quote}probably because the R bindings mistakenly assume that if type.id() == 
Type::EXTENSION then it is safe to cast to our own ExtensionType C++ subclass 
that implements R-specific things
{quote}

That would be the first thing to fix IMHO.

> [R][Python] Extension types cannot be registered in both R and Python
> -
>
> Key: ARROW-16688
> URL: https://issues.apache.org/jira/browse/ARROW-16688
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python, R
>Reporter: Dewey Dunnington
>Priority: Major
>
> When registering extension types as is now possible in the R bindings, it 
> looks as though we cannot register an extension type in R and Python at the 
> same time:
> {code:R}
> # apache/arrow@master
> library(arrow, warn.conflicts = FALSE)
> library(reticulate)
> # this is a virtualenv with pyarrow installed against the same commit
> use_virtualenv(
>   "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev",
>   required = TRUE
> )
> pa <- import("pyarrow")
> pa[["__version__"]]
> #> [1] "9.0.0.dev131+g8a36f0f6c"
> py_run_string("
> import pyarrow as pa
> class TestExtensionType(pa.ExtensionType):
> 
> def __init__(self):
> super().__init__(pa.int32(), 'arrow.test_type')
> 
> def __arrow_ext_serialize__(self):
> return b''
> @classmethod
> def __arrow_ext_deserialize__(cls, storage_type, serialized):
> return cls()
> pa.register_extension_type(TestExtensionType())
> ")
> arrow::register_extension_type(
>   arrow::new_extension_type(int32(), "arrow.test_type")
> )
> #> Error: Key error: A type extension with name arrow.test_type already 
> defined
> {code}
> I also get a segfault if I try to surface a Python type into R (probably 
> because the R bindings mistakenly assume that if {{type.id() == 
> Type::EXTENSION}} then it is safe to cast to our own {{ExtensionType}} C++ 
> subclass that implements R-specific things.
> This came about because the 'geoarrow' Python and 'geoarrow' R packages both 
> register a number of extension type definitions.
> - geoarrow's Python registration: 
> https://github.com/jorisvandenbossche/python-geoarrow/blob/main/src/geoarrow/extension_types.py#L108-L117
> - geoarrow's R registration: 
> https://github.com/paleolimbot/geoarrow/blob/master/R/pkg-arrow.R#L208-L223
> I can also force an interaction if I build GDAL against the same Arrow that 
> the arrow R package is linked against and attempt to load a Feather file 
> saved with an extension type using the sf package. I will attempt to recreate 
> that interaction as well in both R and Python.
> I don't know enough about linking to know to what extent this is linked to my 
> own development setup/build of the R package, although I think there are at 
> least some environments where a shared library is picked up first by the R 
> config script (fedora36, for example). It does look like my own R package 
> build is dynamically linking to libarrow.dylib.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16689) [CI] Improve R Nightly Workflow

2022-05-30 Thread Jacob Wujciak-Jens (Jira)
Jacob Wujciak-Jens created ARROW-16689:
--

 Summary: [CI] Improve R Nightly Workflow
 Key: ARROW-16689
 URL: https://issues.apache.org/jira/browse/ARROW-16689
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Continuous Integration
Reporter: Jacob Wujciak-Jens
Assignee: Jacob Wujciak-Jens
 Fix For: 9.0.0


Only upload if all tests succeed, improve overall polish, and improve 
documentation.

Add ubuntu 22.04 binary (see ARROW-16678).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16688) [R][Python] Extension types cannot be registered in both R and Python

2022-05-30 Thread Dewey Dunnington (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544033#comment-17544033
 ] 

Dewey Dunnington commented on ARROW-16688:
--

cc [~jorisvandenbossche]

> [R][Python] Extension types cannot be registered in both R and Python
> -
>
> Key: ARROW-16688
> URL: https://issues.apache.org/jira/browse/ARROW-16688
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python, R
>Reporter: Dewey Dunnington
>Priority: Major
>
> When registering extension types as is now possible in the R bindings, it 
> looks as though we cannot register an extension type in R and Python at the 
> same time:
> {code:R}
> # apache/arrow@master
> library(arrow, warn.conflicts = FALSE)
> library(reticulate)
> # this is a virtualenv with pyarrow installed against the same commit
> use_virtualenv(
>   "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev",
>   required = TRUE
> )
> pa <- import("pyarrow")
> pa[["__version__"]]
> #> [1] "9.0.0.dev131+g8a36f0f6c"
> py_run_string("
> import pyarrow as pa
> class TestExtensionType(pa.ExtensionType):
> 
> def __init__(self):
> super().__init__(pa.int32(), 'arrow.test_type')
> 
> def __arrow_ext_serialize__(self):
> return b''
> @classmethod
> def __arrow_ext_deserialize__(cls, storage_type, serialized):
> return cls()
> pa.register_extension_type(TestExtensionType())
> ")
> arrow::register_extension_type(
>   arrow::new_extension_type(int32(), "arrow.test_type")
> )
> #> Error: Key error: A type extension with name arrow.test_type already 
> defined
> {code}
> I also get a segfault if I try to surface a Python type into R (probably 
> because the R bindings mistakenly assume that if {{type.id() == 
> Type::EXTENSION}} then it is safe to cast to our own {{ExtensionType}} C++ 
> subclass that implements R-specific things.
> This came about because the 'geoarrow' Python and 'geoarrow' R packages both 
> register a number of extension type definitions.
> - geoarrow's Python registration: 
> https://github.com/jorisvandenbossche/python-geoarrow/blob/main/src/geoarrow/extension_types.py#L108-L117
> - geoarrow's R registration: 
> https://github.com/paleolimbot/geoarrow/blob/master/R/pkg-arrow.R#L208-L223
> I can also force an interaction if I build GDAL against the same Arrow that 
> the arrow R package is linked against and attempt to load a Feather file 
> saved with an extension type using the sf package. I will attempt to recreate 
> that interaction as well in both R and Python.
> I don't know enough about linking to know to what extent this is linked to my 
> own development setup/build of the R package, although I think there are at 
> least some environments where a shared library is picked up first by the R 
> config script (fedora36, for example). It does look like my own R package 
> build is dynamically linking to libarrow.dylib.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16688) [R][Python] Extension types cannot be registered in both R and Python

2022-05-30 Thread Dewey Dunnington (Jira)
Dewey Dunnington created ARROW-16688:


 Summary: [R][Python] Extension types cannot be registered in both 
R and Python
 Key: ARROW-16688
 URL: https://issues.apache.org/jira/browse/ARROW-16688
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python, R
Reporter: Dewey Dunnington


When registering extension types as is now possible in the R bindings, it looks 
as though we cannot register an extension type in R and Python at the same time:

{code:R}
# apache/arrow@master
library(arrow, warn.conflicts = FALSE)
library(reticulate)

# this is a virtualenv with pyarrow installed against the same commit
use_virtualenv(
  "/Users/deweydunnington/Desktop/rscratch/pyarrow-dev",
  required = TRUE
)

pa <- import("pyarrow")
pa[["__version__"]]
#> [1] "9.0.0.dev131+g8a36f0f6c"

py_run_string("
import pyarrow as pa

class TestExtensionType(pa.ExtensionType):

def __init__(self):
super().__init__(pa.int32(), 'arrow.test_type')

def __arrow_ext_serialize__(self):
return b''

@classmethod
def __arrow_ext_deserialize__(cls, storage_type, serialized):
return cls()


pa.register_extension_type(TestExtensionType())
")

arrow::register_extension_type(
  arrow::new_extension_type(int32(), "arrow.test_type")
)
#> Error: Key error: A type extension with name arrow.test_type already defined
{code}

I also get a segfault if I try to surface a Python type into R (probably 
because the R bindings mistakenly assume that if {{type.id() == 
Type::EXTENSION}} then it is safe to cast to our own {{ExtensionType}} C++ 
subclass that implements R-specific things.

This came about because the 'geoarrow' Python and 'geoarrow' R packages both 
register a number of extension type definitions.

- geoarrow's Python registration: 
https://github.com/jorisvandenbossche/python-geoarrow/blob/main/src/geoarrow/extension_types.py#L108-L117
- geoarrow's R registration: 
https://github.com/paleolimbot/geoarrow/blob/master/R/pkg-arrow.R#L208-L223

I can also force an interaction if I build GDAL against the same Arrow that the 
arrow R package is linked against and attempt to load a Feather file saved with 
an extension type using the sf package. I will attempt to recreate that 
interaction as well in both R and Python.

I don't know enough about linking to know to what extent this is linked to my 
own development setup/build of the R package, although I think there are at 
least some environments where a shared library is picked up first by the R 
config script (fedora36, for example). It does look like my own R package build 
is dynamically linking to libarrow.dylib.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16613) [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)

2022-05-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16613:
---
Component/s: C++

> [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector 
> appears to be O(n^2)
> -
>
> Key: ARROW-16613
> URL: https://issues.apache.org/jira/browse/ARROW-16613
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet, Python
>Affects Versions: 8.0.0
>Reporter: Kyle Barron
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Hello!
>  
> I've noticed that when writing a `_metadata` file with 
> `pyarrow.parquet.write_metadata`, it is very slow with a large 
> `metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears 
> that the concatenation inside `metadata.append_row_groups` is very slow. The 
> writer first [iterates over every item of the 
> list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302]
>  and then [concatenates them on each 
> iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799].
>  
> Would it be possible to make a vectorized implementation of this? Where 
> `append_row_groups` accepts a list of `FileMetaData` objects, and where 
> concatenation happens only once?
>  
> Repro (in IPython to use `%time`)
> {code:java}
> from io import BytesIO
> import pyarrow as pa
> import pyarrow.parquet as pq
> def create_example_file_meta_data():
>     data = {
>         "str": pa.array(["a", "b", "c", "d"], type=pa.string()),
>         "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
>         "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
>         "bool": pa.array([True, True, False, False], type=pa.bool_()),
>     }
>     table = pa.table(data)
>     metadata_collector = []
>     pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
>     return table.schema, metadata_collector[0]
> schema, meta = create_example_file_meta_data()
> metadata_collector = [meta] * 500
> %time pq.write_metadata(schema, BytesIO(), 
> metadata_collector=metadata_collector)
> # CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms
> # Wall time: 234 ms
> metadata_collector = [meta] * 1000
> %time pq.write_metadata(schema, BytesIO(), 
> metadata_collector=metadata_collector)
> # CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms
> # Wall time: 970 ms
> metadata_collector = [meta] * 2000
> %time pq.write_metadata(schema, BytesIO(), 
> metadata_collector=metadata_collector)
> # CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s
> # Wall time: 4.3 s
> metadata_collector = [meta] * 4000
> %time pq.write_metadata(schema, BytesIO(), 
> metadata_collector=metadata_collector)
> # CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s
> # Wall time: 17.3 s
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16613) [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)

2022-05-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-16613:
--

Assignee: Antoine Pitrou  (was: Kshiteej K)

> [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector 
> appears to be O(n^2)
> -
>
> Key: ARROW-16613
> URL: https://issues.apache.org/jira/browse/ARROW-16613
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Parquet, Python
>Affects Versions: 8.0.0
>Reporter: Kyle Barron
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Hello!
>  
> I've noticed that when writing a `_metadata` file with 
> `pyarrow.parquet.write_metadata`, it is very slow with a large 
> `metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears 
> that the concatenation inside `metadata.append_row_groups` is very slow. The 
> writer first [iterates over every item of the 
> list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302]
>  and then [concatenates them on each 
> iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799].
>  
> Would it be possible to make a vectorized implementation of this? Where 
> `append_row_groups` accepts a list of `FileMetaData` objects, and where 
> concatenation happens only once?
>  
> Repro (in IPython to use `%time`)
> {code:java}
> from io import BytesIO
> import pyarrow as pa
> import pyarrow.parquet as pq
> def create_example_file_meta_data():
>     data = {
>         "str": pa.array(["a", "b", "c", "d"], type=pa.string()),
>         "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
>         "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
>         "bool": pa.array([True, True, False, False], type=pa.bool_()),
>     }
>     table = pa.table(data)
>     metadata_collector = []
>     pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
>     return table.schema, metadata_collector[0]
> schema, meta = create_example_file_meta_data()
> metadata_collector = [meta] * 500
> %time pq.write_metadata(schema, BytesIO(), 
> metadata_collector=metadata_collector)
> # CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms
> # Wall time: 234 ms
> metadata_collector = [meta] * 1000
> %time pq.write_metadata(schema, BytesIO(), 
> metadata_collector=metadata_collector)
> # CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms
> # Wall time: 970 ms
> metadata_collector = [meta] * 2000
> %time pq.write_metadata(schema, BytesIO(), 
> metadata_collector=metadata_collector)
> # CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s
> # Wall time: 4.3 s
> metadata_collector = [meta] * 4000
> %time pq.write_metadata(schema, BytesIO(), 
> metadata_collector=metadata_collector)
> # CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s
> # Wall time: 17.3 s
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-14632) [Python] Make write_dataset arguments keyword-only

2022-05-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14632:
---
Description: The 
[write_dataset|https://github.com/apache/arrow/blob/8a36f0f6cb385c88b637f479cc38b7e51d45c7e7/python/pyarrow/dataset.py#L804-L811]
 method has many arguments for customizing the behavior of the write.  Most of 
them could be made keyword only.  (was: The write_dataset method has many 
arguments for customizing the behavior of the write.  Most of them could be 
made keyword only.)

> [Python] Make write_dataset arguments keyword-only
> --
>
> Key: ARROW-14632
> URL: https://issues.apache.org/jira/browse/ARROW-14632
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: good-first-issue
>
> The 
> [write_dataset|https://github.com/apache/arrow/blob/8a36f0f6cb385c88b637f479cc38b7e51d45c7e7/python/pyarrow/dataset.py#L804-L811]
>  method has many arguments for customizing the behavior of the write.  Most 
> of them could be made keyword only.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16683) [C++] Bundled gflags misses dependency

2022-05-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-16683.

Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13256
[https://github.com/apache/arrow/pull/13256]

> [C++] Bundled gflags misses dependency
> --
>
> Key: ARROW-16683
> URL: https://issues.apache.org/jira/browse/ARROW-16683
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> https://lists.apache.org/thread/15470pntvv5p2x49qwpkdv0trn6d9p6j
> {noformat}
> cmake .. -G "Visual Studio 16 2019" -A x64 -DARROW_BUILD_TESTS=ON
> says it can't find gflags and will build them from source:
> -- Building gflags from source
> -- Added static library dependency gflags::gflags_static: 
> C:/Users/avertleyb/git/arrow/cpp/build/gflags_ep-prefix/src/gflags_ep/lib/gflags_static.lib
> The second step - 
> cmake --build . --config Release
> right away complains about this library:
> LINK : fatal error LNK1181: cannot open input file 
> 'C:\Users\avertleyb\git\arrow\cpp\build\gflags_ep-prefix\src\gflags_ep\lib\gflags_static.lib'
>  
> [C:\Users\avertleyb\git\arrow\cpp\build\src\arrow\arrow_bundled_dependencies.vcxproj]
> C:\Program Files (x86)\Microsoft Visual 
> Studio\2019\Professional\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(241,5):
>  error MSB8066: Custom build for 
> 'C:\Users\avertleyb\git\arrow\cpp\build\CMakeFiles\b033194e6d32d6a2595cc88c82
> 72e4b2\arrow_bundled_dependencies.lib.rule;C:\Users\avertleyb\git\arrow\cpp\build\CMakeFiles\672df30e18a621ddf9c15292835268fd\arrow_bundled_dependencies.rule'
>  exited with code 1181. [C:\Users\avertleyb\git\arrow\cpp\build\src\arrow\arro
> w_bundled_dependencies.vcxproj]
> However it proceeds with the build, and when the build ends, the library is 
> there:
> C:\Users\avertleyb\git\arrow\cpp\build>dir 
> C:\Users\avertleyb\git\arrow\cpp\build\gflags_ep-prefix\src\gflags_ep\lib\gflags_static.lib
>  Volume in drive C is Windows
>  Volume Serial Number is 3E24-1FC6
>  Directory of 
> C:\Users\avertleyb\git\arrow\cpp\build\gflags_ep-prefix\src\gflags_ep\lib
> 05/26/2022  02:40 PM   672,310 gflags_static.lib
>1 File(s)672,310 bytes
>0 Dir(s)  288,920,072,192 bytes free
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16687) [C++] Can not find conda-installed Google Benchmark library

2022-05-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17543985#comment-17543985
 ] 

Antoine Pitrou commented on ARROW-16687:


cc [~kou]

> [C++] Can not find conda-installed Google Benchmark library
> ---
>
> Key: ARROW-16687
> URL: https://issues.apache.org/jira/browse/ARROW-16687
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> I have {{benchmark 1.6.1}} installed from conda-forge, yet when trying to 
> build Arrow C++ with benchmarks enabled I get the following error:
> {code}
> CMake Error at cmake_modules/ThirdpartyToolchain.cmake:253 (find_package):
>   By not providing "Findbenchmark.cmake" in CMAKE_MODULE_PATH this project
>   has asked CMake to find a package configuration file provided by
>   "benchmark", but CMake did not find one.
>   Could not find a package configuration file provided by "benchmark"
>   (requested version 1.6.0) with any of the following names:
> benchmarkConfig.cmake
> benchmark-config.cmake
>   Add the installation prefix of "benchmark" to CMAKE_PREFIX_PATH or set
>   "benchmark_DIR" to a directory containing one of the above files.  If
>   "benchmark" provides a separate development package or SDK, be sure it has
>   been installed.
> Call Stack (most recent call first):
>   cmake_modules/ThirdpartyToolchain.cmake:2141 (resolve_dependency)
>   CMakeLists.txt:567 (include)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16687) [C++] Can not find conda-installed Google Benchmark library

2022-05-30 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-16687:
--

 Summary: [C++] Can not find conda-installed Google Benchmark 
library
 Key: ARROW-16687
 URL: https://issues.apache.org/jira/browse/ARROW-16687
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


I have {{benchmark 1.6.1}} installed from conda-forge, yet when trying to build 
Arrow C++ with benchmarks enabled I get the following error:
{code}
CMake Error at cmake_modules/ThirdpartyToolchain.cmake:253 (find_package):
  By not providing "Findbenchmark.cmake" in CMAKE_MODULE_PATH this project
  has asked CMake to find a package configuration file provided by
  "benchmark", but CMake did not find one.

  Could not find a package configuration file provided by "benchmark"
  (requested version 1.6.0) with any of the following names:

benchmarkConfig.cmake
benchmark-config.cmake

  Add the installation prefix of "benchmark" to CMAKE_PREFIX_PATH or set
  "benchmark_DIR" to a directory containing one of the above files.  If
  "benchmark" provides a separate development package or SDK, be sure it has
  been installed.
Call Stack (most recent call first):
  cmake_modules/ThirdpartyToolchain.cmake:2141 (resolve_dependency)
  CMakeLists.txt:567 (include)

{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16609) [C++] xxhash not installed into dist/lib/include when building C++

2022-05-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17543976#comment-17543976
 ] 

Antoine Pitrou commented on ARROW-16609:


I think that pragmatically it's ok to start installing {{xxhash}} and the 
{{safe-math}} snippets.

Also, we can rename {{int_util_internal.h}} to {{int_util_overflow.h}} or 
something similar, to also install it.

> [C++] xxhash not installed into dist/lib/include when building C++
> --
>
> Key: ARROW-16609
> URL: https://issues.apache.org/jira/browse/ARROW-16609
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Alenka Frim
>Priority: Blocker
> Fix For: 9.0.0
>
>
> My C++ build setup doesn’t install {{dist/include/arrow/vendored/xxhash/}} 
> but only {{dist/include/arrow/vendored/xxhash.h}}. The last time the module 
> was installed was in november 2021.
> As {{arrow/python/arrow_to_pandas.cc}} includes {{arrow/util/hashing.h}} ->  
> {{arrow/vendored/xxhash.h}}  -> {{arrow/vendored/xxhash/xxhash.h}} this 
> module is needed to try to build Python C++ API separately from C++ 
> (ARROW-16340).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16684) [CI][Archery] Add retry mechanism to git fetch

2022-05-30 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-16684.
-
Resolution: Fixed

Issue resolved by pull request 13258
[https://github.com/apache/arrow/pull/13258]

> [CI][Archery] Add retry mechanism to git fetch
> --
>
> Key: ARROW-16684
> URL: https://issues.apache.org/jira/browse/ARROW-16684
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Archery, Continuous Integration, Developer Tools
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Archery seems to fail sometimes to fetch branches for some repositories. Some 
> of the report packaging jobs 
> ([https://github.com/ursacomputing/crossbow/runs/6643769198?check_suite_focus=true)]
>  have been failing due to git errors when fetching:
> {code:java}
>    File 
> "/home/runner/work/crossbow/crossbow/arrow/dev/archery/archery/crossbow/cli.py",
>  line 238, in latest_prefix
>     queue.fetch()
>   File 
> "/home/runner/work/crossbow/crossbow/arrow/dev/archery/archery/crossbow/core.py",
>  line 271, in fetch
>     self.origin.fetch([refspec])
>   File 
> "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pygit2/remote.py",
>  line 146, in fetch
>     payload.check_error(err)
>   File 
> "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pygit2/callbacks.py",
>  line 93, in check_error
>     check_error(error_code)
>   File 
> "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pygit2/errors.py",
>  line 65, in check_error
>     raise GitError(message)
> _pygit2.GitError: SSL error: received early EOF
> Error: Process completed with exit code 1.{code}
> I have seen that retrying the job can make it pass.
> We should add a retry mechanism to archery to allow retry on GitErrors when 
> fetching branches.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16272) [C++][Python] Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`

2022-05-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17543951#comment-17543951
 ] 

Antoine Pitrou commented on ARROW-16272:


The use case is fixed with https://github.com/apache/arrow/pull/13264 :

{code}
Running...
Time to create fs:  2.0029425621032715
Time to create fhandler:  0.4456977844238281
read time: 0.5826966762542725
Summons Number Plate ID Registration State Plate Type  Issue Date  
Violation Code  ... Community Board Community Council  Census Tract  BIN  BBL  
NTA
0   1363745270  GGY6450 99PAS  07/09/2015   
   46  ... NaNNaN  NaN  NaN  NaN  NaN
1   1363745293   KXD355 SCPAS  07/09/2015   
   21  ... NaNNaN  NaN  NaN  NaN  NaN
2   1363745438  JCK7576 PAPAS  07/09/2015   
   21  ... NaNNaN  NaN  NaN  NaN  NaN
3   1363745475  GYK7658 NYOMS  07/09/2015   
   21  ... NaNNaN  NaN  NaN  NaN  NaN
4   1363745487  GMT8141 NYPAS  07/09/2015   
   21  ... NaNNaN  NaN  NaN  NaN  NaN
.. ...  ......... ...   
  ...  ... ......  ...  ...  ...  ...
95  1363748464  GFV8489 NYPAS  07/09/2015   
   21  ... NaNNaN  NaN  NaN  NaN  NaN
96  1363748476   X15EGU NJPAS  07/09/2015   
   20  ... NaNNaN  NaN  NaN  NaN  NaN
97  1363748490  GDM1774 NYPAS  07/09/2015   
   38  ... NaNNaN  NaN  NaN  NaN  NaN
98  1363748531   G45DSY NJPAS  07/09/2015   
   37  ... NaNNaN  NaN  NaN  NaN  NaN
99  1363748579   RR76Y0 PAPAS  07/09/2015   
   20  ... NaNNaN  NaN  NaN  NaN  NaN

[100 rows x 51 columns]
total time: 3.0595762729644775
{code}

> [C++][Python] Poor read performance of S3FileSystem.open_input_file when used 
> with `pd.read_csv`
> 
>
> Key: ARROW-16272
> URL: https://issues.apache.org/jira/browse/ARROW-16272
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 4.0.1, 5.0.0, 7.0.0
> Environment: MacOS 12.1
> MacBook Pro
> Intel x86
>Reporter: Sahil Gupta
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: S3FileSystem, csv, pandas, pull-request-available, s3
> Fix For: 9.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> `pyarrow.fs.S3FileSystem.open_input_file` and 
> `pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used 
> with Pandas' `read_csv`.
> {code:python}
> import pandas as pd
> import time
> from pyarrow.fs import S3FileSystem
> def load_parking_tickets():
>     print("Running...")
>     t0 = time.time()
>     fs = S3FileSystem(
>         anonymous=True,
>         region="us-east-2",
>         endpoint_override=None,
>         proxy_options=None,
>     )
>     print("Time to create fs: ", time.time() - t0)
>     t0 = time.time()
>     # fhandler = fs.open_input_stream(
>     #     
> "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>     # )
>     fhandler = fs.open_input_file(
>         
> "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>     )
>     print("Time to create fhandler: ", time.time() - t0)
>     t0 = time.time()
>     year_2016_df = pd.read_csv(
>         fhandler,
>         nrows=100,
>     )
>     print("read time:", time.time() - t0)
>     return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> {code}
> Output:
> {code}
> Running...
> Time to create fs:  0.0003612041473388672
> Time to create fhandler:  0.22461509704589844
> read time: 105.76488208770752
> total time: 105.99135684967041
> {code}
> This is with `pandas==1.4.2`.
> Getting similar performance with `fs.open_input_stream` as well (commented 
> out in the code).
> {code}
> Running...
> Time to create fs:  0.0002570152282714844
> Time to create fhandler:  0.18540692329406738
> read time: 186.8419930934906
> total time: 187.03169012069702
> {code}
> When running it with just pandas (which uses `s3fs` under the hood), it's 
> much faster:
> {code:python}
> import pandas as pd

[jira] [Resolved] (ARROW-16617) [C++] WinErrorMessage() should not use Windows ANSI APIs

2022-05-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-16617.

Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13203
[https://github.com/apache/arrow/pull/13203]

> [C++] WinErrorMessage() should not use Windows ANSI APIs
> 
>
> Key: ARROW-16617
> URL: https://issues.apache.org/jira/browse/ARROW-16617
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> The {{WinErrorMessage}} utility function calls {{FormatMessageA}} in order to 
> get the Windows error message. This unfortunately returns the message encoded 
> using the current "codepage", which can give unreadable results if there are 
> non-ASCII characters in it.
> Instead, we should probably use {{FormatMessageW}} and then convert to UTF-8. 
> At least  {{PyArrow}} expects the error message in a {{Status}} to be 
> utf8-encoded.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16682) [Python] CSV reader: allow parsing without encoding errors

2022-05-30 Thread Thomas Buhrmann (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17543941#comment-17543941
 ] 

Thomas Buhrmann commented on ARROW-16682:
-

Ah, that makes sense as an explanation for the docs, thanks.

But the following still fails, and having a decoding error option would be 
useful in general, no?
{code:java}
csv = """
col_1,col2
0,😀
1,b
""".encode("utf-8")

pa.csv.read_csv(io.BytesIO(csv), pa.csv.ReadOptions(encoding="ascii")){code}
{noformat}
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 14: 
>>> ordinal not in range(128)
{noformat}

> [Python] CSV reader: allow parsing without encoding errors
> --
>
> Key: ARROW-16682
> URL: https://issues.apache.org/jira/browse/ARROW-16682
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Thomas Buhrmann
>Priority: Major
>
> When trying to read arbitrary CSV files, it is not possible to infer/guess 
> the correct encoding 100% of the time. The Arrow CSV reader will currently 
> fail if any byte cannot be decoded given the specified encoding (see example 
> below).
> With pandas.read_csv(), I can often get a result that is 99.9% correct by 
> passing it a text stream decoded in Python with 
> [errors="replace"|https://docs.python.org/3/library/codecs.html#error-handlers]
>  (or "ignore" etc.).
> Pyarrow's csv.read_csv() on the other hand neither accepts an already decoded 
> text stream (TypeError: binary file expected, got text file), nor a parameter 
> to configure what to do with decoding errors. As a result the parser simply 
> fails.
> The simplest solution would probably be to expose Python's error handling in 
> pyarrow.csv.ReadOptions (e.g. encoding_errors: "strict" | "ignore" | 
> "replace" ...).
> It would also be useful to document the behaviour of the CSV reader. E.g. 
> that it only accepts binary streams, and how encoding errors are handled. In 
> particular it is unclear what "Columns that cannot decode using this encoding 
> can still be read as Binary" means, since the parser will currently fail if 
> any bytes cannot be decoded.
> Toy example:
>  
> {code:java}
> txt = """
> col_😀_1, col2
> 0,a
> 1,b
> """
> buffer = io.BytesIO(txt.encode("utf-8"))
> pa.csv.read_csv(buffer, pa.csv.ReadOptions(encoding="ascii")){code}
> {noformat}
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 5: 
> ordinal not in range(128){noformat}
> whereas "with pandas":
> {code:java}
> buffer = io.BytesIO(txt.encode("utf-8"))
> text = io.TextIOWrapper(buffer, encoding="ascii", errors="replace")
> pd.read_csv(text){code}
> {noformat}
>col__1  col2
> 0   0 a
> 1   1 b
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16682) [Python] CSV reader: allow parsing without encoding errors

2022-05-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17543921#comment-17543921
 ] 

Antoine Pitrou commented on ARROW-16682:


Here is an example with invalid utf8 in the data:
{code:python}
>>> csv_bytes = b"""col1,col2\na,b\nc\xff,d"""
>>> pa.csv.read_csv(pa.BufferReader(csv_bytes))
pyarrow.Table
col1: binary
col2: string

col1: [[61,63FF]]
col2: [["b","d"]]
{code}

> [Python] CSV reader: allow parsing without encoding errors
> --
>
> Key: ARROW-16682
> URL: https://issues.apache.org/jira/browse/ARROW-16682
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Thomas Buhrmann
>Priority: Major
>
> When trying to read arbitrary CSV files, it is not possible to infer/guess 
> the correct encoding 100% of the time. The Arrow CSV reader will currently 
> fail if any byte cannot be decoded given the specified encoding (see example 
> below).
> With pandas.read_csv(), I can often get a result that is 99.9% correct by 
> passing it a text stream decoded in Python with 
> [errors="replace"|https://docs.python.org/3/library/codecs.html#error-handlers]
>  (or "ignore" etc.).
> Pyarrow's csv.read_csv() on the other hand neither accepts an already decoded 
> text stream (TypeError: binary file expected, got text file), nor a parameter 
> to configure what to do with decoding errors. As a result the parser simply 
> fails.
> The simplest solution would probably be to expose Python's error handling in 
> pyarrow.csv.ReadOptions (e.g. encoding_errors: "strict" | "ignore" | 
> "replace" ...).
> It would also be useful to document the behaviour of the CSV reader. E.g. 
> that it only accepts binary streams, and how encoding errors are handled. In 
> particular it is unclear what "Columns that cannot decode using this encoding 
> can still be read as Binary" means, since the parser will currently fail if 
> any bytes cannot be decoded.
> Toy example:
>  
> {code:java}
> txt = """
> col_😀_1, col2
> 0,a
> 1,b
> """
> buffer = io.BytesIO(txt.encode("utf-8"))
> pa.csv.read_csv(buffer, pa.csv.ReadOptions(encoding="ascii")){code}
> {noformat}
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 5: 
> ordinal not in range(128){noformat}
> whereas "with pandas":
> {code:java}
> buffer = io.BytesIO(txt.encode("utf-8"))
> text = io.TextIOWrapper(buffer, encoding="ascii", errors="replace")
> pd.read_csv(text){code}
> {noformat}
>col__1  col2
> 0   0 a
> 1   1 b
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16682) [Python] CSV reader: allow parsing without encoding errors

2022-05-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17543917#comment-17543917
 ] 

Antoine Pitrou commented on ARROW-16682:


{quote}In particular it is unclear what "Columns that cannot decode using this 
encoding can still be read as Binary" means, since the parser will currently 
fail if any bytes cannot be decoded.
{quote}

That sentence is valid for CSV _data_. However, in your case the column names 
are invalid utf8.

> [Python] CSV reader: allow parsing without encoding errors
> --
>
> Key: ARROW-16682
> URL: https://issues.apache.org/jira/browse/ARROW-16682
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Thomas Buhrmann
>Priority: Major
>
> When trying to read arbitrary CSV files, it is not possible to infer/guess 
> the correct encoding 100% of the time. The Arrow CSV reader will currently 
> fail if any byte cannot be decoded given the specified encoding (see example 
> below).
> With pandas.read_csv(), I can often get a result that is 99.9% correct by 
> passing it a text stream decoded in Python with 
> [errors="replace"|https://docs.python.org/3/library/codecs.html#error-handlers]
>  (or "ignore" etc.).
> Pyarrow's csv.read_csv() on the other hand neither accepts an already decoded 
> text stream (TypeError: binary file expected, got text file), nor a parameter 
> to configure what to do with decoding errors. As a result the parser simply 
> fails.
> The simplest solution would probably be to expose Python's error handling in 
> pyarrow.csv.ReadOptions (e.g. encoding_errors: "strict" | "ignore" | 
> "replace" ...).
> It would also be useful to document the behaviour of the CSV reader. E.g. 
> that it only accepts binary streams, and how encoding errors are handled. In 
> particular it is unclear what "Columns that cannot decode using this encoding 
> can still be read as Binary" means, since the parser will currently fail if 
> any bytes cannot be decoded.
> Toy example:
>  
> {code:java}
> txt = """
> col_😀_1, col2
> 0,a
> 1,b
> """
> buffer = io.BytesIO(txt.encode("utf-8"))
> pa.csv.read_csv(buffer, pa.csv.ReadOptions(encoding="ascii")){code}
> {noformat}
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 5: 
> ordinal not in range(128){noformat}
> whereas "with pandas":
> {code:java}
> buffer = io.BytesIO(txt.encode("utf-8"))
> text = io.TextIOWrapper(buffer, encoding="ascii", errors="replace")
> pd.read_csv(text){code}
> {noformat}
>col__1  col2
> 0   0 a
> 1   1 b
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16686) [C++] Use unique_ptr with FunctionOptions

2022-05-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16686:
---
Component/s: C++

> [C++] Use unique_ptr with FunctionOptions
> -
>
> Key: ARROW-16686
> URL: https://issues.apache.org/jira/browse/ARROW-16686
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
>  
> Update the usage of `FunctionOptions*` to `unique_ptr` 
> {code:java}
> /// \brief Configure a grouped aggregation
> struct ARROW_EXPORT Aggregate {  
> /// the name of the aggregation function  
> std::string function;
> /// options for the aggregation function  
> const FunctionOptions* options;
> }; {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16686) [C++] Use unique_ptr with FunctionOptions

2022-05-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17543916#comment-17543916
 ] 

Antoine Pitrou commented on ARROW-16686:


Why is that? The caller can keep ownership of the pointer. 


> [C++] Use unique_ptr with FunctionOptions
> -
>
> Key: ARROW-16686
> URL: https://issues.apache.org/jira/browse/ARROW-16686
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
>  
> Update the usage of `FunctionOptions*` to `unique_ptr` 
> {code:java}
> /// \brief Configure a grouped aggregation
> struct ARROW_EXPORT Aggregate {  
> /// the name of the aggregation function  
> std::string function;
> /// options for the aggregation function  
> const FunctionOptions* options;
> }; {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16560) [Website][Release] Version JSON files not updated in release

2022-05-30 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-16560.
-
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13257
[https://github.com/apache/arrow/pull/13257]

> [Website][Release] Version JSON files not updated in release
> 
>
> Key: ARROW-16560
> URL: https://issues.apache.org/jira/browse/ARROW-16560
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Website
>Reporter: Nicola Crane
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> ARROW-15366 added a script to automatically increment the version switchers 
> for the docs, which was updated as part of the changes in ARROW-1.  
> However, the latest release did not increment the version numbers (and 
> ARROW-1 changes the script to update on snapshots instead of releases - 
> could be the reason for it not happening?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16654) [Dev][Archery] Support cherry-picking for major releases

2022-05-30 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-16654:
---

Assignee: Krisztian Szucs

> [Dev][Archery] Support cherry-picking for major releases 
> -
>
> Key: ARROW-16654
> URL: https://issues.apache.org/jira/browse/ARROW-16654
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Archery, Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16654) [Dev][Archery] Support cherry-picking for major releases

2022-05-30 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-16654.
-
Resolution: Fixed

Issue resolved by pull request 13230
[https://github.com/apache/arrow/pull/13230]

> [Dev][Archery] Support cherry-picking for major releases 
> -
>
> Key: ARROW-16654
> URL: https://issues.apache.org/jira/browse/ARROW-16654
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Archery, Developer Tools
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16686) [C++] Use unique_ptr with FunctionOptions

2022-05-30 Thread Vibhatha Lakmal Abeykoon (Jira)
Vibhatha Lakmal Abeykoon created ARROW-16686:


 Summary: [C++] Use unique_ptr with FunctionOptions
 Key: ARROW-16686
 URL: https://issues.apache.org/jira/browse/ARROW-16686
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Vibhatha Lakmal Abeykoon
Assignee: Vibhatha Lakmal Abeykoon


 

Update the usage of `FunctionOptions*` to `unique_ptr` 
{code:java}
/// \brief Configure a grouped aggregation
struct ARROW_EXPORT Aggregate {  
/// the name of the aggregation function  
std::string function;
/// options for the aggregation function  
const FunctionOptions* options;
}; {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16685) [Python] Failing docstring example in Table.join

2022-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16685:
---
Labels: pull-request-available  (was: )

> [Python] Failing docstring example in Table.join
> 
>
> Key: ARROW-16685
> URL: https://issues.apache.org/jira/browse/ARROW-16685
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See: 
> https://github.com/apache/arrow/runs/6615749316?check_suite_focus=true#step:6:5590



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16685) [Python] Failing docstring example in Table.join

2022-05-30 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim updated ARROW-16685:

Summary: [Python] Failing docstring example in Table.join  (was: [Python] 
Failing doctest in Table.join)

> [Python] Failing docstring example in Table.join
> 
>
> Key: ARROW-16685
> URL: https://issues.apache.org/jira/browse/ARROW-16685
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
> Fix For: 9.0.0
>
>
> See: 
> https://github.com/apache/arrow/runs/6615749316?check_suite_focus=true#step:6:5590



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16685) [Python] Failing doctest in Table.join

2022-05-30 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-16685:
---

 Summary: [Python] Failing doctest in Table.join
 Key: ARROW-16685
 URL: https://issues.apache.org/jira/browse/ARROW-16685
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Alenka Frim
Assignee: Alenka Frim
 Fix For: 9.0.0


See: 
https://github.com/apache/arrow/runs/6615749316?check_suite_focus=true#step:6:5590



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-13844) [Doc] Document the release process

2022-05-30 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-13844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido reassigned ARROW-13844:
-

Assignee: Raúl Cumplido

> [Doc] Document the release process
> --
>
> Key: ARROW-13844
> URL: https://issues.apache.org/jira/browse/ARROW-13844
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Nicola Crane
>Assignee: Raúl Cumplido
>Priority: Major
>
> (ensure not to duplicate the release management guide)
> Maybe a more high-level overview of the steps for someone with no prior 
> knowledge, any diagrams etc that show how stuff fits together. More on the 
> why not the what.
> May be helpful for someone to shadow Kristian.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16641) [R] How to filter array columns?

2022-05-30 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-16641:
--
Fix Version/s: 9.0.0
   (was: 8.0.0)

> [R] How to filter array columns?
> 
>
> Key: ARROW-16641
> URL: https://issues.apache.org/jira/browse/ARROW-16641
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: R
>Reporter: Vladimir
>Priority: Minor
> Fix For: 9.0.0
>
>
> In the parquet data we have, there is a column with the array data type 
> ({*}list>{*}), which flags records that have different 
> issues. For each record, multiple values could be stored in the column. For 
> example, `{_}[A, B, C]{_}`.
> I'm trying to perform a data filtering step and exclude some flagged records.
> Filtering is trivial for the regular columns that contain just a single 
> value. E.g.,
> {code:java}
> flags_to_exclude <- c("A", "B")
> datt %>% filter(! col %in% flags_to_exclude)
> {code}
> Given the array column, is it possible to exclude records with at least one 
> of the flags from `flags_to_exclude` using the arrow R package?
> I really appreciate any advice you can provide!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16684) [CI][Archery] Add retry mechanism to git fetch

2022-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16684:
---
Labels: pull-request-available  (was: )

> [CI][Archery] Add retry mechanism to git fetch
> --
>
> Key: ARROW-16684
> URL: https://issues.apache.org/jira/browse/ARROW-16684
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Archery, Continuous Integration, Developer Tools
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Archery seems to fail sometimes to fetch branches for some repositories. Some 
> of the report packaging jobs 
> ([https://github.com/ursacomputing/crossbow/runs/6643769198?check_suite_focus=true)]
>  have been failing due to git errors when fetching:
> {code:java}
>    File 
> "/home/runner/work/crossbow/crossbow/arrow/dev/archery/archery/crossbow/cli.py",
>  line 238, in latest_prefix
>     queue.fetch()
>   File 
> "/home/runner/work/crossbow/crossbow/arrow/dev/archery/archery/crossbow/core.py",
>  line 271, in fetch
>     self.origin.fetch([refspec])
>   File 
> "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pygit2/remote.py",
>  line 146, in fetch
>     payload.check_error(err)
>   File 
> "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pygit2/callbacks.py",
>  line 93, in check_error
>     check_error(error_code)
>   File 
> "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pygit2/errors.py",
>  line 65, in check_error
>     raise GitError(message)
> _pygit2.GitError: SSL error: received early EOF
> Error: Process completed with exit code 1.{code}
> I have seen that retrying the job can make it pass.
> We should add a retry mechanism to archery to allow retry on GitErrors when 
> fetching branches.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16684) [CI][Archery] Add retry mechanism to git fetch

2022-05-30 Thread Jira
Raúl Cumplido created ARROW-16684:
-

 Summary: [CI][Archery] Add retry mechanism to git fetch
 Key: ARROW-16684
 URL: https://issues.apache.org/jira/browse/ARROW-16684
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Archery, Continuous Integration, Developer Tools
Reporter: Raúl Cumplido
Assignee: Raúl Cumplido
 Fix For: 9.0.0


Archery seems to fail sometimes to fetch branches for some repositories. Some 
of the report packaging jobs 
([https://github.com/ursacomputing/crossbow/runs/6643769198?check_suite_focus=true)]
 have been failing due to git errors when fetching:
{code:java}
   File 
"/home/runner/work/crossbow/crossbow/arrow/dev/archery/archery/crossbow/cli.py",
 line 238, in latest_prefix
    queue.fetch()
  File 
"/home/runner/work/crossbow/crossbow/arrow/dev/archery/archery/crossbow/core.py",
 line 271, in fetch
    self.origin.fetch([refspec])
  File 
"/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pygit2/remote.py",
 line 146, in fetch
    payload.check_error(err)
  File 
"/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pygit2/callbacks.py",
 line 93, in check_error
    check_error(error_code)
  File 
"/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pygit2/errors.py",
 line 65, in check_error
    raise GitError(message)
_pygit2.GitError: SSL error: received early EOF
Error: Process completed with exit code 1.{code}
I have seen that retrying the job can make it pass.

We should add a retry mechanism to archery to allow retry on GitErrors when 
fetching branches.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16678) [R] Cannot install fresh Arrow 8.0.0 on Ubuntu 22.04 with "NOT_CRAN" = TRUE

2022-05-30 Thread Jacob Wujciak-Jens (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Wujciak-Jens reassigned ARROW-16678:
--

Assignee: Jacob Wujciak-Jens

> [R] Cannot install fresh Arrow 8.0.0 on Ubuntu 22.04 with "NOT_CRAN" = TRUE
> ---
>
> Key: ARROW-16678
> URL: https://issues.apache.org/jira/browse/ARROW-16678
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
> Environment: Ubuntu 22.04, 
> R 4.1.2
> On AWS
>Reporter: Kevin Crouse
>Assignee: Jacob Wujciak-Jens
>Priority: Minor
>
> Trying to install Arrow 8.0 in R in the following way compiles successfully 
> but fails to load:
> {code:java}
> Sys.setenv("NOT_CRAN" = TRUE)
> install.packages("arrow") {code}
> Output at failure:
> {code:java}
> installing to /usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs
> ** R
> ** inst
> ** byte-compile and prepare package for lazy loading
> ** help
> *** installing help indices
> ** building package indices
> ** installing vignettes
> ** testing if installed package can be loaded from temporary location
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '/usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so':
>   /usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so: 
> undefined symbol: EVP_MD_size
> Error: loading failed {code}
> This is a freshly created Ubuntu 22.04 instance created on AWS.
> Looks like EVP_MD_size may be related to openssl. The libssl3, libssl-dev, 
> and libcrypt-dev apt packages are all installed.
> Alternate methods to install Arrow seem to succeed. In particular, this 
> worked:
> install.packages("arrow", repos = 
> "https://packagemanager.rstudio.com/all/__linux__/jammy/latest";)
> So, I'm not sure how serious the issue is, but it's notable since the failing 
> method above is the recommended way to install the package.
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16678) [R] Cannot install fresh Arrow 8.0.0 on Ubuntu 22.04 with "NOT_CRAN" = TRUE

2022-05-30 Thread Jacob Wujciak-Jens (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Wujciak-Jens updated ARROW-16678:
---
Priority: Critical  (was: Minor)

> [R] Cannot install fresh Arrow 8.0.0 on Ubuntu 22.04 with "NOT_CRAN" = TRUE
> ---
>
> Key: ARROW-16678
> URL: https://issues.apache.org/jira/browse/ARROW-16678
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
> Environment: Ubuntu 22.04, 
> R 4.1.2
> On AWS
>Reporter: Kevin Crouse
>Assignee: Jacob Wujciak-Jens
>Priority: Critical
>
> Trying to install Arrow 8.0 in R in the following way compiles successfully 
> but fails to load:
> {code:java}
> Sys.setenv("NOT_CRAN" = TRUE)
> install.packages("arrow") {code}
> Output at failure:
> {code:java}
> installing to /usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs
> ** R
> ** inst
> ** byte-compile and prepare package for lazy loading
> ** help
> *** installing help indices
> ** building package indices
> ** installing vignettes
> ** testing if installed package can be loaded from temporary location
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '/usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so':
>   /usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so: 
> undefined symbol: EVP_MD_size
> Error: loading failed {code}
> This is a freshly created Ubuntu 22.04 instance created on AWS.
> Looks like EVP_MD_size may be related to openssl. The libssl3, libssl-dev, 
> and libcrypt-dev apt packages are all installed.
> Alternate methods to install Arrow seem to succeed. In particular, this 
> worked:
> install.packages("arrow", repos = 
> "https://packagemanager.rstudio.com/all/__linux__/jammy/latest";)
> So, I'm not sure how serious the issue is, but it's notable since the failing 
> method above is the recommended way to install the package.
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16678) [R] Cannot install fresh Arrow 8.0.0 on Ubuntu 22.04 with "NOT_CRAN" = TRUE

2022-05-30 Thread Jacob Wujciak-Jens (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Wujciak-Jens updated ARROW-16678:
---
Fix Version/s: 9.0.0

> [R] Cannot install fresh Arrow 8.0.0 on Ubuntu 22.04 with "NOT_CRAN" = TRUE
> ---
>
> Key: ARROW-16678
> URL: https://issues.apache.org/jira/browse/ARROW-16678
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
> Environment: Ubuntu 22.04, 
> R 4.1.2
> On AWS
>Reporter: Kevin Crouse
>Assignee: Jacob Wujciak-Jens
>Priority: Critical
> Fix For: 9.0.0
>
>
> Trying to install Arrow 8.0 in R in the following way compiles successfully 
> but fails to load:
> {code:java}
> Sys.setenv("NOT_CRAN" = TRUE)
> install.packages("arrow") {code}
> Output at failure:
> {code:java}
> installing to /usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs
> ** R
> ** inst
> ** byte-compile and prepare package for lazy loading
> ** help
> *** installing help indices
> ** building package indices
> ** installing vignettes
> ** testing if installed package can be loaded from temporary location
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '/usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so':
>   /usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so: 
> undefined symbol: EVP_MD_size
> Error: loading failed {code}
> This is a freshly created Ubuntu 22.04 instance created on AWS.
> Looks like EVP_MD_size may be related to openssl. The libssl3, libssl-dev, 
> and libcrypt-dev apt packages are all installed.
> Alternate methods to install Arrow seem to succeed. In particular, this 
> worked:
> install.packages("arrow", repos = 
> "https://packagemanager.rstudio.com/all/__linux__/jammy/latest";)
> So, I'm not sure how serious the issue is, but it's notable since the failing 
> method above is the recommended way to install the package.
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16678) [R] Cannot install fresh Arrow 8.0.0 on Ubuntu 22.04 with "NOT_CRAN" = TRUE

2022-05-30 Thread Jacob Wujciak-Jens (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17543827#comment-17543827
 ] 

Jacob Wujciak-Jens commented on ARROW-16678:


I can reproduce 
[this|https://github.com/assignUser/test-repo-b/runs/6651182415?check_suite_focus=true#step:4:502],
 I am guessing the Issue is, that ubuntu 22.04 comes with openssl3 and the 
binary arrow uses when "NOT_CRAN" was compiled on ubuntu 18.04 against 
openssl1. RSPM works because they compiled on 22.04.

We will need to add a separate 22.04 binary.

cc [~npr]

> [R] Cannot install fresh Arrow 8.0.0 on Ubuntu 22.04 with "NOT_CRAN" = TRUE
> ---
>
> Key: ARROW-16678
> URL: https://issues.apache.org/jira/browse/ARROW-16678
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
> Environment: Ubuntu 22.04, 
> R 4.1.2
> On AWS
>Reporter: Kevin Crouse
>Priority: Minor
>
> Trying to install Arrow 8.0 in R in the following way compiles successfully 
> but fails to load:
> {code:java}
> Sys.setenv("NOT_CRAN" = TRUE)
> install.packages("arrow") {code}
> Output at failure:
> {code:java}
> installing to /usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs
> ** R
> ** inst
> ** byte-compile and prepare package for lazy loading
> ** help
> *** installing help indices
> ** building package indices
> ** installing vignettes
> ** testing if installed package can be loaded from temporary location
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '/usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so':
>   /usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so: 
> undefined symbol: EVP_MD_size
> Error: loading failed {code}
> This is a freshly created Ubuntu 22.04 instance created on AWS.
> Looks like EVP_MD_size may be related to openssl. The libssl3, libssl-dev, 
> and libcrypt-dev apt packages are all installed.
> Alternate methods to install Arrow seem to succeed. In particular, this 
> worked:
> install.packages("arrow", repos = 
> "https://packagemanager.rstudio.com/all/__linux__/jammy/latest";)
> So, I'm not sure how serious the issue is, but it's notable since the failing 
> method above is the recommended way to install the package.
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16560) [Website][Release] Version JSON files not updated in release

2022-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16560:
---
Labels: pull-request-available  (was: )

> [Website][Release] Version JSON files not updated in release
> 
>
> Key: ARROW-16560
> URL: https://issues.apache.org/jira/browse/ARROW-16560
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Website
>Reporter: Nicola Crane
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-15366 added a script to automatically increment the version switchers 
> for the docs, which was updated as part of the changes in ARROW-1.  
> However, the latest release did not increment the version numbers (and 
> ARROW-1 changes the script to update on snapshots instead of releases - 
> could be the reason for it not happening?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16560) [Website][Release] Version JSON files not updated in release

2022-05-30 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-16560:


Assignee: Kouhei Sutou

> [Website][Release] Version JSON files not updated in release
> 
>
> Key: ARROW-16560
> URL: https://issues.apache.org/jira/browse/ARROW-16560
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Website
>Reporter: Nicola Crane
>Assignee: Kouhei Sutou
>Priority: Major
>
> ARROW-15366 added a script to automatically increment the version switchers 
> for the docs, which was updated as part of the changes in ARROW-1.  
> However, the latest release did not increment the version numbers (and 
> ARROW-1 changes the script to update on snapshots instead of releases - 
> could be the reason for it not happening?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16206) [Ruby] Add support for DictionaryArray#values, #raw_records of {Month,DayTime,MonthDayNano} Interval Type

2022-05-30 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-16206.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13255
[https://github.com/apache/arrow/pull/13255]

> [Ruby] Add support for DictionaryArray#values, #raw_records of 
> {Month,DayTime,MonthDayNano} Interval Type
> -
>
> Key: ARROW-16206
> URL: https://issues.apache.org/jira/browse/ARROW-16206
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Ruby
>Reporter: Keisuke Okada
>Assignee: Keisuke Okada
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16683) [C++] Bundled gflags misses dependency

2022-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16683:
---
Labels: pull-request-available  (was: )

> [C++] Bundled gflags misses dependency
> --
>
> Key: ARROW-16683
> URL: https://issues.apache.org/jira/browse/ARROW-16683
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://lists.apache.org/thread/15470pntvv5p2x49qwpkdv0trn6d9p6j
> {noformat}
> cmake .. -G "Visual Studio 16 2019" -A x64 -DARROW_BUILD_TESTS=ON
> says it can't find gflags and will build them from source:
> -- Building gflags from source
> -- Added static library dependency gflags::gflags_static: 
> C:/Users/avertleyb/git/arrow/cpp/build/gflags_ep-prefix/src/gflags_ep/lib/gflags_static.lib
> The second step - 
> cmake --build . --config Release
> right away complains about this library:
> LINK : fatal error LNK1181: cannot open input file 
> 'C:\Users\avertleyb\git\arrow\cpp\build\gflags_ep-prefix\src\gflags_ep\lib\gflags_static.lib'
>  
> [C:\Users\avertleyb\git\arrow\cpp\build\src\arrow\arrow_bundled_dependencies.vcxproj]
> C:\Program Files (x86)\Microsoft Visual 
> Studio\2019\Professional\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(241,5):
>  error MSB8066: Custom build for 
> 'C:\Users\avertleyb\git\arrow\cpp\build\CMakeFiles\b033194e6d32d6a2595cc88c82
> 72e4b2\arrow_bundled_dependencies.lib.rule;C:\Users\avertleyb\git\arrow\cpp\build\CMakeFiles\672df30e18a621ddf9c15292835268fd\arrow_bundled_dependencies.rule'
>  exited with code 1181. [C:\Users\avertleyb\git\arrow\cpp\build\src\arrow\arro
> w_bundled_dependencies.vcxproj]
> However it proceeds with the build, and when the build ends, the library is 
> there:
> C:\Users\avertleyb\git\arrow\cpp\build>dir 
> C:\Users\avertleyb\git\arrow\cpp\build\gflags_ep-prefix\src\gflags_ep\lib\gflags_static.lib
>  Volume in drive C is Windows
>  Volume Serial Number is 3E24-1FC6
>  Directory of 
> C:\Users\avertleyb\git\arrow\cpp\build\gflags_ep-prefix\src\gflags_ep\lib
> 05/26/2022  02:40 PM   672,310 gflags_static.lib
>1 File(s)672,310 bytes
>0 Dir(s)  288,920,072,192 bytes free
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16683) [C++] Bundled gflags misses dependency

2022-05-30 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-16683:


 Summary: [C++] Bundled gflags misses dependency
 Key: ARROW-16683
 URL: https://issues.apache.org/jira/browse/ARROW-16683
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


https://lists.apache.org/thread/15470pntvv5p2x49qwpkdv0trn6d9p6j

{noformat}
cmake .. -G "Visual Studio 16 2019" -A x64 -DARROW_BUILD_TESTS=ON

says it can't find gflags and will build them from source:

-- Building gflags from source
-- Added static library dependency gflags::gflags_static: 
C:/Users/avertleyb/git/arrow/cpp/build/gflags_ep-prefix/src/gflags_ep/lib/gflags_static.lib

The second step - 

cmake --build . --config Release

right away complains about this library:

LINK : fatal error LNK1181: cannot open input file 
'C:\Users\avertleyb\git\arrow\cpp\build\gflags_ep-prefix\src\gflags_ep\lib\gflags_static.lib'
 
[C:\Users\avertleyb\git\arrow\cpp\build\src\arrow\arrow_bundled_dependencies.vcxproj]
C:\Program Files (x86)\Microsoft Visual 
Studio\2019\Professional\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(241,5):
 error MSB8066: Custom build for 
'C:\Users\avertleyb\git\arrow\cpp\build\CMakeFiles\b033194e6d32d6a2595cc88c82
72e4b2\arrow_bundled_dependencies.lib.rule;C:\Users\avertleyb\git\arrow\cpp\build\CMakeFiles\672df30e18a621ddf9c15292835268fd\arrow_bundled_dependencies.rule'
 exited with code 1181. [C:\Users\avertleyb\git\arrow\cpp\build\src\arrow\arro
w_bundled_dependencies.vcxproj]

However it proceeds with the build, and when the build ends, the library is 
there:

C:\Users\avertleyb\git\arrow\cpp\build>dir 
C:\Users\avertleyb\git\arrow\cpp\build\gflags_ep-prefix\src\gflags_ep\lib\gflags_static.lib
 Volume in drive C is Windows
 Volume Serial Number is 3E24-1FC6

 Directory of 
C:\Users\avertleyb\git\arrow\cpp\build\gflags_ep-prefix\src\gflags_ep\lib

05/26/2022  02:40 PM   672,310 gflags_static.lib
   1 File(s)672,310 bytes
   0 Dir(s)  288,920,072,192 bytes free
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)