[jira] [Commented] (ARROW-17839) [Python] Cannot create RecordBatch with nested struct containing extension type

2022-09-28 Thread Matthias Vallentin (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610822#comment-17610822
 ] 

Matthias Vallentin commented on ARROW-17839:


Thanks for the point, [~jorisvandenbossche]. Glad to see that a fix is underway.

Would you mind pointing me to instructions on how to do the test that you 
performed? I am using Poetry and couldn't get the branch to compile. In theory, 
I thought this should do the trick:
{code:java}
[tool.poetry.dependencies]
#pyarrow = "^9.0"
pyarrow = { git = "https://github.com/milesgranger/arrow.git";, branch = 
"ARROW-15545_cast-of-extension-types", subdirectory = "python" }{code}
But this fails to compile due to missing dependencies. (I managed to workaround 
OpenSSL by providing the right env var, but now I'm stuck with Flight not being 
found.) I was hoping that there is some sort of dev guide that shows how to get 
going.

> [Python] Cannot create RecordBatch with nested struct containing extension 
> type
> ---
>
> Key: ARROW-17839
> URL: https://issues.apache.org/jira/browse/ARROW-17839
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
> Environment: macOS 12.5.1 on an Apple M1 Ultra.
>Reporter: Matthias Vallentin
>Priority: Blocker
> Attachments: example.py
>
>
> I'm running into the following issue:
> {code:java}
> pyarrow.lib.ArrowNotImplementedError: Unsupported cast to 
> extension> from fixed_size_binary[16]{code}
> Use case: I want to create a record batch that contains this type:
> {code:java}
> pa.struct([("address", AddressType()), ("length", pa.uint8())]){code}
> Here, {{AddressType}} is an extension type that models an IP address 
> ({{{}pa.binary(16){}}}).
> Please find attached a self-contained example that illustrates the issue.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17868) [C++][Python] Keep and deprecate ARROW_PYTHON CMake option for backward compatibility

2022-09-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17868:
---
Labels: pull-request-available  (was: )

> [C++][Python] Keep and deprecate ARROW_PYTHON CMake option for backward 
> compatibility
> -
>
> Key: ARROW-17868
> URL: https://issues.apache.org/jira/browse/ARROW-17868
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-6858 removed {{ARROW_PYTHON}} CMake option because ARROW-16340 moved 
> {{cpp/src/arrow/python/}} to {{python/pyarrow/src/}}. But it broke backward 
> compatibility. Users who use {{-DARROW_PYTHON=ON}} needs to 
> {{-DARROW_CSV=ON}}, {{-DARROW_DATASET=ON}} and so on manually.
> See also: https://github.com/apache/arrow/pull/14224#discussion_r981399130
> {quote}
> FWIW this broke my local development because of no longer including those 
> (although I should probably start using presets ..)
> Now, it's probably fine to remove this now Python C++ has moved, but we do 
> assume that some C++ modules are built on the pyarrow side (eg we assume that 
> CSV is always built, while with the above change you need to ensure manually 
> that this is done in your cmake call).
> In any case we should update the documentation at 
> https://arrow.apache.org/docs/dev/developers/python.html#build-and-test to 
> indicate that there are a few components required to be able to build pyarrow.
> {quote}
> Eventually, we can remove {{ARROW_PYTHON}} CMake option but we should provide 
> a deprecation period before we remove {{ARROW_PYTHON}}.
> We should also mention that {{ARROW_PYTHON}} is deprecated in our 
> documentation ( 
> https://arrow.apache.org/docs/dev/developers/python.html#build-and-test ).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17886) [R] Convert schema to the corresponding ptype (zero-row data frame)?

2022-09-28 Thread Jira
Kirill Müller created ARROW-17886:
-

 Summary: [R] Convert schema to the corresponding ptype (zero-row 
data frame)?
 Key: ARROW-17886
 URL: https://issues.apache.org/jira/browse/ARROW-17886
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Kirill Müller


When fetching data e.g. from a RecordBatchReader, I would like to know, ahead 
of time, what the data will look like after it's converted to a data frame. I 
have found a way using utils::head(0), but I'm not sure if it's efficient in 
all scenarios.

My use case is the Arrow extension to DBI, in particular the default 
implementation for drivers that don't speak Arrow yet. I'd like to know which 
types the columns should have on the database. I can already infer this from 
the corresponding R types, but those existing drivers don't know about Arrow 
types.

Should we support as.data.frame() for schema objects? The semantics would be to 
return a zero-row data frame with correct column names and types.


library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for 
more information.
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#> timestamp

data <- data.frame(
  a = 1:3,
  b = 2.5,
  c = "three",
  stringsAsFactors = FALSE
)
data$d <- blob::blob(as.raw(1:10))

tbl <- arrow::as_arrow_table(data)
rbr <- arrow::as_record_batch_reader(tbl)

tibble::as_tibble(head(rbr, 0))
#> # A tibble: 0 × 4
#> # … with 4 variables: a , b , c , d 
rbr$read_table()
#> Table
#> 3 rows x 4 columns
#> $a 
#> $b 
#> $c 
#> $d <>
#> 
#> See $metadata for additional Schema metadata



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17885) [R] Return BLOB data as list of raw instead of a list of integers

2022-09-28 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Müller updated ARROW-17885:
--
Description: 
BLOBs should be mapped to lists of raw in R, not lists of integer. Tested with 
ec714db3995549309b987fc8112db98bb93102d0.

library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for 
more information.
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

data <- data.frame(
  a = 1:3,
  b = 2.5,
  c = "three",
  stringsAsFactors = FALSE
)
data$d <- blob::blob(as.raw(1:10))

tbl <- arrow::as_arrow_table(data)
rbr <- arrow::as_record_batch_reader(tbl)

waldo::compare(as.data.frame(rbr$read_next_batch()), data)
#> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...)
#> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...)
#> 
#> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...)
#> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...)
#> 
#> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...)
#> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...)


  was:
BLOBs should be mapped to lists of raw in R, not lists of integer. Tested with 
ec714db3995549309b987fc8112db98bb93102d0.

{{
library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for 
more information.
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

data <- data.frame(
  a = 1:3,
  b = 2.5,
  c = "three",
  stringsAsFactors = FALSE
)
data$d <- blob::blob(as.raw(1:10))

tbl <- arrow::as_arrow_table(data)
rbr <- arrow::as_record_batch_reader(tbl)

waldo::compare(as.data.frame(rbr$read_next_batch()), data)
#> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...)
#> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...)
#> 
#> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...)
#> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...)
#> 
#> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...)
#> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...)
}}



> [R] Return BLOB data as list of raw instead of a list of integers
> -
>
> Key: ARROW-17885
> URL: https://issues.apache.org/jira/browse/ARROW-17885
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 10.0.0, 9.0.1
> Environment: macOS, R 4.1.3
>Reporter: Kirill Müller
>Priority: Minor
>
> BLOBs should be mapped to lists of raw in R, not lists of integer. Tested 
> with ec714db3995549309b987fc8112db98bb93102d0.
> library(arrow)
> #> Some features are not enabled in this build of Arrow. Run `arrow_info()` 
> for more information.
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> data <- data.frame(
>   a = 1:3,
>   b = 2.5,
>   c = "three",
>   stringsAsFactors = FALSE
> )
> data$d <- blob::blob(as.raw(1:10))
> tbl <- arrow::as_arrow_table(data)
> rbr <- arrow::as_record_batch_reader(tbl)
> waldo::compare(as.data.frame(rbr$read_next_batch()), data)
> #> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...)
> #> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...)
> #> 
> #> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...)
> #> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...)
> #> 
> #> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...)
> #> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17885) [R] Return BLOB data as list of raw instead of a list of integers

2022-09-28 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Müller updated ARROW-17885:
--
Environment: macOS arm64, R 4.1.3  (was: macOS, R 4.1.3)

> [R] Return BLOB data as list of raw instead of a list of integers
> -
>
> Key: ARROW-17885
> URL: https://issues.apache.org/jira/browse/ARROW-17885
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 10.0.0, 9.0.1
> Environment: macOS arm64, R 4.1.3
>Reporter: Kirill Müller
>Priority: Minor
>
> BLOBs should be mapped to lists of raw in R, not lists of integer. Tested 
> with ec714db3995549309b987fc8112db98bb93102d0.
> library(arrow)
> #> Some features are not enabled in this build of Arrow. Run `arrow_info()` 
> for more information.
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> data <- data.frame(
>   a = 1:3,
>   b = 2.5,
>   c = "three",
>   stringsAsFactors = FALSE
> )
> data$d <- blob::blob(as.raw(1:10))
> tbl <- arrow::as_arrow_table(data)
> rbr <- arrow::as_record_batch_reader(tbl)
> waldo::compare(as.data.frame(rbr$read_next_batch()), data)
> #> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...)
> #> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...)
> #> 
> #> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...)
> #> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...)
> #> 
> #> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...)
> #> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17885) [R] Return BLOB data as list of raw instead of a list of integers

2022-09-28 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Müller updated ARROW-17885:
--
Description: 
BLOBs should be mapped to lists of raw in R, not lists of integer. Tested with 
ec714db3995549309b987fc8112db98bb93102d0.

{{
library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for 
more information.
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

data <- data.frame(
  a = 1:3,
  b = 2.5,
  c = "three",
  stringsAsFactors = FALSE
)
data$d <- blob::blob(as.raw(1:10))

tbl <- arrow::as_arrow_table(data)
rbr <- arrow::as_record_batch_reader(tbl)

waldo::compare(as.data.frame(rbr$read_next_batch()), data)
#> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...)
#> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...)
#> 
#> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...)
#> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...)
#> 
#> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...)
#> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...)
}}


  was:
BLOBs should be mapped to lists of raw in R, not lists of integer. Tested with 
ec714db3995549309b987fc8112db98bb93102d0.

``` r
library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for 
more information.
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

data <- data.frame(
  a = 1:3,
  b = 2.5,
  c = "three",
  stringsAsFactors = FALSE
)
data$d <- blob::blob(as.raw(1:10))

tbl <- arrow::as_arrow_table(data)
rbr <- arrow::as_record_batch_reader(tbl)

waldo::compare(as.data.frame(rbr$read_next_batch()), data)
#> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...)
#> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...)
#> 
#> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...)
#> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...)
#> 
#> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...)
#> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...)
```

Created on 2022-09-29 with [reprex 
v2.0.2](https://reprex.tidyverse.org)


> [R] Return BLOB data as list of raw instead of a list of integers
> -
>
> Key: ARROW-17885
> URL: https://issues.apache.org/jira/browse/ARROW-17885
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 10.0.0, 9.0.1
> Environment: macOS, R 4.1.3
>Reporter: Kirill Müller
>Priority: Minor
>
> BLOBs should be mapped to lists of raw in R, not lists of integer. Tested 
> with ec714db3995549309b987fc8112db98bb93102d0.
> {{
> library(arrow)
> #> Some features are not enabled in this build of Arrow. Run `arrow_info()` 
> for more information.
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> data <- data.frame(
>   a = 1:3,
>   b = 2.5,
>   c = "three",
>   stringsAsFactors = FALSE
> )
> data$d <- blob::blob(as.raw(1:10))
> tbl <- arrow::as_arrow_table(data)
> rbr <- arrow::as_record_batch_reader(tbl)
> waldo::compare(as.data.frame(rbr$read_next_batch()), data)
> #> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...)
> #> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...)
> #> 
> #> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...)
> #> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...)
> #> 
> #> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...)
> #> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...)
> }}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17885) [R] Return BLOB data as list of raw instead of a list of integers

2022-09-28 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Müller updated ARROW-17885:
--
Summary: [R] Return BLOB data as list of raw instead of a list of integers  
(was: Return BLOB data as list of raw instead of a list of integers)

> [R] Return BLOB data as list of raw instead of a list of integers
> -
>
> Key: ARROW-17885
> URL: https://issues.apache.org/jira/browse/ARROW-17885
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 10.0.0, 9.0.1
> Environment: macOS, R 4.1.3
>Reporter: Kirill Müller
>Priority: Minor
>
> BLOBs should be mapped to lists of raw in R, not lists of integer. Tested 
> with ec714db3995549309b987fc8112db98bb93102d0.
> ``` r
> library(arrow)
> #> Some features are not enabled in this build of Arrow. Run `arrow_info()` 
> for more information.
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> data <- data.frame(
>   a = 1:3,
>   b = 2.5,
>   c = "three",
>   stringsAsFactors = FALSE
> )
> data$d <- blob::blob(as.raw(1:10))
> tbl <- arrow::as_arrow_table(data)
> rbr <- arrow::as_record_batch_reader(tbl)
> waldo::compare(as.data.frame(rbr$read_next_batch()), data)
> #> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...)
> #> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...)
> #> 
> #> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...)
> #> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...)
> #> 
> #> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...)
> #> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...)
> ```
> Created on 2022-09-29 with [reprex 
> v2.0.2](https://reprex.tidyverse.org)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17885) Return BLOB data as list of raw instead of a list of integers

2022-09-28 Thread Jira
Kirill Müller created ARROW-17885:
-

 Summary: Return BLOB data as list of raw instead of a list of 
integers
 Key: ARROW-17885
 URL: https://issues.apache.org/jira/browse/ARROW-17885
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 10.0.0, 9.0.1
 Environment: macOS, R 4.1.3
Reporter: Kirill Müller


BLOBs should be mapped to lists of raw in R, not lists of integer. Tested with 
ec714db3995549309b987fc8112db98bb93102d0.

``` r
library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for 
more information.
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

data <- data.frame(
  a = 1:3,
  b = 2.5,
  c = "three",
  stringsAsFactors = FALSE
)
data$d <- blob::blob(as.raw(1:10))

tbl <- arrow::as_arrow_table(data)
rbr <- arrow::as_record_batch_reader(tbl)

waldo::compare(as.data.frame(rbr$read_next_batch()), data)
#> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...)
#> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...)
#> 
#> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...)
#> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...)
#> 
#> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...)
#> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...)
```

Created on 2022-09-29 with [reprex 
v2.0.2](https://reprex.tidyverse.org)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17884) Add Intel®-IAA/QPL-based Parquet RLE Decode

2022-09-28 Thread zhaoyaqi (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyaqi reassigned ARROW-17884:


Assignee: zhaoyaqi

> Add Intel®-IAA/QPL-based Parquet RLE Decode
> ---
>
> Key: ARROW-17884
> URL: https://issues.apache.org/jira/browse/ARROW-17884
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: zhaoyaqi
>Assignee: zhaoyaqi
>Priority: Minor
>  Labels: performance, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Intel® In-Memory Analytics Accelerator (Intel® IAA) is a hardware accelerator 
> available in the upcoming generation of Intel® Xeon® Scalable processors 
> ("Sapphire Rapids"). Its goal is to speed up common operations in analytics 
> like data (de)compression and filtering. It support decoding of Parquet RLE 
> format. We add new codec which utilizes the Intel® IAA offloading technology 
> to provide a high-performance RLE decode implementation. The codec uses the 
> [Intel® Query Processing Library (QPL)|https://github.com/intel/qpl] which 
> abstracts access to the hardware accelerator. The new solution provides in 
> general higher performance against current solution, and also consume less 
> CPU.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17884) Add Intel®-IAA/QPL-based Parquet RLE Decode

2022-09-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17884:
---
Labels: performance pull-request-available  (was: performance)

> Add Intel®-IAA/QPL-based Parquet RLE Decode
> ---
>
> Key: ARROW-17884
> URL: https://issues.apache.org/jira/browse/ARROW-17884
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: zhaoyaqi
>Priority: Minor
>  Labels: performance, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Intel® In-Memory Analytics Accelerator (Intel® IAA) is a hardware accelerator 
> available in the upcoming generation of Intel® Xeon® Scalable processors 
> ("Sapphire Rapids"). Its goal is to speed up common operations in analytics 
> like data (de)compression and filtering. It support decoding of Parquet RLE 
> format. We add new codec which utilizes the Intel® IAA offloading technology 
> to provide a high-performance RLE decode implementation. The codec uses the 
> [Intel® Query Processing Library (QPL)|https://github.com/intel/qpl] which 
> abstracts access to the hardware accelerator. The new solution provides in 
> general higher performance against current solution, and also consume less 
> CPU.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17884) Add Intel®-IAA/QPL-based Parquet RLE Decode

2022-09-28 Thread zhaoyaqi (Jira)
zhaoyaqi created ARROW-17884:


 Summary: Add Intel®-IAA/QPL-based Parquet RLE Decode
 Key: ARROW-17884
 URL: https://issues.apache.org/jira/browse/ARROW-17884
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: zhaoyaqi


Intel® In-Memory Analytics Accelerator (Intel® IAA) is a hardware accelerator 
available in the upcoming generation of Intel® Xeon® Scalable processors 
("Sapphire Rapids"). Its goal is to speed up common operations in analytics 
like data (de)compression and filtering. It support decoding of Parquet RLE 
format. We add new codec which utilizes the Intel® IAA offloading technology to 
provide a high-performance RLE decode implementation. The codec uses the 
[Intel® Query Processing Library (QPL)|https://github.com/intel/qpl] which 
abstracts access to the hardware accelerator. The new solution provides in 
general higher performance against current solution, and also consume less CPU.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17850) [Java] Upgrade netty-codec-http dependencies

2022-09-28 Thread Hui Yu (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610754#comment-17610754
 ] 

Hui Yu commented on ARROW-17850:


Thank you all !

> [Java] Upgrade netty-codec-http dependencies
> 
>
> Key: ARROW-17850
> URL: https://issues.apache.org/jira/browse/ARROW-17850
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 9.0.0
>Reporter: Hui Yu
>Assignee: David Dali Susanibar Arce
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [CVE-2022-24823]([https://github.com/advisories/GHSA-269q-hmxg-m83q]) reports 
> a security vulnerability for *netty-codec-http*
> Now the version of *netty-codec-http* in the master branch is *4.1.72.Final,* 
> that is unsafe.
> The ticket https://issues.apache.org/jira/browse/ARROW-16996 bumps 
> *netty-codec* to {*}4.1.78.Final{*}, it didn't bump *netty-codec-http.*
> Can you upgrade the version of *netty-codec-http* ? 
>  
> Here is my output of mvn:dependency now:
> ```bash
> [INFO] +- org.apache.arrow:flight-core:jar:9.0.0:compile
> [INFO] |  +- io.grpc:grpc-netty:jar:1.47.0:compile
> [INFO] |  |  +- io.netty:netty-codec-http2:jar:4.1.72.Final:compile
> [INFO] |  |  |  - io.netty:{*}netty-codec-http{*}:jar:4.1.72.Final:compile
> [INFO] |  |  +- io.netty:netty-handler-proxy:jar:4.1.72.Final:runtime
> [INFO] |  |  |  - io.netty:netty-codec-socks:jar:4.1.72.Final:runtime
> [INFO] |  |  +- 
> com.google.errorprone:error_prone_annotations:jar:2.10.0:compile
> [INFO] |  |  +- io.perfmark:perfmark-api:jar:0.25.0:runtime
> [INFO] |  |  - 
> io.netty:netty-transport-native-unix-common:jar:4.1.72.Final:compile
> [INFO] |  +- io.grpc:grpc-core:jar:1.47.0:compile
> [INFO] |  |  +- com.google.android:annotations:jar:4.1.1.4:runtime
> [INFO] |  |  - org.codehaus.mojo:animal-sniffer-annotations:jar:1.19:runtime
> [INFO] |  +- io.grpc:grpc-context:jar:1.47.0:compile
> [INFO] |  +- io.grpc:grpc-protobuf:jar:1.47.0:compile
> [INFO] |  |  +- 
> com.google.api.grpc:proto-google-common-protos:jar:2.0.1:compile
> [INFO] |  |  - io.grpc:grpc-protobuf-lite:jar:1.47.0:compile
> [INFO] |  +- io.netty:netty-tcnative-boringssl-static:jar:2.0.53.Final:compile
> [INFO] |  |  +- io.netty:netty-tcnative-classes:jar:2.0.53.Final:compile
> [INFO] |  |  +- 
> io.netty:netty-tcnative-boringssl-static:jar:linux-x86_64:2.0.53.Final:compile
> [INFO] |  |  +- 
> io.netty:netty-tcnative-boringssl-static:jar:linux-aarch_64:2.0.53.Final:compile
> [INFO] |  |  +- 
> io.netty:netty-tcnative-boringssl-static:jar:osx-x86_64:2.0.53.Final:compile
> [INFO] |  |  +- 
> io.netty:netty-tcnative-boringssl-static:jar:osx-aarch_64:2.0.53.Final:compile
> [INFO] |  |  - 
> io.netty:netty-tcnative-boringssl-static:jar:windows-x86_64:2.0.53.Final:compile
> [INFO] |  +- io.netty:netty-handler:jar:4.1.78.Final:compile
> [INFO] |  |  +- io.netty:netty-resolver:jar:4.1.78.Final:compile
> [INFO] |  |  - io.netty:netty-codec:jar:4.1.78.Final:compile
> [INFO] |  +- io.netty:netty-transport:jar:4.1.78.Final:compile
> [INFO] |  +- com.google.guava:guava:jar:30.1.1-jre:compile
> [INFO] |  |  +- com.google.guava:failureaccess:jar:1.0.1:compile
> [INFO] |  |  +- 
> com.google.guava:listenablefuture:jar:.0-empty-to-avoid-conflict-with-guava:compile
> [INFO] |  |  +- org.checkerframework:checker-qual:jar:3.8.0:compile
> [INFO] |  |  - com.google.j2objc:j2objc-annotations:jar:1.3:compile
> [INFO] |  +- io.grpc:grpc-stub:jar:1.47.0:compile
> [INFO] |  +- com.google.protobuf:protobuf-java:jar:3.21.2:compile
> [INFO] |  +- io.grpc:grpc-api:jar:1.47.0:compile
> [INFO] |  - javax.annotation:javax.annotation-api:jar:1.3.2:compile
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17550) [C++][CI] MinGW builds shouldn't compile grpcio

2022-09-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17550:
---
Labels: pull-request-available  (was: )

> [C++][CI] MinGW builds shouldn't compile grpcio
> ---
>
> Key: ARROW-17550
> URL: https://issues.apache.org/jira/browse/ARROW-17550
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> MinGW builds currently compile the GCS testbench and grpcio for MinGW.
> When the compiled MinGW wheel is not in cache, compiling takes a very long 
> time (\*). But Win32 and Win64 binary wheels are available on PyPI.
> This is pointless: the GCS testbench could simply run with the system Python 
> instead of the msys2 Python, and always use the binaries from PyPI.
> (\*) see for example https://github.com/pitrou/arrow/runs/8071607360 where 
> installing the GCS testbench took 18 minutes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17850) [Java] Upgrade netty-codec-http dependencies

2022-09-28 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-17850.
--
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 14265
[https://github.com/apache/arrow/pull/14265]

> [Java] Upgrade netty-codec-http dependencies
> 
>
> Key: ARROW-17850
> URL: https://issues.apache.org/jira/browse/ARROW-17850
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 9.0.0
>Reporter: Hui Yu
>Assignee: David Dali Susanibar Arce
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [CVE-2022-24823]([https://github.com/advisories/GHSA-269q-hmxg-m83q]) reports 
> a security vulnerability for *netty-codec-http*
> Now the version of *netty-codec-http* in the master branch is *4.1.72.Final,* 
> that is unsafe.
> The ticket https://issues.apache.org/jira/browse/ARROW-16996 bumps 
> *netty-codec* to {*}4.1.78.Final{*}, it didn't bump *netty-codec-http.*
> Can you upgrade the version of *netty-codec-http* ? 
>  
> Here is my output of mvn:dependency now:
> ```bash
> [INFO] +- org.apache.arrow:flight-core:jar:9.0.0:compile
> [INFO] |  +- io.grpc:grpc-netty:jar:1.47.0:compile
> [INFO] |  |  +- io.netty:netty-codec-http2:jar:4.1.72.Final:compile
> [INFO] |  |  |  - io.netty:{*}netty-codec-http{*}:jar:4.1.72.Final:compile
> [INFO] |  |  +- io.netty:netty-handler-proxy:jar:4.1.72.Final:runtime
> [INFO] |  |  |  - io.netty:netty-codec-socks:jar:4.1.72.Final:runtime
> [INFO] |  |  +- 
> com.google.errorprone:error_prone_annotations:jar:2.10.0:compile
> [INFO] |  |  +- io.perfmark:perfmark-api:jar:0.25.0:runtime
> [INFO] |  |  - 
> io.netty:netty-transport-native-unix-common:jar:4.1.72.Final:compile
> [INFO] |  +- io.grpc:grpc-core:jar:1.47.0:compile
> [INFO] |  |  +- com.google.android:annotations:jar:4.1.1.4:runtime
> [INFO] |  |  - org.codehaus.mojo:animal-sniffer-annotations:jar:1.19:runtime
> [INFO] |  +- io.grpc:grpc-context:jar:1.47.0:compile
> [INFO] |  +- io.grpc:grpc-protobuf:jar:1.47.0:compile
> [INFO] |  |  +- 
> com.google.api.grpc:proto-google-common-protos:jar:2.0.1:compile
> [INFO] |  |  - io.grpc:grpc-protobuf-lite:jar:1.47.0:compile
> [INFO] |  +- io.netty:netty-tcnative-boringssl-static:jar:2.0.53.Final:compile
> [INFO] |  |  +- io.netty:netty-tcnative-classes:jar:2.0.53.Final:compile
> [INFO] |  |  +- 
> io.netty:netty-tcnative-boringssl-static:jar:linux-x86_64:2.0.53.Final:compile
> [INFO] |  |  +- 
> io.netty:netty-tcnative-boringssl-static:jar:linux-aarch_64:2.0.53.Final:compile
> [INFO] |  |  +- 
> io.netty:netty-tcnative-boringssl-static:jar:osx-x86_64:2.0.53.Final:compile
> [INFO] |  |  +- 
> io.netty:netty-tcnative-boringssl-static:jar:osx-aarch_64:2.0.53.Final:compile
> [INFO] |  |  - 
> io.netty:netty-tcnative-boringssl-static:jar:windows-x86_64:2.0.53.Final:compile
> [INFO] |  +- io.netty:netty-handler:jar:4.1.78.Final:compile
> [INFO] |  |  +- io.netty:netty-resolver:jar:4.1.78.Final:compile
> [INFO] |  |  - io.netty:netty-codec:jar:4.1.78.Final:compile
> [INFO] |  +- io.netty:netty-transport:jar:4.1.78.Final:compile
> [INFO] |  +- com.google.guava:guava:jar:30.1.1-jre:compile
> [INFO] |  |  +- com.google.guava:failureaccess:jar:1.0.1:compile
> [INFO] |  |  +- 
> com.google.guava:listenablefuture:jar:.0-empty-to-avoid-conflict-with-guava:compile
> [INFO] |  |  +- org.checkerframework:checker-qual:jar:3.8.0:compile
> [INFO] |  |  - com.google.j2objc:j2objc-annotations:jar:1.3:compile
> [INFO] |  +- io.grpc:grpc-stub:jar:1.47.0:compile
> [INFO] |  +- com.google.protobuf:protobuf-java:jar:3.21.2:compile
> [INFO] |  +- io.grpc:grpc-api:jar:1.47.0:compile
> [INFO] |  - javax.annotation:javax.annotation-api:jar:1.3.2:compile
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-15479) [C++] Cast fixed size list to compatible fixed size list type (other values type, other field name)

2022-09-28 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-15479.
--
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 14181
[https://github.com/apache/arrow/pull/14181]

> [C++] Cast fixed size list to compatible fixed size list type (other values 
> type, other field name)
> ---
>
> Key: ARROW-15479
> URL: https://issues.apache.org/jira/browse/ARROW-15479
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Kshiteej K
>Priority: Major
>  Labels: good-second-issue, kernel, pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Casting a FixedSizeListArray to a compatible type but only a different field 
> name isn't implemented:
> {code:python}
> >>> my_type = pa.list_(pa.field("element", pa.int64()), 2)
> >>> arr = pa.FixedSizeListArray.from_arrays(pa.array([1, 2, 3, 4, 5, 6]), 2)
> >>> arr.type
> FixedSizeListType(fixed_size_list[2])
> >>> my_type
> FixedSizeListType(fixed_size_list[2])
> >>> arr.cast(my_type)
> ...
> ArrowNotImplementedError: Unsupported cast from fixed_size_list int64>[2] to fixed_size_list using function cast_fixed_size_list
> {code}
> While the similar operation with a variable sized list actually works:
> {code:python}
> >>> my_type = pa.list_(pa.field("element", pa.int64()))
> >>> arr = pa.array([[1, 2], [3, 4]], pa.list_(pa.int64()))
> >>> arr.type
> ListType(list)
> >>> arr.cast(my_type).type
> ListType(list)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17877) [CI][Python] verify-rc python nightly builds fail due to missing some flags that were activated with ARROW_PYTHON=ON

2022-09-28 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610694#comment-17610694
 ] 

Kouhei Sutou commented on ARROW-17877:
--

Yes. I'll add {{ARROW_PYTHON}} back but we should not use {{ARROW_PYTHON=ON}} 
in the {{verify-release-candidate.sh}}.

Because the {{ARROW_PYTHON}} dependencies are inconsistent. For example, they 
include {{ARROW_DATASET=ON}} but it's an optional component (not a required 
component) in PyArrow. And they don't include all optional components such as 
{{ARROW_PARQUET=ON}}.

I think that CMake presets will be better replacement for {{ARROW_PYTHON}} 
because we can define multiple presets such as {{features-python-minimum}} and 
{{features-python-maximum}}. But CMake presets require CMake 3.19 or later...

> [CI][Python] verify-rc python nightly builds fail due to missing some flags 
> that were activated with ARROW_PYTHON=ON
> 
>
> Key: ARROW-17877
> URL: https://issues.apache.org/jira/browse/ARROW-17877
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Blocker
>  Labels: Nightly
> Fix For: 10.0.0
>
>
> Some of our nightly builds are failing with:
> {code:java}
>  [ 35%] Building CXX object CMakeFiles/_dataset.dir/_dataset.cpp.o
> /arrow/python/build/temp.linux-x86_64-cpython-38/_dataset.cpp:833:10: fatal 
> error: arrow/csv/api.h: No such file or directory
>  #include "arrow/csv/api.h"
>           ^
> compilation terminated.{code}
> I suspect the flags included CSV=ON when building with PYTHON=ON changes here 
> might be related: 
> [https://github.com/apache/arrow/commit/53ac2a00aa9ff199773513f6f996f73a07b37989]
> Example of nightly failures:
> https://github.com/ursacomputing/crossbow/actions/runs/3135833175/jobs/5091988801



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-16155) [R] lubridate functions for 9.0.0

2022-09-28 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc closed ARROW-16155.
--
Resolution: Done

> [R] lubridate functions for 9.0.0
> -
>
> Key: ARROW-16155
> URL: https://issues.apache.org/jira/browse/ARROW-16155
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Alessandro Molina
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> Umbrella ticket for lubridate functions in 9.0.0
> Future work that is not going to happen in v9 is recorder under 
> https://issues.apache.org/jira/browse/ARROW-16841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (ARROW-16155) [R] lubridate functions for 9.0.0

2022-09-28 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc reopened ARROW-16155:

  Assignee: Dragoș Moldovan-Grünfeld

> [R] lubridate functions for 9.0.0
> -
>
> Key: ARROW-16155
> URL: https://issues.apache.org/jira/browse/ARROW-16155
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Alessandro Molina
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> Umbrella ticket for lubridate functions in 9.0.0
> Future work that is not going to happen in v9 is recorder under 
> https://issues.apache.org/jira/browse/ARROW-16841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17857) [C++] Table::CombineChunksToBatch segfaults on empty tables

2022-09-28 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-17857:
-
Fix Version/s: 10.0.0

> [C++] Table::CombineChunksToBatch segfaults on empty tables
> ---
>
> Key: ARROW-17857
> URL: https://issues.apache.org/jira/browse/ARROW-17857
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There can be 0 chunks in a ChunkedArray



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds

2022-09-28 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610690#comment-17610690
 ] 

Kouhei Sutou commented on ARROW-17872:
--

{quote}
Do you know why we decided to use Homebrew for dependencies on macOS?
{quote}

Because Homebrew is one of major package managers that are used by macOS users. 
We should use an environment similar to the one that is used by users for CI to 
find bugs before we release.

Anyway, I'm OK with disabling some features for PR.

> [CI] Cache dependencies on macOS builds
> ---
>
> Key: ARROW-17872
> URL: https://issues.apache.org/jira/browse/ARROW-17872
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration, GLib, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Our macOS CI builds on Github Actions usually take at least 10 minutes 
> installing dependencies from Homebrew (because of compiling from source?). It 
> would be nice to cache those, especially as they probably don't change often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17848) [R][CI] Failed tests in test-dplyr-funcs-datetime.R

2022-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-17848:

Fix Version/s: 10.0.0

> [R][CI] Failed tests in test-dplyr-funcs-datetime.R
> ---
>
> Key: ARROW-17848
> URL: https://issues.apache.org/jira/browse/ARROW-17848
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, R
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 10.0.0
>
>
> Just saw this on an unrelated PR:
> https://github.com/pitrou/arrow/actions/runs/3129051648/jobs/5078785139#step:11:23882
> {code}
> -- Failure (test-dplyr-funcs-datetime.R:461:3): format_ISO8601 
> -
> `object` (`actual`) not equal to `expected` (`expected`).
> actual vs expected
>   a  a2
> - actual[1, ]   2018-10-07  2018-10-07 
> + expected[1, ] 2018-10-07 19:04:05 2018-10-07 19:04:05
>   actual[2, ]   NA  NA 
> `actual$a`:   "2018-10-07"  NA
> `expected$a`: "2018-10-07 19:04:05" NA
> `actual$a2`:   "2018-10-07"  NA
> `expected$a2`: "2018-10-07 19:04:05" NA
> Backtrace:
> x
>  1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:461:2
>  2.   \-arrow:::expect_equal(via_batch, expected, ...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:115:4
>  3. \-testthat::expect_equal(...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:42:4
> -- Failure (test-dplyr-funcs-datetime.R:461:3): format_ISO8601 
> -
> `object` (`actual`) not equal to `expected` (`expected`).
> actual vs expected
>   a  a2
> - actual[1, ]   2018-10-07  2018-10-07 
> + expected[1, ] 2018-10-07 19:04:05 2018-10-07 19:04:05
>   actual[2, ]   NA  NA 
> `actual$a`:   "2018-10-07"  NA
> `expected$a`: "2018-10-07 19:04:05" NA
> `actual$a2`:   "2018-10-07"  NA
> `expected$a2`: "2018-10-07 19:04:05" NA
> Backtrace:
> x
>  1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:461:2
>  2.   \-arrow:::expect_equal(via_table, expected, ...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:129:4
>  3. \-testthat::expect_equal(...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:42:4
> -- Failure (test-dplyr-funcs-datetime.R:492:5): format_ISO8601 
> -
> `object` (`actual`) not equal to `expected` (`expected`).
> actual vs expected
>   x
> - actual[1, ]   2018-10-07-0600
> + expected[1, ] 2018-10-07 19:04:05
>   actual[2, ]   NA 
> `actual$x`:   "2018-10-07-0600" NA
> `expected$x`: "2018-10-07 19:04:05" NA
> Backtrace:
> x
>  1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:492:4
>  2.   \-arrow:::expect_equal(via_batch, expected, ...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:115:4
>  3. \-testthat::expect_equal(...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:42:4
> -- Failure (test-dplyr-funcs-datetime.R:492:5): format_ISO8601 
> -
> `object` (`actual`) not equal to `expected` (`expected`).
> actual vs expected
>   x
> - actual[1, ]   2018-10-07-0600
> + expected[1, ] 2018-10-07 19:04:05
>   actual[2, ]   NA 
> `actual$x`:   "2018-10-07-0600" NA
> `expected$x`: "2018-10-07 19:04:05" NA
> Backtrace:
> x
>  1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:492:4
>  2.   \-arrow:::expect_equal(via_table, expected, ...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:129:4
>  3. \-testthat::expect_equal(...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:42:4
> -- Failure (test-dplyr-funcs-datetime.R:500:5): format_ISO8601 
> -
> `object` (`actual`) not equal to `expected` (`expected`).
> actual vs expected
>x
> - actual[1, ]   2018-10-07T19:04:05-0600
> + expected[1, ] 2018-10-07 19:04:05 
>   actual[2, ]   NA  
> `actual$x`:   "2018-10-07T19:04:05-0600" NA
> `expected$x`: "2018-10-07 19:04:05"  NA
> Backtrace:
> x
>  1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:500:4
>  2.   \-arrow:::expect_equal(via_batch, expected, ...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:115:4
>  3. \-testthat::expect_equal(...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:4

[jira] [Updated] (ARROW-17848) [R][CI] Failed tests in test-dplyr-funcs-datetime.R

2022-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-17848:

Priority: Critical  (was: Major)

> [R][CI] Failed tests in test-dplyr-funcs-datetime.R
> ---
>
> Key: ARROW-17848
> URL: https://issues.apache.org/jira/browse/ARROW-17848
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, R
>Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: 10.0.0
>
>
> Just saw this on an unrelated PR:
> https://github.com/pitrou/arrow/actions/runs/3129051648/jobs/5078785139#step:11:23882
> {code}
> -- Failure (test-dplyr-funcs-datetime.R:461:3): format_ISO8601 
> -
> `object` (`actual`) not equal to `expected` (`expected`).
> actual vs expected
>   a  a2
> - actual[1, ]   2018-10-07  2018-10-07 
> + expected[1, ] 2018-10-07 19:04:05 2018-10-07 19:04:05
>   actual[2, ]   NA  NA 
> `actual$a`:   "2018-10-07"  NA
> `expected$a`: "2018-10-07 19:04:05" NA
> `actual$a2`:   "2018-10-07"  NA
> `expected$a2`: "2018-10-07 19:04:05" NA
> Backtrace:
> x
>  1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:461:2
>  2.   \-arrow:::expect_equal(via_batch, expected, ...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:115:4
>  3. \-testthat::expect_equal(...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:42:4
> -- Failure (test-dplyr-funcs-datetime.R:461:3): format_ISO8601 
> -
> `object` (`actual`) not equal to `expected` (`expected`).
> actual vs expected
>   a  a2
> - actual[1, ]   2018-10-07  2018-10-07 
> + expected[1, ] 2018-10-07 19:04:05 2018-10-07 19:04:05
>   actual[2, ]   NA  NA 
> `actual$a`:   "2018-10-07"  NA
> `expected$a`: "2018-10-07 19:04:05" NA
> `actual$a2`:   "2018-10-07"  NA
> `expected$a2`: "2018-10-07 19:04:05" NA
> Backtrace:
> x
>  1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:461:2
>  2.   \-arrow:::expect_equal(via_table, expected, ...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:129:4
>  3. \-testthat::expect_equal(...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:42:4
> -- Failure (test-dplyr-funcs-datetime.R:492:5): format_ISO8601 
> -
> `object` (`actual`) not equal to `expected` (`expected`).
> actual vs expected
>   x
> - actual[1, ]   2018-10-07-0600
> + expected[1, ] 2018-10-07 19:04:05
>   actual[2, ]   NA 
> `actual$x`:   "2018-10-07-0600" NA
> `expected$x`: "2018-10-07 19:04:05" NA
> Backtrace:
> x
>  1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:492:4
>  2.   \-arrow:::expect_equal(via_batch, expected, ...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:115:4
>  3. \-testthat::expect_equal(...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:42:4
> -- Failure (test-dplyr-funcs-datetime.R:492:5): format_ISO8601 
> -
> `object` (`actual`) not equal to `expected` (`expected`).
> actual vs expected
>   x
> - actual[1, ]   2018-10-07-0600
> + expected[1, ] 2018-10-07 19:04:05
>   actual[2, ]   NA 
> `actual$x`:   "2018-10-07-0600" NA
> `expected$x`: "2018-10-07 19:04:05" NA
> Backtrace:
> x
>  1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:492:4
>  2.   \-arrow:::expect_equal(via_table, expected, ...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:129:4
>  3. \-testthat::expect_equal(...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:42:4
> -- Failure (test-dplyr-funcs-datetime.R:500:5): format_ISO8601 
> -
> `object` (`actual`) not equal to `expected` (`expected`).
> actual vs expected
>x
> - actual[1, ]   2018-10-07T19:04:05-0600
> + expected[1, ] 2018-10-07 19:04:05 
>   actual[2, ]   NA  
> `actual$x`:   "2018-10-07T19:04:05-0600" NA
> `expected$x`: "2018-10-07 19:04:05"  NA
> Backtrace:
> x
>  1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:500:4
>  2.   \-arrow:::expect_equal(via_batch, expected, ...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:115:4
>  3. \-testthat::expect_equal(...) at 
> D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-e

[jira] [Resolved] (ARROW-17811) [Doc][Java] Document how dictionary encoding works

2022-09-28 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-17811.
--
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 14213
[https://github.com/apache/arrow/pull/14213]

> [Doc][Java] Document how dictionary encoding works
> --
>
> Key: ARROW-17811
> URL: https://issues.apache.org/jira/browse/ARROW-17811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Java
>Affects Versions: 9.0.0
>Reporter: Larry White
>Assignee: Larry White
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> The ValueVector documentation does not include any discussion of dictionary 
> encoding. There is example code on the IPC page 
> https://arrow.apache.org/docs/dev/java/ipc.html, but it doesn't provide an 
> overview. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17878) [Website] Exclude Ballista docs from being deleted

2022-09-28 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-17878:


Assignee: Andy Grove

> [Website] Exclude Ballista docs from being deleted
> --
>
> Key: ARROW-17878
> URL: https://issues.apache.org/jira/browse/ARROW-17878
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Exclude Ballista docs from being deleted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17878) [Website] Exclude Ballista docs from being deleted

2022-09-28 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-17878.
--
Fix Version/s: 10.0.0
   Resolution: Fixed

https://github.com/apache/arrow-site/pull/241

> [Website] Exclude Ballista docs from being deleted
> --
>
> Key: ARROW-17878
> URL: https://issues.apache.org/jira/browse/ARROW-17878
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 10.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Exclude Ballista docs from being deleted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-16155) [R] lubridate functions for 9.0.0

2022-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson closed ARROW-16155.
---
Resolution: Done

> [R] lubridate functions for 9.0.0
> -
>
> Key: ARROW-16155
> URL: https://issues.apache.org/jira/browse/ARROW-16155
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Alessandro Molina
>Priority: Major
>
> Umbrella ticket for lubridate functions in 9.0.0
> Future work that is not going to happen in v9 is recorder under 
> https://issues.apache.org/jira/browse/ARROW-16841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17822) Seg Fault in pyarrow FlightClient with unknown uri schema

2022-09-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17822:
---
Labels: pull-request-available  (was: )

> Seg Fault in pyarrow FlightClient with unknown uri schema
> -
>
> Key: ARROW-17822
> URL: https://issues.apache.org/jira/browse/ARROW-17822
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Python
>Affects Versions: 9.0.0
> Environment: Linux U801802 5.14.0-1051-oem #58-Ubuntu SMP Fri Aug 26 
> 05:50:00 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
> Tried with standard ubuntu
> Python 3.8.10 (default, Jun 22 2022, 20:18:18) 
> [GCC 9.4.0] on linux
> And miniconda with python 3.10
>Reporter: Martin
>Assignee: David Li
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Running python in gdb for a bit of info.
> Here I misspelled "grpc" as "grps" but any unrcognized schema will make it seg
> {code:java}
> gdb$ r
> Starting program: /home/user/miniconda3/envs/duckdb10/bin/python 
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 
> 10.4.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa
> [New Thread 0x72fff700 (LWP 1058902)]
> [New Thread 0x71faa700 (LWP 1058903)]
> [New Thread 0x717a9700 (LWP 1058904)]
> [New Thread 0x70fa8700 (LWP 1058905)]
> [New Thread 0x7fffd1d57700 (LWP 1058906)]
> [New Thread 0x7fffd1556700 (LWP 1058907)]
> [New Thread 0x7fffc0d55700 (LWP 1058908)]
> [New Thread 0x7fffb8554700 (LWP 1058909)]
> [New Thread 0x7fffafd53700 (LWP 1058910)]
> [New Thread 0x7fffaf552700 (LWP 1058911)]
> [New Thread 0x7fff9ed51700 (LWP 1058912)]
> [New Thread 0x7fff96550700 (LWP 1058913)]
> [New Thread 0x7fff8dd4f700 (LWP 1058914)]
> [New Thread 0x7fff8554e700 (LWP 1058915)]
> [New Thread 0x7fff84d4d700 (LWP 1058916)]
> [New Thread 0x7fff7c54c700 (LWP 1058917)]
> >>> import pyarrow.flight
> >>> client = pa.flight.connect("grps://0.0.0.0:4")Thread 1 "python" 
> >>> received signal SIGSEGV, Segmentation fault.
> ---[regs]
>   RAX: 0x  RBX: 0x55B2C3B0  RBP: 0x55B2C3B0  
> RSP: 0x7FFFC490  o d I t s Z a P c 
>   RDI: 0x  RSI: 0x55A8B040  RDX: 0x55BDEEA0  
> RCX: 0x0004  RIP: 0x7FFF6BAA43D6
>   R8 : 0x0003  R9 : 0x559F4797  R10: 0x55CEDA70  
> R11: 0x55CEDA70  R12: 0x7FFFC990
>   R13: 0x7FFFC6B0  R14: 0x7FFFC8D0  R15: 0x7FFFC530
>   CS: 0033  DS:   ES:   FS:   GS:   SS: 002B                
> ---[code]
> => 0x7fff6baa43d6 <_ZN5arrow6flight12FlightClientD2Ev+38>:    mov    
> rax,QWORD PTR [rdi]
>    0x7fff6baa43d9 <_ZN5arrow6flight12FlightClientD2Ev+41>:    lea    
> rbp,[rsp+0x8]
>    0x7fff6baa43de <_ZN5arrow6flight12FlightClientD2Ev+46>:    mov    rsi,rdi
>    0x7fff6baa43e1 <_ZN5arrow6flight12FlightClientD2Ev+49>:    mov    BYTE PTR 
> [rbx+0x8],0x1
>    0x7fff6baa43e5 <_ZN5arrow6flight12FlightClientD2Ev+53>:    mov    rdi,rbp
>    0x7fff6baa43e8 <_ZN5arrow6flight12FlightClientD2Ev+56>:    call   QWORD 
> PTR [rax+0x18]
>    0x7fff6baa43eb <_ZN5arrow6flight12FlightClientD2Ev+59>:    mov    
> rax,QWORD PTR [rsp+0x8]
>    0x7fff6baa43f0 <_ZN5arrow6flight12FlightClientD2Ev+64>:    test   rax,rax
> -
> 0x7fff6baa43d6 in arrow::flight::FlightClient::~FlightClient() () from 
> /home/user/miniconda3/envs/duckdb10/lib/python3.10/site-packages/pyarrow/../../../libarrow_flight.so.900{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17187) [R] Improve lazy ALTREP implementation for String

2022-09-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17187:
---
Labels: pull-request-available  (was: )

> [R] Improve lazy ALTREP implementation for String
> -
>
> Key: ARROW-17187
> URL: https://issues.apache.org/jira/browse/ARROW-17187
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-16578 noted that there was a high cost to looping through an ALTREP 
> character vector that we created in the arrow R package. The temporary 
> workaround is to materialize whenever the first element is requested, which 
> is much faster than our initial implementation but is probably not necessary 
> given that other ALTREP character implementations appear to not have this 
> issue:
> (Timings before merging ARROW-16578, which reduces the 5 second operation 
> below to 0.05 seconds).
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> #> Some features are not enabled in this build of Arrow. Run `arrow_info()` 
> for more information.
> df1 <- tibble::tibble(x=as.character(floor(runif(100) * 20)))
> write_parquet(df1,"/tmp/test.parquet")
> df2 <- read_parquet("/tmp/test.parquet")
> system.time(unique(df1$x))
> #>user  system elapsed 
> #>   0.022   0.001   0.023
> system.time(unique(df2$x))
> #>user  system elapsed 
> #>   4.529   0.680   5.226
> # the speed is almost certainly not due to ALTREP itself
> # but is probably something to do with our implementation
> tf <- tempfile()
> readr::write_csv(df1, tf)
> df3 <- vroom::vroom(tf, delim = ",", altrep = TRUE)
> #> Rows: 100 Columns: 1
> #> ── Column specification 
> 
> #> Delimiter: ","
> #> dbl (1): x
> #> 
> #> ℹ Use `spec()` to retrieve the full column specification for this data.
> #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this 
> message.
> .Internal(inspect(df3$x))
> #> @2d2042048 14 REALSXP g0c0 [REF(65535)] vroom_dbl (len=100, 
> materialized=F)
> system.time(unique(df3$x))
> #>user  system elapsed 
> #>   0.127   0.001   0.128
> .Internal(inspect(df3$x))
> #> @2d2042048 14 REALSXP g1c0 [MARK,REF(65535)] vroom_dbl (len=100, 
> materialized=F)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17883) [Java] Implement an immutable table object

2022-09-28 Thread Larry White (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry White reassigned ARROW-17883:
---

Assignee: Larry White

> [Java] Implement an immutable table object
> --
>
> Key: ARROW-17883
> URL: https://issues.apache.org/jira/browse/ARROW-17883
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 10.0.0
>Reporter: Larry White
>Assignee: Larry White
>Priority: Major
>
> Implement an immutable Table object without the batch semantics provided by 
> VectorSchemaRoot. 
> See original design document/discussion here: 
> https://docs.google.com/document/d/1J77irZFWNnSID7vK71z26Nw_Pi99I9Hb9iryno8B03c/edit?usp=sharing
> Note that this ticket covers only the immutable Table implementation. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17883) [Java] Implement an immutable table object

2022-09-28 Thread Larry White (Jira)
Larry White created ARROW-17883:
---

 Summary: [Java] Implement an immutable table object
 Key: ARROW-17883
 URL: https://issues.apache.org/jira/browse/ARROW-17883
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Affects Versions: 10.0.0
Reporter: Larry White


Implement an immutable Table object without the batch semantics provided by 
VectorSchemaRoot. 

See original design document/discussion here: 
https://docs.google.com/document/d/1J77irZFWNnSID7vK71z26Nw_Pi99I9Hb9iryno8B03c/edit?usp=sharing

Note that this ticket covers only the immutable Table implementation. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17687) [C++] ScanningStress test is flaky in CI

2022-09-28 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-17687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610349#comment-17610349
 ] 

Percy Camilo Triveño Aucahuasi edited comment on ARROW-17687 at 9/28/22 5:11 PM:
-

I got this [^backtrace.log.cpp].

It seems we are moving the unique_locker and trying to lock some invalid mutex.

Also, I was able to get another issue, this time a deadlock using these values:
{code:java}
constexpr int kNumIters = 1;
constexpr int kNumFragments = 10;
constexpr int kBatchesPerFragment = 10;
constexpr int kNumConcurrentTasks = 2;{code}
I'll try to explore more about where we are getting these errors, so far I was 
able to reduce and reproduce the test issue using these values:
{code:java}
constexpr int kNumIters = 1;
constexpr int kNumFragments = 2;
constexpr int kBatchesPerFragment = 1;
constexpr int kNumConcurrentTasks = 1;{code}
Given that we can use C++ 17 now, I'll try to use the new std::scoped_lock 
instead of the other lockers (in the places where it make sense to do so)


was (Author: aucahuasi):
I got this [^backtrace.log.cpp].

It seems we are moving the unique_locker and trying to lock some invalid mutex.

Also, I was able to get another issue, this time a deadlock using these values:
{code:java}
constexpr int kNumIters = 1;
constexpr int kNumFragments = 10;
constexpr int kBatchesPerFragment = 10;
constexpr int kNumConcurrentTasks = 2;{code}
I'll try to explore more about where we are getting these errors, so far I was 
able to reduce and reproduce the test issue using these values:
{code:java}
constexpr int kNumIters = 1;
constexpr int kNumFragments = 2;
constexpr int kBatchesPerFragment = 1;
constexpr int kNumConcurrentTasks = 1;{code}
Given that we can use C++ 17 now, I'll try to use the new std::scoped_lock 
instead of the the other lockers (in the places where it make sense to do so)

> [C++] ScanningStress test is flaky in CI
> 
>
> Key: ARROW-17687
> URL: https://issues.apache.org/jira/browse/ARROW-17687
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Major
> Attachments: backtrace.log.cpp
>
>
> There is at least one nightly failure: 
> https://github.com/ursacomputing/crossbow/actions/runs/3033965241/jobs/4882574634



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers

2022-09-28 Thread Clark Zinzow (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610662#comment-17610662
 ] 

Clark Zinzow edited comment on ARROW-10739 at 9/28/22 5:04 PM:
---

[~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC 
format is used under-the-hood for pickle serialization, and confirmed that the 
buffer truncation works as expected. Although this is a far simpler solution 
than (1), the overhead of the {{RecordBatch}} wrapper adds ~230 extra bytes to 
the pickled payload (per {{Array}} chunk) compared to current Arrow master, 
which can be pretty bad for the many-chunk and/or many-column case (order of 
magnitude larger serialized payloads). We could sidestep this issue by having 
{{{}Table{}}}, {{{}RecordBatch{}}}, and {{ChunkedArray}} port their 
{{\_\_reduce\_\_}} to the Arrow IPC serialization as well, which should avoid 
this many-column and many-chunk blow-up, but there will still be the baseline 
~230 byte bloat for {{ChunkedArray}} and {{Array}} that we might find untenable.

I can try to get a PR up for (2) either today or tomorrow while I start working 
on (1) in the background. (1) is going to have a much larger Arrow code impact 
+ we'll continue having two serialization paths to maintain, but it shouldn't 
result in any serialized payload bloat.


was (Author: clarkzinzow):
[~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC 
format is used under-the-hood for pickle serialization, and confirmed that the 
buffer truncation works as expected. Although this is a far simpler solution 
than (1), the overhead of the {{RecordBatch}} wrapper adds ~230 extra bytes to 
the pickled payload (per {{Array}} chunk) compared to current Arrow master, 
which can be pretty bad for the many-chunk and/or many-column case (order of 
magnitude larger serialized payloads). We could sidestep this issue by having 
{{{}Table{}}}, {{{}RecordBatch{}}}, and {{ChunkedArray}} port their 
{{\_\_reduce\_\_}} to the Arrow IPC serialization as well, which should avoid 
this many-column and many-chunk blow-up, but there will still be the baseline 
~230 byte bloat for {{ChunkedArray}} and {{Array}} that we might find untenable.

 

I can try to get a PR up for (2) either today or tomorrow while I start working 
on (1) in the background. (1) is going to have a much larger Arrow code impact 
+ we'll continue having two serialization paths to maintain, but it shouldn't 
result in any serialized payload bloat.

> [Python] Pickling a sliced array serializes all the buffers
> ---
>
> Key: ARROW-10739
> URL: https://issues.apache.org/jira/browse/ARROW-10739
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Maarten Breddels
>Assignee: Alessandro Molina
>Priority: Critical
> Fix For: 10.0.0
>
>
> If a large array is sliced, and pickled, it seems the full buffer is 
> serialized, this leads to excessive memory usage and data transfer when using 
> multiprocessing or dask.
> {code:java}
> >>> import pyarrow as pa
> >>> ar = pa.array(['foo'] * 100_000)
> >>> ar.nbytes
> 74
> >>> import pickle
> >>> len(pickle.dumps(ar.slice(10, 1)))
> 700165
> NumPy for instance
> >>> import numpy as np
> >>> ar_np = np.array(ar)
> >>> ar_np
> array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
> >>> import pickle
> >>> len(pickle.dumps(ar_np[10:11]))
> 165{code}
> I think this makes sense if you know arrow, but kind of unexpected as a user.
> Is there a workaround for this? For instance copy an arrow array to get rid 
> of the offset, and trim the buffers?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers

2022-09-28 Thread Clark Zinzow (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610662#comment-17610662
 ] 

Clark Zinzow edited comment on ARROW-10739 at 9/28/22 5:04 PM:
---

[~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC 
format is used under-the-hood for pickle serialization, and confirmed that the 
buffer truncation works as expected. Although this is a far simpler solution 
than (1), the overhead of the {{RecordBatch}} wrapper adds ~230 extra bytes to 
the pickled payload (per {{Array}} chunk) compared to current Arrow master, 
which can be pretty bad for the many-chunk and/or many-column case (order of 
magnitude larger serialized payloads). We could sidestep this issue by having 
{{{}Table{}}}, {{{}RecordBatch{}}}, and {{ChunkedArray}} port their 
{{\_\_reduce\_\_}} to the Arrow IPC serialization as well, which should avoid 
this many-column and many-chunk blow-up, but there will still be the baseline 
~230 byte bloat for {{ChunkedArray}} and {{Array}} that we might find untenable.

 

I can try to get a PR up for (2) either today or tomorrow while I start working 
on (1) in the background. (1) is going to have a much larger Arrow code impact 
+ we'll continue having two serialization paths to maintain, but it shouldn't 
result in any serialized payload bloat.


was (Author: clarkzinzow):
[~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC 
format is used under-the-hood for pickle serialization, and confirmed that the 
buffer truncation works as expected. Although this is a far simpler solution 
than (1), the overhead of the {{RecordBatch}} wrapper adds ~230 extra bytes to 
the pickled payload (per {{Array}} chunk) compared to current Arrow master, 
which can be pretty bad for the many-chunk and/or many-column case (order of 
magnitude larger serialized payloads). We could sidestep this issue by having 
{{{}Table{}}}, {{{}RecordBatch{}}}, and {{ChunkedArray}} port their 
{{__reduce__}} to the Arrow IPC serialization as well, which should avoid this 
many-column and many-chunk blow-up, but there will still be the baseline ~230 
byte bloat for {{ChunkedArray}} and {{Array}} that we might find untenable.

 

I can try to get a PR up for (2) either today or tomorrow while I start working 
on (1) in the background. (1) is going to have a much larger Arrow code impact 
+ we'll continue having two serialization paths to maintain, but it shouldn't 
result in any serialized payload bloat.

> [Python] Pickling a sliced array serializes all the buffers
> ---
>
> Key: ARROW-10739
> URL: https://issues.apache.org/jira/browse/ARROW-10739
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Maarten Breddels
>Assignee: Alessandro Molina
>Priority: Critical
> Fix For: 10.0.0
>
>
> If a large array is sliced, and pickled, it seems the full buffer is 
> serialized, this leads to excessive memory usage and data transfer when using 
> multiprocessing or dask.
> {code:java}
> >>> import pyarrow as pa
> >>> ar = pa.array(['foo'] * 100_000)
> >>> ar.nbytes
> 74
> >>> import pickle
> >>> len(pickle.dumps(ar.slice(10, 1)))
> 700165
> NumPy for instance
> >>> import numpy as np
> >>> ar_np = np.array(ar)
> >>> ar_np
> array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
> >>> import pickle
> >>> len(pickle.dumps(ar_np[10:11]))
> 165{code}
> I think this makes sense if you know arrow, but kind of unexpected as a user.
> Is there a workaround for this? For instance copy an arrow array to get rid 
> of the offset, and trim the buffers?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17866) [Python] List child array invalid

2022-09-28 Thread Sean Conroy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610666#comment-17610666
 ] 

Sean Conroy commented on ARROW-17866:
-

[~jorisvandenbossche] Thanks so much for noticing the connection.  Yes - this 
appears to be the same issue.  I will implement the suggested workaround now...

> [Python] List child array invalid
> -
>
> Key: ARROW-17866
> URL: https://issues.apache.org/jira/browse/ARROW-17866
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: Sean Conroy
>Priority: Major
>
> This issue happens for all the versions of pyarrow I checked (9.0.0, 7.0.0, 
> 6.0.0, 6.0.1).
> Running on Windows 11.
> {code:java}
> log.to_feather(log_fname)
> Traceback (most recent call last):
>   File "G:\My 
> Drive\ds-atcore-etl\venv\lib\site-packages\IPython\core\interactiveshell.py", 
> line 3444, in run_code
>     exec(code_obj, self.user_global_ns, self.user_ns)
>   File "", line 1, in 
>     log.to_feather(log_fname)
>   File "G:\My 
> Drive\ds-atcore-etl\venv\lib\site-packages\pandas\util\_decorators.py", line 
> 207, in wrapper
>     return func(*args, **kwargs)
>   File "G:\My 
> Drive\ds-atcore-etl\venv\lib\site-packages\pandas\core\frame.py", line 2519, 
> in to_feather
>     to_feather(self, path, **kwargs)
>   File "G:\My 
> Drive\ds-atcore-etl\venv\lib\site-packages\pandas\io\feather_format.py", line 
> 87, in to_feather
>     feather.write_feather(df, handles.handle, **kwargs)
>   File "G:\My Drive\ds-atcore-etl\venv\lib\site-packages\pyarrow\feather.py", 
> line 164, in write_feather
>     table = Table.from_pandas(df, preserve_index=preserve_index)
>   File "pyarrow\table.pxi", line 3495, in pyarrow.lib.Table.from_pandas
>   File "pyarrow\table.pxi", line 3597, in pyarrow.lib.Table.from_arrays
>   File "pyarrow\table.pxi", line 2793, in pyarrow.lib.Table.validate
>   File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 13: In chunk 0: Invalid: List child array 
> invalid: Invalid: Struct child array #0 has length smaller than expected for 
> struct array (67186731 < 67186732) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers

2022-09-28 Thread Clark Zinzow (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610662#comment-17610662
 ] 

Clark Zinzow edited comment on ARROW-10739 at 9/28/22 5:03 PM:
---

[~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC 
format is used under-the-hood for pickle serialization, and confirmed that the 
buffer truncation works as expected. Although this is a far simpler solution 
than (1), the overhead of the {{RecordBatch}} wrapper adds ~230 extra bytes to 
the pickled payload (per {{Array}} chunk) compared to current Arrow master, 
which can be pretty bad for the many-chunk and/or many-column case (order of 
magnitude larger serialized payloads). We could sidestep this issue by having 
{{{}Table{}}}, {{{}RecordBatch{}}}, and {{ChunkedArray}} port their 
{{__reduce__}} to the Arrow IPC serialization as well, which should avoid this 
many-column and many-chunk blow-up, but there will still be the baseline ~230 
byte bloat for {{ChunkedArray}} and {{Array}} that we might find untenable.

 

I can try to get a PR up for (2) either today or tomorrow while I start working 
on (1) in the background. (1) is going to have a much larger Arrow code impact 
+ we'll continue having two serialization paths to maintain, but it shouldn't 
result in any serialized payload bloat.


was (Author: clarkzinzow):
[~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC 
format is used under-the-hood for pickle serialization, and confirmed that the 
buffer truncation works as expected. Although this is a far simpler solution 
than (1), the overhead of the {{RecordBatch}} wrapper adds ~230 extra bytes to 
the pickled payload (per {{Array}} chunk) compared to current Arrow master, 
which can be pretty bad for the many-chunk and/or many-column case (order of 
magnitude larger serialized payloads). We could sidestep this issue by having 
{{{}Table{}}}, {{{}RecordBatch{}}}, and {{ChunkedArray}} port their 
{{__reduce__}} to the Arrow IPC serialization as well, which should avoid this 
many-column and many-chunk blow-up, but there will still be the baseline ~230 
byte bloat for {{ChunkedArray}} and {{Array}} that we might find untenable.

 

I can try to get a PR up for (2) either today or tomorrow while I start working 
on (1) in the background. (1) is going to have a much larger Arrow code impact 
+ we'll continue having two serialization paths to maintain, but it shouldn't 
result in any serialized payload bloat.

> [Python] Pickling a sliced array serializes all the buffers
> ---
>
> Key: ARROW-10739
> URL: https://issues.apache.org/jira/browse/ARROW-10739
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Maarten Breddels
>Assignee: Alessandro Molina
>Priority: Critical
> Fix For: 10.0.0
>
>
> If a large array is sliced, and pickled, it seems the full buffer is 
> serialized, this leads to excessive memory usage and data transfer when using 
> multiprocessing or dask.
> {code:java}
> >>> import pyarrow as pa
> >>> ar = pa.array(['foo'] * 100_000)
> >>> ar.nbytes
> 74
> >>> import pickle
> >>> len(pickle.dumps(ar.slice(10, 1)))
> 700165
> NumPy for instance
> >>> import numpy as np
> >>> ar_np = np.array(ar)
> >>> ar_np
> array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
> >>> import pickle
> >>> len(pickle.dumps(ar_np[10:11]))
> 165{code}
> I think this makes sense if you know arrow, but kind of unexpected as a user.
> Is there a workaround for this? For instance copy an arrow array to get rid 
> of the offset, and trim the buffers?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers

2022-09-28 Thread Clark Zinzow (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610662#comment-17610662
 ] 

Clark Zinzow edited comment on ARROW-10739 at 9/28/22 5:02 PM:
---

[~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC 
format is used under-the-hood for pickle serialization, and confirmed that the 
buffer truncation works as expected. Although this is a far simpler solution 
than (1), the overhead of the {{RecordBatch}} wrapper adds ~230 extra bytes to 
the pickled payload (per {{Array}} chunk) compared to current Arrow master, 
which can be pretty bad for the many-chunk and/or many-column case (order of 
magnitude larger serialized payloads). We could sidestep this issue by having 
{{{}Table{}}}, {{{}RecordBatch{}}}, and {{ChunkedArray}} port their 
{{__reduce__}} to the Arrow IPC serialization as well, which should avoid this 
many-column and many-chunk blow-up, but there will still be the baseline ~230 
byte bloat for {{ChunkedArray}} and {{Array}} that we might find untenable.

 

I can try to get a PR up for (2) either today or tomorrow while I start working 
on (1) in the background. (1) is going to have a much larger Arrow code impact 
+ we'll continue having two serialization paths to maintain, but it shouldn't 
result in any serialized payload bloat.


was (Author: clarkzinzow):
[~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC 
format is used under-the-hood for pickle serialization, and confirmed that the 
buffer truncation works as expected. Although this is a far simpler solution 
than (1), the overhead of the `RecordBatch` wrapper adds ~230 extra bytes to 
the pickled payload (per `Array` chunk) compared to current Arrow master, which 
can be pretty bad for the many-chunk and/or many-column case (order of 
magnitude larger serialized payloads). We could sidestep this issue by having 
`Table`, `RecordBatch`, and `ChunkedArray` port their `__reduce__` to the Arrow 
IPC serialization as well, which should avoid this many-column and many-chunk 
blow-up, but there will still be the baseline ~230 byte bloat for 
`ChunkedArray` and `Array` that we might find untenable.

 

I can try to get a PR up for (2) either today or tomorrow while I start working 
on (1) in the background. (1) is going to have a much larger Arrow code impact 
+ we'll continue having two serialization paths to maintain, but it shouldn't 
result in any serialized payload bloat.

> [Python] Pickling a sliced array serializes all the buffers
> ---
>
> Key: ARROW-10739
> URL: https://issues.apache.org/jira/browse/ARROW-10739
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Maarten Breddels
>Assignee: Alessandro Molina
>Priority: Critical
> Fix For: 10.0.0
>
>
> If a large array is sliced, and pickled, it seems the full buffer is 
> serialized, this leads to excessive memory usage and data transfer when using 
> multiprocessing or dask.
> {code:java}
> >>> import pyarrow as pa
> >>> ar = pa.array(['foo'] * 100_000)
> >>> ar.nbytes
> 74
> >>> import pickle
> >>> len(pickle.dumps(ar.slice(10, 1)))
> 700165
> NumPy for instance
> >>> import numpy as np
> >>> ar_np = np.array(ar)
> >>> ar_np
> array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
> >>> import pickle
> >>> len(pickle.dumps(ar_np[10:11]))
> 165{code}
> I think this makes sense if you know arrow, but kind of unexpected as a user.
> Is there a workaround for this? For instance copy an arrow array to get rid 
> of the offset, and trim the buffers?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers

2022-09-28 Thread Clark Zinzow (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610662#comment-17610662
 ] 

Clark Zinzow commented on ARROW-10739:
--

[~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC 
format is used under-the-hood for pickle serialization, and confirmed that the 
buffer truncation works as expected. Although this is a far simpler solution 
than (1), the overhead of the `RecordBatch` wrapper adds ~230 extra bytes to 
the pickled payload (per `Array` chunk) compared to current Arrow master, which 
can be pretty bad for the many-chunk and/or many-column case (order of 
magnitude larger serialized payloads). We could sidestep this issue by having 
`Table`, `RecordBatch`, and `ChunkedArray` port their `__reduce__` to the Arrow 
IPC serialization as well, which should avoid this many-column and many-chunk 
blow-up, but there will still be the baseline ~230 byte bloat for 
`ChunkedArray` and `Array` that we might find untenable.

 

I can try to get a PR up for (2) either today or tomorrow while I start working 
on (1) in the background. (1) is going to have a much larger Arrow code impact 
+ we'll continue having two serialization paths to maintain, but it shouldn't 
result in any serialized payload bloat.

> [Python] Pickling a sliced array serializes all the buffers
> ---
>
> Key: ARROW-10739
> URL: https://issues.apache.org/jira/browse/ARROW-10739
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Maarten Breddels
>Assignee: Alessandro Molina
>Priority: Critical
> Fix For: 10.0.0
>
>
> If a large array is sliced, and pickled, it seems the full buffer is 
> serialized, this leads to excessive memory usage and data transfer when using 
> multiprocessing or dask.
> {code:java}
> >>> import pyarrow as pa
> >>> ar = pa.array(['foo'] * 100_000)
> >>> ar.nbytes
> 74
> >>> import pickle
> >>> len(pickle.dumps(ar.slice(10, 1)))
> 700165
> NumPy for instance
> >>> import numpy as np
> >>> ar_np = np.array(ar)
> >>> ar_np
> array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
> >>> import pickle
> >>> len(pickle.dumps(ar_np[10:11]))
> 165{code}
> I think this makes sense if you know arrow, but kind of unexpected as a user.
> Is there a workaround for this? For instance copy an arrow array to get rid 
> of the offset, and trim the buffers?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17847) [C++] Support unquoted decimal in JSON parser

2022-09-28 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-17847.

Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 14242
[https://github.com/apache/arrow/pull/14242]

> [C++] Support unquoted decimal in JSON parser
> -
>
> Key: ARROW-17847
> URL: https://issues.apache.org/jira/browse/ARROW-17847
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: Jin Shang
>Assignee: Jin Shang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> -Add an option to parse decimal as unquoted numbers in JSON-
> Support both quoted and unquoted decimal in JSON parser automatically.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17866) [Python] List child array invalid

2022-09-28 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610657#comment-17610657
 ] 

Joris Van den Bossche commented on ARROW-17866:
---

[~meystingray] thanks for the report! 
This sounds very similar as ARROW-17137 (that also mentions a possible 
workaround for now).

> [Python] List child array invalid
> -
>
> Key: ARROW-17866
> URL: https://issues.apache.org/jira/browse/ARROW-17866
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: Sean Conroy
>Priority: Major
>
> This issue happens for all the versions of pyarrow I checked (9.0.0, 7.0.0, 
> 6.0.0, 6.0.1).
> Running on Windows 11.
> {code:java}
> log.to_feather(log_fname)
> Traceback (most recent call last):
>   File "G:\My 
> Drive\ds-atcore-etl\venv\lib\site-packages\IPython\core\interactiveshell.py", 
> line 3444, in run_code
>     exec(code_obj, self.user_global_ns, self.user_ns)
>   File "", line 1, in 
>     log.to_feather(log_fname)
>   File "G:\My 
> Drive\ds-atcore-etl\venv\lib\site-packages\pandas\util\_decorators.py", line 
> 207, in wrapper
>     return func(*args, **kwargs)
>   File "G:\My 
> Drive\ds-atcore-etl\venv\lib\site-packages\pandas\core\frame.py", line 2519, 
> in to_feather
>     to_feather(self, path, **kwargs)
>   File "G:\My 
> Drive\ds-atcore-etl\venv\lib\site-packages\pandas\io\feather_format.py", line 
> 87, in to_feather
>     feather.write_feather(df, handles.handle, **kwargs)
>   File "G:\My Drive\ds-atcore-etl\venv\lib\site-packages\pyarrow\feather.py", 
> line 164, in write_feather
>     table = Table.from_pandas(df, preserve_index=preserve_index)
>   File "pyarrow\table.pxi", line 3495, in pyarrow.lib.Table.from_pandas
>   File "pyarrow\table.pxi", line 3597, in pyarrow.lib.Table.from_arrays
>   File "pyarrow\table.pxi", line 2793, in pyarrow.lib.Table.validate
>   File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 13: In chunk 0: Invalid: List child array 
> invalid: Invalid: Struct child array #0 has length smaller than expected for 
> struct array (67186731 < 67186732) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16319) [R] [Docs] Document the lubridate functions we support in {arrow}

2022-09-28 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-16319:
---

Assignee: (was: Stephanie Hazlitt)

> [R] [Docs] Document the lubridate functions we support in {arrow}
> -
>
> Key: ARROW-16319
> URL: https://issues.apache.org/jira/browse/ARROW-16319
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 8.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> Add documentation around the {{lubridate}} functionality supported in 
> {{arrow}}. Could be made up of:
> * a blogpost 
> * a more in-depth piece of documentation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-14209) [R] Allow multiple arguments to n_distinct()

2022-09-28 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-14209:
---

Assignee: (was: Dragoș Moldovan-Grünfeld)

> [R] Allow multiple arguments to n_distinct()
> 
>
> Key: ARROW-14209
> URL: https://issues.apache.org/jira/browse/ARROW-14209
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
>
> ARROW-13620 and ARROW-14036 added support for the {{n_distinct()}} function 
> in the dplyr verb {{summarise()}} but only with a single argument. Add 
> support for multiple arguments to {{n_distinct()}}. This should return the 
> number of unique combinations of values in the specified columns/expressions.
> See the comment about this here: 
> [https://github.com/apache/arrow/pull/11257#discussion_r720873549]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-12311) [Python][R] Expose (hide?) ScanOptions

2022-09-28 Thread Todd Farmer (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610656#comment-17610656
 ] 

Todd Farmer commented on ARROW-12311:
-

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [Python][R] Expose (hide?) ScanOptions
> --
>
> Key: ARROW-12311
> URL: https://issues.apache.org/jira/browse/ARROW-12311
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python, R
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
> Fix For: 10.0.0
>
>
> Currently R completely hides the `ScanOptions` class.
> In python the class is exposed but the documentation prefers `dataset.scan` 
> (which hides both the scanner and the scan options).
> However, there is some useful information in the `ScanOptions`.  
> Specifically, the projected schema (which is a product of the dataset schema 
> and the projection expression and not easily recreated) and the materialized 
> fields (the list of fields referenced by either the filter or the projection) 
> which might be useful for reporting purposes.
> Currently R uses the projected schema to convert a list of column names into 
> a partition schema.  Python does not rely on either field.
>  
> Options:
>  - Keep the status quo
>  - Expose the ScanOptions object (which itself is exposed via the Scanner)
>  - Expose the interesting fields via the Scanner
>  
> Currently the C++ design is halfway between the latter two (projected schema 
> is exposed and options).  My preference would be the third option.  It raises 
> a further question about how to expose the scanner itself in Python?  Should 
> the user be using ScannerBuilder?  Should they use NewScan?  Should they use 
> the scanner directly at all or should it be hidden?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-14138) [R] update metadata when casting a record batch column

2022-09-28 Thread Todd Farmer (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610654#comment-17610654
 ] 

Todd Farmer commented on ARROW-14138:
-

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [R] update metadata when casting a record batch column
> --
>
> Key: ARROW-14138
> URL: https://issues.apache.org/jira/browse/ARROW-14138
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Romain Francois
>Assignee: Romain Francois
>Priority: Minor
> Fix For: 10.0.0
>
>
> library(arrow, warn.conflicts = FALSE)
> #> See arrow_info() for available features
> raws <- structure(list(
>   as.raw(c(0x70, 0x65, 0x72, 0x73, 0x6f, 0x6e))
> ), class = c("arrow_binary", "vctrs_vctr", "list"))
> batch <- record_batch(b = raws)
> batch$metadata$r
> #>  'arrow_r_metadata' chr 
> "A\n3\n262147\n197888\n5\nUTF-8\n531\n1\n531\n1\n531\n2\n531\n1\n16\n3\n262153\n12\narrow_binary\n262153\n10\nvc"|
>  __truncated__
> #> List of 1
> #>  $ columns:List of 1
> #>   ..$ b:List of 2
> #>   .. ..$ attributes:List of 1
> #>   .. .. ..$ class: chr [1:3] "arrow_binary" "vctrs_vctr" "list"
> #>   .. ..$ columns   : NULL
> # when casting `b` to a string column, the metadata is kept
> batch$b <- batch$b$cast(utf8())
> batch$metadata$r
> #>  'arrow_r_metadata' chr 
> "A\n3\n262147\n197888\n5\nUTF-8\n531\n1\n531\n1\n531\n2\n531\n1\n16\n3\n262153\n12\narrow_binary\n262153\n10\nvc"|
>  __truncated__
> #> List of 1
> #>  $ columns:List of 1
> #>   ..$ b:List of 2
> #>   .. ..$ attributes:List of 1
> #>   .. .. ..$ class: chr [1:3] "arrow_binary" "vctrs_vctr" "list"
> #>   .. ..$ columns   : NULL
> # but it should not have
> batch2 <- record_batch(b = "string")
> batch2$metadata$r
> #> NULL



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-14987) [C++]Memory leak while reading parquet file

2022-09-28 Thread Todd Farmer (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610651#comment-17610651
 ] 

Todd Farmer commented on ARROW-14987:
-

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++]Memory leak while reading parquet file
> ---
>
> Key: ARROW-14987
> URL: https://issues.apache.org/jira/browse/ARROW-14987
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 6.0.1
>Reporter: Qingxiang Chen
>Assignee: Weston Pace
>Priority: Major
>
> When I used parquet to access data, I found that the memory usage was still 
> high after the function ended. I reproduced this problem in the example. code 
> show as below:
>  
> {code:c++}
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> std::shared_ptr generate_table() {
>   arrow::Int64Builder i64builder;
>   for (int i=0;i<32;i++){
> i64builder.Append(i);
>   }
>   std::shared_ptr i64array;
>   PARQUET_THROW_NOT_OK(i64builder.Finish(&i64array));
>   std::shared_ptr schema = arrow::schema(
>   {arrow::field("int", arrow::int64())});
>   return arrow::Table::Make(schema, {i64array});
> }
> void write_parquet_file(const arrow::Table& table) {
>   std::shared_ptr outfile;
>   PARQUET_ASSIGN_OR_THROW(
>   outfile, 
> arrow::io::FileOutputStream::Open("parquet-arrow-example.parquet"));
>   PARQUET_THROW_NOT_OK(
>   parquet::arrow::WriteTable(table, arrow::default_memory_pool(), 
> outfile, 3));
> }
> void read_whole_file() {
>   std::cout << "Reading parquet-arrow-example.parquet at once" << std::endl;
>   std::shared_ptr infile;
>   PARQUET_ASSIGN_OR_THROW(infile,
>   
> arrow::io::ReadableFile::Open("parquet-arrow-example.parquet",
> 
> arrow::default_memory_pool()));
>   std::unique_ptr reader;
>   PARQUET_THROW_NOT_OK(
>   parquet::arrow::OpenFile(infile, arrow::default_memory_pool(), 
> &reader));
>   std::shared_ptr table;
>   PARQUET_THROW_NOT_OK(reader->ReadTable(&table));
>   std::cout << "Loaded " << table->num_rows() << " rows in " << 
> table->num_columns()
> << " columns." << std::endl;
> }
> int main(int argc, char** argv) {
>   std::shared_ptr table = generate_table();
>   write_parquet_file(*table);
>   std::cout << "start " <   read_whole_file();
>   std::cout << "end " <   sleep(100);
> }
> {code}
> After the end, during sleep, the memory usage is still more than 100M and has 
> not dropped. When I increase the data volume by 5 times, the memory usage is 
> about 500M, and it will not drop.
> I want to know whether this part of the data is cached by the memory pool, or 
> whether it is a memory leak problem. If there is no memory leak, how to set 
> memory pool size or release memory?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16155) [R] lubridate functions for 9.0.0

2022-09-28 Thread Todd Farmer (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610655#comment-17610655
 ] 

Todd Farmer commented on ARROW-16155:
-

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [R] lubridate functions for 9.0.0
> -
>
> Key: ARROW-16155
> URL: https://issues.apache.org/jira/browse/ARROW-16155
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Alessandro Molina
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> Umbrella ticket for lubridate functions in 9.0.0
> Future work that is not going to happen in v9 is recorder under 
> https://issues.apache.org/jira/browse/ARROW-16841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-14138) [R] update metadata when casting a record batch column

2022-09-28 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-14138:
---

Assignee: (was: Romain Francois)

> [R] update metadata when casting a record batch column
> --
>
> Key: ARROW-14138
> URL: https://issues.apache.org/jira/browse/ARROW-14138
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Romain Francois
>Priority: Minor
> Fix For: 10.0.0
>
>
> library(arrow, warn.conflicts = FALSE)
> #> See arrow_info() for available features
> raws <- structure(list(
>   as.raw(c(0x70, 0x65, 0x72, 0x73, 0x6f, 0x6e))
> ), class = c("arrow_binary", "vctrs_vctr", "list"))
> batch <- record_batch(b = raws)
> batch$metadata$r
> #>  'arrow_r_metadata' chr 
> "A\n3\n262147\n197888\n5\nUTF-8\n531\n1\n531\n1\n531\n2\n531\n1\n16\n3\n262153\n12\narrow_binary\n262153\n10\nvc"|
>  __truncated__
> #> List of 1
> #>  $ columns:List of 1
> #>   ..$ b:List of 2
> #>   .. ..$ attributes:List of 1
> #>   .. .. ..$ class: chr [1:3] "arrow_binary" "vctrs_vctr" "list"
> #>   .. ..$ columns   : NULL
> # when casting `b` to a string column, the metadata is kept
> batch$b <- batch$b$cast(utf8())
> batch$metadata$r
> #>  'arrow_r_metadata' chr 
> "A\n3\n262147\n197888\n5\nUTF-8\n531\n1\n531\n1\n531\n2\n531\n1\n16\n3\n262153\n12\narrow_binary\n262153\n10\nvc"|
>  __truncated__
> #> List of 1
> #>  $ columns:List of 1
> #>   ..$ b:List of 2
> #>   .. ..$ attributes:List of 1
> #>   .. .. ..$ class: chr [1:3] "arrow_binary" "vctrs_vctr" "list"
> #>   .. ..$ columns   : NULL
> # but it should not have
> batch2 <- record_batch(b = "string")
> batch2$metadata$r
> #> NULL



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-14987) [C++]Memory leak while reading parquet file

2022-09-28 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-14987:
---

Assignee: (was: Weston Pace)

> [C++]Memory leak while reading parquet file
> ---
>
> Key: ARROW-14987
> URL: https://issues.apache.org/jira/browse/ARROW-14987
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 6.0.1
>Reporter: Qingxiang Chen
>Priority: Major
>
> When I used parquet to access data, I found that the memory usage was still 
> high after the function ended. I reproduced this problem in the example. code 
> show as below:
>  
> {code:c++}
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> std::shared_ptr generate_table() {
>   arrow::Int64Builder i64builder;
>   for (int i=0;i<32;i++){
> i64builder.Append(i);
>   }
>   std::shared_ptr i64array;
>   PARQUET_THROW_NOT_OK(i64builder.Finish(&i64array));
>   std::shared_ptr schema = arrow::schema(
>   {arrow::field("int", arrow::int64())});
>   return arrow::Table::Make(schema, {i64array});
> }
> void write_parquet_file(const arrow::Table& table) {
>   std::shared_ptr outfile;
>   PARQUET_ASSIGN_OR_THROW(
>   outfile, 
> arrow::io::FileOutputStream::Open("parquet-arrow-example.parquet"));
>   PARQUET_THROW_NOT_OK(
>   parquet::arrow::WriteTable(table, arrow::default_memory_pool(), 
> outfile, 3));
> }
> void read_whole_file() {
>   std::cout << "Reading parquet-arrow-example.parquet at once" << std::endl;
>   std::shared_ptr infile;
>   PARQUET_ASSIGN_OR_THROW(infile,
>   
> arrow::io::ReadableFile::Open("parquet-arrow-example.parquet",
> 
> arrow::default_memory_pool()));
>   std::unique_ptr reader;
>   PARQUET_THROW_NOT_OK(
>   parquet::arrow::OpenFile(infile, arrow::default_memory_pool(), 
> &reader));
>   std::shared_ptr table;
>   PARQUET_THROW_NOT_OK(reader->ReadTable(&table));
>   std::cout << "Loaded " << table->num_rows() << " rows in " << 
> table->num_columns()
> << " columns." << std::endl;
> }
> int main(int argc, char** argv) {
>   std::shared_ptr table = generate_table();
>   write_parquet_file(*table);
>   std::cout << "start " <   read_whole_file();
>   std::cout << "end " <   sleep(100);
> }
> {code}
> After the end, during sleep, the memory usage is still more than 100M and has 
> not dropped. When I increase the data volume by 5 times, the memory usage is 
> about 500M, and it will not drop.
> I want to know whether this part of the data is cached by the memory pool, or 
> whether it is a memory leak problem. If there is no memory leak, how to set 
> memory pool size or release memory?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16155) [R] lubridate functions for 9.0.0

2022-09-28 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-16155:
---

Assignee: (was: Dragoș Moldovan-Grünfeld)

> [R] lubridate functions for 9.0.0
> -
>
> Key: ARROW-16155
> URL: https://issues.apache.org/jira/browse/ARROW-16155
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Alessandro Molina
>Priority: Major
>
> Umbrella ticket for lubridate functions in 9.0.0
> Future work that is not going to happen in v9 is recorder under 
> https://issues.apache.org/jira/browse/ARROW-16841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16319) [R] [Docs] Document the lubridate functions we support in {arrow}

2022-09-28 Thread Todd Farmer (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610653#comment-17610653
 ] 

Todd Farmer commented on ARROW-16319:
-

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [R] [Docs] Document the lubridate functions we support in {arrow}
> -
>
> Key: ARROW-16319
> URL: https://issues.apache.org/jira/browse/ARROW-16319
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 8.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Stephanie Hazlitt
>Priority: Major
>
> Add documentation around the {{lubridate}} functionality supported in 
> {{arrow}}. Could be made up of:
> * a blogpost 
> * a more in-depth piece of documentation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-14588) [R] Create an arrow-specific checklist for a CRAN release

2022-09-28 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-14588:
---

Assignee: (was: Dragoș Moldovan-Grünfeld)

> [R] Create an arrow-specific checklist for a CRAN release  
> ---
>
> Key: ARROW-14588
> URL: https://issues.apache.org/jira/browse/ARROW-14588
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Minor
>
> This would adapt and implement the functionality of 
> {{usethis::use_release_issue()}} for {{arrow}}'s specific context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-12311) [Python][R] Expose (hide?) ScanOptions

2022-09-28 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-12311:
---

Assignee: (was: Weston Pace)

> [Python][R] Expose (hide?) ScanOptions
> --
>
> Key: ARROW-12311
> URL: https://issues.apache.org/jira/browse/ARROW-12311
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python, R
>Reporter: Weston Pace
>Priority: Major
> Fix For: 10.0.0
>
>
> Currently R completely hides the `ScanOptions` class.
> In python the class is exposed but the documentation prefers `dataset.scan` 
> (which hides both the scanner and the scan options).
> However, there is some useful information in the `ScanOptions`.  
> Specifically, the projected schema (which is a product of the dataset schema 
> and the projection expression and not easily recreated) and the materialized 
> fields (the list of fields referenced by either the filter or the projection) 
> which might be useful for reporting purposes.
> Currently R uses the projected schema to convert a list of column names into 
> a partition schema.  Python does not rely on either field.
>  
> Options:
>  - Keep the status quo
>  - Expose the ScanOptions object (which itself is exposed via the Scanner)
>  - Expose the interesting fields via the Scanner
>  
> Currently the C++ design is halfway between the latter two (projected schema 
> is exposed and options).  My preference would be the third option.  It raises 
> a further question about how to expose the scanner itself in Python?  Should 
> the user be using ScannerBuilder?  Should they use NewScan?  Should they use 
> the scanner directly at all or should it be hidden?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-14209) [R] Allow multiple arguments to n_distinct()

2022-09-28 Thread Todd Farmer (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610650#comment-17610650
 ] 

Todd Farmer commented on ARROW-14209:
-

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [R] Allow multiple arguments to n_distinct()
> 
>
> Key: ARROW-14209
> URL: https://issues.apache.org/jira/browse/ARROW-14209
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> ARROW-13620 and ARROW-14036 added support for the {{n_distinct()}} function 
> in the dplyr verb {{summarise()}} but only with a single argument. Add 
> support for multiple arguments to {{n_distinct()}}. This should return the 
> number of unique combinations of values in the specified columns/expressions.
> See the comment about this here: 
> [https://github.com/apache/arrow/pull/11257#discussion_r720873549]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-14588) [R] Create an arrow-specific checklist for a CRAN release

2022-09-28 Thread Todd Farmer (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610652#comment-17610652
 ] 

Todd Farmer commented on ARROW-14588:
-

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [R] Create an arrow-specific checklist for a CRAN release  
> ---
>
> Key: ARROW-14588
> URL: https://issues.apache.org/jira/browse/ARROW-14588
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Minor
>
> This would adapt and implement the functionality of 
> {{usethis::use_release_issue()}} for {{arrow}}'s specific context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17877) [CI][Python] verify-rc python nightly builds fail due to missing some flags that were activated with ARROW_PYTHON=ON

2022-09-28 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610646#comment-17610646
 ] 

Joris Van den Bossche commented on ARROW-17877:
---

See also ARROW-17868, where [~kou] mentioned to add it back but deprecated. 

(that still means we should also update all internal usage and docs that use it)


> [CI][Python] verify-rc python nightly builds fail due to missing some flags 
> that were activated with ARROW_PYTHON=ON
> 
>
> Key: ARROW-17877
> URL: https://issues.apache.org/jira/browse/ARROW-17877
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Blocker
>  Labels: Nightly
> Fix For: 10.0.0
>
>
> Some of our nightly builds are failing with:
> {code:java}
>  [ 35%] Building CXX object CMakeFiles/_dataset.dir/_dataset.cpp.o
> /arrow/python/build/temp.linux-x86_64-cpython-38/_dataset.cpp:833:10: fatal 
> error: arrow/csv/api.h: No such file or directory
>  #include "arrow/csv/api.h"
>           ^
> compilation terminated.{code}
> I suspect the flags included CSV=ON when building with PYTHON=ON changes here 
> might be related: 
> [https://github.com/apache/arrow/commit/53ac2a00aa9ff199773513f6f996f73a07b37989]
> Example of nightly failures:
> https://github.com/ursacomputing/crossbow/actions/runs/3135833175/jobs/5091988801



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17882) [Java][Doc] Document build & use of new artifact on Windows environment

2022-09-28 Thread David Dali Susanibar Arce (Jira)
David Dali Susanibar Arce created ARROW-17882:
-

 Summary: [Java][Doc] Document build & use of new artifact on 
Windows environment
 Key: ARROW-17882
 URL: https://issues.apache.org/jira/browse/ARROW-17882
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Documentation, Java
Reporter: David Dali Susanibar Arce
Assignee: David Dali Susanibar Arce


* Update build documentation with new Windows JNI DLL support
 * Update use documentation with new Windows JNI DLL support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-17546) [C++] Remove pre-C++17 compatibility measures

2022-09-28 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-17546.
--
Resolution: Fixed

> [C++] Remove pre-C++17 compatibility measures
> -
>
> Key: ARROW-17546
> URL: https://issues.apache.org/jira/browse/ARROW-17546
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
> Fix For: 10.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15075) [C++][Dataset] Implement Dataset for reading JSON format

2022-09-28 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-15075:
---
Summary: [C++][Dataset] Implement Dataset for reading JSON format  (was: 
[C++][Dataset] Implement Dataset for JSON format)

> [C++][Dataset] Implement Dataset for reading JSON format
> 
>
> Key: ARROW-15075
> URL: https://issues.apache.org/jira/browse/ARROW-15075
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Will Jones
>Assignee: Ben Harkins
>Priority: Major
>  Labels: dataset
>
> We already have support for reading individual files, but not yet for reading 
> datasets. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17877) [CI][Python] verify-rc python nightly builds fail due to missing some flags that were activated with ARROW_PYTHON=ON

2022-09-28 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-17877:
--
Fix Version/s: 10.0.0

> [CI][Python] verify-rc python nightly builds fail due to missing some flags 
> that were activated with ARROW_PYTHON=ON
> 
>
> Key: ARROW-17877
> URL: https://issues.apache.org/jira/browse/ARROW-17877
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Blocker
>  Labels: Nightly
> Fix For: 10.0.0
>
>
> Some of our nightly builds are failing with:
> {code:java}
>  [ 35%] Building CXX object CMakeFiles/_dataset.dir/_dataset.cpp.o
> /arrow/python/build/temp.linux-x86_64-cpython-38/_dataset.cpp:833:10: fatal 
> error: arrow/csv/api.h: No such file or directory
>  #include "arrow/csv/api.h"
>           ^
> compilation terminated.{code}
> I suspect the flags included CSV=ON when building with PYTHON=ON changes here 
> might be related: 
> [https://github.com/apache/arrow/commit/53ac2a00aa9ff199773513f6f996f73a07b37989]
> Example of nightly failures:
> https://github.com/ursacomputing/crossbow/actions/runs/3135833175/jobs/5091988801



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17881) [C++] Not able to build the project with the latest commit of the master branch

2022-09-28 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610617#comment-17610617
 ] 

Antoine Pitrou commented on ARROW-17881:


If you're using Homebrew, this is because of 
https://github.com/Homebrew/homebrew-core/issues/111810

In any case, I recommend passing {{-DGTest_SOURCE=BUNDLED}} to CMake so that 
GTest is built from source in C++17 mode.

> [C++] Not able to build the project with the latest commit of the master 
> branch
> ---
>
> Key: ARROW-17881
> URL: https://issues.apache.org/jira/browse/ARROW-17881
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Anirudh Acharya
>Priority: Major
>
> I am trying to build the arrow C++ project with the latest commit( 9af43f11b) 
> from the master branch using this guide - 
> [https://arrow.apache.org/docs/developers/cpp/building.html] But the build 
> fails with the following error -
> {code:java}
> [ 58%] Linking CXX executable ../../debug/arrow-array-test
> Undefined symbols for architecture x86_64:
>   "testing::Matcher std::__1::char_traits > const&>::Matcher(char const*)", referenced from:
>       testing::Matcher std::__1::char_traits > const&> 
> testing::internal::MatcherCastImpl std::__1::char_traits > const&, char const*>::CastImpl(char 
> const* const&, std::__1::integral_constant, 
> std::__1::integral_constant) in array_test.cc.o
>       testing::Matcher std::__1::char_traits > const&> 
> testing::internal::MatcherCastImpl std::__1::char_traits > const&, char const*>::CastImpl(char 
> const* const&, std::__1::integral_constant, 
> std::__1::integral_constant) in array_binary_test.cc.o
> ld: symbol(s) not found for architecture x86_64
> clang-14: error: linker command failed with exit code 1 (use -v to see 
> invocation)
> make[2]: *** [src/arrow/CMakeFiles/arrow-array-test.dir/build.make:207: 
> debug/arrow-array-test] Error 1
> make[1]: *** [CMakeFiles/Makefile2:1653: 
> src/arrow/CMakeFiles/arrow-array-test.dir/all] Error 2
> make[1]: *** Waiting for unfinished jobs
> [ 58%] Building CXX object 
> src/arrow/CMakeFiles/arrow-table-test.dir/table_test.cc.o
> [ 58%] Building CXX object 
> src/parquet/CMakeFiles/parquet_objlib.dir/types.cc.o
> [ 58%] Building CXX object 
> src/arrow/CMakeFiles/arrow-table-test.dir/table_builder_test.cc.o
> [ 58%] Building CXX object 
> src/parquet/CMakeFiles/parquet_objlib.dir/level_comparison_avx2.cc.o
> [ 58%] Building CXX object 
> src/parquet/CMakeFiles/parquet_objlib.dir/level_conversion_bmi2.cc.o
> [ 58%] Building CXX object 
> src/parquet/CMakeFiles/parquet_objlib.dir/encryption/encryption_internal.cc.o
> [ 59%] Building CXX object 
> src/parquet/CMakeFiles/parquet_objlib.dir/encryption/crypto_factory.cc.o
> [ 60%] Linking CXX executable ../../debug/arrow-table-test
> [ 60%] Building CXX object 
> src/parquet/CMakeFiles/parquet_objlib.dir/encryption/file_key_unwrapper.cc.o
> [ 60%] Built target arrow-table-test
> [ 60%] Building CXX object 
> src/parquet/CMakeFiles/parquet_objlib.dir/encryption/file_key_wrapper.cc.o
> [ 60%] Building CXX object 
> src/parquet/CMakeFiles/parquet_objlib.dir/encryption/kms_client.cc.o
> [ 60%] Building CXX object 
> src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_material.cc.o
> [ 61%] Building CXX object 
> src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_metadata.cc.o
> [ 61%] Building CXX object 
> src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_toolkit.cc.o
> [ 61%] Building CXX object 
> src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_toolkit_internal.cc.o
> [ 61%] Building CXX object 
> src/parquet/CMakeFiles/parquet_objlib.dir/encryption/local_wrap_kms_client.cc.o
> [ 61%] Built target parquet_objlib
> make: *** [Makefile:146: all] Error 2 {code}
>  
> I am compiling this on macOS Monterey Version 12.0.1. and versions of GCC, 
> python and clang are as follows -
> {code:java}
> $ clang --version
> clang version 14.0.4
> Target: x86_64-apple-darwin21.1.0
> Thread model: posix
> $ python --version
> Python 3.9.13
> $ gcc --version
> Configured with: --prefix=/Library/Developer/CommandLineTools/usr 
> --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
> Apple clang version 12.0.5 (clang-1205.0.22.9)
> Target: x86_64-apple-darwin21.1.0
> Thread model: posix
> InstalledDir: /Library/Developer/CommandLineTools/usr/bin {code}
>  
> I see that there were nightly job failures for macOS that were reported in 
> the mailing list - 
> [https://lists.apache.org/thread/rrdwxw1st4vdcf3nh5nqfo16n3ymj90x] I am not 
> sure if this failure is related to the issue I am reporting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17881) [C++] Not able to build the project with the latest commit of the master branch

2022-09-28 Thread Anirudh Acharya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Acharya updated ARROW-17881:

Description: 
I am trying to build the arrow C++ project with the latest commit( 9af43f11b) 
from the master branch using this guide - 
[https://arrow.apache.org/docs/developers/cpp/building.html] But the build 
fails with the following error -
{code:java}
[ 58%] Linking CXX executable ../../debug/arrow-array-test
Undefined symbols for architecture x86_64:
  "testing::Matcher > const&>::Matcher(char const*)", referenced from:
      testing::Matcher > const&> 
testing::internal::MatcherCastImpl > const&, char const*>::CastImpl(char const* 
const&, std::__1::integral_constant, 
std::__1::integral_constant) in array_test.cc.o
      testing::Matcher > const&> 
testing::internal::MatcherCastImpl > const&, char const*>::CastImpl(char const* 
const&, std::__1::integral_constant, 
std::__1::integral_constant) in array_binary_test.cc.o
ld: symbol(s) not found for architecture x86_64
clang-14: error: linker command failed with exit code 1 (use -v to see 
invocation)
make[2]: *** [src/arrow/CMakeFiles/arrow-array-test.dir/build.make:207: 
debug/arrow-array-test] Error 1
make[1]: *** [CMakeFiles/Makefile2:1653: 
src/arrow/CMakeFiles/arrow-array-test.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs
[ 58%] Building CXX object 
src/arrow/CMakeFiles/arrow-table-test.dir/table_test.cc.o
[ 58%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/types.cc.o
[ 58%] Building CXX object 
src/arrow/CMakeFiles/arrow-table-test.dir/table_builder_test.cc.o
[ 58%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/level_comparison_avx2.cc.o
[ 58%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/level_conversion_bmi2.cc.o
[ 58%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/encryption_internal.cc.o
[ 59%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/crypto_factory.cc.o
[ 60%] Linking CXX executable ../../debug/arrow-table-test
[ 60%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/file_key_unwrapper.cc.o
[ 60%] Built target arrow-table-test
[ 60%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/file_key_wrapper.cc.o
[ 60%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/kms_client.cc.o
[ 60%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_material.cc.o
[ 61%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_metadata.cc.o
[ 61%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_toolkit.cc.o
[ 61%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_toolkit_internal.cc.o
[ 61%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/local_wrap_kms_client.cc.o
[ 61%] Built target parquet_objlib
make: *** [Makefile:146: all] Error 2 {code}
 

I am compiling this on macOS Monterey Version 12.0.1. and versions of GCC, 
python and clang are as follows -
{code:java}
$ clang --version
clang version 14.0.4
Target: x86_64-apple-darwin21.1.0
Thread model: posix

$ python --version
Python 3.9.13

$ gcc --version
Configured with: --prefix=/Library/Developer/CommandLineTools/usr 
--with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 12.0.5 (clang-1205.0.22.9)
Target: x86_64-apple-darwin21.1.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin {code}
 

I see that there were nightly job failures for macOS that were reported in the 
mailing list - 
[https://lists.apache.org/thread/rrdwxw1st4vdcf3nh5nqfo16n3ymj90x] I am not 
sure if this failure is related to the issue I am reporting.

  was:
I am trying to build the arrow C++ project with the latest commit( 9af43f11b) 
from the master branch using this guide - 
[https://arrow.apache.org/docs/developers/cpp/building.html] But the build 
fails with the following error -
{code:java}
[ 58%] Linking CXX executable ../../debug/arrow-array-test
Undefined symbols for architecture x86_64:
  "testing::Matcher > const&>::Matcher(char const*)", referenced from:
      testing::Matcher > const&> 
testing::internal::MatcherCastImpl > const&, char const*>::CastImpl(char const* 
const&, std::__1::integral_constant, 
std::__1::integral_constant) in array_test.cc.o
      testing::Matcher > const&> 
testing::internal::MatcherCastImpl > const&, char const*>::CastImpl(char const* 
const&, std::__1::integral_constant, 
std::__1::integral_constant) in array_binary_test.cc.o
ld: symbol(s) not found for architecture x86_64
clang-14: error: linker command failed with exit code 1 (use -v to see 
invocation)
make[2]: *** [src/arrow/CMakeFiles/arrow-array-test.dir/build.make:207: 
debug/arrow-array-test] Error 1
make[1]: *** [

[jira] [Created] (ARROW-17881) [C++] Not able to build the project with the latest commit of the master branch

2022-09-28 Thread Anirudh Acharya (Jira)
Anirudh Acharya created ARROW-17881:
---

 Summary: [C++] Not able to build the project with the latest 
commit of the master branch
 Key: ARROW-17881
 URL: https://issues.apache.org/jira/browse/ARROW-17881
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Anirudh Acharya


I am trying to build the arrow C++ project with the latest commit( 9af43f11b) 
from the master branch using this guide - 
[https://arrow.apache.org/docs/developers/cpp/building.html] But the build 
fails with the following error -
{code:java}
[ 58%] Linking CXX executable ../../debug/arrow-array-test
Undefined symbols for architecture x86_64:
  "testing::Matcher > const&>::Matcher(char const*)", referenced from:
      testing::Matcher > const&> 
testing::internal::MatcherCastImpl > const&, char const*>::CastImpl(char const* 
const&, std::__1::integral_constant, 
std::__1::integral_constant) in array_test.cc.o
      testing::Matcher > const&> 
testing::internal::MatcherCastImpl > const&, char const*>::CastImpl(char const* 
const&, std::__1::integral_constant, 
std::__1::integral_constant) in array_binary_test.cc.o
ld: symbol(s) not found for architecture x86_64
clang-14: error: linker command failed with exit code 1 (use -v to see 
invocation)
make[2]: *** [src/arrow/CMakeFiles/arrow-array-test.dir/build.make:207: 
debug/arrow-array-test] Error 1
make[1]: *** [CMakeFiles/Makefile2:1653: 
src/arrow/CMakeFiles/arrow-array-test.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs
[ 58%] Building CXX object 
src/arrow/CMakeFiles/arrow-table-test.dir/table_test.cc.o
[ 58%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/types.cc.o
[ 58%] Building CXX object 
src/arrow/CMakeFiles/arrow-table-test.dir/table_builder_test.cc.o
[ 58%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/level_comparison_avx2.cc.o
[ 58%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/level_conversion_bmi2.cc.o
[ 58%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/encryption_internal.cc.o
[ 59%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/crypto_factory.cc.o
[ 60%] Linking CXX executable ../../debug/arrow-table-test
[ 60%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/file_key_unwrapper.cc.o
[ 60%] Built target arrow-table-test
[ 60%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/file_key_wrapper.cc.o
[ 60%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/kms_client.cc.o
[ 60%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_material.cc.o
[ 61%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_metadata.cc.o
[ 61%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_toolkit.cc.o
[ 61%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_toolkit_internal.cc.o
[ 61%] Building CXX object 
src/parquet/CMakeFiles/parquet_objlib.dir/encryption/local_wrap_kms_client.cc.o
[ 61%] Built target parquet_objlib
make: *** [Makefile:146: all] Error 2 {code}
 

I am compiling this on macOS Monterey Version 12.0.1. and versions of GCC, 
python and clang are as follows -
{code:java}
$ clang --version
clang version 14.0.4
Target: x86_64-apple-darwin21.1.0
Thread model: posix
InstalledDir: /Users/anirudhacharya/miniconda3/envs/pyarrow-dev/bin

$ python --version
Python 3.9.13

$ gcc --version
Configured with: --prefix=/Library/Developer/CommandLineTools/usr 
--with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 12.0.5 (clang-1205.0.22.9)
Target: x86_64-apple-darwin21.1.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin {code}
 

I see that there were nightly job failures for macOS that were reported in the 
mailing list - 
[https://lists.apache.org/thread/rrdwxw1st4vdcf3nh5nqfo16n3ymj90x] I am not 
sure if this failure is related to the issue I am reporting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17850) [Java] Upgrade netty-codec-http dependencies

2022-09-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17850:
---
Labels: pull-request-available  (was: )

> [Java] Upgrade netty-codec-http dependencies
> 
>
> Key: ARROW-17850
> URL: https://issues.apache.org/jira/browse/ARROW-17850
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 9.0.0
>Reporter: Hui Yu
>Assignee: David Dali Susanibar Arce
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [CVE-2022-24823]([https://github.com/advisories/GHSA-269q-hmxg-m83q]) reports 
> a security vulnerability for *netty-codec-http*
> Now the version of *netty-codec-http* in the master branch is *4.1.72.Final,* 
> that is unsafe.
> The ticket https://issues.apache.org/jira/browse/ARROW-16996 bumps 
> *netty-codec* to {*}4.1.78.Final{*}, it didn't bump *netty-codec-http.*
> Can you upgrade the version of *netty-codec-http* ? 
>  
> Here is my output of mvn:dependency now:
> ```bash
> [INFO] +- org.apache.arrow:flight-core:jar:9.0.0:compile
> [INFO] |  +- io.grpc:grpc-netty:jar:1.47.0:compile
> [INFO] |  |  +- io.netty:netty-codec-http2:jar:4.1.72.Final:compile
> [INFO] |  |  |  - io.netty:{*}netty-codec-http{*}:jar:4.1.72.Final:compile
> [INFO] |  |  +- io.netty:netty-handler-proxy:jar:4.1.72.Final:runtime
> [INFO] |  |  |  - io.netty:netty-codec-socks:jar:4.1.72.Final:runtime
> [INFO] |  |  +- 
> com.google.errorprone:error_prone_annotations:jar:2.10.0:compile
> [INFO] |  |  +- io.perfmark:perfmark-api:jar:0.25.0:runtime
> [INFO] |  |  - 
> io.netty:netty-transport-native-unix-common:jar:4.1.72.Final:compile
> [INFO] |  +- io.grpc:grpc-core:jar:1.47.0:compile
> [INFO] |  |  +- com.google.android:annotations:jar:4.1.1.4:runtime
> [INFO] |  |  - org.codehaus.mojo:animal-sniffer-annotations:jar:1.19:runtime
> [INFO] |  +- io.grpc:grpc-context:jar:1.47.0:compile
> [INFO] |  +- io.grpc:grpc-protobuf:jar:1.47.0:compile
> [INFO] |  |  +- 
> com.google.api.grpc:proto-google-common-protos:jar:2.0.1:compile
> [INFO] |  |  - io.grpc:grpc-protobuf-lite:jar:1.47.0:compile
> [INFO] |  +- io.netty:netty-tcnative-boringssl-static:jar:2.0.53.Final:compile
> [INFO] |  |  +- io.netty:netty-tcnative-classes:jar:2.0.53.Final:compile
> [INFO] |  |  +- 
> io.netty:netty-tcnative-boringssl-static:jar:linux-x86_64:2.0.53.Final:compile
> [INFO] |  |  +- 
> io.netty:netty-tcnative-boringssl-static:jar:linux-aarch_64:2.0.53.Final:compile
> [INFO] |  |  +- 
> io.netty:netty-tcnative-boringssl-static:jar:osx-x86_64:2.0.53.Final:compile
> [INFO] |  |  +- 
> io.netty:netty-tcnative-boringssl-static:jar:osx-aarch_64:2.0.53.Final:compile
> [INFO] |  |  - 
> io.netty:netty-tcnative-boringssl-static:jar:windows-x86_64:2.0.53.Final:compile
> [INFO] |  +- io.netty:netty-handler:jar:4.1.78.Final:compile
> [INFO] |  |  +- io.netty:netty-resolver:jar:4.1.78.Final:compile
> [INFO] |  |  - io.netty:netty-codec:jar:4.1.78.Final:compile
> [INFO] |  +- io.netty:netty-transport:jar:4.1.78.Final:compile
> [INFO] |  +- com.google.guava:guava:jar:30.1.1-jre:compile
> [INFO] |  |  +- com.google.guava:failureaccess:jar:1.0.1:compile
> [INFO] |  |  +- 
> com.google.guava:listenablefuture:jar:.0-empty-to-avoid-conflict-with-guava:compile
> [INFO] |  |  +- org.checkerframework:checker-qual:jar:3.8.0:compile
> [INFO] |  |  - com.google.j2objc:j2objc-annotations:jar:1.3:compile
> [INFO] |  +- io.grpc:grpc-stub:jar:1.47.0:compile
> [INFO] |  +- com.google.protobuf:protobuf-java:jar:3.21.2:compile
> [INFO] |  +- io.grpc:grpc-api:jar:1.47.0:compile
> [INFO] |  - javax.annotation:javax.annotation-api:jar:1.3.2:compile
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17865) [Java] Deprecate Plasma JNI bindings

2022-09-28 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-17865.
--
Resolution: Fixed

Issue resolved by pull request 14262
[https://github.com/apache/arrow/pull/14262]

> [Java] Deprecate Plasma JNI bindings
> 
>
> Key: ARROW-17865
> URL: https://issues.apache.org/jira/browse/ARROW-17865
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Antoine Pitrou
>Assignee: David Dali Susanibar Arce
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17867) [C++][FlightRPC] Expose bulk parameter binding in Flight SQL client

2022-09-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17867:
---
Labels: pull-request-available  (was: )

> [C++][FlightRPC] Expose bulk parameter binding in Flight SQL client
> ---
>
> Key: ARROW-17867
> URL: https://issues.apache.org/jira/browse/ARROW-17867
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Also fix various issues noticed as part of ARROW-17661



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17880) [Go] Add support for Decimal types in go/arrow/csv

2022-09-28 Thread Mitchell Devenport (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitchell Devenport updated ARROW-17880:
---
Summary: [Go] Add support for Decimal types in go/arrow/csv  (was: Add 
support for Decimal types in go/arrow/csv)

> [Go] Add support for Decimal types in go/arrow/csv
> --
>
> Key: ARROW-17880
> URL: https://issues.apache.org/jira/browse/ARROW-17880
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Mitchell Devenport
>Priority: Major
>
> The Go CSV library lacks support for Decimal types which are supported by the 
> C++ CSV library:
> [arrow/writer.cc at master · apache/arrow 
> (github.com)|https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.cc#L378]
> [arrow/type_traits.h at master · apache/arrow 
> (github.com)|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L642]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17880) Add support for Decimal types in go/arrow/csv

2022-09-28 Thread Mitchell Devenport (Jira)
Mitchell Devenport created ARROW-17880:
--

 Summary: Add support for Decimal types in go/arrow/csv
 Key: ARROW-17880
 URL: https://issues.apache.org/jira/browse/ARROW-17880
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Mitchell Devenport


The Go CSV library lacks support for Decimal types which are supported by the 
C++ CSV library:
[arrow/writer.cc at master · apache/arrow 
(github.com)|https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.cc#L378]
[arrow/type_traits.h at master · apache/arrow 
(github.com)|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L642]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17879) [R] Intermittent memory leaks in the valgrind nightly test

2022-09-28 Thread Dewey Dunnington (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dewey Dunnington reassigned ARROW-17879:


Assignee: Dewey Dunnington

> [R] Intermittent memory leaks in the valgrind nightly test
> --
>
> Key: ARROW-17879
> URL: https://issues.apache.org/jira/browse/ARROW-17879
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dewey Dunnington
>Priority: Major
> Fix For: 10.0.0
>
>
> The memory leaks that were fixed by a workaround before the last release 
> (ARROW-17252) are present again. I had hoped that the improvements to the 
> captured R thread infrastructure in ARROW-11841 and ARROW-17178 would fix 
> this; however, they don't (and it's not even clear that the failures are 
> related to that, since as part of diagnosing those failures the last time I 
> disabled the safe call infrastructure completely and was still able to 
> observe failures).
> These failures need to be debugged before the release!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17879) [R] Intermittent memory leaks in the valgrind nightly test

2022-09-28 Thread Dewey Dunnington (Jira)
Dewey Dunnington created ARROW-17879:


 Summary: [R] Intermittent memory leaks in the valgrind nightly test
 Key: ARROW-17879
 URL: https://issues.apache.org/jira/browse/ARROW-17879
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Dewey Dunnington
 Fix For: 10.0.0


The memory leaks that were fixed by a workaround before the last release 
(ARROW-17252) are present again. I had hoped that the improvements to the 
captured R thread infrastructure in ARROW-11841 and ARROW-17178 would fix this; 
however, they don't (and it's not even clear that the failures are related to 
that, since as part of diagnosing those failures the last time I disabled the 
safe call infrastructure completely and was still able to observe failures).

These failures need to be debugged before the release!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17878) [Website] Exclude Ballista docs from being deleted

2022-09-28 Thread Andy Grove (Jira)
Andy Grove created ARROW-17878:
--

 Summary: [Website] Exclude Ballista docs from being deleted
 Key: ARROW-17878
 URL: https://issues.apache.org/jira/browse/ARROW-17878
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Andy Grove


Exclude Ballista docs from being deleted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17850) [Java] Upgrade netty-codec-http dependencies

2022-09-28 Thread David Dali Susanibar Arce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610595#comment-17610595
 ] 

David Dali Susanibar Arce commented on ARROW-17850:
---

Updated to:
{code:java}
$ mvn dependency:tree --debug | grep netty-codec-http

[DEBUG]       io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version 
managed from 4.1.77.Final)
[DEBUG]          io.netty:netty-codec-http:jar:4.1.82.Final:compile (version 
managed from 4.1.82.Final)
[DEBUG]       io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version 
managed from 4.1.77.Final)
[DEBUG]          io.netty:netty-codec-http:jar:4.1.82.Final:compile (version 
managed from 4.1.82.Final)
[INFO] |  +- io.netty:netty-codec-http2:jar:4.1.82.Final:compile
[INFO] |  |  \- io.netty:netty-codec-http:jar:4.1.82.Final:compile
[DEBUG]          io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version 
managed from 4.1.77.Final)
[DEBUG]             io.netty:netty-codec-http:jar:4.1.82.Final:compile (version 
managed from 4.1.82.Final)
[DEBUG]          io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version 
managed from 4.1.77.Final)
[DEBUG]             io.netty:netty-codec-http:jar:4.1.82.Final:compile (version 
managed from 4.1.82.Final)
[INFO] |  |  +- io.netty:netty-codec-http2:jar:4.1.82.Final:compile
[INFO] |  |  |  \- io.netty:netty-codec-http:jar:4.1.82.Final:compile
[DEBUG]          io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version 
managed from 4.1.77.Final)
[DEBUG]             io.netty:netty-codec-http:jar:4.1.82.Final:compile (version 
managed from 4.1.82.Final)
[DEBUG]          io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version 
managed from 4.1.77.Final)
[DEBUG]             io.netty:netty-codec-http:jar:4.1.82.Final:compile (version 
managed from 4.1.82.Final)
[INFO] |  |  +- io.netty:netty-codec-http2:jar:4.1.82.Final:compile
[INFO] |  |  |  \- io.netty:netty-codec-http:jar:4.1.82.Final:compile
[DEBUG]          io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version 
managed from 4.1.77.Final)
[DEBUG]             io.netty:netty-codec-http:jar:4.1.82.Final:compile (version 
managed from 4.1.82.Final)
[DEBUG]          io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version 
managed from 4.1.77.Final)
[DEBUG]             io.netty:netty-codec-http:jar:4.1.82.Final:compile (version 
managed from 4.1.82.Final)
[INFO] |  |  +- io.netty:netty-codec-http2:jar:4.1.82.Final:compile
[INFO] |  |  |  \- io.netty:netty-codec-http:jar:4.1.82.Final:compile
[DEBUG]          io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version 
managed from 4.1.77.Final)
[DEBUG]             io.netty:netty-codec-http:jar:4.1.82.Final:compile (version 
managed from 4.1.82.Final)
[DEBUG]          io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version 
managed from 4.1.77.Final)
[DEBUG]             io.netty:netty-codec-http:jar:4.1.82.Final:compile (version 
managed from 4.1.82.Final)
[INFO] |  |  +- io.netty:netty-codec-http2:jar:4.1.82.Final:compile
[INFO] |  |  |  \- io.netty:netty-codec-http:jar:4.1.82.Final:compile {code}

> [Java] Upgrade netty-codec-http dependencies
> 
>
> Key: ARROW-17850
> URL: https://issues.apache.org/jira/browse/ARROW-17850
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 9.0.0
>Reporter: Hui Yu
>Assignee: David Dali Susanibar Arce
>Priority: Major
>
> [CVE-2022-24823]([https://github.com/advisories/GHSA-269q-hmxg-m83q]) reports 
> a security vulnerability for *netty-codec-http*
> Now the version of *netty-codec-http* in the master branch is *4.1.72.Final,* 
> that is unsafe.
> The ticket https://issues.apache.org/jira/browse/ARROW-16996 bumps 
> *netty-codec* to {*}4.1.78.Final{*}, it didn't bump *netty-codec-http.*
> Can you upgrade the version of *netty-codec-http* ? 
>  
> Here is my output of mvn:dependency now:
> ```bash
> [INFO] +- org.apache.arrow:flight-core:jar:9.0.0:compile
> [INFO] |  +- io.grpc:grpc-netty:jar:1.47.0:compile
> [INFO] |  |  +- io.netty:netty-codec-http2:jar:4.1.72.Final:compile
> [INFO] |  |  |  - io.netty:{*}netty-codec-http{*}:jar:4.1.72.Final:compile
> [INFO] |  |  +- io.netty:netty-handler-proxy:jar:4.1.72.Final:runtime
> [INFO] |  |  |  - io.netty:netty-codec-socks:jar:4.1.72.Final:runtime
> [INFO] |  |  +- 
> com.google.errorprone:error_prone_annotations:jar:2.10.0:compile
> [INFO] |  |  +- io.perfmark:perfmark-api:jar:0.25.0:runtime
> [INFO] |  |  - 
> io.netty:netty-transport-native-unix-common:jar:4.1.72.Final:compile
> [INFO] |  +- io.grpc:grpc-core:jar:1.47.0:compile
> [INFO] |  |  +- com.google.android:annotations:jar:4.1.1.4:runtime
> [INFO] |  |  - org.codehaus.mojo:animal-sniffer-annotations:jar:1.19:runtime
> [INFO] |  +- io.grpc:grpc-context:jar:1.47.0:compile
> [

[jira] [Resolved] (ARROW-17875) [C++] Remove assorted pre-C++17 compatibility measures

2022-09-28 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-17875.

Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 14263
[https://github.com/apache/arrow/pull/14263]

> [C++] Remove assorted pre-C++17 compatibility measures
> --
>
> Key: ARROW-17875
> URL: https://issues.apache.org/jira/browse/ARROW-17875
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Some assorted pre-C++17 compatibility measures remain in the code base.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17872) [CI] Cache dependencies on macOS builds

2022-09-28 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610581#comment-17610581
 ] 

Antoine Pitrou edited comment on ARROW-17872 at 9/28/22 1:46 PM:
-

A build that takes 60 minutes or more is horrible for developer experience. So 
I would suggest disabling Gandiva and S3 support on all our PR-based macOS 
builds (and update the brew files to remove/disable the corresponding 
third-party deps).

Do you want to take this [~assignUser]?


was (Author: pitrou):
A build that takes 60 seconds or more is horrible for developer experience. So 
I would suggest disabling Gandiva and S3 support on all our PR-based macOS 
builds (and update the brew files to remove/disable the corresponding 
third-party deps).

Do you want to take this [~assignUser]?

> [CI] Cache dependencies on macOS builds
> ---
>
> Key: ARROW-17872
> URL: https://issues.apache.org/jira/browse/ARROW-17872
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration, GLib, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Our macOS CI builds on Github Actions usually take at least 10 minutes 
> installing dependencies from Homebrew (because of compiling from source?). It 
> would be nice to cache those, especially as they probably don't change often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17872) [CI] Cache dependencies on macOS builds

2022-09-28 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610581#comment-17610581
 ] 

Antoine Pitrou edited comment on ARROW-17872 at 9/28/22 1:46 PM:
-

A build that takes 60 seconds or more is horrible for developer experience. So 
I would suggest disabling Gandiva and S3 support on all our PR-based macOS 
builds (and update the brew files to remove/disable the corresponding 
third-party deps).

Do you want to take this [~assignUser]?


was (Author: pitrou):
A build that takes 60 seconds or more is horrible for developer experience. So 
I would suggest disabling Gandiva and S3 support on all our PR-based macOS 
builds.


> [CI] Cache dependencies on macOS builds
> ---
>
> Key: ARROW-17872
> URL: https://issues.apache.org/jira/browse/ARROW-17872
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration, GLib, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Our macOS CI builds on Github Actions usually take at least 10 minutes 
> installing dependencies from Homebrew (because of compiling from source?). It 
> would be nice to cache those, especially as they probably don't change often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds

2022-09-28 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610581#comment-17610581
 ] 

Antoine Pitrou commented on ARROW-17872:


A build that takes 60 seconds or more is horrible for developer experience. So 
I would suggest disabling Gandiva and S3 support on all our PR-based macOS 
builds.


> [CI] Cache dependencies on macOS builds
> ---
>
> Key: ARROW-17872
> URL: https://issues.apache.org/jira/browse/ARROW-17872
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration, GLib, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Our macOS CI builds on Github Actions usually take at least 10 minutes 
> installing dependencies from Homebrew (because of compiling from source?). It 
> would be nice to cache those, especially as they probably don't change often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17556) [C++] Unbound scan projection expression leads to all fields being loaded

2022-09-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17556:
---
Labels: pull-request-available  (was: )

> [C++] Unbound scan projection expression leads to all fields being loaded
> -
>
> Key: ARROW-17556
> URL: https://issues.apache.org/jira/browse/ARROW-17556
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If a projection expression is unbound then we should bind it to the 
> (augmented) dataset schema and carry on.  Instead it appears we are 
> interpreting "unbound expression" as "nothing set at all" and loading all 
> fields.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17854) [CI][Developer] Host preview docs on S3

2022-09-28 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-17854.

Resolution: Fixed

Issue resolved by pull request 14247
[https://github.com/apache/arrow/pull/14247]

> [CI][Developer] Host preview docs on S3
> ---
>
> Key: ARROW-17854
> URL: https://issues.apache.org/jira/browse/ARROW-17854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Developer Tools
>Reporter: Jacob Wujciak-Jens
>Assignee: Jacob Wujciak-Jens
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Hosting on Github Pages as implemented in [ARROW-12958] is unsustainable due 
> to the size of the arrow docs (~ 200mb).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds

2022-09-28 Thread Jacob Wujciak-Jens (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610573#comment-17610573
 ] 

Jacob Wujciak-Jens commented on ARROW-17872:


bq. 10 minutes for extracting 1.5GB seems quite unexpected

I have checked in detail and each of the bigger dependecies (aws, llvm, boost) 
take 2-3 minutes to "pour", so ok speeds I would say. Just over all a lot but 
still nothing  Isee the cache really speeding up. 

The timeout is set to 60 minutes so we could just raise that limit if it is not 
 applicable for the current build complexity (or as you said remove features). 
The build should already be using all 3 available cores.

> [CI] Cache dependencies on macOS builds
> ---
>
> Key: ARROW-17872
> URL: https://issues.apache.org/jira/browse/ARROW-17872
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration, GLib, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Our macOS CI builds on Github Actions usually take at least 10 minutes 
> installing dependencies from Homebrew (because of compiling from source?). It 
> would be nice to cache those, especially as they probably don't change often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17877) [CI][Python] verify-rc python nightly builds fail due to missing some flags that were activated with ARROW_PYTHON=ON

2022-09-28 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610566#comment-17610566
 ] 

Raúl Cumplido commented on ARROW-17877:
---

[~kou] what do you think, should we enable individually on the 
verify-release-candidate.sh and other jobs all the flags that were activated 
when `ARROW_PYTHON=ON` in the past if we are building pyarrow too:
{code:java}
if(ARROW_PYTHON)
  set(ARROW_COMPUTE ON)
  set(ARROW_CSV ON)
  set(ARROW_DATASET ON)
  set(ARROW_FILESYSTEM ON)
  set(ARROW_HDFS ON)
  set(ARROW_JSON ON)
endif() {code}
or should we create some CMake group flag that enables all the requirements for 
pyarrow?

There are still quite a lot of occurrences of this flag:
{code:java}
$ grep -r "ARROW_PYTHON=ON" | wc -l
22 {code}

> [CI][Python] verify-rc python nightly builds fail due to missing some flags 
> that were activated with ARROW_PYTHON=ON
> 
>
> Key: ARROW-17877
> URL: https://issues.apache.org/jira/browse/ARROW-17877
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Blocker
>  Labels: Nightly
>
> Some of our nightly builds are failing with:
> {code:java}
>  [ 35%] Building CXX object CMakeFiles/_dataset.dir/_dataset.cpp.o
> /arrow/python/build/temp.linux-x86_64-cpython-38/_dataset.cpp:833:10: fatal 
> error: arrow/csv/api.h: No such file or directory
>  #include "arrow/csv/api.h"
>           ^
> compilation terminated.{code}
> I suspect the flags included CSV=ON when building with PYTHON=ON changes here 
> might be related: 
> [https://github.com/apache/arrow/commit/53ac2a00aa9ff199773513f6f996f73a07b37989]
> Example of nightly failures:
> https://github.com/ursacomputing/crossbow/actions/runs/3135833175/jobs/5091988801



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17877) [CI][Python] verify-rc python nightly builds fail due to missing some flags that were activated with ARROW_PYTHON=ON

2022-09-28 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-17877:
--
Summary: [CI][Python] verify-rc python nightly builds fail due to missing 
some flags that were activated with ARROW_PYTHON=ON  (was: [CI][Python] 
verify-rc python nightly builds fail due to missing arrow/csv/api.h)

> [CI][Python] verify-rc python nightly builds fail due to missing some flags 
> that were activated with ARROW_PYTHON=ON
> 
>
> Key: ARROW-17877
> URL: https://issues.apache.org/jira/browse/ARROW-17877
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Blocker
>  Labels: Nightly
>
> Some of our nightly builds are failing with:
> {code:java}
>  [ 35%] Building CXX object CMakeFiles/_dataset.dir/_dataset.cpp.o
> /arrow/python/build/temp.linux-x86_64-cpython-38/_dataset.cpp:833:10: fatal 
> error: arrow/csv/api.h: No such file or directory
>  #include "arrow/csv/api.h"
>           ^
> compilation terminated.{code}
> I suspect the flags included CSV=ON when building with PYTHON=ON changes here 
> might be related: 
> [https://github.com/apache/arrow/commit/53ac2a00aa9ff199773513f6f996f73a07b37989]
> Example of nightly failures:
> https://github.com/ursacomputing/crossbow/actions/runs/3135833175/jobs/5091988801



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17877) [CI][Python] verify-rc python nightly builds fail due to missing arrow/csv/api.h

2022-09-28 Thread Jira
Raúl Cumplido created ARROW-17877:
-

 Summary: [CI][Python] verify-rc python nightly builds fail due to 
missing arrow/csv/api.h
 Key: ARROW-17877
 URL: https://issues.apache.org/jira/browse/ARROW-17877
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Python
Reporter: Raúl Cumplido
Assignee: Raúl Cumplido


Some of our nightly builds are failing with:
{code:java}
 [ 35%] Building CXX object CMakeFiles/_dataset.dir/_dataset.cpp.o
/arrow/python/build/temp.linux-x86_64-cpython-38/_dataset.cpp:833:10: fatal 
error: arrow/csv/api.h: No such file or directory
 #include "arrow/csv/api.h"
          ^
compilation terminated.{code}
I suspect the flags included CSV=ON when building with PYTHON=ON changes here 
might be related: 
[https://github.com/apache/arrow/commit/53ac2a00aa9ff199773513f6f996f73a07b37989]

Example of nightly failures:

https://github.com/ursacomputing/crossbow/actions/runs/3135833175/jobs/5091988801



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds

2022-09-28 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610545#comment-17610545
 ] 

Antoine Pitrou commented on ARROW-17872:


We may perhaps want to disable some Arrow components on those macOS builds, 
unless there's another package manager that we can use?

[~kou] Do you know why we decided to use Homebrew for dependencies on macOS?

> [CI] Cache dependencies on macOS builds
> ---
>
> Key: ARROW-17872
> URL: https://issues.apache.org/jira/browse/ARROW-17872
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration, GLib, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Our macOS CI builds on Github Actions usually take at least 10 minutes 
> installing dependencies from Homebrew (because of compiling from source?). It 
> would be nice to cache those, especially as they probably don't change often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17872) [CI] Cache dependencies on macOS builds

2022-09-28 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610539#comment-17610539
 ] 

Antoine Pitrou edited comment on ARROW-17872 at 9/28/22 12:27 PM:
--

(and even 10 minutes for extracting 1.5GB  seems quite unexpected: that's only 
2.5 MB/s... so it's not a gzip problem but probably an IO/memory issue)


was (Author: pitrou):
(and even 10 minutes for extracting 1.5GB  seems quite unexpected)

> [CI] Cache dependencies on macOS builds
> ---
>
> Key: ARROW-17872
> URL: https://issues.apache.org/jira/browse/ARROW-17872
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration, GLib, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Our macOS CI builds on Github Actions usually take at least 10 minutes 
> installing dependencies from Homebrew (because of compiling from source?). It 
> would be nice to cache those, especially as they probably don't change often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds

2022-09-28 Thread Jacob Wujciak-Jens (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610541#comment-17610541
 ] 

Jacob Wujciak-Jens commented on ARROW-17872:


relevant homebrew issue: https://github.com/Homebrew/brew/issues/13621

> [CI] Cache dependencies on macOS builds
> ---
>
> Key: ARROW-17872
> URL: https://issues.apache.org/jira/browse/ARROW-17872
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration, GLib, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Our macOS CI builds on Github Actions usually take at least 10 minutes 
> installing dependencies from Homebrew (because of compiling from source?). It 
> would be nice to cache those, especially as they probably don't change often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds

2022-09-28 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610539#comment-17610539
 ] 

Antoine Pitrou commented on ARROW-17872:


(and even 10 minutes for extracting 1.5GB  seems quite unexpected)

> [CI] Cache dependencies on macOS builds
> ---
>
> Key: ARROW-17872
> URL: https://issues.apache.org/jira/browse/ARROW-17872
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration, GLib, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Our macOS CI builds on Github Actions usually take at least 10 minutes 
> installing dependencies from Homebrew (because of compiling from source?). It 
> would be nice to cache those, especially as they probably don't change often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds

2022-09-28 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610538#comment-17610538
 ] 

Antoine Pitrou commented on ARROW-17872:


Ouch, LLVM can be heavy but 1.5GB sounds really outlandish.
(for comparison, the combined unpacked size for the conda-forge packages 
{{libllvm}}, {{llvm-tools}} and {{llvmdev}} is 500MB)

> [CI] Cache dependencies on macOS builds
> ---
>
> Key: ARROW-17872
> URL: https://issues.apache.org/jira/browse/ARROW-17872
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration, GLib, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Our macOS CI builds on Github Actions usually take at least 10 minutes 
> installing dependencies from Homebrew (because of compiling from source?). It 
> would be nice to cache those, especially as they probably don't change often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds

2022-09-28 Thread Jacob Wujciak-Jens (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610540#comment-17610540
 ] 

Jacob Wujciak-Jens commented on ARROW-17872:


And we have 12 & 15 both similar size (do we need both?), aws sdk is 800M...

> [CI] Cache dependencies on macOS builds
> ---
>
> Key: ARROW-17872
> URL: https://issues.apache.org/jira/browse/ARROW-17872
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration, GLib, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Our macOS CI builds on Github Actions usually take at least 10 minutes 
> installing dependencies from Homebrew (because of compiling from source?). It 
> would be nice to cache those, especially as they probably don't change often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17876) [R][CI] Remove ubuntu-18.04 from nixlibs & prebuilt binaries

2022-09-28 Thread Jacob Wujciak-Jens (Jira)
Jacob Wujciak-Jens created ARROW-17876:
--

 Summary: [R][CI] Remove ubuntu-18.04 from nixlibs & prebuilt 
binaries
 Key: ARROW-17876
 URL: https://issues.apache.org/jira/browse/ARROW-17876
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Jacob Wujciak-Jens
 Fix For: 10.0.0


The new dts compiled centos-7 binaries ([ARROW-17594]) should be able to 
replace the ubuntu-18.04 binaries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17876) [R][CI] Remove ubuntu-18.04 from nixlibs & prebuilt binaries

2022-09-28 Thread Jacob Wujciak-Jens (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Wujciak-Jens updated ARROW-17876:
---
Priority: Critical  (was: Major)

> [R][CI] Remove ubuntu-18.04 from nixlibs & prebuilt binaries
> 
>
> Key: ARROW-17876
> URL: https://issues.apache.org/jira/browse/ARROW-17876
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jacob Wujciak-Jens
>Priority: Critical
> Fix For: 10.0.0
>
>
> The new dts compiled centos-7 binaries ([ARROW-17594]) should be able to 
> replace the ubuntu-18.04 binaries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17875) [C++] Remove assorted pre-C++17 compatibility measures

2022-09-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17875:
---
Labels: pull-request-available  (was: )

> [C++] Remove assorted pre-C++17 compatibility measures
> --
>
> Key: ARROW-17875
> URL: https://issues.apache.org/jira/browse/ARROW-17875
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Some assorted pre-C++17 compatibility measures remain in the code base.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17872) [CI] Cache dependencies on macOS builds

2022-09-28 Thread Jacob Wujciak-Jens (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610528#comment-17610528
 ] 

Jacob Wujciak-Jens edited comment on ARROW-17872 at 9/28/22 12:14 PM:
--

it looks like homebrew is using system tar to extract the gzipped bottles, 
maybe we can speed it up by symlinking in pigz to make use of the 3 cores the 
mac runners have...


was (Author: JIRAUSER287549):
it looks like homebrew is using system tar to extract the gzipped bottles, 
maybe we can speed it up by symlinking in pzip to make use of the 3 cores the 
mac runners have...

> [CI] Cache dependencies on macOS builds
> ---
>
> Key: ARROW-17872
> URL: https://issues.apache.org/jira/browse/ARROW-17872
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration, GLib, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Our macOS CI builds on Github Actions usually take at least 10 minutes 
> installing dependencies from Homebrew (because of compiling from source?). It 
> would be nice to cache those, especially as they probably don't change often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds

2022-09-28 Thread Jacob Wujciak-Jens (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610528#comment-17610528
 ] 

Jacob Wujciak-Jens commented on ARROW-17872:


it looks like homebrew is using system tar to extract the gzipped bottles, 
maybe we can speed it up by symlinking in pzip to make use of the 3 cores the 
mac runners have...

> [CI] Cache dependencies on macOS builds
> ---
>
> Key: ARROW-17872
> URL: https://issues.apache.org/jira/browse/ARROW-17872
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration, GLib, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Our macOS CI builds on Github Actions usually take at least 10 minutes 
> installing dependencies from Homebrew (because of compiling from source?). It 
> would be nice to cache those, especially as they probably don't change often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17875) [C++] Remove assorted pre-C++17 compatibility measures

2022-09-28 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-17875:
--

 Summary: [C++] Remove assorted pre-C++17 compatibility measures
 Key: ARROW-17875
 URL: https://issues.apache.org/jira/browse/ARROW-17875
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Some assorted pre-C++17 compatibility measures remain in the code base.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds

2022-09-28 Thread Jacob Wujciak-Jens (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610524#comment-17610524
 ] 

Jacob Wujciak-Jens commented on ARROW-17872:


I have set up a test job with debug output to see what exactly is taking so 
long: 
https://github.com/assignUser/test-repo-a/actions/runs/3142905685/jobs/5107502078#step:4:392

If you turn on timestamps you can see that what takes the time is extracting 
the archives (e.g. llvm ~1.5G) not downloading them, so caching the {{hombrew 
--cache}} directory would not save significant time. As the cache is also tar'd 
extracting the cache might be the new bottle neck

> [CI] Cache dependencies on macOS builds
> ---
>
> Key: ARROW-17872
> URL: https://issues.apache.org/jira/browse/ARROW-17872
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration, GLib, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Our macOS CI builds on Github Actions usually take at least 10 minutes 
> installing dependencies from Homebrew (because of compiling from source?). It 
> would be nice to cache those, especially as they probably don't change often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17874) [Archery] C++ linting with --clang-format or archery lint --clang-tidy fails on M1

2022-09-28 Thread Alenka Frim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610520#comment-17610520
 ] 

Alenka Frim commented on ARROW-17874:
-

cc [~raulcd] 

> [Archery] C++ linting with --clang-format  or archery lint --clang-tidy fails 
> on M1
> ---
>
> Key: ARROW-17874
> URL: https://issues.apache.org/jira/browse/ARROW-17874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Archery
>Reporter: Alenka Frim
>Priority: Major
>
> It seems there is some cmake target issue for {{clang-format}}  and 
> {{clang-tidy}} options when running {{archery lint}} on M1:
> {code:java}
> ...
> -- Build files have been written to: 
> /private/var/folders/gw/q7wqd4tx18n_9t4kbkd0bj1mgn/T/arrow-lint-g7drna9_/cpp-buildninja:
>  error: unknown target 'check-format' {code}
> [https://gist.github.com/AlenkaF/f60e24549529cd096bc9c975bcb71179]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17874) [Archery] C++ linting with --clang-format or archery lint --clang-tidy fails on M1

2022-09-28 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-17874:
---

 Summary: [Archery] C++ linting with --clang-format  or archery 
lint --clang-tidy fails on M1
 Key: ARROW-17874
 URL: https://issues.apache.org/jira/browse/ARROW-17874
 Project: Apache Arrow
  Issue Type: Bug
  Components: Archery
Reporter: Alenka Frim


It seems there is some cmake target issue for {{clang-format}}  and 
{{clang-tidy}} options when running {{archery lint}} on M1:

{code:java}
...
-- Build files have been written to: 
/private/var/folders/gw/q7wqd4tx18n_9t4kbkd0bj1mgn/T/arrow-lint-g7drna9_/cpp-buildninja:
 error: unknown target 'check-format' {code}

[https://gist.github.com/AlenkaF/f60e24549529cd096bc9c975bcb71179]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17873) Writing Arrow Files using C#.

2022-09-28 Thread N Gautam Animesh (Jira)
N Gautam Animesh created ARROW-17873:


 Summary: Writing Arrow Files using C#.
 Key: ARROW-17873
 URL: https://issues.apache.org/jira/browse/ARROW-17873
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: N Gautam Animesh


Was working with Arrow along with C# and wanted to know a way to write to an 
arrow file using C#.

Do let me know if there's anything regarding this. Was not able to find 
anything on the internet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds

2022-09-28 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610482#comment-17610482
 ] 

Antoine Pitrou commented on ARROW-17872:


Here is an example job which timed out due to an overlong dependency step:
https://github.com/pitrou/arrow/actions/runs/3141950727/jobs/5104979517

> [CI] Cache dependencies on macOS builds
> ---
>
> Key: ARROW-17872
> URL: https://issues.apache.org/jira/browse/ARROW-17872
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration, GLib, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Our macOS CI builds on Github Actions usually take at least 10 minutes 
> installing dependencies from Homebrew (because of compiling from source?). It 
> would be nice to cache those, especially as they probably don't change often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-13745) [CI][C++] conda python turbodbc nightly job failed

2022-09-28 Thread Yibo Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai closed ARROW-13745.

Resolution: Fixed

> [CI][C++] conda python turbodbc nightly job failed
> --
>
> Key: ARROW-13745
> URL: https://issues.apache.org/jira/browse/ARROW-13745
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Yibo Cai
>Priority: Major
>
> https://github.com/ursacomputing/crossbow/runs/3408001481#step:7:4473
> [NIGHTLY] Arrow Build Report for Job nightly-2021-08-24-0
> https://www.mail-archive.com/builds@arrow.apache.org/msg00109.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17855) [R] Simultaneous read-write operations causing file corruption.

2022-09-28 Thread N Gautam Animesh (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610466#comment-17610466
 ] 

N Gautam Animesh commented on ARROW-17855:
--

Yes, when someone is writing to the file and we are reading it simultaneously, 
then the file is getting corrupted.

> [R] Simultaneous read-write operations causing file corruption.
> ---
>
> Key: ARROW-17855
> URL: https://issues.apache.org/jira/browse/ARROW-17855
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: N Gautam Animesh
>Priority: Major
>
> UseCase: I was trying to simultaneously read and write an arrow file which in 
> turn gave me an Error. It is leading to file corruption. I am currently using 
> read_feather and write_feather functions to save it as a .arrow file. Do let 
> me know if there's anything in this regard or any other way to avoid this. 
> [Error: Invalid: Not an Arrow file]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds

2022-09-28 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610450#comment-17610450
 ] 

Antoine Pitrou commented on ARROW-17872:


[~assignUser] Do you think that's reasonably doable?

> [CI] Cache dependencies on macOS builds
> ---
>
> Key: ARROW-17872
> URL: https://issues.apache.org/jira/browse/ARROW-17872
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration, GLib, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Our macOS CI builds on Github Actions usually take at least 10 minutes 
> installing dependencies from Homebrew (because of compiling from source?). It 
> would be nice to cache those, especially as they probably don't change often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   >