[jira] [Commented] (ARROW-17839) [Python] Cannot create RecordBatch with nested struct containing extension type
[ https://issues.apache.org/jira/browse/ARROW-17839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610822#comment-17610822 ] Matthias Vallentin commented on ARROW-17839: Thanks for the point, [~jorisvandenbossche]. Glad to see that a fix is underway. Would you mind pointing me to instructions on how to do the test that you performed? I am using Poetry and couldn't get the branch to compile. In theory, I thought this should do the trick: {code:java} [tool.poetry.dependencies] #pyarrow = "^9.0" pyarrow = { git = "https://github.com/milesgranger/arrow.git";, branch = "ARROW-15545_cast-of-extension-types", subdirectory = "python" }{code} But this fails to compile due to missing dependencies. (I managed to workaround OpenSSL by providing the right env var, but now I'm stuck with Flight not being found.) I was hoping that there is some sort of dev guide that shows how to get going. > [Python] Cannot create RecordBatch with nested struct containing extension > type > --- > > Key: ARROW-17839 > URL: https://issues.apache.org/jira/browse/ARROW-17839 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 9.0.0 > Environment: macOS 12.5.1 on an Apple M1 Ultra. >Reporter: Matthias Vallentin >Priority: Blocker > Attachments: example.py > > > I'm running into the following issue: > {code:java} > pyarrow.lib.ArrowNotImplementedError: Unsupported cast to > extension> from fixed_size_binary[16]{code} > Use case: I want to create a record batch that contains this type: > {code:java} > pa.struct([("address", AddressType()), ("length", pa.uint8())]){code} > Here, {{AddressType}} is an extension type that models an IP address > ({{{}pa.binary(16){}}}). > Please find attached a self-contained example that illustrates the issue. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17868) [C++][Python] Keep and deprecate ARROW_PYTHON CMake option for backward compatibility
[ https://issues.apache.org/jira/browse/ARROW-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17868: --- Labels: pull-request-available (was: ) > [C++][Python] Keep and deprecate ARROW_PYTHON CMake option for backward > compatibility > - > > Key: ARROW-17868 > URL: https://issues.apache.org/jira/browse/ARROW-17868 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > ARROW-6858 removed {{ARROW_PYTHON}} CMake option because ARROW-16340 moved > {{cpp/src/arrow/python/}} to {{python/pyarrow/src/}}. But it broke backward > compatibility. Users who use {{-DARROW_PYTHON=ON}} needs to > {{-DARROW_CSV=ON}}, {{-DARROW_DATASET=ON}} and so on manually. > See also: https://github.com/apache/arrow/pull/14224#discussion_r981399130 > {quote} > FWIW this broke my local development because of no longer including those > (although I should probably start using presets ..) > Now, it's probably fine to remove this now Python C++ has moved, but we do > assume that some C++ modules are built on the pyarrow side (eg we assume that > CSV is always built, while with the above change you need to ensure manually > that this is done in your cmake call). > In any case we should update the documentation at > https://arrow.apache.org/docs/dev/developers/python.html#build-and-test to > indicate that there are a few components required to be able to build pyarrow. > {quote} > Eventually, we can remove {{ARROW_PYTHON}} CMake option but we should provide > a deprecation period before we remove {{ARROW_PYTHON}}. > We should also mention that {{ARROW_PYTHON}} is deprecated in our > documentation ( > https://arrow.apache.org/docs/dev/developers/python.html#build-and-test ). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17886) [R] Convert schema to the corresponding ptype (zero-row data frame)?
Kirill Müller created ARROW-17886: - Summary: [R] Convert schema to the corresponding ptype (zero-row data frame)? Key: ARROW-17886 URL: https://issues.apache.org/jira/browse/ARROW-17886 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Kirill Müller When fetching data e.g. from a RecordBatchReader, I would like to know, ahead of time, what the data will look like after it's converted to a data frame. I have found a way using utils::head(0), but I'm not sure if it's efficient in all scenarios. My use case is the Arrow extension to DBI, in particular the default implementation for drivers that don't speak Arrow yet. I'd like to know which types the columns should have on the database. I can already infer this from the corresponding R types, but those existing drivers don't know about Arrow types. Should we support as.data.frame() for schema objects? The semantics would be to return a zero-row data frame with correct column names and types. library(arrow) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp data <- data.frame( a = 1:3, b = 2.5, c = "three", stringsAsFactors = FALSE ) data$d <- blob::blob(as.raw(1:10)) tbl <- arrow::as_arrow_table(data) rbr <- arrow::as_record_batch_reader(tbl) tibble::as_tibble(head(rbr, 0)) #> # A tibble: 0 × 4 #> # … with 4 variables: a , b , c , d rbr$read_table() #> Table #> 3 rows x 4 columns #> $a #> $b #> $c #> $d <> #> #> See $metadata for additional Schema metadata -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17885) [R] Return BLOB data as list of raw instead of a list of integers
[ https://issues.apache.org/jira/browse/ARROW-17885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kirill Müller updated ARROW-17885: -- Description: BLOBs should be mapped to lists of raw in R, not lists of integer. Tested with ec714db3995549309b987fc8112db98bb93102d0. library(arrow) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp data <- data.frame( a = 1:3, b = 2.5, c = "three", stringsAsFactors = FALSE ) data$d <- blob::blob(as.raw(1:10)) tbl <- arrow::as_arrow_table(data) rbr <- arrow::as_record_batch_reader(tbl) waldo::compare(as.data.frame(rbr$read_next_batch()), data) #> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...) #> #> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...) #> #> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...) was: BLOBs should be mapped to lists of raw in R, not lists of integer. Tested with ec714db3995549309b987fc8112db98bb93102d0. {{ library(arrow) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp data <- data.frame( a = 1:3, b = 2.5, c = "three", stringsAsFactors = FALSE ) data$d <- blob::blob(as.raw(1:10)) tbl <- arrow::as_arrow_table(data) rbr <- arrow::as_record_batch_reader(tbl) waldo::compare(as.data.frame(rbr$read_next_batch()), data) #> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...) #> #> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...) #> #> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...) }} > [R] Return BLOB data as list of raw instead of a list of integers > - > > Key: ARROW-17885 > URL: https://issues.apache.org/jira/browse/ARROW-17885 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 10.0.0, 9.0.1 > Environment: macOS, R 4.1.3 >Reporter: Kirill Müller >Priority: Minor > > BLOBs should be mapped to lists of raw in R, not lists of integer. Tested > with ec714db3995549309b987fc8112db98bb93102d0. > library(arrow) > #> Some features are not enabled in this build of Arrow. Run `arrow_info()` > for more information. > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > data <- data.frame( > a = 1:3, > b = 2.5, > c = "three", > stringsAsFactors = FALSE > ) > data$d <- blob::blob(as.raw(1:10)) > tbl <- arrow::as_arrow_table(data) > rbr <- arrow::as_record_batch_reader(tbl) > waldo::compare(as.data.frame(rbr$read_next_batch()), data) > #> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...) > #> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...) > #> > #> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...) > #> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...) > #> > #> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...) > #> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17885) [R] Return BLOB data as list of raw instead of a list of integers
[ https://issues.apache.org/jira/browse/ARROW-17885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kirill Müller updated ARROW-17885: -- Environment: macOS arm64, R 4.1.3 (was: macOS, R 4.1.3) > [R] Return BLOB data as list of raw instead of a list of integers > - > > Key: ARROW-17885 > URL: https://issues.apache.org/jira/browse/ARROW-17885 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 10.0.0, 9.0.1 > Environment: macOS arm64, R 4.1.3 >Reporter: Kirill Müller >Priority: Minor > > BLOBs should be mapped to lists of raw in R, not lists of integer. Tested > with ec714db3995549309b987fc8112db98bb93102d0. > library(arrow) > #> Some features are not enabled in this build of Arrow. Run `arrow_info()` > for more information. > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > data <- data.frame( > a = 1:3, > b = 2.5, > c = "three", > stringsAsFactors = FALSE > ) > data$d <- blob::blob(as.raw(1:10)) > tbl <- arrow::as_arrow_table(data) > rbr <- arrow::as_record_batch_reader(tbl) > waldo::compare(as.data.frame(rbr$read_next_batch()), data) > #> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...) > #> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...) > #> > #> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...) > #> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...) > #> > #> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...) > #> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17885) [R] Return BLOB data as list of raw instead of a list of integers
[ https://issues.apache.org/jira/browse/ARROW-17885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kirill Müller updated ARROW-17885: -- Description: BLOBs should be mapped to lists of raw in R, not lists of integer. Tested with ec714db3995549309b987fc8112db98bb93102d0. {{ library(arrow) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp data <- data.frame( a = 1:3, b = 2.5, c = "three", stringsAsFactors = FALSE ) data$d <- blob::blob(as.raw(1:10)) tbl <- arrow::as_arrow_table(data) rbr <- arrow::as_record_batch_reader(tbl) waldo::compare(as.data.frame(rbr$read_next_batch()), data) #> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...) #> #> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...) #> #> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...) }} was: BLOBs should be mapped to lists of raw in R, not lists of integer. Tested with ec714db3995549309b987fc8112db98bb93102d0. ``` r library(arrow) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp data <- data.frame( a = 1:3, b = 2.5, c = "three", stringsAsFactors = FALSE ) data$d <- blob::blob(as.raw(1:10)) tbl <- arrow::as_arrow_table(data) rbr <- arrow::as_record_batch_reader(tbl) waldo::compare(as.data.frame(rbr$read_next_batch()), data) #> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...) #> #> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...) #> #> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...) ``` Created on 2022-09-29 with [reprex v2.0.2](https://reprex.tidyverse.org) > [R] Return BLOB data as list of raw instead of a list of integers > - > > Key: ARROW-17885 > URL: https://issues.apache.org/jira/browse/ARROW-17885 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 10.0.0, 9.0.1 > Environment: macOS, R 4.1.3 >Reporter: Kirill Müller >Priority: Minor > > BLOBs should be mapped to lists of raw in R, not lists of integer. Tested > with ec714db3995549309b987fc8112db98bb93102d0. > {{ > library(arrow) > #> Some features are not enabled in this build of Arrow. Run `arrow_info()` > for more information. > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > data <- data.frame( > a = 1:3, > b = 2.5, > c = "three", > stringsAsFactors = FALSE > ) > data$d <- blob::blob(as.raw(1:10)) > tbl <- arrow::as_arrow_table(data) > rbr <- arrow::as_record_batch_reader(tbl) > waldo::compare(as.data.frame(rbr$read_next_batch()), data) > #> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...) > #> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...) > #> > #> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...) > #> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...) > #> > #> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...) > #> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...) > }} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17885) [R] Return BLOB data as list of raw instead of a list of integers
[ https://issues.apache.org/jira/browse/ARROW-17885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kirill Müller updated ARROW-17885: -- Summary: [R] Return BLOB data as list of raw instead of a list of integers (was: Return BLOB data as list of raw instead of a list of integers) > [R] Return BLOB data as list of raw instead of a list of integers > - > > Key: ARROW-17885 > URL: https://issues.apache.org/jira/browse/ARROW-17885 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 10.0.0, 9.0.1 > Environment: macOS, R 4.1.3 >Reporter: Kirill Müller >Priority: Minor > > BLOBs should be mapped to lists of raw in R, not lists of integer. Tested > with ec714db3995549309b987fc8112db98bb93102d0. > ``` r > library(arrow) > #> Some features are not enabled in this build of Arrow. Run `arrow_info()` > for more information. > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > data <- data.frame( > a = 1:3, > b = 2.5, > c = "three", > stringsAsFactors = FALSE > ) > data$d <- blob::blob(as.raw(1:10)) > tbl <- arrow::as_arrow_table(data) > rbr <- arrow::as_record_batch_reader(tbl) > waldo::compare(as.data.frame(rbr$read_next_batch()), data) > #> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...) > #> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...) > #> > #> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...) > #> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...) > #> > #> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...) > #> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...) > ``` > Created on 2022-09-29 with [reprex > v2.0.2](https://reprex.tidyverse.org) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17885) Return BLOB data as list of raw instead of a list of integers
Kirill Müller created ARROW-17885: - Summary: Return BLOB data as list of raw instead of a list of integers Key: ARROW-17885 URL: https://issues.apache.org/jira/browse/ARROW-17885 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 10.0.0, 9.0.1 Environment: macOS, R 4.1.3 Reporter: Kirill Müller BLOBs should be mapped to lists of raw in R, not lists of integer. Tested with ec714db3995549309b987fc8112db98bb93102d0. ``` r library(arrow) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp data <- data.frame( a = 1:3, b = 2.5, c = "three", stringsAsFactors = FALSE ) data$d <- blob::blob(as.raw(1:10)) tbl <- arrow::as_arrow_table(data) rbr <- arrow::as_record_batch_reader(tbl) waldo::compare(as.data.frame(rbr$read_next_batch()), data) #> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...) #> #> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...) #> #> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...) ``` Created on 2022-09-29 with [reprex v2.0.2](https://reprex.tidyverse.org) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17884) Add Intel®-IAA/QPL-based Parquet RLE Decode
[ https://issues.apache.org/jira/browse/ARROW-17884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyaqi reassigned ARROW-17884: Assignee: zhaoyaqi > Add Intel®-IAA/QPL-based Parquet RLE Decode > --- > > Key: ARROW-17884 > URL: https://issues.apache.org/jira/browse/ARROW-17884 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: zhaoyaqi >Assignee: zhaoyaqi >Priority: Minor > Labels: performance, pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Intel® In-Memory Analytics Accelerator (Intel® IAA) is a hardware accelerator > available in the upcoming generation of Intel® Xeon® Scalable processors > ("Sapphire Rapids"). Its goal is to speed up common operations in analytics > like data (de)compression and filtering. It support decoding of Parquet RLE > format. We add new codec which utilizes the Intel® IAA offloading technology > to provide a high-performance RLE decode implementation. The codec uses the > [Intel® Query Processing Library (QPL)|https://github.com/intel/qpl] which > abstracts access to the hardware accelerator. The new solution provides in > general higher performance against current solution, and also consume less > CPU. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17884) Add Intel®-IAA/QPL-based Parquet RLE Decode
[ https://issues.apache.org/jira/browse/ARROW-17884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17884: --- Labels: performance pull-request-available (was: performance) > Add Intel®-IAA/QPL-based Parquet RLE Decode > --- > > Key: ARROW-17884 > URL: https://issues.apache.org/jira/browse/ARROW-17884 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: zhaoyaqi >Priority: Minor > Labels: performance, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Intel® In-Memory Analytics Accelerator (Intel® IAA) is a hardware accelerator > available in the upcoming generation of Intel® Xeon® Scalable processors > ("Sapphire Rapids"). Its goal is to speed up common operations in analytics > like data (de)compression and filtering. It support decoding of Parquet RLE > format. We add new codec which utilizes the Intel® IAA offloading technology > to provide a high-performance RLE decode implementation. The codec uses the > [Intel® Query Processing Library (QPL)|https://github.com/intel/qpl] which > abstracts access to the hardware accelerator. The new solution provides in > general higher performance against current solution, and also consume less > CPU. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17884) Add Intel®-IAA/QPL-based Parquet RLE Decode
zhaoyaqi created ARROW-17884: Summary: Add Intel®-IAA/QPL-based Parquet RLE Decode Key: ARROW-17884 URL: https://issues.apache.org/jira/browse/ARROW-17884 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: zhaoyaqi Intel® In-Memory Analytics Accelerator (Intel® IAA) is a hardware accelerator available in the upcoming generation of Intel® Xeon® Scalable processors ("Sapphire Rapids"). Its goal is to speed up common operations in analytics like data (de)compression and filtering. It support decoding of Parquet RLE format. We add new codec which utilizes the Intel® IAA offloading technology to provide a high-performance RLE decode implementation. The codec uses the [Intel® Query Processing Library (QPL)|https://github.com/intel/qpl] which abstracts access to the hardware accelerator. The new solution provides in general higher performance against current solution, and also consume less CPU. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17850) [Java] Upgrade netty-codec-http dependencies
[ https://issues.apache.org/jira/browse/ARROW-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610754#comment-17610754 ] Hui Yu commented on ARROW-17850: Thank you all ! > [Java] Upgrade netty-codec-http dependencies > > > Key: ARROW-17850 > URL: https://issues.apache.org/jira/browse/ARROW-17850 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 9.0.0 >Reporter: Hui Yu >Assignee: David Dali Susanibar Arce >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > [CVE-2022-24823]([https://github.com/advisories/GHSA-269q-hmxg-m83q]) reports > a security vulnerability for *netty-codec-http* > Now the version of *netty-codec-http* in the master branch is *4.1.72.Final,* > that is unsafe. > The ticket https://issues.apache.org/jira/browse/ARROW-16996 bumps > *netty-codec* to {*}4.1.78.Final{*}, it didn't bump *netty-codec-http.* > Can you upgrade the version of *netty-codec-http* ? > > Here is my output of mvn:dependency now: > ```bash > [INFO] +- org.apache.arrow:flight-core:jar:9.0.0:compile > [INFO] | +- io.grpc:grpc-netty:jar:1.47.0:compile > [INFO] | | +- io.netty:netty-codec-http2:jar:4.1.72.Final:compile > [INFO] | | | - io.netty:{*}netty-codec-http{*}:jar:4.1.72.Final:compile > [INFO] | | +- io.netty:netty-handler-proxy:jar:4.1.72.Final:runtime > [INFO] | | | - io.netty:netty-codec-socks:jar:4.1.72.Final:runtime > [INFO] | | +- > com.google.errorprone:error_prone_annotations:jar:2.10.0:compile > [INFO] | | +- io.perfmark:perfmark-api:jar:0.25.0:runtime > [INFO] | | - > io.netty:netty-transport-native-unix-common:jar:4.1.72.Final:compile > [INFO] | +- io.grpc:grpc-core:jar:1.47.0:compile > [INFO] | | +- com.google.android:annotations:jar:4.1.1.4:runtime > [INFO] | | - org.codehaus.mojo:animal-sniffer-annotations:jar:1.19:runtime > [INFO] | +- io.grpc:grpc-context:jar:1.47.0:compile > [INFO] | +- io.grpc:grpc-protobuf:jar:1.47.0:compile > [INFO] | | +- > com.google.api.grpc:proto-google-common-protos:jar:2.0.1:compile > [INFO] | | - io.grpc:grpc-protobuf-lite:jar:1.47.0:compile > [INFO] | +- io.netty:netty-tcnative-boringssl-static:jar:2.0.53.Final:compile > [INFO] | | +- io.netty:netty-tcnative-classes:jar:2.0.53.Final:compile > [INFO] | | +- > io.netty:netty-tcnative-boringssl-static:jar:linux-x86_64:2.0.53.Final:compile > [INFO] | | +- > io.netty:netty-tcnative-boringssl-static:jar:linux-aarch_64:2.0.53.Final:compile > [INFO] | | +- > io.netty:netty-tcnative-boringssl-static:jar:osx-x86_64:2.0.53.Final:compile > [INFO] | | +- > io.netty:netty-tcnative-boringssl-static:jar:osx-aarch_64:2.0.53.Final:compile > [INFO] | | - > io.netty:netty-tcnative-boringssl-static:jar:windows-x86_64:2.0.53.Final:compile > [INFO] | +- io.netty:netty-handler:jar:4.1.78.Final:compile > [INFO] | | +- io.netty:netty-resolver:jar:4.1.78.Final:compile > [INFO] | | - io.netty:netty-codec:jar:4.1.78.Final:compile > [INFO] | +- io.netty:netty-transport:jar:4.1.78.Final:compile > [INFO] | +- com.google.guava:guava:jar:30.1.1-jre:compile > [INFO] | | +- com.google.guava:failureaccess:jar:1.0.1:compile > [INFO] | | +- > com.google.guava:listenablefuture:jar:.0-empty-to-avoid-conflict-with-guava:compile > [INFO] | | +- org.checkerframework:checker-qual:jar:3.8.0:compile > [INFO] | | - com.google.j2objc:j2objc-annotations:jar:1.3:compile > [INFO] | +- io.grpc:grpc-stub:jar:1.47.0:compile > [INFO] | +- com.google.protobuf:protobuf-java:jar:3.21.2:compile > [INFO] | +- io.grpc:grpc-api:jar:1.47.0:compile > [INFO] | - javax.annotation:javax.annotation-api:jar:1.3.2:compile > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17550) [C++][CI] MinGW builds shouldn't compile grpcio
[ https://issues.apache.org/jira/browse/ARROW-17550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17550: --- Labels: pull-request-available (was: ) > [C++][CI] MinGW builds shouldn't compile grpcio > --- > > Key: ARROW-17550 > URL: https://issues.apache.org/jira/browse/ARROW-17550 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > MinGW builds currently compile the GCS testbench and grpcio for MinGW. > When the compiled MinGW wheel is not in cache, compiling takes a very long > time (\*). But Win32 and Win64 binary wheels are available on PyPI. > This is pointless: the GCS testbench could simply run with the system Python > instead of the msys2 Python, and always use the binaries from PyPI. > (\*) see for example https://github.com/pitrou/arrow/runs/8071607360 where > installing the GCS testbench took 18 minutes -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17850) [Java] Upgrade netty-codec-http dependencies
[ https://issues.apache.org/jira/browse/ARROW-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-17850. -- Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 14265 [https://github.com/apache/arrow/pull/14265] > [Java] Upgrade netty-codec-http dependencies > > > Key: ARROW-17850 > URL: https://issues.apache.org/jira/browse/ARROW-17850 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 9.0.0 >Reporter: Hui Yu >Assignee: David Dali Susanibar Arce >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > [CVE-2022-24823]([https://github.com/advisories/GHSA-269q-hmxg-m83q]) reports > a security vulnerability for *netty-codec-http* > Now the version of *netty-codec-http* in the master branch is *4.1.72.Final,* > that is unsafe. > The ticket https://issues.apache.org/jira/browse/ARROW-16996 bumps > *netty-codec* to {*}4.1.78.Final{*}, it didn't bump *netty-codec-http.* > Can you upgrade the version of *netty-codec-http* ? > > Here is my output of mvn:dependency now: > ```bash > [INFO] +- org.apache.arrow:flight-core:jar:9.0.0:compile > [INFO] | +- io.grpc:grpc-netty:jar:1.47.0:compile > [INFO] | | +- io.netty:netty-codec-http2:jar:4.1.72.Final:compile > [INFO] | | | - io.netty:{*}netty-codec-http{*}:jar:4.1.72.Final:compile > [INFO] | | +- io.netty:netty-handler-proxy:jar:4.1.72.Final:runtime > [INFO] | | | - io.netty:netty-codec-socks:jar:4.1.72.Final:runtime > [INFO] | | +- > com.google.errorprone:error_prone_annotations:jar:2.10.0:compile > [INFO] | | +- io.perfmark:perfmark-api:jar:0.25.0:runtime > [INFO] | | - > io.netty:netty-transport-native-unix-common:jar:4.1.72.Final:compile > [INFO] | +- io.grpc:grpc-core:jar:1.47.0:compile > [INFO] | | +- com.google.android:annotations:jar:4.1.1.4:runtime > [INFO] | | - org.codehaus.mojo:animal-sniffer-annotations:jar:1.19:runtime > [INFO] | +- io.grpc:grpc-context:jar:1.47.0:compile > [INFO] | +- io.grpc:grpc-protobuf:jar:1.47.0:compile > [INFO] | | +- > com.google.api.grpc:proto-google-common-protos:jar:2.0.1:compile > [INFO] | | - io.grpc:grpc-protobuf-lite:jar:1.47.0:compile > [INFO] | +- io.netty:netty-tcnative-boringssl-static:jar:2.0.53.Final:compile > [INFO] | | +- io.netty:netty-tcnative-classes:jar:2.0.53.Final:compile > [INFO] | | +- > io.netty:netty-tcnative-boringssl-static:jar:linux-x86_64:2.0.53.Final:compile > [INFO] | | +- > io.netty:netty-tcnative-boringssl-static:jar:linux-aarch_64:2.0.53.Final:compile > [INFO] | | +- > io.netty:netty-tcnative-boringssl-static:jar:osx-x86_64:2.0.53.Final:compile > [INFO] | | +- > io.netty:netty-tcnative-boringssl-static:jar:osx-aarch_64:2.0.53.Final:compile > [INFO] | | - > io.netty:netty-tcnative-boringssl-static:jar:windows-x86_64:2.0.53.Final:compile > [INFO] | +- io.netty:netty-handler:jar:4.1.78.Final:compile > [INFO] | | +- io.netty:netty-resolver:jar:4.1.78.Final:compile > [INFO] | | - io.netty:netty-codec:jar:4.1.78.Final:compile > [INFO] | +- io.netty:netty-transport:jar:4.1.78.Final:compile > [INFO] | +- com.google.guava:guava:jar:30.1.1-jre:compile > [INFO] | | +- com.google.guava:failureaccess:jar:1.0.1:compile > [INFO] | | +- > com.google.guava:listenablefuture:jar:.0-empty-to-avoid-conflict-with-guava:compile > [INFO] | | +- org.checkerframework:checker-qual:jar:3.8.0:compile > [INFO] | | - com.google.j2objc:j2objc-annotations:jar:1.3:compile > [INFO] | +- io.grpc:grpc-stub:jar:1.47.0:compile > [INFO] | +- com.google.protobuf:protobuf-java:jar:3.21.2:compile > [INFO] | +- io.grpc:grpc-api:jar:1.47.0:compile > [INFO] | - javax.annotation:javax.annotation-api:jar:1.3.2:compile > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-15479) [C++] Cast fixed size list to compatible fixed size list type (other values type, other field name)
[ https://issues.apache.org/jira/browse/ARROW-15479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-15479. -- Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 14181 [https://github.com/apache/arrow/pull/14181] > [C++] Cast fixed size list to compatible fixed size list type (other values > type, other field name) > --- > > Key: ARROW-15479 > URL: https://issues.apache.org/jira/browse/ARROW-15479 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Kshiteej K >Priority: Major > Labels: good-second-issue, kernel, pull-request-available > Fix For: 10.0.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > > Casting a FixedSizeListArray to a compatible type but only a different field > name isn't implemented: > {code:python} > >>> my_type = pa.list_(pa.field("element", pa.int64()), 2) > >>> arr = pa.FixedSizeListArray.from_arrays(pa.array([1, 2, 3, 4, 5, 6]), 2) > >>> arr.type > FixedSizeListType(fixed_size_list[2]) > >>> my_type > FixedSizeListType(fixed_size_list[2]) > >>> arr.cast(my_type) > ... > ArrowNotImplementedError: Unsupported cast from fixed_size_list int64>[2] to fixed_size_list using function cast_fixed_size_list > {code} > While the similar operation with a variable sized list actually works: > {code:python} > >>> my_type = pa.list_(pa.field("element", pa.int64())) > >>> arr = pa.array([[1, 2], [3, 4]], pa.list_(pa.int64())) > >>> arr.type > ListType(list) > >>> arr.cast(my_type).type > ListType(list) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17877) [CI][Python] verify-rc python nightly builds fail due to missing some flags that were activated with ARROW_PYTHON=ON
[ https://issues.apache.org/jira/browse/ARROW-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610694#comment-17610694 ] Kouhei Sutou commented on ARROW-17877: -- Yes. I'll add {{ARROW_PYTHON}} back but we should not use {{ARROW_PYTHON=ON}} in the {{verify-release-candidate.sh}}. Because the {{ARROW_PYTHON}} dependencies are inconsistent. For example, they include {{ARROW_DATASET=ON}} but it's an optional component (not a required component) in PyArrow. And they don't include all optional components such as {{ARROW_PARQUET=ON}}. I think that CMake presets will be better replacement for {{ARROW_PYTHON}} because we can define multiple presets such as {{features-python-minimum}} and {{features-python-maximum}}. But CMake presets require CMake 3.19 or later... > [CI][Python] verify-rc python nightly builds fail due to missing some flags > that were activated with ARROW_PYTHON=ON > > > Key: ARROW-17877 > URL: https://issues.apache.org/jira/browse/ARROW-17877 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, Python >Reporter: Raúl Cumplido >Assignee: Raúl Cumplido >Priority: Blocker > Labels: Nightly > Fix For: 10.0.0 > > > Some of our nightly builds are failing with: > {code:java} > [ 35%] Building CXX object CMakeFiles/_dataset.dir/_dataset.cpp.o > /arrow/python/build/temp.linux-x86_64-cpython-38/_dataset.cpp:833:10: fatal > error: arrow/csv/api.h: No such file or directory > #include "arrow/csv/api.h" > ^ > compilation terminated.{code} > I suspect the flags included CSV=ON when building with PYTHON=ON changes here > might be related: > [https://github.com/apache/arrow/commit/53ac2a00aa9ff199773513f6f996f73a07b37989] > Example of nightly failures: > https://github.com/ursacomputing/crossbow/actions/runs/3135833175/jobs/5091988801 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-16155) [R] lubridate functions for 9.0.0
[ https://issues.apache.org/jira/browse/ARROW-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc closed ARROW-16155. -- Resolution: Done > [R] lubridate functions for 9.0.0 > - > > Key: ARROW-16155 > URL: https://issues.apache.org/jira/browse/ARROW-16155 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 8.0.0 >Reporter: Alessandro Molina >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > > Umbrella ticket for lubridate functions in 9.0.0 > Future work that is not going to happen in v9 is recorder under > https://issues.apache.org/jira/browse/ARROW-16841 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (ARROW-16155) [R] lubridate functions for 9.0.0
[ https://issues.apache.org/jira/browse/ARROW-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc reopened ARROW-16155: Assignee: Dragoș Moldovan-Grünfeld > [R] lubridate functions for 9.0.0 > - > > Key: ARROW-16155 > URL: https://issues.apache.org/jira/browse/ARROW-16155 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 8.0.0 >Reporter: Alessandro Molina >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > > Umbrella ticket for lubridate functions in 9.0.0 > Future work that is not going to happen in v9 is recorder under > https://issues.apache.org/jira/browse/ARROW-16841 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17857) [C++] Table::CombineChunksToBatch segfaults on empty tables
[ https://issues.apache.org/jira/browse/ARROW-17857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-17857: - Fix Version/s: 10.0.0 > [C++] Table::CombineChunksToBatch segfaults on empty tables > --- > > Key: ARROW-17857 > URL: https://issues.apache.org/jira/browse/ARROW-17857 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > There can be 0 chunks in a ChunkedArray -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds
[ https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610690#comment-17610690 ] Kouhei Sutou commented on ARROW-17872: -- {quote} Do you know why we decided to use Homebrew for dependencies on macOS? {quote} Because Homebrew is one of major package managers that are used by macOS users. We should use an environment similar to the one that is used by users for CI to find bugs before we release. Anyway, I'm OK with disabling some features for PR. > [CI] Cache dependencies on macOS builds > --- > > Key: ARROW-17872 > URL: https://issues.apache.org/jira/browse/ARROW-17872 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Continuous Integration, GLib, Python >Reporter: Antoine Pitrou >Priority: Major > > Our macOS CI builds on Github Actions usually take at least 10 minutes > installing dependencies from Homebrew (because of compiling from source?). It > would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17848) [R][CI] Failed tests in test-dplyr-funcs-datetime.R
[ https://issues.apache.org/jira/browse/ARROW-17848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-17848: Fix Version/s: 10.0.0 > [R][CI] Failed tests in test-dplyr-funcs-datetime.R > --- > > Key: ARROW-17848 > URL: https://issues.apache.org/jira/browse/ARROW-17848 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, R >Reporter: Antoine Pitrou >Priority: Major > Fix For: 10.0.0 > > > Just saw this on an unrelated PR: > https://github.com/pitrou/arrow/actions/runs/3129051648/jobs/5078785139#step:11:23882 > {code} > -- Failure (test-dplyr-funcs-datetime.R:461:3): format_ISO8601 > - > `object` (`actual`) not equal to `expected` (`expected`). > actual vs expected > a a2 > - actual[1, ] 2018-10-07 2018-10-07 > + expected[1, ] 2018-10-07 19:04:05 2018-10-07 19:04:05 > actual[2, ] NA NA > `actual$a`: "2018-10-07" NA > `expected$a`: "2018-10-07 19:04:05" NA > `actual$a2`: "2018-10-07" NA > `expected$a2`: "2018-10-07 19:04:05" NA > Backtrace: > x > 1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:461:2 > 2. \-arrow:::expect_equal(via_batch, expected, ...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:115:4 > 3. \-testthat::expect_equal(...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:42:4 > -- Failure (test-dplyr-funcs-datetime.R:461:3): format_ISO8601 > - > `object` (`actual`) not equal to `expected` (`expected`). > actual vs expected > a a2 > - actual[1, ] 2018-10-07 2018-10-07 > + expected[1, ] 2018-10-07 19:04:05 2018-10-07 19:04:05 > actual[2, ] NA NA > `actual$a`: "2018-10-07" NA > `expected$a`: "2018-10-07 19:04:05" NA > `actual$a2`: "2018-10-07" NA > `expected$a2`: "2018-10-07 19:04:05" NA > Backtrace: > x > 1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:461:2 > 2. \-arrow:::expect_equal(via_table, expected, ...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:129:4 > 3. \-testthat::expect_equal(...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:42:4 > -- Failure (test-dplyr-funcs-datetime.R:492:5): format_ISO8601 > - > `object` (`actual`) not equal to `expected` (`expected`). > actual vs expected > x > - actual[1, ] 2018-10-07-0600 > + expected[1, ] 2018-10-07 19:04:05 > actual[2, ] NA > `actual$x`: "2018-10-07-0600" NA > `expected$x`: "2018-10-07 19:04:05" NA > Backtrace: > x > 1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:492:4 > 2. \-arrow:::expect_equal(via_batch, expected, ...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:115:4 > 3. \-testthat::expect_equal(...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:42:4 > -- Failure (test-dplyr-funcs-datetime.R:492:5): format_ISO8601 > - > `object` (`actual`) not equal to `expected` (`expected`). > actual vs expected > x > - actual[1, ] 2018-10-07-0600 > + expected[1, ] 2018-10-07 19:04:05 > actual[2, ] NA > `actual$x`: "2018-10-07-0600" NA > `expected$x`: "2018-10-07 19:04:05" NA > Backtrace: > x > 1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:492:4 > 2. \-arrow:::expect_equal(via_table, expected, ...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:129:4 > 3. \-testthat::expect_equal(...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:42:4 > -- Failure (test-dplyr-funcs-datetime.R:500:5): format_ISO8601 > - > `object` (`actual`) not equal to `expected` (`expected`). > actual vs expected >x > - actual[1, ] 2018-10-07T19:04:05-0600 > + expected[1, ] 2018-10-07 19:04:05 > actual[2, ] NA > `actual$x`: "2018-10-07T19:04:05-0600" NA > `expected$x`: "2018-10-07 19:04:05" NA > Backtrace: > x > 1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:500:4 > 2. \-arrow:::expect_equal(via_batch, expected, ...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:115:4 > 3. \-testthat::expect_equal(...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:4
[jira] [Updated] (ARROW-17848) [R][CI] Failed tests in test-dplyr-funcs-datetime.R
[ https://issues.apache.org/jira/browse/ARROW-17848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-17848: Priority: Critical (was: Major) > [R][CI] Failed tests in test-dplyr-funcs-datetime.R > --- > > Key: ARROW-17848 > URL: https://issues.apache.org/jira/browse/ARROW-17848 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, R >Reporter: Antoine Pitrou >Priority: Critical > Fix For: 10.0.0 > > > Just saw this on an unrelated PR: > https://github.com/pitrou/arrow/actions/runs/3129051648/jobs/5078785139#step:11:23882 > {code} > -- Failure (test-dplyr-funcs-datetime.R:461:3): format_ISO8601 > - > `object` (`actual`) not equal to `expected` (`expected`). > actual vs expected > a a2 > - actual[1, ] 2018-10-07 2018-10-07 > + expected[1, ] 2018-10-07 19:04:05 2018-10-07 19:04:05 > actual[2, ] NA NA > `actual$a`: "2018-10-07" NA > `expected$a`: "2018-10-07 19:04:05" NA > `actual$a2`: "2018-10-07" NA > `expected$a2`: "2018-10-07 19:04:05" NA > Backtrace: > x > 1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:461:2 > 2. \-arrow:::expect_equal(via_batch, expected, ...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:115:4 > 3. \-testthat::expect_equal(...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:42:4 > -- Failure (test-dplyr-funcs-datetime.R:461:3): format_ISO8601 > - > `object` (`actual`) not equal to `expected` (`expected`). > actual vs expected > a a2 > - actual[1, ] 2018-10-07 2018-10-07 > + expected[1, ] 2018-10-07 19:04:05 2018-10-07 19:04:05 > actual[2, ] NA NA > `actual$a`: "2018-10-07" NA > `expected$a`: "2018-10-07 19:04:05" NA > `actual$a2`: "2018-10-07" NA > `expected$a2`: "2018-10-07 19:04:05" NA > Backtrace: > x > 1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:461:2 > 2. \-arrow:::expect_equal(via_table, expected, ...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:129:4 > 3. \-testthat::expect_equal(...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:42:4 > -- Failure (test-dplyr-funcs-datetime.R:492:5): format_ISO8601 > - > `object` (`actual`) not equal to `expected` (`expected`). > actual vs expected > x > - actual[1, ] 2018-10-07-0600 > + expected[1, ] 2018-10-07 19:04:05 > actual[2, ] NA > `actual$x`: "2018-10-07-0600" NA > `expected$x`: "2018-10-07 19:04:05" NA > Backtrace: > x > 1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:492:4 > 2. \-arrow:::expect_equal(via_batch, expected, ...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:115:4 > 3. \-testthat::expect_equal(...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:42:4 > -- Failure (test-dplyr-funcs-datetime.R:492:5): format_ISO8601 > - > `object` (`actual`) not equal to `expected` (`expected`). > actual vs expected > x > - actual[1, ] 2018-10-07-0600 > + expected[1, ] 2018-10-07 19:04:05 > actual[2, ] NA > `actual$x`: "2018-10-07-0600" NA > `expected$x`: "2018-10-07 19:04:05" NA > Backtrace: > x > 1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:492:4 > 2. \-arrow:::expect_equal(via_table, expected, ...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:129:4 > 3. \-testthat::expect_equal(...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:42:4 > -- Failure (test-dplyr-funcs-datetime.R:500:5): format_ISO8601 > - > `object` (`actual`) not equal to `expected` (`expected`). > actual vs expected >x > - actual[1, ] 2018-10-07T19:04:05-0600 > + expected[1, ] 2018-10-07 19:04:05 > actual[2, ] NA > `actual$x`: "2018-10-07T19:04:05-0600" NA > `expected$x`: "2018-10-07 19:04:05" NA > Backtrace: > x > 1. \-arrow:::compare_dplyr_binding(...) at test-dplyr-funcs-datetime.R:500:4 > 2. \-arrow:::expect_equal(via_batch, expected, ...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-expectation.R:115:4 > 3. \-testthat::expect_equal(...) at > D:\a\arrow\arrow\r\check\arrow.Rcheck\tests\testthat\helper-e
[jira] [Resolved] (ARROW-17811) [Doc][Java] Document how dictionary encoding works
[ https://issues.apache.org/jira/browse/ARROW-17811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-17811. -- Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 14213 [https://github.com/apache/arrow/pull/14213] > [Doc][Java] Document how dictionary encoding works > -- > > Key: ARROW-17811 > URL: https://issues.apache.org/jira/browse/ARROW-17811 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Java >Affects Versions: 9.0.0 >Reporter: Larry White >Assignee: Larry White >Priority: Minor > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 5h 20m > Remaining Estimate: 0h > > The ValueVector documentation does not include any discussion of dictionary > encoding. There is example code on the IPC page > https://arrow.apache.org/docs/dev/java/ipc.html, but it doesn't provide an > overview. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17878) [Website] Exclude Ballista docs from being deleted
[ https://issues.apache.org/jira/browse/ARROW-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-17878: Assignee: Andy Grove > [Website] Exclude Ballista docs from being deleted > -- > > Key: ARROW-17878 > URL: https://issues.apache.org/jira/browse/ARROW-17878 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Exclude Ballista docs from being deleted -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17878) [Website] Exclude Ballista docs from being deleted
[ https://issues.apache.org/jira/browse/ARROW-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-17878. -- Fix Version/s: 10.0.0 Resolution: Fixed https://github.com/apache/arrow-site/pull/241 > [Website] Exclude Ballista docs from being deleted > -- > > Key: ARROW-17878 > URL: https://issues.apache.org/jira/browse/ARROW-17878 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 10.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Exclude Ballista docs from being deleted -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-16155) [R] lubridate functions for 9.0.0
[ https://issues.apache.org/jira/browse/ARROW-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson closed ARROW-16155. --- Resolution: Done > [R] lubridate functions for 9.0.0 > - > > Key: ARROW-16155 > URL: https://issues.apache.org/jira/browse/ARROW-16155 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 8.0.0 >Reporter: Alessandro Molina >Priority: Major > > Umbrella ticket for lubridate functions in 9.0.0 > Future work that is not going to happen in v9 is recorder under > https://issues.apache.org/jira/browse/ARROW-16841 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17822) Seg Fault in pyarrow FlightClient with unknown uri schema
[ https://issues.apache.org/jira/browse/ARROW-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17822: --- Labels: pull-request-available (was: ) > Seg Fault in pyarrow FlightClient with unknown uri schema > - > > Key: ARROW-17822 > URL: https://issues.apache.org/jira/browse/ARROW-17822 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC, Python >Affects Versions: 9.0.0 > Environment: Linux U801802 5.14.0-1051-oem #58-Ubuntu SMP Fri Aug 26 > 05:50:00 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux > Tried with standard ubuntu > Python 3.8.10 (default, Jun 22 2022, 20:18:18) > [GCC 9.4.0] on linux > And miniconda with python 3.10 >Reporter: Martin >Assignee: David Li >Priority: Minor > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Running python in gdb for a bit of info. > Here I misspelled "grpc" as "grps" but any unrcognized schema will make it seg > {code:java} > gdb$ r > Starting program: /home/user/miniconda3/envs/duckdb10/bin/python > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". > Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC > 10.4.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import pyarrow as pa > [New Thread 0x72fff700 (LWP 1058902)] > [New Thread 0x71faa700 (LWP 1058903)] > [New Thread 0x717a9700 (LWP 1058904)] > [New Thread 0x70fa8700 (LWP 1058905)] > [New Thread 0x7fffd1d57700 (LWP 1058906)] > [New Thread 0x7fffd1556700 (LWP 1058907)] > [New Thread 0x7fffc0d55700 (LWP 1058908)] > [New Thread 0x7fffb8554700 (LWP 1058909)] > [New Thread 0x7fffafd53700 (LWP 1058910)] > [New Thread 0x7fffaf552700 (LWP 1058911)] > [New Thread 0x7fff9ed51700 (LWP 1058912)] > [New Thread 0x7fff96550700 (LWP 1058913)] > [New Thread 0x7fff8dd4f700 (LWP 1058914)] > [New Thread 0x7fff8554e700 (LWP 1058915)] > [New Thread 0x7fff84d4d700 (LWP 1058916)] > [New Thread 0x7fff7c54c700 (LWP 1058917)] > >>> import pyarrow.flight > >>> client = pa.flight.connect("grps://0.0.0.0:4")Thread 1 "python" > >>> received signal SIGSEGV, Segmentation fault. > ---[regs] > RAX: 0x RBX: 0x55B2C3B0 RBP: 0x55B2C3B0 > RSP: 0x7FFFC490 o d I t s Z a P c > RDI: 0x RSI: 0x55A8B040 RDX: 0x55BDEEA0 > RCX: 0x0004 RIP: 0x7FFF6BAA43D6 > R8 : 0x0003 R9 : 0x559F4797 R10: 0x55CEDA70 > R11: 0x55CEDA70 R12: 0x7FFFC990 > R13: 0x7FFFC6B0 R14: 0x7FFFC8D0 R15: 0x7FFFC530 > CS: 0033 DS: ES: FS: GS: SS: 002B > ---[code] > => 0x7fff6baa43d6 <_ZN5arrow6flight12FlightClientD2Ev+38>: mov > rax,QWORD PTR [rdi] > 0x7fff6baa43d9 <_ZN5arrow6flight12FlightClientD2Ev+41>: lea > rbp,[rsp+0x8] > 0x7fff6baa43de <_ZN5arrow6flight12FlightClientD2Ev+46>: mov rsi,rdi > 0x7fff6baa43e1 <_ZN5arrow6flight12FlightClientD2Ev+49>: mov BYTE PTR > [rbx+0x8],0x1 > 0x7fff6baa43e5 <_ZN5arrow6flight12FlightClientD2Ev+53>: mov rdi,rbp > 0x7fff6baa43e8 <_ZN5arrow6flight12FlightClientD2Ev+56>: call QWORD > PTR [rax+0x18] > 0x7fff6baa43eb <_ZN5arrow6flight12FlightClientD2Ev+59>: mov > rax,QWORD PTR [rsp+0x8] > 0x7fff6baa43f0 <_ZN5arrow6flight12FlightClientD2Ev+64>: test rax,rax > - > 0x7fff6baa43d6 in arrow::flight::FlightClient::~FlightClient() () from > /home/user/miniconda3/envs/duckdb10/lib/python3.10/site-packages/pyarrow/../../../libarrow_flight.so.900{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17187) [R] Improve lazy ALTREP implementation for String
[ https://issues.apache.org/jira/browse/ARROW-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17187: --- Labels: pull-request-available (was: ) > [R] Improve lazy ALTREP implementation for String > - > > Key: ARROW-17187 > URL: https://issues.apache.org/jira/browse/ARROW-17187 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dewey Dunnington >Assignee: Dewey Dunnington >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > ARROW-16578 noted that there was a high cost to looping through an ALTREP > character vector that we created in the arrow R package. The temporary > workaround is to materialize whenever the first element is requested, which > is much faster than our initial implementation but is probably not necessary > given that other ALTREP character implementations appear to not have this > issue: > (Timings before merging ARROW-16578, which reduces the 5 second operation > below to 0.05 seconds). > {code:R} > library(arrow, warn.conflicts = FALSE) > #> Some features are not enabled in this build of Arrow. Run `arrow_info()` > for more information. > df1 <- tibble::tibble(x=as.character(floor(runif(100) * 20))) > write_parquet(df1,"/tmp/test.parquet") > df2 <- read_parquet("/tmp/test.parquet") > system.time(unique(df1$x)) > #>user system elapsed > #> 0.022 0.001 0.023 > system.time(unique(df2$x)) > #>user system elapsed > #> 4.529 0.680 5.226 > # the speed is almost certainly not due to ALTREP itself > # but is probably something to do with our implementation > tf <- tempfile() > readr::write_csv(df1, tf) > df3 <- vroom::vroom(tf, delim = ",", altrep = TRUE) > #> Rows: 100 Columns: 1 > #> ── Column specification > > #> Delimiter: "," > #> dbl (1): x > #> > #> ℹ Use `spec()` to retrieve the full column specification for this data. > #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this > message. > .Internal(inspect(df3$x)) > #> @2d2042048 14 REALSXP g0c0 [REF(65535)] vroom_dbl (len=100, > materialized=F) > system.time(unique(df3$x)) > #>user system elapsed > #> 0.127 0.001 0.128 > .Internal(inspect(df3$x)) > #> @2d2042048 14 REALSXP g1c0 [MARK,REF(65535)] vroom_dbl (len=100, > materialized=F) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17883) [Java] Implement an immutable table object
[ https://issues.apache.org/jira/browse/ARROW-17883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry White reassigned ARROW-17883: --- Assignee: Larry White > [Java] Implement an immutable table object > -- > > Key: ARROW-17883 > URL: https://issues.apache.org/jira/browse/ARROW-17883 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Affects Versions: 10.0.0 >Reporter: Larry White >Assignee: Larry White >Priority: Major > > Implement an immutable Table object without the batch semantics provided by > VectorSchemaRoot. > See original design document/discussion here: > https://docs.google.com/document/d/1J77irZFWNnSID7vK71z26Nw_Pi99I9Hb9iryno8B03c/edit?usp=sharing > Note that this ticket covers only the immutable Table implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17883) [Java] Implement an immutable table object
Larry White created ARROW-17883: --- Summary: [Java] Implement an immutable table object Key: ARROW-17883 URL: https://issues.apache.org/jira/browse/ARROW-17883 Project: Apache Arrow Issue Type: Improvement Components: Java Affects Versions: 10.0.0 Reporter: Larry White Implement an immutable Table object without the batch semantics provided by VectorSchemaRoot. See original design document/discussion here: https://docs.google.com/document/d/1J77irZFWNnSID7vK71z26Nw_Pi99I9Hb9iryno8B03c/edit?usp=sharing Note that this ticket covers only the immutable Table implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17687) [C++] ScanningStress test is flaky in CI
[ https://issues.apache.org/jira/browse/ARROW-17687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610349#comment-17610349 ] Percy Camilo Triveño Aucahuasi edited comment on ARROW-17687 at 9/28/22 5:11 PM: - I got this [^backtrace.log.cpp]. It seems we are moving the unique_locker and trying to lock some invalid mutex. Also, I was able to get another issue, this time a deadlock using these values: {code:java} constexpr int kNumIters = 1; constexpr int kNumFragments = 10; constexpr int kBatchesPerFragment = 10; constexpr int kNumConcurrentTasks = 2;{code} I'll try to explore more about where we are getting these errors, so far I was able to reduce and reproduce the test issue using these values: {code:java} constexpr int kNumIters = 1; constexpr int kNumFragments = 2; constexpr int kBatchesPerFragment = 1; constexpr int kNumConcurrentTasks = 1;{code} Given that we can use C++ 17 now, I'll try to use the new std::scoped_lock instead of the other lockers (in the places where it make sense to do so) was (Author: aucahuasi): I got this [^backtrace.log.cpp]. It seems we are moving the unique_locker and trying to lock some invalid mutex. Also, I was able to get another issue, this time a deadlock using these values: {code:java} constexpr int kNumIters = 1; constexpr int kNumFragments = 10; constexpr int kBatchesPerFragment = 10; constexpr int kNumConcurrentTasks = 2;{code} I'll try to explore more about where we are getting these errors, so far I was able to reduce and reproduce the test issue using these values: {code:java} constexpr int kNumIters = 1; constexpr int kNumFragments = 2; constexpr int kBatchesPerFragment = 1; constexpr int kNumConcurrentTasks = 1;{code} Given that we can use C++ 17 now, I'll try to use the new std::scoped_lock instead of the the other lockers (in the places where it make sense to do so) > [C++] ScanningStress test is flaky in CI > > > Key: ARROW-17687 > URL: https://issues.apache.org/jira/browse/ARROW-17687 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Assignee: Percy Camilo Triveño Aucahuasi >Priority: Major > Attachments: backtrace.log.cpp > > > There is at least one nightly failure: > https://github.com/ursacomputing/crossbow/actions/runs/3033965241/jobs/4882574634 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers
[ https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610662#comment-17610662 ] Clark Zinzow edited comment on ARROW-10739 at 9/28/22 5:04 PM: --- [~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC format is used under-the-hood for pickle serialization, and confirmed that the buffer truncation works as expected. Although this is a far simpler solution than (1), the overhead of the {{RecordBatch}} wrapper adds ~230 extra bytes to the pickled payload (per {{Array}} chunk) compared to current Arrow master, which can be pretty bad for the many-chunk and/or many-column case (order of magnitude larger serialized payloads). We could sidestep this issue by having {{{}Table{}}}, {{{}RecordBatch{}}}, and {{ChunkedArray}} port their {{\_\_reduce\_\_}} to the Arrow IPC serialization as well, which should avoid this many-column and many-chunk blow-up, but there will still be the baseline ~230 byte bloat for {{ChunkedArray}} and {{Array}} that we might find untenable. I can try to get a PR up for (2) either today or tomorrow while I start working on (1) in the background. (1) is going to have a much larger Arrow code impact + we'll continue having two serialization paths to maintain, but it shouldn't result in any serialized payload bloat. was (Author: clarkzinzow): [~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC format is used under-the-hood for pickle serialization, and confirmed that the buffer truncation works as expected. Although this is a far simpler solution than (1), the overhead of the {{RecordBatch}} wrapper adds ~230 extra bytes to the pickled payload (per {{Array}} chunk) compared to current Arrow master, which can be pretty bad for the many-chunk and/or many-column case (order of magnitude larger serialized payloads). We could sidestep this issue by having {{{}Table{}}}, {{{}RecordBatch{}}}, and {{ChunkedArray}} port their {{\_\_reduce\_\_}} to the Arrow IPC serialization as well, which should avoid this many-column and many-chunk blow-up, but there will still be the baseline ~230 byte bloat for {{ChunkedArray}} and {{Array}} that we might find untenable. I can try to get a PR up for (2) either today or tomorrow while I start working on (1) in the background. (1) is going to have a much larger Arrow code impact + we'll continue having two serialization paths to maintain, but it shouldn't result in any serialized payload bloat. > [Python] Pickling a sliced array serializes all the buffers > --- > > Key: ARROW-10739 > URL: https://issues.apache.org/jira/browse/ARROW-10739 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Maarten Breddels >Assignee: Alessandro Molina >Priority: Critical > Fix For: 10.0.0 > > > If a large array is sliced, and pickled, it seems the full buffer is > serialized, this leads to excessive memory usage and data transfer when using > multiprocessing or dask. > {code:java} > >>> import pyarrow as pa > >>> ar = pa.array(['foo'] * 100_000) > >>> ar.nbytes > 74 > >>> import pickle > >>> len(pickle.dumps(ar.slice(10, 1))) > 700165 > NumPy for instance > >>> import numpy as np > >>> ar_np = np.array(ar) > >>> ar_np > array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object) > >>> import pickle > >>> len(pickle.dumps(ar_np[10:11])) > 165{code} > I think this makes sense if you know arrow, but kind of unexpected as a user. > Is there a workaround for this? For instance copy an arrow array to get rid > of the offset, and trim the buffers? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers
[ https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610662#comment-17610662 ] Clark Zinzow edited comment on ARROW-10739 at 9/28/22 5:04 PM: --- [~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC format is used under-the-hood for pickle serialization, and confirmed that the buffer truncation works as expected. Although this is a far simpler solution than (1), the overhead of the {{RecordBatch}} wrapper adds ~230 extra bytes to the pickled payload (per {{Array}} chunk) compared to current Arrow master, which can be pretty bad for the many-chunk and/or many-column case (order of magnitude larger serialized payloads). We could sidestep this issue by having {{{}Table{}}}, {{{}RecordBatch{}}}, and {{ChunkedArray}} port their {{\_\_reduce\_\_}} to the Arrow IPC serialization as well, which should avoid this many-column and many-chunk blow-up, but there will still be the baseline ~230 byte bloat for {{ChunkedArray}} and {{Array}} that we might find untenable. I can try to get a PR up for (2) either today or tomorrow while I start working on (1) in the background. (1) is going to have a much larger Arrow code impact + we'll continue having two serialization paths to maintain, but it shouldn't result in any serialized payload bloat. was (Author: clarkzinzow): [~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC format is used under-the-hood for pickle serialization, and confirmed that the buffer truncation works as expected. Although this is a far simpler solution than (1), the overhead of the {{RecordBatch}} wrapper adds ~230 extra bytes to the pickled payload (per {{Array}} chunk) compared to current Arrow master, which can be pretty bad for the many-chunk and/or many-column case (order of magnitude larger serialized payloads). We could sidestep this issue by having {{{}Table{}}}, {{{}RecordBatch{}}}, and {{ChunkedArray}} port their {{__reduce__}} to the Arrow IPC serialization as well, which should avoid this many-column and many-chunk blow-up, but there will still be the baseline ~230 byte bloat for {{ChunkedArray}} and {{Array}} that we might find untenable. I can try to get a PR up for (2) either today or tomorrow while I start working on (1) in the background. (1) is going to have a much larger Arrow code impact + we'll continue having two serialization paths to maintain, but it shouldn't result in any serialized payload bloat. > [Python] Pickling a sliced array serializes all the buffers > --- > > Key: ARROW-10739 > URL: https://issues.apache.org/jira/browse/ARROW-10739 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Maarten Breddels >Assignee: Alessandro Molina >Priority: Critical > Fix For: 10.0.0 > > > If a large array is sliced, and pickled, it seems the full buffer is > serialized, this leads to excessive memory usage and data transfer when using > multiprocessing or dask. > {code:java} > >>> import pyarrow as pa > >>> ar = pa.array(['foo'] * 100_000) > >>> ar.nbytes > 74 > >>> import pickle > >>> len(pickle.dumps(ar.slice(10, 1))) > 700165 > NumPy for instance > >>> import numpy as np > >>> ar_np = np.array(ar) > >>> ar_np > array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object) > >>> import pickle > >>> len(pickle.dumps(ar_np[10:11])) > 165{code} > I think this makes sense if you know arrow, but kind of unexpected as a user. > Is there a workaround for this? For instance copy an arrow array to get rid > of the offset, and trim the buffers? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17866) [Python] List child array invalid
[ https://issues.apache.org/jira/browse/ARROW-17866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610666#comment-17610666 ] Sean Conroy commented on ARROW-17866: - [~jorisvandenbossche] Thanks so much for noticing the connection. Yes - this appears to be the same issue. I will implement the suggested workaround now... > [Python] List child array invalid > - > > Key: ARROW-17866 > URL: https://issues.apache.org/jira/browse/ARROW-17866 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 9.0.0 >Reporter: Sean Conroy >Priority: Major > > This issue happens for all the versions of pyarrow I checked (9.0.0, 7.0.0, > 6.0.0, 6.0.1). > Running on Windows 11. > {code:java} > log.to_feather(log_fname) > Traceback (most recent call last): > File "G:\My > Drive\ds-atcore-etl\venv\lib\site-packages\IPython\core\interactiveshell.py", > line 3444, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "", line 1, in > log.to_feather(log_fname) > File "G:\My > Drive\ds-atcore-etl\venv\lib\site-packages\pandas\util\_decorators.py", line > 207, in wrapper > return func(*args, **kwargs) > File "G:\My > Drive\ds-atcore-etl\venv\lib\site-packages\pandas\core\frame.py", line 2519, > in to_feather > to_feather(self, path, **kwargs) > File "G:\My > Drive\ds-atcore-etl\venv\lib\site-packages\pandas\io\feather_format.py", line > 87, in to_feather > feather.write_feather(df, handles.handle, **kwargs) > File "G:\My Drive\ds-atcore-etl\venv\lib\site-packages\pyarrow\feather.py", > line 164, in write_feather > table = Table.from_pandas(df, preserve_index=preserve_index) > File "pyarrow\table.pxi", line 3495, in pyarrow.lib.Table.from_pandas > File "pyarrow\table.pxi", line 3597, in pyarrow.lib.Table.from_arrays > File "pyarrow\table.pxi", line 2793, in pyarrow.lib.Table.validate > File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column 13: In chunk 0: Invalid: List child array > invalid: Invalid: Struct child array #0 has length smaller than expected for > struct array (67186731 < 67186732) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers
[ https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610662#comment-17610662 ] Clark Zinzow edited comment on ARROW-10739 at 9/28/22 5:03 PM: --- [~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC format is used under-the-hood for pickle serialization, and confirmed that the buffer truncation works as expected. Although this is a far simpler solution than (1), the overhead of the {{RecordBatch}} wrapper adds ~230 extra bytes to the pickled payload (per {{Array}} chunk) compared to current Arrow master, which can be pretty bad for the many-chunk and/or many-column case (order of magnitude larger serialized payloads). We could sidestep this issue by having {{{}Table{}}}, {{{}RecordBatch{}}}, and {{ChunkedArray}} port their {{__reduce__}} to the Arrow IPC serialization as well, which should avoid this many-column and many-chunk blow-up, but there will still be the baseline ~230 byte bloat for {{ChunkedArray}} and {{Array}} that we might find untenable. I can try to get a PR up for (2) either today or tomorrow while I start working on (1) in the background. (1) is going to have a much larger Arrow code impact + we'll continue having two serialization paths to maintain, but it shouldn't result in any serialized payload bloat. was (Author: clarkzinzow): [~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC format is used under-the-hood for pickle serialization, and confirmed that the buffer truncation works as expected. Although this is a far simpler solution than (1), the overhead of the {{RecordBatch}} wrapper adds ~230 extra bytes to the pickled payload (per {{Array}} chunk) compared to current Arrow master, which can be pretty bad for the many-chunk and/or many-column case (order of magnitude larger serialized payloads). We could sidestep this issue by having {{{}Table{}}}, {{{}RecordBatch{}}}, and {{ChunkedArray}} port their {{__reduce__}} to the Arrow IPC serialization as well, which should avoid this many-column and many-chunk blow-up, but there will still be the baseline ~230 byte bloat for {{ChunkedArray}} and {{Array}} that we might find untenable. I can try to get a PR up for (2) either today or tomorrow while I start working on (1) in the background. (1) is going to have a much larger Arrow code impact + we'll continue having two serialization paths to maintain, but it shouldn't result in any serialized payload bloat. > [Python] Pickling a sliced array serializes all the buffers > --- > > Key: ARROW-10739 > URL: https://issues.apache.org/jira/browse/ARROW-10739 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Maarten Breddels >Assignee: Alessandro Molina >Priority: Critical > Fix For: 10.0.0 > > > If a large array is sliced, and pickled, it seems the full buffer is > serialized, this leads to excessive memory usage and data transfer when using > multiprocessing or dask. > {code:java} > >>> import pyarrow as pa > >>> ar = pa.array(['foo'] * 100_000) > >>> ar.nbytes > 74 > >>> import pickle > >>> len(pickle.dumps(ar.slice(10, 1))) > 700165 > NumPy for instance > >>> import numpy as np > >>> ar_np = np.array(ar) > >>> ar_np > array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object) > >>> import pickle > >>> len(pickle.dumps(ar_np[10:11])) > 165{code} > I think this makes sense if you know arrow, but kind of unexpected as a user. > Is there a workaround for this? For instance copy an arrow array to get rid > of the offset, and trim the buffers? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers
[ https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610662#comment-17610662 ] Clark Zinzow edited comment on ARROW-10739 at 9/28/22 5:02 PM: --- [~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC format is used under-the-hood for pickle serialization, and confirmed that the buffer truncation works as expected. Although this is a far simpler solution than (1), the overhead of the {{RecordBatch}} wrapper adds ~230 extra bytes to the pickled payload (per {{Array}} chunk) compared to current Arrow master, which can be pretty bad for the many-chunk and/or many-column case (order of magnitude larger serialized payloads). We could sidestep this issue by having {{{}Table{}}}, {{{}RecordBatch{}}}, and {{ChunkedArray}} port their {{__reduce__}} to the Arrow IPC serialization as well, which should avoid this many-column and many-chunk blow-up, but there will still be the baseline ~230 byte bloat for {{ChunkedArray}} and {{Array}} that we might find untenable. I can try to get a PR up for (2) either today or tomorrow while I start working on (1) in the background. (1) is going to have a much larger Arrow code impact + we'll continue having two serialization paths to maintain, but it shouldn't result in any serialized payload bloat. was (Author: clarkzinzow): [~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC format is used under-the-hood for pickle serialization, and confirmed that the buffer truncation works as expected. Although this is a far simpler solution than (1), the overhead of the `RecordBatch` wrapper adds ~230 extra bytes to the pickled payload (per `Array` chunk) compared to current Arrow master, which can be pretty bad for the many-chunk and/or many-column case (order of magnitude larger serialized payloads). We could sidestep this issue by having `Table`, `RecordBatch`, and `ChunkedArray` port their `__reduce__` to the Arrow IPC serialization as well, which should avoid this many-column and many-chunk blow-up, but there will still be the baseline ~230 byte bloat for `ChunkedArray` and `Array` that we might find untenable. I can try to get a PR up for (2) either today or tomorrow while I start working on (1) in the background. (1) is going to have a much larger Arrow code impact + we'll continue having two serialization paths to maintain, but it shouldn't result in any serialized payload bloat. > [Python] Pickling a sliced array serializes all the buffers > --- > > Key: ARROW-10739 > URL: https://issues.apache.org/jira/browse/ARROW-10739 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Maarten Breddels >Assignee: Alessandro Molina >Priority: Critical > Fix For: 10.0.0 > > > If a large array is sliced, and pickled, it seems the full buffer is > serialized, this leads to excessive memory usage and data transfer when using > multiprocessing or dask. > {code:java} > >>> import pyarrow as pa > >>> ar = pa.array(['foo'] * 100_000) > >>> ar.nbytes > 74 > >>> import pickle > >>> len(pickle.dumps(ar.slice(10, 1))) > 700165 > NumPy for instance > >>> import numpy as np > >>> ar_np = np.array(ar) > >>> ar_np > array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object) > >>> import pickle > >>> len(pickle.dumps(ar_np[10:11])) > 165{code} > I think this makes sense if you know arrow, but kind of unexpected as a user. > Is there a workaround for this? For instance copy an arrow array to get rid > of the offset, and trim the buffers? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers
[ https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610662#comment-17610662 ] Clark Zinzow commented on ARROW-10739: -- [~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC format is used under-the-hood for pickle serialization, and confirmed that the buffer truncation works as expected. Although this is a far simpler solution than (1), the overhead of the `RecordBatch` wrapper adds ~230 extra bytes to the pickled payload (per `Array` chunk) compared to current Arrow master, which can be pretty bad for the many-chunk and/or many-column case (order of magnitude larger serialized payloads). We could sidestep this issue by having `Table`, `RecordBatch`, and `ChunkedArray` port their `__reduce__` to the Arrow IPC serialization as well, which should avoid this many-column and many-chunk blow-up, but there will still be the baseline ~230 byte bloat for `ChunkedArray` and `Array` that we might find untenable. I can try to get a PR up for (2) either today or tomorrow while I start working on (1) in the background. (1) is going to have a much larger Arrow code impact + we'll continue having two serialization paths to maintain, but it shouldn't result in any serialized payload bloat. > [Python] Pickling a sliced array serializes all the buffers > --- > > Key: ARROW-10739 > URL: https://issues.apache.org/jira/browse/ARROW-10739 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Maarten Breddels >Assignee: Alessandro Molina >Priority: Critical > Fix For: 10.0.0 > > > If a large array is sliced, and pickled, it seems the full buffer is > serialized, this leads to excessive memory usage and data transfer when using > multiprocessing or dask. > {code:java} > >>> import pyarrow as pa > >>> ar = pa.array(['foo'] * 100_000) > >>> ar.nbytes > 74 > >>> import pickle > >>> len(pickle.dumps(ar.slice(10, 1))) > 700165 > NumPy for instance > >>> import numpy as np > >>> ar_np = np.array(ar) > >>> ar_np > array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object) > >>> import pickle > >>> len(pickle.dumps(ar_np[10:11])) > 165{code} > I think this makes sense if you know arrow, but kind of unexpected as a user. > Is there a workaround for this? For instance copy an arrow array to get rid > of the offset, and trim the buffers? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17847) [C++] Support unquoted decimal in JSON parser
[ https://issues.apache.org/jira/browse/ARROW-17847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-17847. Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 14242 [https://github.com/apache/arrow/pull/14242] > [C++] Support unquoted decimal in JSON parser > - > > Key: ARROW-17847 > URL: https://issues.apache.org/jira/browse/ARROW-17847 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 9.0.0 >Reporter: Jin Shang >Assignee: Jin Shang >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 5.5h > Remaining Estimate: 0h > > -Add an option to parse decimal as unquoted numbers in JSON- > Support both quoted and unquoted decimal in JSON parser automatically. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17866) [Python] List child array invalid
[ https://issues.apache.org/jira/browse/ARROW-17866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610657#comment-17610657 ] Joris Van den Bossche commented on ARROW-17866: --- [~meystingray] thanks for the report! This sounds very similar as ARROW-17137 (that also mentions a possible workaround for now). > [Python] List child array invalid > - > > Key: ARROW-17866 > URL: https://issues.apache.org/jira/browse/ARROW-17866 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 9.0.0 >Reporter: Sean Conroy >Priority: Major > > This issue happens for all the versions of pyarrow I checked (9.0.0, 7.0.0, > 6.0.0, 6.0.1). > Running on Windows 11. > {code:java} > log.to_feather(log_fname) > Traceback (most recent call last): > File "G:\My > Drive\ds-atcore-etl\venv\lib\site-packages\IPython\core\interactiveshell.py", > line 3444, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "", line 1, in > log.to_feather(log_fname) > File "G:\My > Drive\ds-atcore-etl\venv\lib\site-packages\pandas\util\_decorators.py", line > 207, in wrapper > return func(*args, **kwargs) > File "G:\My > Drive\ds-atcore-etl\venv\lib\site-packages\pandas\core\frame.py", line 2519, > in to_feather > to_feather(self, path, **kwargs) > File "G:\My > Drive\ds-atcore-etl\venv\lib\site-packages\pandas\io\feather_format.py", line > 87, in to_feather > feather.write_feather(df, handles.handle, **kwargs) > File "G:\My Drive\ds-atcore-etl\venv\lib\site-packages\pyarrow\feather.py", > line 164, in write_feather > table = Table.from_pandas(df, preserve_index=preserve_index) > File "pyarrow\table.pxi", line 3495, in pyarrow.lib.Table.from_pandas > File "pyarrow\table.pxi", line 3597, in pyarrow.lib.Table.from_arrays > File "pyarrow\table.pxi", line 2793, in pyarrow.lib.Table.validate > File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column 13: In chunk 0: Invalid: List child array > invalid: Invalid: Struct child array #0 has length smaller than expected for > struct array (67186731 < 67186732) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16319) [R] [Docs] Document the lubridate functions we support in {arrow}
[ https://issues.apache.org/jira/browse/ARROW-16319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Farmer reassigned ARROW-16319: --- Assignee: (was: Stephanie Hazlitt) > [R] [Docs] Document the lubridate functions we support in {arrow} > - > > Key: ARROW-16319 > URL: https://issues.apache.org/jira/browse/ARROW-16319 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R >Affects Versions: 8.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > > Add documentation around the {{lubridate}} functionality supported in > {{arrow}}. Could be made up of: > * a blogpost > * a more in-depth piece of documentation -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-14209) [R] Allow multiple arguments to n_distinct()
[ https://issues.apache.org/jira/browse/ARROW-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Farmer reassigned ARROW-14209: --- Assignee: (was: Dragoș Moldovan-Grünfeld) > [R] Allow multiple arguments to n_distinct() > > > Key: ARROW-14209 > URL: https://issues.apache.org/jira/browse/ARROW-14209 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Priority: Major > > ARROW-13620 and ARROW-14036 added support for the {{n_distinct()}} function > in the dplyr verb {{summarise()}} but only with a single argument. Add > support for multiple arguments to {{n_distinct()}}. This should return the > number of unique combinations of values in the specified columns/expressions. > See the comment about this here: > [https://github.com/apache/arrow/pull/11257#discussion_r720873549] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-12311) [Python][R] Expose (hide?) ScanOptions
[ https://issues.apache.org/jira/browse/ARROW-12311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610656#comment-17610656 ] Todd Farmer commented on ARROW-12311: - This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon. > [Python][R] Expose (hide?) ScanOptions > -- > > Key: ARROW-12311 > URL: https://issues.apache.org/jira/browse/ARROW-12311 > Project: Apache Arrow > Issue Type: Improvement > Components: Python, R >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Fix For: 10.0.0 > > > Currently R completely hides the `ScanOptions` class. > In python the class is exposed but the documentation prefers `dataset.scan` > (which hides both the scanner and the scan options). > However, there is some useful information in the `ScanOptions`. > Specifically, the projected schema (which is a product of the dataset schema > and the projection expression and not easily recreated) and the materialized > fields (the list of fields referenced by either the filter or the projection) > which might be useful for reporting purposes. > Currently R uses the projected schema to convert a list of column names into > a partition schema. Python does not rely on either field. > > Options: > - Keep the status quo > - Expose the ScanOptions object (which itself is exposed via the Scanner) > - Expose the interesting fields via the Scanner > > Currently the C++ design is halfway between the latter two (projected schema > is exposed and options). My preference would be the third option. It raises > a further question about how to expose the scanner itself in Python? Should > the user be using ScannerBuilder? Should they use NewScan? Should they use > the scanner directly at all or should it be hidden? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-14138) [R] update metadata when casting a record batch column
[ https://issues.apache.org/jira/browse/ARROW-14138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610654#comment-17610654 ] Todd Farmer commented on ARROW-14138: - This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon. > [R] update metadata when casting a record batch column > -- > > Key: ARROW-14138 > URL: https://issues.apache.org/jira/browse/ARROW-14138 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Romain Francois >Assignee: Romain Francois >Priority: Minor > Fix For: 10.0.0 > > > library(arrow, warn.conflicts = FALSE) > #> See arrow_info() for available features > raws <- structure(list( > as.raw(c(0x70, 0x65, 0x72, 0x73, 0x6f, 0x6e)) > ), class = c("arrow_binary", "vctrs_vctr", "list")) > batch <- record_batch(b = raws) > batch$metadata$r > #> 'arrow_r_metadata' chr > "A\n3\n262147\n197888\n5\nUTF-8\n531\n1\n531\n1\n531\n2\n531\n1\n16\n3\n262153\n12\narrow_binary\n262153\n10\nvc"| > __truncated__ > #> List of 1 > #> $ columns:List of 1 > #> ..$ b:List of 2 > #> .. ..$ attributes:List of 1 > #> .. .. ..$ class: chr [1:3] "arrow_binary" "vctrs_vctr" "list" > #> .. ..$ columns : NULL > # when casting `b` to a string column, the metadata is kept > batch$b <- batch$b$cast(utf8()) > batch$metadata$r > #> 'arrow_r_metadata' chr > "A\n3\n262147\n197888\n5\nUTF-8\n531\n1\n531\n1\n531\n2\n531\n1\n16\n3\n262153\n12\narrow_binary\n262153\n10\nvc"| > __truncated__ > #> List of 1 > #> $ columns:List of 1 > #> ..$ b:List of 2 > #> .. ..$ attributes:List of 1 > #> .. .. ..$ class: chr [1:3] "arrow_binary" "vctrs_vctr" "list" > #> .. ..$ columns : NULL > # but it should not have > batch2 <- record_batch(b = "string") > batch2$metadata$r > #> NULL -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-14987) [C++]Memory leak while reading parquet file
[ https://issues.apache.org/jira/browse/ARROW-14987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610651#comment-17610651 ] Todd Farmer commented on ARROW-14987: - This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon. > [C++]Memory leak while reading parquet file > --- > > Key: ARROW-14987 > URL: https://issues.apache.org/jira/browse/ARROW-14987 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 6.0.1 >Reporter: Qingxiang Chen >Assignee: Weston Pace >Priority: Major > > When I used parquet to access data, I found that the memory usage was still > high after the function ended. I reproduced this problem in the example. code > show as below: > > {code:c++} > #include > #include > #include > #include > #include > #include > #include > std::shared_ptr generate_table() { > arrow::Int64Builder i64builder; > for (int i=0;i<32;i++){ > i64builder.Append(i); > } > std::shared_ptr i64array; > PARQUET_THROW_NOT_OK(i64builder.Finish(&i64array)); > std::shared_ptr schema = arrow::schema( > {arrow::field("int", arrow::int64())}); > return arrow::Table::Make(schema, {i64array}); > } > void write_parquet_file(const arrow::Table& table) { > std::shared_ptr outfile; > PARQUET_ASSIGN_OR_THROW( > outfile, > arrow::io::FileOutputStream::Open("parquet-arrow-example.parquet")); > PARQUET_THROW_NOT_OK( > parquet::arrow::WriteTable(table, arrow::default_memory_pool(), > outfile, 3)); > } > void read_whole_file() { > std::cout << "Reading parquet-arrow-example.parquet at once" << std::endl; > std::shared_ptr infile; > PARQUET_ASSIGN_OR_THROW(infile, > > arrow::io::ReadableFile::Open("parquet-arrow-example.parquet", > > arrow::default_memory_pool())); > std::unique_ptr reader; > PARQUET_THROW_NOT_OK( > parquet::arrow::OpenFile(infile, arrow::default_memory_pool(), > &reader)); > std::shared_ptr table; > PARQUET_THROW_NOT_OK(reader->ReadTable(&table)); > std::cout << "Loaded " << table->num_rows() << " rows in " << > table->num_columns() > << " columns." << std::endl; > } > int main(int argc, char** argv) { > std::shared_ptr table = generate_table(); > write_parquet_file(*table); > std::cout << "start " < read_whole_file(); > std::cout << "end " < sleep(100); > } > {code} > After the end, during sleep, the memory usage is still more than 100M and has > not dropped. When I increase the data volume by 5 times, the memory usage is > about 500M, and it will not drop. > I want to know whether this part of the data is cached by the memory pool, or > whether it is a memory leak problem. If there is no memory leak, how to set > memory pool size or release memory? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16155) [R] lubridate functions for 9.0.0
[ https://issues.apache.org/jira/browse/ARROW-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610655#comment-17610655 ] Todd Farmer commented on ARROW-16155: - This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon. > [R] lubridate functions for 9.0.0 > - > > Key: ARROW-16155 > URL: https://issues.apache.org/jira/browse/ARROW-16155 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 8.0.0 >Reporter: Alessandro Molina >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > > Umbrella ticket for lubridate functions in 9.0.0 > Future work that is not going to happen in v9 is recorder under > https://issues.apache.org/jira/browse/ARROW-16841 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-14138) [R] update metadata when casting a record batch column
[ https://issues.apache.org/jira/browse/ARROW-14138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Farmer reassigned ARROW-14138: --- Assignee: (was: Romain Francois) > [R] update metadata when casting a record batch column > -- > > Key: ARROW-14138 > URL: https://issues.apache.org/jira/browse/ARROW-14138 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Romain Francois >Priority: Minor > Fix For: 10.0.0 > > > library(arrow, warn.conflicts = FALSE) > #> See arrow_info() for available features > raws <- structure(list( > as.raw(c(0x70, 0x65, 0x72, 0x73, 0x6f, 0x6e)) > ), class = c("arrow_binary", "vctrs_vctr", "list")) > batch <- record_batch(b = raws) > batch$metadata$r > #> 'arrow_r_metadata' chr > "A\n3\n262147\n197888\n5\nUTF-8\n531\n1\n531\n1\n531\n2\n531\n1\n16\n3\n262153\n12\narrow_binary\n262153\n10\nvc"| > __truncated__ > #> List of 1 > #> $ columns:List of 1 > #> ..$ b:List of 2 > #> .. ..$ attributes:List of 1 > #> .. .. ..$ class: chr [1:3] "arrow_binary" "vctrs_vctr" "list" > #> .. ..$ columns : NULL > # when casting `b` to a string column, the metadata is kept > batch$b <- batch$b$cast(utf8()) > batch$metadata$r > #> 'arrow_r_metadata' chr > "A\n3\n262147\n197888\n5\nUTF-8\n531\n1\n531\n1\n531\n2\n531\n1\n16\n3\n262153\n12\narrow_binary\n262153\n10\nvc"| > __truncated__ > #> List of 1 > #> $ columns:List of 1 > #> ..$ b:List of 2 > #> .. ..$ attributes:List of 1 > #> .. .. ..$ class: chr [1:3] "arrow_binary" "vctrs_vctr" "list" > #> .. ..$ columns : NULL > # but it should not have > batch2 <- record_batch(b = "string") > batch2$metadata$r > #> NULL -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-14987) [C++]Memory leak while reading parquet file
[ https://issues.apache.org/jira/browse/ARROW-14987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Farmer reassigned ARROW-14987: --- Assignee: (was: Weston Pace) > [C++]Memory leak while reading parquet file > --- > > Key: ARROW-14987 > URL: https://issues.apache.org/jira/browse/ARROW-14987 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 6.0.1 >Reporter: Qingxiang Chen >Priority: Major > > When I used parquet to access data, I found that the memory usage was still > high after the function ended. I reproduced this problem in the example. code > show as below: > > {code:c++} > #include > #include > #include > #include > #include > #include > #include > std::shared_ptr generate_table() { > arrow::Int64Builder i64builder; > for (int i=0;i<32;i++){ > i64builder.Append(i); > } > std::shared_ptr i64array; > PARQUET_THROW_NOT_OK(i64builder.Finish(&i64array)); > std::shared_ptr schema = arrow::schema( > {arrow::field("int", arrow::int64())}); > return arrow::Table::Make(schema, {i64array}); > } > void write_parquet_file(const arrow::Table& table) { > std::shared_ptr outfile; > PARQUET_ASSIGN_OR_THROW( > outfile, > arrow::io::FileOutputStream::Open("parquet-arrow-example.parquet")); > PARQUET_THROW_NOT_OK( > parquet::arrow::WriteTable(table, arrow::default_memory_pool(), > outfile, 3)); > } > void read_whole_file() { > std::cout << "Reading parquet-arrow-example.parquet at once" << std::endl; > std::shared_ptr infile; > PARQUET_ASSIGN_OR_THROW(infile, > > arrow::io::ReadableFile::Open("parquet-arrow-example.parquet", > > arrow::default_memory_pool())); > std::unique_ptr reader; > PARQUET_THROW_NOT_OK( > parquet::arrow::OpenFile(infile, arrow::default_memory_pool(), > &reader)); > std::shared_ptr table; > PARQUET_THROW_NOT_OK(reader->ReadTable(&table)); > std::cout << "Loaded " << table->num_rows() << " rows in " << > table->num_columns() > << " columns." << std::endl; > } > int main(int argc, char** argv) { > std::shared_ptr table = generate_table(); > write_parquet_file(*table); > std::cout << "start " < read_whole_file(); > std::cout << "end " < sleep(100); > } > {code} > After the end, during sleep, the memory usage is still more than 100M and has > not dropped. When I increase the data volume by 5 times, the memory usage is > about 500M, and it will not drop. > I want to know whether this part of the data is cached by the memory pool, or > whether it is a memory leak problem. If there is no memory leak, how to set > memory pool size or release memory? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16155) [R] lubridate functions for 9.0.0
[ https://issues.apache.org/jira/browse/ARROW-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Farmer reassigned ARROW-16155: --- Assignee: (was: Dragoș Moldovan-Grünfeld) > [R] lubridate functions for 9.0.0 > - > > Key: ARROW-16155 > URL: https://issues.apache.org/jira/browse/ARROW-16155 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 8.0.0 >Reporter: Alessandro Molina >Priority: Major > > Umbrella ticket for lubridate functions in 9.0.0 > Future work that is not going to happen in v9 is recorder under > https://issues.apache.org/jira/browse/ARROW-16841 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16319) [R] [Docs] Document the lubridate functions we support in {arrow}
[ https://issues.apache.org/jira/browse/ARROW-16319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610653#comment-17610653 ] Todd Farmer commented on ARROW-16319: - This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon. > [R] [Docs] Document the lubridate functions we support in {arrow} > - > > Key: ARROW-16319 > URL: https://issues.apache.org/jira/browse/ARROW-16319 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R >Affects Versions: 8.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Stephanie Hazlitt >Priority: Major > > Add documentation around the {{lubridate}} functionality supported in > {{arrow}}. Could be made up of: > * a blogpost > * a more in-depth piece of documentation -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-14588) [R] Create an arrow-specific checklist for a CRAN release
[ https://issues.apache.org/jira/browse/ARROW-14588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Farmer reassigned ARROW-14588: --- Assignee: (was: Dragoș Moldovan-Grünfeld) > [R] Create an arrow-specific checklist for a CRAN release > --- > > Key: ARROW-14588 > URL: https://issues.apache.org/jira/browse/ARROW-14588 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Priority: Minor > > This would adapt and implement the functionality of > {{usethis::use_release_issue()}} for {{arrow}}'s specific context. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-12311) [Python][R] Expose (hide?) ScanOptions
[ https://issues.apache.org/jira/browse/ARROW-12311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Farmer reassigned ARROW-12311: --- Assignee: (was: Weston Pace) > [Python][R] Expose (hide?) ScanOptions > -- > > Key: ARROW-12311 > URL: https://issues.apache.org/jira/browse/ARROW-12311 > Project: Apache Arrow > Issue Type: Improvement > Components: Python, R >Reporter: Weston Pace >Priority: Major > Fix For: 10.0.0 > > > Currently R completely hides the `ScanOptions` class. > In python the class is exposed but the documentation prefers `dataset.scan` > (which hides both the scanner and the scan options). > However, there is some useful information in the `ScanOptions`. > Specifically, the projected schema (which is a product of the dataset schema > and the projection expression and not easily recreated) and the materialized > fields (the list of fields referenced by either the filter or the projection) > which might be useful for reporting purposes. > Currently R uses the projected schema to convert a list of column names into > a partition schema. Python does not rely on either field. > > Options: > - Keep the status quo > - Expose the ScanOptions object (which itself is exposed via the Scanner) > - Expose the interesting fields via the Scanner > > Currently the C++ design is halfway between the latter two (projected schema > is exposed and options). My preference would be the third option. It raises > a further question about how to expose the scanner itself in Python? Should > the user be using ScannerBuilder? Should they use NewScan? Should they use > the scanner directly at all or should it be hidden? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-14209) [R] Allow multiple arguments to n_distinct()
[ https://issues.apache.org/jira/browse/ARROW-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610650#comment-17610650 ] Todd Farmer commented on ARROW-14209: - This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon. > [R] Allow multiple arguments to n_distinct() > > > Key: ARROW-14209 > URL: https://issues.apache.org/jira/browse/ARROW-14209 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > > ARROW-13620 and ARROW-14036 added support for the {{n_distinct()}} function > in the dplyr verb {{summarise()}} but only with a single argument. Add > support for multiple arguments to {{n_distinct()}}. This should return the > number of unique combinations of values in the specified columns/expressions. > See the comment about this here: > [https://github.com/apache/arrow/pull/11257#discussion_r720873549] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-14588) [R] Create an arrow-specific checklist for a CRAN release
[ https://issues.apache.org/jira/browse/ARROW-14588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610652#comment-17610652 ] Todd Farmer commented on ARROW-14588: - This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon. > [R] Create an arrow-specific checklist for a CRAN release > --- > > Key: ARROW-14588 > URL: https://issues.apache.org/jira/browse/ARROW-14588 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Minor > > This would adapt and implement the functionality of > {{usethis::use_release_issue()}} for {{arrow}}'s specific context. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17877) [CI][Python] verify-rc python nightly builds fail due to missing some flags that were activated with ARROW_PYTHON=ON
[ https://issues.apache.org/jira/browse/ARROW-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610646#comment-17610646 ] Joris Van den Bossche commented on ARROW-17877: --- See also ARROW-17868, where [~kou] mentioned to add it back but deprecated. (that still means we should also update all internal usage and docs that use it) > [CI][Python] verify-rc python nightly builds fail due to missing some flags > that were activated with ARROW_PYTHON=ON > > > Key: ARROW-17877 > URL: https://issues.apache.org/jira/browse/ARROW-17877 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, Python >Reporter: Raúl Cumplido >Assignee: Raúl Cumplido >Priority: Blocker > Labels: Nightly > Fix For: 10.0.0 > > > Some of our nightly builds are failing with: > {code:java} > [ 35%] Building CXX object CMakeFiles/_dataset.dir/_dataset.cpp.o > /arrow/python/build/temp.linux-x86_64-cpython-38/_dataset.cpp:833:10: fatal > error: arrow/csv/api.h: No such file or directory > #include "arrow/csv/api.h" > ^ > compilation terminated.{code} > I suspect the flags included CSV=ON when building with PYTHON=ON changes here > might be related: > [https://github.com/apache/arrow/commit/53ac2a00aa9ff199773513f6f996f73a07b37989] > Example of nightly failures: > https://github.com/ursacomputing/crossbow/actions/runs/3135833175/jobs/5091988801 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17882) [Java][Doc] Document build & use of new artifact on Windows environment
David Dali Susanibar Arce created ARROW-17882: - Summary: [Java][Doc] Document build & use of new artifact on Windows environment Key: ARROW-17882 URL: https://issues.apache.org/jira/browse/ARROW-17882 Project: Apache Arrow Issue Type: Sub-task Components: Documentation, Java Reporter: David Dali Susanibar Arce Assignee: David Dali Susanibar Arce * Update build documentation with new Windows JNI DLL support * Update use documentation with new Windows JNI DLL support -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17546) [C++] Remove pre-C++17 compatibility measures
[ https://issues.apache.org/jira/browse/ARROW-17546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-17546. -- Resolution: Fixed > [C++] Remove pre-C++17 compatibility measures > - > > Key: ARROW-17546 > URL: https://issues.apache.org/jira/browse/ARROW-17546 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Critical > Fix For: 10.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15075) [C++][Dataset] Implement Dataset for reading JSON format
[ https://issues.apache.org/jira/browse/ARROW-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-15075: --- Summary: [C++][Dataset] Implement Dataset for reading JSON format (was: [C++][Dataset] Implement Dataset for JSON format) > [C++][Dataset] Implement Dataset for reading JSON format > > > Key: ARROW-15075 > URL: https://issues.apache.org/jira/browse/ARROW-15075 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Will Jones >Assignee: Ben Harkins >Priority: Major > Labels: dataset > > We already have support for reading individual files, but not yet for reading > datasets. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17877) [CI][Python] verify-rc python nightly builds fail due to missing some flags that were activated with ARROW_PYTHON=ON
[ https://issues.apache.org/jira/browse/ARROW-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido updated ARROW-17877: -- Fix Version/s: 10.0.0 > [CI][Python] verify-rc python nightly builds fail due to missing some flags > that were activated with ARROW_PYTHON=ON > > > Key: ARROW-17877 > URL: https://issues.apache.org/jira/browse/ARROW-17877 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, Python >Reporter: Raúl Cumplido >Assignee: Raúl Cumplido >Priority: Blocker > Labels: Nightly > Fix For: 10.0.0 > > > Some of our nightly builds are failing with: > {code:java} > [ 35%] Building CXX object CMakeFiles/_dataset.dir/_dataset.cpp.o > /arrow/python/build/temp.linux-x86_64-cpython-38/_dataset.cpp:833:10: fatal > error: arrow/csv/api.h: No such file or directory > #include "arrow/csv/api.h" > ^ > compilation terminated.{code} > I suspect the flags included CSV=ON when building with PYTHON=ON changes here > might be related: > [https://github.com/apache/arrow/commit/53ac2a00aa9ff199773513f6f996f73a07b37989] > Example of nightly failures: > https://github.com/ursacomputing/crossbow/actions/runs/3135833175/jobs/5091988801 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17881) [C++] Not able to build the project with the latest commit of the master branch
[ https://issues.apache.org/jira/browse/ARROW-17881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610617#comment-17610617 ] Antoine Pitrou commented on ARROW-17881: If you're using Homebrew, this is because of https://github.com/Homebrew/homebrew-core/issues/111810 In any case, I recommend passing {{-DGTest_SOURCE=BUNDLED}} to CMake so that GTest is built from source in C++17 mode. > [C++] Not able to build the project with the latest commit of the master > branch > --- > > Key: ARROW-17881 > URL: https://issues.apache.org/jira/browse/ARROW-17881 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Anirudh Acharya >Priority: Major > > I am trying to build the arrow C++ project with the latest commit( 9af43f11b) > from the master branch using this guide - > [https://arrow.apache.org/docs/developers/cpp/building.html] But the build > fails with the following error - > {code:java} > [ 58%] Linking CXX executable ../../debug/arrow-array-test > Undefined symbols for architecture x86_64: > "testing::Matcher std::__1::char_traits > const&>::Matcher(char const*)", referenced from: > testing::Matcher std::__1::char_traits > const&> > testing::internal::MatcherCastImpl std::__1::char_traits > const&, char const*>::CastImpl(char > const* const&, std::__1::integral_constant, > std::__1::integral_constant) in array_test.cc.o > testing::Matcher std::__1::char_traits > const&> > testing::internal::MatcherCastImpl std::__1::char_traits > const&, char const*>::CastImpl(char > const* const&, std::__1::integral_constant, > std::__1::integral_constant) in array_binary_test.cc.o > ld: symbol(s) not found for architecture x86_64 > clang-14: error: linker command failed with exit code 1 (use -v to see > invocation) > make[2]: *** [src/arrow/CMakeFiles/arrow-array-test.dir/build.make:207: > debug/arrow-array-test] Error 1 > make[1]: *** [CMakeFiles/Makefile2:1653: > src/arrow/CMakeFiles/arrow-array-test.dir/all] Error 2 > make[1]: *** Waiting for unfinished jobs > [ 58%] Building CXX object > src/arrow/CMakeFiles/arrow-table-test.dir/table_test.cc.o > [ 58%] Building CXX object > src/parquet/CMakeFiles/parquet_objlib.dir/types.cc.o > [ 58%] Building CXX object > src/arrow/CMakeFiles/arrow-table-test.dir/table_builder_test.cc.o > [ 58%] Building CXX object > src/parquet/CMakeFiles/parquet_objlib.dir/level_comparison_avx2.cc.o > [ 58%] Building CXX object > src/parquet/CMakeFiles/parquet_objlib.dir/level_conversion_bmi2.cc.o > [ 58%] Building CXX object > src/parquet/CMakeFiles/parquet_objlib.dir/encryption/encryption_internal.cc.o > [ 59%] Building CXX object > src/parquet/CMakeFiles/parquet_objlib.dir/encryption/crypto_factory.cc.o > [ 60%] Linking CXX executable ../../debug/arrow-table-test > [ 60%] Building CXX object > src/parquet/CMakeFiles/parquet_objlib.dir/encryption/file_key_unwrapper.cc.o > [ 60%] Built target arrow-table-test > [ 60%] Building CXX object > src/parquet/CMakeFiles/parquet_objlib.dir/encryption/file_key_wrapper.cc.o > [ 60%] Building CXX object > src/parquet/CMakeFiles/parquet_objlib.dir/encryption/kms_client.cc.o > [ 60%] Building CXX object > src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_material.cc.o > [ 61%] Building CXX object > src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_metadata.cc.o > [ 61%] Building CXX object > src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_toolkit.cc.o > [ 61%] Building CXX object > src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_toolkit_internal.cc.o > [ 61%] Building CXX object > src/parquet/CMakeFiles/parquet_objlib.dir/encryption/local_wrap_kms_client.cc.o > [ 61%] Built target parquet_objlib > make: *** [Makefile:146: all] Error 2 {code} > > I am compiling this on macOS Monterey Version 12.0.1. and versions of GCC, > python and clang are as follows - > {code:java} > $ clang --version > clang version 14.0.4 > Target: x86_64-apple-darwin21.1.0 > Thread model: posix > $ python --version > Python 3.9.13 > $ gcc --version > Configured with: --prefix=/Library/Developer/CommandLineTools/usr > --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1 > Apple clang version 12.0.5 (clang-1205.0.22.9) > Target: x86_64-apple-darwin21.1.0 > Thread model: posix > InstalledDir: /Library/Developer/CommandLineTools/usr/bin {code} > > I see that there were nightly job failures for macOS that were reported in > the mailing list - > [https://lists.apache.org/thread/rrdwxw1st4vdcf3nh5nqfo16n3ymj90x] I am not > sure if this failure is related to the issue I am reporting. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17881) [C++] Not able to build the project with the latest commit of the master branch
[ https://issues.apache.org/jira/browse/ARROW-17881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Acharya updated ARROW-17881: Description: I am trying to build the arrow C++ project with the latest commit( 9af43f11b) from the master branch using this guide - [https://arrow.apache.org/docs/developers/cpp/building.html] But the build fails with the following error - {code:java} [ 58%] Linking CXX executable ../../debug/arrow-array-test Undefined symbols for architecture x86_64: "testing::Matcher > const&>::Matcher(char const*)", referenced from: testing::Matcher > const&> testing::internal::MatcherCastImpl > const&, char const*>::CastImpl(char const* const&, std::__1::integral_constant, std::__1::integral_constant) in array_test.cc.o testing::Matcher > const&> testing::internal::MatcherCastImpl > const&, char const*>::CastImpl(char const* const&, std::__1::integral_constant, std::__1::integral_constant) in array_binary_test.cc.o ld: symbol(s) not found for architecture x86_64 clang-14: error: linker command failed with exit code 1 (use -v to see invocation) make[2]: *** [src/arrow/CMakeFiles/arrow-array-test.dir/build.make:207: debug/arrow-array-test] Error 1 make[1]: *** [CMakeFiles/Makefile2:1653: src/arrow/CMakeFiles/arrow-array-test.dir/all] Error 2 make[1]: *** Waiting for unfinished jobs [ 58%] Building CXX object src/arrow/CMakeFiles/arrow-table-test.dir/table_test.cc.o [ 58%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/types.cc.o [ 58%] Building CXX object src/arrow/CMakeFiles/arrow-table-test.dir/table_builder_test.cc.o [ 58%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/level_comparison_avx2.cc.o [ 58%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/level_conversion_bmi2.cc.o [ 58%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/encryption_internal.cc.o [ 59%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/crypto_factory.cc.o [ 60%] Linking CXX executable ../../debug/arrow-table-test [ 60%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/file_key_unwrapper.cc.o [ 60%] Built target arrow-table-test [ 60%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/file_key_wrapper.cc.o [ 60%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/kms_client.cc.o [ 60%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_material.cc.o [ 61%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_metadata.cc.o [ 61%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_toolkit.cc.o [ 61%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_toolkit_internal.cc.o [ 61%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/local_wrap_kms_client.cc.o [ 61%] Built target parquet_objlib make: *** [Makefile:146: all] Error 2 {code} I am compiling this on macOS Monterey Version 12.0.1. and versions of GCC, python and clang are as follows - {code:java} $ clang --version clang version 14.0.4 Target: x86_64-apple-darwin21.1.0 Thread model: posix $ python --version Python 3.9.13 $ gcc --version Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1 Apple clang version 12.0.5 (clang-1205.0.22.9) Target: x86_64-apple-darwin21.1.0 Thread model: posix InstalledDir: /Library/Developer/CommandLineTools/usr/bin {code} I see that there were nightly job failures for macOS that were reported in the mailing list - [https://lists.apache.org/thread/rrdwxw1st4vdcf3nh5nqfo16n3ymj90x] I am not sure if this failure is related to the issue I am reporting. was: I am trying to build the arrow C++ project with the latest commit( 9af43f11b) from the master branch using this guide - [https://arrow.apache.org/docs/developers/cpp/building.html] But the build fails with the following error - {code:java} [ 58%] Linking CXX executable ../../debug/arrow-array-test Undefined symbols for architecture x86_64: "testing::Matcher > const&>::Matcher(char const*)", referenced from: testing::Matcher > const&> testing::internal::MatcherCastImpl > const&, char const*>::CastImpl(char const* const&, std::__1::integral_constant, std::__1::integral_constant) in array_test.cc.o testing::Matcher > const&> testing::internal::MatcherCastImpl > const&, char const*>::CastImpl(char const* const&, std::__1::integral_constant, std::__1::integral_constant) in array_binary_test.cc.o ld: symbol(s) not found for architecture x86_64 clang-14: error: linker command failed with exit code 1 (use -v to see invocation) make[2]: *** [src/arrow/CMakeFiles/arrow-array-test.dir/build.make:207: debug/arrow-array-test] Error 1 make[1]: *** [
[jira] [Created] (ARROW-17881) [C++] Not able to build the project with the latest commit of the master branch
Anirudh Acharya created ARROW-17881: --- Summary: [C++] Not able to build the project with the latest commit of the master branch Key: ARROW-17881 URL: https://issues.apache.org/jira/browse/ARROW-17881 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Anirudh Acharya I am trying to build the arrow C++ project with the latest commit( 9af43f11b) from the master branch using this guide - [https://arrow.apache.org/docs/developers/cpp/building.html] But the build fails with the following error - {code:java} [ 58%] Linking CXX executable ../../debug/arrow-array-test Undefined symbols for architecture x86_64: "testing::Matcher > const&>::Matcher(char const*)", referenced from: testing::Matcher > const&> testing::internal::MatcherCastImpl > const&, char const*>::CastImpl(char const* const&, std::__1::integral_constant, std::__1::integral_constant) in array_test.cc.o testing::Matcher > const&> testing::internal::MatcherCastImpl > const&, char const*>::CastImpl(char const* const&, std::__1::integral_constant, std::__1::integral_constant) in array_binary_test.cc.o ld: symbol(s) not found for architecture x86_64 clang-14: error: linker command failed with exit code 1 (use -v to see invocation) make[2]: *** [src/arrow/CMakeFiles/arrow-array-test.dir/build.make:207: debug/arrow-array-test] Error 1 make[1]: *** [CMakeFiles/Makefile2:1653: src/arrow/CMakeFiles/arrow-array-test.dir/all] Error 2 make[1]: *** Waiting for unfinished jobs [ 58%] Building CXX object src/arrow/CMakeFiles/arrow-table-test.dir/table_test.cc.o [ 58%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/types.cc.o [ 58%] Building CXX object src/arrow/CMakeFiles/arrow-table-test.dir/table_builder_test.cc.o [ 58%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/level_comparison_avx2.cc.o [ 58%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/level_conversion_bmi2.cc.o [ 58%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/encryption_internal.cc.o [ 59%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/crypto_factory.cc.o [ 60%] Linking CXX executable ../../debug/arrow-table-test [ 60%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/file_key_unwrapper.cc.o [ 60%] Built target arrow-table-test [ 60%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/file_key_wrapper.cc.o [ 60%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/kms_client.cc.o [ 60%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_material.cc.o [ 61%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_metadata.cc.o [ 61%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_toolkit.cc.o [ 61%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_toolkit_internal.cc.o [ 61%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/local_wrap_kms_client.cc.o [ 61%] Built target parquet_objlib make: *** [Makefile:146: all] Error 2 {code} I am compiling this on macOS Monterey Version 12.0.1. and versions of GCC, python and clang are as follows - {code:java} $ clang --version clang version 14.0.4 Target: x86_64-apple-darwin21.1.0 Thread model: posix InstalledDir: /Users/anirudhacharya/miniconda3/envs/pyarrow-dev/bin $ python --version Python 3.9.13 $ gcc --version Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1 Apple clang version 12.0.5 (clang-1205.0.22.9) Target: x86_64-apple-darwin21.1.0 Thread model: posix InstalledDir: /Library/Developer/CommandLineTools/usr/bin {code} I see that there were nightly job failures for macOS that were reported in the mailing list - [https://lists.apache.org/thread/rrdwxw1st4vdcf3nh5nqfo16n3ymj90x] I am not sure if this failure is related to the issue I am reporting. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17850) [Java] Upgrade netty-codec-http dependencies
[ https://issues.apache.org/jira/browse/ARROW-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17850: --- Labels: pull-request-available (was: ) > [Java] Upgrade netty-codec-http dependencies > > > Key: ARROW-17850 > URL: https://issues.apache.org/jira/browse/ARROW-17850 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 9.0.0 >Reporter: Hui Yu >Assignee: David Dali Susanibar Arce >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > [CVE-2022-24823]([https://github.com/advisories/GHSA-269q-hmxg-m83q]) reports > a security vulnerability for *netty-codec-http* > Now the version of *netty-codec-http* in the master branch is *4.1.72.Final,* > that is unsafe. > The ticket https://issues.apache.org/jira/browse/ARROW-16996 bumps > *netty-codec* to {*}4.1.78.Final{*}, it didn't bump *netty-codec-http.* > Can you upgrade the version of *netty-codec-http* ? > > Here is my output of mvn:dependency now: > ```bash > [INFO] +- org.apache.arrow:flight-core:jar:9.0.0:compile > [INFO] | +- io.grpc:grpc-netty:jar:1.47.0:compile > [INFO] | | +- io.netty:netty-codec-http2:jar:4.1.72.Final:compile > [INFO] | | | - io.netty:{*}netty-codec-http{*}:jar:4.1.72.Final:compile > [INFO] | | +- io.netty:netty-handler-proxy:jar:4.1.72.Final:runtime > [INFO] | | | - io.netty:netty-codec-socks:jar:4.1.72.Final:runtime > [INFO] | | +- > com.google.errorprone:error_prone_annotations:jar:2.10.0:compile > [INFO] | | +- io.perfmark:perfmark-api:jar:0.25.0:runtime > [INFO] | | - > io.netty:netty-transport-native-unix-common:jar:4.1.72.Final:compile > [INFO] | +- io.grpc:grpc-core:jar:1.47.0:compile > [INFO] | | +- com.google.android:annotations:jar:4.1.1.4:runtime > [INFO] | | - org.codehaus.mojo:animal-sniffer-annotations:jar:1.19:runtime > [INFO] | +- io.grpc:grpc-context:jar:1.47.0:compile > [INFO] | +- io.grpc:grpc-protobuf:jar:1.47.0:compile > [INFO] | | +- > com.google.api.grpc:proto-google-common-protos:jar:2.0.1:compile > [INFO] | | - io.grpc:grpc-protobuf-lite:jar:1.47.0:compile > [INFO] | +- io.netty:netty-tcnative-boringssl-static:jar:2.0.53.Final:compile > [INFO] | | +- io.netty:netty-tcnative-classes:jar:2.0.53.Final:compile > [INFO] | | +- > io.netty:netty-tcnative-boringssl-static:jar:linux-x86_64:2.0.53.Final:compile > [INFO] | | +- > io.netty:netty-tcnative-boringssl-static:jar:linux-aarch_64:2.0.53.Final:compile > [INFO] | | +- > io.netty:netty-tcnative-boringssl-static:jar:osx-x86_64:2.0.53.Final:compile > [INFO] | | +- > io.netty:netty-tcnative-boringssl-static:jar:osx-aarch_64:2.0.53.Final:compile > [INFO] | | - > io.netty:netty-tcnative-boringssl-static:jar:windows-x86_64:2.0.53.Final:compile > [INFO] | +- io.netty:netty-handler:jar:4.1.78.Final:compile > [INFO] | | +- io.netty:netty-resolver:jar:4.1.78.Final:compile > [INFO] | | - io.netty:netty-codec:jar:4.1.78.Final:compile > [INFO] | +- io.netty:netty-transport:jar:4.1.78.Final:compile > [INFO] | +- com.google.guava:guava:jar:30.1.1-jre:compile > [INFO] | | +- com.google.guava:failureaccess:jar:1.0.1:compile > [INFO] | | +- > com.google.guava:listenablefuture:jar:.0-empty-to-avoid-conflict-with-guava:compile > [INFO] | | +- org.checkerframework:checker-qual:jar:3.8.0:compile > [INFO] | | - com.google.j2objc:j2objc-annotations:jar:1.3:compile > [INFO] | +- io.grpc:grpc-stub:jar:1.47.0:compile > [INFO] | +- com.google.protobuf:protobuf-java:jar:3.21.2:compile > [INFO] | +- io.grpc:grpc-api:jar:1.47.0:compile > [INFO] | - javax.annotation:javax.annotation-api:jar:1.3.2:compile > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17865) [Java] Deprecate Plasma JNI bindings
[ https://issues.apache.org/jira/browse/ARROW-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-17865. -- Resolution: Fixed Issue resolved by pull request 14262 [https://github.com/apache/arrow/pull/14262] > [Java] Deprecate Plasma JNI bindings > > > Key: ARROW-17865 > URL: https://issues.apache.org/jira/browse/ARROW-17865 > Project: Apache Arrow > Issue Type: Sub-task > Components: Java >Reporter: Antoine Pitrou >Assignee: David Dali Susanibar Arce >Priority: Blocker > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17867) [C++][FlightRPC] Expose bulk parameter binding in Flight SQL client
[ https://issues.apache.org/jira/browse/ARROW-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17867: --- Labels: pull-request-available (was: ) > [C++][FlightRPC] Expose bulk parameter binding in Flight SQL client > --- > > Key: ARROW-17867 > URL: https://issues.apache.org/jira/browse/ARROW-17867 > Project: Apache Arrow > Issue Type: Improvement >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Also fix various issues noticed as part of ARROW-17661 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17880) [Go] Add support for Decimal types in go/arrow/csv
[ https://issues.apache.org/jira/browse/ARROW-17880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mitchell Devenport updated ARROW-17880: --- Summary: [Go] Add support for Decimal types in go/arrow/csv (was: Add support for Decimal types in go/arrow/csv) > [Go] Add support for Decimal types in go/arrow/csv > -- > > Key: ARROW-17880 > URL: https://issues.apache.org/jira/browse/ARROW-17880 > Project: Apache Arrow > Issue Type: Improvement > Components: Go >Reporter: Mitchell Devenport >Priority: Major > > The Go CSV library lacks support for Decimal types which are supported by the > C++ CSV library: > [arrow/writer.cc at master · apache/arrow > (github.com)|https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.cc#L378] > [arrow/type_traits.h at master · apache/arrow > (github.com)|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L642] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17880) Add support for Decimal types in go/arrow/csv
Mitchell Devenport created ARROW-17880: -- Summary: Add support for Decimal types in go/arrow/csv Key: ARROW-17880 URL: https://issues.apache.org/jira/browse/ARROW-17880 Project: Apache Arrow Issue Type: Improvement Components: Go Reporter: Mitchell Devenport The Go CSV library lacks support for Decimal types which are supported by the C++ CSV library: [arrow/writer.cc at master · apache/arrow (github.com)|https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.cc#L378] [arrow/type_traits.h at master · apache/arrow (github.com)|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L642] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17879) [R] Intermittent memory leaks in the valgrind nightly test
[ https://issues.apache.org/jira/browse/ARROW-17879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dewey Dunnington reassigned ARROW-17879: Assignee: Dewey Dunnington > [R] Intermittent memory leaks in the valgrind nightly test > -- > > Key: ARROW-17879 > URL: https://issues.apache.org/jira/browse/ARROW-17879 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Dewey Dunnington >Assignee: Dewey Dunnington >Priority: Major > Fix For: 10.0.0 > > > The memory leaks that were fixed by a workaround before the last release > (ARROW-17252) are present again. I had hoped that the improvements to the > captured R thread infrastructure in ARROW-11841 and ARROW-17178 would fix > this; however, they don't (and it's not even clear that the failures are > related to that, since as part of diagnosing those failures the last time I > disabled the safe call infrastructure completely and was still able to > observe failures). > These failures need to be debugged before the release! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17879) [R] Intermittent memory leaks in the valgrind nightly test
Dewey Dunnington created ARROW-17879: Summary: [R] Intermittent memory leaks in the valgrind nightly test Key: ARROW-17879 URL: https://issues.apache.org/jira/browse/ARROW-17879 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Dewey Dunnington Fix For: 10.0.0 The memory leaks that were fixed by a workaround before the last release (ARROW-17252) are present again. I had hoped that the improvements to the captured R thread infrastructure in ARROW-11841 and ARROW-17178 would fix this; however, they don't (and it's not even clear that the failures are related to that, since as part of diagnosing those failures the last time I disabled the safe call infrastructure completely and was still able to observe failures). These failures need to be debugged before the release! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17878) [Website] Exclude Ballista docs from being deleted
Andy Grove created ARROW-17878: -- Summary: [Website] Exclude Ballista docs from being deleted Key: ARROW-17878 URL: https://issues.apache.org/jira/browse/ARROW-17878 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Andy Grove Exclude Ballista docs from being deleted -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17850) [Java] Upgrade netty-codec-http dependencies
[ https://issues.apache.org/jira/browse/ARROW-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610595#comment-17610595 ] David Dali Susanibar Arce commented on ARROW-17850: --- Updated to: {code:java} $ mvn dependency:tree --debug | grep netty-codec-http [DEBUG] io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version managed from 4.1.77.Final) [DEBUG] io.netty:netty-codec-http:jar:4.1.82.Final:compile (version managed from 4.1.82.Final) [DEBUG] io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version managed from 4.1.77.Final) [DEBUG] io.netty:netty-codec-http:jar:4.1.82.Final:compile (version managed from 4.1.82.Final) [INFO] | +- io.netty:netty-codec-http2:jar:4.1.82.Final:compile [INFO] | | \- io.netty:netty-codec-http:jar:4.1.82.Final:compile [DEBUG] io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version managed from 4.1.77.Final) [DEBUG] io.netty:netty-codec-http:jar:4.1.82.Final:compile (version managed from 4.1.82.Final) [DEBUG] io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version managed from 4.1.77.Final) [DEBUG] io.netty:netty-codec-http:jar:4.1.82.Final:compile (version managed from 4.1.82.Final) [INFO] | | +- io.netty:netty-codec-http2:jar:4.1.82.Final:compile [INFO] | | | \- io.netty:netty-codec-http:jar:4.1.82.Final:compile [DEBUG] io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version managed from 4.1.77.Final) [DEBUG] io.netty:netty-codec-http:jar:4.1.82.Final:compile (version managed from 4.1.82.Final) [DEBUG] io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version managed from 4.1.77.Final) [DEBUG] io.netty:netty-codec-http:jar:4.1.82.Final:compile (version managed from 4.1.82.Final) [INFO] | | +- io.netty:netty-codec-http2:jar:4.1.82.Final:compile [INFO] | | | \- io.netty:netty-codec-http:jar:4.1.82.Final:compile [DEBUG] io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version managed from 4.1.77.Final) [DEBUG] io.netty:netty-codec-http:jar:4.1.82.Final:compile (version managed from 4.1.82.Final) [DEBUG] io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version managed from 4.1.77.Final) [DEBUG] io.netty:netty-codec-http:jar:4.1.82.Final:compile (version managed from 4.1.82.Final) [INFO] | | +- io.netty:netty-codec-http2:jar:4.1.82.Final:compile [INFO] | | | \- io.netty:netty-codec-http:jar:4.1.82.Final:compile [DEBUG] io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version managed from 4.1.77.Final) [DEBUG] io.netty:netty-codec-http:jar:4.1.82.Final:compile (version managed from 4.1.82.Final) [DEBUG] io.netty:netty-codec-http2:jar:4.1.82.Final:compile (version managed from 4.1.77.Final) [DEBUG] io.netty:netty-codec-http:jar:4.1.82.Final:compile (version managed from 4.1.82.Final) [INFO] | | +- io.netty:netty-codec-http2:jar:4.1.82.Final:compile [INFO] | | | \- io.netty:netty-codec-http:jar:4.1.82.Final:compile {code} > [Java] Upgrade netty-codec-http dependencies > > > Key: ARROW-17850 > URL: https://issues.apache.org/jira/browse/ARROW-17850 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 9.0.0 >Reporter: Hui Yu >Assignee: David Dali Susanibar Arce >Priority: Major > > [CVE-2022-24823]([https://github.com/advisories/GHSA-269q-hmxg-m83q]) reports > a security vulnerability for *netty-codec-http* > Now the version of *netty-codec-http* in the master branch is *4.1.72.Final,* > that is unsafe. > The ticket https://issues.apache.org/jira/browse/ARROW-16996 bumps > *netty-codec* to {*}4.1.78.Final{*}, it didn't bump *netty-codec-http.* > Can you upgrade the version of *netty-codec-http* ? > > Here is my output of mvn:dependency now: > ```bash > [INFO] +- org.apache.arrow:flight-core:jar:9.0.0:compile > [INFO] | +- io.grpc:grpc-netty:jar:1.47.0:compile > [INFO] | | +- io.netty:netty-codec-http2:jar:4.1.72.Final:compile > [INFO] | | | - io.netty:{*}netty-codec-http{*}:jar:4.1.72.Final:compile > [INFO] | | +- io.netty:netty-handler-proxy:jar:4.1.72.Final:runtime > [INFO] | | | - io.netty:netty-codec-socks:jar:4.1.72.Final:runtime > [INFO] | | +- > com.google.errorprone:error_prone_annotations:jar:2.10.0:compile > [INFO] | | +- io.perfmark:perfmark-api:jar:0.25.0:runtime > [INFO] | | - > io.netty:netty-transport-native-unix-common:jar:4.1.72.Final:compile > [INFO] | +- io.grpc:grpc-core:jar:1.47.0:compile > [INFO] | | +- com.google.android:annotations:jar:4.1.1.4:runtime > [INFO] | | - org.codehaus.mojo:animal-sniffer-annotations:jar:1.19:runtime > [INFO] | +- io.grpc:grpc-context:jar:1.47.0:compile > [
[jira] [Resolved] (ARROW-17875) [C++] Remove assorted pre-C++17 compatibility measures
[ https://issues.apache.org/jira/browse/ARROW-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-17875. Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 14263 [https://github.com/apache/arrow/pull/14263] > [C++] Remove assorted pre-C++17 compatibility measures > -- > > Key: ARROW-17875 > URL: https://issues.apache.org/jira/browse/ARROW-17875 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Trivial > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Some assorted pre-C++17 compatibility measures remain in the code base. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17872) [CI] Cache dependencies on macOS builds
[ https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610581#comment-17610581 ] Antoine Pitrou edited comment on ARROW-17872 at 9/28/22 1:46 PM: - A build that takes 60 minutes or more is horrible for developer experience. So I would suggest disabling Gandiva and S3 support on all our PR-based macOS builds (and update the brew files to remove/disable the corresponding third-party deps). Do you want to take this [~assignUser]? was (Author: pitrou): A build that takes 60 seconds or more is horrible for developer experience. So I would suggest disabling Gandiva and S3 support on all our PR-based macOS builds (and update the brew files to remove/disable the corresponding third-party deps). Do you want to take this [~assignUser]? > [CI] Cache dependencies on macOS builds > --- > > Key: ARROW-17872 > URL: https://issues.apache.org/jira/browse/ARROW-17872 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Continuous Integration, GLib, Python >Reporter: Antoine Pitrou >Priority: Major > > Our macOS CI builds on Github Actions usually take at least 10 minutes > installing dependencies from Homebrew (because of compiling from source?). It > would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17872) [CI] Cache dependencies on macOS builds
[ https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610581#comment-17610581 ] Antoine Pitrou edited comment on ARROW-17872 at 9/28/22 1:46 PM: - A build that takes 60 seconds or more is horrible for developer experience. So I would suggest disabling Gandiva and S3 support on all our PR-based macOS builds (and update the brew files to remove/disable the corresponding third-party deps). Do you want to take this [~assignUser]? was (Author: pitrou): A build that takes 60 seconds or more is horrible for developer experience. So I would suggest disabling Gandiva and S3 support on all our PR-based macOS builds. > [CI] Cache dependencies on macOS builds > --- > > Key: ARROW-17872 > URL: https://issues.apache.org/jira/browse/ARROW-17872 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Continuous Integration, GLib, Python >Reporter: Antoine Pitrou >Priority: Major > > Our macOS CI builds on Github Actions usually take at least 10 minutes > installing dependencies from Homebrew (because of compiling from source?). It > would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds
[ https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610581#comment-17610581 ] Antoine Pitrou commented on ARROW-17872: A build that takes 60 seconds or more is horrible for developer experience. So I would suggest disabling Gandiva and S3 support on all our PR-based macOS builds. > [CI] Cache dependencies on macOS builds > --- > > Key: ARROW-17872 > URL: https://issues.apache.org/jira/browse/ARROW-17872 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Continuous Integration, GLib, Python >Reporter: Antoine Pitrou >Priority: Major > > Our macOS CI builds on Github Actions usually take at least 10 minutes > installing dependencies from Homebrew (because of compiling from source?). It > would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17556) [C++] Unbound scan projection expression leads to all fields being loaded
[ https://issues.apache.org/jira/browse/ARROW-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17556: --- Labels: pull-request-available (was: ) > [C++] Unbound scan projection expression leads to all fields being loaded > - > > Key: ARROW-17556 > URL: https://issues.apache.org/jira/browse/ARROW-17556 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Assignee: Vibhatha Lakmal Abeykoon >Priority: Blocker > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > If a projection expression is unbound then we should bind it to the > (augmented) dataset schema and carry on. Instead it appears we are > interpreting "unbound expression" as "nothing set at all" and loading all > fields. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17854) [CI][Developer] Host preview docs on S3
[ https://issues.apache.org/jira/browse/ARROW-17854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-17854. Resolution: Fixed Issue resolved by pull request 14247 [https://github.com/apache/arrow/pull/14247] > [CI][Developer] Host preview docs on S3 > --- > > Key: ARROW-17854 > URL: https://issues.apache.org/jira/browse/ARROW-17854 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Developer Tools >Reporter: Jacob Wujciak-Jens >Assignee: Jacob Wujciak-Jens >Priority: Critical > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > Hosting on Github Pages as implemented in [ARROW-12958] is unsustainable due > to the size of the arrow docs (~ 200mb). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds
[ https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610573#comment-17610573 ] Jacob Wujciak-Jens commented on ARROW-17872: bq. 10 minutes for extracting 1.5GB seems quite unexpected I have checked in detail and each of the bigger dependecies (aws, llvm, boost) take 2-3 minutes to "pour", so ok speeds I would say. Just over all a lot but still nothing Isee the cache really speeding up. The timeout is set to 60 minutes so we could just raise that limit if it is not applicable for the current build complexity (or as you said remove features). The build should already be using all 3 available cores. > [CI] Cache dependencies on macOS builds > --- > > Key: ARROW-17872 > URL: https://issues.apache.org/jira/browse/ARROW-17872 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Continuous Integration, GLib, Python >Reporter: Antoine Pitrou >Priority: Major > > Our macOS CI builds on Github Actions usually take at least 10 minutes > installing dependencies from Homebrew (because of compiling from source?). It > would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17877) [CI][Python] verify-rc python nightly builds fail due to missing some flags that were activated with ARROW_PYTHON=ON
[ https://issues.apache.org/jira/browse/ARROW-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610566#comment-17610566 ] Raúl Cumplido commented on ARROW-17877: --- [~kou] what do you think, should we enable individually on the verify-release-candidate.sh and other jobs all the flags that were activated when `ARROW_PYTHON=ON` in the past if we are building pyarrow too: {code:java} if(ARROW_PYTHON) set(ARROW_COMPUTE ON) set(ARROW_CSV ON) set(ARROW_DATASET ON) set(ARROW_FILESYSTEM ON) set(ARROW_HDFS ON) set(ARROW_JSON ON) endif() {code} or should we create some CMake group flag that enables all the requirements for pyarrow? There are still quite a lot of occurrences of this flag: {code:java} $ grep -r "ARROW_PYTHON=ON" | wc -l 22 {code} > [CI][Python] verify-rc python nightly builds fail due to missing some flags > that were activated with ARROW_PYTHON=ON > > > Key: ARROW-17877 > URL: https://issues.apache.org/jira/browse/ARROW-17877 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, Python >Reporter: Raúl Cumplido >Assignee: Raúl Cumplido >Priority: Blocker > Labels: Nightly > > Some of our nightly builds are failing with: > {code:java} > [ 35%] Building CXX object CMakeFiles/_dataset.dir/_dataset.cpp.o > /arrow/python/build/temp.linux-x86_64-cpython-38/_dataset.cpp:833:10: fatal > error: arrow/csv/api.h: No such file or directory > #include "arrow/csv/api.h" > ^ > compilation terminated.{code} > I suspect the flags included CSV=ON when building with PYTHON=ON changes here > might be related: > [https://github.com/apache/arrow/commit/53ac2a00aa9ff199773513f6f996f73a07b37989] > Example of nightly failures: > https://github.com/ursacomputing/crossbow/actions/runs/3135833175/jobs/5091988801 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17877) [CI][Python] verify-rc python nightly builds fail due to missing some flags that were activated with ARROW_PYTHON=ON
[ https://issues.apache.org/jira/browse/ARROW-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido updated ARROW-17877: -- Summary: [CI][Python] verify-rc python nightly builds fail due to missing some flags that were activated with ARROW_PYTHON=ON (was: [CI][Python] verify-rc python nightly builds fail due to missing arrow/csv/api.h) > [CI][Python] verify-rc python nightly builds fail due to missing some flags > that were activated with ARROW_PYTHON=ON > > > Key: ARROW-17877 > URL: https://issues.apache.org/jira/browse/ARROW-17877 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, Python >Reporter: Raúl Cumplido >Assignee: Raúl Cumplido >Priority: Blocker > Labels: Nightly > > Some of our nightly builds are failing with: > {code:java} > [ 35%] Building CXX object CMakeFiles/_dataset.dir/_dataset.cpp.o > /arrow/python/build/temp.linux-x86_64-cpython-38/_dataset.cpp:833:10: fatal > error: arrow/csv/api.h: No such file or directory > #include "arrow/csv/api.h" > ^ > compilation terminated.{code} > I suspect the flags included CSV=ON when building with PYTHON=ON changes here > might be related: > [https://github.com/apache/arrow/commit/53ac2a00aa9ff199773513f6f996f73a07b37989] > Example of nightly failures: > https://github.com/ursacomputing/crossbow/actions/runs/3135833175/jobs/5091988801 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17877) [CI][Python] verify-rc python nightly builds fail due to missing arrow/csv/api.h
Raúl Cumplido created ARROW-17877: - Summary: [CI][Python] verify-rc python nightly builds fail due to missing arrow/csv/api.h Key: ARROW-17877 URL: https://issues.apache.org/jira/browse/ARROW-17877 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, Python Reporter: Raúl Cumplido Assignee: Raúl Cumplido Some of our nightly builds are failing with: {code:java} [ 35%] Building CXX object CMakeFiles/_dataset.dir/_dataset.cpp.o /arrow/python/build/temp.linux-x86_64-cpython-38/_dataset.cpp:833:10: fatal error: arrow/csv/api.h: No such file or directory #include "arrow/csv/api.h" ^ compilation terminated.{code} I suspect the flags included CSV=ON when building with PYTHON=ON changes here might be related: [https://github.com/apache/arrow/commit/53ac2a00aa9ff199773513f6f996f73a07b37989] Example of nightly failures: https://github.com/ursacomputing/crossbow/actions/runs/3135833175/jobs/5091988801 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds
[ https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610545#comment-17610545 ] Antoine Pitrou commented on ARROW-17872: We may perhaps want to disable some Arrow components on those macOS builds, unless there's another package manager that we can use? [~kou] Do you know why we decided to use Homebrew for dependencies on macOS? > [CI] Cache dependencies on macOS builds > --- > > Key: ARROW-17872 > URL: https://issues.apache.org/jira/browse/ARROW-17872 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Continuous Integration, GLib, Python >Reporter: Antoine Pitrou >Priority: Major > > Our macOS CI builds on Github Actions usually take at least 10 minutes > installing dependencies from Homebrew (because of compiling from source?). It > would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17872) [CI] Cache dependencies on macOS builds
[ https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610539#comment-17610539 ] Antoine Pitrou edited comment on ARROW-17872 at 9/28/22 12:27 PM: -- (and even 10 minutes for extracting 1.5GB seems quite unexpected: that's only 2.5 MB/s... so it's not a gzip problem but probably an IO/memory issue) was (Author: pitrou): (and even 10 minutes for extracting 1.5GB seems quite unexpected) > [CI] Cache dependencies on macOS builds > --- > > Key: ARROW-17872 > URL: https://issues.apache.org/jira/browse/ARROW-17872 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Continuous Integration, GLib, Python >Reporter: Antoine Pitrou >Priority: Major > > Our macOS CI builds on Github Actions usually take at least 10 minutes > installing dependencies from Homebrew (because of compiling from source?). It > would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds
[ https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610541#comment-17610541 ] Jacob Wujciak-Jens commented on ARROW-17872: relevant homebrew issue: https://github.com/Homebrew/brew/issues/13621 > [CI] Cache dependencies on macOS builds > --- > > Key: ARROW-17872 > URL: https://issues.apache.org/jira/browse/ARROW-17872 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Continuous Integration, GLib, Python >Reporter: Antoine Pitrou >Priority: Major > > Our macOS CI builds on Github Actions usually take at least 10 minutes > installing dependencies from Homebrew (because of compiling from source?). It > would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds
[ https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610539#comment-17610539 ] Antoine Pitrou commented on ARROW-17872: (and even 10 minutes for extracting 1.5GB seems quite unexpected) > [CI] Cache dependencies on macOS builds > --- > > Key: ARROW-17872 > URL: https://issues.apache.org/jira/browse/ARROW-17872 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Continuous Integration, GLib, Python >Reporter: Antoine Pitrou >Priority: Major > > Our macOS CI builds on Github Actions usually take at least 10 minutes > installing dependencies from Homebrew (because of compiling from source?). It > would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds
[ https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610538#comment-17610538 ] Antoine Pitrou commented on ARROW-17872: Ouch, LLVM can be heavy but 1.5GB sounds really outlandish. (for comparison, the combined unpacked size for the conda-forge packages {{libllvm}}, {{llvm-tools}} and {{llvmdev}} is 500MB) > [CI] Cache dependencies on macOS builds > --- > > Key: ARROW-17872 > URL: https://issues.apache.org/jira/browse/ARROW-17872 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Continuous Integration, GLib, Python >Reporter: Antoine Pitrou >Priority: Major > > Our macOS CI builds on Github Actions usually take at least 10 minutes > installing dependencies from Homebrew (because of compiling from source?). It > would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds
[ https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610540#comment-17610540 ] Jacob Wujciak-Jens commented on ARROW-17872: And we have 12 & 15 both similar size (do we need both?), aws sdk is 800M... > [CI] Cache dependencies on macOS builds > --- > > Key: ARROW-17872 > URL: https://issues.apache.org/jira/browse/ARROW-17872 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Continuous Integration, GLib, Python >Reporter: Antoine Pitrou >Priority: Major > > Our macOS CI builds on Github Actions usually take at least 10 minutes > installing dependencies from Homebrew (because of compiling from source?). It > would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17876) [R][CI] Remove ubuntu-18.04 from nixlibs & prebuilt binaries
Jacob Wujciak-Jens created ARROW-17876: -- Summary: [R][CI] Remove ubuntu-18.04 from nixlibs & prebuilt binaries Key: ARROW-17876 URL: https://issues.apache.org/jira/browse/ARROW-17876 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Jacob Wujciak-Jens Fix For: 10.0.0 The new dts compiled centos-7 binaries ([ARROW-17594]) should be able to replace the ubuntu-18.04 binaries. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17876) [R][CI] Remove ubuntu-18.04 from nixlibs & prebuilt binaries
[ https://issues.apache.org/jira/browse/ARROW-17876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacob Wujciak-Jens updated ARROW-17876: --- Priority: Critical (was: Major) > [R][CI] Remove ubuntu-18.04 from nixlibs & prebuilt binaries > > > Key: ARROW-17876 > URL: https://issues.apache.org/jira/browse/ARROW-17876 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jacob Wujciak-Jens >Priority: Critical > Fix For: 10.0.0 > > > The new dts compiled centos-7 binaries ([ARROW-17594]) should be able to > replace the ubuntu-18.04 binaries. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17875) [C++] Remove assorted pre-C++17 compatibility measures
[ https://issues.apache.org/jira/browse/ARROW-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17875: --- Labels: pull-request-available (was: ) > [C++] Remove assorted pre-C++17 compatibility measures > -- > > Key: ARROW-17875 > URL: https://issues.apache.org/jira/browse/ARROW-17875 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Trivial > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Some assorted pre-C++17 compatibility measures remain in the code base. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17872) [CI] Cache dependencies on macOS builds
[ https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610528#comment-17610528 ] Jacob Wujciak-Jens edited comment on ARROW-17872 at 9/28/22 12:14 PM: -- it looks like homebrew is using system tar to extract the gzipped bottles, maybe we can speed it up by symlinking in pigz to make use of the 3 cores the mac runners have... was (Author: JIRAUSER287549): it looks like homebrew is using system tar to extract the gzipped bottles, maybe we can speed it up by symlinking in pzip to make use of the 3 cores the mac runners have... > [CI] Cache dependencies on macOS builds > --- > > Key: ARROW-17872 > URL: https://issues.apache.org/jira/browse/ARROW-17872 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Continuous Integration, GLib, Python >Reporter: Antoine Pitrou >Priority: Major > > Our macOS CI builds on Github Actions usually take at least 10 minutes > installing dependencies from Homebrew (because of compiling from source?). It > would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds
[ https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610528#comment-17610528 ] Jacob Wujciak-Jens commented on ARROW-17872: it looks like homebrew is using system tar to extract the gzipped bottles, maybe we can speed it up by symlinking in pzip to make use of the 3 cores the mac runners have... > [CI] Cache dependencies on macOS builds > --- > > Key: ARROW-17872 > URL: https://issues.apache.org/jira/browse/ARROW-17872 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Continuous Integration, GLib, Python >Reporter: Antoine Pitrou >Priority: Major > > Our macOS CI builds on Github Actions usually take at least 10 minutes > installing dependencies from Homebrew (because of compiling from source?). It > would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17875) [C++] Remove assorted pre-C++17 compatibility measures
Antoine Pitrou created ARROW-17875: -- Summary: [C++] Remove assorted pre-C++17 compatibility measures Key: ARROW-17875 URL: https://issues.apache.org/jira/browse/ARROW-17875 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Some assorted pre-C++17 compatibility measures remain in the code base. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds
[ https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610524#comment-17610524 ] Jacob Wujciak-Jens commented on ARROW-17872: I have set up a test job with debug output to see what exactly is taking so long: https://github.com/assignUser/test-repo-a/actions/runs/3142905685/jobs/5107502078#step:4:392 If you turn on timestamps you can see that what takes the time is extracting the archives (e.g. llvm ~1.5G) not downloading them, so caching the {{hombrew --cache}} directory would not save significant time. As the cache is also tar'd extracting the cache might be the new bottle neck > [CI] Cache dependencies on macOS builds > --- > > Key: ARROW-17872 > URL: https://issues.apache.org/jira/browse/ARROW-17872 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Continuous Integration, GLib, Python >Reporter: Antoine Pitrou >Priority: Major > > Our macOS CI builds on Github Actions usually take at least 10 minutes > installing dependencies from Homebrew (because of compiling from source?). It > would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17874) [Archery] C++ linting with --clang-format or archery lint --clang-tidy fails on M1
[ https://issues.apache.org/jira/browse/ARROW-17874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610520#comment-17610520 ] Alenka Frim commented on ARROW-17874: - cc [~raulcd] > [Archery] C++ linting with --clang-format or archery lint --clang-tidy fails > on M1 > --- > > Key: ARROW-17874 > URL: https://issues.apache.org/jira/browse/ARROW-17874 > Project: Apache Arrow > Issue Type: Bug > Components: Archery >Reporter: Alenka Frim >Priority: Major > > It seems there is some cmake target issue for {{clang-format}} and > {{clang-tidy}} options when running {{archery lint}} on M1: > {code:java} > ... > -- Build files have been written to: > /private/var/folders/gw/q7wqd4tx18n_9t4kbkd0bj1mgn/T/arrow-lint-g7drna9_/cpp-buildninja: > error: unknown target 'check-format' {code} > [https://gist.github.com/AlenkaF/f60e24549529cd096bc9c975bcb71179] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17874) [Archery] C++ linting with --clang-format or archery lint --clang-tidy fails on M1
Alenka Frim created ARROW-17874: --- Summary: [Archery] C++ linting with --clang-format or archery lint --clang-tidy fails on M1 Key: ARROW-17874 URL: https://issues.apache.org/jira/browse/ARROW-17874 Project: Apache Arrow Issue Type: Bug Components: Archery Reporter: Alenka Frim It seems there is some cmake target issue for {{clang-format}} and {{clang-tidy}} options when running {{archery lint}} on M1: {code:java} ... -- Build files have been written to: /private/var/folders/gw/q7wqd4tx18n_9t4kbkd0bj1mgn/T/arrow-lint-g7drna9_/cpp-buildninja: error: unknown target 'check-format' {code} [https://gist.github.com/AlenkaF/f60e24549529cd096bc9c975bcb71179] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17873) Writing Arrow Files using C#.
N Gautam Animesh created ARROW-17873: Summary: Writing Arrow Files using C#. Key: ARROW-17873 URL: https://issues.apache.org/jira/browse/ARROW-17873 Project: Apache Arrow Issue Type: Improvement Reporter: N Gautam Animesh Was working with Arrow along with C# and wanted to know a way to write to an arrow file using C#. Do let me know if there's anything regarding this. Was not able to find anything on the internet. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds
[ https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610482#comment-17610482 ] Antoine Pitrou commented on ARROW-17872: Here is an example job which timed out due to an overlong dependency step: https://github.com/pitrou/arrow/actions/runs/3141950727/jobs/5104979517 > [CI] Cache dependencies on macOS builds > --- > > Key: ARROW-17872 > URL: https://issues.apache.org/jira/browse/ARROW-17872 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Continuous Integration, GLib, Python >Reporter: Antoine Pitrou >Priority: Major > > Our macOS CI builds on Github Actions usually take at least 10 minutes > installing dependencies from Homebrew (because of compiling from source?). It > would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-13745) [CI][C++] conda python turbodbc nightly job failed
[ https://issues.apache.org/jira/browse/ARROW-13745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yibo Cai closed ARROW-13745. Resolution: Fixed > [CI][C++] conda python turbodbc nightly job failed > -- > > Key: ARROW-13745 > URL: https://issues.apache.org/jira/browse/ARROW-13745 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Yibo Cai >Priority: Major > > https://github.com/ursacomputing/crossbow/runs/3408001481#step:7:4473 > [NIGHTLY] Arrow Build Report for Job nightly-2021-08-24-0 > https://www.mail-archive.com/builds@arrow.apache.org/msg00109.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17855) [R] Simultaneous read-write operations causing file corruption.
[ https://issues.apache.org/jira/browse/ARROW-17855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610466#comment-17610466 ] N Gautam Animesh commented on ARROW-17855: -- Yes, when someone is writing to the file and we are reading it simultaneously, then the file is getting corrupted. > [R] Simultaneous read-write operations causing file corruption. > --- > > Key: ARROW-17855 > URL: https://issues.apache.org/jira/browse/ARROW-17855 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: N Gautam Animesh >Priority: Major > > UseCase: I was trying to simultaneously read and write an arrow file which in > turn gave me an Error. It is leading to file corruption. I am currently using > read_feather and write_feather functions to save it as a .arrow file. Do let > me know if there's anything in this regard or any other way to avoid this. > [Error: Invalid: Not an Arrow file] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17872) [CI] Cache dependencies on macOS builds
[ https://issues.apache.org/jira/browse/ARROW-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610450#comment-17610450 ] Antoine Pitrou commented on ARROW-17872: [~assignUser] Do you think that's reasonably doable? > [CI] Cache dependencies on macOS builds > --- > > Key: ARROW-17872 > URL: https://issues.apache.org/jira/browse/ARROW-17872 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Continuous Integration, GLib, Python >Reporter: Antoine Pitrou >Priority: Major > > Our macOS CI builds on Github Actions usually take at least 10 minutes > installing dependencies from Homebrew (because of compiling from source?). It > would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)