[jira] [Assigned] (ARROW-18318) [Python] Expose Scalar.validate

2023-01-05 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-18318:
--

Assignee: buaazhwb

> [Python] Expose Scalar.validate
> ---
>
> Key: ARROW-18318
> URL: https://issues.apache.org/jira/browse/ARROW-18318
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: buaazhwb
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In C++, scalars have {{Validate}} and {{ValidateFull}} methods, just like 
> arrays. However, these methods were not exposed on PyArrow scalars (while 
> they are on PyArrow arrays).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-7594) [C++] Implement HTTP and FTP file systems

2023-01-05 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654992#comment-17654992
 ] 

Antoine Pitrou commented on ARROW-7594:
---

[~icook] We'll need someone or something to allocate the required workforce.

> [C++] Implement HTTP and FTP file systems
> -
>
> Key: ARROW-7594
> URL: https://issues.apache.org/jira/browse/ARROW-7594
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Priority: Major
> Fix For: 12.0.0
>
>
> It'd be handy to have (probably read only) a generic filesystem 
> implementation which wrapped {{any cURLable base url}}:
> {code}
> ARROW_ASSIGN_OR_RAISE(auto fs, 
> HttpFileSystem::Make("https://some.site/json-api/v3;));
> ASSERT_OK_AND_ASSIGN(auto json_stream, fs->OpenInputStream("slug"));
> // ...
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18195) [R][C++] Final value returned by case_when is NA when input has 64 or more values and 1 or more NAs

2023-01-04 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18195.

Resolution: Fixed

Issue resolved by pull request 15131
https://github.com/apache/arrow/pull/15131

> [R][C++] Final value returned by case_when is NA when input has 64 or more 
> values and 1 or more NAs
> ---
>
> Key: ARROW-18195
> URL: https://issues.apache.org/jira/browse/ARROW-18195
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 10.0.0
>Reporter: Lee Mendelowitz
>Assignee: Will Jones
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
> Attachments: test_issue.R
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> There appears to be a bug when processing an Arrow table with NA values and 
> using `dplyr::case_when`. A reproducible example is below: the output from 
> arrow table processing does not match the output when processing a tibble. If 
> the NA's are removed from the dataframe, then the outputs match.
> {noformat}
> ``` r
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #> filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #> intersect, setdiff, setequal, union
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> library(assertthat)
> play_results = c('single', 'double', 'triple', 'home_run')
> nrows = 1000
> # Change frac_na to 0, and the result error disappears.
> frac_na = 0.05
> # Create a test dataframe with NA values
> test_df = tibble(
> play_result = sample(play_results, nrows, replace = TRUE)
> ) %>%
> mutate(
> play_result = ifelse(runif(nrows) < frac_na, NA_character_, 
> play_result)
> )
> 
> test_arrow = arrow_table(test_df)
> process_plays = function(df) {
> df %>%
> mutate(
> avg = case_when(
> play_result == 'single' ~ 1,
> play_result == 'double' ~ 1,
> play_result == 'triple' ~ 1,
> play_result == 'home_run' ~ 1,
> is.na(play_result) ~ NA_real_,
> TRUE ~ 0
> )
> ) %>%
> count(play_result, avg) %>%
> arrange(play_result)
> }
> # Compare arrow_table reuslt to tibble result
> result_tibble = process_plays(test_df)
> result_arrow = process_plays(test_arrow) %>% collect()
> assertthat::assert_that(identical(result_tibble, result_arrow))
> #> Error: result_tibble not identical to result_arrow
> ```
> Created on 2022-10-29 with [reprex 
> v2.0.2](https://reprex.tidyverse.org)
> {noformat}
> I have reproduced this issue both on Mac OS and Ubuntu 20.04.
>  
> {noformat}
> ```
> r$> sessionInfo()
> R version 4.2.1 (2022-06-23)
> Platform: aarch64-apple-darwin21.5.0 (64-bit)
> Running under: macOS Monterey 12.5.1
> Matrix products: default
> BLAS:   /opt/homebrew/Cellar/openblas/0.3.20/lib/libopenblasp-r0.3.20.dylib
> LAPACK: /opt/homebrew/Cellar/r/4.2.1/lib/R/lib/libRlapack.dylib
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> attached base packages:
> [1] stats     graphics  grDevices datasets  utils     methods   base
> other attached packages:
> [1] assertthat_0.2.1 arrow_10.0.0     dplyr_1.0.10
> loaded via a namespace (and not attached):
>  [1] compiler_4.2.1    pillar_1.8.1      highr_0.9         R.methodsS3_1.8.2 
> R.utils_2.12.0    tools_4.2.1       bit_4.0.4         digest_0.6.29
>  [9] evaluate_0.15     lifecycle_1.0.1   tibble_3.1.8      R.cache_0.16.0    
> pkgconfig_2.0.3   rlang_1.0.5       reprex_2.0.2      DBI_1.1.2
> [17] cli_3.3.0         rstudioapi_0.13   yaml_2.3.5        xfun_0.31         
> fastmap_1.1.0     withr_2.5.0       styler_1.8.0      knitr_1.39
> [25] generics_0.1.3    fs_1.5.2          vctrs_0.4.1       bit64_4.0.5       
> tidyselect_1.1.2  glue_1.6.2        R6_2.5.1          processx_3.5.3
> [33] fansi_1.0.3       rmarkdown_2.14    purrr_0.3.4       callr_3.7.0       
> clipr_0.8.0       magrittr_2.0.3    ellipsis_0.3.2    ps_1.7.0
> [41] htmltools_0.5.3   renv_0.16.0       utf8_1.2.2        R.oo_1.25.0
> ```
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18436) [Python] `FileSystem.from_uri` doesn't decode %-encoded characters in path

2023-01-03 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18436.

Resolution: Fixed

Issue resolved by pull request 14974
https://github.com/apache/arrow/pull/14974

> [Python] `FileSystem.from_uri` doesn't decode %-encoded characters in path
> --
>
> Key: ARROW-18436
> URL: https://issues.apache.org/jira/browse/ARROW-18436
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 10.0.1
> Environment: - OS: macOS
> - `python=3.9.15:h709bd14_0_cpython` (installed from conda-forge)
> - `pyarrow=10.0.1:py39h2db5b05_1_cpu` (installed from conda-forge)
>Reporter: James Bourbeau
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When attempting to create a new filesystem object from a public dataset in 
> S3, where there is a space in the bucket name, an error is raised.
>  
> Here's a minimal reproducer:
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet") {code}
> which fails with the following traceback:
>  
> {code:java}
> Traceback (most recent call last):
>   File "/Users/james/projects/dask/dask/test.py", line 3, in 
>     result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet")
>   File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet'{code}
>  
> Note that things work if I use a different dataset that doesn't have a space 
> in the URI, or if I replace the portion of the URI that has a space with a 
> `*` wildcard
>  
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://ursa-labs-taxi-data/2009/01/data.parquet") 
> # works
>  result = 
> FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # works
> {code}
>  
> The wildcard isn't necessarily equivalent to the original failing URI, but I 
> think highlights that the space is somehow problematic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18318) [Python] Expose Scalar.validate

2023-01-03 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18318.

Resolution: Fixed

Issue resolved by pull request 15149
https://github.com/apache/arrow/pull/15149

> [Python] Expose Scalar.validate
> ---
>
> Key: ARROW-18318
> URL: https://issues.apache.org/jira/browse/ARROW-18318
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In C++, scalars have {{Validate}} and {{ValidateFull}} methods, just like 
> arrays. However, these methods were not exposed on PyArrow scalars (while 
> they are on PyArrow arrays).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18202) [R][C++] Different behaviour of R's base::gsub() binding aka libarrow's replace_string_regex kernel since 10.0.0

2023-01-03 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18202.

Resolution: Fixed

Issue resolved by pull request 15132
https://github.com/apache/arrow/pull/15132

> [R][C++] Different behaviour of R's base::gsub() binding aka libarrow's 
> replace_string_regex kernel since 10.0.0
> 
>
> Key: ARROW-18202
> URL: https://issues.apache.org/jira/browse/ARROW-18202
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 10.0.0
>Reporter: Lorenzo Isella
>Assignee: Will Jones
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hello,
> I think there is a problem with arrow 10.0 and R. I did not have this issue 
> with arrow 9.0.
> Could you please have a look?
> Many thanks
>  
> {code:r}
> library(tidyverse)
> library(arrow)
> ll <- c(      "100",   "1000",  "200"  , "3000" , "50"   ,
>         "500", ""   ,   "Not Range")
> df <- tibble(x=rep(ll, 1000), y=seq(8000))
> write_tsv(df, "data.tsv")
> data <- open_dataset("data.tsv", format="tsv",
>                      skip_rows=1,
>                      schema=schema(x=string(),
>                      y=double())
> )
> test <- data |>
>     collect()
> ###I want to replace the "" with "0". I believe this worked with arrow 9.0
> df2 <- data |>
>     mutate(x=gsub("^$","0",x) ) |>
>     collect()
> df2 ### now I did not modify the  "" entries in x
> #> # A tibble: 8,000 × 2
> #>    x               y
> #>           
> #>  1 "100"       1
> #>  2 "1000"      2
> #>  3 "200"       3
> #>  4 "3000"      4
> #>  5 "50"        5
> #>  6 "500"       6
> #>  7 ""              7
> #>  8 "Not Range"     8
> #>  9 "100"       9
> #> 10 "1000"     10
> #> # … with 7,990 more rows
>  
> df3 <- df |>
>     mutate(x=gsub("^$","0",x) )
> df3  ## and this is fine
> #> # A tibble: 8,000 × 2
> #>    x             y
> #>         
> #>  1 100       1
> #>  2 1000      2
> #>  3 200       3
> #>  4 3000      4
> #>  5 50        5
> #>  6 500       6
> #>  7 0             7
> #>  8 Not Range     8
> #>  9 100       9
> #> 10 1000     10
> #> # … with 7,990 more rows
> ## How to fix this...I believe this issue did not arise with arrow 9.0.
> sessionInfo()
> #> R version 4.2.1 (2022-06-23)
> #> Platform: x86_64-pc-linux-gnu (64-bit)
> #> Running under: Debian GNU/Linux 11 (bullseye)
> #> 
> #> Matrix products: default
> #> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
> #> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
> #> 
> #> locale:
> #>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
> #>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
> #>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
> #>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
> #>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
> #> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
> #> 
> #> attached base packages:
> #> [1] stats     graphics  grDevices utils     datasets  methods   base     
> #> 
> #> other attached packages:
> #>  [1] arrow_10.0.0    forcats_0.5.2   stringr_1.4.1   dplyr_1.0.10   
> #>  [5] purrr_0.3.5     readr_2.1.3     tidyr_1.2.1     tibble_3.1.8   
> #>  [9] ggplot2_3.3.6   tidyverse_1.3.2
> #> 
> #> loaded via a namespace (and not attached):
> #>  [1] lubridate_1.8.0     assertthat_0.2.1    digest_0.6.30      
> #>  [4] utf8_1.2.2          R6_2.5.1            cellranger_1.1.0   
> #>  [7] backports_1.4.1     reprex_2.0.2        evaluate_0.17      
> #> [10] httr_1.4.4          highr_0.9           pillar_1.8.1       
> #> [13] rlang_1.0.6         googlesheets4_1.0.1 readxl_1.4.1       
> #> [16] R.utils_2.12.1      R.oo_1.25.0         rmarkdown_2.17     
> #> [19] styler_1.8.0        googledrive_2.0.0   bit_4.0.4          
> #> [22] munsell_0.5.0       broom_1.0.1         compiler_4.2.1     
> #> [25] modelr_0.1.9        xfun_0.34           pkgconfig_2.0.3    
> #> [28] htmltools_0.5.3     tidyselect_1.2.0    fansi_1.0.3        
> #> [31] crayon_1.5.2        tzdb_0.3.0          dbplyr_2.2.1       
> #> [34] withr_2.5.0         R.methodsS3_1.8.2   grid_4.2.1         
> #> [37] jsonlite_1.8.3      gtable_0.3.1        lifecycle_1.0.3    
> #> [40] DBI_1.1.3           magrittr_2.0.3      scales_1.2.1       
> #> [43] vroom_1.6.0         cli_3.4.1           stringi_1.7.8      
> #> [46] fs_1.5.2            xml2_1.3.3          ellipsis_0.3.2     
> #> [49] generics_0.1.3      vctrs_0.5.0         tools_4.2.1        
> #> [52] bit64_4.0.5         

[jira] [Commented] (ARROW-12938) [C++] Investigate spawning arbitrary callbacks from StopToken

2022-12-20 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17649652#comment-17649652
 ] 

Antoine Pitrou commented on ARROW-12938:


Yes, this is possible even without a separate thread. But you're right there's 
some API design work.

> [C++] Investigate spawning arbitrary callbacks from StopToken
> -
>
> Key: ARROW-12938
> URL: https://issues.apache.org/jira/browse/ARROW-12938
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> In some situations, we may want to forward stop requests to external 
> runtimes, e.g. gRPC (see https://github.com/apache/arrow/pull/10318 ), 
> without polling.
> Ideally, one may temporarily add a callback to a StopToken. This bears 
> complications, especially in the case where the stop request comes from a 
> signal handler.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18436) [Python] `FileSystem.from_uri` doesn't decode %-encoded characters in path

2022-12-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18436:
---
Summary: [Python] `FileSystem.from_uri` doesn't decode %-encoded characters 
in path  (was: [Python] `FileSystem.from_uri` doesn't decode %-encoded 
characters)

> [Python] `FileSystem.from_uri` doesn't decode %-encoded characters in path
> --
>
> Key: ARROW-18436
> URL: https://issues.apache.org/jira/browse/ARROW-18436
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 10.0.1
> Environment: - OS: macOS
> - `python=3.9.15:h709bd14_0_cpython` (installed from conda-forge)
> - `pyarrow=10.0.1:py39h2db5b05_1_cpu` (installed from conda-forge)
>Reporter: James Bourbeau
>Assignee: Antoine Pitrou
>Priority: Minor
> Fix For: 11.0.0
>
>
> When attempting to create a new filesystem object from a public dataset in 
> S3, where there is a space in the bucket name, an error is raised.
>  
> Here's a minimal reproducer:
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet") {code}
> which fails with the following traceback:
>  
> {code:java}
> Traceback (most recent call last):
>   File "/Users/james/projects/dask/dask/test.py", line 3, in 
>     result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet")
>   File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet'{code}
>  
> Note that things work if I use a different dataset that doesn't have a space 
> in the URI, or if I replace the portion of the URI that has a space with a 
> `*` wildcard
>  
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://ursa-labs-taxi-data/2009/01/data.parquet") 
> # works
>  result = 
> FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # works
> {code}
>  
> The wildcard isn't necessarily equivalent to the original failing URI, but I 
> think highlights that the space is somehow problematic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18436) `pyarrow.fs.FileSystem.from_uri` crashes when URI has a space

2022-12-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-18436:
--

Assignee: Antoine Pitrou

> `pyarrow.fs.FileSystem.from_uri` crashes when URI has a space
> -
>
> Key: ARROW-18436
> URL: https://issues.apache.org/jira/browse/ARROW-18436
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 10.0.1
> Environment: - OS: macOS
> - `python=3.9.15:h709bd14_0_cpython` (installed from conda-forge)
> - `pyarrow=10.0.1:py39h2db5b05_1_cpu` (installed from conda-forge)
>Reporter: James Bourbeau
>Assignee: Antoine Pitrou
>Priority: Minor
> Fix For: 11.0.0
>
>
> When attempting to create a new filesystem object from a public dataset in 
> S3, where there is a space in the bucket name, an error is raised.
>  
> Here's a minimal reproducer:
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet") {code}
> which fails with the following traceback:
>  
> {code:java}
> Traceback (most recent call last):
>   File "/Users/james/projects/dask/dask/test.py", line 3, in 
>     result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet")
>   File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet'{code}
>  
> Note that things work if I use a different dataset that doesn't have a space 
> in the URI, or if I replace the portion of the URI that has a space with a 
> `*` wildcard
>  
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://ursa-labs-taxi-data/2009/01/data.parquet") 
> # works
>  result = 
> FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # works
> {code}
>  
> The wildcard isn't necessarily equivalent to the original failing URI, but I 
> think highlights that the space is somehow problematic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18436) [Python] `FileSystem.from_uri` doesn't decode %-encoded characters

2022-12-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18436:
---
Summary: [Python] `FileSystem.from_uri` doesn't decode %-encoded characters 
 (was: `pyarrow.fs.FileSystem.from_uri` crashes when URI has a space)

> [Python] `FileSystem.from_uri` doesn't decode %-encoded characters
> --
>
> Key: ARROW-18436
> URL: https://issues.apache.org/jira/browse/ARROW-18436
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 10.0.1
> Environment: - OS: macOS
> - `python=3.9.15:h709bd14_0_cpython` (installed from conda-forge)
> - `pyarrow=10.0.1:py39h2db5b05_1_cpu` (installed from conda-forge)
>Reporter: James Bourbeau
>Assignee: Antoine Pitrou
>Priority: Minor
> Fix For: 11.0.0
>
>
> When attempting to create a new filesystem object from a public dataset in 
> S3, where there is a space in the bucket name, an error is raised.
>  
> Here's a minimal reproducer:
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet") {code}
> which fails with the following traceback:
>  
> {code:java}
> Traceback (most recent call last):
>   File "/Users/james/projects/dask/dask/test.py", line 3, in 
>     result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet")
>   File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet'{code}
>  
> Note that things work if I use a different dataset that doesn't have a space 
> in the URI, or if I replace the portion of the URI that has a space with a 
> `*` wildcard
>  
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://ursa-labs-taxi-data/2009/01/data.parquet") 
> # works
>  result = 
> FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # works
> {code}
>  
> The wildcard isn't necessarily equivalent to the original failing URI, but I 
> think highlights that the space is somehow problematic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18436) `pyarrow.fs.FileSystem.from_uri` crashes when URI has a space

2022-12-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18436:
---
Component/s: C++

> `pyarrow.fs.FileSystem.from_uri` crashes when URI has a space
> -
>
> Key: ARROW-18436
> URL: https://issues.apache.org/jira/browse/ARROW-18436
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 10.0.1
> Environment: - OS: macOS
> - `python=3.9.15:h709bd14_0_cpython` (installed from conda-forge)
> - `pyarrow=10.0.1:py39h2db5b05_1_cpu` (installed from conda-forge)
>Reporter: James Bourbeau
>Priority: Minor
> Fix For: 11.0.0
>
>
> When attempting to create a new filesystem object from a public dataset in 
> S3, where there is a space in the bucket name, an error is raised.
>  
> Here's a minimal reproducer:
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet") {code}
> which fails with the following traceback:
>  
> {code:java}
> Traceback (most recent call last):
>   File "/Users/james/projects/dask/dask/test.py", line 3, in 
>     result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet")
>   File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet'{code}
>  
> Note that things work if I use a different dataset that doesn't have a space 
> in the URI, or if I replace the portion of the URI that has a space with a 
> `*` wildcard
>  
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://ursa-labs-taxi-data/2009/01/data.parquet") 
> # works
>  result = 
> FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # works
> {code}
>  
> The wildcard isn't necessarily equivalent to the original failing URI, but I 
> think highlights that the space is somehow problematic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18436) `pyarrow.fs.FileSystem.from_uri` crashes when URI has a space

2022-12-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18436:
---
Fix Version/s: 11.0.0

> `pyarrow.fs.FileSystem.from_uri` crashes when URI has a space
> -
>
> Key: ARROW-18436
> URL: https://issues.apache.org/jira/browse/ARROW-18436
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: - OS: macOS
> - `python=3.9.15:h709bd14_0_cpython` (installed from conda-forge)
> - `pyarrow=10.0.1:py39h2db5b05_1_cpu` (installed from conda-forge)
>Reporter: James Bourbeau
>Priority: Minor
> Fix For: 11.0.0
>
>
> When attempting to create a new filesystem object from a public dataset in 
> S3, where there is a space in the bucket name, an error is raised.
>  
> Here's a minimal reproducer:
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet") {code}
> which fails with the following traceback:
>  
> {code:java}
> Traceback (most recent call last):
>   File "/Users/james/projects/dask/dask/test.py", line 3, in 
>     result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet")
>   File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet'{code}
>  
> Note that things work if I use a different dataset that doesn't have a space 
> in the URI, or if I replace the portion of the URI that has a space with a 
> `*` wildcard
>  
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://ursa-labs-taxi-data/2009/01/data.parquet") 
> # works
>  result = 
> FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # works
> {code}
>  
> The wildcard isn't necessarily equivalent to the original failing URI, but I 
> think highlights that the space is somehow problematic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18436) `pyarrow.fs.FileSystem.from_uri` crashes when URI has a space

2022-12-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17647715#comment-17647715
 ] 

Antoine Pitrou commented on ARROW-18436:


That's because the space needs to be encoded. However, there is an issue that 
it isn't decoded on return:
{code:python}
>>> result = 
>>> FileSystem.from_uri("s3://nyc-tlc/trip%20data/fhvhv_tripdata_2022-06.parquet")
>>> result
(,
 'nyc-tlc/trip%20data/fhvhv_tripdata_2022-06.parquet')
{code}


> `pyarrow.fs.FileSystem.from_uri` crashes when URI has a space
> -
>
> Key: ARROW-18436
> URL: https://issues.apache.org/jira/browse/ARROW-18436
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: - OS: macOS
> - `python=3.9.15:h709bd14_0_cpython` (installed from conda-forge)
> - `pyarrow=10.0.1:py39h2db5b05_1_cpu` (installed from conda-forge)
>Reporter: James Bourbeau
>Priority: Minor
>
> When attempting to create a new filesystem object from a public dataset in 
> S3, where there is a space in the bucket name, an error is raised.
>  
> Here's a minimal reproducer:
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet") {code}
> which fails with the following traceback:
>  
> {code:java}
> Traceback (most recent call last):
>   File "/Users/james/projects/dask/dask/test.py", line 3, in 
>     result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet")
>   File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet'{code}
>  
> Note that things work if I use a different dataset that doesn't have a space 
> in the URI, or if I replace the portion of the URI that has a space with a 
> `*` wildcard
>  
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://ursa-labs-taxi-data/2009/01/data.parquet") 
> # works
>  result = 
> FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # works
> {code}
>  
> The wildcard isn't necessarily equivalent to the original failing URI, but I 
> think highlights that the space is somehow problematic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18106) [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"

2022-12-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18106.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14741
https://github.com/apache/arrow/pull/14741

> [C++] JSON reader ignores explicit schema with default 
> unexpected_field_behavior="infer"
> 
>
> Key: ARROW-18106
> URL: https://issues.apache.org/jira/browse/ARROW-18106
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Harkins
>Priority: Major
>  Labels: json, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Not 100% sure this is a "bug", but at least I find it an unexpected interplay 
> between two options.
> By default, when reading json, we _infer_ the data type of columns, and when 
> specifying an explicit schema, we _also_ by default infer the type of columns 
> that are not specified in the explicit schema. The docs for 
> {{unexpected_field_behavior}}:
> > How JSON fields outside of explicit_schema (if given) are treated
> But it seems that if you specify a schema, and the parsing of one of the 
> columns fails according to that schema, we still fall back to this default of 
> inferring the data type (while I would have expected an error, since we 
> should only infer for columns _not_ in the schema.
> Example code using pyarrow:
> {code:python}
> import io
> import pyarrow as pa
> from pyarrow import json
> s_json = """{"column":"2022-09-05T08:08:46.000"}"""
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]))
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> The parsing fails here because there are milliseconds and the type is "s", 
> but the explicit schema is ignored, and we get a result with a string column 
> as result:
> {code}
> pyarrow.Table
> column: string
> 
> column: [["2022-09-05T08:08:46.000"]]
> {code}
> But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the 
> expected parse error:
> {code:python}
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]), unexpected_field_behavior="ignore")
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> gives
> {code}
> ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't 
> parse:2022-09-05T08:08:46.000
> {code}
> It might be this is specific to timestamps, I don't directly see a similar 
> issue with eg {{"column": "A"}} and setting the schema to "column" being 
> int64.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18435) [C++][Java] Update ORC to 1.8.1

2022-12-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18435.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14942
https://github.com/apache/arrow/pull/14942

> [C++][Java] Update ORC to 1.8.1
> ---
>
> Key: ARROW-18435
> URL: https://issues.apache.org/jira/browse/ARROW-18435
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Java
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17798) [C++][Parquet] Add DELTA_BINARY_PACKED encoder to Parquet writer

2022-12-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-17798.

Resolution: Fixed

Issue resolved by pull request 14191
https://github.com/apache/arrow/pull/14191

> [C++][Parquet] Add DELTA_BINARY_PACKED encoder to Parquet writer
> 
>
> Key: ARROW-17798
> URL: https://issues.apache.org/jira/browse/ARROW-17798
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Parquet
>Reporter: Rok Mihevc
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 26h 10m
>  Remaining Estimate: 0h
>
> We need to add DELTA_BINARY_PACKED encoder to implement DELTA_BYTE_ARRAY 
> encoder (ARROW-17619).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18423) [Python] Expose reading a schema from an IPC message

2022-12-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-18423:
--

Assignee: Andre Kohn

> [Python] Expose reading a schema from an IPC message
> 
>
> Key: ARROW-18423
> URL: https://issues.apache.org/jira/browse/ARROW-18423
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Andre Kohn
>Assignee: Andre Kohn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Pyarrow currently does not implement the reading of an Arrow schema from an 
> IPC message.
> [https://github.com/apache/arrow/blob/80b389efe902af376a85a8b3740e0dbdc5f80900/python/pyarrow/ipc.pxi#L1094]
>  
> We'd like to consume an Arrow IPC stream like the following:
>  
> {code:java}
> schema_msg = pyarrow.ipc.read_message(result_iter.next().data)
> schema = pyarrow.ipc.read_schema(schema_msg)
> for batch_data in result_iter:
> batch_msg = pyarrow.ipc.read_message(batch_data.data)
>     batch = pyarrow.ipc.read_record_batch(batch_msg, schema){code}
>  
> The associated (tiny) PR on GitHub implements this reading by binding the 
> existing C++ function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18423) [Python] Expose reading a schema from an IPC message

2022-12-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18423.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14831
https://github.com/apache/arrow/pull/14831

> [Python] Expose reading a schema from an IPC message
> 
>
> Key: ARROW-18423
> URL: https://issues.apache.org/jira/browse/ARROW-18423
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Andre Kohn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Pyarrow currently does not implement the reading of an Arrow schema from an 
> IPC message.
> [https://github.com/apache/arrow/blob/80b389efe902af376a85a8b3740e0dbdc5f80900/python/pyarrow/ipc.pxi#L1094]
>  
> We'd like to consume an Arrow IPC stream like the following:
>  
> {code:java}
> schema_msg = pyarrow.ipc.read_message(result_iter.next().data)
> schema = pyarrow.ipc.read_schema(schema_msg)
> for batch_data in result_iter:
> batch_msg = pyarrow.ipc.read_message(batch_data.data)
>     batch = pyarrow.ipc.read_record_batch(batch_msg, schema){code}
>  
> The associated (tiny) PR on GitHub implements this reading by binding the 
> existing C++ function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18420) [C++][Parquet] Introduce ColumnIndex and OffsetIndex

2022-12-13 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18420.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14803
https://github.com/apache/arrow/pull/14803

> [C++][Parquet] Introduce ColumnIndex and OffsetIndex
> 
>
> Key: ARROW-18420
> URL: https://issues.apache.org/jira/browse/ARROW-18420
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Parquet
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 12h 50m
>  Remaining Estimate: 0h
>
> Define interface of ColumnIndex and OffsetIndex and provide implementation to 
> read from serialized form.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17932) [C++] Implement streaming RecordBatchReader for JSON

2022-12-13 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-17932.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14355
https://github.com/apache/arrow/pull/14355

> [C++] Implement streaming RecordBatchReader for JSON
> 
>
> Key: ARROW-17932
> URL: https://issues.apache.org/jira/browse/ARROW-17932
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Ben Harkins
>Assignee: Ben Harkins
>Priority: Major
>  Labels: json, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> We don't currently support incremental RecordBatch reading from JSON streams, 
> which is needed to properly implement JSON support in Dataset. The existing 
> CSV StreamingReader API can be used as a model.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16430) [Python] Read/Write record batch custom metadata API in pyarrow

2022-12-12 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-16430.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 13041
https://github.com/apache/arrow/pull/13041

> [Python] Read/Write record batch custom metadata API in pyarrow
> ---
>
> Key: ARROW-16430
> URL: https://issues.apache.org/jira/browse/ARROW-16430
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 7.0.0
>Reporter: Yue Ni
>Assignee: Yue Ni
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> In https://issues.apache.org/jira/browse/ARROW-16131, Arrow C++ APIs were 
> added so that users can read/write record batch custom metadata for IPC file. 
> But pyarrow still lacks corresponding APIs for doing this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18421) [C++][ORC] Add accessor for number of rows by stripe in reader

2022-12-12 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18421.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14806
https://github.com/apache/arrow/pull/14806

> [C++][ORC] Add accessor for number of rows by stripe in reader
> --
>
> Key: ARROW-18421
> URL: https://issues.apache.org/jira/browse/ARROW-18421
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Louis Calot
>Assignee: Louis Calot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> I need to have the number of rows by stripe to be able to read specific 
> ranges of records in the ORC file without reading it all. The number of rows 
> was already stored in the implementation but not available in the API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18277) [R] Unable to install R's arrow on RStudio

2022-12-12 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646121#comment-17646121
 ] 

Antoine Pitrou commented on ARROW-18277:


I think we can close it.

> [R] Unable to install R's arrow on RStudio
> --
>
> Key: ARROW-18277
> URL: https://issues.apache.org/jira/browse/ARROW-18277
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Connor
>Priority: Minor
>
> Hello! Following the instructions on 
> [https://arrow.apache.org/docs/r/articles/install.html] I am filing this 
> ticket for help installing R's arrow package on RStudio. Output below
>  
> {code:java}
> > Sys.setenv(ARROW_R_DEV=TRUE)
> > install.packages("arrow")
> Installing package into ‘/var/lib/rstudio-server/local/site-library’
> (as ‘lib’ is unspecified)
> trying URL 'https://cran.rstudio.com/src/contrib/arrow_10.0.0.tar.gz'
> Content type 'application/x-gzip' length 4843530 bytes (4.6 MB)
> ==
> downloaded 4.6 MB* installing *source* package ‘arrow’ ...
> ** package ‘arrow’ successfully unpacked and MD5 sums checked
> ** using staged installation
> *** Found local C++ source: 'tools/cpp'
> *** Building libarrow from source
>     For build options and troubleshooting, see the install vignette:
>     https://cran.r-project.org/web/packages/arrow/vignettes/install.html
> *** Building with MAKEFLAGS= -j2 
>  cmake
> trying URL 
> 'https://github.com/Kitware/CMake/releases/download/v3.21.4/cmake-3.21.4-linux-x86_64.tar.gz'
> Content type 'application/octet-stream' length 44684259 bytes (42.6 MB)
> ==
> downloaded 42.6 MB arrow with SOURCE_DIR='tools/cpp' 
> BUILD_DIR='/tmp/RtmpRnb6XO/file4484b64e7cde3' 
> DEST_DIR='libarrow/arrow-10.0.0' 
> CMAKE='/tmp/RtmpRnb6XO/file4484b4c7f3eba/cmake-3.21.4-linux-x86_64/bin/cmake' 
> EXTRA_CMAKE_FLAGS='' CC='/usr/bin/gcc -fPIC' CXX='/usr/bin/g++ -fPIC 
> -std=c++17' LDFLAGS='-L/usr/local/lib' ARROW_S3='OFF' ARROW_GCS='OFF' 
> ++ pwd
> + : /tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow
> + : tools/cpp
> + : /tmp/RtmpRnb6XO/file4484b64e7cde3
> + : libarrow/arrow-10.0.0
> + : /tmp/RtmpRnb6XO/file4484b4c7f3eba/cmake-3.21.4-linux-x86_64/bin/cmake
> ++ cd tools/cpp
> ++ pwd
> + SOURCE_DIR=/tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow/tools/cpp
> ++ mkdir -p libarrow/arrow-10.0.0
> ++ cd libarrow/arrow-10.0.0
> ++ pwd
> + DEST_DIR=/tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow/libarrow/arrow-10.0.0
> ++ nproc
> + : 16
> + '[' '' '!=' '' ']'
> + '[' '' = false ']'
> + ARROW_DEFAULT_PARAM=OFF
> + mkdir -p /tmp/RtmpRnb6XO/file4484b64e7cde3
> + pushd /tmp/RtmpRnb6XO/file4484b64e7cde3
> /tmp/RtmpRnb6XO/file4484b64e7cde3 /tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow
> + /tmp/RtmpRnb6XO/file4484b4c7f3eba/cmake-3.21.4-linux-x86_64/bin/cmake 
> -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=OFF 
> -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON -DARROW_CSV=ON -DARROW_DATASET=ON 
> -DARROW_DEPENDENCY_SOURCE=AUTO -DAWSSDK_SOURCE= -DARROW_FILESYSTEM=ON 
> -DARROW_GCS=OFF -DARROW_JEMALLOC=OFF -DARROW_MIMALLOC=ON -DARROW_JSON=ON 
> -DARROW_PARQUET=ON -DARROW_S3=OFF -DARROW_WITH_BROTLI=OFF 
> -DARROW_WITH_BZ2=OFF -DARROW_WITH_LZ4=ON -DARROW_WITH_RE2=ON 
> -DARROW_WITH_SNAPPY=ON -DARROW_WITH_UTF8PROC=ON -DARROW_WITH_ZLIB=OFF 
> -DARROW_WITH_ZSTD=OFF -DARROW_VERBOSE_THIRDPARTY_BUILD=OFF 
> -DCMAKE_BUILD_TYPE=Release -DCMAKE_FIND_DEBUG_MODE=OFF 
> -DCMAKE_INSTALL_LIBDIR=lib 
> -DCMAKE_INSTALL_PREFIX=/tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow/libarrow/arrow-10.0.0
>  -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON 
> -DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=OFF 
> -Dxsimd_SOURCE= -Dzstd_SOURCE= -G 'Unix Makefiles' 
> /tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow/tools/cpp
> -- Building using CMake version: 3.21.4
> -- The C compiler identification is GNU 6.3.0
> -- The CXX compiler identification is GNU 6.3.0
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Check for working C compiler: /usr/bin/gcc - skipped
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Check for working CXX compiler: /usr/bin/g++ - skipped
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Arrow version: 10.0.0 (full: '10.0.0')
> -- Arrow SO version: 1000 (full: 1000.0.0)
> -- clang-tidy 14 not found
> -- clang-format 14 not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) 
> -- infer not found
> -- Found Python3: /usr/local/bin/python3.9 (found version "3.9.4") found 
> components: Interpreter 
> -- Found cpplint executable at 
> 

[jira] [Commented] (ARROW-12264) [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

2022-12-12 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17645984#comment-17645984
 ] 

Antoine Pitrou commented on ARROW-12264:


That's right.

> [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down
> ---
>
> Key: ARROW-12264
> URL: https://issues.apache.org/jira/browse/ARROW-12264
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Antoine Pitrou
>Priority: Major
>
> The Parquet spec (in parquet.thrift) says the following about handling of 
> floating-point statistics:
> {code}
>* (*) Because the sorting order is not specified properly for floating
>* point values (relations vs. total ordering) the following
>* compatibility rules should be applied when reading statistics:
>* - If the min is a NaN, it should be ignored.
>* - If the max is a NaN, it should be ignored.
>* - If the min is +0, the row group may contain -0 values as well.
>* - If the max is -0, the row group may contain +0 values as well.
>* - When looking for NaN values, min and max should be ignored.
> {code}
> It appears that the dataset code uses the following filter expression when 
> doing Parquet predicate push-down (in {{file_parquet.cc}}):
> {code:c++}
> return and_(greater_equal(field_expr, literal(min)),
> less_equal(field_expr, literal(max)));
> {code}
> A NaN value will fail that filter and yet may be found in the given Parquet 
> column chunk.
> We may instead need a "greater_equal_or_nan" comparison that returns true if 
> either value is NaN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-14999) [C++] List types with different field names are not equal

2022-12-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-14999.

Resolution: Fixed

Issue resolved by pull request 14847
https://github.com/apache/arrow/pull/14847

> [C++] List types with different field names are not equal
> -
>
> Key: ARROW-14999
> URL: https://issues.apache.org/jira/browse/ARROW-14999
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 6.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> When comparing map types, the names of the fields are ignored. This was 
> introduced in ARROW-7173.
> However for list types, they are not ignored. For example,
> {code:python}
> In [6]: l1 = pa.list_(pa.field("val", pa.int64()))
> In [7]: l2 = pa.list_(pa.int64())
> In [8]: l1
> Out[8]: ListType(list)
> In [9]: l2
> Out[9]: ListType(list)
> In [10]: l1 == l2
> Out[10]: False
> {code}
> Should we make list type comparison ignore field names too?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-12264) [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

2022-12-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-12264:
---
Issue Type: Bug  (was: Task)

> [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down
> ---
>
> Key: ARROW-12264
> URL: https://issues.apache.org/jira/browse/ARROW-12264
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Antoine Pitrou
>Priority: Major
>
> The Parquet spec (in parquet.thrift) says the following about handling of 
> floating-point statistics:
> {code}
>* (*) Because the sorting order is not specified properly for floating
>* point values (relations vs. total ordering) the following
>* compatibility rules should be applied when reading statistics:
>* - If the min is a NaN, it should be ignored.
>* - If the max is a NaN, it should be ignored.
>* - If the min is +0, the row group may contain -0 values as well.
>* - If the max is -0, the row group may contain +0 values as well.
>* - When looking for NaN values, min and max should be ignored.
> {code}
> It appears that the dataset code uses the following filter expression when 
> doing Parquet predicate push-down (in {{file_parquet.cc}}):
> {code:c++}
> return and_(greater_equal(field_expr, literal(min)),
> less_equal(field_expr, literal(max)));
> {code}
> A NaN value will fail that filter and yet may be found in the given Parquet 
> column chunk.
> We may instead need a "greater_equal_or_nan" comparison that returns true if 
> either value is NaN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-12264) [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

2022-12-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644354#comment-17644354
 ] 

Antoine Pitrou edited comment on ARROW-12264 at 12/7/22 2:08 PM:
-

cc [~westonpace]


was (Author: pitrou):
cc @westonpace

> [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down
> ---
>
> Key: ARROW-12264
> URL: https://issues.apache.org/jira/browse/ARROW-12264
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Antoine Pitrou
>Priority: Major
>
> The Parquet spec (in parquet.thrift) says the following about handling of 
> floating-point statistics:
> {code}
>* (*) Because the sorting order is not specified properly for floating
>* point values (relations vs. total ordering) the following
>* compatibility rules should be applied when reading statistics:
>* - If the min is a NaN, it should be ignored.
>* - If the max is a NaN, it should be ignored.
>* - If the min is +0, the row group may contain -0 values as well.
>* - If the max is -0, the row group may contain +0 values as well.
>* - When looking for NaN values, min and max should be ignored.
> {code}
> It appears that the dataset code uses the following filter expression when 
> doing Parquet predicate push-down (in {{file_parquet.cc}}):
> {code:c++}
> return and_(greater_equal(field_expr, literal(min)),
> less_equal(field_expr, literal(max)));
> {code}
> A NaN value will fail that filter and yet may be found in the given Parquet 
> column chunk.
> We may instead need a "greater_equal_or_nan" comparison that returns true if 
> either value is NaN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-12264) [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

2022-12-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644354#comment-17644354
 ] 

Antoine Pitrou commented on ARROW-12264:


cc @westonpace

> [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down
> ---
>
> Key: ARROW-12264
> URL: https://issues.apache.org/jira/browse/ARROW-12264
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Antoine Pitrou
>Priority: Major
>
> The Parquet spec (in parquet.thrift) says the following about handling of 
> floating-point statistics:
> {code}
>* (*) Because the sorting order is not specified properly for floating
>* point values (relations vs. total ordering) the following
>* compatibility rules should be applied when reading statistics:
>* - If the min is a NaN, it should be ignored.
>* - If the max is a NaN, it should be ignored.
>* - If the min is +0, the row group may contain -0 values as well.
>* - If the max is -0, the row group may contain +0 values as well.
>* - When looking for NaN values, min and max should be ignored.
> {code}
> It appears that the dataset code uses the following filter expression when 
> doing Parquet predicate push-down (in {{file_parquet.cc}}):
> {code:c++}
> return and_(greater_equal(field_expr, literal(min)),
> less_equal(field_expr, literal(max)));
> {code}
> A NaN value will fail that filter and yet may be found in the given Parquet 
> column chunk.
> We may instead need a "greater_equal_or_nan" comparison that returns true if 
> either value is NaN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-13240) [C++][Parquet] Page statistics not written in v2?

2022-12-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644350#comment-17644350
 ] 

Antoine Pitrou commented on ARROW-13240:


[~jorgecarleitao] Could you try to check if that still happens with the latest 
PyArrow?

> [C++][Parquet] Page statistics not written in v2?
> -
>
> Key: ARROW-13240
> URL: https://issues.apache.org/jira/browse/ARROW-13240
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jorge Leitão
>Priority: Major
>
> While working in integration tests of parquet2 against pyarrow, I noticed 
> that page statistics are only written by pyarrow when using version 1.
> I do not have an easy way to reproduce this within pyarrow as I am not sure 
> how to access individual pages from a column chunk, but it is something that 
> I observe when trying to integrate.
> The row group stats are still written, this only affects page statistics.
> pyarrow call:
> ```
> pa.parquet.write_table(
> t,
> path,
> version="2.0",
> data_page_version="2.0",
> write_statistics=True,
> )
> ```
> changing version to "1.0" does not impact this behavior, suggesting that the 
> specific option causing this behavior is the data_page_version="2.0".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-13240) [C++][Parquet] Page statistics not written in v2?

2022-12-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644349#comment-17644349
 ] 

Antoine Pitrou commented on ARROW-13240:


[~emkornfield] When would that have happened?

> [C++][Parquet] Page statistics not written in v2?
> -
>
> Key: ARROW-13240
> URL: https://issues.apache.org/jira/browse/ARROW-13240
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jorge Leitão
>Priority: Major
>
> While working in integration tests of parquet2 against pyarrow, I noticed 
> that page statistics are only written by pyarrow when using version 1.
> I do not have an easy way to reproduce this within pyarrow as I am not sure 
> how to access individual pages from a column chunk, but it is something that 
> I observe when trying to integrate.
> The row group stats are still written, this only affects page statistics.
> pyarrow call:
> ```
> pa.parquet.write_table(
> t,
> path,
> version="2.0",
> data_page_version="2.0",
> write_statistics=True,
> )
> ```
> changing version to "1.0" does not impact this behavior, suggesting that the 
> specific option causing this behavior is the data_page_version="2.0".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-13240) [C++][Parquet] Page statistics not written in v2?

2022-12-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-13240:
---
Priority: Major  (was: Minor)

> [C++][Parquet] Page statistics not written in v2?
> -
>
> Key: ARROW-13240
> URL: https://issues.apache.org/jira/browse/ARROW-13240
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jorge Leitão
>Priority: Major
>
> While working in integration tests of parquet2 against pyarrow, I noticed 
> that page statistics are only written by pyarrow when using version 1.
> I do not have an easy way to reproduce this within pyarrow as I am not sure 
> how to access individual pages from a column chunk, but it is something that 
> I observe when trying to integrate.
> The row group stats are still written, this only affects page statistics.
> pyarrow call:
> ```
> pa.parquet.write_table(
> t,
> path,
> version="2.0",
> data_page_version="2.0",
> write_statistics=True,
> )
> ```
> changing version to "1.0" does not impact this behavior, suggesting that the 
> specific option causing this behavior is the data_page_version="2.0".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18424) [C++] Fix Doxygen error on `arrow::engine::ConversionStrictness`

2022-12-05 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18424.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14845
https://github.com/apache/arrow/pull/14845

> [C++] Fix Doxygen error on `arrow::engine::ConversionStrictness`
> 
>
> Key: ARROW-18424
> URL: https://issues.apache.org/jira/browse/ARROW-18424
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Doxygen is hitting the following error: 
> `/arrow/cpp/src/arrow/engine/substrait/options.h:37: error: documented symbol 
> 'enum ARROW_ENGINE_EXPORT arrow::engine::arrow::engine::ConversionStrictness' 
> was not declared or defined. (warning treated as error, aborting now)`. See 
> [this CI job 
> output|https://github.com/apache/arrow/actions/runs/3557712768/jobs/5975904381],
>  for example.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18424) [C++] Fix Doxygen error on `arrow::engine::ConversionStrictness`

2022-12-05 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18424:
---
Priority: Trivial  (was: Major)

> [C++] Fix Doxygen error on `arrow::engine::ConversionStrictness`
> 
>
> Key: ARROW-18424
> URL: https://issues.apache.org/jira/browse/ARROW-18424
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Doxygen is hitting the following error: 
> `/arrow/cpp/src/arrow/engine/substrait/options.h:37: error: documented symbol 
> 'enum ARROW_ENGINE_EXPORT arrow::engine::arrow::engine::ConversionStrictness' 
> was not declared or defined. (warning treated as error, aborting now)`. See 
> [this CI job 
> output|https://github.com/apache/arrow/actions/runs/3557712768/jobs/5975904381],
>  for example.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-14161) [C++][Parquet][Docs] Reading/Writing Parquet Files

2022-12-05 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-14161.

Resolution: Fixed

Issue resolved by pull request 14018
https://github.com/apache/arrow/pull/14018

> [C++][Parquet][Docs] Reading/Writing Parquet Files
> --
>
> Key: ARROW-14161
> URL: https://issues.apache.org/jira/browse/ARROW-14161
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> Missing documentation on Reading/Writing Parquet files C++ api:
>  * 
> [WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE]
>  missing docs on chunk_size found some 
> [here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53]
>  _size of the RowGroup in the parquet file. Normally you would choose this to 
> be rather large_
>  * Typo in file reader 
> [example|https://arrow.apache.org/docs/cpp/parquet.html#filereader] the 
> include should be {{#include "parquet/arrow/reader.h"}}
>  * 
> [WriteProperties/Builder|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet16WriterPropertiesE]
>  missing docs on {{compression}}
>  * Missing example on using WriteProperties



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18269) [C++] Slash character in partition value handling

2022-12-05 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18269.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14646
[https://github.com/apache/arrow/pull/14646]

> [C++] Slash character in partition value handling
> -
>
> Key: ARROW-18269
> URL: https://issues.apache.org/jira/browse/ARROW-18269
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 10.0.0
>Reporter: Vadym Dytyniak
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
>  
> Provided example shows that pyarrow does not handle partition value that 
> contains '/' correctly:
> {code:java}
> import pandas as pd
> import pyarrow as pa
> from pyarrow import dataset as ds
> df = pd.DataFrame({
> 'value': [1, 2],
> 'instrument_id': ['A/Z', 'B'],
> })
> ds.write_dataset(
> data=pa.Table.from_pandas(df),
> base_dir='data',
> format='parquet',
> partitioning=['instrument_id'],
> partitioning_flavor='hive',
> )
> table = ds.dataset(
> source='data',
> format='parquet',
> partitioning='hive',
> ).to_table()
> tables = [table]
> df = pa.concat_tables(tables).to_pandas()  tables = [table]
> df = pa.concat_tables(tables).to_pandas() 
> print(df.head()){code}
> Result:
> {code:java}
>    value instrument_id
> 0      1             A
> 1      2             B {code}
> Expected behaviour:
> Option 1: Result should be:
> {code:java}
>    value instrument_id
> 0      1             A/Z
> 1      2             B {code}
> Option 2: Error should be raised to avoid '/' in partition value.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18419) [C++] Update vendored fast_float

2022-12-05 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18419.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14817
[https://github.com/apache/arrow/pull/14817]

> [C++] Update vendored fast_float
> 
>
> Key: ARROW-18419
> URL: https://issues.apache.org/jira/browse/ARROW-18419
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> For https://github.com/fastfloat/fast_float/pull/147 .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18413) [C++][Parquet] FileMetaData exposes page index metadata

2022-12-01 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18413.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14742
[https://github.com/apache/arrow/pull/14742]

> [C++][Parquet] FileMetaData exposes page index metadata
> ---
>
> Key: ARROW-18413
> URL: https://issues.apache.org/jira/browse/ARROW-18413
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Parquet
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Parquet ColumnChunk thrift object has recorded metadata for page index:
> [parquet-format/parquet.thrift at master · apache/parquet-format 
> (github.com)|https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L799]
> We just need to add public API to ColumnChunkMetaData to make it ready to 
> read.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17538) [C++] Importing an ArrowArrayStream can't handle errors from get_schema

2022-11-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641430#comment-17641430
 ] 

Antoine Pitrou commented on ARROW-17538:


[~benpharkins] I don't know if you would like to carve out a bit of time for 
this?

> [C++] Importing an ArrowArrayStream can't handle errors from get_schema
> ---
>
> Key: ARROW-17538
> URL: https://issues.apache.org/jira/browse/ARROW-17538
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: David Li
>Priority: Major
>  Labels: good-first-issue
> Fix For: 11.0.0
>
>
> As indicated in the code: 
> https://github.com/apache/arrow/blob/cd3c6ead97d584366aafd2f14d99a1cb8ace9ca2/cpp/src/arrow/c/bridge.cc#L1823
>  
> This probably needs a static initializer so we can catch things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17538) [C++] Importing an ArrowArrayStream can't handle errors from get_schema

2022-11-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-17538:
---
Fix Version/s: 11.0.0

> [C++] Importing an ArrowArrayStream can't handle errors from get_schema
> ---
>
> Key: ARROW-17538
> URL: https://issues.apache.org/jira/browse/ARROW-17538
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: David Li
>Priority: Major
>  Labels: good-first-issue
> Fix For: 11.0.0
>
>
> As indicated in the code: 
> https://github.com/apache/arrow/blob/cd3c6ead97d584366aafd2f14d99a1cb8ace9ca2/cpp/src/arrow/c/bridge.cc#L1823
>  
> This probably needs a static initializer so we can catch things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18375) MIGRATION: Enable GitHub issue type labels

2022-11-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641382#comment-17641382
 ] 

Antoine Pitrou commented on ARROW-18375:


I use "Type: enhancement" for user-visible enhancements (such as new features, 
performance improvements...) and "Type: task" for things that don't affect them 
directly (such as an internal refactor).

> MIGRATION: Enable GitHub issue type labels
> --
>
> Key: ARROW-18375
> URL: https://issues.apache.org/jira/browse/ARROW-18375
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Priority: Major
>
> As part of enabling GitHub issue reporting, the following labels have been 
> defined and need to be added to the repository label options. Without these 
> labels added, [new issues|https://github.com/apache/arrow/issues/14692] do 
> not get the issue template-defined issue type labels set properly.
>  
> Labels:
>  * Type: bug
>  * Type: enhancement
>  * Type: usage
>  * Type: task
>  * Type: test
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18375) MIGRATION: Enable GitHub issue type labels

2022-11-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641379#comment-17641379
 ] 

Antoine Pitrou commented on ARROW-18375:


I don't understand why "Type: test" is for either :-)

> MIGRATION: Enable GitHub issue type labels
> --
>
> Key: ARROW-18375
> URL: https://issues.apache.org/jira/browse/ARROW-18375
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Priority: Major
>
> As part of enabling GitHub issue reporting, the following labels have been 
> defined and need to be added to the repository label options. Without these 
> labels added, [new issues|https://github.com/apache/arrow/issues/14692] do 
> not get the issue template-defined issue type labels set properly.
>  
> Labels:
>  * Type: bug
>  * Type: enhancement
>  * Type: usage
>  * Type: task
>  * Type: test
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18373) MIGRATION: Enable multiple component selection in issue templates

2022-11-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18373:
---
Labels: gh-migration pull-request-available  (was: pull-request-available)

> MIGRATION: Enable multiple component selection in issue templates
> -
>
> Key: ARROW-18373
> URL: https://issues.apache.org/jira/browse/ARROW-18373
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Assignee: Todd Farmer
>Priority: Major
>  Labels: gh-migration, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Per comments in [this merged PR|https://github.com/apache/arrow/pull/14675], 
> we would like to enable selection of multiple components when reporting 
> issues via GitHub issues.
> Additionally, we may want to add the needed Apache license to the issue 
> templates and remove the exclusion rules from rat_exclude_files.txt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18378) MIGRATION: Disable issue reporting in ASF Jira

2022-11-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18378:
---
Labels: gh-migration  (was: )

> MIGRATION: Disable issue reporting in ASF Jira
> --
>
> Key: ARROW-18378
> URL: https://issues.apache.org/jira/browse/ARROW-18378
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Assignee: Todd Farmer
>Priority: Major
>  Labels: gh-migration
>
> ARROW-18364 enabled issue reporting for Apache Arrow in GitHub issues. Even 
> though existing Jira issues have not yet been migrated and are still being 
> worked in the Jira system, we should assess disabling creation of new issues 
> in ASF Jira, and instead pointing users to GitHub issues. This may benefit 
> the project by reducing the need to monitor inflow in two discrete systems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18381) MIGRATION: Create milestones for every needed fix version

2022-11-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18381:
---
Labels: gh-migration  (was: )

> MIGRATION: Create milestones for every needed fix version
> -
>
> Key: ARROW-18381
> URL: https://issues.apache.org/jira/browse/ARROW-18381
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Priority: Major
>  Labels: gh-migration
> Attachments: Screenshot from 2022-11-22 11-53-07.png, Screenshot from 
> 2022-11-22 11-54-26.png
>
>
> The Apache Arrow projects uses the "Fix version" field in ASF Jira issue to 
> track the version in which issues were resolved/fixed/implemented. The most 
> equivalent field in GitHub issues is the "milestone" field. This field is 
> explicitly managed - the versions need to be added to the repository 
> configuration before they can be used. This mapping needs to be established 
> as a prerequisite for completing the import from ASF Jira.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18364) MIGRATION: Update GitHub issue templates to support bug reports and feature requests

2022-11-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18364:
---
Labels: gh-migration  (was: )

> MIGRATION: Update GitHub issue templates to support bug reports and feature 
> requests
> 
>
> Key: ARROW-18364
> URL: https://issues.apache.org/jira/browse/ARROW-18364
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Assignee: Todd Farmer
>Priority: Major
>  Labels: gh-migration
> Attachments: image-2022-11-22-11-53-20-840.png, 
> image-2022-11-22-11-55-17-106.png
>
>
> The [GitHub issue creation page for 
> Arrow|https://github.com/apache/arrow/issues/new/choose] directs users to 
> open bug reports in Jira. Now that ASF Infra has disabled self-service 
> registration in Jira, and in light of the pending migration of Apache Arrow 
> issue tracking from ASF Jira to GitHub issues, we should enable bug reports 
> to be submitted via GitHub directly. Issue templates will help distinguish 
> bug reports and feature requests from existing usage assistance questions.
> It's also worth noting now that GitHub issue reporting is enabled that issues 
> cannot be resolved in a way that explicitly tracks the version where the 
> resolution was made, if the issue is tracked only in GitHub issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18377) MIGRATION: Automate component labels from issue form content

2022-11-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18377:
---
Labels: gh-migration  (was: )

> MIGRATION: Automate component labels from issue form content
> 
>
> Key: ARROW-18377
> URL: https://issues.apache.org/jira/browse/ARROW-18377
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Priority: Major
>  Labels: gh-migration
>
> ARROW-18364 added the ability to report issues in GitHub, and includes GitHub 
> issue templates with a drop-down component(s) selector. These form elements 
> drive resulting issue markdown only, and cannot dynamically drive issue 
> labels. This requires GitHub actions, which also have a few limitations. 
> First, the issue form does not produce any structured data, it only produces 
> the issue description markdown, so a parser is required. Second, ASF 
> restricts GitHub actions to a selection of approved actions. It is likely 
> that while community actions exist to generate structured data from issue 
> forms, the Apache Arrow project will need to write its own parser and label 
> application action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18376) MIGRATION: Add component labels to GitHub

2022-11-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18376:
---
Labels: gh-migration  (was: )

> MIGRATION: Add component labels to GitHub
> -
>
> Key: ARROW-18376
> URL: https://issues.apache.org/jira/browse/ARROW-18376
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Priority: Major
>  Labels: gh-migration
>
> Similar to ARROW-18375, component labels have been established based on 
> existing component values defined in ASF Jira. The following labels are 
> needed:
> * Component: Archery
> * Component: Benchmarking
> * Component: C
> * Component: C#
> * Component: C++
> * Component: C++ - Gandiva
> * Component: C++ - Plasma
> * Component: Continuous Integration
> * Component: Dart
> * Component: Developer Tools
> * Component: Documentation
> * Component: FlightRPC
> * Component: Format
> * Component: GLib
> * Component: Go
> * Component: GPU
> * Component: Integration
> * Component: Java
> * Component: JavaScript
> * Component: MATLAB
> * Component: Packaging
> * Component: Parquet
> * Component: Python
> * Component: R
> * Component: Ruby
> * Component: Swift
> * Component: Website
> * Component: Other



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18371) [C++] Expose *FromJSON helpers

2022-11-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641250#comment-17641250
 ] 

Antoine Pitrou commented on ARROW-18371:


We would have to prefix those macros with {{ARROW_}}. 

> [C++] Expose *FromJSON helpers
> --
>
> Key: ARROW-18371
> URL: https://issues.apache.org/jira/browse/ARROW-18371
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: testing
>
> {Array,{{Exec,Record}Batch}FromJSON helper functions would be useful when 
> testing in projects that use Arrow. BatchesWithSchema and MakeBasicBatches 
> could be considered as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18375) MIGRATION: Enable GitHub issue type labels

2022-11-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641234#comment-17641234
 ] 

Antoine Pitrou commented on ARROW-18375:


cc [~jorisvandenbossche] [~assignUser]

> MIGRATION: Enable GitHub issue type labels
> --
>
> Key: ARROW-18375
> URL: https://issues.apache.org/jira/browse/ARROW-18375
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Priority: Major
>
> As part of enabling GitHub issue reporting, the following labels have been 
> defined and need to be added to the repository label options. Without these 
> labels added, [new issues|https://github.com/apache/arrow/issues/14692] do 
> not get the issue template-defined issue type labels set properly.
>  
> Labels:
>  * Type: bug
>  * Type: enhancement
>  * Type: usage
>  * Type: task
>  * Type: test
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18375) MIGRATION: Enable GitHub issue type labels

2022-11-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641233#comment-17641233
 ] 

Antoine Pitrou commented on ARROW-18375:


I renamed the existing "bug", "enhancement" and "usage" labels to "Type: bug", 
"Type: enhancement" and "Type: usage". The two last ones seem neither existing 
nor referenced in any issue template, though.


> MIGRATION: Enable GitHub issue type labels
> --
>
> Key: ARROW-18375
> URL: https://issues.apache.org/jira/browse/ARROW-18375
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Priority: Major
>
> As part of enabling GitHub issue reporting, the following labels have been 
> defined and need to be added to the repository label options. Without these 
> labels added, [new issues|https://github.com/apache/arrow/issues/14692] do 
> not get the issue template-defined issue type labels set properly.
>  
> Labels:
>  * Type: bug
>  * Type: enhancement
>  * Type: usage
>  * Type: task
>  * Type: test
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-13221) [C++] arrow_reader_writer_test.cc slow to compile

2022-11-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641195#comment-17641195
 ] 

Antoine Pitrou commented on ARROW-13221:


bq. It seems that 1) may not reduce total build time. Because it just splits 
whole tests and doesn't change the total number of tests and the current build 
approach (templated tests).

It depends on the parallelism. In a parallel build, compiling 
arrow_reader_writer_test.cc is often the last task to finish at the end...

> [C++] arrow_reader_writer_test.cc slow to compile
> -
>
> Key: ARROW-13221
> URL: https://issues.apache.org/jira/browse/ARROW-13221
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
>
> As soon as some optimizations are enabled, 
> {{src/parquet/arrow/arrow_reader_writer_test.cc}} becomes extremely slow to 
> compile (more than one minute just for itself). This is perceivable on e.g. 
> the {{conda-cpp-valgrind}} build where we add {{-Og}} to the gcc flags in 
> order to make the tests less slow under emulation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-13221) [C++] arrow_reader_writer_test.cc slow to compile

2022-11-29 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17640576#comment-17640576
 ] 

Antoine Pitrou commented on ARROW-13221:


I think there are two aspects to this:
1) split the tests into a least two different test files
2) rewrite some of the templated tests (inducing long code generation times at 
compilation) to be runtime-parametric instead (for example using 
RandomArrayGenerator or ArrayFromJSON)

Given that those tests have grown organically, step 2 above will probably be 
slightly cumbersome... One needs to understand what each test checks for and 
rewrite it while keeping its intent.

cc [~kou] [~benpharkins]


> [C++] arrow_reader_writer_test.cc slow to compile
> -
>
> Key: ARROW-13221
> URL: https://issues.apache.org/jira/browse/ARROW-13221
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
>
> As soon as some optimizations are enabled, 
> {{src/parquet/arrow/arrow_reader_writer_test.cc}} becomes extremely slow to 
> compile (more than one minute just for itself). This is perceivable on e.g. 
> the {{conda-cpp-valgrind}} build where we add {{-Og}} to the gcc flags in 
> order to make the tests less slow under emulation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18039) [C++][CI] Reduce MinGW build times

2022-11-29 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17640491#comment-17640491
 ] 

Antoine Pitrou commented on ARROW-18039:


Yes, it was already reported at 
https://issues.apache.org/jira/browse/ARROW-13221

> [C++][CI] Reduce MinGW build times
> --
>
> Key: ARROW-18039
> URL: https://issues.apache.org/jira/browse/ARROW-18039
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Assignee: Kouhei Sutou
>Priority: Major
>
> The MinGW C++ builds on CI currently build in release mode. This is probably 
> because debug builds on Windows are complicated (you must get all the 
> dependencies also compiled in debug mode, AFAIU).
> However, we could probably disable optimizations, so as to reduce compilation 
> times.
> The compilation flags are currently as follows:
> {code}
> -- CMAKE_C_FLAGS:  -O2 -DNDEBUG -ftree-vectorize  -Wa,-mbig-obj -Wall 
> -Wno-conversion -Wno-sign-conversion -Wunused-result 
> -fno-semantic-interposition -mxsave -msse4.2 
> -- CMAKE_CXX_FLAGS:  -Wno-noexcept-type  -fdiagnostics-color=always -O2 
> -DNDEBUG -ftree-vectorize  -Wa,-mbig-obj -Wall -Wno-conversion 
> -Wno-sign-conversion -Wunused-result -fno-semantic-interposition -mxsave 
> -msse4.2 
> {code}
> Perhaps we can pass {{-O0}}?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17836) [C++] Allow specifying of alignment in MemoryPool's allocations

2022-11-25 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-17836.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14225
[https://github.com/apache/arrow/pull/14225]

> [C++] Allow specifying of alignment in MemoryPool's allocations 
> 
>
> Key: ARROW-17836
> URL: https://issues.apache.org/jira/browse/ARROW-17836
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Sasha Krassovsky
>Assignee: Sasha Krassovsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> For spilling, I need to create buffers that are 512-byte aligned. The task is 
> to augment MemoryPool to allow for specifying alignment explicitly when 
> allocating (but keep the default the same).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data

2022-11-24 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18400:
---
Fix Version/s: 11.0.0

> [Python] Quadratic memory usage of Table.to_pandas with nested data
> ---
>
> Key: ARROW-18400
> URL: https://issues.apache.org/jira/browse/ARROW-18400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>Reporter: Adam Reeve
>Priority: Critical
> Fix For: 11.0.0
>
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(1, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } for i in range(arr_len)]
>         ))
> df = pd.DataFrame({'c1': nested_col})
> df.to_parquet(filename)
> {code}
> And then read into a DataFrame with:
> {code:python}
> import pyarrow.parquet as pq
> table = pq.read_table(filename)
> df = table.to_pandas()
> {code}
> Only reading to an Arrow table isn't a problem, it's the to_pandas method 
> that exhibits the large memory usage. I haven't tested generating nested 
> Arrow data in memory without writing Parquet from Pandas but I assume the 
> problem probably isn't Parquet specific.
> Memory usage I see when reading different sized files on a machine with 64 GB 
> RAM:
> ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
> |32,000|362|361|
> |64,000|531|531|
> |128,000|1,152|1,101|
> |256,000|2,888|1,402|
> |512,000|10,301|3,508|
> |1,024,000|38,697|5,313|
> |2,048,000|OOM|20,061|
> |4,096,000| |OOM|
> With Arrow 10.0.1, memory usage approximately quadruples when row count 
> doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but 
> then quadruples from 1024k to 2048k rows.
> PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
> changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data

2022-11-24 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17638234#comment-17638234
 ] 

Antoine Pitrou commented on ARROW-18400:


[~alenka] [~milesgranger] This seems like something we'd like to fix.

> [Python] Quadratic memory usage of Table.to_pandas with nested data
> ---
>
> Key: ARROW-18400
> URL: https://issues.apache.org/jira/browse/ARROW-18400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>Reporter: Adam Reeve
>Priority: Critical
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(1, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } for i in range(arr_len)]
>         ))
> df = pd.DataFrame({'c1': nested_col})
> df.to_parquet(filename)
> {code}
> And then read into a DataFrame with:
> {code:python}
> import pyarrow.parquet as pq
> table = pq.read_table(filename)
> df = table.to_pandas()
> {code}
> Only reading to an Arrow table isn't a problem, it's the to_pandas method 
> that exhibits the large memory usage. I haven't tested generating nested 
> Arrow data in memory without writing Parquet from Pandas but I assume the 
> problem probably isn't Parquet specific.
> Memory usage I see when reading different sized files on a machine with 64 GB 
> RAM:
> ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
> |32,000|362|361|
> |64,000|531|531|
> |128,000|1,152|1,101|
> |256,000|2,888|1,402|
> |512,000|10,301|3,508|
> |1,024,000|38,697|5,313|
> |2,048,000|OOM|20,061|
> |4,096,000| |OOM|
> With Arrow 10.0.1, memory usage approximately quadruples when row count 
> doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but 
> then quadruples from 1024k to 2048k rows.
> PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
> changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18400) Quadratic memory usage of Table.to_pandas with nested data

2022-11-24 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18400:
---
Priority: Critical  (was: Major)

> Quadratic memory usage of Table.to_pandas with nested data
> --
>
> Key: ARROW-18400
> URL: https://issues.apache.org/jira/browse/ARROW-18400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>Reporter: Adam Reeve
>Priority: Critical
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(1, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } for i in range(arr_len)]
>         ))
> df = pd.DataFrame({'c1': nested_col})
> df.to_parquet(filename)
> {code}
> And then read into a DataFrame with:
> {code:python}
> import pyarrow.parquet as pq
> table = pq.read_table(filename)
> df = table.to_pandas()
> {code}
> Only reading to an Arrow table isn't a problem, it's the to_pandas method 
> that exhibits the large memory usage. I haven't tested generating nested 
> Arrow data in memory without writing Parquet from Pandas but I assume the 
> problem probably isn't Parquet specific.
> Memory usage I see when reading different sized files on a machine with 64 GB 
> RAM:
> ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
> |32,000|362|361|
> |64,000|531|531|
> |128,000|1,152|1,101|
> |256,000|2,888|1,402|
> |512,000|10,301|3,508|
> |1,024,000|38,697|5,313|
> |2,048,000|OOM|20,061|
> |4,096,000| |OOM|
> With Arrow 10.0.1, memory usage approximately quadruples when row count 
> doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but 
> then quadruples from 1024k to 2048k rows.
> PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
> changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data

2022-11-24 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18400:
---
Summary: [Python] Quadratic memory usage of Table.to_pandas with nested 
data  (was: Quadratic memory usage of Table.to_pandas with nested data)

> [Python] Quadratic memory usage of Table.to_pandas with nested data
> ---
>
> Key: ARROW-18400
> URL: https://issues.apache.org/jira/browse/ARROW-18400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>Reporter: Adam Reeve
>Priority: Critical
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(1, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } for i in range(arr_len)]
>         ))
> df = pd.DataFrame({'c1': nested_col})
> df.to_parquet(filename)
> {code}
> And then read into a DataFrame with:
> {code:python}
> import pyarrow.parquet as pq
> table = pq.read_table(filename)
> df = table.to_pandas()
> {code}
> Only reading to an Arrow table isn't a problem, it's the to_pandas method 
> that exhibits the large memory usage. I haven't tested generating nested 
> Arrow data in memory without writing Parquet from Pandas but I assume the 
> problem probably isn't Parquet specific.
> Memory usage I see when reading different sized files on a machine with 64 GB 
> RAM:
> ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
> |32,000|362|361|
> |64,000|531|531|
> |128,000|1,152|1,101|
> |256,000|2,888|1,402|
> |512,000|10,301|3,508|
> |1,024,000|38,697|5,313|
> |2,048,000|OOM|20,061|
> |4,096,000| |OOM|
> With Arrow 10.0.1, memory usage approximately quadruples when row count 
> doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but 
> then quadruples from 1024k to 2048k rows.
> PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
> changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17859) [C++] Use self-pipe in signal-receiving StopSource

2022-11-23 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-17859.

Resolution: Fixed

Issue resolved by pull request 14250
[https://github.com/apache/arrow/pull/14250]

> [C++] Use self-pipe in signal-receiving StopSource
> --
>
> Key: ARROW-17859
> URL: https://issues.apache.org/jira/browse/ARROW-17859
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> The signal-receiving StopSource currently uses elaborate hacks to request the 
> StopSource from a signal handler. Instead we should just use a SelfPipe and 
> send signals to a worker thread.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18399) [Python] Reduce warnings during tests

2022-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-18399:
--

 Summary: [Python] Reduce warnings during tests
 Key: ARROW-18399
 URL: https://issues.apache.org/jira/browse/ARROW-18399
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Antoine Pitrou


Numerous warnings are displayed at the end of a test run, we should strive them 
to reduce them:
https://github.com/apache/arrow/actions/runs/3533792571/jobs/5929880345#step:6:5489




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18399) [Python] Reduce warnings during tests

2022-11-23 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637891#comment-17637891
 ] 

Antoine Pitrou commented on ARROW-18399:


cc [~milesgranger]

> [Python] Reduce warnings during tests
> -
>
> Key: ARROW-18399
> URL: https://issues.apache.org/jira/browse/ARROW-18399
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Minor
>
> Numerous warnings are displayed at the end of a test run, we should strive 
> them to reduce them:
> https://github.com/apache/arrow/actions/runs/3533792571/jobs/5929880345#step:6:5489



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18392) [CI][Python] Some nightly python tests fail due to ACCESS DENIED to S3 bucket

2022-11-23 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18392.

Resolution: Fixed

Issue resolved by pull request 14716
[https://github.com/apache/arrow/pull/14716]

> [CI][Python] Some nightly python tests fail due to ACCESS DENIED to S3 bucket 
> --
>
> Key: ARROW-18392
> URL: https://issues.apache.org/jira/browse/ARROW-18392
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Assignee: Miles Granger
>Priority: Critical
>  Labels: Nightly, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Several nightly tests fail with:
> {code:java}
>  === FAILURES 
> ===
>  test_s3fs_wrong_region 
>     @pytest.mark.s3
>     def test_s3fs_wrong_region():
>         from pyarrow.fs import S3FileSystem
>     
>         # wrong region for bucket
>         fs = S3FileSystem(region='eu-north-1')
>     
>         msg = ("When getting information for bucket 
> 'voltrondata-labs-datasets': "
>                r"AWS Error UNKNOWN \(HTTP status 301\) during HeadBucket "
>                "operation: No response body. Looks like the configured region 
> is "
>                "'eu-north-1' while the bucket is located in 'us-east-2'."
>                "|NETWORK_CONNECTION")
>         with pytest.raises(OSError, match=msg) as exc:
>             fs.get_file_info("voltrondata-labs-datasets")
>     
>         # Sometimes fails on unrelated network error, so next call would also 
> fail.
>         if 'NETWORK_CONNECTION' in str(exc.value):
>             return
>     
>         fs = S3FileSystem(region='us-east-2')
> >       
> > fs.get_file_info("voltrondata-labs-datasets")opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_fs.py:1339:
> >  
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> pyarrow/_fs.pyx:571: in pyarrow._fs.FileSystem.get_file_info
>     ???
> pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status
>     ???
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ >   ???
> E   OSError: When getting information for bucket 'voltrondata-labs-datasets': 
> AWS Error ACCESS_DENIED during HeadBucket operation: No response body. {code}
> I can't seem to be able to reproduce locally but is pretty consistent:
>  * 
> [test-conda-python-3.10|https://github.com/ursacomputing/crossbow/actions/runs/3528202639/jobs/5918051269]
>  * 
> [test-conda-python-3.11|https://github.com/ursacomputing/crossbow/actions/runs/3528201175/jobs/5918048135]
>  * 
> [test-conda-python-3.7|https://github.com/ursacomputing/crossbow/actions/runs/3528195566/jobs/5918035812]
>  * 
> [test-conda-python-3.7-pandas-latest|https://github.com/ursacomputing/crossbow/actions/runs/3528211334/jobs/5918069152]
>  * 
> [test-conda-python-3.8|https://github.com/ursacomputing/crossbow/actions/runs/3528193702/jobs/5918032370]
>  * 
> [test-conda-python-3.8-pandas-latest|https://github.com/ursacomputing/crossbow/actions/runs/3528213536/jobs/5918073481]
>  * 
> [test-conda-python-3.8-pandas-nightly|https://github.com/ursacomputing/crossbow/actions/runs/3528205157/jobs/5918056277]
>  * 
> [test-conda-python-3.9|https://github.com/ursacomputing/crossbow/actions/runs/3528202402/jobs/5918050613]
>  * 
> [test-conda-python-3.9-pandas-upstream_devel|https://github.com/ursacomputing/crossbow/actions/runs/3528210560/jobs/5918067302]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18398) [C++] Sporadic error in StressSourceGroupedSumStop

2022-11-23 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637880#comment-17637880
 ] 

Antoine Pitrou commented on ARROW-18398:


cc [~westonpace]

> [C++] Sporadic error in StressSourceGroupedSumStop
> --
>
> Key: ARROW-18398
> URL: https://issues.apache.org/jira/browse/ARROW-18398
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> I just saw this occasional failure:
> https://github.com/apache/arrow/actions/runs/3533672097/jobs/5929601817#step:11:294
> {code}
> [ RUN  ] ExecPlanExecution.StressSourceGroupedSumStop
> D:/a/arrow/arrow/cpp/src/arrow/compute/exec/plan_test.cc:850: Failure
> Value of: _fut.Wait(::arrow::kDefaultAssertFinishesWaitSeconds)
>   Actual: false
> Expected: true
> Google Test trace:
> D:/a/arrow/arrow/cpp/src/arrow/compute/exec/plan_test.cc:825: parallel
> D:/a/arrow/arrow/cpp/src/arrow/compute/exec/plan_test.cc:822: unslowed
> D:/a/arrow/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:60: Plan was 
> destroyed before finishing
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18398) [C++] Sporadic error in StressSourceGroupedSumStop

2022-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-18398:
--

 Summary: [C++] Sporadic error in StressSourceGroupedSumStop
 Key: ARROW-18398
 URL: https://issues.apache.org/jira/browse/ARROW-18398
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


I just saw this occasional failure:
https://github.com/apache/arrow/actions/runs/3533672097/jobs/5929601817#step:11:294

{code}
[ RUN  ] ExecPlanExecution.StressSourceGroupedSumStop
D:/a/arrow/arrow/cpp/src/arrow/compute/exec/plan_test.cc:850: Failure
Value of: _fut.Wait(::arrow::kDefaultAssertFinishesWaitSeconds)
  Actual: false
Expected: true
Google Test trace:
D:/a/arrow/arrow/cpp/src/arrow/compute/exec/plan_test.cc:825: parallel
D:/a/arrow/arrow/cpp/src/arrow/compute/exec/plan_test.cc:822: unslowed
D:/a/arrow/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:60: Plan was destroyed 
before finishing
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18395) [C++] Move select-k implementation into separate module

2022-11-23 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18395:
---
Labels: good-second-issue  (was: )

> [C++] Move select-k implementation into separate module
> ---
>
> Key: ARROW-18395
> URL: https://issues.apache.org/jira/browse/ARROW-18395
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
>  Labels: good-second-issue
>
> The select-k kernel implementations are currently in {{vector_sort.cc}}, 
> amongst other things.
> To make the code more readable and faster to compiler, we should move them 
> into their own file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18396) [C++] Move rank implementation into separate module

2022-11-23 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18396:
---
Labels: good-second-issue  (was: )

> [C++] Move rank implementation into separate module
> ---
>
> Key: ARROW-18396
> URL: https://issues.apache.org/jira/browse/ARROW-18396
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
>  Labels: good-second-issue
>
> The rank kernel implementations are currently in {{vector_sort.cc}}, amongst 
> other things.
> To make the code more readable and faster to compiler, we should move them 
> into their own file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18396) [C++] Move rank implementation into separate module

2022-11-23 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637870#comment-17637870
 ] 

Antoine Pitrou commented on ARROW-18396:


cc [~benpharkins] if/when you have time for a reasonably small task.

> [C++] Move rank implementation into separate module
> ---
>
> Key: ARROW-18396
> URL: https://issues.apache.org/jira/browse/ARROW-18396
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
>
> The rank kernel implementations are currently in {{vector_sort.cc}}, amongst 
> other things.
> To make the code more readable and faster to compiler, we should move them 
> into their own file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18395) [C++] Move select-k implementation into separate module

2022-11-23 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637869#comment-17637869
 ] 

Antoine Pitrou commented on ARROW-18395:


cc [~benpharkins] if/when you have time for a reasonably small task.

> [C++] Move select-k implementation into separate module
> ---
>
> Key: ARROW-18395
> URL: https://issues.apache.org/jira/browse/ARROW-18395
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
>
> The select-k kernel implementations are currently in {{vector_sort.cc}}, 
> amongst other things.
> To make the code more readable and faster to compiler, we should move them 
> into their own file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18383) [C++] Avoid global variables for thread pools and at-fork handlers

2022-11-23 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18383.

Resolution: Fixed

Issue resolved by pull request 14704
[https://github.com/apache/arrow/pull/14704]

> [C++] Avoid global variables for thread pools and at-fork handlers
> --
>
> Key: ARROW-18383
> URL: https://issues.apache.org/jira/browse/ARROW-18383
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Investigation revealed an issue where the global IO thread pool could be 
> constructed before the at-fork handler internal state. The IO thread pool, 
> created on library load, would register an at-fork handler; then, the at-fork 
> handler state would be initialized and clobber the handler registered just 
> before.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18397) [C++] Clear S3 region resolver client at S3 shutdown

2022-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-18397:
--

 Summary: [C++] Clear S3 region resolver client at S3 shutdown
 Key: ARROW-18397
 URL: https://issues.apache.org/jira/browse/ARROW-18397
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 10.0.2, 11.0.0


The S3 region resolver caches a S3 client at module scope. This client can be 
destroyed very late and trigger an assertion error in the AWS SDK because it 
was already shutdown:
https://github.com/aws/aws-sdk-cpp/issues/2204

When explicitly finalizing S3, we should ensure we also destroy the cached S3 
client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18350) [C++] Use std::to_chars instead of std::to_string

2022-11-23 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18350.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14666
[https://github.com/apache/arrow/pull/14666]

> [C++] Use std::to_chars instead of std::to_string
> -
>
> Key: ARROW-18350
> URL: https://issues.apache.org/jira/browse/ARROW-18350
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> {{std::to_chars}} is locale-independent unlike {{std::to_string}}; it may 
> also be faster in some cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18396) [C++] Move rank implementation into separate module

2022-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-18396:
--

 Summary: [C++] Move rank implementation into separate module
 Key: ARROW-18396
 URL: https://issues.apache.org/jira/browse/ARROW-18396
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou


The rank kernel implementations are currently in {{vector_sort.cc}}, amongst 
other things.
To make the code more readable and faster to compiler, we should move them into 
their own file.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18395) [C++] Move select-k implementation into separate module

2022-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-18395:
--

 Summary: [C++] Move select-k implementation into separate module
 Key: ARROW-18395
 URL: https://issues.apache.org/jira/browse/ARROW-18395
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou


The select-k kernel implementations are currently in {{vector_sort.cc}}, 
amongst other things.
To make the code more readable and faster to compiler, we should move them into 
their own file.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18381) MIGRATION: Create milestones for every needed fix version

2022-11-23 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637624#comment-17637624
 ] 

Antoine Pitrou commented on ARROW-18381:


bq. This means that 314 legacy issues will lose metadata associating with at 
least one version that it was associated with in Jira. Is this acceptable? 
Should the lowest or highest associated Jira version be used during import, if 
so?
bq. 

Well, it can't be avoided, so I guess we'll have to accept it :-) 
As for which version should be used, I would say the lowest version - which is 
generally a bugfix version released earlier than the following feature version.

cc [~kou] for advice.

> MIGRATION: Create milestones for every needed fix version
> -
>
> Key: ARROW-18381
> URL: https://issues.apache.org/jira/browse/ARROW-18381
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Priority: Major
> Attachments: Screenshot from 2022-11-22 11-53-07.png, Screenshot from 
> 2022-11-22 11-54-26.png
>
>
> The Apache Arrow projects uses the "Fix version" field in ASF Jira issue to 
> track the version in which issues were resolved/fixed/implemented. The most 
> equivalent field in GitHub issues is the "milestone" field. This field is 
> explicitly managed - the versions need to be added to the repository 
> configuration before they can be used. This mapping needs to be established 
> as a prerequisite for completing the import from ASF Jira.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18383) [C++] Avoid global variables for thread pools and at-fork handlers

2022-11-22 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-18383:
--

 Summary: [C++] Avoid global variables for thread pools and at-fork 
handlers
 Key: ARROW-18383
 URL: https://issues.apache.org/jira/browse/ARROW-18383
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 11.0.0


Investigation revealed an issue where the global IO thread pool could be 
constructed before the at-fork handler internal state. The IO thread pool, 
created on library load, would register an at-fork handler; then, the at-fork 
handler state would be initialized and clobber the handler registered just 
before.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18382) [C++] "ADDRESS_SANITIZER" not defined in fuzzing builds

2022-11-22 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-18382:
--

 Summary: [C++] "ADDRESS_SANITIZER" not defined in fuzzing builds
 Key: ARROW-18382
 URL: https://issues.apache.org/jira/browse/ARROW-18382
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Fuzzing builds (as run by OSS-Fuzz) enable Address Sanitizer through their own 
set of options rather than by enabling {{ARROW_USE_ASAN}}. However, we need to 
be informed this situation in the Arrow source code.

One example of where this matters is that eternal thread pools produce spurious 
leaks at shutdown because of the vector of at-fork handlers; it therefore needs 
to be worked around on those builds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-4709) [C++] Optimize for ordered JSON fields

2022-11-22 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-4709.
---
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14100
[https://github.com/apache/arrow/pull/14100]

> [C++] Optimize for ordered JSON fields
> --
>
> Key: ARROW-4709
> URL: https://issues.apache.org/jira/browse/ARROW-4709
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Harkins
>Priority: Minor
>  Labels: good-second-issue, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Fields appear consistently ordered in most JSON data in the wild, but the 
> JSON parser currently looks fields up in a hash table. The ordering can 
> probably be exploited to yield better performance when looking up field 
> indices



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-13677) [C++] Improve performance of unpack64

2022-11-22 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637299#comment-17637299
 ] 

Antoine Pitrou commented on ARROW-13677:


[~benpharkins] This can be a reasonable undertaking, assuming you're interest 
at looking at this kind of micro-optimizations.

> [C++] Improve performance of unpack64
> -
>
> Key: ARROW-13677
> URL: https://issues.apache.org/jira/browse/ARROW-13677
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> unpack32 benefits from auto-generated SIMD optimizations, but unpack64 
> doesn't. The latter is used by Parquet for DELTA_BINARY_PACKED encoding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17985) [Python][C++] Opaque error code ([code: 100]), when not setting region

2022-11-22 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-17985.

Resolution: Fixed

Issue resolved by pull request 14601
[https://github.com/apache/arrow/pull/14601]

> [Python][C++] Opaque error code ([code: 100]), when not setting region
> --
>
> Key: ARROW-17985
> URL: https://issues.apache.org/jira/browse/ARROW-17985
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Vedant Roy
>Assignee: Miles Granger
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> A few odd things are going on with the Python bindings:
>  # Statefulness. I ran the following code:
> {code:java}
> import os
> import pyarrow.fs as arrow_fs
> def fs_():
>     s3_fs = arrow_fs.S3FileSystem(
>         access_key="",
>         secret_key="",
>         endpoint_override="",
>     )
>     return s3_fs
> fs = fs_()
> print(fs.get_file_info("data"))
> {code}
> and it worked on one machine but not the other. Only setting
> {code:java}
> region="auto"
> {code}
>  allowed the code to work consistently on both computers.
> Furthermore, the error message is very opaque:
> {code:java}
> Traceback (most recent call last):
>   File "cluster_scripts/test_s3.py", line 51, in 
> print(fs.get_file_info("data"))
>   File "pyarrow/_fs.pyx", line 439, in pyarrow._fs.FileSystem.get_file_info
>   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> OSError: When getting information for bucket 'data': AWS Error [code 100]: No 
> response body.
> {code}
> Googling this error gives no information whatsoever. I managed to figure out 
> the issue by switching from Cloudflare to S3, and when the issue was still 
> going on, I explicitly set a region, but the experience was pretty painful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18343) [C++] AllocateBitmap() with out parameter is declared but not defined

2022-11-21 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18343.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14657
[https://github.com/apache/arrow/pull/14657]

> [C++] AllocateBitmap() with out parameter is declared but not defined
> -
>
> Key: ARROW-18343
> URL: https://issues.apache.org/jira/browse/ARROW-18343
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jin Shang
>Assignee: Jin Shang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> [This variant of 
> AllocateBitmap|https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer.h#L483]
>  is declared but not defined in buffer.cc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18371) [C++] Expose *FromJSON helpers

2022-11-21 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636794#comment-17636794
 ] 

Antoine Pitrou commented on ARROW-18371:


> I assume the comment is regarding BatchesWithSchema and MakeBasicBatches.

Yes, this is what I meant. Sorry for the imprecision.

> [C++] Expose *FromJSON helpers
> --
>
> Key: ARROW-18371
> URL: https://issues.apache.org/jira/browse/ARROW-18371
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: testing
>
> {Array,{{Exec,Record}Batch}FromJSON helper functions would be useful when 
> testing in projects that use Arrow. BatchesWithSchema and MakeBasicBatches 
> could be considered as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18371) [C++] Expose *FromJSON helpers

2022-11-21 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636747#comment-17636747
 ] 

Antoine Pitrou commented on ARROW-18371:


Definitely not. These are functions generating ad hoc data tailored for 
specific tests, with little consistency.
We could expose the Random generation class, though, possibly together with 
some API cleanup.

> [C++] Expose *FromJSON helpers
> --
>
> Key: ARROW-18371
> URL: https://issues.apache.org/jira/browse/ARROW-18371
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: testing
>
> {Array,{{Exec,Record}Batch}FromJSON helper functions would be useful when 
> testing in projects that use Arrow. BatchesWithSchema and MakeBasicBatches 
> could be considered as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18362) [C++] Accelerate Parquet bit-packing decoding with ICX AVX-512

2022-11-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18362:
---
Summary: [C++] Accelerate Parquet bit-packing decoding with ICX AVX-512  
(was: Accelerate Parquet bit-packing decoding with ICX AVX-512)

> [C++] Accelerate Parquet bit-packing decoding with ICX AVX-512
> --
>
> Key: ARROW-18362
> URL: https://issues.apache.org/jira/browse/ARROW-18362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: zhaoyaqi
>Priority: Major
>
> h1. Accelerate Parquet bit-packing decoding with ICX AVX-512 instructions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18362) [C++] Accelerate Parquet bit-packing decoding with AVX-512

2022-11-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18362:
---
Summary: [C++] Accelerate Parquet bit-packing decoding with AVX-512  (was: 
[C++] Accelerate Parquet bit-packing decoding with ICX AVX-512)

> [C++] Accelerate Parquet bit-packing decoding with AVX-512
> --
>
> Key: ARROW-18362
> URL: https://issues.apache.org/jira/browse/ARROW-18362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: zhaoyaqi
>Priority: Major
>
> h1. Accelerate Parquet bit-packing decoding with ICX AVX-512 instructions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18362) [C++] Accelerate Parquet bit-packing decoding with AVX-512

2022-11-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18362:
---
Description: Accelerate Parquet bit-packing decoding with AVX-512 
instructions?  (was: h1. Accelerate Parquet bit-packing decoding with ICX 
AVX-512 instructions)

> [C++] Accelerate Parquet bit-packing decoding with AVX-512
> --
>
> Key: ARROW-18362
> URL: https://issues.apache.org/jira/browse/ARROW-18362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: zhaoyaqi
>Priority: Major
>
> Accelerate Parquet bit-packing decoding with AVX-512 instructions?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18362) Accelerate Parquet bit-packing decoding with ICX AVX-512

2022-11-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635863#comment-17635863
 ] 

Antoine Pitrou commented on ARROW-18362:


Are you willing to contribute this?

> Accelerate Parquet bit-packing decoding with ICX AVX-512
> 
>
> Key: ARROW-18362
> URL: https://issues.apache.org/jira/browse/ARROW-18362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: zhaoyaqi
>Priority: Major
>
> h1. Accelerate Parquet bit-packing decoding with ICX AVX-512 instructions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18353) [C++][Flight] Sporadic hang in UCX tests

2022-11-17 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-18353:
--

 Summary: [C++][Flight] Sporadic hang in UCX tests
 Key: ARROW-18353
 URL: https://issues.apache.org/jira/browse/ARROW-18353
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, FlightRPC
Reporter: Antoine Pitrou


The UCX tests sometimes hang here.

Full gdb backtraces for all threads:
{code}

Thread 8 (Thread 0x7f4562fcd700 (LWP 76837)):
#0  0x7f4577b72ad3 in futex_wait_cancelable (private=, 
expected=0, futex_word=0x564ebe5b5b3c)
at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x564ebe5b5ae0, 
cond=0x564ebe5b5b10) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x564ebe5b5b10, mutex=0x564ebe5b5ae0) at 
pthread_cond_wait.c:655
#3  0x7f457b4ce7cb in 
std::condition_variable::wait 
>(std::unique_lock &, struct {...}) (this=0x564ebe5b5b10, 
__lock=..., __p=...)
at 
/opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/condition_variable:111
#4  0x7f457b4c7b5e in arrow::flight::transport::ucx::(anonymous 
namespace)::WriteClientStream::WritesDone (this=0x564ebe5b5a90)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_client.cc:277
#5  0x7f457b4cc989 in arrow::flight::transport::ucx::(anonymous 
namespace)::UcxClientStream::DoFinish (this=0x564ebe5b5a90)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_client.cc:692
#6  0x7f457af80e04 in arrow::flight::internal::ClientDataStream::Finish 
(this=0x564ebe5b5a90, st=...) at /arrow/cpp/src/arrow/flight/transport.cc:46
#7  0x7f457af4f6e1 in arrow::flight::ClientMetadataReader::ReadMetadata 
(this=0x564ebe560630, out=0x7f4562fcc170)
at /arrow/cpp/src/arrow/flight/client.cc:263
#8  0x7f457b593af6 in operator() (__closure=0x564ebe4e4848) at 
/arrow/cpp/src/arrow/flight/test_definitions.cc:1538
#9  0x7f457b5b66b8 in std::__invoke_impl 
>(std::__invoke_other, struct {...} &&) (__f=...)
at 
/opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/bits/invoke.h:60
#10 0x7f457b5b6529 in 
std::__invoke 
>(struct {...} &&) (__fn=...)
at 
/opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/bits/invoke.h:95
#11 0x7f457b5b63c4 in 
std::thread::_Invoker
 > >::_M_invoke<0>(std::_Index_tuple<0>) (
this=0x564ebe4e4848) at 
/opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/thread:264
#12 0x7f457b5b6224 in 
std::thread::_Invoker
 > >::operator()(void) (
this=0x564ebe4e4848) at 
/opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/thread:271
#13 0x7f457b5b5e1e in 
std::thread::_State_impl
 > > >::_M_run(void) (this=0x564ebe4e4840) at 
/opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/thread:215
#14 0x7f4578242a93 in std::execute_native_thread_routine (__p=)
at 
/home/conda/feedstock_root/build_artifacts/gcc_compilers_1666516830325/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/new_allocator.h:82
#15 0x7f4577b6c6db in start_thread (arg=0x7f4562fcd700) at 
pthread_create.c:463
#16 0x7f4577ea561f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7f45725ca700 (LWP 76828)):
#0  0x7f4577ea5947 in epoll_wait (epfd=36, 
events=events@entry=0x7f45725c86c0, maxevents=16, timeout=timeout@entry=0)
at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x7f45779fe3e3 in ucs_event_set_wait (event_set=0x7f4564026240, 
num_events=num_events@entry=0x7f45725c8804, timeout_ms=timeout_ms@entry=0, 
event_set_handler=event_set_handler@entry=0x7f4575d29320 
, arg=arg@entry=0x7f45725c8800) at 
sys/event_set.c:198
#2  0x7f4575d29283 in uct_tcp_iface_progress (tl_iface=0x7f4564026900) at 
tcp/tcp_iface.c:327
#3  0x7f4577a7de22 in ucs_callbackq_dispatch (cbq=) at 
/usr/local/src/conda/ucx-1.13.1/src/ucs/datastruct/callbackq.h:211
#4  uct_worker_progress (worker=) at 
/usr/local/src/conda/ucx-1.13.1/src/uct/api/uct.h:2638
#5  ucp_worker_progress (worker=0x7f4564000c80) at core/ucp_worker.c:2782
#6  0x7f457b4f186f in 
arrow::flight::transport::ucx::UcpCallDriver::Impl::MakeProgress 
(this=0x7f456404d3b0)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:759
#7  0x7f457b4eee40 in 
arrow::flight::transport::ucx::UcpCallDriver::Impl::ReadNextFrame 
(this=0x7f456404d3b0)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:449
#8  0x7f457b4f3661 in 
arrow::flight::transport::ucx::UcpCallDriver::ReadNextFrame 
(this=0x7f456c0016d0)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:1037
#9  0x7f457b4d8c43 in arrow::flight::transport::ucx::(anonymous 
namespace)::PutServerStream::ReadImpl (this=0x7f45725c8b60, data=0x7f45725c8af0)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_server.cc:153
#10 0x7f457b4d8525 in arrow::flight::transport::ucx::(anonymous 
namespace)::PutServerStream::ReadData 

[jira] [Commented] (ARROW-18351) [C++][Flight] Crash in UcxErrorHandlingTest.TestDoExchange

2022-11-17 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635483#comment-17635483
 ] 

Antoine Pitrou commented on ARROW-18351:


cc [~lidavidm] [~yibocai]

> [C++][Flight] Crash in UcxErrorHandlingTest.TestDoExchange
> --
>
> Key: ARROW-18351
> URL: https://issues.apache.org/jira/browse/ARROW-18351
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Antoine Pitrou
>Priority: Major
>
> I get a non-deterministic crash in the Flight UCX tests.
> {code}
> [--] 3 tests from UcxErrorHandlingTest
> [ RUN  ] UcxErrorHandlingTest.TestGetFlightInfo
> [   OK ] UcxErrorHandlingTest.TestGetFlightInfo (24 ms)
> [ RUN  ] UcxErrorHandlingTest.TestDoPut
> [   OK ] UcxErrorHandlingTest.TestDoPut (15 ms)
> [ RUN  ] UcxErrorHandlingTest.TestDoExchange
> /arrow/cpp/src/arrow/util/future.cc:125:  Check failed: 
> !IsFutureFinished(state_) Future already marked finished
> {code}
> Here is the GDB backtrace:
> {code}
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1  0x7f18c49cd7f1 in __GI_abort () at abort.c:79
> #2  0x7f18c5854e00 in arrow::util::CerrLog::~CerrLog 
> (this=0x7f18a81607b0, __in_chrg=) at 
> /arrow/cpp/src/arrow/util/logging.cc:72
> #3  0x7f18c5854e1c in arrow::util::CerrLog::~CerrLog 
> (this=0x7f18a81607b0, __in_chrg=) at 
> /arrow/cpp/src/arrow/util/logging.cc:74
> #4  0x7f18c5855181 in arrow::util::ArrowLog::~ArrowLog 
> (this=0x7f18c07fc380, __in_chrg=) at 
> /arrow/cpp/src/arrow/util/logging.cc:250
> #5  0x7f18c5826f86 in arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed 
> (this=0x7f18a815f030, state=arrow::FutureState::FAILURE)
> at /arrow/cpp/src/arrow/util/future.cc:125
> #6  0x7f18c58265af in arrow::ConcreteFutureImpl::DoMarkFailed 
> (this=0x7f18a815f030) at /arrow/cpp/src/arrow/util/future.cc:40
> #7  0x7f18c5827660 in arrow::FutureImpl::MarkFailed (this=0x7f18a815f030) 
> at /arrow/cpp/src/arrow/util/future.cc:195
> #8  0x7f18c80ff8d8 in 
> arrow::Future 
> >::DoMarkFinished (this=0x7f18a815efb0, res=...)
> at /arrow/cpp/src/arrow/util/future.h:660
> #9  0x7f18c80fb37d in 
> arrow::Future 
> >::MarkFinished (this=0x7f18a815efb0, res=...)
> at /arrow/cpp/src/arrow/util/future.h:403
> #10 0x7f18c80f5ae3 in 
> arrow::flight::transport::ucx::UcpCallDriver::Impl::Push 
> (this=0x7f18a804d2d0, status=...)
> at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:780
> #11 0x7f18c80f5c1f in 
> arrow::flight::transport::ucx::UcpCallDriver::Impl::RecvActiveMessage 
> (this=0x7f18a804d2d0, header=0x7f18c8081865, header_length=12, 
> data=0x7f18c8081864, data_length=1, param=0x7f18c07fc680) at 
> /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:791
> #12 0x7f18c80f7d29 in 
> arrow::flight::transport::ucx::UcpCallDriver::RecvActiveMessage 
> (this=0x7f18b80017e0, header=0x7f18c8081865, header_length=12, 
> data=0x7f18c8081864, data_length=1, param=0x7f18c07fc680) at 
> /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:1082
> #13 0x7f18c80e3ea4 in arrow::flight::transport::ucx::(anonymous 
> namespace)::UcxServerImpl::HandleIncomingActiveMessage (self=0x7f18a80259a0, 
> header=0x7f18c8081865, header_length=12, data=0x7f18c8081864, 
> data_length=1, param=0x7f18c07fc680)
> at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_server.cc:586
> #14 0x7f18c4661a09 in ucp_am_invoke_cb (recv_flags=, 
> reply_ep=, data_length=1, data=, 
> user_hdr_length=, user_hdr=0x7f18c8081865, am_id=4132, 
> worker=) at core/ucp_am.c:1220
> #15 ucp_am_handler_common (name=, recv_flags= out>, am_flags=0, reply_ep=, total_length=, 
> am_hdr=0x7f18c808185c, worker=) at core/ucp_am.c:1289
> #16 ucp_am_handler_reply (am_arg=, am_data=, 
> am_length=, am_flags=) at core/ucp_am.c:1327
> #17 0x7f18c28e3f1c in uct_iface_invoke_am (flags=0, length=29, 
> data=0x7f18c808185c, id=, iface=0x7f18a8027e20)
> at /usr/local/src/conda/ucx-1.13.1/src/uct/base/uct_iface.h:861
> #18 uct_mm_iface_invoke_am (flags=0, length=29, data=0x7f18c808185c, 
> am_id=, iface=0x7f18a8027e20) at sm/mm/base/mm_iface.h:256
> #19 uct_mm_iface_process_recv (iface=0x7f18a8027e20) at 
> sm/mm/base/mm_iface.c:256
> #20 uct_mm_iface_poll_fifo (iface=0x7f18a8027e20) at sm/mm/base/mm_iface.c:304
> #21 uct_mm_iface_progress (tl_iface=0x7f18a8027e20) at 
> sm/mm/base/mm_iface.c:357
> #22 0x7f18c4686e22 in ucs_callbackq_dispatch (cbq=) at 
> /usr/local/src/conda/ucx-1.13.1/src/ucs/datastruct/callbackq.h:211
> #23 uct_worker_progress (worker=) at 
> /usr/local/src/conda/ucx-1.13.1/src/uct/api/uct.h:2638
> #24 ucp_worker_progress (worker=0x7f18a80008d0) at core/ucp_worker.c:2782
> #25 0x7f18c80f586f in 
> 

[jira] [Created] (ARROW-18351) [C++][Flight] Crash in UcxErrorHandlingTest.TestDoExchange

2022-11-17 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-18351:
--

 Summary: [C++][Flight] Crash in UcxErrorHandlingTest.TestDoExchange
 Key: ARROW-18351
 URL: https://issues.apache.org/jira/browse/ARROW-18351
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, FlightRPC
Reporter: Antoine Pitrou


I get a non-deterministic crash in the Flight UCX tests.
{code}
[--] 3 tests from UcxErrorHandlingTest
[ RUN  ] UcxErrorHandlingTest.TestGetFlightInfo
[   OK ] UcxErrorHandlingTest.TestGetFlightInfo (24 ms)
[ RUN  ] UcxErrorHandlingTest.TestDoPut
[   OK ] UcxErrorHandlingTest.TestDoPut (15 ms)
[ RUN  ] UcxErrorHandlingTest.TestDoExchange
/arrow/cpp/src/arrow/util/future.cc:125:  Check failed: 
!IsFutureFinished(state_) Future already marked finished
{code}

Here is the GDB backtrace:
{code}
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x7f18c49cd7f1 in __GI_abort () at abort.c:79
#2  0x7f18c5854e00 in arrow::util::CerrLog::~CerrLog (this=0x7f18a81607b0, 
__in_chrg=) at /arrow/cpp/src/arrow/util/logging.cc:72
#3  0x7f18c5854e1c in arrow::util::CerrLog::~CerrLog (this=0x7f18a81607b0, 
__in_chrg=) at /arrow/cpp/src/arrow/util/logging.cc:74
#4  0x7f18c5855181 in arrow::util::ArrowLog::~ArrowLog 
(this=0x7f18c07fc380, __in_chrg=) at 
/arrow/cpp/src/arrow/util/logging.cc:250
#5  0x7f18c5826f86 in arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed 
(this=0x7f18a815f030, state=arrow::FutureState::FAILURE)
at /arrow/cpp/src/arrow/util/future.cc:125
#6  0x7f18c58265af in arrow::ConcreteFutureImpl::DoMarkFailed 
(this=0x7f18a815f030) at /arrow/cpp/src/arrow/util/future.cc:40
#7  0x7f18c5827660 in arrow::FutureImpl::MarkFailed (this=0x7f18a815f030) 
at /arrow/cpp/src/arrow/util/future.cc:195
#8  0x7f18c80ff8d8 in 
arrow::Future 
>::DoMarkFinished (this=0x7f18a815efb0, res=...)
at /arrow/cpp/src/arrow/util/future.h:660
#9  0x7f18c80fb37d in 
arrow::Future 
>::MarkFinished (this=0x7f18a815efb0, res=...)
at /arrow/cpp/src/arrow/util/future.h:403
#10 0x7f18c80f5ae3 in 
arrow::flight::transport::ucx::UcpCallDriver::Impl::Push (this=0x7f18a804d2d0, 
status=...)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:780
#11 0x7f18c80f5c1f in 
arrow::flight::transport::ucx::UcpCallDriver::Impl::RecvActiveMessage 
(this=0x7f18a804d2d0, header=0x7f18c8081865, header_length=12, 
data=0x7f18c8081864, data_length=1, param=0x7f18c07fc680) at 
/arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:791
#12 0x7f18c80f7d29 in 
arrow::flight::transport::ucx::UcpCallDriver::RecvActiveMessage 
(this=0x7f18b80017e0, header=0x7f18c8081865, header_length=12, 
data=0x7f18c8081864, data_length=1, param=0x7f18c07fc680) at 
/arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:1082
#13 0x7f18c80e3ea4 in arrow::flight::transport::ucx::(anonymous 
namespace)::UcxServerImpl::HandleIncomingActiveMessage (self=0x7f18a80259a0, 
header=0x7f18c8081865, header_length=12, data=0x7f18c8081864, 
data_length=1, param=0x7f18c07fc680)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_server.cc:586
#14 0x7f18c4661a09 in ucp_am_invoke_cb (recv_flags=, 
reply_ep=, data_length=1, data=, 
user_hdr_length=, user_hdr=0x7f18c8081865, am_id=4132, 
worker=) at core/ucp_am.c:1220
#15 ucp_am_handler_common (name=, recv_flags=, am_flags=0, reply_ep=, total_length=, 
am_hdr=0x7f18c808185c, worker=) at core/ucp_am.c:1289
#16 ucp_am_handler_reply (am_arg=, am_data=, 
am_length=, am_flags=) at core/ucp_am.c:1327
#17 0x7f18c28e3f1c in uct_iface_invoke_am (flags=0, length=29, 
data=0x7f18c808185c, id=, iface=0x7f18a8027e20)
at /usr/local/src/conda/ucx-1.13.1/src/uct/base/uct_iface.h:861
#18 uct_mm_iface_invoke_am (flags=0, length=29, data=0x7f18c808185c, 
am_id=, iface=0x7f18a8027e20) at sm/mm/base/mm_iface.h:256
#19 uct_mm_iface_process_recv (iface=0x7f18a8027e20) at 
sm/mm/base/mm_iface.c:256
#20 uct_mm_iface_poll_fifo (iface=0x7f18a8027e20) at sm/mm/base/mm_iface.c:304
#21 uct_mm_iface_progress (tl_iface=0x7f18a8027e20) at sm/mm/base/mm_iface.c:357
#22 0x7f18c4686e22 in ucs_callbackq_dispatch (cbq=) at 
/usr/local/src/conda/ucx-1.13.1/src/ucs/datastruct/callbackq.h:211
#23 uct_worker_progress (worker=) at 
/usr/local/src/conda/ucx-1.13.1/src/uct/api/uct.h:2638
#24 ucp_worker_progress (worker=0x7f18a80008d0) at core/ucp_worker.c:2782
#25 0x7f18c80f586f in 
arrow::flight::transport::ucx::UcpCallDriver::Impl::MakeProgress 
(this=0x7f18a804d2d0)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:759
#26 0x7f18c80f2e40 in 
arrow::flight::transport::ucx::UcpCallDriver::Impl::ReadNextFrame 
(this=0x7f18a804d2d0)
at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:449
#27 0x7f18c80f7661 in 
arrow::flight::transport::ucx::UcpCallDriver::ReadNextFrame 
(this=0x7f18b80017e0)

[jira] [Created] (ARROW-18350) [C++] Use std::to_chars instead of std::to_string

2022-11-17 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-18350:
--

 Summary: [C++] Use std::to_chars instead of std::to_string
 Key: ARROW-18350
 URL: https://issues.apache.org/jira/browse/ARROW-18350
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


{{std::to_chars}} is locale-independent unlike {{std::to_string}}; it may also 
be faster in some cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18349) [CI][C++][Flight] Exercise UCX on CI

2022-11-17 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635471#comment-17635471
 ] 

Antoine Pitrou commented on ARROW-18349:


cc [~yibocai] [~lidavidm] [~kou]

> [CI][C++][Flight] Exercise UCX on CI
> 
>
> Key: ARROW-18349
> URL: https://issues.apache.org/jira/browse/ARROW-18349
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Continuous Integration, FlightRPC
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 11.0.0
>
>
> UCX doesn't seem enabled on any CI configuration for now.
> We should have at least a nightly job with UCX enabled, for example one of 
> the Conda or Ubuntu builds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18349) [CI][C++][Flight] Exercise UCX on CI

2022-11-17 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-18349:
--

 Summary: [CI][C++][Flight] Exercise UCX on CI
 Key: ARROW-18349
 URL: https://issues.apache.org/jira/browse/ARROW-18349
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Continuous Integration, FlightRPC
Reporter: Antoine Pitrou
 Fix For: 11.0.0


UCX doesn't seem enabled on any CI configuration for now.

We should have at least a nightly job with UCX enabled, for example one of the 
Conda or Ubuntu builds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16817) [C++][Python] Segfaults for unsupported datatypes in the ORC writer

2022-11-17 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-16817:
--

Assignee: Will Jones  (was: Ian Alexander Joiner)

> [C++][Python] Segfaults for unsupported datatypes in the ORC writer
> ---
>
> Key: ARROW-16817
> URL: https://issues.apache.org/jira/browse/ARROW-16817
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Ian Alexander Joiner
>Assignee: Will Jones
>Priority: Major
>  Labels: good-first-issue, good-second-issue, 
> pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> In the ORC writer if a table has at least a column with unsupported datatype 
> segfaults occur when we try to write them in ORC.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16817) [C++][Python] Segfaults for unsupported datatypes in the ORC writer

2022-11-17 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-16817:
--

Assignee: Ian Alexander Joiner  (was: Will Jones)

> [C++][Python] Segfaults for unsupported datatypes in the ORC writer
> ---
>
> Key: ARROW-16817
> URL: https://issues.apache.org/jira/browse/ARROW-16817
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Ian Alexander Joiner
>Assignee: Ian Alexander Joiner
>Priority: Major
>  Labels: good-first-issue, good-second-issue, 
> pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> In the ORC writer if a table has at least a column with unsupported datatype 
> segfaults occur when we try to write them in ORC.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16817) [C++][Python] Segfaults for unsupported datatypes in the ORC writer

2022-11-17 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-16817.

Resolution: Fixed

Issue resolved by pull request 14638
[https://github.com/apache/arrow/pull/14638]

> [C++][Python] Segfaults for unsupported datatypes in the ORC writer
> ---
>
> Key: ARROW-16817
> URL: https://issues.apache.org/jira/browse/ARROW-16817
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Ian Alexander Joiner
>Assignee: Will Jones
>Priority: Major
>  Labels: good-first-issue, good-second-issue, 
> pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> In the ORC writer if a table has at least a column with unsupported datatype 
> segfaults occur when we try to write them in ORC.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-15538) [C++] Create mapping from Substrait "standard functions" to Arrow equivalents

2022-11-17 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-15538.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14434
[https://github.com/apache/arrow/pull/14434]

> [C++] Create mapping from Substrait "standard functions" to Arrow equivalents
> -
>
> Key: ARROW-15538
> URL: https://issues.apache.org/jira/browse/ARROW-15538
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Tom Drabas
>Priority: Major
>  Labels: pull-request-available, substrait
> Fix For: 11.0.0
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Substrait has a number of "stock" functions defined here: 
> https://github.com/substrait-io/substrait/tree/main/extensions
> This is basically a set of standard extensions.
> We should map these functions to the equivalent Arrow functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18344) [C++] Use input pre-sortedness to create sorted table with ConcatenateTables

2022-11-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17634997#comment-17634997
 ] 

Antoine Pitrou commented on ARROW-18344:


That's what we should do indeed. But do we want the user to be able to pass 
sort indices or not?

> [C++] Use input pre-sortedness to create sorted table with ConcatenateTables
> 
>
> Key: ARROW-18344
> URL: https://issues.apache.org/jira/browse/ARROW-18344
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: kernel
>
> In case of concatenating large sorted tables (e.g. sorted timeseries data) 
> the resulting table is no longer sorted. However the input sortedness can be 
> used to significantly speed up post concatenation sorting. A potential API 
> could be to add ConcatenateTablesOptions.inputs_sorted and implement the 
> logic in ConcatenateTables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-18344) [C++] Use input pre-sortedness to create sorted table with ConcatenateTables

2022-11-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17634877#comment-17634877
 ] 

Antoine Pitrou edited comment on ARROW-18344 at 11/16/22 3:11 PM:
--

We don't actually sort data in Arrow, we produce indices that would sort the 
(untouched, unsorted) data.

Here we should follow the same approach, which means it can't be part of 
Concatenate.

We probably want something like a "merge_indices" compute function, similar to 
"sort_indices". The building blocks required for implementation are already 
there, since that's how "sort_indices" is implemented for chunked inputs.

One limitation is that this requires physical chunking to be aligned with 
logical sortedness? Unless we optionally allow the user to pass a vector of the 
boundaries between (logical) sorted chunks.




was (Author: pitrou):
We don't actually sort data in Arrow, we produce indices that would sort the 
(untouched, unsorted) data.

Here we should follow the same approach, which means it can't be part of 
Concatenate.

We probably want something like a "merge_indices" compute function, similar to 
"sort_indices". The building blocks required for implementation are already 
there, since that's how "sort_indices" is implemented for chunked inputs.

One limitation is that this requires physical chunking to be aligned with 
logical sortedness?


> [C++] Use input pre-sortedness to create sorted table with ConcatenateTables
> 
>
> Key: ARROW-18344
> URL: https://issues.apache.org/jira/browse/ARROW-18344
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: kernel
>
> In case of concatenating large sorted tables (e.g. sorted timeseries data) 
> the resulting table is no longer sorted. However the input sortedness can be 
> used to significantly speed up post concatenation sorting. A potential API 
> could be to add ConcatenateTablesOptions.inputs_sorted and implement the 
> logic in ConcatenateTables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18344) [C++] Use input pre-sortedness to create sorted table with ConcatenateTables

2022-11-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17634877#comment-17634877
 ] 

Antoine Pitrou commented on ARROW-18344:


We don't actually sort data in Arrow, we produce indices that would sort the 
(untouched, unsorted) data.

Here we should follow the same approach, which means it can't be part of 
Concatenate.

We probably want something like a "merge_indices" compute function, similar to 
"sort_indices". The building blocks required for implementation are already 
there, since that's how "sort_indices" is implemented for chunked inputs.

One limitation is that this requires physical chunking to be aligned with 
logical sortedness?


> [C++] Use input pre-sortedness to create sorted table with ConcatenateTables
> 
>
> Key: ARROW-18344
> URL: https://issues.apache.org/jira/browse/ARROW-18344
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: kernel
>
> In case of concatenating large sorted tables (e.g. sorted timeseries data) 
> the resulting table is no longer sorted. However the input sortedness can be 
> used to significantly speed up post concatenation sorting. A potential API 
> could be to add ConcatenateTablesOptions.inputs_sorted and implement the 
> logic in ConcatenateTables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17825) [C++] Allow to write several tables successively with ORCFileWriter::Write method

2022-11-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-17825.

Resolution: Fixed

Issue resolved by pull request 14219
[https://github.com/apache/arrow/pull/14219]

> [C++] Allow to write several tables successively with ORCFileWriter::Write 
> method
> -
>
> Key: ARROW-17825
> URL: https://issues.apache.org/jira/browse/ARROW-17825
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Louis Calot
>Assignee: Louis Calot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I had the need to write an ORC file little by little, so as to not consume 
> too much memory.
> Following [this|https://github.com/apache/arrow/issues/14211] discussion, it 
> appeared that the API did not seemed to prevent doing that, but that the 
> internal implementation was not reusing the writer accordingly.
> This PR makes the needed changes to reuse the "writer_" correctly.
> I do not think that the preceding behaviour was correct, as calling several 
> time the "Write" method would lead to incorrect ORC files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >