[jira] [Assigned] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data

2023-01-05 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-18400:
--

Assignee: Will Jones

> [Python] Quadratic memory usage of Table.to_pandas with nested data
> ---
>
> Key: ARROW-18400
> URL: https://issues.apache.org/jira/browse/ARROW-18400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>Reporter: Adam Reeve
>Assignee: Will Jones
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
> Attachments: test_memory.py
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(1, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } for i in range(arr_len)]
>         ))
> df = pd.DataFrame({'c1': nested_col})
> df.to_parquet(filename)
> {code}
> And then read into a DataFrame with:
> {code:python}
> import pyarrow.parquet as pq
> table = pq.read_table(filename)
> df = table.to_pandas()
> {code}
> Only reading to an Arrow table isn't a problem, it's the to_pandas method 
> that exhibits the large memory usage. I haven't tested generating nested 
> Arrow data in memory without writing Parquet from Pandas but I assume the 
> problem probably isn't Parquet specific.
> Memory usage I see when reading different sized files on a machine with 64 GB 
> RAM:
> ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
> |32,000|362|361|
> |64,000|531|531|
> |128,000|1,152|1,101|
> |256,000|2,888|1,402|
> |512,000|10,301|3,508|
> |1,024,000|38,697|5,313|
> |2,048,000|OOM|20,061|
> |4,096,000| |OOM|
> With Arrow 10.0.1, memory usage approximately quadruples when row count 
> doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but 
> then quadruples from 1024k to 2048k rows.
> PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
> changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data

2023-01-05 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655100#comment-17655100
 ] 

Will Jones commented on ARROW-18400:


Took a look at the issue in Joris' last repro. Is seems to stem from the fact 
that the {{ListArray.values()}} method in C++ doesn't account for slices. I 
think if it did, the numpy conversion issue would be solved. Created a repro in 
a PR here: [https://github.com/apache/arrow/pull/15210]

Do you agree with that assessment?

> [Python] Quadratic memory usage of Table.to_pandas with nested data
> ---
>
> Key: ARROW-18400
> URL: https://issues.apache.org/jira/browse/ARROW-18400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>Reporter: Adam Reeve
>Assignee: Alenka Frim
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
> Attachments: test_memory.py
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(1, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } for i in range(arr_len)]
>         ))
> df = pd.DataFrame({'c1': nested_col})
> df.to_parquet(filename)
> {code}
> And then read into a DataFrame with:
> {code:python}
> import pyarrow.parquet as pq
> table = pq.read_table(filename)
> df = table.to_pandas()
> {code}
> Only reading to an Arrow table isn't a problem, it's the to_pandas method 
> that exhibits the large memory usage. I haven't tested generating nested 
> Arrow data in memory without writing Parquet from Pandas but I assume the 
> problem probably isn't Parquet specific.
> Memory usage I see when reading different sized files on a machine with 64 GB 
> RAM:
> ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
> |32,000|362|361|
> |64,000|531|531|
> |128,000|1,152|1,101|
> |256,000|2,888|1,402|
> |512,000|10,301|3,508|
> |1,024,000|38,697|5,313|
> |2,048,000|OOM|20,061|
> |4,096,000| |OOM|
> With Arrow 10.0.1, memory usage approximately quadruples when row count 
> doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but 
> then quadruples from 1024k to 2048k rows.
> PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
> changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18411) [Python] MapType comparison ignores nullable flag of item_field

2023-01-05 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones resolved ARROW-18411.

Resolution: Fixed

> [Python] MapType comparison ignores nullable flag of item_field
> ---
>
> Key: ARROW-18411
> URL: https://issues.apache.org/jira/browse/ARROW-18411
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: pyarrow==10.0.1
>Reporter: &res
>Assignee: Will Jones
>Priority: Minor
>
> By default MapType value fields are nullable:
> {code:java}
>  pa.map_(pa.string(), pa.int32()).item_field.nullable == True {code}
> It is possible to mark the value field of a MapType as not-nullable:
> {code:java}
>  pa.map_(pa.string(), pa.field("value", pa.int32(), 
> nullable=False)).item_field.nullable == False{code}
> But comparing these two types, that are semantically different, returns True:
> {code:java}
> pa.map_(pa.string(), pa.int32()) == pa.map_(pa.string(), pa.field("value", 
> pa.int32(), nullable=False)) # Returns True {code}
> So it looks like the comparison omits the nullable flag. 
> {code:java}
> import pyarrow as pa
> map_type = pa.map_(pa.string(), pa.int32())
> non_null_map_type = pa.map_(pa.string(), pa.field("value", pa.int32(), 
> nullable=False))
> nullable_map_type = pa.map_(pa.string(), pa.field("value", pa.int32(), 
> nullable=True))
> map_type_different_field_name = pa.map_(pa.string(), pa.field("value", 
> pa.int32(), nullable=True))
> assert nullable_map_type == map_type  # Wrong
> assert str(nullable_map_type) == str(map_type)
> assert str(non_null_map_type) == str(map_type) # Wrong
> assert non_null_map_type == map_type
> assert non_null_map_type.item_type == map_type.item_type
> assert non_null_map_type.item_field != map_type.item_field
> assert non_null_map_type.item_field.nullable != map_type.item_field.nullable
> assert non_null_map_type.item_field.name == map_type.item_field.name
> assert map_type == map_type_different_field_name # This makes sense
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17302) [R] Configure curl timeout policy for S3

2023-01-03 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones resolved ARROW-17302.

Resolution: Fixed

> [R] Configure curl timeout policy for S3
> 
>
> Key: ARROW-17302
> URL: https://issues.apache.org/jira/browse/ARROW-17302
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 9.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Nicola Crane
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See ARROW-16521



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18202) [R][C++] Different behaviour of R's base::gsub() binding aka libarrow's replace_string_regex kernel since 10.0.0

2022-12-30 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653186#comment-17653186
 ] 

Will Jones commented on ARROW-18202:


The following lines were added to early return if the input string is empty:

https://github.com/apache/arrow/blob/498b645e1d09306bf5399a9a019a5caa99513815/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2048-L2051

> [R][C++] Different behaviour of R's base::gsub() binding aka libarrow's 
> replace_string_regex kernel since 10.0.0
> 
>
> Key: ARROW-18202
> URL: https://issues.apache.org/jira/browse/ARROW-18202
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 10.0.0
>Reporter: Lorenzo Isella
>Assignee: Will Jones
>Priority: Critical
> Fix For: 11.0.0
>
>
> Hello,
> I think there is a problem with arrow 10.0 and R. I did not have this issue 
> with arrow 9.0.
> Could you please have a look?
> Many thanks
>  
> {code:r}
> library(tidyverse)
> library(arrow)
> ll <- c(      "100",   "1000",  "200"  , "3000" , "50"   ,
>         "500", ""   ,   "Not Range")
> df <- tibble(x=rep(ll, 1000), y=seq(8000))
> write_tsv(df, "data.tsv")
> data <- open_dataset("data.tsv", format="tsv",
>                      skip_rows=1,
>                      schema=schema(x=string(),
>                      y=double())
> )
> test <- data |>
>     collect()
> ###I want to replace the "" with "0". I believe this worked with arrow 9.0
> df2 <- data |>
>     mutate(x=gsub("^$","0",x) ) |>
>     collect()
> df2 ### now I did not modify the  "" entries in x
> #> # A tibble: 8,000 × 2
> #>    x               y
> #>           
> #>  1 "100"       1
> #>  2 "1000"      2
> #>  3 "200"       3
> #>  4 "3000"      4
> #>  5 "50"        5
> #>  6 "500"       6
> #>  7 ""              7
> #>  8 "Not Range"     8
> #>  9 "100"       9
> #> 10 "1000"     10
> #> # … with 7,990 more rows
>  
> df3 <- df |>
>     mutate(x=gsub("^$","0",x) )
> df3  ## and this is fine
> #> # A tibble: 8,000 × 2
> #>    x             y
> #>         
> #>  1 100       1
> #>  2 1000      2
> #>  3 200       3
> #>  4 3000      4
> #>  5 50        5
> #>  6 500       6
> #>  7 0             7
> #>  8 Not Range     8
> #>  9 100       9
> #> 10 1000     10
> #> # … with 7,990 more rows
> ## How to fix this...I believe this issue did not arise with arrow 9.0.
> sessionInfo()
> #> R version 4.2.1 (2022-06-23)
> #> Platform: x86_64-pc-linux-gnu (64-bit)
> #> Running under: Debian GNU/Linux 11 (bullseye)
> #> 
> #> Matrix products: default
> #> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
> #> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
> #> 
> #> locale:
> #>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
> #>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
> #>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
> #>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
> #>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
> #> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
> #> 
> #> attached base packages:
> #> [1] stats     graphics  grDevices utils     datasets  methods   base     
> #> 
> #> other attached packages:
> #>  [1] arrow_10.0.0    forcats_0.5.2   stringr_1.4.1   dplyr_1.0.10   
> #>  [5] purrr_0.3.5     readr_2.1.3     tidyr_1.2.1     tibble_3.1.8   
> #>  [9] ggplot2_3.3.6   tidyverse_1.3.2
> #> 
> #> loaded via a namespace (and not attached):
> #>  [1] lubridate_1.8.0     assertthat_0.2.1    digest_0.6.30      
> #>  [4] utf8_1.2.2          R6_2.5.1            cellranger_1.1.0   
> #>  [7] backports_1.4.1     reprex_2.0.2        evaluate_0.17      
> #> [10] httr_1.4.4          highr_0.9           pillar_1.8.1       
> #> [13] rlang_1.0.6         googlesheets4_1.0.1 readxl_1.4.1       
> #> [16] R.utils_2.12.1      R.oo_1.25.0         rmarkdown_2.17     
> #> [19] styler_1.8.0        googledrive_2.0.0   bit_4.0.4          
> #> [22] munsell_0.5.0       broom_1.0.1         compiler_4.2.1     
> #> [25] modelr_0.1.9        xfun_0.34           pkgconfig_2.0.3    
> #> [28] htmltools_0.5.3     tidyselect_1.2.0    fansi_1.0.3        
> #> [31] crayon_1.5.2        tzdb_0.3.0          dbplyr_2.2.1       
> #> [34] withr_2.5.0         R.methodsS3_1.8.2   grid_4.2.1         
> #> [37] jsonlite_1.8.3      gtable_0.3.1        lifecycle_1.0.3    
> #> [40] DBI_1.1.3           magrittr_2.0.3      scales_1.2.1       
> #> [43] vroom_1.6.0         cli_3.4.1           stringi_1.7.8      
> #> [46] fs_1.5.2            xml2_1.3.3          ellipsis_0.3.2     
> #> [49] generics_0.1.3      vctrs_0.5.0

[jira] [Assigned] (ARROW-18202) [R][C++] Different behaviour of R's base::gsub() binding aka libarrow's replace_string_regex kernel since 10.0.0

2022-12-30 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-18202:
--

Assignee: Will Jones

> [R][C++] Different behaviour of R's base::gsub() binding aka libarrow's 
> replace_string_regex kernel since 10.0.0
> 
>
> Key: ARROW-18202
> URL: https://issues.apache.org/jira/browse/ARROW-18202
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 10.0.0
>Reporter: Lorenzo Isella
>Assignee: Will Jones
>Priority: Critical
> Fix For: 11.0.0
>
>
> Hello,
> I think there is a problem with arrow 10.0 and R. I did not have this issue 
> with arrow 9.0.
> Could you please have a look?
> Many thanks
>  
> {code:r}
> library(tidyverse)
> library(arrow)
> ll <- c(      "100",   "1000",  "200"  , "3000" , "50"   ,
>         "500", ""   ,   "Not Range")
> df <- tibble(x=rep(ll, 1000), y=seq(8000))
> write_tsv(df, "data.tsv")
> data <- open_dataset("data.tsv", format="tsv",
>                      skip_rows=1,
>                      schema=schema(x=string(),
>                      y=double())
> )
> test <- data |>
>     collect()
> ###I want to replace the "" with "0". I believe this worked with arrow 9.0
> df2 <- data |>
>     mutate(x=gsub("^$","0",x) ) |>
>     collect()
> df2 ### now I did not modify the  "" entries in x
> #> # A tibble: 8,000 × 2
> #>    x               y
> #>           
> #>  1 "100"       1
> #>  2 "1000"      2
> #>  3 "200"       3
> #>  4 "3000"      4
> #>  5 "50"        5
> #>  6 "500"       6
> #>  7 ""              7
> #>  8 "Not Range"     8
> #>  9 "100"       9
> #> 10 "1000"     10
> #> # … with 7,990 more rows
>  
> df3 <- df |>
>     mutate(x=gsub("^$","0",x) )
> df3  ## and this is fine
> #> # A tibble: 8,000 × 2
> #>    x             y
> #>         
> #>  1 100       1
> #>  2 1000      2
> #>  3 200       3
> #>  4 3000      4
> #>  5 50        5
> #>  6 500       6
> #>  7 0             7
> #>  8 Not Range     8
> #>  9 100       9
> #> 10 1000     10
> #> # … with 7,990 more rows
> ## How to fix this...I believe this issue did not arise with arrow 9.0.
> sessionInfo()
> #> R version 4.2.1 (2022-06-23)
> #> Platform: x86_64-pc-linux-gnu (64-bit)
> #> Running under: Debian GNU/Linux 11 (bullseye)
> #> 
> #> Matrix products: default
> #> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
> #> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
> #> 
> #> locale:
> #>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
> #>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
> #>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
> #>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
> #>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
> #> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
> #> 
> #> attached base packages:
> #> [1] stats     graphics  grDevices utils     datasets  methods   base     
> #> 
> #> other attached packages:
> #>  [1] arrow_10.0.0    forcats_0.5.2   stringr_1.4.1   dplyr_1.0.10   
> #>  [5] purrr_0.3.5     readr_2.1.3     tidyr_1.2.1     tibble_3.1.8   
> #>  [9] ggplot2_3.3.6   tidyverse_1.3.2
> #> 
> #> loaded via a namespace (and not attached):
> #>  [1] lubridate_1.8.0     assertthat_0.2.1    digest_0.6.30      
> #>  [4] utf8_1.2.2          R6_2.5.1            cellranger_1.1.0   
> #>  [7] backports_1.4.1     reprex_2.0.2        evaluate_0.17      
> #> [10] httr_1.4.4          highr_0.9           pillar_1.8.1       
> #> [13] rlang_1.0.6         googlesheets4_1.0.1 readxl_1.4.1       
> #> [16] R.utils_2.12.1      R.oo_1.25.0         rmarkdown_2.17     
> #> [19] styler_1.8.0        googledrive_2.0.0   bit_4.0.4          
> #> [22] munsell_0.5.0       broom_1.0.1         compiler_4.2.1     
> #> [25] modelr_0.1.9        xfun_0.34           pkgconfig_2.0.3    
> #> [28] htmltools_0.5.3     tidyselect_1.2.0    fansi_1.0.3        
> #> [31] crayon_1.5.2        tzdb_0.3.0          dbplyr_2.2.1       
> #> [34] withr_2.5.0         R.methodsS3_1.8.2   grid_4.2.1         
> #> [37] jsonlite_1.8.3      gtable_0.3.1        lifecycle_1.0.3    
> #> [40] DBI_1.1.3           magrittr_2.0.3      scales_1.2.1       
> #> [43] vroom_1.6.0         cli_3.4.1           stringi_1.7.8      
> #> [46] fs_1.5.2            xml2_1.3.3          ellipsis_0.3.2     
> #> [49] generics_0.1.3      vctrs_0.5.0         tools_4.2.1        
> #> [52] bit64_4.0.5         R.cache_0.16.0      glue_1.6.2         
> #> [55] hms_1.1.2           parallel_4.2.1      fastmap_1.1.0      
> #> [58] yaml_2.3.6          colorspace_2.0-3    gargle_1.2.1       
> #> [61

[jira] [Commented] (ARROW-18195) [R][C++] Final value returned by case_when is NA when input has 64 or more values and 1 or more NAs

2022-12-30 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653169#comment-17653169
 ] 

Will Jones commented on ARROW-18195:


Thank you for all the reproductions. I zeroed in on one simple one and was able 
to reproduce in C++. Additional observations:

 {code:R}
library(dplyr, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)

# Condition has NA and more than 64 values
# Expression generated internally:
# case_when({1=x}, 1)
test_df4 = tibble::tibble(x = c(NA, rep(TRUE, 64)))
test_arrow4 = arrow_table(test_df4)
test_arrow4 %>%
  mutate(y = case_when(x ~ 1L)) %>%
  collect() %>%
  tail()
#> # A tibble: 6 × 2
#>   x y
#>
#> 1 TRUE  1
#> 2 TRUE  1
#> 3 TRUE  1
#> 4 TRUE  1
#> 5 TRUE  1
#> 6 TRUE NA

# It seems to be coming from the next clause, which defaults to NA
# Expression generated internally:
# case_when({1=x, 2=true}, 1, 2)
test_df4 = tibble::tibble(x = c(NA, rep(TRUE, 64)))
test_arrow4 = arrow_table(test_df4)
test_arrow4 %>%
  mutate(y = case_when(x ~ 1L, TRUE ~ 2L)) %>%
  collect() %>%
  tail()
#> # A tibble: 6 × 2
#>   x y
#>
#> 1 TRUE  1
#> 2 TRUE  1
#> 3 TRUE  1
#> 4 TRUE  1
#> 5 TRUE  1
#> 6 TRUE  2

# Applies also to vectors
test_df4 = tibble::tibble(x = c(NA, rep(TRUE, 64)), left = rep(1L, 65), right = 
rep(2L, 65))
test_arrow4 = arrow_table(test_df4)
test_arrow4 %>%
  mutate(y = case_when(x ~ left, TRUE ~ right)) %>%
  collect() %>%
  tail()
#> # A tibble: 6 × 4
#>   x  left right y
#>  
#> 1 TRUE  1 2 1
#> 2 TRUE  1 2 1
#> 3 TRUE  1 2 1
#> 4 TRUE  1 2 1
#> 5 TRUE  1 2 1
#> 6 TRUE  1 2 2

# It does seem the 65th and onward element become the else value for no reason
lapply(c(65, 68, 127, 140), function(len) {
  test_df4 = tibble::tibble(x = c(NA, rep(TRUE, len - 1)))
  test_arrow4 = arrow_table(test_df4)
  y <- test_arrow4 %>%
mutate(y = case_when(x ~ 1L)) %>%
collect() %>%
.$y
  which(is.na(y))
})
#> [[1]]
#> [1]  1 65
#> 
#> [[2]]
#> [1]  1 65 66 67 68
#> 
#> [[3]]
#>  [1]   1  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81 
 82
#> [20]  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
101
#> [39] 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 
120
#> [58] 121 122 123 124 125 126 127
#> 
#> [[4]]
#>  [1]  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82 
 83
#> [20]  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 
102
#> [39] 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
121
#> [58] 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 
140
{code}

Created on 2022-12-30 with [reprex 
v2.0.2](https://reprex.tidyverse.org)


> [R][C++] Final value returned by case_when is NA when input has 64 or more 
> values and 1 or more NAs
> ---
>
> Key: ARROW-18195
> URL: https://issues.apache.org/jira/browse/ARROW-18195
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 10.0.0
>Reporter: Lee Mendelowitz
>Assignee: Will Jones
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
> Attachments: test_issue.R
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There appears to be a bug when processing an Arrow table with NA values and 
> using `dplyr::case_when`. A reproducible example is below: the output from 
> arrow table processing does not match the output when processing a tibble. If 
> the NA's are removed from the dataframe, then the outputs match.
> {noformat}
> ``` r
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #> filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #> intersect, setdiff, setequal, union
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> library(assertthat)
> play_results = c('single', 'double', 'triple', 'home_run')
> nrows = 1000
> # Change frac_na to 0, and the result error disappears.
> frac_na = 0.05
> # Create a test dataframe with NA values
> test_df = tibble(
> play_result = sample(play_results, nrows, replace = TRUE)
> ) %>%
> mutate(
> play_result = ifelse(runif(nrows) < frac_na, NA_character_, 
> play_result)
> )
> 
> test_arrow = arrow_table(test_df)
> process_plays = function(df) {
> df %>%
> mutate(
> avg = case_when(
> play_result == 'single' ~ 1,
>   

[jira] [Assigned] (ARROW-18195) [R][C++] Final value returned by case_when is NA when input has 64 or more values and 1 or more NAs

2022-12-30 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-18195:
--

Assignee: Will Jones

> [R][C++] Final value returned by case_when is NA when input has 64 or more 
> values and 1 or more NAs
> ---
>
> Key: ARROW-18195
> URL: https://issues.apache.org/jira/browse/ARROW-18195
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 10.0.0
>Reporter: Lee Mendelowitz
>Assignee: Will Jones
>Priority: Critical
> Fix For: 11.0.0
>
> Attachments: test_issue.R
>
>
> There appears to be a bug when processing an Arrow table with NA values and 
> using `dplyr::case_when`. A reproducible example is below: the output from 
> arrow table processing does not match the output when processing a tibble. If 
> the NA's are removed from the dataframe, then the outputs match.
> {noformat}
> ``` r
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #> filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #> intersect, setdiff, setequal, union
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> library(assertthat)
> play_results = c('single', 'double', 'triple', 'home_run')
> nrows = 1000
> # Change frac_na to 0, and the result error disappears.
> frac_na = 0.05
> # Create a test dataframe with NA values
> test_df = tibble(
> play_result = sample(play_results, nrows, replace = TRUE)
> ) %>%
> mutate(
> play_result = ifelse(runif(nrows) < frac_na, NA_character_, 
> play_result)
> )
> 
> test_arrow = arrow_table(test_df)
> process_plays = function(df) {
> df %>%
> mutate(
> avg = case_when(
> play_result == 'single' ~ 1,
> play_result == 'double' ~ 1,
> play_result == 'triple' ~ 1,
> play_result == 'home_run' ~ 1,
> is.na(play_result) ~ NA_real_,
> TRUE ~ 0
> )
> ) %>%
> count(play_result, avg) %>%
> arrange(play_result)
> }
> # Compare arrow_table reuslt to tibble result
> result_tibble = process_plays(test_df)
> result_arrow = process_plays(test_arrow) %>% collect()
> assertthat::assert_that(identical(result_tibble, result_arrow))
> #> Error: result_tibble not identical to result_arrow
> ```
> Created on 2022-10-29 with [reprex 
> v2.0.2](https://reprex.tidyverse.org)
> {noformat}
> I have reproduced this issue both on Mac OS and Ubuntu 20.04.
>  
> {noformat}
> ```
> r$> sessionInfo()
> R version 4.2.1 (2022-06-23)
> Platform: aarch64-apple-darwin21.5.0 (64-bit)
> Running under: macOS Monterey 12.5.1
> Matrix products: default
> BLAS:   /opt/homebrew/Cellar/openblas/0.3.20/lib/libopenblasp-r0.3.20.dylib
> LAPACK: /opt/homebrew/Cellar/r/4.2.1/lib/R/lib/libRlapack.dylib
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> attached base packages:
> [1] stats     graphics  grDevices datasets  utils     methods   base
> other attached packages:
> [1] assertthat_0.2.1 arrow_10.0.0     dplyr_1.0.10
> loaded via a namespace (and not attached):
>  [1] compiler_4.2.1    pillar_1.8.1      highr_0.9         R.methodsS3_1.8.2 
> R.utils_2.12.0    tools_4.2.1       bit_4.0.4         digest_0.6.29
>  [9] evaluate_0.15     lifecycle_1.0.1   tibble_3.1.8      R.cache_0.16.0    
> pkgconfig_2.0.3   rlang_1.0.5       reprex_2.0.2      DBI_1.1.2
> [17] cli_3.3.0         rstudioapi_0.13   yaml_2.3.5        xfun_0.31         
> fastmap_1.1.0     withr_2.5.0       styler_1.8.0      knitr_1.39
> [25] generics_0.1.3    fs_1.5.2          vctrs_0.4.1       bit64_4.0.5       
> tidyselect_1.1.2  glue_1.6.2        R6_2.5.1          processx_3.5.3
> [33] fansi_1.0.3       rmarkdown_2.14    purrr_0.3.4       callr_3.7.0       
> clipr_0.8.0       magrittr_2.0.3    ellipsis_0.3.2    ps_1.7.0
> [41] htmltools_0.5.3   renv_0.16.0       utf8_1.2.2        R.oo_1.25.0
> ```
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data

2022-11-30 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641484#comment-17641484
 ] 

Will Jones commented on ARROW-18400:


Under the hood, {{pyarrow.parquet.read_table}} is using Dataset. Has anyone 
looked at the effect of changing {{batch_readahead}} and 
{{{}fragment_readahead{}}}? (They can be passed as kwargs to 
{{{}.to_table(){}}}) 
([docs|https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner])

> [Python] Quadratic memory usage of Table.to_pandas with nested data
> ---
>
> Key: ARROW-18400
> URL: https://issues.apache.org/jira/browse/ARROW-18400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>Reporter: Adam Reeve
>Assignee: Alenka Frim
>Priority: Critical
> Fix For: 11.0.0
>
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(1, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } for i in range(arr_len)]
>         ))
> df = pd.DataFrame({'c1': nested_col})
> df.to_parquet(filename)
> {code}
> And then read into a DataFrame with:
> {code:python}
> import pyarrow.parquet as pq
> table = pq.read_table(filename)
> df = table.to_pandas()
> {code}
> Only reading to an Arrow table isn't a problem, it's the to_pandas method 
> that exhibits the large memory usage. I haven't tested generating nested 
> Arrow data in memory without writing Parquet from Pandas but I assume the 
> problem probably isn't Parquet specific.
> Memory usage I see when reading different sized files on a machine with 64 GB 
> RAM:
> ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
> |32,000|362|361|
> |64,000|531|531|
> |128,000|1,152|1,101|
> |256,000|2,888|1,402|
> |512,000|10,301|3,508|
> |1,024,000|38,697|5,313|
> |2,048,000|OOM|20,061|
> |4,096,000| |OOM|
> With Arrow 10.0.1, memory usage approximately quadruples when row count 
> doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but 
> then quadruples from 1024k to 2048k rows.
> PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
> changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18411) [Python] MapType comparison ignores nullable flag of item_field

2022-11-28 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17640137#comment-17640137
 ] 

Will Jones commented on ARROW-18411:


Thanks for reporting this. This will be fixed by 
[https://github.com/apache/arrow/pull/13851]

> [Python] MapType comparison ignores nullable flag of item_field
> ---
>
> Key: ARROW-18411
> URL: https://issues.apache.org/jira/browse/ARROW-18411
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: pyarrow==10.0.1
>Reporter: &res
>Assignee: Will Jones
>Priority: Minor
>
> By default MapType value fields are nullable:
> {code:java}
>  pa.map_(pa.string(), pa.int32()).item_field.nullable == True {code}
> It is possible to mark the value field of a MapType as not-nullable:
> {code:java}
>  pa.map_(pa.string(), pa.field("value", pa.int32(), 
> nullable=False)).item_field.nullable == False{code}
> But comparing these two types, that are semantically different, returns True:
> {code:java}
> pa.map_(pa.string(), pa.int32()) == pa.map_(pa.string(), pa.field("value", 
> pa.int32(), nullable=False)) # Returns True {code}
> So it looks like the comparison omits the nullable flag. 
> {code:java}
> import pyarrow as pa
> map_type = pa.map_(pa.string(), pa.int32())
> non_null_map_type = pa.map_(pa.string(), pa.field("value", pa.int32(), 
> nullable=False))
> nullable_map_type = pa.map_(pa.string(), pa.field("value", pa.int32(), 
> nullable=True))
> map_type_different_field_name = pa.map_(pa.string(), pa.field("value", 
> pa.int32(), nullable=True))
> assert nullable_map_type == map_type  # Wrong
> assert str(nullable_map_type) == str(map_type)
> assert str(non_null_map_type) == str(map_type) # Wrong
> assert non_null_map_type == map_type
> assert non_null_map_type.item_type == map_type.item_type
> assert non_null_map_type.item_field != map_type.item_field
> assert non_null_map_type.item_field.nullable != map_type.item_field.nullable
> assert non_null_map_type.item_field.name == map_type.item_field.name
> assert map_type == map_type_different_field_name # This makes sense
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18411) [Python] MapType comparison ignores nullable flag of item_field

2022-11-28 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-18411:
--

Assignee: Will Jones

> [Python] MapType comparison ignores nullable flag of item_field
> ---
>
> Key: ARROW-18411
> URL: https://issues.apache.org/jira/browse/ARROW-18411
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: pyarrow==10.0.1
>Reporter: &res
>Assignee: Will Jones
>Priority: Minor
>
> By default MapType value fields are nullable:
> {code:java}
>  pa.map_(pa.string(), pa.int32()).item_field.nullable == True {code}
> It is possible to mark the value field of a MapType as not-nullable:
> {code:java}
>  pa.map_(pa.string(), pa.field("value", pa.int32(), 
> nullable=False)).item_field.nullable == False{code}
> But comparing these two types, that are semantically different, returns True:
> {code:java}
> pa.map_(pa.string(), pa.int32()) == pa.map_(pa.string(), pa.field("value", 
> pa.int32(), nullable=False)) # Returns True {code}
> So it looks like the comparison omits the nullable flag. 
> {code:java}
> import pyarrow as pa
> map_type = pa.map_(pa.string(), pa.int32())
> non_null_map_type = pa.map_(pa.string(), pa.field("value", pa.int32(), 
> nullable=False))
> nullable_map_type = pa.map_(pa.string(), pa.field("value", pa.int32(), 
> nullable=True))
> map_type_different_field_name = pa.map_(pa.string(), pa.field("value", 
> pa.int32(), nullable=True))
> assert nullable_map_type == map_type  # Wrong
> assert str(nullable_map_type) == str(map_type)
> assert str(non_null_map_type) == str(map_type) # Wrong
> assert non_null_map_type == map_type
> assert non_null_map_type.item_type == map_type.item_type
> assert non_null_map_type.item_field != map_type.item_field
> assert non_null_map_type.item_field.nullable != map_type.item_field.nullable
> assert non_null_map_type.item_field.name == map_type.item_field.name
> assert map_type == map_type_different_field_name # This makes sense
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15812) [R] Allow user to supply col_names argument when reading in a CSV dataset

2022-11-21 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-15812:
--

Assignee: Will Jones

> [R] Allow user to supply col_names argument when reading in a CSV dataset
> -
>
> Key: ARROW-15812
> URL: https://issues.apache.org/jira/browse/ARROW-15812
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Will Jones
>Priority: Major
>
> Allow the user to supply the {{col_names}} argument from {{readr}} when 
> reading in a dataset.  
> This is already possible when reading in a single CSV file via 
> {{arrow::read_csv_arrow()}} via the {{readr_to_csv_read_options}} function, 
> and so once the C++ functionality to autogenerate column names for Datasets 
> is implemented, we should hook up {{readr_to_csv_read_options}} in 
> {{csv_file_format_read_opts}} just like we have with 
> {{readr_to_csv_parse_options}} in {{csv_file_format_parse_options}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15812) [R] Allow user to supply col_names argument when reading in a CSV dataset

2022-11-21 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636918#comment-17636918
 ] 

Will Jones commented on ARROW-15812:


Auto-generation of column names was added to Datasets in 
https://issues.apache.org/jira/browse/ARROW-16436

> [R] Allow user to supply col_names argument when reading in a CSV dataset
> -
>
> Key: ARROW-15812
> URL: https://issues.apache.org/jira/browse/ARROW-15812
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> Allow the user to supply the {{col_names}} argument from {{readr}} when 
> reading in a dataset.  
> This is already possible when reading in a single CSV file via 
> {{arrow::read_csv_arrow()}} via the {{readr_to_csv_read_options}} function, 
> and so once the C++ functionality to autogenerate column names for Datasets 
> is implemented, we should hook up {{readr_to_csv_read_options}} in 
> {{csv_file_format_read_opts}} just like we have with 
> {{readr_to_csv_parse_options}} in {{csv_file_format_parse_options}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15470) [R] Allows user to specify string to be used for missing data when writing CSV dataset

2022-11-18 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-15470:
--

Assignee: Will Jones

> [R] Allows user to specify string to be used for missing data when writing 
> CSV dataset
> --
>
> Key: ARROW-15470
> URL: https://issues.apache.org/jira/browse/ARROW-15470
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Will Jones
>Priority: Major
>
> The ability to select the string to be used for missing data was implemented 
> for the CSV Writer in ARROW-14903 and as David Li points out below, is 
> available, so I think we just need to hook it up on the R side.
> This requires the values passed in as the "na" argument to be instead passed 
> through to "null_strings", similarly to what has been done with "skip" and 
> "skip_rows" in ARROW-15743.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18355) [R] support the quoted_na argument in open_dataset for CSVs by mapping it to CSVConvertOptions$strings_can_be_null

2022-11-18 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636012#comment-17636012
 ] 

Will Jones commented on ARROW-18355:


This feature is "soft-deprecated" in readr. Do we still want to add support?

> [R] support the quoted_na argument in open_dataset for CSVs by mapping it to 
> CSVConvertOptions$strings_can_be_null
> --
>
> Key: ARROW-18355
> URL: https://issues.apache.org/jira/browse/ARROW-18355
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Will Jones
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18355) [R] support the quoted_na argument in open_dataset for CSVs by mapping it to CSVConvertOptions$strings_can_be_null

2022-11-18 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-18355:
--

Assignee: Will Jones

> [R] support the quoted_na argument in open_dataset for CSVs by mapping it to 
> CSVConvertOptions$strings_can_be_null
> --
>
> Key: ARROW-18355
> URL: https://issues.apache.org/jira/browse/ARROW-18355
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Will Jones
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18359) PrettyPrint Improvements

2022-11-17 Thread Will Jones (Jira)
Will Jones created ARROW-18359:
--

 Summary: PrettyPrint Improvements
 Key: ARROW-18359
 URL: https://issues.apache.org/jira/browse/ARROW-18359
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python, R
Reporter: Will Jones


We have some pretty printing capabilities, but we may want to think at a high 
level about the design first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-15026) [Python] datetime.timedelta to pyarrow.duration('us') silently overflows

2022-11-15 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones resolved ARROW-15026.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 13718
[https://github.com/apache/arrow/pull/13718]

> [Python] datetime.timedelta to pyarrow.duration('us') silently overflows
> 
>
> Key: ARROW-15026
> URL: https://issues.apache.org/jira/browse/ARROW-15026
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Andreas Rappold
>Assignee: Anja Boskovic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
>  
> Hi! This reproduces the issue:
> {code:java}
> # python 3.9.9
> # pyarrow 6.0.1
> import datetime
> import pyarrow
> d = datetime.timedelta(days=-106751992, seconds=71945, microseconds=224192)
> pyarrow.scalar(d)
> #  microseconds=224192)>
> pyarrow.scalar(d).as_py() == d
> # True
> d2 = d - datetime.timedelta(microseconds=1)
> pyarrow.scalar(d2)
> #  microseconds=775807)>
> pyarrow.scalar(d2).as_py() == d2
> # False{code}
> Other conversions (e.g. to int*) raise an exception instead. I didn't check 
> if duration overflows for too large timedeltas. If its easy to fix, point me 
> in the right direction and I try to create a PR. Thanks
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2022-11-15 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-14196:
--

Assignee: Will Jones

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Assignee: Will Jones
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17812) [C++][Documentation] Add Gandiva User Guide

2022-11-08 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones resolved ARROW-17812.

Resolution: Fixed

Issue resolved by pull request 14200
[https://github.com/apache/arrow/pull/14200]

> [C++][Documentation] Add Gandiva User Guide
> ---
>
> Key: ARROW-17812
> URL: https://issues.apache.org/jira/browse/ARROW-17812
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments

2022-11-04 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-18246:
---
Component/s: Documentation

> [Python][Docs] PyArrow table join docstring typos for left and right suffix 
> arguments
> -
>
> Key: ARROW-18246
> URL: https://issues.apache.org/jira/browse/ARROW-18246
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation
>Reporter: d33bs
>Assignee: Will Jones
>Priority: Minor
>  Labels: docs-impacting, documentation, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hello, thank you for all the amazing work on Arrow! I'd like to report a 
> potential issue with PyArrow's Table Join docstring which may make it 
> confusing for others to read. This content is I believe translated into the 
> documentation website as well.
> The content which needs to be corrected may be found starting at: 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737]
> The block currently reads:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to right column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffic to add to the left column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> It could be improved with the following:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to left column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffix to add to the right column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> Please let me know if I may clarify or if there are any questions on the 
> above. Thanks again for your help!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments

2022-11-04 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-18246:
---
Fix Version/s: 11.0.0

> [Python][Docs] PyArrow table join docstring typos for left and right suffix 
> arguments
> -
>
> Key: ARROW-18246
> URL: https://issues.apache.org/jira/browse/ARROW-18246
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation
>Reporter: d33bs
>Assignee: Will Jones
>Priority: Minor
>  Labels: docs-impacting, documentation, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hello, thank you for all the amazing work on Arrow! I'd like to report a 
> potential issue with PyArrow's Table Join docstring which may make it 
> confusing for others to read. This content is I believe translated into the 
> documentation website as well.
> The content which needs to be corrected may be found starting at: 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737]
> The block currently reads:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to right column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffic to add to the left column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> It could be improved with the following:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to left column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffix to add to the right column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> Please let me know if I may clarify or if there are any questions on the 
> above. Thanks again for your help!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments

2022-11-04 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629039#comment-17629039
 ] 

Will Jones commented on ARROW-18246:


Thanks for reporting. I have created an update fixing those and a couple other 
issues in the docs.

> [Python][Docs] PyArrow table join docstring typos for left and right suffix 
> arguments
> -
>
> Key: ARROW-18246
> URL: https://issues.apache.org/jira/browse/ARROW-18246
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: d33bs
>Assignee: Will Jones
>Priority: Minor
>  Labels: docs-impacting, documentation
>
> Hello, thank you for all the amazing work on Arrow! I'd like to report a 
> potential issue with PyArrow's Table Join docstring which may make it 
> confusing for others to read. This content is I believe translated into the 
> documentation website as well.
> The content which needs to be corrected may be found starting at: 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737]
> The block currently reads:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to right column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffic to add to the left column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> It could be improved with the following:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to left column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffix to add to the right column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> Please let me know if I may clarify or if there are any questions on the 
> above. Thanks again for your help!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments

2022-11-04 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-18246:
--

Assignee: Will Jones

> [Python][Docs] PyArrow table join docstring typos for left and right suffix 
> arguments
> -
>
> Key: ARROW-18246
> URL: https://issues.apache.org/jira/browse/ARROW-18246
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: d33bs
>Assignee: Will Jones
>Priority: Minor
>  Labels: docs-impacting, documentation
>
> Hello, thank you for all the amazing work on Arrow! I'd like to report a 
> potential issue with PyArrow's Table Join docstring which may make it 
> confusing for others to read. This content is I believe translated into the 
> documentation website as well.
> The content which needs to be corrected may be found starting at: 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737]
> The block currently reads:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to right column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffic to add to the left column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> It could be improved with the following:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to left column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffix to add to the right column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> Please let me know if I may clarify or if there are any questions on the 
> above. Thanks again for your help!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-18245) wheels for PyArrow + Python 3.11

2022-11-04 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones closed ARROW-18245.
--
Resolution: Duplicate

Hello! This is being actively worked on in ARROW-17487. I've closed this ticket 
since it duplicates that one.

> wheels for PyArrow + Python 3.11
> 
>
> Key: ARROW-18245
> URL: https://issues.apache.org/jira/browse/ARROW-18245
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 10.0.0
> Environment: Linux RH8
>Reporter: Aleksandar
>Priority: Minor
>
> Hi,
> May we know the plan for pypi pyarrow 10 package will have build dependencies 
> installed as part of the package.    Right now pyarrow10 package has no 
> wheels  for py3.11.0 .
> Maybe this is not a right forum but someone is maintaining and packaging 
> these things for developers.
> Thanks much and sorry for intruding ...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18228) AWS Error SLOW_DOWN during PutObject operation

2022-11-04 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629016#comment-17629016
 ] 

Will Jones commented on ARROW-18228:


If you are still getting errors, it might be worth reviewing how you slow you 
app down somewhat to not hit these errors.

[https://aws.amazon.com/premiumsupport/knowledge-center/http-5xx-errors-s3/]

[https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html]

[https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html]

I'm not sure if we have any other settings to limit concurrent requests or tune 
the backoff strategy, but that might be helpful for cases like this.

> AWS Error SLOW_DOWN during PutObject operation
> --
>
> Key: ARROW-18228
> URL: https://issues.apache.org/jira/browse/ARROW-18228
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 10.0.0
>Reporter: Vadym Dytyniak
>Priority: Major
>
> We use Dask to parallelise read/write operations and pyarrow to write dataset 
> from worker nodes.
> After pyarrow released version 10.0.0, our data flows automatically switched 
> to the latest version and some of them started to fail with the following 
> error:
> {code:java}
> File "/usr/local/lib/python3.10/dist-packages/org/store/storage.py", line 
> 768, in _write_partition
> ds.write_dataset(
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 988, in write_dataset
> _filesystemdataset_write(
>   File "pyarrow/_dataset.pyx", line 2859, in 
> pyarrow._dataset._filesystemdataset_write
> check_status(CFileSystemDataset.Write(c_options, c_scanner))
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: When creating key 'equities.us.level2.by_security/' in bucket 
> 'org-prod': AWS Error SLOW_DOWN during PutObject operation: Please reduce 
> your request rate. {code}
> In total flow failed many times: most failed with the error above, but one 
> failed with:
> {code:java}
> File "/usr/local/lib/python3.10/dist-packages/chronos/store/storage.py", line 
> 857, in _load_partition
> table = ds.dataset(
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 752, in dataset
> return _filesystem_dataset(source, **kwargs)
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 444, in _filesystem_dataset
> fs, paths_or_selector = _ensure_single_source(source, filesystem)
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 411, in _ensure_single_source
> file_info = filesystem.get_file_info(path)
>   File "pyarrow/_fs.pyx", line 564, in pyarrow._fs.FileSystem.get_file_info
> info = GetResultValue(self.fs.GetFileInfo(path))
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> return check_status(status)
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: When getting information for key 
> 'ns/date=2022-10-31/channel=4/feed=A/9f41f928eedc431ca695a7ffe5fc60c2-0.parquet'
>  in bucket 'org-poc': AWS Error NETWORK_CONNECTION during HeadObject 
> operation: curlCode: 28, Timeout was reached {code}
>  
> Do you have any idea what was changed for dataset write between 9.0.0 and 
> 10.0.0 to help us to fix the issue?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18228) AWS Error SLOW_DOWN during PutObject operation

2022-11-03 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628391#comment-17628391
 ] 

Will Jones commented on ARROW-18228:


I think this may have been caused by 
https://issues.apache.org/jira/browse/ARROW-17057

In 10.0.0, we exposed the retry strategy in Python, but we set the default 
number of retries to 3, while I believe in the underlying C++ code it was set 
to 10 before.

Could you try setting the {{max_attempts=10}} in your code:

{code:python}
from pyarrow.fs import AwsDefaultS3RetryStrategy, S3FileSystem

fs = S3FileSystem(retry_strategy=AwsDefaultS3RetryStrategy(max_attempts=10))
{code}

> AWS Error SLOW_DOWN during PutObject operation
> --
>
> Key: ARROW-18228
> URL: https://issues.apache.org/jira/browse/ARROW-18228
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 10.0.0
>Reporter: Vadym Dytyniak
>Priority: Major
>
> We use Dask to parallelise read/write operations and pyarrow to write dataset 
> from worker nodes.
> After pyarrow released version 10.0.0, our data flows automatically switched 
> to the latest version and some of them started to fail with the following 
> error:
> {code:java}
> File "/usr/local/lib/python3.10/dist-packages/org/store/storage.py", line 
> 768, in _write_partition
> ds.write_dataset(
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 988, in write_dataset
> _filesystemdataset_write(
>   File "pyarrow/_dataset.pyx", line 2859, in 
> pyarrow._dataset._filesystemdataset_write
> check_status(CFileSystemDataset.Write(c_options, c_scanner))
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: When creating key 'equities.us.level2.by_security/' in bucket 
> 'org-prod': AWS Error SLOW_DOWN during PutObject operation: Please reduce 
> your request rate. {code}
> In total flow failed many times: most failed with the error above, but one 
> failed with:
> {code:java}
> File "/usr/local/lib/python3.10/dist-packages/chronos/store/storage.py", line 
> 857, in _load_partition
> table = ds.dataset(
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 752, in dataset
> return _filesystem_dataset(source, **kwargs)
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 444, in _filesystem_dataset
> fs, paths_or_selector = _ensure_single_source(source, filesystem)
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 411, in _ensure_single_source
> file_info = filesystem.get_file_info(path)
>   File "pyarrow/_fs.pyx", line 564, in pyarrow._fs.FileSystem.get_file_info
> info = GetResultValue(self.fs.GetFileInfo(path))
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> return check_status(status)
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: When getting information for key 
> 'ns/date=2022-10-31/channel=4/feed=A/9f41f928eedc431ca695a7ffe5fc60c2-0.parquet'
>  in bucket 'org-poc': AWS Error NETWORK_CONNECTION during HeadObject 
> operation: curlCode: 28, Timeout was reached {code}
>  
> Do you have any idea what was changed for dataset write between 9.0.0 and 
> 10.0.0 to help us to fix the issue?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18210) [C++][Parquet] Skip check in StreamWriter

2022-11-03 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628379#comment-17628379
 ] 

Will Jones commented on ARROW-18210:


Created https://issues.apache.org/jira/browse/ARROW-18239

> [C++][Parquet] Skip check in StreamWriter
> -
>
> Key: ARROW-18210
> URL: https://issues.apache.org/jira/browse/ARROW-18210
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Affects Versions: 10.0.0
>Reporter: Madhur
>Priority: Major
>
> Currently StreamWriter is slower only because of checking of columns, if we 
> allow customization option (maybe ctor arg) to skip the check then 
> StreamWriter can be more efficient?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18239) [C++][Docs] Add examples of Parquet TypedColumnWriter to user guide

2022-11-03 Thread Will Jones (Jira)
Will Jones created ARROW-18239:
--

 Summary: [C++][Docs] Add examples of Parquet TypedColumnWriter to 
user guide
 Key: ARROW-18239
 URL: https://issues.apache.org/jira/browse/ARROW-18239
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Will Jones


Since this is the more performant non-Arrow way to write Parquet data, we 
should show that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18230) [Python] Pass Cmake args to Python CPP

2022-11-02 Thread Will Jones (Jira)
Will Jones created ARROW-18230:
--

 Summary: [Python] Pass Cmake args to Python CPP 
 Key: ARROW-18230
 URL: https://issues.apache.org/jira/browse/ARROW-18230
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Will Jones
 Fix For: 11.0.0


We pass {{extra_cmake_args}} to {{_run_cmake}} (Cython build) but not to {{
_run_cmake_pyarrow_cpp}} (PyArrow C++ build). We should probably be passing to 
both.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18204) [R] Allow setting field metadata

2022-10-31 Thread Will Jones (Jira)
Will Jones created ARROW-18204:
--

 Summary: [R] Allow setting field metadata
 Key: ARROW-18204
 URL: https://issues.apache.org/jira/browse/ARROW-18204
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 10.0.0
Reporter: Will Jones


Currently, can't create a {{Field}} with metadata, which makes it hard to 
create tests regarding field metadata. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-14999) [C++] List types with different field names are not equal

2022-10-24 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623449#comment-17623449
 ] 

Will Jones commented on ARROW-14999:


So here are the conclusions I've gathered so far:

1. Equality of ListTypes and MapTypes have different behavior right now: list 
types with different field names are unequal, but map types with different 
field names are equal. We should make this behavior consistent and probably 
have an option in the {{.Equals()}} method to toggle checking these internal 
field names.
2. For extension arrays, it's important that we preserve these field names in 
most operations. That means that even if the default behavior is to ignore 
field names in equality for List/Map, unit tests for functions should check for 
field name equality.

I'm leaning right now that the default for checking equality should be to 
ignore field names for List/Map (obviously not for struct) in cases where we 
also don't check metadata. For example, {{TypeEquals()}} will check metadata 
and field names, while {{DataType::Equals()}} will not.

> [C++] List types with different field names are not equal
> -
>
> Key: ARROW-14999
> URL: https://issues.apache.org/jira/browse/ARROW-14999
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 6.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When comparing map types, the names of the fields are ignored. This was 
> introduced in ARROW-7173.
> However for list types, they are not ignored. For example,
> {code:python}
> In [6]: l1 = pa.list_(pa.field("val", pa.int64()))
> In [7]: l2 = pa.list_(pa.int64())
> In [8]: l1
> Out[8]: ListType(list)
> In [9]: l2
> Out[9]: ListType(list)
> In [10]: l1 == l2
> Out[10]: False
> {code}
> Should we make list type comparison ignore field names too?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16817) [C++][Python] Segfaults for unsupported datatypes in the ORC writer

2022-10-18 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-16817:
--

Assignee: Will Jones  (was: Ian Alexander Joiner)

> [C++][Python] Segfaults for unsupported datatypes in the ORC writer
> ---
>
> Key: ARROW-16817
> URL: https://issues.apache.org/jira/browse/ARROW-16817
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Ian Alexander Joiner
>Assignee: Will Jones
>Priority: Major
>  Labels: good-first-issue, good-second-issue, 
> pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> In the ORC writer if a table has at least a column with unsupported datatype 
> segfaults occur when we try to write them in ORC.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17994) [C++] Add overflow argument is required when it shouldn't be

2022-10-13 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-17994:
---
Attachment: generate_ibis_queries.py

> [C++] Add overflow argument is required when it shouldn't be
> 
>
> Key: ARROW-17994
> URL: https://issues.apache.org/jira/browse/ARROW-17994
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Will Jones
>Priority: Major
>  Labels: acero, substrait
> Fix For: 11.0.0
>
> Attachments: generate_ibis_queries.py, try_queries_acero.py
>
>
> If I pass a substrait plan that contains an add function, but don't provide 
> the nullablity argument, I get the following error:
> {code:none}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/_substrait.pyx", line 140, in pyarrow._substrait.run_query
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Expected Substrait call to have an enum argument at 
> index 0 but the argument was not an enum.
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/extension_set.cc:684
>   call.GetEnumArg(arg_index)
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/extension_set.cc:702
>   ParseEnumArg(call, 0, kOverflowParser)
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:332
>   FromProto(expr, ext_set, conversion_options)
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/serde.cc:156
>   FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), 
> ext_set, conversion_options)
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:106
>   engine::DeserializePlans(substrait_buffer, consumer_factory, registry, 
> nullptr, conversion_options_)
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:130
>   executor.Init(substrait_buffer, registry)
> {code}
> Yet in the spec, this argument is supposed to be optional: 
> https://github.com/substrait-io/substrait/blob/f3f6bdc947e689e800279666ff33f118e42d2146/extensions/functions_arithmetic.yaml#L11
> If I modify the plan to include the argument, it works as expected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17994) [C++] Add overflow argument is required when it shouldn't be

2022-10-13 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-17994:
---
Attachment: try_queries_acero.py

> [C++] Add overflow argument is required when it shouldn't be
> 
>
> Key: ARROW-17994
> URL: https://issues.apache.org/jira/browse/ARROW-17994
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Will Jones
>Priority: Major
>  Labels: acero, substrait
> Fix For: 11.0.0
>
> Attachments: generate_ibis_queries.py, try_queries_acero.py
>
>
> If I pass a substrait plan that contains an add function, but don't provide 
> the nullablity argument, I get the following error:
> {code:none}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/_substrait.pyx", line 140, in pyarrow._substrait.run_query
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Expected Substrait call to have an enum argument at 
> index 0 but the argument was not an enum.
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/extension_set.cc:684
>   call.GetEnumArg(arg_index)
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/extension_set.cc:702
>   ParseEnumArg(call, 0, kOverflowParser)
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:332
>   FromProto(expr, ext_set, conversion_options)
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/serde.cc:156
>   FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), 
> ext_set, conversion_options)
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:106
>   engine::DeserializePlans(substrait_buffer, consumer_factory, registry, 
> nullptr, conversion_options_)
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:130
>   executor.Init(substrait_buffer, registry)
> {code}
> Yet in the spec, this argument is supposed to be optional: 
> https://github.com/substrait-io/substrait/blob/f3f6bdc947e689e800279666ff33f118e42d2146/extensions/functions_arithmetic.yaml#L11
> If I modify the plan to include the argument, it works as expected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17069) [Python][R] GCSFIleSystem reports cannot resolve host on public buckets

2022-10-12 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-17069:
--

Assignee: Will Jones

> [Python][R] GCSFIleSystem reports cannot resolve host on public buckets
> ---
>
> Key: ARROW-17069
> URL: https://issues.apache.org/jira/browse/ARROW-17069
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python, R
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> GCSFileSystem will returns {{Couldn't resolve host name}} if you don't supply 
> {{anonymous}} as the user:
> {code:python}
> import pyarrow.dataset as ds
> # Fails:
> dataset = 
> ds.dataset("gs://voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 749, in dataset
> return _filesystem_dataset(source, **kwargs)
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 441, in _filesystem_dataset
> fs, paths_or_selector = _ensure_single_source(source, filesystem)
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 408, in _ensure_single_source
> file_info = filesystem.get_file_info(path)
>   File "pyarrow/_fs.pyx", line 444, in pyarrow._fs.FileSystem.get_file_info
> info = GetResultValue(self.fs.GetFileInfo(path))
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> return check_status(status)
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: google::cloud::Status(UNAVAILABLE: Retry policy exhausted in 
> GetObjectMetadata: EasyPerform() - CURL error [6]=Couldn't resolve host name)
> # This works fine:
> >>> dataset = 
> >>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
> {code}
> I would expect that we could connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17069) [Python][R] GCSFIleSystem reports cannot resolve host on public buckets

2022-10-12 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17616691#comment-17616691
 ] 

Will Jones commented on ARROW-17069:


Sure. I had done this earlier for R: 
https://arrow.apache.org/docs/r/articles/fs.html#gcs-authentication

I will make a PR to update the Python user guide.

> [Python][R] GCSFIleSystem reports cannot resolve host on public buckets
> ---
>
> Key: ARROW-17069
> URL: https://issues.apache.org/jira/browse/ARROW-17069
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python, R
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Priority: Critical
> Fix For: 10.0.0
>
>
> GCSFileSystem will returns {{Couldn't resolve host name}} if you don't supply 
> {{anonymous}} as the user:
> {code:python}
> import pyarrow.dataset as ds
> # Fails:
> dataset = 
> ds.dataset("gs://voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 749, in dataset
> return _filesystem_dataset(source, **kwargs)
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 441, in _filesystem_dataset
> fs, paths_or_selector = _ensure_single_source(source, filesystem)
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 408, in _ensure_single_source
> file_info = filesystem.get_file_info(path)
>   File "pyarrow/_fs.pyx", line 444, in pyarrow._fs.FileSystem.get_file_info
> info = GetResultValue(self.fs.GetFileInfo(path))
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> return check_status(status)
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: google::cloud::Status(UNAVAILABLE: Retry policy exhausted in 
> GetObjectMetadata: EasyPerform() - CURL error [6]=Couldn't resolve host name)
> # This works fine:
> >>> dataset = 
> >>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
> {code}
> I would expect that we could connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17994) [C++] Add overflow argument is required when it shouldn't be

2022-10-11 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17616096#comment-17616096
 ] 

Will Jones commented on ARROW-17994:


Example plan that is broken:

{code:json}
{
  "extensionUris": [
{
  "extensionUriAnchor": 1
}
  ],
  "extensions": [
{
  "extensionFunction": {
"extensionUriReference": 1,
"functionAnchor": 1,
"name": "equal"
  }
},
{
  "extensionFunction": {
"extensionUriReference": 1,
"functionAnchor": 2,
"name": "add"
  }
}
  ],
  "relations": [
{
  "root": {
"input": {
  "project": {
"input": {
  "read": {
"baseSchema": {
  "names": [
"a",
"b",
"c"
  ],
  "struct": {
"types": [
  {
"fp64": {
  "nullability": "NULLABILITY_NULLABLE"
}
  },
  {
"i64": {
  "nullability": "NULLABILITY_NULLABLE"
}
  },
  {
"fp64": {
  "nullability": "NULLABILITY_NULLABLE"
}
  }
],
"nullability": "NULLABILITY_REQUIRED"
  }
},
"namedTable": {
  "names": [
"table0"
  ]
}
  }
},
"expressions": [
  {
"scalarFunction": {
  "functionReference": 2,
  "outputType": {
"fp64": {
  "nullability": "NULLABILITY_NULLABLE"
}
  },
  "arguments": [
{
  "value": {
"selection": {
  "directReference": {
"structField": {}
  },
  "rootReference": {}
}
  }
},
{
  "value": {
"selection": {
  "directReference": {
"structField": {
  "field": 2
}
  },
  "rootReference": {}
}
  }
}
  ]
}
  }
]
  }
},
"names": [
  "a",
  "b",
  "c",
  "v"
]
  }
}
  ]
}
{code}

Example plan that works:

{code:json}
{
"extensionUris": [
  {
"extensionUriAnchor": 1
  }
],
"extensions": [
  {
"extensionFunction": {
  "extensionUriReference": 1,
  "functionAnchor": 1,
  "name": "equal"
}
  },
  {
"extensionFunction": {
  "extensionUriReference": 1,
  "functionAnchor": 2,
  "name": "add"
}
  }
],
"relations": [
  {
"root": {
  "input": {
"project": {
  "input": {
"read": {
  "baseSchema": {
"names": [
  "a",
  "b",
  "c"
],
"struct": {
  "types": [
{
  "fp64": {
"nullability": "NULLABILITY_NULLABLE"
  }
},
{
  "i64": {
"nullability": "NULLABILITY_NULLABLE"
  }
},
{
  "fp64": {
"nullability": "NULLABILITY_NULLABLE"
  }
}
  ],
  "nullability": "NULLABILITY_REQUIRED"
}
  },
  "namedTable": {
"names": [
  "table0"
]
  }
}
  },
  "expressions": [
{
  "scalarFunction": {
"functionReference": 2,
"outputType": {
  "fp64": {
"nullability": "NULLABILITY_NULLA

[jira] [Created] (ARROW-17994) [C++] Add overflow argument is required when it shouldn't be

2022-10-11 Thread Will Jones (Jira)
Will Jones created ARROW-17994:
--

 Summary: [C++] Add overflow argument is required when it shouldn't 
be
 Key: ARROW-17994
 URL: https://issues.apache.org/jira/browse/ARROW-17994
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Will Jones
 Fix For: 11.0.0


If I pass a substrait plan that contains an add function, but don't provide the 
nullablity argument, I get the following error:

{code:none}
Traceback (most recent call last):
  File "", line 1, in 
  File "pyarrow/_substrait.pyx", line 140, in pyarrow._substrait.run_query
  File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected Substrait call to have an enum argument at 
index 0 but the argument was not an enum.
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/extension_set.cc:684
  call.GetEnumArg(arg_index)
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/extension_set.cc:702
  ParseEnumArg(call, 0, kOverflowParser)
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:332
  FromProto(expr, ext_set, conversion_options)
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/serde.cc:156
  FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), 
ext_set, conversion_options)
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:106
  engine::DeserializePlans(substrait_buffer, consumer_factory, registry, 
nullptr, conversion_options_)
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:130
  executor.Init(substrait_buffer, registry)
{code}

Yet in the spec, this argument is supposed to be optional: 
https://github.com/substrait-io/substrait/blob/f3f6bdc947e689e800279666ff33f118e42d2146/extensions/functions_arithmetic.yaml#L11

If I modify the plan to include the argument, it works as expected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17963) [C++] Implement cast_dictionary for string

2022-10-07 Thread Will Jones (Jira)
Will Jones created ARROW-17963:
--

 Summary: [C++] Implement cast_dictionary for string
 Key: ARROW-17963
 URL: https://issues.apache.org/jira/browse/ARROW-17963
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Will Jones
 Fix For: 11.0.0


We can cast dictionary(string, X) to string, but not the other way around.

{code:R}
> Array$create(c("a", "b"))$cast(dictionary(int32(), string()))
Error: NotImplemented: Unsupported cast from string to dictionary using 
function cast_dictionary
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/compute/function.cc:249  
func.DispatchBest(&in_types)

> Array$create(as.factor(c("a", "b")))$cast(string())
Array

[
  "a",
  "b"
]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17438) [R] glimpse() errors if there is a UDF

2022-10-06 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones resolved ARROW-17438.

Resolution: Duplicate

> [R] glimpse() errors if there is a UDF
> --
>
> Key: ARROW-17438
> URL: https://issues.apache.org/jira/browse/ARROW-17438
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 10.0.0
>
>
> Using the example from ARROW-17437:
> {code}
> register_scalar_function(
>   "test", 
>   function(context, x) paste(x, collapse=","), 
>   utf8(), 
>   utf8(), 
>   auto_convert=TRUE
> )
> Table$create(x = c("a", "b", "c")) |>
>   transmute(test(x)) |>
>   glimpse()
> # Table (query)
> # 3 rows x 1 columns
> # Error in `dplyr::collect()`:
> # ! NotImplemented: Call to R (resolve scalar user-defined function output 
> data type) from a non-R thread from an unsupported context
> # Run `rlang::last_error()` to see where the error occurred.
> {code}
> A variety of things could fix this:
> * Supporting UDFs in any query (I think there's a draft PR open for this)
> * The limit operator (FetchNode?) so that {{head()}} is handled in the 
> ExecPlan and we don't need to use the RecordBatchReader workaround to get it 
> efficiently (also PR in the works)
> * Worse case, error more informatively  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17438) [R] glimpse() errors if there is a UDF

2022-10-06 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613721#comment-17613721
 ] 

Will Jones commented on ARROW-17438:


I just tested, and this is now fixed. (I believe in ARROW-17178.) cc 
[~paleolimbot] 

> [R] glimpse() errors if there is a UDF
> --
>
> Key: ARROW-17438
> URL: https://issues.apache.org/jira/browse/ARROW-17438
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 10.0.0
>
>
> Using the example from ARROW-17437:
> {code}
> register_scalar_function(
>   "test", 
>   function(context, x) paste(x, collapse=","), 
>   utf8(), 
>   utf8(), 
>   auto_convert=TRUE
> )
> Table$create(x = c("a", "b", "c")) |>
>   transmute(test(x)) |>
>   glimpse()
> # Table (query)
> # 3 rows x 1 columns
> # Error in `dplyr::collect()`:
> # ! NotImplemented: Call to R (resolve scalar user-defined function output 
> data type) from a non-R thread from an unsupported context
> # Run `rlang::last_error()` to see where the error occurred.
> {code}
> A variety of things could fix this:
> * Supporting UDFs in any query (I think there's a draft PR open for this)
> * The limit operator (FetchNode?) so that {{head()}} is handled in the 
> ExecPlan and we don't need to use the RecordBatchReader workaround to get it 
> efficiently (also PR in the works)
> * Worse case, error more informatively  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-16897) [R][C++] Full join on Arrow objects is incorrect

2022-10-06 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones closed ARROW-16897.
--
Resolution: Duplicate

> [R][C++] Full join on Arrow objects is incorrect
> 
>
> Key: ARROW-16897
> URL: https://issues.apache.org/jira/browse/ARROW-16897
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 8.0.0, 9.0.0
> Environment: Linux
>Reporter: Oliver Reiter
>Assignee: Weston Pace
>Priority: Critical
>  Labels: joins, query-engine
> Fix For: 10.0.0
>
>
> Hello,
> I am trying to do a full join on a dataset. It produces the correct number of 
> observations, but not the correct result (the resulting data.frame is just 
> filled up with NA-rows).
> My use case: I want to include the 'full' year range for every factor value:
> {code:java}
> library(data.table)
> library(arrow)
> library(dplyr)
> year_range <- 2000:2019
> group_n <- 100
> N <- 1000 ## the resulting data should have 100 groups * 20 years
> dt <- data.table(value = rnorm(N),
>                  group = rep(paste0("g", 1:group_n), length.out = N))
> ## there are only observations for some years in every group
> dt[, year := sample(year_range, size = N / group_n), by = .(group)]
> dt[group == "g1", ]
> ## this would be the 'full' data.table
> group_years <- data.table(group = rep(unique(dt$group), each = 20),
>                           year = rep(year_range, times = 10))
> group_years[group == "g1", ]
> write_dataset(dt, path = "parquet_db")
> db <- open_dataset(sources = "parquet_db")
> ## full_join using data.table -> expected result
> db_full <- merge(dt, group_years,
>                  by = c("group", "year"),
>                  all = TRUE)
> setorder(db_full, group, year)
> db_full[group == "g1", ]
> ## try to do the full_join with arrow -> incorrect result
> db_full_arrow <- db |>
>   full_join(group_years, by = c("group", "year")) |>
>   collect() |>
>   setDT()
> setorder(db_full_arrow, group, year)
> db_full_arrow[group == "g1", ]
> ## or: convert data.table to arrow_table beforehand -> incorrect result
> group_years_arrow <- group_years |>
>   as_arrow_table()
> db_full_arrow <- db |>
>   full_join(group_years_arrow, by = c("group", "year")) |>
>   collect() |>
>   setDT()
> setorder(db_full_arrow, group, year)
> db_full_arrow[group == "g1", ]{code}
> The [documentation|https://arrow.apache.org/docs/r/] says equality joins are 
> supported, which should hold also for `full_join` I guess?
> Thanks for your time and work!
>  
> Oliver



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17149) [R] Enable GCS tests for Windows

2022-10-06 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-17149:
---
Fix Version/s: 11.0.0
   (was: 10.0.0)

> [R] Enable GCS tests for Windows
> 
>
> Key: ARROW-17149
> URL: https://issues.apache.org/jira/browse/ARROW-17149
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Affects Versions: 9.0.0
>Reporter: Will Jones
>Priority: Major
> Fix For: 11.0.0
>
>
> In ARROW-16879, I found the GCS tests were hanging in CI, but couldn't 
> diagnose why. We should solve that and enable the tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17954) [R] Update News for 10.0.0

2022-10-06 Thread Will Jones (Jira)
Will Jones created ARROW-17954:
--

 Summary: [R] Update News for 10.0.0
 Key: ARROW-17954
 URL: https://issues.apache.org/jira/browse/ARROW-17954
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 10.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-14342) [Python] Add support for the SSO credential provider

2022-10-05 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613238#comment-17613238
 ] 

Will Jones commented on ARROW-14342:


In the meantime, you can work around this by using boto3 to resolve the SSO 
credentials:

{code:Python}
from boto3 import Session

session = Session()
credentials = session.get_credentials()
s3 = S3FileSystem(
access_key=current_credentials.access_key,
secret_key=current_credentials.secret_key,
session_token=current_credentials.token,
)
{code}

> [Python] Add support for the SSO credential provider
> 
>
> Key: ARROW-14342
> URL: https://issues.apache.org/jira/browse/ARROW-14342
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 3.0.0, 5.0.0
>Reporter: Björn Boschman
>Priority: Major
> Fix For: 11.0.0
>
>
> Not sure about other languages
>  see also: [https://github.com/boto/botocore/pull/2070]
> {code:java}
> from pyarrow.fs import S3FileSystem 
> bucket = 'some-bucket-with-read-access' 
> key = 'some-existing-key' 
> s3 = S3FileSystem() 
> s3.open_input_file(f'{bucket}/{key}'){code}
>  
>  results in
>  
> {code:java}
> Traceback (most recent call last):
>   File "test.py", line 7, in 
> s3.open_input_file(f'{bucket}/{key}')
>   File "pyarrow/_fs.pyx", line 587, in pyarrow._fs.FileSystem.open_input_file
>   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> OSError: When reading information for key 'some-existing-key' in bucket 
> 'some-bucket-with-read-access': AWS Error [code 15]: No response body.
> {code}
>  
> without sso creds supported - shouldn't it raise some kind of AccessDenied 
> Exception?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14342) Add support for the SSO credential provider

2022-10-05 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-14342:
---
Fix Version/s: 11.0.0

> Add support for the SSO credential provider
> ---
>
> Key: ARROW-14342
> URL: https://issues.apache.org/jira/browse/ARROW-14342
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 3.0.0, 5.0.0
>Reporter: Björn Boschman
>Priority: Major
> Fix For: 11.0.0
>
>
> Not sure about other languages
>  see also: [https://github.com/boto/botocore/pull/2070]
> {code:java}
> from pyarrow.fs import S3FileSystem 
> bucket = 'some-bucket-with-read-access' 
> key = 'some-existing-key' 
> s3 = S3FileSystem() 
> s3.open_input_file(f'{bucket}/{key}'){code}
>  
>  results in
>  
> {code:java}
> Traceback (most recent call last):
>   File "test.py", line 7, in 
> s3.open_input_file(f'{bucket}/{key}')
>   File "pyarrow/_fs.pyx", line 587, in pyarrow._fs.FileSystem.open_input_file
>   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> OSError: When reading information for key 'some-existing-key' in bucket 
> 'some-bucket-with-read-access': AWS Error [code 15]: No response body.
> {code}
>  
> without sso creds supported - shouldn't it raise some kind of AccessDenied 
> Exception?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14342) [Python] Add support for the SSO credential provider

2022-10-05 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-14342:
---
Summary: [Python] Add support for the SSO credential provider  (was: Add 
support for the SSO credential provider)

> [Python] Add support for the SSO credential provider
> 
>
> Key: ARROW-14342
> URL: https://issues.apache.org/jira/browse/ARROW-14342
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 3.0.0, 5.0.0
>Reporter: Björn Boschman
>Priority: Major
> Fix For: 11.0.0
>
>
> Not sure about other languages
>  see also: [https://github.com/boto/botocore/pull/2070]
> {code:java}
> from pyarrow.fs import S3FileSystem 
> bucket = 'some-bucket-with-read-access' 
> key = 'some-existing-key' 
> s3 = S3FileSystem() 
> s3.open_input_file(f'{bucket}/{key}'){code}
>  
>  results in
>  
> {code:java}
> Traceback (most recent call last):
>   File "test.py", line 7, in 
> s3.open_input_file(f'{bucket}/{key}')
>   File "pyarrow/_fs.pyx", line 587, in pyarrow._fs.FileSystem.open_input_file
>   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> OSError: When reading information for key 'some-existing-key' in bucket 
> 'some-bucket-with-read-access': AWS Error [code 15]: No response body.
> {code}
>  
> without sso creds supported - shouldn't it raise some kind of AccessDenied 
> Exception?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-14342) Add support for the SSO credential provider

2022-10-05 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613235#comment-17613235
 ] 

Will Jones commented on ARROW-14342:


SSO support was added in aws-sdk-cpp 1.9. Once we upgrade that dependency we 
should automatically get support for this.

https://github.com/aws/aws-sdk-cpp/issues/1433#issuecomment-1079267499

> Add support for the SSO credential provider
> ---
>
> Key: ARROW-14342
> URL: https://issues.apache.org/jira/browse/ARROW-14342
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 3.0.0, 5.0.0
>Reporter: Björn Boschman
>Priority: Major
>
> Not sure about other languages
>  see also: [https://github.com/boto/botocore/pull/2070]
> {code:java}
> from pyarrow.fs import S3FileSystem 
> bucket = 'some-bucket-with-read-access' 
> key = 'some-existing-key' 
> s3 = S3FileSystem() 
> s3.open_input_file(f'{bucket}/{key}'){code}
>  
>  results in
>  
> {code:java}
> Traceback (most recent call last):
>   File "test.py", line 7, in 
> s3.open_input_file(f'{bucket}/{key}')
>   File "pyarrow/_fs.pyx", line 587, in pyarrow._fs.FileSystem.open_input_file
>   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> OSError: When reading information for key 'some-existing-key' in bucket 
> 'some-bucket-with-read-access': AWS Error [code 15]: No response body.
> {code}
>  
> without sso creds supported - shouldn't it raise some kind of AccessDenied 
> Exception?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17944) [Python] Accept bytes object in pyarrow.substrait.run_query

2022-10-05 Thread Will Jones (Jira)
Will Jones created ARROW-17944:
--

 Summary: [Python] Accept bytes object in 
pyarrow.substrait.run_query
 Key: ARROW-17944
 URL: https://issues.apache.org/jira/browse/ARROW-17944
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Will Jones
 Fix For: 11.0.0


{{pyarrow.substrait.run_query()}} only accepts a PyArrow buffer, and will 
segfault if something else is passed. People might try to pass a Python bytes 
object, which isn't unreasonable. For example, they might use the value 
returned by protobufs {{SerializeToString()}} function, which is Python bytes. 
At the very least, we should not segfault.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17349) [C++] Add casting support for map type

2022-10-05 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613030#comment-17613030
 ] 

Will Jones commented on ARROW-17349:


Yes, I've updated the title. Casting lists only was broken if it was inside a 
map. The only reason casting maps looked as if it was working was because of 
the early return if types are "equal" (and maps are "equal" even if they have 
different field names).

> [C++] Add casting support for map type
> --
>
> Key: ARROW-17349
> URL: https://issues.apache.org/jira/browse/ARROW-17349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: good-first-issue, kernel, pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Different parquet implementations use different field names for internal 
> fields of ListType and MapType, which can sometimes cause silly conflicts. 
> For example, we use {{item}} as the field name for list, but Spark uses 
> {{element}}. Fortunately, we can automatically cast between List and Map 
> Types with different field names. Unfortunately, it only works at the top 
> level. We should get it to work at arbitrary levels of nesting.
> This was discovered in delta-rs: 
> https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285
> Here's a reproduction in Python:
> {code:Python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> def roundtrip_scanner(in_arr, out_type):
> table = pa.table({"arr": in_arr})
> pq.write_table(table, "test.parquet")
> schema = pa.schema({"arr": out_type})
> ds.dataset("test.parquet", schema=schema).to_table()
> # MapType
> ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32())
> ty = pa.map_(pa.int32(), pa.int32())
> arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # ListType
> ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False))
> ty = pa.list_(pa.int32())
> arr_named = pa.array([[1, 2, 4]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # Combination MapType and ListType
> ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", 
> pa.int32(), nullable=True)), nullable=False))
> ty = pa.map_(pa.string(), pa.list_(pa.int32()))
> arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # Traceback (most recent call last):
> #   File "", line 1, in 
> #   File "", line 5, in roundtrip_scanner
> #   File "pyarrow/_dataset.pyx", line 331, in 
> pyarrow._dataset.Dataset.to_table
> #   File "pyarrow/_dataset.pyx", line 2577, in 
> pyarrow._dataset.Scanner.to_table
> #   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> #   File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
> # pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map list> from map ('arr')>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17349) [C++] Add casting support for map type

2022-10-05 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-17349:
---
Summary: [C++] Add casting support for map type  (was: [C++] Support 
casting field names of list and map when nested)

> [C++] Add casting support for map type
> --
>
> Key: ARROW-17349
> URL: https://issues.apache.org/jira/browse/ARROW-17349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: good-first-issue, kernel, pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Different parquet implementations use different field names for internal 
> fields of ListType and MapType, which can sometimes cause silly conflicts. 
> For example, we use {{item}} as the field name for list, but Spark uses 
> {{element}}. Fortunately, we can automatically cast between List and Map 
> Types with different field names. Unfortunately, it only works at the top 
> level. We should get it to work at arbitrary levels of nesting.
> This was discovered in delta-rs: 
> https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285
> Here's a reproduction in Python:
> {code:Python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> def roundtrip_scanner(in_arr, out_type):
> table = pa.table({"arr": in_arr})
> pq.write_table(table, "test.parquet")
> schema = pa.schema({"arr": out_type})
> ds.dataset("test.parquet", schema=schema).to_table()
> # MapType
> ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32())
> ty = pa.map_(pa.int32(), pa.int32())
> arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # ListType
> ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False))
> ty = pa.list_(pa.int32())
> arr_named = pa.array([[1, 2, 4]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # Combination MapType and ListType
> ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", 
> pa.int32(), nullable=True)), nullable=False))
> ty = pa.map_(pa.string(), pa.list_(pa.int32()))
> arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # Traceback (most recent call last):
> #   File "", line 1, in 
> #   File "", line 5, in roundtrip_scanner
> #   File "pyarrow/_dataset.pyx", line 331, in 
> pyarrow._dataset.Dataset.to_table
> #   File "pyarrow/_dataset.pyx", line 2577, in 
> pyarrow._dataset.Scanner.to_table
> #   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> #   File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
> # pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map list> from map ('arr')>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17923) [C++] Consider dictionary arrays for special fragment fields

2022-10-03 Thread Will Jones (Jira)
Will Jones created ARROW-17923:
--

 Summary: [C++] Consider dictionary arrays for special fragment 
fields
 Key: ARROW-17923
 URL: https://issues.apache.org/jira/browse/ARROW-17923
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Will Jones


I noticed in ARROW-15281 we made {{__filename}} a string column. In common 
cases, this will be inefficient if materialized. If possible, it may be better 
to have them be dictionary arrays.

As an example, 
[here|https://github.com/apache/arrow/pull/12826#issuecomment-1230745059] is a 
user report of 10x increased memory usage caused by accidentally including 
these special fragment columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17897) [Packaging][Conan] Add back ARROW_GCS to conanfile.py

2022-09-29 Thread Will Jones (Jira)
Will Jones created ARROW-17897:
--

 Summary: [Packaging][Conan] Add back ARROW_GCS to conanfile.py
 Key: ARROW-17897
 URL: https://issues.apache.org/jira/browse/ARROW-17897
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 10.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-14159) [R] Re-allow some multithreading on Windows

2022-09-26 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-14159:
--

Assignee: (was: Will Jones)

> [R] Re-allow some multithreading on Windows
> ---
>
> Key: ARROW-14159
> URL: https://issues.apache.org/jira/browse/ARROW-14159
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 10.0.0
>
>
> Followup to ARROW-8379, which set use_threads = FALSE on Windows. See 
> discussion about adding more controls, disabling threading in some places and 
> not others, etc. We want to do this soon after release so that we have a few 
> months to see how things behave on CI before releasing again.
> -
> Collecting some CI hangs after ARROW-8379
> 1. Rtools35, 64bit test suite hangs: 
> https://github.com/apache/arrow/pull/11290/checks?check_run_id=3767787034
> {code}
> ** running tests for arch 'i386' ...
>   Running 'testthat.R' [17s]
>  OK
> ** running tests for arch 'x64' ...
> Error: Error:   stderr is not a pipe.>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16880) [R] Test GCS auth with gargle/googleAuthR

2022-09-26 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-16880:
--

Assignee: (was: Will Jones)

> [R] Test GCS auth with gargle/googleAuthR
> -
>
> Key: ARROW-16880
> URL: https://issues.apache.org/jira/browse/ARROW-16880
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 10.0.0
>
>
> These are the main packages that let folks worth with Google Cloud from R, so 
> we should make sure we can play nicely with their auth methods, how they 
> cache credentials, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16880) [R] Test GCS auth with gargle/googleAuthR

2022-09-26 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-16880:
---
Fix Version/s: (was: 10.0.0)

> [R] Test GCS auth with gargle/googleAuthR
> -
>
> Key: ARROW-16880
> URL: https://issues.apache.org/jira/browse/ARROW-16880
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>
> These are the main packages that let folks worth with Google Cloud from R, so 
> we should make sure we can play nicely with their auth methods, how they 
> cache credentials, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17069) [Python][R] GCSFIleSystem reports cannot resolve host on public buckets

2022-09-26 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-17069:
--

Assignee: (was: Will Jones)

> [Python][R] GCSFIleSystem reports cannot resolve host on public buckets
> ---
>
> Key: ARROW-17069
> URL: https://issues.apache.org/jira/browse/ARROW-17069
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python, R
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Priority: Critical
> Fix For: 10.0.0
>
>
> GCSFileSystem will returns {{Couldn't resolve host name}} if you don't supply 
> {{anonymous}} as the user:
> {code:python}
> import pyarrow.dataset as ds
> # Fails:
> dataset = 
> ds.dataset("gs://voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 749, in dataset
> return _filesystem_dataset(source, **kwargs)
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 441, in _filesystem_dataset
> fs, paths_or_selector = _ensure_single_source(source, filesystem)
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 408, in _ensure_single_source
> file_info = filesystem.get_file_info(path)
>   File "pyarrow/_fs.pyx", line 444, in pyarrow._fs.FileSystem.get_file_info
> info = GetResultValue(self.fs.GetFileInfo(path))
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> return check_status(status)
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: google::cloud::Status(UNAVAILABLE: Retry policy exhausted in 
> GetObjectMetadata: EasyPerform() - CURL error [6]=Couldn't resolve host name)
> # This works fine:
> >>> dataset = 
> >>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
> {code}
> I would expect that we could connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16089) [Packaging] Add support for Coan C/C++ package manager

2022-09-26 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17609576#comment-17609576
 ] 

Will Jones commented on ARROW-16089:


[~kou] I heard we are waiting on the 10.0.0 release to upstream our changes to 
the Conan files. I found there is an issue with OpenSSL on some platforms in 
the current Conan that is fixed in ours. Would it be alright if I brought these 
upstream now instead of waiting?

> [Packaging] Add support for Coan C/C++ package manager
> --
>
> Key: ARROW-16089
> URL: https://issues.apache.org/jira/browse/ARROW-16089
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17845) [CI][Conan] Re-enable Flight in Conan CI check

2022-09-26 Thread Will Jones (Jira)
Will Jones created ARROW-17845:
--

 Summary: [CI][Conan] Re-enable Flight in Conan CI check
 Key: ARROW-17845
 URL: https://issues.apache.org/jira/browse/ARROW-17845
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Will Jones
Assignee: Will Jones






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15838) [C++] Key column behavior in joins

2022-09-22 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-15838:
--

Assignee: Will Jones

> [C++] Key column behavior in joins
> --
>
> Key: ARROW-15838
> URL: https://issues.apache.org/jira/browse/ARROW-15838
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Jonathan Keane
>Assignee: Will Jones
>Priority: Major
> Fix For: 10.0.0
>
>
> By default in dplyr (and possibly in pandas too?) coalesces the key column 
> for full joins to be the (non-null) values from both key columns:
> {code}
> > left <- tibble::tibble(
>   key = c(1, 2),
>   A = c(0, 1),  
> )  
> left_tab <- Table$create(left)
> > right <- tibble::tibble(
>   key = c(2, 3),
>   B = c(0, 1),
> )  
> right_tab <- Table$create(right)
> > left %>% full_join(right) 
> Joining, by = "key"
> # A tibble: 3 × 3
> key A B
> 
> 1 1 0NA
> 2 2 1 0
> 3 3NA 1
> > left_tab %>% full_join(right_tab) %>% collect()
> # A tibble: 3 × 3
> key A B
> 
> 1 2 1 0
> 2 1 0NA
> 3NANA 1
> {code}
> And right join, we would expect the key from the right table to be in the 
> result, but we get the key from the left instead:
> {code}
> > left <- tibble::tibble(
>   key = c(1, 2),
>   A = c(0, 1),  
> )  
> left_tab <- Table$create(left)
> > right <- tibble::tibble(
>   key = c(2, 3),
>   B = c(0, 1),
> )  
> right_tab <- Table$create(right)
> > left %>% right_join(right)
> Joining, by = "key"
> # A tibble: 2 × 3
> key A B
> 
> 1 2 1 0
> 2 3NA 1
> > left_tab %>% right_join(right_tab) %>% collect()
> # A tibble: 2 × 3
> key A B
> 
> 1 2 1 0
> 2NANA 1
> {code}
> Additionally, we should be able to keep both key columns with an option (cf 
> https://github.com/apache/arrow/blob/9719eae66dcf38c966ae769215d27020a6dd5550/r/R/dplyr-join.R#L32)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17812) [C++][Documentation] Add Gandiva User Guide

2022-09-21 Thread Will Jones (Jira)
Will Jones created ARROW-17812:
--

 Summary: [C++][Documentation] Add Gandiva User Guide
 Key: ARROW-17812
 URL: https://issues.apache.org/jira/browse/ARROW-17812
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 10.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17349) [C++] Support casting field names of list and map when nested

2022-09-20 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607422#comment-17607422
 ] 

Will Jones commented on ARROW-17349:


What's actually going on is we don't have any cast kernel for Map. Casting from 
a map to map works, because we early return if types are equal, and our equals 
method doesn't care about map field names. But it does care about list field 
names, so if the map contains a list then it will look for a cast function.

I'll create a separate ticket for implementing Cast for Map, but for this 
particular issue, I think it would be nice to have a fast path for renaming 
fields in cast.

> [C++] Support casting field names of list and map when nested
> -
>
> Key: ARROW-17349
> URL: https://issues.apache.org/jira/browse/ARROW-17349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: good-first-issue, kernel, pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Different parquet implementations use different field names for internal 
> fields of ListType and MapType, which can sometimes cause silly conflicts. 
> For example, we use {{item}} as the field name for list, but Spark uses 
> {{element}}. Fortunately, we can automatically cast between List and Map 
> Types with different field names. Unfortunately, it only works at the top 
> level. We should get it to work at arbitrary levels of nesting.
> This was discovered in delta-rs: 
> https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285
> Here's a reproduction in Python:
> {code:Python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> def roundtrip_scanner(in_arr, out_type):
> table = pa.table({"arr": in_arr})
> pq.write_table(table, "test.parquet")
> schema = pa.schema({"arr": out_type})
> ds.dataset("test.parquet", schema=schema).to_table()
> # MapType
> ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32())
> ty = pa.map_(pa.int32(), pa.int32())
> arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # ListType
> ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False))
> ty = pa.list_(pa.int32())
> arr_named = pa.array([[1, 2, 4]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # Combination MapType and ListType
> ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", 
> pa.int32(), nullable=True)), nullable=False))
> ty = pa.map_(pa.string(), pa.list_(pa.int32()))
> arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # Traceback (most recent call last):
> #   File "", line 1, in 
> #   File "", line 5, in roundtrip_scanner
> #   File "pyarrow/_dataset.pyx", line 331, in 
> pyarrow._dataset.Dataset.to_table
> #   File "pyarrow/_dataset.pyx", line 2577, in 
> pyarrow._dataset.Scanner.to_table
> #   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> #   File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
> # pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map list> from map ('arr')>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17349) [C++] Support casting field names of list and map when nested

2022-09-20 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-17349:
--

Assignee: Will Jones

> [C++] Support casting field names of list and map when nested
> -
>
> Key: ARROW-17349
> URL: https://issues.apache.org/jira/browse/ARROW-17349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: good-first-issue, kernel, pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Different parquet implementations use different field names for internal 
> fields of ListType and MapType, which can sometimes cause silly conflicts. 
> For example, we use {{item}} as the field name for list, but Spark uses 
> {{element}}. Fortunately, we can automatically cast between List and Map 
> Types with different field names. Unfortunately, it only works at the top 
> level. We should get it to work at arbitrary levels of nesting.
> This was discovered in delta-rs: 
> https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285
> Here's a reproduction in Python:
> {code:Python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> def roundtrip_scanner(in_arr, out_type):
> table = pa.table({"arr": in_arr})
> pq.write_table(table, "test.parquet")
> schema = pa.schema({"arr": out_type})
> ds.dataset("test.parquet", schema=schema).to_table()
> # MapType
> ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32())
> ty = pa.map_(pa.int32(), pa.int32())
> arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # ListType
> ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False))
> ty = pa.list_(pa.int32())
> arr_named = pa.array([[1, 2, 4]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # Combination MapType and ListType
> ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", 
> pa.int32(), nullable=True)), nullable=False))
> ty = pa.map_(pa.string(), pa.list_(pa.int32()))
> arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # Traceback (most recent call last):
> #   File "", line 1, in 
> #   File "", line 5, in roundtrip_scanner
> #   File "pyarrow/_dataset.pyx", line 331, in 
> pyarrow._dataset.Dataset.to_table
> #   File "pyarrow/_dataset.pyx", line 2577, in 
> pyarrow._dataset.Scanner.to_table
> #   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> #   File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
> # pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map list> from map ('arr')>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17788) [R][Doc] Add example of using Scanner

2022-09-20 Thread Will Jones (Jira)
Will Jones created ARROW-17788:
--

 Summary: [R][Doc] Add example of using Scanner
 Key: ARROW-17788
 URL: https://issues.apache.org/jira/browse/ARROW-17788
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, R
Affects Versions: 9.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 10.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17776) [C++] Stabilize Parquet ArrowReaderProperties

2022-09-19 Thread Will Jones (Jira)
Will Jones created ARROW-17776:
--

 Summary: [C++] Stabilize Parquet ArrowReaderProperties
 Key: ARROW-17776
 URL: https://issues.apache.org/jira/browse/ARROW-17776
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Parquet
Affects Versions: 9.0.0
Reporter: Will Jones


{{ArrowReaderProperties}} is still marked experimental, but it's pretty well 
used at this point.

One possible change we might wish to make before stabilizing the API for it 
though: The {{ArrowWriterProperties}} class uses a namespaced builder class, 
which provides a nice syntax for creation and enforces immutability of the 
final properties. Perhaps we should mirror that design in the reader properties?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17400) [C++] Move Parquet APIs to use Result instead of Status

2022-09-14 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17604933#comment-17604933
 ] 

Will Jones commented on ARROW-17400:


[~devavret] Are you still working on this? I did a little bit of this in my PR 
[https://github.com/apache/arrow/pull/14018] but there are other APIs to do as 
well. 

> [C++] Move Parquet APIs to use Result instead of Status
> ---
>
> Key: ARROW-17400
> URL: https://issues.apache.org/jira/browse/ARROW-17400
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: Will Jones
>Assignee: Devavret Makkar
>Priority: Minor
>  Labels: good-first-issue
>
> Notably, IPC and CSV have "open file" methods that return result, while 
> opening a Parquet file requires passing in an out variable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17593) [C++] Try and maintain input shape in Acero

2022-09-01 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599079#comment-17599079
 ] 

Will Jones commented on ARROW-17593:


I've been reading through the Parquet implementation, and was surprised to find 
that you cannot write out a row group with multiple batches. We've decoupled 
row group sizes and batch size on read (great!), but not on write. Perhaps that 
should also be part of the solution.

I'm not deeply familiar with Acero internals yet, but what you've described 
here seems very sensible. Though it sounds like we may need some helper class 
to allocate the batch and line up the morsels, IIUC.

> [C++] Try and maintain input shape in Acero
> ---
>
> Key: ARROW-17593
> URL: https://issues.apache.org/jira/browse/ARROW-17593
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> Data is scanned in large chunks based on the format.  For example, CSV scans 
> chunks based on a chunk_size while parquet scans entire row groups.
> Then, upon entry into Acero, these chunks are sliced into morsels (~L3 size) 
> for parallelism and batches (~L1-L2 size) for cache efficient processing.
> However, the way it is currently done, means that the output of Acero is a 
> stream of tiny batches.  This is somewhat undesirable in many cases.
> For example, if a pyarrow user calls pq.read_table they might expect to get 
> one batch per row group.  If they were to turn around and write out that 
> table to a new parquet file then either they end up with a non-ideal parquet 
> file (tiny row groups) or they are forced to concatenate the batches (which 
> is an allocation + copy).
> Even if the user is doing their own streaming processing (e.g. in pyarrow) 
> these small batch sizes are undesirable as the overhead of python means that 
> streaming processing should be done in larger batches.
> Instead, there should be a configurable max_batch_size, independent of row 
> group size and morsel size, which is configurable, and quite large by default 
> (1Mi or 64Mi rows).  This control exists for users that want to do their own 
> streaming processing and need to be able to tune for RAM usage.
> Acero will read in data based on the format, as it does today (e.g. CSV chunk 
> size, row group size).  If the source data is very large (bigger than 
> max_batch_size) it will be sliced.  From that point on, any morsels or 
> batches should simply be views into this larger output batch.  For example, 
> when doing a projection to add a new column, we should allocate a 
> max_batch_size array and then populate it over many runs of the project node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17590) Lower memory usage with filters

2022-09-01 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599019#comment-17599019
 ] 

Will Jones commented on ARROW-17590:


First, I don't believe the row-level filters avoid reading any data, unless 
they can be applied to the data set partition values. In order to evaluate the 
expression on the row, it needs to be parsed into Arrow data.

If you want to reduce memory usage, I have two suggestions:
 # Turn off prebuffering, if you haven't already. In Python some interfaces 
it's on by default, some off. It gives better performance on some filesystems, 
but it uses more memory.
 # Consider reading in batches, using the {{iter_batches()}} method on Parquet 
files for instance. Then you can filter as the data comes in and concatenate 
the results into a Table.

Which interface are you using? {{pyarrow.parquet.read_table}} or datasets?

> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file for all columns without filtering, the 
> memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
> dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
> something.
> It helps to limit the number of columns read. Read 1 column with filter for 1 
> row or more or without filter, it takes about 10MB, which is quite smaller 
> and better, but still bigger than the size of table or data frame with 1 or 
> 500 rows of 1 columns (under 1MB)
> The filtered column is not a partition key, which functionally works to get 
> the correct rows. But the memory usage is quite high even when the parquet 
> file is not really large, partitioned or not. There were some references 
> similar to this issue, for example: 
> [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with read 
> filtering? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us when we need to load all or many columns. 
> I am not sure what improvement is possible with respect to how the parquet 
> columnar format works, and if it can be patched somehow in the Pyarrow Python 
> code, or need to change and build the arrow C++ code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-14161) [C++][Parquet][Docs] Reading/Writing Parquet Files

2022-08-31 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-14161:
--

Assignee: Will Jones

> [C++][Parquet][Docs] Reading/Writing Parquet Files
> --
>
> Key: ARROW-14161
> URL: https://issues.apache.org/jira/browse/ARROW-14161
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Rares Vernica
>Assignee: Will Jones
>Priority: Minor
> Fix For: 10.0.0
>
>
> Missing documentation on Reading/Writing Parquet files C++ api:
>  * 
> [WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE]
>  missing docs on chunk_size found some 
> [here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53]
>  _size of the RowGroup in the parquet file. Normally you would choose this to 
> be rather large_
>  * Typo in file reader 
> [example|https://arrow.apache.org/docs/cpp/parquet.html#filereader] the 
> include should be {{#include "parquet/arrow/reader.h"}}
>  * 
> [WriteProperties/Builder|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet16WriterPropertiesE]
>  missing docs on {{compression}}
>  * Missing example on using WriteProperties



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-13454) [C++][Docs] Tables vs Record Batches

2022-08-30 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-13454:
--

Assignee: Will Jones

> [C++][Docs] Tables vs Record Batches
> 
>
> Key: ARROW-13454
> URL: https://issues.apache.org/jira/browse/ARROW-13454
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Rares Vernica
>Assignee: Will Jones
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It is not clear what the difference is between Tables and Record Batches is 
> as described on [https://arrow.apache.org/docs/cpp/tables.html#tables]
> _A 
> [{{arrow::Table}}|https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow5TableE]
>  is a two-dimensional dataset with chunked arrays for columns_
> _A 
> [{{arrow::RecordBatch}}|https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow11RecordBatchE]
>  is a two-dimensional dataset of a number of contiguous arrays_
> Or maybe the distinction between _chunked arrays_ and _contiguous arrays_ can 
> be clarified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15006) [Python][Doc] Iteratively enable more numpydoc checks

2022-08-30 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597937#comment-17597937
 ] 

Will Jones commented on ARROW-15006:


Great spreadsheet! Best place for developer discussions is the dev mailing 
list, since it's the most public. Add a link to this ticket and the 
spreadsheet, and you can add a {{[Python][Doc]}} prefix to the subject to help 
recipients know if it's relevant to them.

> [Python][Doc] Iteratively enable more numpydoc checks
> -
>
> Key: ARROW-15006
> URL: https://issues.apache.org/jira/browse/ARROW-15006
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Krisztian Szucs
>Assignee: Bryce Mecum
>Priority: Major
>  Labels: good-first-issue, pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Asof https://github.com/apache/arrow/pull/7732 we're going to have a numpydoc 
> check running on pull requests. There is a single rule enabled at the moment: 
> PR01
> Additional checks we can run:
> {code}
> ERROR_MSGS = {
> "GL01": "Docstring text (summary) should start in the line immediately "
> "after the opening quotes (not in the same line, or leaving a "
> "blank line in between)",
> "GL02": "Closing quotes should be placed in the line after the last text "
> "in the docstring (do not close the quotes in the same line as "
> "the text, or leave a blank line between the last text and the "
> "quotes)",
> "GL03": "Double line break found; please use only one blank line to "
> "separate sections or paragraphs, and do not leave blank lines "
> "at the end of docstrings",
> "GL05": 'Tabs found at the start of line "{line_with_tabs}", please use '
> "whitespace only",
> "GL06": 'Found unknown section "{section}". Allowed sections are: '
> "{allowed_sections}",
> "GL07": "Sections are in the wrong order. Correct order is: 
> {correct_sections}",
> "GL08": "The object does not have a docstring",
> "GL09": "Deprecation warning should precede extended summary",
> "GL10": "reST directives {directives} must be followed by two colons",
> "SS01": "No summary found (a short summary in a single line should be "
> "present at the beginning of the docstring)",
> "SS02": "Summary does not start with a capital letter",
> "SS03": "Summary does not end with a period",
> "SS04": "Summary contains heading whitespaces",
> "SS05": "Summary must start with infinitive verb, not third person "
> '(e.g. use "Generate" instead of "Generates")',
> "SS06": "Summary should fit in a single line",
> "ES01": "No extended summary found",
> "PR01": "Parameters {missing_params} not documented",
> "PR02": "Unknown parameters {unknown_params}",
> "PR03": "Wrong parameters order. Actual: {actual_params}. "
> "Documented: {documented_params}",
> "PR04": 'Parameter "{param_name}" has no type',
> "PR05": 'Parameter "{param_name}" type should not finish with "."',
> "PR06": 'Parameter "{param_name}" type should use "{right_type}" instead '
> 'of "{wrong_type}"',
> "PR07": 'Parameter "{param_name}" has no description',
> "PR08": 'Parameter "{param_name}" description should start with a '
> "capital letter",
> "PR09": 'Parameter "{param_name}" description should finish with "."',
> "PR10": 'Parameter "{param_name}" requires a space before the colon '
> "separating the parameter name and type",
> "RT01": "No Returns section found",
> "RT02": "The first line of the Returns section should contain only the "
> "type, unless multiple values are being returned",
> "RT03": "Return value has no description",
> "RT04": "Return value description should start with a capital letter",
> "RT05": 'Return value description should finish with "."',
> "YD01": "No Yields section found",
> "SA01": "See Also section not found",
> "SA02": "Missing period at end of description for See Also "
> '"{reference_name}" reference',
> "SA03": "Description should be capitalized for See Also "
> '"{reference_name}" reference',
> "SA04": 'Missing description for See Also "{reference_name}" reference',
> "EX01": "No examples section found",
> }
> {code}
> cc [~alenkaf] [~amol-] [~jorisvandenbossche]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-30 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597915#comment-17597915
 ] 

Will Jones commented on ARROW-17459:


We have a section of our docs devoted to [developer setup and 
guidelines|https://arrow.apache.org/docs/developers/contributing.html]. And we 
have documentation describing the [Arrow in-memory 
format|https://arrow.apache.org/docs/format/Columnar.html] (it may be worth 
reviewing the structure of nested arrays, for example). For the internals of 
the Parquet arrow code, it's best to read through the source headers at 
{{{}cpp/src/parquet/arrow/{}}}.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17399) pyarrow may use a lot of memory to load a dataframe from parquet

2022-08-29 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597400#comment-17597400
 ] 

Will Jones commented on ARROW-17399:


Sorry, you are right you had a single column there already.

I tried your repro on my M1 Macbook and didn't see the memory usage you are 
seeing. (This is with mimalloc allocator, but I got similar results with 
jemalloc and the system allocator.)

Are you able to reproduce this on the latest versions of Pandas and Numpy? And 
could you confirm your package version numbers?

{code:none}
❯ python test_pyarrow.py
  0 time:   0.0 rss:  79.5
  1 time:   2.0 rss: 617.1
  2 time:   3.4 rss:1090.6
  3 time:   3.7 rss: 633.6
  4 time:   6.7 rss: 633.6
  5 time:  10.1 rss:1942.9
  6 time:  13.1 rss:1942.9
  7 time:  13.6 rss: 664.8
  8 time:  16.6 rss: 664.8
{code}

> pyarrow may use a lot of memory to load a dataframe from parquet
> 
>
> Key: ARROW-17399
> URL: https://issues.apache.org/jira/browse/ARROW-17399
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 9.0.0
> Environment: linux
>Reporter: Gianluca Ficarelli
>Priority: Major
> Attachments: memory-profiler.png
>
>
> When a pandas dataframe is loaded from a parquet file using 
> {{{}pyarrow.parquet.read_table{}}}, the memory usage may grow a lot more than 
> what should be needed to load the dataframe, and it's not freed until the 
> dataframe is deleted.
> The problem is evident when the dataframe has a {*}column containing lists or 
> numpy arrays{*}, while it seems absent (or not noticeable) if the column 
> contains only integer or floats.
> I'm attaching a simple script to reproduce the issue, and a graph created 
> with memory-profiler showing the memory usage.
> In this example, the dataframe created with pandas needs around 1.2 GB, but 
> the memory usage after loading it from parquet is around 16 GB.
> The items of the column are created as numpy arrays and not lists, to be 
> consistent with the types loaded from parquet (pyarrow produces numpy arrays 
> and not lists).
>  
> {code:python}
> import gc
> import time
> import numpy as np
> import pandas as pd
> import pyarrow
> import pyarrow.parquet
> import psutil
> def pyarrow_dump(filename, df, compression="snappy"):
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_table(table, filename, compression=compression)
> def pyarrow_load(filename):
> table = pyarrow.parquet.read_table(filename)
> return table.to_pandas()
> def print_mem(msg, start_time=time.monotonic(), process=psutil.Process()):
> # gc.collect()
> current_time = time.monotonic() - start_time
> rss = process.memory_info().rss / 2 ** 20
> print(f"{msg:>3} time:{current_time:>10.1f} rss:{rss:>10.1f}")
> if __name__ == "__main__":
> print_mem(0)
> rows = 500
> df = pd.DataFrame({"a": [np.arange(10) for i in range(rows)]})
> print_mem(1)
> 
> pyarrow_dump("example.parquet", df)
> print_mem(2)
> 
> del df
> print_mem(3)
> time.sleep(3)
> print_mem(4)
> df = pyarrow_load("example.parquet")
> print_mem(5)
> time.sleep(3)
> print_mem(6)
> del df
> print_mem(7)
> time.sleep(3)
> print_mem(8)
> {code}
> Run with memory-profiler:
> {code:bash}
> mprof run --multiprocess python test_pyarrow.py
> {code}
> Output:
> {code:java}
> mprof: Sampling memory every 0.1s
> running new process
>   0 time:   0.0 rss: 135.4
>   1 time:   4.9 rss:1252.2
>   2 time:   7.1 rss:1265.0
>   3 time:   7.5 rss: 760.2
>   4 time:  10.7 rss: 758.9
>   5 time:  19.6 rss:   16745.4
>   6 time:  22.6 rss:   16335.4
>   7 time:  22.9 rss:   15833.0
>   8 time:  25.9 rss: 955.0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-29 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597392#comment-17597392
 ] 

Will Jones commented on ARROW-17459:


Hi Arthur,

Here's a simple repro I created in Python:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32()))
arr = pa.chunked_array([arr, arr])
tab = pa.table({ "arr": arr })

pq.write_table(tab, "test.parquet")

pq.read_table("test.parquet")
#Traceback (most recent call last):
#  File "", line 1, in 
#  File 
"/Users/willjones/mambaforge/envs/notebooks/lib/python3.10/site-#packages/pyarrow/parquet/__init__.py",
 line 2827, in read_table
#return dataset.read(columns=columns, use_threads=use_threads,
#  File 
"/Users/willjones/mambaforge/envs/notebooks/lib/python3.10/site-#packages/pyarrow/parquet/__init__.py",
 line 2473, in read
#table = self._dataset.to_table(
#  File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
#  File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
#  File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
#  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
#pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented 
for chunked array outputs
{code}

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-18 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581502#comment-17581502
 ] 

Will Jones commented on ARROW-17459:


I haven't tried this, but perhaps {{GetRecordBatchReader}} instead will work 
[https://github.com/wjones127/arrow/blob/895e2da93c0af3a1525c8c75ec8d612d96c28647/cpp/src/parquet/arrow/reader.h#L165]

It sounds like there are some code paths that do work and some that don't.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15006) [Python][Doc] Iteratively enable more numpydoc checks

2022-08-18 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581404#comment-17581404
 ] 

Will Jones commented on ARROW-15006:


Perhaps we should start with the style-only ones first (PR06: Parameter 
"accept_root_dir" type should use "bool" instead of "boolean"), and skip the 
"missing" warnings (GL08: The object does not have a docstring). I think the 
missing ones will be good to do, but will require a lot more work to get enough 
context to properly describe each of the objects, so possibly better as a 
follow up. That's should narrow things down to some mechanical changes that we 
can get out of the way quickly (if only there were an automatic formatter).

Also, we should double check that we have instructions for how to run these 
checks locally, so that developers can verify their changes before pushing them.

> [Python][Doc] Iteratively enable more numpydoc checks
> -
>
> Key: ARROW-15006
> URL: https://issues.apache.org/jira/browse/ARROW-15006
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: good-first-issue
>
> Asof https://github.com/apache/arrow/pull/7732 we're going to have a numpydoc 
> check running on pull requests. There is a single rule enabled at the moment: 
> PR01
> Additional checks we can run:
> {code}
> ERROR_MSGS = {
> "GL01": "Docstring text (summary) should start in the line immediately "
> "after the opening quotes (not in the same line, or leaving a "
> "blank line in between)",
> "GL02": "Closing quotes should be placed in the line after the last text "
> "in the docstring (do not close the quotes in the same line as "
> "the text, or leave a blank line between the last text and the "
> "quotes)",
> "GL03": "Double line break found; please use only one blank line to "
> "separate sections or paragraphs, and do not leave blank lines "
> "at the end of docstrings",
> "GL05": 'Tabs found at the start of line "{line_with_tabs}", please use '
> "whitespace only",
> "GL06": 'Found unknown section "{section}". Allowed sections are: '
> "{allowed_sections}",
> "GL07": "Sections are in the wrong order. Correct order is: 
> {correct_sections}",
> "GL08": "The object does not have a docstring",
> "GL09": "Deprecation warning should precede extended summary",
> "GL10": "reST directives {directives} must be followed by two colons",
> "SS01": "No summary found (a short summary in a single line should be "
> "present at the beginning of the docstring)",
> "SS02": "Summary does not start with a capital letter",
> "SS03": "Summary does not end with a period",
> "SS04": "Summary contains heading whitespaces",
> "SS05": "Summary must start with infinitive verb, not third person "
> '(e.g. use "Generate" instead of "Generates")',
> "SS06": "Summary should fit in a single line",
> "ES01": "No extended summary found",
> "PR01": "Parameters {missing_params} not documented",
> "PR02": "Unknown parameters {unknown_params}",
> "PR03": "Wrong parameters order. Actual: {actual_params}. "
> "Documented: {documented_params}",
> "PR04": 'Parameter "{param_name}" has no type',
> "PR05": 'Parameter "{param_name}" type should not finish with "."',
> "PR06": 'Parameter "{param_name}" type should use "{right_type}" instead '
> 'of "{wrong_type}"',
> "PR07": 'Parameter "{param_name}" has no description',
> "PR08": 'Parameter "{param_name}" description should start with a '
> "capital letter",
> "PR09": 'Parameter "{param_name}" description should finish with "."',
> "PR10": 'Parameter "{param_name}" requires a space before the colon '
> "separating the parameter name and type",
> "RT01": "No Returns section found",
> "RT02": "The first line of the Returns section should contain only the "
> "type, unless multiple values are being returned",
> "RT03": "Return value has no description",
> "RT04": "Return value description should start with a capital letter",
> "RT05": 'Return value description should finish with "."',
> "YD01": "No Yields section found",
> "SA01": "See Also section not found",
> "SA02": "Missing period at end of description for See Also "
> '"{reference_name}" reference',
> "SA03": "Description should be capitalized for See Also "
> '"{reference_name}" reference',
> "SA04": 'Missing description for See Also "{reference_name}" reference',
> "EX01": "No examples section found",
> }
> 

[jira] [Comment Edited] (ARROW-17441) [Python] Memory kept after del and pool.released_unused()

2022-08-16 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580475#comment-17580475
 ] 

Will Jones edited comment on ARROW-17441 at 8/16/22 9:10 PM:
-

Going back to my original test with Parquet, it does seem like there some 
long-standing issue with Parquet reads and mimalloc. And a regression with the 
system allocator on MacOS?

Here is the original Parquet read test (so all buffers are allocated within 
Arrow, no numpy):
{code:python}
import os
import psutil
import time
import gc
process = psutil.Process(os.getpid())


import pyarrow.parquet as pq
import pyarrow as pa

def print_rss():
print(f"RSS: {process.memory_info().rss:,} bytes")

pq_path = "tall.parquet"

print(f"memory_pool={pa.default_memory_pool().backend_name}")
print_rss()
print("reading table")
tab = pq.read_table(pq_path)
print_rss()
print("deleting table")
del tab
gc.collect()
print_rss()
print("releasing unused memory")
pa.default_memory_pool().release_unused()
print_rss()
print("waiting 10 seconds")
time.sleep(10)
print_rss()
print(f"Total allocated bytes: {pa.total_allocated_bytes():,}") {code}
Result in PyArrow 7.0.0:
{code:none}
ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool2.py && \
ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool2.py && \
ARROW_DEFAULT_MEMORY_POOL=system python test_pool2.py
memory_pool=mimalloc
RSS: 47,906,816 bytes
reading table
RSS: 2,077,507,584 bytes
deleting table
RSS: 2,071,887,872 bytes
releasing unused memory
RSS: 2,064,875,520 bytes
waiting 10 seconds
RSS: 1,862,352,896 bytes
Total allocated bytes: 0
memory_pool=jemalloc
RSS: 47,415,296 bytes
reading table
RSS: 2,704,965,632 bytes
deleting table
RSS: 70,746,112 bytes
releasing unused memory
RSS: 71,663,616 bytes
waiting 10 seconds
RSS: 71,663,616 bytes
Total allocated bytes: 0
memory_pool=system
RSS: 47,857,664 bytes
reading table
RSS: 2,705,408,000 bytes
deleting table
RSS: 71,106,560 bytes
releasing unused memory
RSS: 71,106,560 bytes
waiting 10 seconds
RSS: 71,106,560 bytes
Total allocated bytes: 0 {code}
Result in PyArrow 9.0.0:
{code:none}
> ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool2.py && \
ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool2.py && \
ARROW_DEFAULT_MEMORY_POOL=system python test_pool2.py
memory_pool=mimalloc
RSS: 48,037,888 bytes
reading table
RSS: 2,140,487,680 bytes
deleting table
RSS: 2,149,711,872 bytes
releasing unused memory
RSS: 2,142,273,536 bytes
waiting 10 seconds
RSS: 1,710,981,120 bytes
Total allocated bytes: 0
memory_pool=jemalloc
RSS: 48,136,192 bytes
reading table
RSS: 2,681,274,368 bytes
deleting table
RSS: 71,942,144 bytes
releasing unused memory
RSS: 72,908,800 bytes
waiting 10 seconds
RSS: 72,908,800 bytes
Total allocated bytes: 0
memory_pool=system
RSS: 48,005,120 bytes
reading table
RSS: 2,847,965,184 bytes
deleting table
RSS: 1,440,071,680 bytes
releasing unused memory
RSS: 1,440,071,680 bytes
waiting 10 seconds
RSS: 1,440,071,680 bytes
Total allocated bytes: 0 {code}


was (Author: willjones127):
Going back to my original test with Parquet, it does seem like there some 
long-standing issue with Parquet reads and mimalloc. And a regression with the 
system allocator on MacOS?

Here is the original Parquet read test (so all buffers are allocated within 
Arrow, no numpy):
{code:python}
import os
import psutil
import time
import gc
process = psutil.Process(os.getpid())


import pyarrow.parquet as pq
import pyarrow as pa

def print_rss():
print(f"RSS: {process.memory_info().rss:,} bytes")

pq_path = "tall.parquet"

print(f"memory_pool={pa.default_memory_pool().backend_name}")
print_rss()
print("reading table")
tab = pq.read_table(pq_path)
print_rss()
print("deleting table")
del tab
gc.collect()
print_rss()
print("releasing unused memory")
pa.default_memory_pool().release_unused()
print_rss()
print("waiting 10 seconds")
time.sleep(10)
print_rss()
{code}
Result in PyArrow 7.0.0:
{code:none}
ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool2.py && \
ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool2.py && \
ARROW_DEFAULT_MEMORY_POOL=system python test_pool2.py
memory_pool=mimalloc
RSS: 47,906,816 bytes
reading table
RSS: 2,077,507,584 bytes
deleting table
RSS: 2,071,887,872 bytes
releasing unused memory
RSS: 2,064,875,520 bytes
waiting 10 seconds
RSS: 1,862,352,896 bytes
memory_pool=jemalloc
RSS: 47,415,296 bytes
reading table
RSS: 2,704,965,632 bytes
deleting table
RSS: 70,746,112 bytes
releasing unused memory
RSS: 71,663,616 bytes
waiting 10 seconds
RSS: 71,663,616 bytes
memory_pool=system
RSS: 47,857,664 bytes
reading table
RSS: 2,705,408,000 bytes
deleting table
RSS: 71,106,560 bytes
releasing unused memory
RSS: 71,106,560 bytes
waiting 10 seconds
RSS: 71,106,560 bytes
{code}
Result in PyArrow 9.0.0:
{code:none}
> ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool2.py && \
ARROW_DEFAULT_MEMORY_POOL=jemalloc p

[jira] [Commented] (ARROW-17441) [Python] Memory kept after del and pool.released_unused()

2022-08-16 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580475#comment-17580475
 ] 

Will Jones commented on ARROW-17441:


Going back to my original test with Parquet, it does seem like there some 
long-standing issue with Parquet reads and mimalloc. And a regression with the 
system allocator on MacOS?

Here is the original Parquet read test (so all buffers are allocated within 
Arrow, no numpy):
{code:python}
import os
import psutil
import time
import gc
process = psutil.Process(os.getpid())


import pyarrow.parquet as pq
import pyarrow as pa

def print_rss():
print(f"RSS: {process.memory_info().rss:,} bytes")

pq_path = "tall.parquet"

print(f"memory_pool={pa.default_memory_pool().backend_name}")
print_rss()
print("reading table")
tab = pq.read_table(pq_path)
print_rss()
print("deleting table")
del tab
gc.collect()
print_rss()
print("releasing unused memory")
pa.default_memory_pool().release_unused()
print_rss()
print("waiting 10 seconds")
time.sleep(10)
print_rss()
{code}
Result in PyArrow 7.0.0:
{code:none}
ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool2.py && \
ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool2.py && \
ARROW_DEFAULT_MEMORY_POOL=system python test_pool2.py
memory_pool=mimalloc
RSS: 47,906,816 bytes
reading table
RSS: 2,077,507,584 bytes
deleting table
RSS: 2,071,887,872 bytes
releasing unused memory
RSS: 2,064,875,520 bytes
waiting 10 seconds
RSS: 1,862,352,896 bytes
memory_pool=jemalloc
RSS: 47,415,296 bytes
reading table
RSS: 2,704,965,632 bytes
deleting table
RSS: 70,746,112 bytes
releasing unused memory
RSS: 71,663,616 bytes
waiting 10 seconds
RSS: 71,663,616 bytes
memory_pool=system
RSS: 47,857,664 bytes
reading table
RSS: 2,705,408,000 bytes
deleting table
RSS: 71,106,560 bytes
releasing unused memory
RSS: 71,106,560 bytes
waiting 10 seconds
RSS: 71,106,560 bytes
{code}
Result in PyArrow 9.0.0:
{code:none}
> ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool2.py && \
ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool2.py && \
ARROW_DEFAULT_MEMORY_POOL=system python test_pool2.py
memory_pool=mimalloc
RSS: 48,037,888 bytes
reading table
RSS: 2,140,487,680 bytes
deleting table
RSS: 2,149,711,872 bytes
releasing unused memory
RSS: 2,142,273,536 bytes
waiting 10 seconds
RSS: 1,710,981,120 bytes
memory_pool=jemalloc
RSS: 48,136,192 bytes
reading table
RSS: 2,681,274,368 bytes
deleting table
RSS: 71,942,144 bytes
releasing unused memory
RSS: 72,908,800 bytes
waiting 10 seconds
RSS: 72,908,800 bytes
memory_pool=system
RSS: 48,005,120 bytes
reading table
RSS: 2,847,965,184 bytes
deleting table
RSS: 1,440,071,680 bytes
releasing unused memory
RSS: 1,440,071,680 bytes
waiting 10 seconds
RSS: 1,440,071,680 bytes
{code}

> [Python] Memory kept after del and pool.released_unused()
> -
>
> Key: ARROW-17441
> URL: https://issues.apache.org/jira/browse/ARROW-17441
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: Will Jones
>Priority: Major
>
> I was trying reproduce another issue involving memory pools not releasing 
> memory, but encountered this confusing behavior: if I create a table, then 
> call {{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see 
> significant memory usage. On mimalloc in particular, I see no meaningful drop 
> in memory usage on either call.
> Am I missing something? My understanding prior has been that memory will be 
> held onto by a memory pool, but will be forced free by release_unused; and 
> that system memory pool should release memory immediately. But neither of 
> those seem true.
> {code:python}
> import os
> import psutil
> import time
> import gc
> process = psutil.Process(os.getpid())
> import numpy as np
> from uuid import uuid4
> import pyarrow as pa
> def gen_batches(n_groups=200, rows_per_group=200_000):
> for _ in range(n_groups):
> id_val = uuid4().bytes
> yield pa.table({
> "x": np.random.random(rows_per_group), # This will compress poorly
> "y": np.random.random(rows_per_group),
> "a": pa.array(list(range(rows_per_group)), type=pa.int32()), # 
> This compresses with delta encoding
> "id": pa.array([id_val] * rows_per_group), # This compresses with 
> RLE
> })
> def print_rss():
> print(f"RSS: {process.memory_info().rss:,} bytes")
> print(f"memory_pool={pa.default_memory_pool().backend_name}")
> print_rss()
> print("reading table")
> tab = pa.concat_tables(list(gen_batches()))
> print_rss()
> print("deleting table")
> del tab
> gc.collect()
> print_rss()
> print("releasing unused memory")
> pa.default_memory_pool().release_unused()
> print_rss()
> print("waiting 10 seconds")
> time.sleep(10)
> print_

[jira] [Commented] (ARROW-17441) [Python] Memory kept after del and pool.released_unused()

2022-08-16 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580471#comment-17580471
 ] 

Will Jones commented on ARROW-17441:


{quote}I must admit I don't understand the references to compression in your 
comments. Were you planning to use Parquet at some point?{quote}

Sorry, I was testing memory usage from Parquet reads and seeing something like 
this, but decided to take Parquet out of the picture to simplify.

{quote}Other than that, Numpy-allocated memory does not use the Arrow memory 
pool, so I'm not sure those stats are very indicative.{quote}

Ah I think you are likely right there.

> [Python] Memory kept after del and pool.released_unused()
> -
>
> Key: ARROW-17441
> URL: https://issues.apache.org/jira/browse/ARROW-17441
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: Will Jones
>Priority: Major
>
> I was trying reproduce another issue involving memory pools not releasing 
> memory, but encountered this confusing behavior: if I create a table, then 
> call {{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see 
> significant memory usage. On mimalloc in particular, I see no meaningful drop 
> in memory usage on either call.
> Am I missing something? My understanding prior has been that memory will be 
> held onto by a memory pool, but will be forced free by release_unused; and 
> that system memory pool should release memory immediately. But neither of 
> those seem true.
> {code:python}
> import os
> import psutil
> import time
> import gc
> process = psutil.Process(os.getpid())
> import numpy as np
> from uuid import uuid4
> import pyarrow as pa
> def gen_batches(n_groups=200, rows_per_group=200_000):
> for _ in range(n_groups):
> id_val = uuid4().bytes
> yield pa.table({
> "x": np.random.random(rows_per_group), # This will compress poorly
> "y": np.random.random(rows_per_group),
> "a": pa.array(list(range(rows_per_group)), type=pa.int32()), # 
> This compresses with delta encoding
> "id": pa.array([id_val] * rows_per_group), # This compresses with 
> RLE
> })
> def print_rss():
> print(f"RSS: {process.memory_info().rss:,} bytes")
> print(f"memory_pool={pa.default_memory_pool().backend_name}")
> print_rss()
> print("reading table")
> tab = pa.concat_tables(list(gen_batches()))
> print_rss()
> print("deleting table")
> del tab
> gc.collect()
> print_rss()
> print("releasing unused memory")
> pa.default_memory_pool().release_unused()
> print_rss()
> print("waiting 10 seconds")
> time.sleep(10)
> print_rss()
> {code}
> {code:none}
> > ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \
> ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \
> ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py
> memory_pool=mimalloc
> RSS: 44,449,792 bytes
> reading table
> RSS: 1,819,557,888 bytes
> deleting table
> RSS: 1,819,590,656 bytes
> releasing unused memory
> RSS: 1,819,852,800 bytes
> waiting 10 seconds
> RSS: 1,819,852,800 bytes
> memory_pool=jemalloc
> RSS: 45,629,440 bytes
> reading table
> RSS: 1,668,677,632 bytes
> deleting table
> RSS: 698,400,768 bytes
> releasing unused memory
> RSS: 699,023,360 bytes
> waiting 10 seconds
> RSS: 699,023,360 bytes
> memory_pool=system
> RSS: 44,875,776 bytes
> reading table
> RSS: 1,713,569,792 bytes
> deleting table
> RSS: 540,311,552 bytes
> releasing unused memory
> RSS: 540,311,552 bytes
> waiting 10 seconds
> RSS: 540,311,552 bytes
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17441) [Python] Memory kept after del and pool.released_unused()

2022-08-16 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580472#comment-17580472
 ] 

Will Jones commented on ARROW-17441:


I reran this in PyArrow 7.0.0 and got results where mimalloc is more in line 
with the others, so I think mimalloc 2 is actually worse rather than better at 
releasing unused memory:

{code}
> ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \
ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \
ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py
memory_pool=mimalloc
RSS: 43,958,272 bytes
reading table
RSS: 1,728,200,704 bytes
deleting table
RSS: 1,600,585,728 bytes
releasing unused memory
RSS: 549,797,888 bytes
waiting 10 seconds
RSS: 549,797,888 bytes
memory_pool=jemalloc
RSS: 43,663,360 bytes
reading table
RSS: 1,663,483,904 bytes
deleting table
RSS: 693,682,176 bytes
releasing unused memory
RSS: 694,304,768 bytes
waiting 10 seconds
RSS: 694,304,768 bytes
memory_pool=system
RSS: 44,220,416 bytes
reading table
RSS: 1,667,072,000 bytes
deleting table
RSS: 697,171,968 bytes
releasing unused memory
RSS: 697,171,968 bytes
waiting 10 seconds
RSS: 697,171,968 bytes
{code}

> [Python] Memory kept after del and pool.released_unused()
> -
>
> Key: ARROW-17441
> URL: https://issues.apache.org/jira/browse/ARROW-17441
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: Will Jones
>Priority: Major
>
> I was trying reproduce another issue involving memory pools not releasing 
> memory, but encountered this confusing behavior: if I create a table, then 
> call {{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see 
> significant memory usage. On mimalloc in particular, I see no meaningful drop 
> in memory usage on either call.
> Am I missing something? My understanding prior has been that memory will be 
> held onto by a memory pool, but will be forced free by release_unused; and 
> that system memory pool should release memory immediately. But neither of 
> those seem true.
> {code:python}
> import os
> import psutil
> import time
> import gc
> process = psutil.Process(os.getpid())
> import numpy as np
> from uuid import uuid4
> import pyarrow as pa
> def gen_batches(n_groups=200, rows_per_group=200_000):
> for _ in range(n_groups):
> id_val = uuid4().bytes
> yield pa.table({
> "x": np.random.random(rows_per_group), # This will compress poorly
> "y": np.random.random(rows_per_group),
> "a": pa.array(list(range(rows_per_group)), type=pa.int32()), # 
> This compresses with delta encoding
> "id": pa.array([id_val] * rows_per_group), # This compresses with 
> RLE
> })
> def print_rss():
> print(f"RSS: {process.memory_info().rss:,} bytes")
> print(f"memory_pool={pa.default_memory_pool().backend_name}")
> print_rss()
> print("reading table")
> tab = pa.concat_tables(list(gen_batches()))
> print_rss()
> print("deleting table")
> del tab
> gc.collect()
> print_rss()
> print("releasing unused memory")
> pa.default_memory_pool().release_unused()
> print_rss()
> print("waiting 10 seconds")
> time.sleep(10)
> print_rss()
> {code}
> {code:none}
> > ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \
> ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \
> ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py
> memory_pool=mimalloc
> RSS: 44,449,792 bytes
> reading table
> RSS: 1,819,557,888 bytes
> deleting table
> RSS: 1,819,590,656 bytes
> releasing unused memory
> RSS: 1,819,852,800 bytes
> waiting 10 seconds
> RSS: 1,819,852,800 bytes
> memory_pool=jemalloc
> RSS: 45,629,440 bytes
> reading table
> RSS: 1,668,677,632 bytes
> deleting table
> RSS: 698,400,768 bytes
> releasing unused memory
> RSS: 699,023,360 bytes
> waiting 10 seconds
> RSS: 699,023,360 bytes
> memory_pool=system
> RSS: 44,875,776 bytes
> reading table
> RSS: 1,713,569,792 bytes
> deleting table
> RSS: 540,311,552 bytes
> releasing unused memory
> RSS: 540,311,552 bytes
> waiting 10 seconds
> RSS: 540,311,552 bytes
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17441) [Python] Memory kept after del and pool.released_unused()

2022-08-16 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-17441:
---
Description: 
I was trying reproduce another issue involving memory pools not releasing 
memory, but encountered this confusing behavior: if I create a table, then call 
{{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see 
significant memory usage. On mimalloc in particular, I see no meaningful drop 
in memory usage on either call.

Am I missing something? My understanding prior has been that memory will be 
held onto by a memory pool, but will be forced free by release_unused; and that 
system memory pool should release memory immediately. But neither of those seem 
true.
{code:python}
import os
import psutil
import time
import gc
process = psutil.Process(os.getpid())
import numpy as np
from uuid import uuid4


import pyarrow as pa

def gen_batches(n_groups=200, rows_per_group=200_000):
for _ in range(n_groups):
id_val = uuid4().bytes
yield pa.table({
"x": np.random.random(rows_per_group), # This will compress poorly
"y": np.random.random(rows_per_group),
"a": pa.array(list(range(rows_per_group)), type=pa.int32()), # This 
compresses with delta encoding
"id": pa.array([id_val] * rows_per_group), # This compresses with 
RLE
})

def print_rss():
print(f"RSS: {process.memory_info().rss:,} bytes")

print(f"memory_pool={pa.default_memory_pool().backend_name}")
print_rss()
print("reading table")
tab = pa.concat_tables(list(gen_batches()))
print_rss()
print("deleting table")
del tab
gc.collect()
print_rss()
print("releasing unused memory")
pa.default_memory_pool().release_unused()
print_rss()
print("waiting 10 seconds")
time.sleep(10)
print_rss()
{code}
{code:none}
> ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \
ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \
ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py
memory_pool=mimalloc
RSS: 44,449,792 bytes
reading table
RSS: 1,819,557,888 bytes
deleting table
RSS: 1,819,590,656 bytes
releasing unused memory
RSS: 1,819,852,800 bytes
waiting 10 seconds
RSS: 1,819,852,800 bytes
memory_pool=jemalloc
RSS: 45,629,440 bytes
reading table
RSS: 1,668,677,632 bytes
deleting table
RSS: 698,400,768 bytes
releasing unused memory
RSS: 699,023,360 bytes
waiting 10 seconds
RSS: 699,023,360 bytes
memory_pool=system
RSS: 44,875,776 bytes
reading table
RSS: 1,713,569,792 bytes
deleting table
RSS: 540,311,552 bytes
releasing unused memory
RSS: 540,311,552 bytes
waiting 10 seconds
RSS: 540,311,552 bytes
{code}

  was:
I was trying reproduce another issue involving memory pools not releasing 
memory, but encountered this confusing behavior: if I create a table, then call 
{{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see 
significant memory usage. On mimalloc in particular, I see no meaningful drop 
in memory usage on either call.

Am I missing something?
{code:python}
import os
import psutil
import time
import gc
process = psutil.Process(os.getpid())
import numpy as np
from uuid import uuid4


import pyarrow as pa

def gen_batches(n_groups=200, rows_per_group=200_000):
for _ in range(n_groups):
id_val = uuid4().bytes
yield pa.table({
"x": np.random.random(rows_per_group), # This will compress poorly
"y": np.random.random(rows_per_group),
"a": pa.array(list(range(rows_per_group)), type=pa.int32()), # This 
compresses with delta encoding
"id": pa.array([id_val] * rows_per_group), # This compresses with 
RLE
})

def print_rss():
print(f"RSS: {process.memory_info().rss:,} bytes")

print(f"memory_pool={pa.default_memory_pool().backend_name}")
print_rss()
print("reading table")
tab = pa.concat_tables(list(gen_batches()))
print_rss()
print("deleting table")
del tab
gc.collect()
print_rss()
print("releasing unused memory")
pa.default_memory_pool().release_unused()
print_rss()
print("waiting 10 seconds")
time.sleep(10)
print_rss()
{code}
{code:none}
> ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \
ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \
ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py
memory_pool=mimalloc
RSS: 44,449,792 bytes
reading table
RSS: 1,819,557,888 bytes
deleting table
RSS: 1,819,590,656 bytes
releasing unused memory
RSS: 1,819,852,800 bytes
waiting 10 seconds
RSS: 1,819,852,800 bytes
memory_pool=jemalloc
RSS: 45,629,440 bytes
reading table
RSS: 1,668,677,632 bytes
deleting table
RSS: 698,400,768 bytes
releasing unused memory
RSS: 699,023,360 bytes
waiting 10 seconds
RSS: 699,023,360 bytes
memory_pool=system
RSS: 44,875,776 bytes
reading table
RSS: 1,713,569,792 bytes
deleting table
RSS: 540,311,552 bytes
releasing unused memory
RSS: 540,311,552 bytes
waiting 10 seconds
RSS: 540,311,552 bytes
{code}


> [Python] Mem

[jira] [Created] (ARROW-17441) [Python] Memory kept after del and pool.released_unused()

2022-08-16 Thread Will Jones (Jira)
Will Jones created ARROW-17441:
--

 Summary: [Python] Memory kept after del and pool.released_unused()
 Key: ARROW-17441
 URL: https://issues.apache.org/jira/browse/ARROW-17441
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 9.0.0
Reporter: Will Jones


I was trying reproduce another issue involving memory pools not releasing 
memory, but encountered this confusing behavior: if I create a table, then call 
{{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see 
significant memory usage. On mimalloc in particular, I see no meaningful drop 
in memory usage on either call.

Am I missing something?
{code:python}
import os
import psutil
import time
import gc
process = psutil.Process(os.getpid())
import numpy as np
from uuid import uuid4


import pyarrow as pa

def gen_batches(n_groups=200, rows_per_group=200_000):
for _ in range(n_groups):
id_val = uuid4().bytes
yield pa.table({
"x": np.random.random(rows_per_group), # This will compress poorly
"y": np.random.random(rows_per_group),
"a": pa.array(list(range(rows_per_group)), type=pa.int32()), # This 
compresses with delta encoding
"id": pa.array([id_val] * rows_per_group), # This compresses with 
RLE
})

def print_rss():
print(f"RSS: {process.memory_info().rss:,} bytes")

print(f"memory_pool={pa.default_memory_pool().backend_name}")
print_rss()
print("reading table")
tab = pa.concat_tables(list(gen_batches()))
print_rss()
print("deleting table")
del tab
gc.collect()
print_rss()
print("releasing unused memory")
pa.default_memory_pool().release_unused()
print_rss()
print("waiting 10 seconds")
time.sleep(10)
print_rss()
{code}
{code:none}
> ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \
ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \
ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py
memory_pool=mimalloc
RSS: 44,449,792 bytes
reading table
RSS: 1,819,557,888 bytes
deleting table
RSS: 1,819,590,656 bytes
releasing unused memory
RSS: 1,819,852,800 bytes
waiting 10 seconds
RSS: 1,819,852,800 bytes
memory_pool=jemalloc
RSS: 45,629,440 bytes
reading table
RSS: 1,668,677,632 bytes
deleting table
RSS: 698,400,768 bytes
releasing unused memory
RSS: 699,023,360 bytes
waiting 10 seconds
RSS: 699,023,360 bytes
memory_pool=system
RSS: 44,875,776 bytes
reading table
RSS: 1,713,569,792 bytes
deleting table
RSS: 540,311,552 bytes
releasing unused memory
RSS: 540,311,552 bytes
waiting 10 seconds
RSS: 540,311,552 bytes
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15368) [C++] [Docs] Improve our SIMD documentation

2022-08-15 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-15368:
---
Fix Version/s: 10.0.0

> [C++] [Docs] Improve our SIMD documentation
> ---
>
> Key: ARROW-15368
> URL: https://issues.apache.org/jira/browse/ARROW-15368
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Jonathan Keane
>Priority: Major
> Fix For: 10.0.0
>
>
> We should document the various env vars ({{{}ARROW_SIMD_LEVEL{}}}, 
> {{{}ARROW_RUNTIME_SIMD_LEVEL{}}}, {{{}ARROW_USER_SIMD_LEVEL{}}}, others?).
> We should also document what the defaults are (and what that means for 
> performance and possible optimization if you're compiling and you know you'll 
> be on more/less modern hardware:
> e.g. pyarrow and the R package are compiled with SSE4_2, but there is some 
> amount of runtime dispatched simd code, and MAX there means that it will 
> compile everything it can. but at runtime it will use whatever is available. 
> so if you compile on a machine with AVX512 and run on a machine with AVX512, 
> you'll get any AVX512 runtime dispatched code that's available (probably not 
> much). There is more (esp. in the query engine) that is runtime AVX2.
> FWIW I (neal) would leave ARROW_RUNTIME_SIMD_LEVEL=MAX always. You can set 
> ARROW_USER_SIMD_LEVEL to change/limit what level the runtime dispatch uses
> Additionally we should document that valgrind does not support AVX512: 
> [https://bugs.kde.org/show_bug.cgi?id=383010] 
> And users should set ARROW_USER_SIMD_LEVEL to AVX2 if they plan to run 
> valgrind on an AVX512 capable machine similar to what we do for our 
> [CI|https://github.com/apache/arrow/blob/bc1a16cd0eceeffe67893a7e8000d2dd28dcf3f1/docker-compose.yml#L309]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17397) [R] Does R API for Apache Arrow has a tableFromIPC function ?

2022-08-12 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones resolved ARROW-17397.

  Assignee: Will Jones
Resolution: Information Provided

> [R] Does R API for Apache Arrow has a tableFromIPC function ? 
> --
>
> Key: ARROW-17397
> URL: https://issues.apache.org/jira/browse/ARROW-17397
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Roy Assis
>Assignee: Will Jones
>Priority: Minor
>
> I'm building an API using python and flask. I want to return a dataframe from 
> the API, i'm serializing the dataframe like so and sending it in the response:
> {code:python}
> batch = pa.record_batch(df)
> sink = pa.BufferOutputStream()
> with pa.ipc.new_stream(sink, batch.schema) as writer:
> writer.write_batch(batch)
> pybytes = sink.getvalue().to_pybytes()
> {code}
> Is it possible to read it with R ? If so can you provide a code snippet.
> Best,
> Roy



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17399) pyarrow may use a lot of memory to load a dataframe from parquet

2022-08-12 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579120#comment-17579120
 ] 

Will Jones commented on ARROW-17399:


That's helps narrow it down. Are you able to narrow down and share the specific 
data types ({{table.schema}}}) that seem to be problematic?

> pyarrow may use a lot of memory to load a dataframe from parquet
> 
>
> Key: ARROW-17399
> URL: https://issues.apache.org/jira/browse/ARROW-17399
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 9.0.0
> Environment: linux
>Reporter: Gianluca Ficarelli
>Priority: Major
> Attachments: memory-profiler.png
>
>
> When a pandas dataframe is loaded from a parquet file using 
> {{{}pyarrow.parquet.read_table{}}}, the memory usage may grow a lot more than 
> what should be needed to load the dataframe, and it's not freed until the 
> dataframe is deleted.
> The problem is evident when the dataframe has a {*}column containing lists or 
> numpy arrays{*}, while it seems absent (or not noticeable) if the column 
> contains only integer or floats.
> I'm attaching a simple script to reproduce the issue, and a graph created 
> with memory-profiler showing the memory usage.
> In this example, the dataframe created with pandas needs around 1.2 GB, but 
> the memory usage after loading it from parquet is around 16 GB.
> The items of the column are created as numpy arrays and not lists, to be 
> consistent with the types loaded from parquet (pyarrow produces numpy arrays 
> and not lists).
>  
> {code:python}
> import gc
> import time
> import numpy as np
> import pandas as pd
> import pyarrow
> import pyarrow.parquet
> import psutil
> def pyarrow_dump(filename, df, compression="snappy"):
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_table(table, filename, compression=compression)
> def pyarrow_load(filename):
> table = pyarrow.parquet.read_table(filename)
> return table.to_pandas()
> def print_mem(msg, start_time=time.monotonic(), process=psutil.Process()):
> # gc.collect()
> current_time = time.monotonic() - start_time
> rss = process.memory_info().rss / 2 ** 20
> print(f"{msg:>3} time:{current_time:>10.1f} rss:{rss:>10.1f}")
> if __name__ == "__main__":
> print_mem(0)
> rows = 500
> df = pd.DataFrame({"a": [np.arange(10) for i in range(rows)]})
> print_mem(1)
> 
> pyarrow_dump("example.parquet", df)
> print_mem(2)
> 
> del df
> print_mem(3)
> time.sleep(3)
> print_mem(4)
> df = pyarrow_load("example.parquet")
> print_mem(5)
> time.sleep(3)
> print_mem(6)
> del df
> print_mem(7)
> time.sleep(3)
> print_mem(8)
> {code}
> Run with memory-profiler:
> {code:bash}
> mprof run --multiprocess python test_pyarrow.py
> {code}
> Output:
> {code:java}
> mprof: Sampling memory every 0.1s
> running new process
>   0 time:   0.0 rss: 135.4
>   1 time:   4.9 rss:1252.2
>   2 time:   7.1 rss:1265.0
>   3 time:   7.5 rss: 760.2
>   4 time:  10.7 rss: 758.9
>   5 time:  19.6 rss:   16745.4
>   6 time:  22.6 rss:   16335.4
>   7 time:  22.9 rss:   15833.0
>   8 time:  25.9 rss: 955.0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17400) [C++] Move Parquet APIs to use Result instead of Status

2022-08-12 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-17400:
---
Labels: good-first-issue  (was: )

> [C++] Move Parquet APIs to use Result instead of Status
> ---
>
> Key: ARROW-17400
> URL: https://issues.apache.org/jira/browse/ARROW-17400
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: Will Jones
>Priority: Minor
>  Labels: good-first-issue
>
> Notably, IPC and CSV have "open file" methods that return result, while 
> opening a Parquet file requires passing in an out variable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17399) pyarrow may use a lot of memory to load a dataframe from parquet

2022-08-12 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579071#comment-17579071
 ] 

Will Jones commented on ARROW-17399:


Hi Gianluca,

There are two conversions happening when reading: first, Parquet data is 
deserialized into Arrow data; second, Arrow data is converted into Pandas / 
numpy data. Are you able to narrow down during which conversion memory is 
increasing?


> pyarrow may use a lot of memory to load a dataframe from parquet
> 
>
> Key: ARROW-17399
> URL: https://issues.apache.org/jira/browse/ARROW-17399
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 9.0.0
> Environment: linux
>Reporter: Gianluca Ficarelli
>Priority: Major
> Attachments: memory-profiler.png
>
>
> When a pandas dataframe is loaded from a parquet file using 
> {{{}pyarrow.parquet.read_table{}}}, the memory usage may grow a lot more than 
> what should be needed to load the dataframe, and it's not freed until the 
> dataframe is deleted.
> The problem is evident when the dataframe has a {*}column containing lists or 
> numpy arrays{*}, while it seems absent (or not noticeable) if the column 
> contains only integer or floats.
> I'm attaching a simple script to reproduce the issue, and a graph created 
> with memory-profiler showing the memory usage.
> In this example, the dataframe created with pandas needs around 1.2 GB, but 
> the memory usage after loading it from parquet is around 16 GB.
> The items of the column are created as numpy arrays and not lists, to be 
> consistent with the types loaded from parquet (pyarrow produces numpy arrays 
> and not lists).
>  
> {code:python}
> import gc
> import time
> import numpy as np
> import pandas as pd
> import pyarrow
> import pyarrow.parquet
> import psutil
> def pyarrow_dump(filename, df, compression="snappy"):
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_table(table, filename, compression=compression)
> def pyarrow_load(filename):
> table = pyarrow.parquet.read_table(filename)
> return table.to_pandas()
> def print_mem(msg, start_time=time.monotonic(), process=psutil.Process()):
> # gc.collect()
> current_time = time.monotonic() - start_time
> rss = process.memory_info().rss / 2 ** 20
> print(f"{msg:>3} time:{current_time:>10.1f} rss:{rss:>10.1f}")
> if __name__ == "__main__":
> print_mem(0)
> rows = 500
> df = pd.DataFrame({"a": [np.arange(10) for i in range(rows)]})
> print_mem(1)
> 
> pyarrow_dump("example.parquet", df)
> print_mem(2)
> 
> del df
> print_mem(3)
> time.sleep(3)
> print_mem(4)
> df = pyarrow_load("example.parquet")
> print_mem(5)
> time.sleep(3)
> print_mem(6)
> del df
> print_mem(7)
> time.sleep(3)
> print_mem(8)
> {code}
> Run with memory-profiler:
> {code:bash}
> mprof run --multiprocess python test_pyarrow.py
> {code}
> Output:
> {code:java}
> mprof: Sampling memory every 0.1s
> running new process
>   0 time:   0.0 rss: 135.4
>   1 time:   4.9 rss:1252.2
>   2 time:   7.1 rss:1265.0
>   3 time:   7.5 rss: 760.2
>   4 time:  10.7 rss: 758.9
>   5 time:  19.6 rss:   16745.4
>   6 time:  22.6 rss:   16335.4
>   7 time:  22.9 rss:   15833.0
>   8 time:  25.9 rss: 955.0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17397) [R] Does R API for Apache Arrow has a tableFromIPC function ?

2022-08-12 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579067#comment-17579067
 ] 

Will Jones commented on ARROW-17397:


Hi Roy,

I think what you are looking for is a 
[read_ipc_stream|https://arrow.apache.org/docs/r/reference/read_ipc_stream.html].

Here is an example:

{code:R}
library(arrow)
library(dplyr)

output_stream <- BufferOutputStream$create()

test_tbl <- tibble::tibble(
  x = 1:1e4,
  y = vapply(x, rlang::hash, character(1), USE.NAMES = FALSE),
  z = vapply(y, rlang::hash, character(1), USE.NAMES = FALSE)
)

write_ipc_stream(test_tbl, output_stream)

ipc_buffer <- output_stream$finish()

read_ipc_stream(ipc_buffer)
{code}
 

> [R] Does R API for Apache Arrow has a tableFromIPC function ? 
> --
>
> Key: ARROW-17397
> URL: https://issues.apache.org/jira/browse/ARROW-17397
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Roy Assis
>Priority: Minor
>
> I'm building an API using python and flask. I want to return a dataframe from 
> the API, i'm serializing the dataframe like so and sending it in the response:
> {code:python}
> batch = pa.record_batch(df)
> sink = pa.BufferOutputStream()
> with pa.ipc.new_stream(sink, batch.schema) as writer:
> writer.write_batch(batch)
> pybytes = sink.getvalue().to_pybytes()
> {code}
> Is it possible to read it with R ? If so can you provide a code snippet.
> Best,
> Roy



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17401) [C++] Add ReadTable method to RecordBatchFileReader

2022-08-12 Thread Will Jones (Jira)
Will Jones created ARROW-17401:
--

 Summary: [C++] Add ReadTable method to RecordBatchFileReader
 Key: ARROW-17401
 URL: https://issues.apache.org/jira/browse/ARROW-17401
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 9.0.0
Reporter: Will Jones


For convenience, it would be helpful to add an method for just reading the 
entire file as a table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17400) [C++] Move Parquet APIs to use Result instead of Status

2022-08-12 Thread Will Jones (Jira)
Will Jones created ARROW-17400:
--

 Summary: [C++] Move Parquet APIs to use Result instead of Status
 Key: ARROW-17400
 URL: https://issues.apache.org/jira/browse/ARROW-17400
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 9.0.0
Reporter: Will Jones


Notably, IPC and CSV have "open file" methods that return result, while opening 
a Parquet file requires passing in an out variable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-14999) [C++] List types with different field names are not equal

2022-08-11 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578528#comment-17578528
 ] 

Will Jones commented on ARROW-14999:


Do you expect to be able to roundtrip that from Parquet? It seems like the 
conclusion of discussion in ARROW-11497 was that we should transition in the 
long term towards always using "element", but maybe we would still be able to 
roundtrip by casting back based on the Arrow schema saved in the metadata?

> [C++] List types with different field names are not equal
> -
>
> Key: ARROW-14999
> URL: https://issues.apache.org/jira/browse/ARROW-14999
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 6.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When comparing map types, the names of the fields are ignored. This was 
> introduced in ARROW-7173.
> However for list types, they are not ignored. For example,
> {code:python}
> In [6]: l1 = pa.list_(pa.field("val", pa.int64()))
> In [7]: l2 = pa.list_(pa.int64())
> In [8]: l1
> Out[8]: ListType(list)
> In [9]: l2
> Out[9]: ListType(list)
> In [10]: l1 == l2
> Out[10]: False
> {code}
> Should we make list type comparison ignore field names too?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-14999) [C++] List types with different field names are not equal

2022-08-10 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-14999:
--

Assignee: Will Jones

> [C++] List types with different field names are not equal
> -
>
> Key: ARROW-14999
> URL: https://issues.apache.org/jira/browse/ARROW-14999
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 6.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
> Fix For: 10.0.0
>
>
> When comparing map types, the names of the fields are ignored. This was 
> introduced in ARROW-7173.
> However for list types, they are not ignored. For example,
> {code:python}
> In [6]: l1 = pa.list_(pa.field("val", pa.int64()))
> In [7]: l2 = pa.list_(pa.int64())
> In [8]: l1
> Out[8]: ListType(list)
> In [9]: l2
> Out[9]: ListType(list)
> In [10]: l1 == l2
> Out[10]: False
> {code}
> Should we make list type comparison ignore field names too?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-12958) [CI][Developer] Build + host the docs for PR branches

2022-08-10 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-12958:
---
Component/s: Documentation

> [CI][Developer] Build + host the docs for PR branches
> -
>
> Key: ARROW-12958
> URL: https://issues.apache.org/jira/browse/ARROW-12958
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Developer Tools, Documentation
>Reporter: Jonathan Keane
>Priority: Major
> Fix For: 10.0.0
>
>
> We already run the docs building with crossbow, could we host the rendered 
> docs somewhere so that we can see what they look like during the PR process?
> ARROW-1299 is a ticket for nightly docs updates for what's in master.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-12958) [CI][Developer] Build + host the docs for PR branches

2022-08-10 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-12958:
---
Fix Version/s: 10.0.0

> [CI][Developer] Build + host the docs for PR branches
> -
>
> Key: ARROW-12958
> URL: https://issues.apache.org/jira/browse/ARROW-12958
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Developer Tools
>Reporter: Jonathan Keane
>Priority: Major
> Fix For: 10.0.0
>
>
> We already run the docs building with crossbow, could we host the rendered 
> docs somewhere so that we can see what they look like during the PR process?
> ARROW-1299 is a ticket for nightly docs updates for what's in master.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-12958) [CI][Developer] Build + host the docs for PR branches

2022-08-10 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578148#comment-17578148
 ] 

Will Jones commented on ARROW-12958:


Alternatively, we could possibly host on Github pages, where a crossbow job 
publishes the pages to a folder in a branch of some repo, and a nightly cleanup 
job will delete any pages that are older than 30 days. We could do that in a 
free repo, so that eliminates hosting cost concerns.

> [CI][Developer] Build + host the docs for PR branches
> -
>
> Key: ARROW-12958
> URL: https://issues.apache.org/jira/browse/ARROW-12958
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Developer Tools
>Reporter: Jonathan Keane
>Priority: Major
>
> We already run the docs building with crossbow, could we host the rendered 
> docs somewhere so that we can see what they look like during the PR process?
> ARROW-1299 is a ticket for nightly docs updates for what's in master.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-12958) [CI][Developer] Build + host the docs for PR branches

2022-08-10 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578147#comment-17578147
 ] 

Will Jones commented on ARROW-12958:


Yeah I think this could likely be solved by:
 # Create some hosting location where docs can be served out of. Perhaps it 
should automatically clean up anything 30 days or older.
 # Create a crossbow job that builds the docs and uploads to the hosting 
location.

One solution to (1) is using an S3 bucket to statically host the site, and 
implement [lifecycle 
rules|https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-expire-general-considerations.html]
 to make objects expire after 30 days of creation. Not sure if there someone 
will to host those resources though. Or if there is a cheaper alternative.

> [CI][Developer] Build + host the docs for PR branches
> -
>
> Key: ARROW-12958
> URL: https://issues.apache.org/jira/browse/ARROW-12958
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Developer Tools
>Reporter: Jonathan Keane
>Priority: Major
>
> We already run the docs building with crossbow, could we host the rendered 
> docs somewhere so that we can see what they look like during the PR process?
> ARROW-1299 is a ticket for nightly docs updates for what's in master.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17076) [Python][Docs] Enable building documentation with pyarrow nightly builds

2022-08-10 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-17076:
---
Fix Version/s: 10.0.0

> [Python][Docs] Enable building documentation with pyarrow nightly builds
> 
>
> Key: ARROW-17076
> URL: https://issues.apache.org/jira/browse/ARROW-17076
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation, Python
>Reporter: Todd Farmer
>Priority: Minor
> Fix For: 10.0.0
>
>
> The [instructions for building 
> documentation|https://arrow.apache.org/docs/developers/documentation.html] 
> describes needing pyarrow to successfully build the docs. It also highlights 
> that certain optional pyarrow features must be enabled to successfully build:
> {code:java}
> Note that building the documentation may fail if your build of pyarrow is not 
> sufficiently comprehensive. Portions of the Python API documentation will 
> also not build without CUDA support having been built. {code}
> "Sufficiently comprehensive" is relatively ambiguous, leaving users to repeat 
> a sequence of steps to identify and resolve required elements:
>  * Build C++
>  * Build Python
>  * Attempt to build docs
>  * Evaluate missing features based on error messages
> This adds significant overhead to simply building docs, limiting 
> accessibility for less experienced users to offer docs improvements.
> Rather than attempt to follow the steps above, I attempted to use a nightly 
> pyarrow build to satisfy docs build requirements. This did not work, though, 
> because nightly builds are not built with the options needed to build docs:
> {code:java}
> (base) todd@pop-os:~/arrow$ pushd docs
> make html
> popd
> ~/arrow/docs ~/arrow
> sphinx-build -b html -d _build/doctrees  -j8 source _build/html
> Running Sphinx v5.0.2
> WARNING: Invalid configuration value found: 'language = None'. Update your 
> configuration to a valid langauge code. Falling back to 'en' (English).
> making output directory... done
> [autosummary] generating autosummary for: c_glib/index.rst, cpp/api.rst, 
> cpp/api/array.rst, cpp/api/async.rst, cpp/api/builder.rst, cpp/api/c_abi.rst, 
> cpp/api/compute.rst, cpp/api/cuda.rst, cpp/api/dataset.rst, 
> cpp/api/datatype.rst, ..., python/json.rst, python/memory.rst, 
> python/numpy.rst, python/orc.rst, python/pandas.rst, python/parquet.rst, 
> python/plasma.rst, python/timestamps.rst, r/index.rst, status.rst
> WARNING: [autosummary] failed to import pyarrow.compute.CumulativeSumOptions.
> Possible hints:
> * ModuleNotFoundError: No module named 
> 'pyarrow.compute.CumulativeSumOptions'; 'pyarrow.compute' is not a package
> * AttributeError: module 'pyarrow.compute' has no attribute 
> 'CumulativeSumOptions'
> * ImportError: 
> WARNING: [autosummary] failed to import pyarrow.compute.cumulative_sum.
> Possible hints:
> * ModuleNotFoundError: No module named 'pyarrow.compute.cumulative_sum'; 
> 'pyarrow.compute' is not a package
> * ImportError: 
> * AttributeError: module 'pyarrow.compute' has no attribute 'cumulative_sum'
> WARNING: [autosummary] failed to import 
> pyarrow.compute.cumulative_sum_checked.
> Possible hints:
> * ImportError: 
> * AttributeError: module 'pyarrow.compute' has no attribute 
> 'cumulative_sum_checked'
> * ModuleNotFoundError: No module named 
> 'pyarrow.compute.cumulative_sum_checked'; 'pyarrow.compute' is not a package
> WARNING: [autosummary] failed to import pyarrow.dataset.WrittenFile.
> Possible hints:
> * ModuleNotFoundError: No module named 'pyarrow.dataset.WrittenFile'; 
> 'pyarrow.dataset' is not a package
> * ImportError: 
> * AttributeError: module 'pyarrow.dataset' has no attribute 
> 'WrittenFile'Extension error (sphinx.ext.autosummary):
> Handler  for event 
> 'builder-inited' threw an exception (exception: no module named 
> pyarrow.parquet.encryption)
> make: *** [Makefile:81: html] Error 2
> ~/arrow
> {code}
> Nightly builds should be made sufficient to build documentation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-13457) [C++][Docs] Scalars User Guide

2022-08-10 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-13457:
---
Fix Version/s: 10.0.0

> [C++][Docs] Scalars User Guide
> --
>
> Key: ARROW-13457
> URL: https://issues.apache.org/jira/browse/ARROW-13457
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Rares Vernica
>Priority: Minor
> Fix For: 10.0.0
>
>
> In the C++ User Guide, Scalars are briefly mentioned in Compute Functions 
> [https://arrow.apache.org/docs/cpp/compute.html] It would be nice to have 
> some examples on some of the ways a Scalar can be created or manipulated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-13454) [C++][Docs] Tables vs Record Batches

2022-08-10 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-13454:
---
Fix Version/s: 10.0.0

> [C++][Docs] Tables vs Record Batches
> 
>
> Key: ARROW-13454
> URL: https://issues.apache.org/jira/browse/ARROW-13454
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Rares Vernica
>Priority: Minor
> Fix For: 10.0.0
>
>
> It is not clear what the difference is between Tables and Record Batches is 
> as described on [https://arrow.apache.org/docs/cpp/tables.html#tables]
> _A 
> [{{arrow::Table}}|https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow5TableE]
>  is a two-dimensional dataset with chunked arrays for columns_
> _A 
> [{{arrow::RecordBatch}}|https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow11RecordBatchE]
>  is a two-dimensional dataset of a number of contiguous arrays_
> Or maybe the distinction between _chunked arrays_ and _contiguous arrays_ can 
> be clarified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   >