[jira] [Assigned] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data
[ https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-18400: -- Assignee: Will Jones > [Python] Quadratic memory usage of Table.to_pandas with nested data > --- > > Key: ARROW-18400 > URL: https://issues.apache.org/jira/browse/ARROW-18400 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 10.0.1 > Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X > with 64 GB RAM >Reporter: Adam Reeve >Assignee: Will Jones >Priority: Critical > Labels: pull-request-available > Fix For: 11.0.0 > > Attachments: test_memory.py > > Time Spent: 50m > Remaining Estimate: 0h > > Reading nested Parquet data and then converting it to a Pandas DataFrame > shows quadratic memory usage and will eventually run out of memory for > reasonably small files. I had initially thought this was a regression since > 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks > in at higher row counts. > Example code to generate nested Parquet data: > {code:python} > import numpy as np > import random > import string > import pandas as pd > _characters = string.ascii_uppercase + string.digits + string.punctuation > def make_random_string(N=10): > return ''.join(random.choice(_characters) for _ in range(N)) > nrows = 1_024_000 > filename = 'nested.parquet' > arr_len = 10 > nested_col = [] > for i in range(nrows): > nested_col.append(np.array( > [{ > 'a': None if i % 1000 == 0 else np.random.choice(1, > size=3).astype(np.int64), > 'b': None if i % 100 == 0 else random.choice(range(100)), > 'c': None if i % 10 == 0 else make_random_string(5) > } for i in range(arr_len)] > )) > df = pd.DataFrame({'c1': nested_col}) > df.to_parquet(filename) > {code} > And then read into a DataFrame with: > {code:python} > import pyarrow.parquet as pq > table = pq.read_table(filename) > df = table.to_pandas() > {code} > Only reading to an Arrow table isn't a problem, it's the to_pandas method > that exhibits the large memory usage. I haven't tested generating nested > Arrow data in memory without writing Parquet from Pandas but I assume the > problem probably isn't Parquet specific. > Memory usage I see when reading different sized files on a machine with 64 GB > RAM: > ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)|| > |32,000|362|361| > |64,000|531|531| > |128,000|1,152|1,101| > |256,000|2,888|1,402| > |512,000|10,301|3,508| > |1,024,000|38,697|5,313| > |2,048,000|OOM|20,061| > |4,096,000| |OOM| > With Arrow 10.0.1, memory usage approximately quadruples when row count > doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but > then quadruples from 1024k to 2048k rows. > PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something > changed between 7.0.0 and 8.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data
[ https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655100#comment-17655100 ] Will Jones commented on ARROW-18400: Took a look at the issue in Joris' last repro. Is seems to stem from the fact that the {{ListArray.values()}} method in C++ doesn't account for slices. I think if it did, the numpy conversion issue would be solved. Created a repro in a PR here: [https://github.com/apache/arrow/pull/15210] Do you agree with that assessment? > [Python] Quadratic memory usage of Table.to_pandas with nested data > --- > > Key: ARROW-18400 > URL: https://issues.apache.org/jira/browse/ARROW-18400 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 10.0.1 > Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X > with 64 GB RAM >Reporter: Adam Reeve >Assignee: Alenka Frim >Priority: Critical > Labels: pull-request-available > Fix For: 11.0.0 > > Attachments: test_memory.py > > Time Spent: 20m > Remaining Estimate: 0h > > Reading nested Parquet data and then converting it to a Pandas DataFrame > shows quadratic memory usage and will eventually run out of memory for > reasonably small files. I had initially thought this was a regression since > 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks > in at higher row counts. > Example code to generate nested Parquet data: > {code:python} > import numpy as np > import random > import string > import pandas as pd > _characters = string.ascii_uppercase + string.digits + string.punctuation > def make_random_string(N=10): > return ''.join(random.choice(_characters) for _ in range(N)) > nrows = 1_024_000 > filename = 'nested.parquet' > arr_len = 10 > nested_col = [] > for i in range(nrows): > nested_col.append(np.array( > [{ > 'a': None if i % 1000 == 0 else np.random.choice(1, > size=3).astype(np.int64), > 'b': None if i % 100 == 0 else random.choice(range(100)), > 'c': None if i % 10 == 0 else make_random_string(5) > } for i in range(arr_len)] > )) > df = pd.DataFrame({'c1': nested_col}) > df.to_parquet(filename) > {code} > And then read into a DataFrame with: > {code:python} > import pyarrow.parquet as pq > table = pq.read_table(filename) > df = table.to_pandas() > {code} > Only reading to an Arrow table isn't a problem, it's the to_pandas method > that exhibits the large memory usage. I haven't tested generating nested > Arrow data in memory without writing Parquet from Pandas but I assume the > problem probably isn't Parquet specific. > Memory usage I see when reading different sized files on a machine with 64 GB > RAM: > ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)|| > |32,000|362|361| > |64,000|531|531| > |128,000|1,152|1,101| > |256,000|2,888|1,402| > |512,000|10,301|3,508| > |1,024,000|38,697|5,313| > |2,048,000|OOM|20,061| > |4,096,000| |OOM| > With Arrow 10.0.1, memory usage approximately quadruples when row count > doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but > then quadruples from 1024k to 2048k rows. > PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something > changed between 7.0.0 and 8.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-18411) [Python] MapType comparison ignores nullable flag of item_field
[ https://issues.apache.org/jira/browse/ARROW-18411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones resolved ARROW-18411. Resolution: Fixed > [Python] MapType comparison ignores nullable flag of item_field > --- > > Key: ARROW-18411 > URL: https://issues.apache.org/jira/browse/ARROW-18411 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Environment: pyarrow==10.0.1 >Reporter: &res >Assignee: Will Jones >Priority: Minor > > By default MapType value fields are nullable: > {code:java} > pa.map_(pa.string(), pa.int32()).item_field.nullable == True {code} > It is possible to mark the value field of a MapType as not-nullable: > {code:java} > pa.map_(pa.string(), pa.field("value", pa.int32(), > nullable=False)).item_field.nullable == False{code} > But comparing these two types, that are semantically different, returns True: > {code:java} > pa.map_(pa.string(), pa.int32()) == pa.map_(pa.string(), pa.field("value", > pa.int32(), nullable=False)) # Returns True {code} > So it looks like the comparison omits the nullable flag. > {code:java} > import pyarrow as pa > map_type = pa.map_(pa.string(), pa.int32()) > non_null_map_type = pa.map_(pa.string(), pa.field("value", pa.int32(), > nullable=False)) > nullable_map_type = pa.map_(pa.string(), pa.field("value", pa.int32(), > nullable=True)) > map_type_different_field_name = pa.map_(pa.string(), pa.field("value", > pa.int32(), nullable=True)) > assert nullable_map_type == map_type # Wrong > assert str(nullable_map_type) == str(map_type) > assert str(non_null_map_type) == str(map_type) # Wrong > assert non_null_map_type == map_type > assert non_null_map_type.item_type == map_type.item_type > assert non_null_map_type.item_field != map_type.item_field > assert non_null_map_type.item_field.nullable != map_type.item_field.nullable > assert non_null_map_type.item_field.name == map_type.item_field.name > assert map_type == map_type_different_field_name # This makes sense > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17302) [R] Configure curl timeout policy for S3
[ https://issues.apache.org/jira/browse/ARROW-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones resolved ARROW-17302. Resolution: Fixed > [R] Configure curl timeout policy for S3 > > > Key: ARROW-17302 > URL: https://issues.apache.org/jira/browse/ARROW-17302 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Affects Versions: 9.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Nicola Crane >Priority: Major > Labels: good-first-issue, pull-request-available > Fix For: 11.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > See ARROW-16521 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18202) [R][C++] Different behaviour of R's base::gsub() binding aka libarrow's replace_string_regex kernel since 10.0.0
[ https://issues.apache.org/jira/browse/ARROW-18202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653186#comment-17653186 ] Will Jones commented on ARROW-18202: The following lines were added to early return if the input string is empty: https://github.com/apache/arrow/blob/498b645e1d09306bf5399a9a019a5caa99513815/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2048-L2051 > [R][C++] Different behaviour of R's base::gsub() binding aka libarrow's > replace_string_regex kernel since 10.0.0 > > > Key: ARROW-18202 > URL: https://issues.apache.org/jira/browse/ARROW-18202 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 10.0.0 >Reporter: Lorenzo Isella >Assignee: Will Jones >Priority: Critical > Fix For: 11.0.0 > > > Hello, > I think there is a problem with arrow 10.0 and R. I did not have this issue > with arrow 9.0. > Could you please have a look? > Many thanks > > {code:r} > library(tidyverse) > library(arrow) > ll <- c( "100", "1000", "200" , "3000" , "50" , > "500", "" , "Not Range") > df <- tibble(x=rep(ll, 1000), y=seq(8000)) > write_tsv(df, "data.tsv") > data <- open_dataset("data.tsv", format="tsv", > skip_rows=1, > schema=schema(x=string(), > y=double()) > ) > test <- data |> > collect() > ###I want to replace the "" with "0". I believe this worked with arrow 9.0 > df2 <- data |> > mutate(x=gsub("^$","0",x) ) |> > collect() > df2 ### now I did not modify the "" entries in x > #> # A tibble: 8,000 × 2 > #> x y > #> > #> 1 "100" 1 > #> 2 "1000" 2 > #> 3 "200" 3 > #> 4 "3000" 4 > #> 5 "50" 5 > #> 6 "500" 6 > #> 7 "" 7 > #> 8 "Not Range" 8 > #> 9 "100" 9 > #> 10 "1000" 10 > #> # … with 7,990 more rows > > df3 <- df |> > mutate(x=gsub("^$","0",x) ) > df3 ## and this is fine > #> # A tibble: 8,000 × 2 > #> x y > #> > #> 1 100 1 > #> 2 1000 2 > #> 3 200 3 > #> 4 3000 4 > #> 5 50 5 > #> 6 500 6 > #> 7 0 7 > #> 8 Not Range 8 > #> 9 100 9 > #> 10 1000 10 > #> # … with 7,990 more rows > ## How to fix this...I believe this issue did not arise with arrow 9.0. > sessionInfo() > #> R version 4.2.1 (2022-06-23) > #> Platform: x86_64-pc-linux-gnu (64-bit) > #> Running under: Debian GNU/Linux 11 (bullseye) > #> > #> Matrix products: default > #> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 > #> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 > #> > #> locale: > #> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > #> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > #> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > #> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C > #> [9] LC_ADDRESS=C LC_TELEPHONE=C > #> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > #> > #> attached base packages: > #> [1] stats graphics grDevices utils datasets methods base > #> > #> other attached packages: > #> [1] arrow_10.0.0 forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10 > #> [5] purrr_0.3.5 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8 > #> [9] ggplot2_3.3.6 tidyverse_1.3.2 > #> > #> loaded via a namespace (and not attached): > #> [1] lubridate_1.8.0 assertthat_0.2.1 digest_0.6.30 > #> [4] utf8_1.2.2 R6_2.5.1 cellranger_1.1.0 > #> [7] backports_1.4.1 reprex_2.0.2 evaluate_0.17 > #> [10] httr_1.4.4 highr_0.9 pillar_1.8.1 > #> [13] rlang_1.0.6 googlesheets4_1.0.1 readxl_1.4.1 > #> [16] R.utils_2.12.1 R.oo_1.25.0 rmarkdown_2.17 > #> [19] styler_1.8.0 googledrive_2.0.0 bit_4.0.4 > #> [22] munsell_0.5.0 broom_1.0.1 compiler_4.2.1 > #> [25] modelr_0.1.9 xfun_0.34 pkgconfig_2.0.3 > #> [28] htmltools_0.5.3 tidyselect_1.2.0 fansi_1.0.3 > #> [31] crayon_1.5.2 tzdb_0.3.0 dbplyr_2.2.1 > #> [34] withr_2.5.0 R.methodsS3_1.8.2 grid_4.2.1 > #> [37] jsonlite_1.8.3 gtable_0.3.1 lifecycle_1.0.3 > #> [40] DBI_1.1.3 magrittr_2.0.3 scales_1.2.1 > #> [43] vroom_1.6.0 cli_3.4.1 stringi_1.7.8 > #> [46] fs_1.5.2 xml2_1.3.3 ellipsis_0.3.2 > #> [49] generics_0.1.3 vctrs_0.5.0
[jira] [Assigned] (ARROW-18202) [R][C++] Different behaviour of R's base::gsub() binding aka libarrow's replace_string_regex kernel since 10.0.0
[ https://issues.apache.org/jira/browse/ARROW-18202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-18202: -- Assignee: Will Jones > [R][C++] Different behaviour of R's base::gsub() binding aka libarrow's > replace_string_regex kernel since 10.0.0 > > > Key: ARROW-18202 > URL: https://issues.apache.org/jira/browse/ARROW-18202 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 10.0.0 >Reporter: Lorenzo Isella >Assignee: Will Jones >Priority: Critical > Fix For: 11.0.0 > > > Hello, > I think there is a problem with arrow 10.0 and R. I did not have this issue > with arrow 9.0. > Could you please have a look? > Many thanks > > {code:r} > library(tidyverse) > library(arrow) > ll <- c( "100", "1000", "200" , "3000" , "50" , > "500", "" , "Not Range") > df <- tibble(x=rep(ll, 1000), y=seq(8000)) > write_tsv(df, "data.tsv") > data <- open_dataset("data.tsv", format="tsv", > skip_rows=1, > schema=schema(x=string(), > y=double()) > ) > test <- data |> > collect() > ###I want to replace the "" with "0". I believe this worked with arrow 9.0 > df2 <- data |> > mutate(x=gsub("^$","0",x) ) |> > collect() > df2 ### now I did not modify the "" entries in x > #> # A tibble: 8,000 × 2 > #> x y > #> > #> 1 "100" 1 > #> 2 "1000" 2 > #> 3 "200" 3 > #> 4 "3000" 4 > #> 5 "50" 5 > #> 6 "500" 6 > #> 7 "" 7 > #> 8 "Not Range" 8 > #> 9 "100" 9 > #> 10 "1000" 10 > #> # … with 7,990 more rows > > df3 <- df |> > mutate(x=gsub("^$","0",x) ) > df3 ## and this is fine > #> # A tibble: 8,000 × 2 > #> x y > #> > #> 1 100 1 > #> 2 1000 2 > #> 3 200 3 > #> 4 3000 4 > #> 5 50 5 > #> 6 500 6 > #> 7 0 7 > #> 8 Not Range 8 > #> 9 100 9 > #> 10 1000 10 > #> # … with 7,990 more rows > ## How to fix this...I believe this issue did not arise with arrow 9.0. > sessionInfo() > #> R version 4.2.1 (2022-06-23) > #> Platform: x86_64-pc-linux-gnu (64-bit) > #> Running under: Debian GNU/Linux 11 (bullseye) > #> > #> Matrix products: default > #> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 > #> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 > #> > #> locale: > #> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > #> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > #> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > #> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C > #> [9] LC_ADDRESS=C LC_TELEPHONE=C > #> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > #> > #> attached base packages: > #> [1] stats graphics grDevices utils datasets methods base > #> > #> other attached packages: > #> [1] arrow_10.0.0 forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10 > #> [5] purrr_0.3.5 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8 > #> [9] ggplot2_3.3.6 tidyverse_1.3.2 > #> > #> loaded via a namespace (and not attached): > #> [1] lubridate_1.8.0 assertthat_0.2.1 digest_0.6.30 > #> [4] utf8_1.2.2 R6_2.5.1 cellranger_1.1.0 > #> [7] backports_1.4.1 reprex_2.0.2 evaluate_0.17 > #> [10] httr_1.4.4 highr_0.9 pillar_1.8.1 > #> [13] rlang_1.0.6 googlesheets4_1.0.1 readxl_1.4.1 > #> [16] R.utils_2.12.1 R.oo_1.25.0 rmarkdown_2.17 > #> [19] styler_1.8.0 googledrive_2.0.0 bit_4.0.4 > #> [22] munsell_0.5.0 broom_1.0.1 compiler_4.2.1 > #> [25] modelr_0.1.9 xfun_0.34 pkgconfig_2.0.3 > #> [28] htmltools_0.5.3 tidyselect_1.2.0 fansi_1.0.3 > #> [31] crayon_1.5.2 tzdb_0.3.0 dbplyr_2.2.1 > #> [34] withr_2.5.0 R.methodsS3_1.8.2 grid_4.2.1 > #> [37] jsonlite_1.8.3 gtable_0.3.1 lifecycle_1.0.3 > #> [40] DBI_1.1.3 magrittr_2.0.3 scales_1.2.1 > #> [43] vroom_1.6.0 cli_3.4.1 stringi_1.7.8 > #> [46] fs_1.5.2 xml2_1.3.3 ellipsis_0.3.2 > #> [49] generics_0.1.3 vctrs_0.5.0 tools_4.2.1 > #> [52] bit64_4.0.5 R.cache_0.16.0 glue_1.6.2 > #> [55] hms_1.1.2 parallel_4.2.1 fastmap_1.1.0 > #> [58] yaml_2.3.6 colorspace_2.0-3 gargle_1.2.1 > #> [61
[jira] [Commented] (ARROW-18195) [R][C++] Final value returned by case_when is NA when input has 64 or more values and 1 or more NAs
[ https://issues.apache.org/jira/browse/ARROW-18195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653169#comment-17653169 ] Will Jones commented on ARROW-18195: Thank you for all the reproductions. I zeroed in on one simple one and was able to reproduce in C++. Additional observations: {code:R} library(dplyr, warn.conflicts = FALSE) library(arrow, warn.conflicts = FALSE) # Condition has NA and more than 64 values # Expression generated internally: # case_when({1=x}, 1) test_df4 = tibble::tibble(x = c(NA, rep(TRUE, 64))) test_arrow4 = arrow_table(test_df4) test_arrow4 %>% mutate(y = case_when(x ~ 1L)) %>% collect() %>% tail() #> # A tibble: 6 × 2 #> x y #> #> 1 TRUE 1 #> 2 TRUE 1 #> 3 TRUE 1 #> 4 TRUE 1 #> 5 TRUE 1 #> 6 TRUE NA # It seems to be coming from the next clause, which defaults to NA # Expression generated internally: # case_when({1=x, 2=true}, 1, 2) test_df4 = tibble::tibble(x = c(NA, rep(TRUE, 64))) test_arrow4 = arrow_table(test_df4) test_arrow4 %>% mutate(y = case_when(x ~ 1L, TRUE ~ 2L)) %>% collect() %>% tail() #> # A tibble: 6 × 2 #> x y #> #> 1 TRUE 1 #> 2 TRUE 1 #> 3 TRUE 1 #> 4 TRUE 1 #> 5 TRUE 1 #> 6 TRUE 2 # Applies also to vectors test_df4 = tibble::tibble(x = c(NA, rep(TRUE, 64)), left = rep(1L, 65), right = rep(2L, 65)) test_arrow4 = arrow_table(test_df4) test_arrow4 %>% mutate(y = case_when(x ~ left, TRUE ~ right)) %>% collect() %>% tail() #> # A tibble: 6 × 4 #> x left right y #> #> 1 TRUE 1 2 1 #> 2 TRUE 1 2 1 #> 3 TRUE 1 2 1 #> 4 TRUE 1 2 1 #> 5 TRUE 1 2 1 #> 6 TRUE 1 2 2 # It does seem the 65th and onward element become the else value for no reason lapply(c(65, 68, 127, 140), function(len) { test_df4 = tibble::tibble(x = c(NA, rep(TRUE, len - 1))) test_arrow4 = arrow_table(test_df4) y <- test_arrow4 %>% mutate(y = case_when(x ~ 1L)) %>% collect() %>% .$y which(is.na(y)) }) #> [[1]] #> [1] 1 65 #> #> [[2]] #> [1] 1 65 66 67 68 #> #> [[3]] #> [1] 1 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 #> [20] 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 #> [39] 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 #> [58] 121 122 123 124 125 126 127 #> #> [[4]] #> [1] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 #> [20] 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 #> [39] 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 #> [58] 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 {code} Created on 2022-12-30 with [reprex v2.0.2](https://reprex.tidyverse.org) > [R][C++] Final value returned by case_when is NA when input has 64 or more > values and 1 or more NAs > --- > > Key: ARROW-18195 > URL: https://issues.apache.org/jira/browse/ARROW-18195 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 10.0.0 >Reporter: Lee Mendelowitz >Assignee: Will Jones >Priority: Critical > Labels: pull-request-available > Fix For: 11.0.0 > > Attachments: test_issue.R > > Time Spent: 20m > Remaining Estimate: 0h > > There appears to be a bug when processing an Arrow table with NA values and > using `dplyr::case_when`. A reproducible example is below: the output from > arrow table processing does not match the output when processing a tibble. If > the NA's are removed from the dataframe, then the outputs match. > {noformat} > ``` r > library(dplyr) > #> > #> Attaching package: 'dplyr' > #> The following objects are masked from 'package:stats': > #> > #> filter, lag > #> The following objects are masked from 'package:base': > #> > #> intersect, setdiff, setequal, union > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > library(assertthat) > play_results = c('single', 'double', 'triple', 'home_run') > nrows = 1000 > # Change frac_na to 0, and the result error disappears. > frac_na = 0.05 > # Create a test dataframe with NA values > test_df = tibble( > play_result = sample(play_results, nrows, replace = TRUE) > ) %>% > mutate( > play_result = ifelse(runif(nrows) < frac_na, NA_character_, > play_result) > ) > > test_arrow = arrow_table(test_df) > process_plays = function(df) { > df %>% > mutate( > avg = case_when( > play_result == 'single' ~ 1, >
[jira] [Assigned] (ARROW-18195) [R][C++] Final value returned by case_when is NA when input has 64 or more values and 1 or more NAs
[ https://issues.apache.org/jira/browse/ARROW-18195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-18195: -- Assignee: Will Jones > [R][C++] Final value returned by case_when is NA when input has 64 or more > values and 1 or more NAs > --- > > Key: ARROW-18195 > URL: https://issues.apache.org/jira/browse/ARROW-18195 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 10.0.0 >Reporter: Lee Mendelowitz >Assignee: Will Jones >Priority: Critical > Fix For: 11.0.0 > > Attachments: test_issue.R > > > There appears to be a bug when processing an Arrow table with NA values and > using `dplyr::case_when`. A reproducible example is below: the output from > arrow table processing does not match the output when processing a tibble. If > the NA's are removed from the dataframe, then the outputs match. > {noformat} > ``` r > library(dplyr) > #> > #> Attaching package: 'dplyr' > #> The following objects are masked from 'package:stats': > #> > #> filter, lag > #> The following objects are masked from 'package:base': > #> > #> intersect, setdiff, setequal, union > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > library(assertthat) > play_results = c('single', 'double', 'triple', 'home_run') > nrows = 1000 > # Change frac_na to 0, and the result error disappears. > frac_na = 0.05 > # Create a test dataframe with NA values > test_df = tibble( > play_result = sample(play_results, nrows, replace = TRUE) > ) %>% > mutate( > play_result = ifelse(runif(nrows) < frac_na, NA_character_, > play_result) > ) > > test_arrow = arrow_table(test_df) > process_plays = function(df) { > df %>% > mutate( > avg = case_when( > play_result == 'single' ~ 1, > play_result == 'double' ~ 1, > play_result == 'triple' ~ 1, > play_result == 'home_run' ~ 1, > is.na(play_result) ~ NA_real_, > TRUE ~ 0 > ) > ) %>% > count(play_result, avg) %>% > arrange(play_result) > } > # Compare arrow_table reuslt to tibble result > result_tibble = process_plays(test_df) > result_arrow = process_plays(test_arrow) %>% collect() > assertthat::assert_that(identical(result_tibble, result_arrow)) > #> Error: result_tibble not identical to result_arrow > ``` > Created on 2022-10-29 with [reprex > v2.0.2](https://reprex.tidyverse.org) > {noformat} > I have reproduced this issue both on Mac OS and Ubuntu 20.04. > > {noformat} > ``` > r$> sessionInfo() > R version 4.2.1 (2022-06-23) > Platform: aarch64-apple-darwin21.5.0 (64-bit) > Running under: macOS Monterey 12.5.1 > Matrix products: default > BLAS: /opt/homebrew/Cellar/openblas/0.3.20/lib/libopenblasp-r0.3.20.dylib > LAPACK: /opt/homebrew/Cellar/r/4.2.1/lib/R/lib/libRlapack.dylib > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > attached base packages: > [1] stats graphics grDevices datasets utils methods base > other attached packages: > [1] assertthat_0.2.1 arrow_10.0.0 dplyr_1.0.10 > loaded via a namespace (and not attached): > [1] compiler_4.2.1 pillar_1.8.1 highr_0.9 R.methodsS3_1.8.2 > R.utils_2.12.0 tools_4.2.1 bit_4.0.4 digest_0.6.29 > [9] evaluate_0.15 lifecycle_1.0.1 tibble_3.1.8 R.cache_0.16.0 > pkgconfig_2.0.3 rlang_1.0.5 reprex_2.0.2 DBI_1.1.2 > [17] cli_3.3.0 rstudioapi_0.13 yaml_2.3.5 xfun_0.31 > fastmap_1.1.0 withr_2.5.0 styler_1.8.0 knitr_1.39 > [25] generics_0.1.3 fs_1.5.2 vctrs_0.4.1 bit64_4.0.5 > tidyselect_1.1.2 glue_1.6.2 R6_2.5.1 processx_3.5.3 > [33] fansi_1.0.3 rmarkdown_2.14 purrr_0.3.4 callr_3.7.0 > clipr_0.8.0 magrittr_2.0.3 ellipsis_0.3.2 ps_1.7.0 > [41] htmltools_0.5.3 renv_0.16.0 utf8_1.2.2 R.oo_1.25.0 > ``` > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data
[ https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641484#comment-17641484 ] Will Jones commented on ARROW-18400: Under the hood, {{pyarrow.parquet.read_table}} is using Dataset. Has anyone looked at the effect of changing {{batch_readahead}} and {{{}fragment_readahead{}}}? (They can be passed as kwargs to {{{}.to_table(){}}}) ([docs|https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner]) > [Python] Quadratic memory usage of Table.to_pandas with nested data > --- > > Key: ARROW-18400 > URL: https://issues.apache.org/jira/browse/ARROW-18400 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 10.0.1 > Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X > with 64 GB RAM >Reporter: Adam Reeve >Assignee: Alenka Frim >Priority: Critical > Fix For: 11.0.0 > > > Reading nested Parquet data and then converting it to a Pandas DataFrame > shows quadratic memory usage and will eventually run out of memory for > reasonably small files. I had initially thought this was a regression since > 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks > in at higher row counts. > Example code to generate nested Parquet data: > {code:python} > import numpy as np > import random > import string > import pandas as pd > _characters = string.ascii_uppercase + string.digits + string.punctuation > def make_random_string(N=10): > return ''.join(random.choice(_characters) for _ in range(N)) > nrows = 1_024_000 > filename = 'nested.parquet' > arr_len = 10 > nested_col = [] > for i in range(nrows): > nested_col.append(np.array( > [{ > 'a': None if i % 1000 == 0 else np.random.choice(1, > size=3).astype(np.int64), > 'b': None if i % 100 == 0 else random.choice(range(100)), > 'c': None if i % 10 == 0 else make_random_string(5) > } for i in range(arr_len)] > )) > df = pd.DataFrame({'c1': nested_col}) > df.to_parquet(filename) > {code} > And then read into a DataFrame with: > {code:python} > import pyarrow.parquet as pq > table = pq.read_table(filename) > df = table.to_pandas() > {code} > Only reading to an Arrow table isn't a problem, it's the to_pandas method > that exhibits the large memory usage. I haven't tested generating nested > Arrow data in memory without writing Parquet from Pandas but I assume the > problem probably isn't Parquet specific. > Memory usage I see when reading different sized files on a machine with 64 GB > RAM: > ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)|| > |32,000|362|361| > |64,000|531|531| > |128,000|1,152|1,101| > |256,000|2,888|1,402| > |512,000|10,301|3,508| > |1,024,000|38,697|5,313| > |2,048,000|OOM|20,061| > |4,096,000| |OOM| > With Arrow 10.0.1, memory usage approximately quadruples when row count > doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but > then quadruples from 1024k to 2048k rows. > PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something > changed between 7.0.0 and 8.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18411) [Python] MapType comparison ignores nullable flag of item_field
[ https://issues.apache.org/jira/browse/ARROW-18411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17640137#comment-17640137 ] Will Jones commented on ARROW-18411: Thanks for reporting this. This will be fixed by [https://github.com/apache/arrow/pull/13851] > [Python] MapType comparison ignores nullable flag of item_field > --- > > Key: ARROW-18411 > URL: https://issues.apache.org/jira/browse/ARROW-18411 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Environment: pyarrow==10.0.1 >Reporter: &res >Assignee: Will Jones >Priority: Minor > > By default MapType value fields are nullable: > {code:java} > pa.map_(pa.string(), pa.int32()).item_field.nullable == True {code} > It is possible to mark the value field of a MapType as not-nullable: > {code:java} > pa.map_(pa.string(), pa.field("value", pa.int32(), > nullable=False)).item_field.nullable == False{code} > But comparing these two types, that are semantically different, returns True: > {code:java} > pa.map_(pa.string(), pa.int32()) == pa.map_(pa.string(), pa.field("value", > pa.int32(), nullable=False)) # Returns True {code} > So it looks like the comparison omits the nullable flag. > {code:java} > import pyarrow as pa > map_type = pa.map_(pa.string(), pa.int32()) > non_null_map_type = pa.map_(pa.string(), pa.field("value", pa.int32(), > nullable=False)) > nullable_map_type = pa.map_(pa.string(), pa.field("value", pa.int32(), > nullable=True)) > map_type_different_field_name = pa.map_(pa.string(), pa.field("value", > pa.int32(), nullable=True)) > assert nullable_map_type == map_type # Wrong > assert str(nullable_map_type) == str(map_type) > assert str(non_null_map_type) == str(map_type) # Wrong > assert non_null_map_type == map_type > assert non_null_map_type.item_type == map_type.item_type > assert non_null_map_type.item_field != map_type.item_field > assert non_null_map_type.item_field.nullable != map_type.item_field.nullable > assert non_null_map_type.item_field.name == map_type.item_field.name > assert map_type == map_type_different_field_name # This makes sense > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-18411) [Python] MapType comparison ignores nullable flag of item_field
[ https://issues.apache.org/jira/browse/ARROW-18411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-18411: -- Assignee: Will Jones > [Python] MapType comparison ignores nullable flag of item_field > --- > > Key: ARROW-18411 > URL: https://issues.apache.org/jira/browse/ARROW-18411 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Environment: pyarrow==10.0.1 >Reporter: &res >Assignee: Will Jones >Priority: Minor > > By default MapType value fields are nullable: > {code:java} > pa.map_(pa.string(), pa.int32()).item_field.nullable == True {code} > It is possible to mark the value field of a MapType as not-nullable: > {code:java} > pa.map_(pa.string(), pa.field("value", pa.int32(), > nullable=False)).item_field.nullable == False{code} > But comparing these two types, that are semantically different, returns True: > {code:java} > pa.map_(pa.string(), pa.int32()) == pa.map_(pa.string(), pa.field("value", > pa.int32(), nullable=False)) # Returns True {code} > So it looks like the comparison omits the nullable flag. > {code:java} > import pyarrow as pa > map_type = pa.map_(pa.string(), pa.int32()) > non_null_map_type = pa.map_(pa.string(), pa.field("value", pa.int32(), > nullable=False)) > nullable_map_type = pa.map_(pa.string(), pa.field("value", pa.int32(), > nullable=True)) > map_type_different_field_name = pa.map_(pa.string(), pa.field("value", > pa.int32(), nullable=True)) > assert nullable_map_type == map_type # Wrong > assert str(nullable_map_type) == str(map_type) > assert str(non_null_map_type) == str(map_type) # Wrong > assert non_null_map_type == map_type > assert non_null_map_type.item_type == map_type.item_type > assert non_null_map_type.item_field != map_type.item_field > assert non_null_map_type.item_field.nullable != map_type.item_field.nullable > assert non_null_map_type.item_field.name == map_type.item_field.name > assert map_type == map_type_different_field_name # This makes sense > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-15812) [R] Allow user to supply col_names argument when reading in a CSV dataset
[ https://issues.apache.org/jira/browse/ARROW-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-15812: -- Assignee: Will Jones > [R] Allow user to supply col_names argument when reading in a CSV dataset > - > > Key: ARROW-15812 > URL: https://issues.apache.org/jira/browse/ARROW-15812 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Will Jones >Priority: Major > > Allow the user to supply the {{col_names}} argument from {{readr}} when > reading in a dataset. > This is already possible when reading in a single CSV file via > {{arrow::read_csv_arrow()}} via the {{readr_to_csv_read_options}} function, > and so once the C++ functionality to autogenerate column names for Datasets > is implemented, we should hook up {{readr_to_csv_read_options}} in > {{csv_file_format_read_opts}} just like we have with > {{readr_to_csv_parse_options}} in {{csv_file_format_parse_options}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15812) [R] Allow user to supply col_names argument when reading in a CSV dataset
[ https://issues.apache.org/jira/browse/ARROW-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636918#comment-17636918 ] Will Jones commented on ARROW-15812: Auto-generation of column names was added to Datasets in https://issues.apache.org/jira/browse/ARROW-16436 > [R] Allow user to supply col_names argument when reading in a CSV dataset > - > > Key: ARROW-15812 > URL: https://issues.apache.org/jira/browse/ARROW-15812 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Priority: Major > > Allow the user to supply the {{col_names}} argument from {{readr}} when > reading in a dataset. > This is already possible when reading in a single CSV file via > {{arrow::read_csv_arrow()}} via the {{readr_to_csv_read_options}} function, > and so once the C++ functionality to autogenerate column names for Datasets > is implemented, we should hook up {{readr_to_csv_read_options}} in > {{csv_file_format_read_opts}} just like we have with > {{readr_to_csv_parse_options}} in {{csv_file_format_parse_options}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-15470) [R] Allows user to specify string to be used for missing data when writing CSV dataset
[ https://issues.apache.org/jira/browse/ARROW-15470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-15470: -- Assignee: Will Jones > [R] Allows user to specify string to be used for missing data when writing > CSV dataset > -- > > Key: ARROW-15470 > URL: https://issues.apache.org/jira/browse/ARROW-15470 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Will Jones >Priority: Major > > The ability to select the string to be used for missing data was implemented > for the CSV Writer in ARROW-14903 and as David Li points out below, is > available, so I think we just need to hook it up on the R side. > This requires the values passed in as the "na" argument to be instead passed > through to "null_strings", similarly to what has been done with "skip" and > "skip_rows" in ARROW-15743. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18355) [R] support the quoted_na argument in open_dataset for CSVs by mapping it to CSVConvertOptions$strings_can_be_null
[ https://issues.apache.org/jira/browse/ARROW-18355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636012#comment-17636012 ] Will Jones commented on ARROW-18355: This feature is "soft-deprecated" in readr. Do we still want to add support? > [R] support the quoted_na argument in open_dataset for CSVs by mapping it to > CSVConvertOptions$strings_can_be_null > -- > > Key: ARROW-18355 > URL: https://issues.apache.org/jira/browse/ARROW-18355 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Will Jones >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-18355) [R] support the quoted_na argument in open_dataset for CSVs by mapping it to CSVConvertOptions$strings_can_be_null
[ https://issues.apache.org/jira/browse/ARROW-18355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-18355: -- Assignee: Will Jones > [R] support the quoted_na argument in open_dataset for CSVs by mapping it to > CSVConvertOptions$strings_can_be_null > -- > > Key: ARROW-18355 > URL: https://issues.apache.org/jira/browse/ARROW-18355 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Will Jones >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18359) PrettyPrint Improvements
Will Jones created ARROW-18359: -- Summary: PrettyPrint Improvements Key: ARROW-18359 URL: https://issues.apache.org/jira/browse/ARROW-18359 Project: Apache Arrow Issue Type: Improvement Components: C++, Python, R Reporter: Will Jones We have some pretty printing capabilities, but we may want to think at a high level about the design first. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-15026) [Python] datetime.timedelta to pyarrow.duration('us') silently overflows
[ https://issues.apache.org/jira/browse/ARROW-15026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones resolved ARROW-15026. Fix Version/s: 11.0.0 Resolution: Fixed Issue resolved by pull request 13718 [https://github.com/apache/arrow/pull/13718] > [Python] datetime.timedelta to pyarrow.duration('us') silently overflows > > > Key: ARROW-15026 > URL: https://issues.apache.org/jira/browse/ARROW-15026 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Andreas Rappold >Assignee: Anja Boskovic >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > > Hi! This reproduces the issue: > {code:java} > # python 3.9.9 > # pyarrow 6.0.1 > import datetime > import pyarrow > d = datetime.timedelta(days=-106751992, seconds=71945, microseconds=224192) > pyarrow.scalar(d) > # microseconds=224192)> > pyarrow.scalar(d).as_py() == d > # True > d2 = d - datetime.timedelta(microseconds=1) > pyarrow.scalar(d2) > # microseconds=775807)> > pyarrow.scalar(d2).as_py() == d2 > # False{code} > Other conversions (e.g. to int*) raise an exception instead. I didn't check > if duration overflows for too large timedeltas. If its easy to fix, point me > in the right direction and I try to create a PR. Thanks > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer
[ https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-14196: -- Assignee: Will Jones > [C++][Parquet] Default to compliant nested types in Parquet writer > -- > > Key: ARROW-14196 > URL: https://issues.apache.org/jira/browse/ARROW-14196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Joris Van den Bossche >Assignee: Will Jones >Priority: Major > > In C++ there is already an option to get the "compliant_nested_types" (to > have the list columns follow the Parquet specification), and ARROW-11497 > exposed this option in Python. > This is still set to False by default, but in the source it says "TODO: At > some point we should flip this.", and in ARROW-11497 there was also some > discussion about what it would take to change the default. > cc [~emkornfield] [~apitrou] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17812) [C++][Documentation] Add Gandiva User Guide
[ https://issues.apache.org/jira/browse/ARROW-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones resolved ARROW-17812. Resolution: Fixed Issue resolved by pull request 14200 [https://github.com/apache/arrow/pull/14200] > [C++][Documentation] Add Gandiva User Guide > --- > > Key: ARROW-17812 > URL: https://issues.apache.org/jira/browse/ARROW-17812 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Gandiva >Reporter: Will Jones >Assignee: Will Jones >Priority: Minor > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 5.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments
[ https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-18246: --- Component/s: Documentation > [Python][Docs] PyArrow table join docstring typos for left and right suffix > arguments > - > > Key: ARROW-18246 > URL: https://issues.apache.org/jira/browse/ARROW-18246 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation >Reporter: d33bs >Assignee: Will Jones >Priority: Minor > Labels: docs-impacting, documentation, pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Hello, thank you for all the amazing work on Arrow! I'd like to report a > potential issue with PyArrow's Table Join docstring which may make it > confusing for others to read. This content is I believe translated into the > documentation website as well. > The content which needs to be corrected may be found starting at: > [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737] > The block currently reads: > {code:java} > left_suffix : str, default None > Which suffix to add to right column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffic to add to the left column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > It could be improved with the following: > {code:java} > left_suffix : str, default None > Which suffix to add to left column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffix to add to the right column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > Please let me know if I may clarify or if there are any questions on the > above. Thanks again for your help! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments
[ https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-18246: --- Fix Version/s: 11.0.0 > [Python][Docs] PyArrow table join docstring typos for left and right suffix > arguments > - > > Key: ARROW-18246 > URL: https://issues.apache.org/jira/browse/ARROW-18246 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation >Reporter: d33bs >Assignee: Will Jones >Priority: Minor > Labels: docs-impacting, documentation, pull-request-available > Fix For: 11.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Hello, thank you for all the amazing work on Arrow! I'd like to report a > potential issue with PyArrow's Table Join docstring which may make it > confusing for others to read. This content is I believe translated into the > documentation website as well. > The content which needs to be corrected may be found starting at: > [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737] > The block currently reads: > {code:java} > left_suffix : str, default None > Which suffix to add to right column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffic to add to the left column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > It could be improved with the following: > {code:java} > left_suffix : str, default None > Which suffix to add to left column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffix to add to the right column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > Please let me know if I may clarify or if there are any questions on the > above. Thanks again for your help! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments
[ https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629039#comment-17629039 ] Will Jones commented on ARROW-18246: Thanks for reporting. I have created an update fixing those and a couple other issues in the docs. > [Python][Docs] PyArrow table join docstring typos for left and right suffix > arguments > - > > Key: ARROW-18246 > URL: https://issues.apache.org/jira/browse/ARROW-18246 > Project: Apache Arrow > Issue Type: Bug >Reporter: d33bs >Assignee: Will Jones >Priority: Minor > Labels: docs-impacting, documentation > > Hello, thank you for all the amazing work on Arrow! I'd like to report a > potential issue with PyArrow's Table Join docstring which may make it > confusing for others to read. This content is I believe translated into the > documentation website as well. > The content which needs to be corrected may be found starting at: > [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737] > The block currently reads: > {code:java} > left_suffix : str, default None > Which suffix to add to right column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffic to add to the left column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > It could be improved with the following: > {code:java} > left_suffix : str, default None > Which suffix to add to left column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffix to add to the right column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > Please let me know if I may clarify or if there are any questions on the > above. Thanks again for your help! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments
[ https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-18246: -- Assignee: Will Jones > [Python][Docs] PyArrow table join docstring typos for left and right suffix > arguments > - > > Key: ARROW-18246 > URL: https://issues.apache.org/jira/browse/ARROW-18246 > Project: Apache Arrow > Issue Type: Bug >Reporter: d33bs >Assignee: Will Jones >Priority: Minor > Labels: docs-impacting, documentation > > Hello, thank you for all the amazing work on Arrow! I'd like to report a > potential issue with PyArrow's Table Join docstring which may make it > confusing for others to read. This content is I believe translated into the > documentation website as well. > The content which needs to be corrected may be found starting at: > [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737] > The block currently reads: > {code:java} > left_suffix : str, default None > Which suffix to add to right column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffic to add to the left column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > It could be improved with the following: > {code:java} > left_suffix : str, default None > Which suffix to add to left column names. This prevents confusion > when the columns in left and right tables have colliding names. > right_suffix : str, default None > Which suffix to add to the right column names. This prevents confusion > when the columns in left and right tables have colliding names.{code} > Please let me know if I may clarify or if there are any questions on the > above. Thanks again for your help! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-18245) wheels for PyArrow + Python 3.11
[ https://issues.apache.org/jira/browse/ARROW-18245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones closed ARROW-18245. -- Resolution: Duplicate Hello! This is being actively worked on in ARROW-17487. I've closed this ticket since it duplicates that one. > wheels for PyArrow + Python 3.11 > > > Key: ARROW-18245 > URL: https://issues.apache.org/jira/browse/ARROW-18245 > Project: Apache Arrow > Issue Type: Improvement >Affects Versions: 10.0.0 > Environment: Linux RH8 >Reporter: Aleksandar >Priority: Minor > > Hi, > May we know the plan for pypi pyarrow 10 package will have build dependencies > installed as part of the package. Right now pyarrow10 package has no > wheels for py3.11.0 . > Maybe this is not a right forum but someone is maintaining and packaging > these things for developers. > Thanks much and sorry for intruding ... -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18228) AWS Error SLOW_DOWN during PutObject operation
[ https://issues.apache.org/jira/browse/ARROW-18228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629016#comment-17629016 ] Will Jones commented on ARROW-18228: If you are still getting errors, it might be worth reviewing how you slow you app down somewhat to not hit these errors. [https://aws.amazon.com/premiumsupport/knowledge-center/http-5xx-errors-s3/] [https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html] [https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html] I'm not sure if we have any other settings to limit concurrent requests or tune the backoff strategy, but that might be helpful for cases like this. > AWS Error SLOW_DOWN during PutObject operation > -- > > Key: ARROW-18228 > URL: https://issues.apache.org/jira/browse/ARROW-18228 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 10.0.0 >Reporter: Vadym Dytyniak >Priority: Major > > We use Dask to parallelise read/write operations and pyarrow to write dataset > from worker nodes. > After pyarrow released version 10.0.0, our data flows automatically switched > to the latest version and some of them started to fail with the following > error: > {code:java} > File "/usr/local/lib/python3.10/dist-packages/org/store/storage.py", line > 768, in _write_partition > ds.write_dataset( > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 988, in write_dataset > _filesystemdataset_write( > File "pyarrow/_dataset.pyx", line 2859, in > pyarrow._dataset._filesystemdataset_write > check_status(CFileSystemDataset.Write(c_options, c_scanner)) > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > raise IOError(message) > OSError: When creating key 'equities.us.level2.by_security/' in bucket > 'org-prod': AWS Error SLOW_DOWN during PutObject operation: Please reduce > your request rate. {code} > In total flow failed many times: most failed with the error above, but one > failed with: > {code:java} > File "/usr/local/lib/python3.10/dist-packages/chronos/store/storage.py", line > 857, in _load_partition > table = ds.dataset( > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 752, in dataset > return _filesystem_dataset(source, **kwargs) > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 444, in _filesystem_dataset > fs, paths_or_selector = _ensure_single_source(source, filesystem) > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 411, in _ensure_single_source > file_info = filesystem.get_file_info(path) > File "pyarrow/_fs.pyx", line 564, in pyarrow._fs.FileSystem.get_file_info > info = GetResultValue(self.fs.GetFileInfo(path)) > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > return check_status(status) > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > raise IOError(message) > OSError: When getting information for key > 'ns/date=2022-10-31/channel=4/feed=A/9f41f928eedc431ca695a7ffe5fc60c2-0.parquet' > in bucket 'org-poc': AWS Error NETWORK_CONNECTION during HeadObject > operation: curlCode: 28, Timeout was reached {code} > > Do you have any idea what was changed for dataset write between 9.0.0 and > 10.0.0 to help us to fix the issue? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18228) AWS Error SLOW_DOWN during PutObject operation
[ https://issues.apache.org/jira/browse/ARROW-18228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628391#comment-17628391 ] Will Jones commented on ARROW-18228: I think this may have been caused by https://issues.apache.org/jira/browse/ARROW-17057 In 10.0.0, we exposed the retry strategy in Python, but we set the default number of retries to 3, while I believe in the underlying C++ code it was set to 10 before. Could you try setting the {{max_attempts=10}} in your code: {code:python} from pyarrow.fs import AwsDefaultS3RetryStrategy, S3FileSystem fs = S3FileSystem(retry_strategy=AwsDefaultS3RetryStrategy(max_attempts=10)) {code} > AWS Error SLOW_DOWN during PutObject operation > -- > > Key: ARROW-18228 > URL: https://issues.apache.org/jira/browse/ARROW-18228 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 10.0.0 >Reporter: Vadym Dytyniak >Priority: Major > > We use Dask to parallelise read/write operations and pyarrow to write dataset > from worker nodes. > After pyarrow released version 10.0.0, our data flows automatically switched > to the latest version and some of them started to fail with the following > error: > {code:java} > File "/usr/local/lib/python3.10/dist-packages/org/store/storage.py", line > 768, in _write_partition > ds.write_dataset( > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 988, in write_dataset > _filesystemdataset_write( > File "pyarrow/_dataset.pyx", line 2859, in > pyarrow._dataset._filesystemdataset_write > check_status(CFileSystemDataset.Write(c_options, c_scanner)) > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > raise IOError(message) > OSError: When creating key 'equities.us.level2.by_security/' in bucket > 'org-prod': AWS Error SLOW_DOWN during PutObject operation: Please reduce > your request rate. {code} > In total flow failed many times: most failed with the error above, but one > failed with: > {code:java} > File "/usr/local/lib/python3.10/dist-packages/chronos/store/storage.py", line > 857, in _load_partition > table = ds.dataset( > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 752, in dataset > return _filesystem_dataset(source, **kwargs) > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 444, in _filesystem_dataset > fs, paths_or_selector = _ensure_single_source(source, filesystem) > File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line > 411, in _ensure_single_source > file_info = filesystem.get_file_info(path) > File "pyarrow/_fs.pyx", line 564, in pyarrow._fs.FileSystem.get_file_info > info = GetResultValue(self.fs.GetFileInfo(path)) > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > return check_status(status) > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > raise IOError(message) > OSError: When getting information for key > 'ns/date=2022-10-31/channel=4/feed=A/9f41f928eedc431ca695a7ffe5fc60c2-0.parquet' > in bucket 'org-poc': AWS Error NETWORK_CONNECTION during HeadObject > operation: curlCode: 28, Timeout was reached {code} > > Do you have any idea what was changed for dataset write between 9.0.0 and > 10.0.0 to help us to fix the issue? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18210) [C++][Parquet] Skip check in StreamWriter
[ https://issues.apache.org/jira/browse/ARROW-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628379#comment-17628379 ] Will Jones commented on ARROW-18210: Created https://issues.apache.org/jira/browse/ARROW-18239 > [C++][Parquet] Skip check in StreamWriter > - > > Key: ARROW-18210 > URL: https://issues.apache.org/jira/browse/ARROW-18210 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Affects Versions: 10.0.0 >Reporter: Madhur >Priority: Major > > Currently StreamWriter is slower only because of checking of columns, if we > allow customization option (maybe ctor arg) to skip the check then > StreamWriter can be more efficient? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18239) [C++][Docs] Add examples of Parquet TypedColumnWriter to user guide
Will Jones created ARROW-18239: -- Summary: [C++][Docs] Add examples of Parquet TypedColumnWriter to user guide Key: ARROW-18239 URL: https://issues.apache.org/jira/browse/ARROW-18239 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Will Jones Since this is the more performant non-Arrow way to write Parquet data, we should show that. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18230) [Python] Pass Cmake args to Python CPP
Will Jones created ARROW-18230: -- Summary: [Python] Pass Cmake args to Python CPP Key: ARROW-18230 URL: https://issues.apache.org/jira/browse/ARROW-18230 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Will Jones Fix For: 11.0.0 We pass {{extra_cmake_args}} to {{_run_cmake}} (Cython build) but not to {{ _run_cmake_pyarrow_cpp}} (PyArrow C++ build). We should probably be passing to both. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18204) [R] Allow setting field metadata
Will Jones created ARROW-18204: -- Summary: [R] Allow setting field metadata Key: ARROW-18204 URL: https://issues.apache.org/jira/browse/ARROW-18204 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 10.0.0 Reporter: Will Jones Currently, can't create a {{Field}} with metadata, which makes it hard to create tests regarding field metadata. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-14999) [C++] List types with different field names are not equal
[ https://issues.apache.org/jira/browse/ARROW-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623449#comment-17623449 ] Will Jones commented on ARROW-14999: So here are the conclusions I've gathered so far: 1. Equality of ListTypes and MapTypes have different behavior right now: list types with different field names are unequal, but map types with different field names are equal. We should make this behavior consistent and probably have an option in the {{.Equals()}} method to toggle checking these internal field names. 2. For extension arrays, it's important that we preserve these field names in most operations. That means that even if the default behavior is to ignore field names in equality for List/Map, unit tests for functions should check for field name equality. I'm leaning right now that the default for checking equality should be to ignore field names for List/Map (obviously not for struct) in cases where we also don't check metadata. For example, {{TypeEquals()}} will check metadata and field names, while {{DataType::Equals()}} will not. > [C++] List types with different field names are not equal > - > > Key: ARROW-14999 > URL: https://issues.apache.org/jira/browse/ARROW-14999 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 6.0.0 >Reporter: Will Jones >Assignee: Will Jones >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > When comparing map types, the names of the fields are ignored. This was > introduced in ARROW-7173. > However for list types, they are not ignored. For example, > {code:python} > In [6]: l1 = pa.list_(pa.field("val", pa.int64())) > In [7]: l2 = pa.list_(pa.int64()) > In [8]: l1 > Out[8]: ListType(list) > In [9]: l2 > Out[9]: ListType(list) > In [10]: l1 == l2 > Out[10]: False > {code} > Should we make list type comparison ignore field names too? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16817) [C++][Python] Segfaults for unsupported datatypes in the ORC writer
[ https://issues.apache.org/jira/browse/ARROW-16817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-16817: -- Assignee: Will Jones (was: Ian Alexander Joiner) > [C++][Python] Segfaults for unsupported datatypes in the ORC writer > --- > > Key: ARROW-16817 > URL: https://issues.apache.org/jira/browse/ARROW-16817 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Ian Alexander Joiner >Assignee: Will Jones >Priority: Major > Labels: good-first-issue, good-second-issue, > pull-request-available > Fix For: 10.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > In the ORC writer if a table has at least a column with unsupported datatype > segfaults occur when we try to write them in ORC. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17994) [C++] Add overflow argument is required when it shouldn't be
[ https://issues.apache.org/jira/browse/ARROW-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-17994: --- Attachment: generate_ibis_queries.py > [C++] Add overflow argument is required when it shouldn't be > > > Key: ARROW-17994 > URL: https://issues.apache.org/jira/browse/ARROW-17994 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Will Jones >Priority: Major > Labels: acero, substrait > Fix For: 11.0.0 > > Attachments: generate_ibis_queries.py, try_queries_acero.py > > > If I pass a substrait plan that contains an add function, but don't provide > the nullablity argument, I get the following error: > {code:none} > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/_substrait.pyx", line 140, in pyarrow._substrait.run_query > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Expected Substrait call to have an enum argument at > index 0 but the argument was not an enum. > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/extension_set.cc:684 > call.GetEnumArg(arg_index) > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/extension_set.cc:702 > ParseEnumArg(call, 0, kOverflowParser) > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:332 > FromProto(expr, ext_set, conversion_options) > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/serde.cc:156 > FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), > ext_set, conversion_options) > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:106 > engine::DeserializePlans(substrait_buffer, consumer_factory, registry, > nullptr, conversion_options_) > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:130 > executor.Init(substrait_buffer, registry) > {code} > Yet in the spec, this argument is supposed to be optional: > https://github.com/substrait-io/substrait/blob/f3f6bdc947e689e800279666ff33f118e42d2146/extensions/functions_arithmetic.yaml#L11 > If I modify the plan to include the argument, it works as expected. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17994) [C++] Add overflow argument is required when it shouldn't be
[ https://issues.apache.org/jira/browse/ARROW-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-17994: --- Attachment: try_queries_acero.py > [C++] Add overflow argument is required when it shouldn't be > > > Key: ARROW-17994 > URL: https://issues.apache.org/jira/browse/ARROW-17994 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Will Jones >Priority: Major > Labels: acero, substrait > Fix For: 11.0.0 > > Attachments: generate_ibis_queries.py, try_queries_acero.py > > > If I pass a substrait plan that contains an add function, but don't provide > the nullablity argument, I get the following error: > {code:none} > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/_substrait.pyx", line 140, in pyarrow._substrait.run_query > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Expected Substrait call to have an enum argument at > index 0 but the argument was not an enum. > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/extension_set.cc:684 > call.GetEnumArg(arg_index) > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/extension_set.cc:702 > ParseEnumArg(call, 0, kOverflowParser) > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:332 > FromProto(expr, ext_set, conversion_options) > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/serde.cc:156 > FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), > ext_set, conversion_options) > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:106 > engine::DeserializePlans(substrait_buffer, consumer_factory, registry, > nullptr, conversion_options_) > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:130 > executor.Init(substrait_buffer, registry) > {code} > Yet in the spec, this argument is supposed to be optional: > https://github.com/substrait-io/substrait/blob/f3f6bdc947e689e800279666ff33f118e42d2146/extensions/functions_arithmetic.yaml#L11 > If I modify the plan to include the argument, it works as expected. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17069) [Python][R] GCSFIleSystem reports cannot resolve host on public buckets
[ https://issues.apache.org/jira/browse/ARROW-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-17069: -- Assignee: Will Jones > [Python][R] GCSFIleSystem reports cannot resolve host on public buckets > --- > > Key: ARROW-17069 > URL: https://issues.apache.org/jira/browse/ARROW-17069 > Project: Apache Arrow > Issue Type: Bug > Components: Python, R >Affects Versions: 8.0.0 >Reporter: Will Jones >Assignee: Will Jones >Priority: Critical > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > GCSFileSystem will returns {{Couldn't resolve host name}} if you don't supply > {{anonymous}} as the user: > {code:python} > import pyarrow.dataset as ds > # Fails: > dataset = > ds.dataset("gs://voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3") > Traceback (most recent call last): > File "", line 1, in > File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", > line 749, in dataset > return _filesystem_dataset(source, **kwargs) > File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", > line 441, in _filesystem_dataset > fs, paths_or_selector = _ensure_single_source(source, filesystem) > File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", > line 408, in _ensure_single_source > file_info = filesystem.get_file_info(path) > File "pyarrow/_fs.pyx", line 444, in pyarrow._fs.FileSystem.get_file_info > info = GetResultValue(self.fs.GetFileInfo(path)) > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > return check_status(status) > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > raise IOError(message) > OSError: google::cloud::Status(UNAVAILABLE: Retry policy exhausted in > GetObjectMetadata: EasyPerform() - CURL error [6]=Couldn't resolve host name) > # This works fine: > >>> dataset = > >>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3") > {code} > I would expect that we could connect. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17069) [Python][R] GCSFIleSystem reports cannot resolve host on public buckets
[ https://issues.apache.org/jira/browse/ARROW-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17616691#comment-17616691 ] Will Jones commented on ARROW-17069: Sure. I had done this earlier for R: https://arrow.apache.org/docs/r/articles/fs.html#gcs-authentication I will make a PR to update the Python user guide. > [Python][R] GCSFIleSystem reports cannot resolve host on public buckets > --- > > Key: ARROW-17069 > URL: https://issues.apache.org/jira/browse/ARROW-17069 > Project: Apache Arrow > Issue Type: Bug > Components: Python, R >Affects Versions: 8.0.0 >Reporter: Will Jones >Priority: Critical > Fix For: 10.0.0 > > > GCSFileSystem will returns {{Couldn't resolve host name}} if you don't supply > {{anonymous}} as the user: > {code:python} > import pyarrow.dataset as ds > # Fails: > dataset = > ds.dataset("gs://voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3") > Traceback (most recent call last): > File "", line 1, in > File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", > line 749, in dataset > return _filesystem_dataset(source, **kwargs) > File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", > line 441, in _filesystem_dataset > fs, paths_or_selector = _ensure_single_source(source, filesystem) > File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", > line 408, in _ensure_single_source > file_info = filesystem.get_file_info(path) > File "pyarrow/_fs.pyx", line 444, in pyarrow._fs.FileSystem.get_file_info > info = GetResultValue(self.fs.GetFileInfo(path)) > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > return check_status(status) > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > raise IOError(message) > OSError: google::cloud::Status(UNAVAILABLE: Retry policy exhausted in > GetObjectMetadata: EasyPerform() - CURL error [6]=Couldn't resolve host name) > # This works fine: > >>> dataset = > >>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3") > {code} > I would expect that we could connect. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17994) [C++] Add overflow argument is required when it shouldn't be
[ https://issues.apache.org/jira/browse/ARROW-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17616096#comment-17616096 ] Will Jones commented on ARROW-17994: Example plan that is broken: {code:json} { "extensionUris": [ { "extensionUriAnchor": 1 } ], "extensions": [ { "extensionFunction": { "extensionUriReference": 1, "functionAnchor": 1, "name": "equal" } }, { "extensionFunction": { "extensionUriReference": 1, "functionAnchor": 2, "name": "add" } } ], "relations": [ { "root": { "input": { "project": { "input": { "read": { "baseSchema": { "names": [ "a", "b", "c" ], "struct": { "types": [ { "fp64": { "nullability": "NULLABILITY_NULLABLE" } }, { "i64": { "nullability": "NULLABILITY_NULLABLE" } }, { "fp64": { "nullability": "NULLABILITY_NULLABLE" } } ], "nullability": "NULLABILITY_REQUIRED" } }, "namedTable": { "names": [ "table0" ] } } }, "expressions": [ { "scalarFunction": { "functionReference": 2, "outputType": { "fp64": { "nullability": "NULLABILITY_NULLABLE" } }, "arguments": [ { "value": { "selection": { "directReference": { "structField": {} }, "rootReference": {} } } }, { "value": { "selection": { "directReference": { "structField": { "field": 2 } }, "rootReference": {} } } } ] } } ] } }, "names": [ "a", "b", "c", "v" ] } } ] } {code} Example plan that works: {code:json} { "extensionUris": [ { "extensionUriAnchor": 1 } ], "extensions": [ { "extensionFunction": { "extensionUriReference": 1, "functionAnchor": 1, "name": "equal" } }, { "extensionFunction": { "extensionUriReference": 1, "functionAnchor": 2, "name": "add" } } ], "relations": [ { "root": { "input": { "project": { "input": { "read": { "baseSchema": { "names": [ "a", "b", "c" ], "struct": { "types": [ { "fp64": { "nullability": "NULLABILITY_NULLABLE" } }, { "i64": { "nullability": "NULLABILITY_NULLABLE" } }, { "fp64": { "nullability": "NULLABILITY_NULLABLE" } } ], "nullability": "NULLABILITY_REQUIRED" } }, "namedTable": { "names": [ "table0" ] } } }, "expressions": [ { "scalarFunction": { "functionReference": 2, "outputType": { "fp64": { "nullability": "NULLABILITY_NULLA
[jira] [Created] (ARROW-17994) [C++] Add overflow argument is required when it shouldn't be
Will Jones created ARROW-17994: -- Summary: [C++] Add overflow argument is required when it shouldn't be Key: ARROW-17994 URL: https://issues.apache.org/jira/browse/ARROW-17994 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Will Jones Fix For: 11.0.0 If I pass a substrait plan that contains an add function, but don't provide the nullablity argument, I get the following error: {code:none} Traceback (most recent call last): File "", line 1, in File "pyarrow/_substrait.pyx", line 140, in pyarrow._substrait.run_query File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Expected Substrait call to have an enum argument at index 0 but the argument was not an enum. /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/extension_set.cc:684 call.GetEnumArg(arg_index) /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/extension_set.cc:702 ParseEnumArg(call, 0, kOverflowParser) /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:332 FromProto(expr, ext_set, conversion_options) /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/serde.cc:156 FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), ext_set, conversion_options) /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:106 engine::DeserializePlans(substrait_buffer, consumer_factory, registry, nullptr, conversion_options_) /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:130 executor.Init(substrait_buffer, registry) {code} Yet in the spec, this argument is supposed to be optional: https://github.com/substrait-io/substrait/blob/f3f6bdc947e689e800279666ff33f118e42d2146/extensions/functions_arithmetic.yaml#L11 If I modify the plan to include the argument, it works as expected. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17963) [C++] Implement cast_dictionary for string
Will Jones created ARROW-17963: -- Summary: [C++] Implement cast_dictionary for string Key: ARROW-17963 URL: https://issues.apache.org/jira/browse/ARROW-17963 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Will Jones Fix For: 11.0.0 We can cast dictionary(string, X) to string, but not the other way around. {code:R} > Array$create(c("a", "b"))$cast(dictionary(int32(), string())) Error: NotImplemented: Unsupported cast from string to dictionary using function cast_dictionary /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/compute/function.cc:249 func.DispatchBest(&in_types) > Array$create(as.factor(c("a", "b")))$cast(string()) Array [ "a", "b" ] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17438) [R] glimpse() errors if there is a UDF
[ https://issues.apache.org/jira/browse/ARROW-17438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones resolved ARROW-17438. Resolution: Duplicate > [R] glimpse() errors if there is a UDF > -- > > Key: ARROW-17438 > URL: https://issues.apache.org/jira/browse/ARROW-17438 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: Neal Richardson >Priority: Major > Fix For: 10.0.0 > > > Using the example from ARROW-17437: > {code} > register_scalar_function( > "test", > function(context, x) paste(x, collapse=","), > utf8(), > utf8(), > auto_convert=TRUE > ) > Table$create(x = c("a", "b", "c")) |> > transmute(test(x)) |> > glimpse() > # Table (query) > # 3 rows x 1 columns > # Error in `dplyr::collect()`: > # ! NotImplemented: Call to R (resolve scalar user-defined function output > data type) from a non-R thread from an unsupported context > # Run `rlang::last_error()` to see where the error occurred. > {code} > A variety of things could fix this: > * Supporting UDFs in any query (I think there's a draft PR open for this) > * The limit operator (FetchNode?) so that {{head()}} is handled in the > ExecPlan and we don't need to use the RecordBatchReader workaround to get it > efficiently (also PR in the works) > * Worse case, error more informatively -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17438) [R] glimpse() errors if there is a UDF
[ https://issues.apache.org/jira/browse/ARROW-17438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613721#comment-17613721 ] Will Jones commented on ARROW-17438: I just tested, and this is now fixed. (I believe in ARROW-17178.) cc [~paleolimbot] > [R] glimpse() errors if there is a UDF > -- > > Key: ARROW-17438 > URL: https://issues.apache.org/jira/browse/ARROW-17438 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: Neal Richardson >Priority: Major > Fix For: 10.0.0 > > > Using the example from ARROW-17437: > {code} > register_scalar_function( > "test", > function(context, x) paste(x, collapse=","), > utf8(), > utf8(), > auto_convert=TRUE > ) > Table$create(x = c("a", "b", "c")) |> > transmute(test(x)) |> > glimpse() > # Table (query) > # 3 rows x 1 columns > # Error in `dplyr::collect()`: > # ! NotImplemented: Call to R (resolve scalar user-defined function output > data type) from a non-R thread from an unsupported context > # Run `rlang::last_error()` to see where the error occurred. > {code} > A variety of things could fix this: > * Supporting UDFs in any query (I think there's a draft PR open for this) > * The limit operator (FetchNode?) so that {{head()}} is handled in the > ExecPlan and we don't need to use the RecordBatchReader workaround to get it > efficiently (also PR in the works) > * Worse case, error more informatively -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-16897) [R][C++] Full join on Arrow objects is incorrect
[ https://issues.apache.org/jira/browse/ARROW-16897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones closed ARROW-16897. -- Resolution: Duplicate > [R][C++] Full join on Arrow objects is incorrect > > > Key: ARROW-16897 > URL: https://issues.apache.org/jira/browse/ARROW-16897 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 8.0.0, 9.0.0 > Environment: Linux >Reporter: Oliver Reiter >Assignee: Weston Pace >Priority: Critical > Labels: joins, query-engine > Fix For: 10.0.0 > > > Hello, > I am trying to do a full join on a dataset. It produces the correct number of > observations, but not the correct result (the resulting data.frame is just > filled up with NA-rows). > My use case: I want to include the 'full' year range for every factor value: > {code:java} > library(data.table) > library(arrow) > library(dplyr) > year_range <- 2000:2019 > group_n <- 100 > N <- 1000 ## the resulting data should have 100 groups * 20 years > dt <- data.table(value = rnorm(N), > group = rep(paste0("g", 1:group_n), length.out = N)) > ## there are only observations for some years in every group > dt[, year := sample(year_range, size = N / group_n), by = .(group)] > dt[group == "g1", ] > ## this would be the 'full' data.table > group_years <- data.table(group = rep(unique(dt$group), each = 20), > year = rep(year_range, times = 10)) > group_years[group == "g1", ] > write_dataset(dt, path = "parquet_db") > db <- open_dataset(sources = "parquet_db") > ## full_join using data.table -> expected result > db_full <- merge(dt, group_years, > by = c("group", "year"), > all = TRUE) > setorder(db_full, group, year) > db_full[group == "g1", ] > ## try to do the full_join with arrow -> incorrect result > db_full_arrow <- db |> > full_join(group_years, by = c("group", "year")) |> > collect() |> > setDT() > setorder(db_full_arrow, group, year) > db_full_arrow[group == "g1", ] > ## or: convert data.table to arrow_table beforehand -> incorrect result > group_years_arrow <- group_years |> > as_arrow_table() > db_full_arrow <- db |> > full_join(group_years_arrow, by = c("group", "year")) |> > collect() |> > setDT() > setorder(db_full_arrow, group, year) > db_full_arrow[group == "g1", ]{code} > The [documentation|https://arrow.apache.org/docs/r/] says equality joins are > supported, which should hold also for `full_join` I guess? > Thanks for your time and work! > > Oliver -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17149) [R] Enable GCS tests for Windows
[ https://issues.apache.org/jira/browse/ARROW-17149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-17149: --- Fix Version/s: 11.0.0 (was: 10.0.0) > [R] Enable GCS tests for Windows > > > Key: ARROW-17149 > URL: https://issues.apache.org/jira/browse/ARROW-17149 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Affects Versions: 9.0.0 >Reporter: Will Jones >Priority: Major > Fix For: 11.0.0 > > > In ARROW-16879, I found the GCS tests were hanging in CI, but couldn't > diagnose why. We should solve that and enable the tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17954) [R] Update News for 10.0.0
Will Jones created ARROW-17954: -- Summary: [R] Update News for 10.0.0 Key: ARROW-17954 URL: https://issues.apache.org/jira/browse/ARROW-17954 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Will Jones Assignee: Will Jones Fix For: 10.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-14342) [Python] Add support for the SSO credential provider
[ https://issues.apache.org/jira/browse/ARROW-14342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613238#comment-17613238 ] Will Jones commented on ARROW-14342: In the meantime, you can work around this by using boto3 to resolve the SSO credentials: {code:Python} from boto3 import Session session = Session() credentials = session.get_credentials() s3 = S3FileSystem( access_key=current_credentials.access_key, secret_key=current_credentials.secret_key, session_token=current_credentials.token, ) {code} > [Python] Add support for the SSO credential provider > > > Key: ARROW-14342 > URL: https://issues.apache.org/jira/browse/ARROW-14342 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 3.0.0, 5.0.0 >Reporter: Björn Boschman >Priority: Major > Fix For: 11.0.0 > > > Not sure about other languages > see also: [https://github.com/boto/botocore/pull/2070] > {code:java} > from pyarrow.fs import S3FileSystem > bucket = 'some-bucket-with-read-access' > key = 'some-existing-key' > s3 = S3FileSystem() > s3.open_input_file(f'{bucket}/{key}'){code} > > results in > > {code:java} > Traceback (most recent call last): > File "test.py", line 7, in > s3.open_input_file(f'{bucket}/{key}') > File "pyarrow/_fs.pyx", line 587, in pyarrow._fs.FileSystem.open_input_file > File "pyarrow/error.pxi", line 143, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status > OSError: When reading information for key 'some-existing-key' in bucket > 'some-bucket-with-read-access': AWS Error [code 15]: No response body. > {code} > > without sso creds supported - shouldn't it raise some kind of AccessDenied > Exception? > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-14342) Add support for the SSO credential provider
[ https://issues.apache.org/jira/browse/ARROW-14342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-14342: --- Fix Version/s: 11.0.0 > Add support for the SSO credential provider > --- > > Key: ARROW-14342 > URL: https://issues.apache.org/jira/browse/ARROW-14342 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 3.0.0, 5.0.0 >Reporter: Björn Boschman >Priority: Major > Fix For: 11.0.0 > > > Not sure about other languages > see also: [https://github.com/boto/botocore/pull/2070] > {code:java} > from pyarrow.fs import S3FileSystem > bucket = 'some-bucket-with-read-access' > key = 'some-existing-key' > s3 = S3FileSystem() > s3.open_input_file(f'{bucket}/{key}'){code} > > results in > > {code:java} > Traceback (most recent call last): > File "test.py", line 7, in > s3.open_input_file(f'{bucket}/{key}') > File "pyarrow/_fs.pyx", line 587, in pyarrow._fs.FileSystem.open_input_file > File "pyarrow/error.pxi", line 143, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status > OSError: When reading information for key 'some-existing-key' in bucket > 'some-bucket-with-read-access': AWS Error [code 15]: No response body. > {code} > > without sso creds supported - shouldn't it raise some kind of AccessDenied > Exception? > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-14342) [Python] Add support for the SSO credential provider
[ https://issues.apache.org/jira/browse/ARROW-14342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-14342: --- Summary: [Python] Add support for the SSO credential provider (was: Add support for the SSO credential provider) > [Python] Add support for the SSO credential provider > > > Key: ARROW-14342 > URL: https://issues.apache.org/jira/browse/ARROW-14342 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 3.0.0, 5.0.0 >Reporter: Björn Boschman >Priority: Major > Fix For: 11.0.0 > > > Not sure about other languages > see also: [https://github.com/boto/botocore/pull/2070] > {code:java} > from pyarrow.fs import S3FileSystem > bucket = 'some-bucket-with-read-access' > key = 'some-existing-key' > s3 = S3FileSystem() > s3.open_input_file(f'{bucket}/{key}'){code} > > results in > > {code:java} > Traceback (most recent call last): > File "test.py", line 7, in > s3.open_input_file(f'{bucket}/{key}') > File "pyarrow/_fs.pyx", line 587, in pyarrow._fs.FileSystem.open_input_file > File "pyarrow/error.pxi", line 143, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status > OSError: When reading information for key 'some-existing-key' in bucket > 'some-bucket-with-read-access': AWS Error [code 15]: No response body. > {code} > > without sso creds supported - shouldn't it raise some kind of AccessDenied > Exception? > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-14342) Add support for the SSO credential provider
[ https://issues.apache.org/jira/browse/ARROW-14342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613235#comment-17613235 ] Will Jones commented on ARROW-14342: SSO support was added in aws-sdk-cpp 1.9. Once we upgrade that dependency we should automatically get support for this. https://github.com/aws/aws-sdk-cpp/issues/1433#issuecomment-1079267499 > Add support for the SSO credential provider > --- > > Key: ARROW-14342 > URL: https://issues.apache.org/jira/browse/ARROW-14342 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 3.0.0, 5.0.0 >Reporter: Björn Boschman >Priority: Major > > Not sure about other languages > see also: [https://github.com/boto/botocore/pull/2070] > {code:java} > from pyarrow.fs import S3FileSystem > bucket = 'some-bucket-with-read-access' > key = 'some-existing-key' > s3 = S3FileSystem() > s3.open_input_file(f'{bucket}/{key}'){code} > > results in > > {code:java} > Traceback (most recent call last): > File "test.py", line 7, in > s3.open_input_file(f'{bucket}/{key}') > File "pyarrow/_fs.pyx", line 587, in pyarrow._fs.FileSystem.open_input_file > File "pyarrow/error.pxi", line 143, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status > OSError: When reading information for key 'some-existing-key' in bucket > 'some-bucket-with-read-access': AWS Error [code 15]: No response body. > {code} > > without sso creds supported - shouldn't it raise some kind of AccessDenied > Exception? > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17944) [Python] Accept bytes object in pyarrow.substrait.run_query
Will Jones created ARROW-17944: -- Summary: [Python] Accept bytes object in pyarrow.substrait.run_query Key: ARROW-17944 URL: https://issues.apache.org/jira/browse/ARROW-17944 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Will Jones Fix For: 11.0.0 {{pyarrow.substrait.run_query()}} only accepts a PyArrow buffer, and will segfault if something else is passed. People might try to pass a Python bytes object, which isn't unreasonable. For example, they might use the value returned by protobufs {{SerializeToString()}} function, which is Python bytes. At the very least, we should not segfault. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17349) [C++] Add casting support for map type
[ https://issues.apache.org/jira/browse/ARROW-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613030#comment-17613030 ] Will Jones commented on ARROW-17349: Yes, I've updated the title. Casting lists only was broken if it was inside a map. The only reason casting maps looked as if it was working was because of the early return if types are "equal" (and maps are "equal" even if they have different field names). > [C++] Add casting support for map type > -- > > Key: ARROW-17349 > URL: https://issues.apache.org/jira/browse/ARROW-17349 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 9.0.0 >Reporter: Will Jones >Assignee: Will Jones >Priority: Major > Labels: good-first-issue, kernel, pull-request-available > Fix For: 10.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Different parquet implementations use different field names for internal > fields of ListType and MapType, which can sometimes cause silly conflicts. > For example, we use {{item}} as the field name for list, but Spark uses > {{element}}. Fortunately, we can automatically cast between List and Map > Types with different field names. Unfortunately, it only works at the top > level. We should get it to work at arbitrary levels of nesting. > This was discovered in delta-rs: > https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285 > Here's a reproduction in Python: > {code:Python} > import pyarrow as pa > import pyarrow.parquet as pq > import pyarrow.dataset as ds > def roundtrip_scanner(in_arr, out_type): > table = pa.table({"arr": in_arr}) > pq.write_table(table, "test.parquet") > schema = pa.schema({"arr": out_type}) > ds.dataset("test.parquet", schema=schema).to_table() > # MapType > ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32()) > ty = pa.map_(pa.int32(), pa.int32()) > arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named) > roundtrip_scanner(arr_named, ty) > # ListType > ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False)) > ty = pa.list_(pa.int32()) > arr_named = pa.array([[1, 2, 4]], type=ty_named) > roundtrip_scanner(arr_named, ty) > # Combination MapType and ListType > ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", > pa.int32(), nullable=True)), nullable=False)) > ty = pa.map_(pa.string(), pa.list_(pa.int32())) > arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named) > roundtrip_scanner(arr_named, ty) > # Traceback (most recent call last): > # File "", line 1, in > # File "", line 5, in roundtrip_scanner > # File "pyarrow/_dataset.pyx", line 331, in > pyarrow._dataset.Dataset.to_table > # File "pyarrow/_dataset.pyx", line 2577, in > pyarrow._dataset.Scanner.to_table > # File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > # File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status > # pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map list> from map ('arr')> > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17349) [C++] Add casting support for map type
[ https://issues.apache.org/jira/browse/ARROW-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-17349: --- Summary: [C++] Add casting support for map type (was: [C++] Support casting field names of list and map when nested) > [C++] Add casting support for map type > -- > > Key: ARROW-17349 > URL: https://issues.apache.org/jira/browse/ARROW-17349 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 9.0.0 >Reporter: Will Jones >Assignee: Will Jones >Priority: Major > Labels: good-first-issue, kernel, pull-request-available > Fix For: 10.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Different parquet implementations use different field names for internal > fields of ListType and MapType, which can sometimes cause silly conflicts. > For example, we use {{item}} as the field name for list, but Spark uses > {{element}}. Fortunately, we can automatically cast between List and Map > Types with different field names. Unfortunately, it only works at the top > level. We should get it to work at arbitrary levels of nesting. > This was discovered in delta-rs: > https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285 > Here's a reproduction in Python: > {code:Python} > import pyarrow as pa > import pyarrow.parquet as pq > import pyarrow.dataset as ds > def roundtrip_scanner(in_arr, out_type): > table = pa.table({"arr": in_arr}) > pq.write_table(table, "test.parquet") > schema = pa.schema({"arr": out_type}) > ds.dataset("test.parquet", schema=schema).to_table() > # MapType > ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32()) > ty = pa.map_(pa.int32(), pa.int32()) > arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named) > roundtrip_scanner(arr_named, ty) > # ListType > ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False)) > ty = pa.list_(pa.int32()) > arr_named = pa.array([[1, 2, 4]], type=ty_named) > roundtrip_scanner(arr_named, ty) > # Combination MapType and ListType > ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", > pa.int32(), nullable=True)), nullable=False)) > ty = pa.map_(pa.string(), pa.list_(pa.int32())) > arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named) > roundtrip_scanner(arr_named, ty) > # Traceback (most recent call last): > # File "", line 1, in > # File "", line 5, in roundtrip_scanner > # File "pyarrow/_dataset.pyx", line 331, in > pyarrow._dataset.Dataset.to_table > # File "pyarrow/_dataset.pyx", line 2577, in > pyarrow._dataset.Scanner.to_table > # File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > # File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status > # pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map list> from map ('arr')> > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17923) [C++] Consider dictionary arrays for special fragment fields
Will Jones created ARROW-17923: -- Summary: [C++] Consider dictionary arrays for special fragment fields Key: ARROW-17923 URL: https://issues.apache.org/jira/browse/ARROW-17923 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Will Jones I noticed in ARROW-15281 we made {{__filename}} a string column. In common cases, this will be inefficient if materialized. If possible, it may be better to have them be dictionary arrays. As an example, [here|https://github.com/apache/arrow/pull/12826#issuecomment-1230745059] is a user report of 10x increased memory usage caused by accidentally including these special fragment columns. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17897) [Packaging][Conan] Add back ARROW_GCS to conanfile.py
Will Jones created ARROW-17897: -- Summary: [Packaging][Conan] Add back ARROW_GCS to conanfile.py Key: ARROW-17897 URL: https://issues.apache.org/jira/browse/ARROW-17897 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Will Jones Assignee: Will Jones Fix For: 10.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-14159) [R] Re-allow some multithreading on Windows
[ https://issues.apache.org/jira/browse/ARROW-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-14159: -- Assignee: (was: Will Jones) > [R] Re-allow some multithreading on Windows > --- > > Key: ARROW-14159 > URL: https://issues.apache.org/jira/browse/ARROW-14159 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Priority: Major > Fix For: 10.0.0 > > > Followup to ARROW-8379, which set use_threads = FALSE on Windows. See > discussion about adding more controls, disabling threading in some places and > not others, etc. We want to do this soon after release so that we have a few > months to see how things behave on CI before releasing again. > - > Collecting some CI hangs after ARROW-8379 > 1. Rtools35, 64bit test suite hangs: > https://github.com/apache/arrow/pull/11290/checks?check_run_id=3767787034 > {code} > ** running tests for arch 'i386' ... > Running 'testthat.R' [17s] > OK > ** running tests for arch 'x64' ... > Error: Error: stderr is not a pipe.> > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16880) [R] Test GCS auth with gargle/googleAuthR
[ https://issues.apache.org/jira/browse/ARROW-16880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-16880: -- Assignee: (was: Will Jones) > [R] Test GCS auth with gargle/googleAuthR > - > > Key: ARROW-16880 > URL: https://issues.apache.org/jira/browse/ARROW-16880 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Priority: Major > Fix For: 10.0.0 > > > These are the main packages that let folks worth with Google Cloud from R, so > we should make sure we can play nicely with their auth methods, how they > cache credentials, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16880) [R] Test GCS auth with gargle/googleAuthR
[ https://issues.apache.org/jira/browse/ARROW-16880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-16880: --- Fix Version/s: (was: 10.0.0) > [R] Test GCS auth with gargle/googleAuthR > - > > Key: ARROW-16880 > URL: https://issues.apache.org/jira/browse/ARROW-16880 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Priority: Major > > These are the main packages that let folks worth with Google Cloud from R, so > we should make sure we can play nicely with their auth methods, how they > cache credentials, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17069) [Python][R] GCSFIleSystem reports cannot resolve host on public buckets
[ https://issues.apache.org/jira/browse/ARROW-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-17069: -- Assignee: (was: Will Jones) > [Python][R] GCSFIleSystem reports cannot resolve host on public buckets > --- > > Key: ARROW-17069 > URL: https://issues.apache.org/jira/browse/ARROW-17069 > Project: Apache Arrow > Issue Type: Bug > Components: Python, R >Affects Versions: 8.0.0 >Reporter: Will Jones >Priority: Critical > Fix For: 10.0.0 > > > GCSFileSystem will returns {{Couldn't resolve host name}} if you don't supply > {{anonymous}} as the user: > {code:python} > import pyarrow.dataset as ds > # Fails: > dataset = > ds.dataset("gs://voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3") > Traceback (most recent call last): > File "", line 1, in > File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", > line 749, in dataset > return _filesystem_dataset(source, **kwargs) > File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", > line 441, in _filesystem_dataset > fs, paths_or_selector = _ensure_single_source(source, filesystem) > File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", > line 408, in _ensure_single_source > file_info = filesystem.get_file_info(path) > File "pyarrow/_fs.pyx", line 444, in pyarrow._fs.FileSystem.get_file_info > info = GetResultValue(self.fs.GetFileInfo(path)) > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > return check_status(status) > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > raise IOError(message) > OSError: google::cloud::Status(UNAVAILABLE: Retry policy exhausted in > GetObjectMetadata: EasyPerform() - CURL error [6]=Couldn't resolve host name) > # This works fine: > >>> dataset = > >>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3") > {code} > I would expect that we could connect. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16089) [Packaging] Add support for Coan C/C++ package manager
[ https://issues.apache.org/jira/browse/ARROW-16089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17609576#comment-17609576 ] Will Jones commented on ARROW-16089: [~kou] I heard we are waiting on the 10.0.0 release to upstream our changes to the Conan files. I found there is an issue with OpenSSL on some platforms in the current Conan that is fixed in ours. Would it be alright if I brought these upstream now instead of waiting? > [Packaging] Add support for Coan C/C++ package manager > -- > > Key: ARROW-16089 > URL: https://issues.apache.org/jira/browse/ARROW-16089 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Packaging >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17845) [CI][Conan] Re-enable Flight in Conan CI check
Will Jones created ARROW-17845: -- Summary: [CI][Conan] Re-enable Flight in Conan CI check Key: ARROW-17845 URL: https://issues.apache.org/jira/browse/ARROW-17845 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Will Jones Assignee: Will Jones -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-15838) [C++] Key column behavior in joins
[ https://issues.apache.org/jira/browse/ARROW-15838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-15838: -- Assignee: Will Jones > [C++] Key column behavior in joins > -- > > Key: ARROW-15838 > URL: https://issues.apache.org/jira/browse/ARROW-15838 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Jonathan Keane >Assignee: Will Jones >Priority: Major > Fix For: 10.0.0 > > > By default in dplyr (and possibly in pandas too?) coalesces the key column > for full joins to be the (non-null) values from both key columns: > {code} > > left <- tibble::tibble( > key = c(1, 2), > A = c(0, 1), > ) > left_tab <- Table$create(left) > > right <- tibble::tibble( > key = c(2, 3), > B = c(0, 1), > ) > right_tab <- Table$create(right) > > left %>% full_join(right) > Joining, by = "key" > # A tibble: 3 × 3 > key A B > > 1 1 0NA > 2 2 1 0 > 3 3NA 1 > > left_tab %>% full_join(right_tab) %>% collect() > # A tibble: 3 × 3 > key A B > > 1 2 1 0 > 2 1 0NA > 3NANA 1 > {code} > And right join, we would expect the key from the right table to be in the > result, but we get the key from the left instead: > {code} > > left <- tibble::tibble( > key = c(1, 2), > A = c(0, 1), > ) > left_tab <- Table$create(left) > > right <- tibble::tibble( > key = c(2, 3), > B = c(0, 1), > ) > right_tab <- Table$create(right) > > left %>% right_join(right) > Joining, by = "key" > # A tibble: 2 × 3 > key A B > > 1 2 1 0 > 2 3NA 1 > > left_tab %>% right_join(right_tab) %>% collect() > # A tibble: 2 × 3 > key A B > > 1 2 1 0 > 2NANA 1 > {code} > Additionally, we should be able to keep both key columns with an option (cf > https://github.com/apache/arrow/blob/9719eae66dcf38c966ae769215d27020a6dd5550/r/R/dplyr-join.R#L32) > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17812) [C++][Documentation] Add Gandiva User Guide
Will Jones created ARROW-17812: -- Summary: [C++][Documentation] Add Gandiva User Guide Key: ARROW-17812 URL: https://issues.apache.org/jira/browse/ARROW-17812 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva Reporter: Will Jones Assignee: Will Jones Fix For: 10.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17349) [C++] Support casting field names of list and map when nested
[ https://issues.apache.org/jira/browse/ARROW-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607422#comment-17607422 ] Will Jones commented on ARROW-17349: What's actually going on is we don't have any cast kernel for Map. Casting from a map to map works, because we early return if types are equal, and our equals method doesn't care about map field names. But it does care about list field names, so if the map contains a list then it will look for a cast function. I'll create a separate ticket for implementing Cast for Map, but for this particular issue, I think it would be nice to have a fast path for renaming fields in cast. > [C++] Support casting field names of list and map when nested > - > > Key: ARROW-17349 > URL: https://issues.apache.org/jira/browse/ARROW-17349 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 9.0.0 >Reporter: Will Jones >Assignee: Will Jones >Priority: Major > Labels: good-first-issue, kernel, pull-request-available > Fix For: 10.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Different parquet implementations use different field names for internal > fields of ListType and MapType, which can sometimes cause silly conflicts. > For example, we use {{item}} as the field name for list, but Spark uses > {{element}}. Fortunately, we can automatically cast between List and Map > Types with different field names. Unfortunately, it only works at the top > level. We should get it to work at arbitrary levels of nesting. > This was discovered in delta-rs: > https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285 > Here's a reproduction in Python: > {code:Python} > import pyarrow as pa > import pyarrow.parquet as pq > import pyarrow.dataset as ds > def roundtrip_scanner(in_arr, out_type): > table = pa.table({"arr": in_arr}) > pq.write_table(table, "test.parquet") > schema = pa.schema({"arr": out_type}) > ds.dataset("test.parquet", schema=schema).to_table() > # MapType > ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32()) > ty = pa.map_(pa.int32(), pa.int32()) > arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named) > roundtrip_scanner(arr_named, ty) > # ListType > ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False)) > ty = pa.list_(pa.int32()) > arr_named = pa.array([[1, 2, 4]], type=ty_named) > roundtrip_scanner(arr_named, ty) > # Combination MapType and ListType > ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", > pa.int32(), nullable=True)), nullable=False)) > ty = pa.map_(pa.string(), pa.list_(pa.int32())) > arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named) > roundtrip_scanner(arr_named, ty) > # Traceback (most recent call last): > # File "", line 1, in > # File "", line 5, in roundtrip_scanner > # File "pyarrow/_dataset.pyx", line 331, in > pyarrow._dataset.Dataset.to_table > # File "pyarrow/_dataset.pyx", line 2577, in > pyarrow._dataset.Scanner.to_table > # File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > # File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status > # pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map list> from map ('arr')> > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17349) [C++] Support casting field names of list and map when nested
[ https://issues.apache.org/jira/browse/ARROW-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-17349: -- Assignee: Will Jones > [C++] Support casting field names of list and map when nested > - > > Key: ARROW-17349 > URL: https://issues.apache.org/jira/browse/ARROW-17349 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 9.0.0 >Reporter: Will Jones >Assignee: Will Jones >Priority: Major > Labels: good-first-issue, kernel, pull-request-available > Fix For: 10.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Different parquet implementations use different field names for internal > fields of ListType and MapType, which can sometimes cause silly conflicts. > For example, we use {{item}} as the field name for list, but Spark uses > {{element}}. Fortunately, we can automatically cast between List and Map > Types with different field names. Unfortunately, it only works at the top > level. We should get it to work at arbitrary levels of nesting. > This was discovered in delta-rs: > https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285 > Here's a reproduction in Python: > {code:Python} > import pyarrow as pa > import pyarrow.parquet as pq > import pyarrow.dataset as ds > def roundtrip_scanner(in_arr, out_type): > table = pa.table({"arr": in_arr}) > pq.write_table(table, "test.parquet") > schema = pa.schema({"arr": out_type}) > ds.dataset("test.parquet", schema=schema).to_table() > # MapType > ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32()) > ty = pa.map_(pa.int32(), pa.int32()) > arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named) > roundtrip_scanner(arr_named, ty) > # ListType > ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False)) > ty = pa.list_(pa.int32()) > arr_named = pa.array([[1, 2, 4]], type=ty_named) > roundtrip_scanner(arr_named, ty) > # Combination MapType and ListType > ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", > pa.int32(), nullable=True)), nullable=False)) > ty = pa.map_(pa.string(), pa.list_(pa.int32())) > arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named) > roundtrip_scanner(arr_named, ty) > # Traceback (most recent call last): > # File "", line 1, in > # File "", line 5, in roundtrip_scanner > # File "pyarrow/_dataset.pyx", line 331, in > pyarrow._dataset.Dataset.to_table > # File "pyarrow/_dataset.pyx", line 2577, in > pyarrow._dataset.Scanner.to_table > # File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > # File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status > # pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map list> from map ('arr')> > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17788) [R][Doc] Add example of using Scanner
Will Jones created ARROW-17788: -- Summary: [R][Doc] Add example of using Scanner Key: ARROW-17788 URL: https://issues.apache.org/jira/browse/ARROW-17788 Project: Apache Arrow Issue Type: Improvement Components: Documentation, R Affects Versions: 9.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 10.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17776) [C++] Stabilize Parquet ArrowReaderProperties
Will Jones created ARROW-17776: -- Summary: [C++] Stabilize Parquet ArrowReaderProperties Key: ARROW-17776 URL: https://issues.apache.org/jira/browse/ARROW-17776 Project: Apache Arrow Issue Type: Improvement Components: C++, Parquet Affects Versions: 9.0.0 Reporter: Will Jones {{ArrowReaderProperties}} is still marked experimental, but it's pretty well used at this point. One possible change we might wish to make before stabilizing the API for it though: The {{ArrowWriterProperties}} class uses a namespaced builder class, which provides a nice syntax for creation and enforces immutability of the final properties. Perhaps we should mirror that design in the reader properties? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17400) [C++] Move Parquet APIs to use Result instead of Status
[ https://issues.apache.org/jira/browse/ARROW-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17604933#comment-17604933 ] Will Jones commented on ARROW-17400: [~devavret] Are you still working on this? I did a little bit of this in my PR [https://github.com/apache/arrow/pull/14018] but there are other APIs to do as well. > [C++] Move Parquet APIs to use Result instead of Status > --- > > Key: ARROW-17400 > URL: https://issues.apache.org/jira/browse/ARROW-17400 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 9.0.0 >Reporter: Will Jones >Assignee: Devavret Makkar >Priority: Minor > Labels: good-first-issue > > Notably, IPC and CSV have "open file" methods that return result, while > opening a Parquet file requires passing in an out variable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17593) [C++] Try and maintain input shape in Acero
[ https://issues.apache.org/jira/browse/ARROW-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599079#comment-17599079 ] Will Jones commented on ARROW-17593: I've been reading through the Parquet implementation, and was surprised to find that you cannot write out a row group with multiple batches. We've decoupled row group sizes and batch size on read (great!), but not on write. Perhaps that should also be part of the solution. I'm not deeply familiar with Acero internals yet, but what you've described here seems very sensible. Though it sounds like we may need some helper class to allocate the batch and line up the morsels, IIUC. > [C++] Try and maintain input shape in Acero > --- > > Key: ARROW-17593 > URL: https://issues.apache.org/jira/browse/ARROW-17593 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Priority: Major > > Data is scanned in large chunks based on the format. For example, CSV scans > chunks based on a chunk_size while parquet scans entire row groups. > Then, upon entry into Acero, these chunks are sliced into morsels (~L3 size) > for parallelism and batches (~L1-L2 size) for cache efficient processing. > However, the way it is currently done, means that the output of Acero is a > stream of tiny batches. This is somewhat undesirable in many cases. > For example, if a pyarrow user calls pq.read_table they might expect to get > one batch per row group. If they were to turn around and write out that > table to a new parquet file then either they end up with a non-ideal parquet > file (tiny row groups) or they are forced to concatenate the batches (which > is an allocation + copy). > Even if the user is doing their own streaming processing (e.g. in pyarrow) > these small batch sizes are undesirable as the overhead of python means that > streaming processing should be done in larger batches. > Instead, there should be a configurable max_batch_size, independent of row > group size and morsel size, which is configurable, and quite large by default > (1Mi or 64Mi rows). This control exists for users that want to do their own > streaming processing and need to be able to tune for RAM usage. > Acero will read in data based on the format, as it does today (e.g. CSV chunk > size, row group size). If the source data is very large (bigger than > max_batch_size) it will be sliced. From that point on, any morsels or > batches should simply be views into this larger output batch. For example, > when doing a projection to add a new column, we should allocate a > max_batch_size array and then populate it over many runs of the project node. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17590) Lower memory usage with filters
[ https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599019#comment-17599019 ] Will Jones commented on ARROW-17590: First, I don't believe the row-level filters avoid reading any data, unless they can be applied to the data set partition values. In order to evaluate the expression on the row, it needs to be parsed into Arrow data. If you want to reduce memory usage, I have two suggestions: # Turn off prebuffering, if you haven't already. In Python some interfaces it's on by default, some off. It gives better performance on some filesystems, but it uses more memory. # Consider reading in batches, using the {{iter_batches()}} method on Parquet files for instance. Then you can filter as the data comes in and concatenate the results into a Table. Which interface are you using? {{pyarrow.parquet.read_table}} or datasets? > Lower memory usage with filters > --- > > Key: ARROW-17590 > URL: https://issues.apache.org/jira/browse/ARROW-17590 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Yin >Priority: Major > > Hi, > When I read a parquet file (about 23MB with 250K rows and 600 object/string > columns with lots of None) with filter on a not null column for a small > number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB > to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 > rows 20MB). Looks like it scans/loads many rows from the parquet file. Not > only the footprint or watermark of memory usage is high, but also it seems > not releasing the memory in time (such as after GC in Python, but may get > used for subsequent read). > When reading the same parquet file for all columns without filtering, the > memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas > dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting > something. > It helps to limit the number of columns read. Read 1 column with filter for 1 > row or more or without filter, it takes about 10MB, which is quite smaller > and better, but still bigger than the size of table or data frame with 1 or > 500 rows of 1 columns (under 1MB) > The filtered column is not a partition key, which functionally works to get > the correct rows. But the memory usage is quite high even when the parquet > file is not really large, partitioned or not. There were some references > similar to this issue, for example: > [https://github.com/apache/arrow/issues/7338] > Related classes/methods in (pyarrow 9.0.0) > _ParquetDatasetV2.read > self._dataset.to_table(columns=columns, filter=self._filter_expression, > use_threads=use_threads) > pyarrow._dataset.FileSystemDatase.to_table > I played with pyarrow._dataset.Scanner.to_table > self._dataset.scanner(columns=columns, > filter=self._filter_expression).to_table() > The memory usage is small to construct the scanner but then goes up after the > to_table call materializes it. > Is there some way or workaround to reduce the memory usage with read > filtering? > If not supported yet, can it be fixed/improved with priority? > This is a blocking issue for us when we need to load all or many columns. > I am not sure what improvement is possible with respect to how the parquet > columnar format works, and if it can be patched somehow in the Pyarrow Python > code, or need to change and build the arrow C++ code. > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-14161) [C++][Parquet][Docs] Reading/Writing Parquet Files
[ https://issues.apache.org/jira/browse/ARROW-14161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-14161: -- Assignee: Will Jones > [C++][Parquet][Docs] Reading/Writing Parquet Files > -- > > Key: ARROW-14161 > URL: https://issues.apache.org/jira/browse/ARROW-14161 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Rares Vernica >Assignee: Will Jones >Priority: Minor > Fix For: 10.0.0 > > > Missing documentation on Reading/Writing Parquet files C++ api: > * > [WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE] > missing docs on chunk_size found some > [here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53] > _size of the RowGroup in the parquet file. Normally you would choose this to > be rather large_ > * Typo in file reader > [example|https://arrow.apache.org/docs/cpp/parquet.html#filereader] the > include should be {{#include "parquet/arrow/reader.h"}} > * > [WriteProperties/Builder|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet16WriterPropertiesE] > missing docs on {{compression}} > * Missing example on using WriteProperties -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-13454) [C++][Docs] Tables vs Record Batches
[ https://issues.apache.org/jira/browse/ARROW-13454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-13454: -- Assignee: Will Jones > [C++][Docs] Tables vs Record Batches > > > Key: ARROW-13454 > URL: https://issues.apache.org/jira/browse/ARROW-13454 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Rares Vernica >Assignee: Will Jones >Priority: Minor > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > It is not clear what the difference is between Tables and Record Batches is > as described on [https://arrow.apache.org/docs/cpp/tables.html#tables] > _A > [{{arrow::Table}}|https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow5TableE] > is a two-dimensional dataset with chunked arrays for columns_ > _A > [{{arrow::RecordBatch}}|https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow11RecordBatchE] > is a two-dimensional dataset of a number of contiguous arrays_ > Or maybe the distinction between _chunked arrays_ and _contiguous arrays_ can > be clarified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15006) [Python][Doc] Iteratively enable more numpydoc checks
[ https://issues.apache.org/jira/browse/ARROW-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597937#comment-17597937 ] Will Jones commented on ARROW-15006: Great spreadsheet! Best place for developer discussions is the dev mailing list, since it's the most public. Add a link to this ticket and the spreadsheet, and you can add a {{[Python][Doc]}} prefix to the subject to help recipients know if it's relevant to them. > [Python][Doc] Iteratively enable more numpydoc checks > - > > Key: ARROW-15006 > URL: https://issues.apache.org/jira/browse/ARROW-15006 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Python >Reporter: Krisztian Szucs >Assignee: Bryce Mecum >Priority: Major > Labels: good-first-issue, pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > Asof https://github.com/apache/arrow/pull/7732 we're going to have a numpydoc > check running on pull requests. There is a single rule enabled at the moment: > PR01 > Additional checks we can run: > {code} > ERROR_MSGS = { > "GL01": "Docstring text (summary) should start in the line immediately " > "after the opening quotes (not in the same line, or leaving a " > "blank line in between)", > "GL02": "Closing quotes should be placed in the line after the last text " > "in the docstring (do not close the quotes in the same line as " > "the text, or leave a blank line between the last text and the " > "quotes)", > "GL03": "Double line break found; please use only one blank line to " > "separate sections or paragraphs, and do not leave blank lines " > "at the end of docstrings", > "GL05": 'Tabs found at the start of line "{line_with_tabs}", please use ' > "whitespace only", > "GL06": 'Found unknown section "{section}". Allowed sections are: ' > "{allowed_sections}", > "GL07": "Sections are in the wrong order. Correct order is: > {correct_sections}", > "GL08": "The object does not have a docstring", > "GL09": "Deprecation warning should precede extended summary", > "GL10": "reST directives {directives} must be followed by two colons", > "SS01": "No summary found (a short summary in a single line should be " > "present at the beginning of the docstring)", > "SS02": "Summary does not start with a capital letter", > "SS03": "Summary does not end with a period", > "SS04": "Summary contains heading whitespaces", > "SS05": "Summary must start with infinitive verb, not third person " > '(e.g. use "Generate" instead of "Generates")', > "SS06": "Summary should fit in a single line", > "ES01": "No extended summary found", > "PR01": "Parameters {missing_params} not documented", > "PR02": "Unknown parameters {unknown_params}", > "PR03": "Wrong parameters order. Actual: {actual_params}. " > "Documented: {documented_params}", > "PR04": 'Parameter "{param_name}" has no type', > "PR05": 'Parameter "{param_name}" type should not finish with "."', > "PR06": 'Parameter "{param_name}" type should use "{right_type}" instead ' > 'of "{wrong_type}"', > "PR07": 'Parameter "{param_name}" has no description', > "PR08": 'Parameter "{param_name}" description should start with a ' > "capital letter", > "PR09": 'Parameter "{param_name}" description should finish with "."', > "PR10": 'Parameter "{param_name}" requires a space before the colon ' > "separating the parameter name and type", > "RT01": "No Returns section found", > "RT02": "The first line of the Returns section should contain only the " > "type, unless multiple values are being returned", > "RT03": "Return value has no description", > "RT04": "Return value description should start with a capital letter", > "RT05": 'Return value description should finish with "."', > "YD01": "No Yields section found", > "SA01": "See Also section not found", > "SA02": "Missing period at end of description for See Also " > '"{reference_name}" reference', > "SA03": "Description should be capitalized for See Also " > '"{reference_name}" reference', > "SA04": 'Missing description for See Also "{reference_name}" reference', > "EX01": "No examples section found", > } > {code} > cc [~alenkaf] [~amol-] [~jorisvandenbossche] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597915#comment-17597915 ] Will Jones commented on ARROW-17459: We have a section of our docs devoted to [developer setup and guidelines|https://arrow.apache.org/docs/developers/contributing.html]. And we have documentation describing the [Arrow in-memory format|https://arrow.apache.org/docs/format/Columnar.html] (it may be worth reviewing the structure of nested arrays, for example). For the internals of the Parquet arrow code, it's best to read through the source headers at {{{}cpp/src/parquet/arrow/{}}}. > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17399) pyarrow may use a lot of memory to load a dataframe from parquet
[ https://issues.apache.org/jira/browse/ARROW-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597400#comment-17597400 ] Will Jones commented on ARROW-17399: Sorry, you are right you had a single column there already. I tried your repro on my M1 Macbook and didn't see the memory usage you are seeing. (This is with mimalloc allocator, but I got similar results with jemalloc and the system allocator.) Are you able to reproduce this on the latest versions of Pandas and Numpy? And could you confirm your package version numbers? {code:none} ❯ python test_pyarrow.py 0 time: 0.0 rss: 79.5 1 time: 2.0 rss: 617.1 2 time: 3.4 rss:1090.6 3 time: 3.7 rss: 633.6 4 time: 6.7 rss: 633.6 5 time: 10.1 rss:1942.9 6 time: 13.1 rss:1942.9 7 time: 13.6 rss: 664.8 8 time: 16.6 rss: 664.8 {code} > pyarrow may use a lot of memory to load a dataframe from parquet > > > Key: ARROW-17399 > URL: https://issues.apache.org/jira/browse/ARROW-17399 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet, Python >Affects Versions: 9.0.0 > Environment: linux >Reporter: Gianluca Ficarelli >Priority: Major > Attachments: memory-profiler.png > > > When a pandas dataframe is loaded from a parquet file using > {{{}pyarrow.parquet.read_table{}}}, the memory usage may grow a lot more than > what should be needed to load the dataframe, and it's not freed until the > dataframe is deleted. > The problem is evident when the dataframe has a {*}column containing lists or > numpy arrays{*}, while it seems absent (or not noticeable) if the column > contains only integer or floats. > I'm attaching a simple script to reproduce the issue, and a graph created > with memory-profiler showing the memory usage. > In this example, the dataframe created with pandas needs around 1.2 GB, but > the memory usage after loading it from parquet is around 16 GB. > The items of the column are created as numpy arrays and not lists, to be > consistent with the types loaded from parquet (pyarrow produces numpy arrays > and not lists). > > {code:python} > import gc > import time > import numpy as np > import pandas as pd > import pyarrow > import pyarrow.parquet > import psutil > def pyarrow_dump(filename, df, compression="snappy"): > table = pyarrow.Table.from_pandas(df) > pyarrow.parquet.write_table(table, filename, compression=compression) > def pyarrow_load(filename): > table = pyarrow.parquet.read_table(filename) > return table.to_pandas() > def print_mem(msg, start_time=time.monotonic(), process=psutil.Process()): > # gc.collect() > current_time = time.monotonic() - start_time > rss = process.memory_info().rss / 2 ** 20 > print(f"{msg:>3} time:{current_time:>10.1f} rss:{rss:>10.1f}") > if __name__ == "__main__": > print_mem(0) > rows = 500 > df = pd.DataFrame({"a": [np.arange(10) for i in range(rows)]}) > print_mem(1) > > pyarrow_dump("example.parquet", df) > print_mem(2) > > del df > print_mem(3) > time.sleep(3) > print_mem(4) > df = pyarrow_load("example.parquet") > print_mem(5) > time.sleep(3) > print_mem(6) > del df > print_mem(7) > time.sleep(3) > print_mem(8) > {code} > Run with memory-profiler: > {code:bash} > mprof run --multiprocess python test_pyarrow.py > {code} > Output: > {code:java} > mprof: Sampling memory every 0.1s > running new process > 0 time: 0.0 rss: 135.4 > 1 time: 4.9 rss:1252.2 > 2 time: 7.1 rss:1265.0 > 3 time: 7.5 rss: 760.2 > 4 time: 10.7 rss: 758.9 > 5 time: 19.6 rss: 16745.4 > 6 time: 22.6 rss: 16335.4 > 7 time: 22.9 rss: 15833.0 > 8 time: 25.9 rss: 955.0 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597392#comment-17597392 ] Will Jones commented on ARROW-17459: Hi Arthur, Here's a simple repro I created in Python: {code:python} import pyarrow as pa import pyarrow.parquet as pq arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32())) arr = pa.chunked_array([arr, arr]) tab = pa.table({ "arr": arr }) pq.write_table(tab, "test.parquet") pq.read_table("test.parquet") #Traceback (most recent call last): # File "", line 1, in # File "/Users/willjones/mambaforge/envs/notebooks/lib/python3.10/site-#packages/pyarrow/parquet/__init__.py", line 2827, in read_table #return dataset.read(columns=columns, use_threads=use_threads, # File "/Users/willjones/mambaforge/envs/notebooks/lib/python3.10/site-#packages/pyarrow/parquet/__init__.py", line 2473, in read #table = self._dataset.to_table( # File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table # File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table # File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status # File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status #pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs {code} > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581502#comment-17581502 ] Will Jones commented on ARROW-17459: I haven't tried this, but perhaps {{GetRecordBatchReader}} instead will work [https://github.com/wjones127/arrow/blob/895e2da93c0af3a1525c8c75ec8d612d96c28647/cpp/src/parquet/arrow/reader.h#L165] It sounds like there are some code paths that do work and some that don't. > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15006) [Python][Doc] Iteratively enable more numpydoc checks
[ https://issues.apache.org/jira/browse/ARROW-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581404#comment-17581404 ] Will Jones commented on ARROW-15006: Perhaps we should start with the style-only ones first (PR06: Parameter "accept_root_dir" type should use "bool" instead of "boolean"), and skip the "missing" warnings (GL08: The object does not have a docstring). I think the missing ones will be good to do, but will require a lot more work to get enough context to properly describe each of the objects, so possibly better as a follow up. That's should narrow things down to some mechanical changes that we can get out of the way quickly (if only there were an automatic formatter). Also, we should double check that we have instructions for how to run these checks locally, so that developers can verify their changes before pushing them. > [Python][Doc] Iteratively enable more numpydoc checks > - > > Key: ARROW-15006 > URL: https://issues.apache.org/jira/browse/ARROW-15006 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Python >Reporter: Krisztian Szucs >Priority: Major > Labels: good-first-issue > > Asof https://github.com/apache/arrow/pull/7732 we're going to have a numpydoc > check running on pull requests. There is a single rule enabled at the moment: > PR01 > Additional checks we can run: > {code} > ERROR_MSGS = { > "GL01": "Docstring text (summary) should start in the line immediately " > "after the opening quotes (not in the same line, or leaving a " > "blank line in between)", > "GL02": "Closing quotes should be placed in the line after the last text " > "in the docstring (do not close the quotes in the same line as " > "the text, or leave a blank line between the last text and the " > "quotes)", > "GL03": "Double line break found; please use only one blank line to " > "separate sections or paragraphs, and do not leave blank lines " > "at the end of docstrings", > "GL05": 'Tabs found at the start of line "{line_with_tabs}", please use ' > "whitespace only", > "GL06": 'Found unknown section "{section}". Allowed sections are: ' > "{allowed_sections}", > "GL07": "Sections are in the wrong order. Correct order is: > {correct_sections}", > "GL08": "The object does not have a docstring", > "GL09": "Deprecation warning should precede extended summary", > "GL10": "reST directives {directives} must be followed by two colons", > "SS01": "No summary found (a short summary in a single line should be " > "present at the beginning of the docstring)", > "SS02": "Summary does not start with a capital letter", > "SS03": "Summary does not end with a period", > "SS04": "Summary contains heading whitespaces", > "SS05": "Summary must start with infinitive verb, not third person " > '(e.g. use "Generate" instead of "Generates")', > "SS06": "Summary should fit in a single line", > "ES01": "No extended summary found", > "PR01": "Parameters {missing_params} not documented", > "PR02": "Unknown parameters {unknown_params}", > "PR03": "Wrong parameters order. Actual: {actual_params}. " > "Documented: {documented_params}", > "PR04": 'Parameter "{param_name}" has no type', > "PR05": 'Parameter "{param_name}" type should not finish with "."', > "PR06": 'Parameter "{param_name}" type should use "{right_type}" instead ' > 'of "{wrong_type}"', > "PR07": 'Parameter "{param_name}" has no description', > "PR08": 'Parameter "{param_name}" description should start with a ' > "capital letter", > "PR09": 'Parameter "{param_name}" description should finish with "."', > "PR10": 'Parameter "{param_name}" requires a space before the colon ' > "separating the parameter name and type", > "RT01": "No Returns section found", > "RT02": "The first line of the Returns section should contain only the " > "type, unless multiple values are being returned", > "RT03": "Return value has no description", > "RT04": "Return value description should start with a capital letter", > "RT05": 'Return value description should finish with "."', > "YD01": "No Yields section found", > "SA01": "See Also section not found", > "SA02": "Missing period at end of description for See Also " > '"{reference_name}" reference', > "SA03": "Description should be capitalized for See Also " > '"{reference_name}" reference', > "SA04": 'Missing description for See Also "{reference_name}" reference', > "EX01": "No examples section found", > } >
[jira] [Comment Edited] (ARROW-17441) [Python] Memory kept after del and pool.released_unused()
[ https://issues.apache.org/jira/browse/ARROW-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580475#comment-17580475 ] Will Jones edited comment on ARROW-17441 at 8/16/22 9:10 PM: - Going back to my original test with Parquet, it does seem like there some long-standing issue with Parquet reads and mimalloc. And a regression with the system allocator on MacOS? Here is the original Parquet read test (so all buffers are allocated within Arrow, no numpy): {code:python} import os import psutil import time import gc process = psutil.Process(os.getpid()) import pyarrow.parquet as pq import pyarrow as pa def print_rss(): print(f"RSS: {process.memory_info().rss:,} bytes") pq_path = "tall.parquet" print(f"memory_pool={pa.default_memory_pool().backend_name}") print_rss() print("reading table") tab = pq.read_table(pq_path) print_rss() print("deleting table") del tab gc.collect() print_rss() print("releasing unused memory") pa.default_memory_pool().release_unused() print_rss() print("waiting 10 seconds") time.sleep(10) print_rss() print(f"Total allocated bytes: {pa.total_allocated_bytes():,}") {code} Result in PyArrow 7.0.0: {code:none} ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool2.py && \ ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool2.py && \ ARROW_DEFAULT_MEMORY_POOL=system python test_pool2.py memory_pool=mimalloc RSS: 47,906,816 bytes reading table RSS: 2,077,507,584 bytes deleting table RSS: 2,071,887,872 bytes releasing unused memory RSS: 2,064,875,520 bytes waiting 10 seconds RSS: 1,862,352,896 bytes Total allocated bytes: 0 memory_pool=jemalloc RSS: 47,415,296 bytes reading table RSS: 2,704,965,632 bytes deleting table RSS: 70,746,112 bytes releasing unused memory RSS: 71,663,616 bytes waiting 10 seconds RSS: 71,663,616 bytes Total allocated bytes: 0 memory_pool=system RSS: 47,857,664 bytes reading table RSS: 2,705,408,000 bytes deleting table RSS: 71,106,560 bytes releasing unused memory RSS: 71,106,560 bytes waiting 10 seconds RSS: 71,106,560 bytes Total allocated bytes: 0 {code} Result in PyArrow 9.0.0: {code:none} > ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool2.py && \ ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool2.py && \ ARROW_DEFAULT_MEMORY_POOL=system python test_pool2.py memory_pool=mimalloc RSS: 48,037,888 bytes reading table RSS: 2,140,487,680 bytes deleting table RSS: 2,149,711,872 bytes releasing unused memory RSS: 2,142,273,536 bytes waiting 10 seconds RSS: 1,710,981,120 bytes Total allocated bytes: 0 memory_pool=jemalloc RSS: 48,136,192 bytes reading table RSS: 2,681,274,368 bytes deleting table RSS: 71,942,144 bytes releasing unused memory RSS: 72,908,800 bytes waiting 10 seconds RSS: 72,908,800 bytes Total allocated bytes: 0 memory_pool=system RSS: 48,005,120 bytes reading table RSS: 2,847,965,184 bytes deleting table RSS: 1,440,071,680 bytes releasing unused memory RSS: 1,440,071,680 bytes waiting 10 seconds RSS: 1,440,071,680 bytes Total allocated bytes: 0 {code} was (Author: willjones127): Going back to my original test with Parquet, it does seem like there some long-standing issue with Parquet reads and mimalloc. And a regression with the system allocator on MacOS? Here is the original Parquet read test (so all buffers are allocated within Arrow, no numpy): {code:python} import os import psutil import time import gc process = psutil.Process(os.getpid()) import pyarrow.parquet as pq import pyarrow as pa def print_rss(): print(f"RSS: {process.memory_info().rss:,} bytes") pq_path = "tall.parquet" print(f"memory_pool={pa.default_memory_pool().backend_name}") print_rss() print("reading table") tab = pq.read_table(pq_path) print_rss() print("deleting table") del tab gc.collect() print_rss() print("releasing unused memory") pa.default_memory_pool().release_unused() print_rss() print("waiting 10 seconds") time.sleep(10) print_rss() {code} Result in PyArrow 7.0.0: {code:none} ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool2.py && \ ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool2.py && \ ARROW_DEFAULT_MEMORY_POOL=system python test_pool2.py memory_pool=mimalloc RSS: 47,906,816 bytes reading table RSS: 2,077,507,584 bytes deleting table RSS: 2,071,887,872 bytes releasing unused memory RSS: 2,064,875,520 bytes waiting 10 seconds RSS: 1,862,352,896 bytes memory_pool=jemalloc RSS: 47,415,296 bytes reading table RSS: 2,704,965,632 bytes deleting table RSS: 70,746,112 bytes releasing unused memory RSS: 71,663,616 bytes waiting 10 seconds RSS: 71,663,616 bytes memory_pool=system RSS: 47,857,664 bytes reading table RSS: 2,705,408,000 bytes deleting table RSS: 71,106,560 bytes releasing unused memory RSS: 71,106,560 bytes waiting 10 seconds RSS: 71,106,560 bytes {code} Result in PyArrow 9.0.0: {code:none} > ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool2.py && \ ARROW_DEFAULT_MEMORY_POOL=jemalloc p
[jira] [Commented] (ARROW-17441) [Python] Memory kept after del and pool.released_unused()
[ https://issues.apache.org/jira/browse/ARROW-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580475#comment-17580475 ] Will Jones commented on ARROW-17441: Going back to my original test with Parquet, it does seem like there some long-standing issue with Parquet reads and mimalloc. And a regression with the system allocator on MacOS? Here is the original Parquet read test (so all buffers are allocated within Arrow, no numpy): {code:python} import os import psutil import time import gc process = psutil.Process(os.getpid()) import pyarrow.parquet as pq import pyarrow as pa def print_rss(): print(f"RSS: {process.memory_info().rss:,} bytes") pq_path = "tall.parquet" print(f"memory_pool={pa.default_memory_pool().backend_name}") print_rss() print("reading table") tab = pq.read_table(pq_path) print_rss() print("deleting table") del tab gc.collect() print_rss() print("releasing unused memory") pa.default_memory_pool().release_unused() print_rss() print("waiting 10 seconds") time.sleep(10) print_rss() {code} Result in PyArrow 7.0.0: {code:none} ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool2.py && \ ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool2.py && \ ARROW_DEFAULT_MEMORY_POOL=system python test_pool2.py memory_pool=mimalloc RSS: 47,906,816 bytes reading table RSS: 2,077,507,584 bytes deleting table RSS: 2,071,887,872 bytes releasing unused memory RSS: 2,064,875,520 bytes waiting 10 seconds RSS: 1,862,352,896 bytes memory_pool=jemalloc RSS: 47,415,296 bytes reading table RSS: 2,704,965,632 bytes deleting table RSS: 70,746,112 bytes releasing unused memory RSS: 71,663,616 bytes waiting 10 seconds RSS: 71,663,616 bytes memory_pool=system RSS: 47,857,664 bytes reading table RSS: 2,705,408,000 bytes deleting table RSS: 71,106,560 bytes releasing unused memory RSS: 71,106,560 bytes waiting 10 seconds RSS: 71,106,560 bytes {code} Result in PyArrow 9.0.0: {code:none} > ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool2.py && \ ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool2.py && \ ARROW_DEFAULT_MEMORY_POOL=system python test_pool2.py memory_pool=mimalloc RSS: 48,037,888 bytes reading table RSS: 2,140,487,680 bytes deleting table RSS: 2,149,711,872 bytes releasing unused memory RSS: 2,142,273,536 bytes waiting 10 seconds RSS: 1,710,981,120 bytes memory_pool=jemalloc RSS: 48,136,192 bytes reading table RSS: 2,681,274,368 bytes deleting table RSS: 71,942,144 bytes releasing unused memory RSS: 72,908,800 bytes waiting 10 seconds RSS: 72,908,800 bytes memory_pool=system RSS: 48,005,120 bytes reading table RSS: 2,847,965,184 bytes deleting table RSS: 1,440,071,680 bytes releasing unused memory RSS: 1,440,071,680 bytes waiting 10 seconds RSS: 1,440,071,680 bytes {code} > [Python] Memory kept after del and pool.released_unused() > - > > Key: ARROW-17441 > URL: https://issues.apache.org/jira/browse/ARROW-17441 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 9.0.0 >Reporter: Will Jones >Priority: Major > > I was trying reproduce another issue involving memory pools not releasing > memory, but encountered this confusing behavior: if I create a table, then > call {{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see > significant memory usage. On mimalloc in particular, I see no meaningful drop > in memory usage on either call. > Am I missing something? My understanding prior has been that memory will be > held onto by a memory pool, but will be forced free by release_unused; and > that system memory pool should release memory immediately. But neither of > those seem true. > {code:python} > import os > import psutil > import time > import gc > process = psutil.Process(os.getpid()) > import numpy as np > from uuid import uuid4 > import pyarrow as pa > def gen_batches(n_groups=200, rows_per_group=200_000): > for _ in range(n_groups): > id_val = uuid4().bytes > yield pa.table({ > "x": np.random.random(rows_per_group), # This will compress poorly > "y": np.random.random(rows_per_group), > "a": pa.array(list(range(rows_per_group)), type=pa.int32()), # > This compresses with delta encoding > "id": pa.array([id_val] * rows_per_group), # This compresses with > RLE > }) > def print_rss(): > print(f"RSS: {process.memory_info().rss:,} bytes") > print(f"memory_pool={pa.default_memory_pool().backend_name}") > print_rss() > print("reading table") > tab = pa.concat_tables(list(gen_batches())) > print_rss() > print("deleting table") > del tab > gc.collect() > print_rss() > print("releasing unused memory") > pa.default_memory_pool().release_unused() > print_rss() > print("waiting 10 seconds") > time.sleep(10) > print_
[jira] [Commented] (ARROW-17441) [Python] Memory kept after del and pool.released_unused()
[ https://issues.apache.org/jira/browse/ARROW-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580471#comment-17580471 ] Will Jones commented on ARROW-17441: {quote}I must admit I don't understand the references to compression in your comments. Were you planning to use Parquet at some point?{quote} Sorry, I was testing memory usage from Parquet reads and seeing something like this, but decided to take Parquet out of the picture to simplify. {quote}Other than that, Numpy-allocated memory does not use the Arrow memory pool, so I'm not sure those stats are very indicative.{quote} Ah I think you are likely right there. > [Python] Memory kept after del and pool.released_unused() > - > > Key: ARROW-17441 > URL: https://issues.apache.org/jira/browse/ARROW-17441 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 9.0.0 >Reporter: Will Jones >Priority: Major > > I was trying reproduce another issue involving memory pools not releasing > memory, but encountered this confusing behavior: if I create a table, then > call {{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see > significant memory usage. On mimalloc in particular, I see no meaningful drop > in memory usage on either call. > Am I missing something? My understanding prior has been that memory will be > held onto by a memory pool, but will be forced free by release_unused; and > that system memory pool should release memory immediately. But neither of > those seem true. > {code:python} > import os > import psutil > import time > import gc > process = psutil.Process(os.getpid()) > import numpy as np > from uuid import uuid4 > import pyarrow as pa > def gen_batches(n_groups=200, rows_per_group=200_000): > for _ in range(n_groups): > id_val = uuid4().bytes > yield pa.table({ > "x": np.random.random(rows_per_group), # This will compress poorly > "y": np.random.random(rows_per_group), > "a": pa.array(list(range(rows_per_group)), type=pa.int32()), # > This compresses with delta encoding > "id": pa.array([id_val] * rows_per_group), # This compresses with > RLE > }) > def print_rss(): > print(f"RSS: {process.memory_info().rss:,} bytes") > print(f"memory_pool={pa.default_memory_pool().backend_name}") > print_rss() > print("reading table") > tab = pa.concat_tables(list(gen_batches())) > print_rss() > print("deleting table") > del tab > gc.collect() > print_rss() > print("releasing unused memory") > pa.default_memory_pool().release_unused() > print_rss() > print("waiting 10 seconds") > time.sleep(10) > print_rss() > {code} > {code:none} > > ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \ > ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \ > ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py > memory_pool=mimalloc > RSS: 44,449,792 bytes > reading table > RSS: 1,819,557,888 bytes > deleting table > RSS: 1,819,590,656 bytes > releasing unused memory > RSS: 1,819,852,800 bytes > waiting 10 seconds > RSS: 1,819,852,800 bytes > memory_pool=jemalloc > RSS: 45,629,440 bytes > reading table > RSS: 1,668,677,632 bytes > deleting table > RSS: 698,400,768 bytes > releasing unused memory > RSS: 699,023,360 bytes > waiting 10 seconds > RSS: 699,023,360 bytes > memory_pool=system > RSS: 44,875,776 bytes > reading table > RSS: 1,713,569,792 bytes > deleting table > RSS: 540,311,552 bytes > releasing unused memory > RSS: 540,311,552 bytes > waiting 10 seconds > RSS: 540,311,552 bytes > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17441) [Python] Memory kept after del and pool.released_unused()
[ https://issues.apache.org/jira/browse/ARROW-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580472#comment-17580472 ] Will Jones commented on ARROW-17441: I reran this in PyArrow 7.0.0 and got results where mimalloc is more in line with the others, so I think mimalloc 2 is actually worse rather than better at releasing unused memory: {code} > ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \ ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \ ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py memory_pool=mimalloc RSS: 43,958,272 bytes reading table RSS: 1,728,200,704 bytes deleting table RSS: 1,600,585,728 bytes releasing unused memory RSS: 549,797,888 bytes waiting 10 seconds RSS: 549,797,888 bytes memory_pool=jemalloc RSS: 43,663,360 bytes reading table RSS: 1,663,483,904 bytes deleting table RSS: 693,682,176 bytes releasing unused memory RSS: 694,304,768 bytes waiting 10 seconds RSS: 694,304,768 bytes memory_pool=system RSS: 44,220,416 bytes reading table RSS: 1,667,072,000 bytes deleting table RSS: 697,171,968 bytes releasing unused memory RSS: 697,171,968 bytes waiting 10 seconds RSS: 697,171,968 bytes {code} > [Python] Memory kept after del and pool.released_unused() > - > > Key: ARROW-17441 > URL: https://issues.apache.org/jira/browse/ARROW-17441 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 9.0.0 >Reporter: Will Jones >Priority: Major > > I was trying reproduce another issue involving memory pools not releasing > memory, but encountered this confusing behavior: if I create a table, then > call {{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see > significant memory usage. On mimalloc in particular, I see no meaningful drop > in memory usage on either call. > Am I missing something? My understanding prior has been that memory will be > held onto by a memory pool, but will be forced free by release_unused; and > that system memory pool should release memory immediately. But neither of > those seem true. > {code:python} > import os > import psutil > import time > import gc > process = psutil.Process(os.getpid()) > import numpy as np > from uuid import uuid4 > import pyarrow as pa > def gen_batches(n_groups=200, rows_per_group=200_000): > for _ in range(n_groups): > id_val = uuid4().bytes > yield pa.table({ > "x": np.random.random(rows_per_group), # This will compress poorly > "y": np.random.random(rows_per_group), > "a": pa.array(list(range(rows_per_group)), type=pa.int32()), # > This compresses with delta encoding > "id": pa.array([id_val] * rows_per_group), # This compresses with > RLE > }) > def print_rss(): > print(f"RSS: {process.memory_info().rss:,} bytes") > print(f"memory_pool={pa.default_memory_pool().backend_name}") > print_rss() > print("reading table") > tab = pa.concat_tables(list(gen_batches())) > print_rss() > print("deleting table") > del tab > gc.collect() > print_rss() > print("releasing unused memory") > pa.default_memory_pool().release_unused() > print_rss() > print("waiting 10 seconds") > time.sleep(10) > print_rss() > {code} > {code:none} > > ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \ > ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \ > ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py > memory_pool=mimalloc > RSS: 44,449,792 bytes > reading table > RSS: 1,819,557,888 bytes > deleting table > RSS: 1,819,590,656 bytes > releasing unused memory > RSS: 1,819,852,800 bytes > waiting 10 seconds > RSS: 1,819,852,800 bytes > memory_pool=jemalloc > RSS: 45,629,440 bytes > reading table > RSS: 1,668,677,632 bytes > deleting table > RSS: 698,400,768 bytes > releasing unused memory > RSS: 699,023,360 bytes > waiting 10 seconds > RSS: 699,023,360 bytes > memory_pool=system > RSS: 44,875,776 bytes > reading table > RSS: 1,713,569,792 bytes > deleting table > RSS: 540,311,552 bytes > releasing unused memory > RSS: 540,311,552 bytes > waiting 10 seconds > RSS: 540,311,552 bytes > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17441) [Python] Memory kept after del and pool.released_unused()
[ https://issues.apache.org/jira/browse/ARROW-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-17441: --- Description: I was trying reproduce another issue involving memory pools not releasing memory, but encountered this confusing behavior: if I create a table, then call {{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see significant memory usage. On mimalloc in particular, I see no meaningful drop in memory usage on either call. Am I missing something? My understanding prior has been that memory will be held onto by a memory pool, but will be forced free by release_unused; and that system memory pool should release memory immediately. But neither of those seem true. {code:python} import os import psutil import time import gc process = psutil.Process(os.getpid()) import numpy as np from uuid import uuid4 import pyarrow as pa def gen_batches(n_groups=200, rows_per_group=200_000): for _ in range(n_groups): id_val = uuid4().bytes yield pa.table({ "x": np.random.random(rows_per_group), # This will compress poorly "y": np.random.random(rows_per_group), "a": pa.array(list(range(rows_per_group)), type=pa.int32()), # This compresses with delta encoding "id": pa.array([id_val] * rows_per_group), # This compresses with RLE }) def print_rss(): print(f"RSS: {process.memory_info().rss:,} bytes") print(f"memory_pool={pa.default_memory_pool().backend_name}") print_rss() print("reading table") tab = pa.concat_tables(list(gen_batches())) print_rss() print("deleting table") del tab gc.collect() print_rss() print("releasing unused memory") pa.default_memory_pool().release_unused() print_rss() print("waiting 10 seconds") time.sleep(10) print_rss() {code} {code:none} > ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \ ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \ ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py memory_pool=mimalloc RSS: 44,449,792 bytes reading table RSS: 1,819,557,888 bytes deleting table RSS: 1,819,590,656 bytes releasing unused memory RSS: 1,819,852,800 bytes waiting 10 seconds RSS: 1,819,852,800 bytes memory_pool=jemalloc RSS: 45,629,440 bytes reading table RSS: 1,668,677,632 bytes deleting table RSS: 698,400,768 bytes releasing unused memory RSS: 699,023,360 bytes waiting 10 seconds RSS: 699,023,360 bytes memory_pool=system RSS: 44,875,776 bytes reading table RSS: 1,713,569,792 bytes deleting table RSS: 540,311,552 bytes releasing unused memory RSS: 540,311,552 bytes waiting 10 seconds RSS: 540,311,552 bytes {code} was: I was trying reproduce another issue involving memory pools not releasing memory, but encountered this confusing behavior: if I create a table, then call {{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see significant memory usage. On mimalloc in particular, I see no meaningful drop in memory usage on either call. Am I missing something? {code:python} import os import psutil import time import gc process = psutil.Process(os.getpid()) import numpy as np from uuid import uuid4 import pyarrow as pa def gen_batches(n_groups=200, rows_per_group=200_000): for _ in range(n_groups): id_val = uuid4().bytes yield pa.table({ "x": np.random.random(rows_per_group), # This will compress poorly "y": np.random.random(rows_per_group), "a": pa.array(list(range(rows_per_group)), type=pa.int32()), # This compresses with delta encoding "id": pa.array([id_val] * rows_per_group), # This compresses with RLE }) def print_rss(): print(f"RSS: {process.memory_info().rss:,} bytes") print(f"memory_pool={pa.default_memory_pool().backend_name}") print_rss() print("reading table") tab = pa.concat_tables(list(gen_batches())) print_rss() print("deleting table") del tab gc.collect() print_rss() print("releasing unused memory") pa.default_memory_pool().release_unused() print_rss() print("waiting 10 seconds") time.sleep(10) print_rss() {code} {code:none} > ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \ ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \ ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py memory_pool=mimalloc RSS: 44,449,792 bytes reading table RSS: 1,819,557,888 bytes deleting table RSS: 1,819,590,656 bytes releasing unused memory RSS: 1,819,852,800 bytes waiting 10 seconds RSS: 1,819,852,800 bytes memory_pool=jemalloc RSS: 45,629,440 bytes reading table RSS: 1,668,677,632 bytes deleting table RSS: 698,400,768 bytes releasing unused memory RSS: 699,023,360 bytes waiting 10 seconds RSS: 699,023,360 bytes memory_pool=system RSS: 44,875,776 bytes reading table RSS: 1,713,569,792 bytes deleting table RSS: 540,311,552 bytes releasing unused memory RSS: 540,311,552 bytes waiting 10 seconds RSS: 540,311,552 bytes {code} > [Python] Mem
[jira] [Created] (ARROW-17441) [Python] Memory kept after del and pool.released_unused()
Will Jones created ARROW-17441: -- Summary: [Python] Memory kept after del and pool.released_unused() Key: ARROW-17441 URL: https://issues.apache.org/jira/browse/ARROW-17441 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 9.0.0 Reporter: Will Jones I was trying reproduce another issue involving memory pools not releasing memory, but encountered this confusing behavior: if I create a table, then call {{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see significant memory usage. On mimalloc in particular, I see no meaningful drop in memory usage on either call. Am I missing something? {code:python} import os import psutil import time import gc process = psutil.Process(os.getpid()) import numpy as np from uuid import uuid4 import pyarrow as pa def gen_batches(n_groups=200, rows_per_group=200_000): for _ in range(n_groups): id_val = uuid4().bytes yield pa.table({ "x": np.random.random(rows_per_group), # This will compress poorly "y": np.random.random(rows_per_group), "a": pa.array(list(range(rows_per_group)), type=pa.int32()), # This compresses with delta encoding "id": pa.array([id_val] * rows_per_group), # This compresses with RLE }) def print_rss(): print(f"RSS: {process.memory_info().rss:,} bytes") print(f"memory_pool={pa.default_memory_pool().backend_name}") print_rss() print("reading table") tab = pa.concat_tables(list(gen_batches())) print_rss() print("deleting table") del tab gc.collect() print_rss() print("releasing unused memory") pa.default_memory_pool().release_unused() print_rss() print("waiting 10 seconds") time.sleep(10) print_rss() {code} {code:none} > ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \ ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \ ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py memory_pool=mimalloc RSS: 44,449,792 bytes reading table RSS: 1,819,557,888 bytes deleting table RSS: 1,819,590,656 bytes releasing unused memory RSS: 1,819,852,800 bytes waiting 10 seconds RSS: 1,819,852,800 bytes memory_pool=jemalloc RSS: 45,629,440 bytes reading table RSS: 1,668,677,632 bytes deleting table RSS: 698,400,768 bytes releasing unused memory RSS: 699,023,360 bytes waiting 10 seconds RSS: 699,023,360 bytes memory_pool=system RSS: 44,875,776 bytes reading table RSS: 1,713,569,792 bytes deleting table RSS: 540,311,552 bytes releasing unused memory RSS: 540,311,552 bytes waiting 10 seconds RSS: 540,311,552 bytes {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15368) [C++] [Docs] Improve our SIMD documentation
[ https://issues.apache.org/jira/browse/ARROW-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-15368: --- Fix Version/s: 10.0.0 > [C++] [Docs] Improve our SIMD documentation > --- > > Key: ARROW-15368 > URL: https://issues.apache.org/jira/browse/ARROW-15368 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Jonathan Keane >Priority: Major > Fix For: 10.0.0 > > > We should document the various env vars ({{{}ARROW_SIMD_LEVEL{}}}, > {{{}ARROW_RUNTIME_SIMD_LEVEL{}}}, {{{}ARROW_USER_SIMD_LEVEL{}}}, others?). > We should also document what the defaults are (and what that means for > performance and possible optimization if you're compiling and you know you'll > be on more/less modern hardware: > e.g. pyarrow and the R package are compiled with SSE4_2, but there is some > amount of runtime dispatched simd code, and MAX there means that it will > compile everything it can. but at runtime it will use whatever is available. > so if you compile on a machine with AVX512 and run on a machine with AVX512, > you'll get any AVX512 runtime dispatched code that's available (probably not > much). There is more (esp. in the query engine) that is runtime AVX2. > FWIW I (neal) would leave ARROW_RUNTIME_SIMD_LEVEL=MAX always. You can set > ARROW_USER_SIMD_LEVEL to change/limit what level the runtime dispatch uses > Additionally we should document that valgrind does not support AVX512: > [https://bugs.kde.org/show_bug.cgi?id=383010] > And users should set ARROW_USER_SIMD_LEVEL to AVX2 if they plan to run > valgrind on an AVX512 capable machine similar to what we do for our > [CI|https://github.com/apache/arrow/blob/bc1a16cd0eceeffe67893a7e8000d2dd28dcf3f1/docker-compose.yml#L309] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17397) [R] Does R API for Apache Arrow has a tableFromIPC function ?
[ https://issues.apache.org/jira/browse/ARROW-17397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones resolved ARROW-17397. Assignee: Will Jones Resolution: Information Provided > [R] Does R API for Apache Arrow has a tableFromIPC function ? > -- > > Key: ARROW-17397 > URL: https://issues.apache.org/jira/browse/ARROW-17397 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Roy Assis >Assignee: Will Jones >Priority: Minor > > I'm building an API using python and flask. I want to return a dataframe from > the API, i'm serializing the dataframe like so and sending it in the response: > {code:python} > batch = pa.record_batch(df) > sink = pa.BufferOutputStream() > with pa.ipc.new_stream(sink, batch.schema) as writer: > writer.write_batch(batch) > pybytes = sink.getvalue().to_pybytes() > {code} > Is it possible to read it with R ? If so can you provide a code snippet. > Best, > Roy -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17399) pyarrow may use a lot of memory to load a dataframe from parquet
[ https://issues.apache.org/jira/browse/ARROW-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579120#comment-17579120 ] Will Jones commented on ARROW-17399: That's helps narrow it down. Are you able to narrow down and share the specific data types ({{table.schema}}}) that seem to be problematic? > pyarrow may use a lot of memory to load a dataframe from parquet > > > Key: ARROW-17399 > URL: https://issues.apache.org/jira/browse/ARROW-17399 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet, Python >Affects Versions: 9.0.0 > Environment: linux >Reporter: Gianluca Ficarelli >Priority: Major > Attachments: memory-profiler.png > > > When a pandas dataframe is loaded from a parquet file using > {{{}pyarrow.parquet.read_table{}}}, the memory usage may grow a lot more than > what should be needed to load the dataframe, and it's not freed until the > dataframe is deleted. > The problem is evident when the dataframe has a {*}column containing lists or > numpy arrays{*}, while it seems absent (or not noticeable) if the column > contains only integer or floats. > I'm attaching a simple script to reproduce the issue, and a graph created > with memory-profiler showing the memory usage. > In this example, the dataframe created with pandas needs around 1.2 GB, but > the memory usage after loading it from parquet is around 16 GB. > The items of the column are created as numpy arrays and not lists, to be > consistent with the types loaded from parquet (pyarrow produces numpy arrays > and not lists). > > {code:python} > import gc > import time > import numpy as np > import pandas as pd > import pyarrow > import pyarrow.parquet > import psutil > def pyarrow_dump(filename, df, compression="snappy"): > table = pyarrow.Table.from_pandas(df) > pyarrow.parquet.write_table(table, filename, compression=compression) > def pyarrow_load(filename): > table = pyarrow.parquet.read_table(filename) > return table.to_pandas() > def print_mem(msg, start_time=time.monotonic(), process=psutil.Process()): > # gc.collect() > current_time = time.monotonic() - start_time > rss = process.memory_info().rss / 2 ** 20 > print(f"{msg:>3} time:{current_time:>10.1f} rss:{rss:>10.1f}") > if __name__ == "__main__": > print_mem(0) > rows = 500 > df = pd.DataFrame({"a": [np.arange(10) for i in range(rows)]}) > print_mem(1) > > pyarrow_dump("example.parquet", df) > print_mem(2) > > del df > print_mem(3) > time.sleep(3) > print_mem(4) > df = pyarrow_load("example.parquet") > print_mem(5) > time.sleep(3) > print_mem(6) > del df > print_mem(7) > time.sleep(3) > print_mem(8) > {code} > Run with memory-profiler: > {code:bash} > mprof run --multiprocess python test_pyarrow.py > {code} > Output: > {code:java} > mprof: Sampling memory every 0.1s > running new process > 0 time: 0.0 rss: 135.4 > 1 time: 4.9 rss:1252.2 > 2 time: 7.1 rss:1265.0 > 3 time: 7.5 rss: 760.2 > 4 time: 10.7 rss: 758.9 > 5 time: 19.6 rss: 16745.4 > 6 time: 22.6 rss: 16335.4 > 7 time: 22.9 rss: 15833.0 > 8 time: 25.9 rss: 955.0 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17400) [C++] Move Parquet APIs to use Result instead of Status
[ https://issues.apache.org/jira/browse/ARROW-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-17400: --- Labels: good-first-issue (was: ) > [C++] Move Parquet APIs to use Result instead of Status > --- > > Key: ARROW-17400 > URL: https://issues.apache.org/jira/browse/ARROW-17400 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 9.0.0 >Reporter: Will Jones >Priority: Minor > Labels: good-first-issue > > Notably, IPC and CSV have "open file" methods that return result, while > opening a Parquet file requires passing in an out variable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17399) pyarrow may use a lot of memory to load a dataframe from parquet
[ https://issues.apache.org/jira/browse/ARROW-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579071#comment-17579071 ] Will Jones commented on ARROW-17399: Hi Gianluca, There are two conversions happening when reading: first, Parquet data is deserialized into Arrow data; second, Arrow data is converted into Pandas / numpy data. Are you able to narrow down during which conversion memory is increasing? > pyarrow may use a lot of memory to load a dataframe from parquet > > > Key: ARROW-17399 > URL: https://issues.apache.org/jira/browse/ARROW-17399 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet, Python >Affects Versions: 9.0.0 > Environment: linux >Reporter: Gianluca Ficarelli >Priority: Major > Attachments: memory-profiler.png > > > When a pandas dataframe is loaded from a parquet file using > {{{}pyarrow.parquet.read_table{}}}, the memory usage may grow a lot more than > what should be needed to load the dataframe, and it's not freed until the > dataframe is deleted. > The problem is evident when the dataframe has a {*}column containing lists or > numpy arrays{*}, while it seems absent (or not noticeable) if the column > contains only integer or floats. > I'm attaching a simple script to reproduce the issue, and a graph created > with memory-profiler showing the memory usage. > In this example, the dataframe created with pandas needs around 1.2 GB, but > the memory usage after loading it from parquet is around 16 GB. > The items of the column are created as numpy arrays and not lists, to be > consistent with the types loaded from parquet (pyarrow produces numpy arrays > and not lists). > > {code:python} > import gc > import time > import numpy as np > import pandas as pd > import pyarrow > import pyarrow.parquet > import psutil > def pyarrow_dump(filename, df, compression="snappy"): > table = pyarrow.Table.from_pandas(df) > pyarrow.parquet.write_table(table, filename, compression=compression) > def pyarrow_load(filename): > table = pyarrow.parquet.read_table(filename) > return table.to_pandas() > def print_mem(msg, start_time=time.monotonic(), process=psutil.Process()): > # gc.collect() > current_time = time.monotonic() - start_time > rss = process.memory_info().rss / 2 ** 20 > print(f"{msg:>3} time:{current_time:>10.1f} rss:{rss:>10.1f}") > if __name__ == "__main__": > print_mem(0) > rows = 500 > df = pd.DataFrame({"a": [np.arange(10) for i in range(rows)]}) > print_mem(1) > > pyarrow_dump("example.parquet", df) > print_mem(2) > > del df > print_mem(3) > time.sleep(3) > print_mem(4) > df = pyarrow_load("example.parquet") > print_mem(5) > time.sleep(3) > print_mem(6) > del df > print_mem(7) > time.sleep(3) > print_mem(8) > {code} > Run with memory-profiler: > {code:bash} > mprof run --multiprocess python test_pyarrow.py > {code} > Output: > {code:java} > mprof: Sampling memory every 0.1s > running new process > 0 time: 0.0 rss: 135.4 > 1 time: 4.9 rss:1252.2 > 2 time: 7.1 rss:1265.0 > 3 time: 7.5 rss: 760.2 > 4 time: 10.7 rss: 758.9 > 5 time: 19.6 rss: 16745.4 > 6 time: 22.6 rss: 16335.4 > 7 time: 22.9 rss: 15833.0 > 8 time: 25.9 rss: 955.0 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17397) [R] Does R API for Apache Arrow has a tableFromIPC function ?
[ https://issues.apache.org/jira/browse/ARROW-17397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579067#comment-17579067 ] Will Jones commented on ARROW-17397: Hi Roy, I think what you are looking for is a [read_ipc_stream|https://arrow.apache.org/docs/r/reference/read_ipc_stream.html]. Here is an example: {code:R} library(arrow) library(dplyr) output_stream <- BufferOutputStream$create() test_tbl <- tibble::tibble( x = 1:1e4, y = vapply(x, rlang::hash, character(1), USE.NAMES = FALSE), z = vapply(y, rlang::hash, character(1), USE.NAMES = FALSE) ) write_ipc_stream(test_tbl, output_stream) ipc_buffer <- output_stream$finish() read_ipc_stream(ipc_buffer) {code} > [R] Does R API for Apache Arrow has a tableFromIPC function ? > -- > > Key: ARROW-17397 > URL: https://issues.apache.org/jira/browse/ARROW-17397 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Roy Assis >Priority: Minor > > I'm building an API using python and flask. I want to return a dataframe from > the API, i'm serializing the dataframe like so and sending it in the response: > {code:python} > batch = pa.record_batch(df) > sink = pa.BufferOutputStream() > with pa.ipc.new_stream(sink, batch.schema) as writer: > writer.write_batch(batch) > pybytes = sink.getvalue().to_pybytes() > {code} > Is it possible to read it with R ? If so can you provide a code snippet. > Best, > Roy -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17401) [C++] Add ReadTable method to RecordBatchFileReader
Will Jones created ARROW-17401: -- Summary: [C++] Add ReadTable method to RecordBatchFileReader Key: ARROW-17401 URL: https://issues.apache.org/jira/browse/ARROW-17401 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 9.0.0 Reporter: Will Jones For convenience, it would be helpful to add an method for just reading the entire file as a table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17400) [C++] Move Parquet APIs to use Result instead of Status
Will Jones created ARROW-17400: -- Summary: [C++] Move Parquet APIs to use Result instead of Status Key: ARROW-17400 URL: https://issues.apache.org/jira/browse/ARROW-17400 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 9.0.0 Reporter: Will Jones Notably, IPC and CSV have "open file" methods that return result, while opening a Parquet file requires passing in an out variable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-14999) [C++] List types with different field names are not equal
[ https://issues.apache.org/jira/browse/ARROW-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578528#comment-17578528 ] Will Jones commented on ARROW-14999: Do you expect to be able to roundtrip that from Parquet? It seems like the conclusion of discussion in ARROW-11497 was that we should transition in the long term towards always using "element", but maybe we would still be able to roundtrip by casting back based on the Arrow schema saved in the metadata? > [C++] List types with different field names are not equal > - > > Key: ARROW-14999 > URL: https://issues.apache.org/jira/browse/ARROW-14999 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 6.0.0 >Reporter: Will Jones >Assignee: Will Jones >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > When comparing map types, the names of the fields are ignored. This was > introduced in ARROW-7173. > However for list types, they are not ignored. For example, > {code:python} > In [6]: l1 = pa.list_(pa.field("val", pa.int64())) > In [7]: l2 = pa.list_(pa.int64()) > In [8]: l1 > Out[8]: ListType(list) > In [9]: l2 > Out[9]: ListType(list) > In [10]: l1 == l2 > Out[10]: False > {code} > Should we make list type comparison ignore field names too? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-14999) [C++] List types with different field names are not equal
[ https://issues.apache.org/jira/browse/ARROW-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-14999: -- Assignee: Will Jones > [C++] List types with different field names are not equal > - > > Key: ARROW-14999 > URL: https://issues.apache.org/jira/browse/ARROW-14999 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 6.0.0 >Reporter: Will Jones >Assignee: Will Jones >Priority: Major > Fix For: 10.0.0 > > > When comparing map types, the names of the fields are ignored. This was > introduced in ARROW-7173. > However for list types, they are not ignored. For example, > {code:python} > In [6]: l1 = pa.list_(pa.field("val", pa.int64())) > In [7]: l2 = pa.list_(pa.int64()) > In [8]: l1 > Out[8]: ListType(list) > In [9]: l2 > Out[9]: ListType(list) > In [10]: l1 == l2 > Out[10]: False > {code} > Should we make list type comparison ignore field names too? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-12958) [CI][Developer] Build + host the docs for PR branches
[ https://issues.apache.org/jira/browse/ARROW-12958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-12958: --- Component/s: Documentation > [CI][Developer] Build + host the docs for PR branches > - > > Key: ARROW-12958 > URL: https://issues.apache.org/jira/browse/ARROW-12958 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Developer Tools, Documentation >Reporter: Jonathan Keane >Priority: Major > Fix For: 10.0.0 > > > We already run the docs building with crossbow, could we host the rendered > docs somewhere so that we can see what they look like during the PR process? > ARROW-1299 is a ticket for nightly docs updates for what's in master. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-12958) [CI][Developer] Build + host the docs for PR branches
[ https://issues.apache.org/jira/browse/ARROW-12958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-12958: --- Fix Version/s: 10.0.0 > [CI][Developer] Build + host the docs for PR branches > - > > Key: ARROW-12958 > URL: https://issues.apache.org/jira/browse/ARROW-12958 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Developer Tools >Reporter: Jonathan Keane >Priority: Major > Fix For: 10.0.0 > > > We already run the docs building with crossbow, could we host the rendered > docs somewhere so that we can see what they look like during the PR process? > ARROW-1299 is a ticket for nightly docs updates for what's in master. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-12958) [CI][Developer] Build + host the docs for PR branches
[ https://issues.apache.org/jira/browse/ARROW-12958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578148#comment-17578148 ] Will Jones commented on ARROW-12958: Alternatively, we could possibly host on Github pages, where a crossbow job publishes the pages to a folder in a branch of some repo, and a nightly cleanup job will delete any pages that are older than 30 days. We could do that in a free repo, so that eliminates hosting cost concerns. > [CI][Developer] Build + host the docs for PR branches > - > > Key: ARROW-12958 > URL: https://issues.apache.org/jira/browse/ARROW-12958 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Developer Tools >Reporter: Jonathan Keane >Priority: Major > > We already run the docs building with crossbow, could we host the rendered > docs somewhere so that we can see what they look like during the PR process? > ARROW-1299 is a ticket for nightly docs updates for what's in master. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-12958) [CI][Developer] Build + host the docs for PR branches
[ https://issues.apache.org/jira/browse/ARROW-12958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578147#comment-17578147 ] Will Jones commented on ARROW-12958: Yeah I think this could likely be solved by: # Create some hosting location where docs can be served out of. Perhaps it should automatically clean up anything 30 days or older. # Create a crossbow job that builds the docs and uploads to the hosting location. One solution to (1) is using an S3 bucket to statically host the site, and implement [lifecycle rules|https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-expire-general-considerations.html] to make objects expire after 30 days of creation. Not sure if there someone will to host those resources though. Or if there is a cheaper alternative. > [CI][Developer] Build + host the docs for PR branches > - > > Key: ARROW-12958 > URL: https://issues.apache.org/jira/browse/ARROW-12958 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Developer Tools >Reporter: Jonathan Keane >Priority: Major > > We already run the docs building with crossbow, could we host the rendered > docs somewhere so that we can see what they look like during the PR process? > ARROW-1299 is a ticket for nightly docs updates for what's in master. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17076) [Python][Docs] Enable building documentation with pyarrow nightly builds
[ https://issues.apache.org/jira/browse/ARROW-17076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-17076: --- Fix Version/s: 10.0.0 > [Python][Docs] Enable building documentation with pyarrow nightly builds > > > Key: ARROW-17076 > URL: https://issues.apache.org/jira/browse/ARROW-17076 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation, Python >Reporter: Todd Farmer >Priority: Minor > Fix For: 10.0.0 > > > The [instructions for building > documentation|https://arrow.apache.org/docs/developers/documentation.html] > describes needing pyarrow to successfully build the docs. It also highlights > that certain optional pyarrow features must be enabled to successfully build: > {code:java} > Note that building the documentation may fail if your build of pyarrow is not > sufficiently comprehensive. Portions of the Python API documentation will > also not build without CUDA support having been built. {code} > "Sufficiently comprehensive" is relatively ambiguous, leaving users to repeat > a sequence of steps to identify and resolve required elements: > * Build C++ > * Build Python > * Attempt to build docs > * Evaluate missing features based on error messages > This adds significant overhead to simply building docs, limiting > accessibility for less experienced users to offer docs improvements. > Rather than attempt to follow the steps above, I attempted to use a nightly > pyarrow build to satisfy docs build requirements. This did not work, though, > because nightly builds are not built with the options needed to build docs: > {code:java} > (base) todd@pop-os:~/arrow$ pushd docs > make html > popd > ~/arrow/docs ~/arrow > sphinx-build -b html -d _build/doctrees -j8 source _build/html > Running Sphinx v5.0.2 > WARNING: Invalid configuration value found: 'language = None'. Update your > configuration to a valid langauge code. Falling back to 'en' (English). > making output directory... done > [autosummary] generating autosummary for: c_glib/index.rst, cpp/api.rst, > cpp/api/array.rst, cpp/api/async.rst, cpp/api/builder.rst, cpp/api/c_abi.rst, > cpp/api/compute.rst, cpp/api/cuda.rst, cpp/api/dataset.rst, > cpp/api/datatype.rst, ..., python/json.rst, python/memory.rst, > python/numpy.rst, python/orc.rst, python/pandas.rst, python/parquet.rst, > python/plasma.rst, python/timestamps.rst, r/index.rst, status.rst > WARNING: [autosummary] failed to import pyarrow.compute.CumulativeSumOptions. > Possible hints: > * ModuleNotFoundError: No module named > 'pyarrow.compute.CumulativeSumOptions'; 'pyarrow.compute' is not a package > * AttributeError: module 'pyarrow.compute' has no attribute > 'CumulativeSumOptions' > * ImportError: > WARNING: [autosummary] failed to import pyarrow.compute.cumulative_sum. > Possible hints: > * ModuleNotFoundError: No module named 'pyarrow.compute.cumulative_sum'; > 'pyarrow.compute' is not a package > * ImportError: > * AttributeError: module 'pyarrow.compute' has no attribute 'cumulative_sum' > WARNING: [autosummary] failed to import > pyarrow.compute.cumulative_sum_checked. > Possible hints: > * ImportError: > * AttributeError: module 'pyarrow.compute' has no attribute > 'cumulative_sum_checked' > * ModuleNotFoundError: No module named > 'pyarrow.compute.cumulative_sum_checked'; 'pyarrow.compute' is not a package > WARNING: [autosummary] failed to import pyarrow.dataset.WrittenFile. > Possible hints: > * ModuleNotFoundError: No module named 'pyarrow.dataset.WrittenFile'; > 'pyarrow.dataset' is not a package > * ImportError: > * AttributeError: module 'pyarrow.dataset' has no attribute > 'WrittenFile'Extension error (sphinx.ext.autosummary): > Handler for event > 'builder-inited' threw an exception (exception: no module named > pyarrow.parquet.encryption) > make: *** [Makefile:81: html] Error 2 > ~/arrow > {code} > Nightly builds should be made sufficient to build documentation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-13457) [C++][Docs] Scalars User Guide
[ https://issues.apache.org/jira/browse/ARROW-13457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-13457: --- Fix Version/s: 10.0.0 > [C++][Docs] Scalars User Guide > -- > > Key: ARROW-13457 > URL: https://issues.apache.org/jira/browse/ARROW-13457 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Rares Vernica >Priority: Minor > Fix For: 10.0.0 > > > In the C++ User Guide, Scalars are briefly mentioned in Compute Functions > [https://arrow.apache.org/docs/cpp/compute.html] It would be nice to have > some examples on some of the ways a Scalar can be created or manipulated. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-13454) [C++][Docs] Tables vs Record Batches
[ https://issues.apache.org/jira/browse/ARROW-13454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-13454: --- Fix Version/s: 10.0.0 > [C++][Docs] Tables vs Record Batches > > > Key: ARROW-13454 > URL: https://issues.apache.org/jira/browse/ARROW-13454 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Rares Vernica >Priority: Minor > Fix For: 10.0.0 > > > It is not clear what the difference is between Tables and Record Batches is > as described on [https://arrow.apache.org/docs/cpp/tables.html#tables] > _A > [{{arrow::Table}}|https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow5TableE] > is a two-dimensional dataset with chunked arrays for columns_ > _A > [{{arrow::RecordBatch}}|https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow11RecordBatchE] > is a two-dimensional dataset of a number of contiguous arrays_ > Or maybe the distinction between _chunked arrays_ and _contiguous arrays_ can > be clarified. -- This message was sent by Atlassian Jira (v8.20.10#820010)